gnomad.utils.file_utils

gnomad.utils.file_utils.file_exists(fname)

Check whether a file exists.

gnomad.utils.file_utils.check_file_exists_raise_error(fname)

Check whether the file or all files in a list of files exist and optionally raise an exception.

gnomad.utils.file_utils.write_temp_gcs(t, ...)

gnomad.utils.file_utils.select_primitives_from_ht(ht)

Select only primitive types (string, int, float, bool) from a Table.

gnomad.utils.file_utils.get_file_stats(url)

Get size (as both int and str) and md5 for file at specified URL.

gnomad.utils.file_utils.read_list_data(...)

Read a file input into a python list (each line will be an element).

gnomad.utils.file_utils.repartition_for_join(ht_path)

Calculate new partition intervals using input Table.

gnomad.utils.file_utils.create_vds(...[, ...])

Combine GVCFs into a single VDS.

gnomad.utils.file_utils.file_exists(fname)[source]

Check whether a file exists.

Supports either local or Google cloud (gs://) paths. If the file is a Hail file (.ht, .mt, .bm, .parquet, .he, and .vds extensions), it checks that _SUCCESS is present.

Parameters:

fname (str) – File name.

Return type:

bool

Returns:

Whether the file exists.

gnomad.utils.file_utils.check_file_exists_raise_error(fname, error_if_exists=False, error_if_not_exists=False, error_if_exists_msg='The following files already exist: ', error_if_not_exists_msg='The following files do not exist: ')[source]

Check whether the file or all files in a list of files exist and optionally raise an exception.

This can be useful when writing out to files at the end of a pipeline to first check if the file already exists and therefore requires the file to be removed or overwrite specified so the pipeline doesn’t fail.

Parameters:
  • fname (Union[str, List[str]]) – File path, or list of file paths to check the existence of.

  • error_if_exists (bool) – Whether to raise an exception if any of the files exist. Default is True.

  • error_if_not_exists (bool) – Whether to raise an exception if any of the files do not exist. Default is False.

  • error_if_exists_msg (str) – String of the error message to print if any of the files exist.

  • error_if_not_exists_msg (str) – String of the error message to print if any of the files do not exist.

Return type:

bool

Returns:

Boolean indicating if fname or all files in fname exist.

gnomad.utils.file_utils.write_temp_gcs(t, gcs_path, overwrite=False, temp_path=None)[source]
Parameters:
  • t (Union[MatrixTable, Table]) –

  • gcs_path (str) –

  • overwrite (bool) –

  • temp_path (Optional[str]) –

Return type:

None

gnomad.utils.file_utils.select_primitives_from_ht(ht)[source]

Select only primitive types (string, int, float, bool) from a Table.

Particularly useful for exporting a Table.

Parameters:

ht (Table) – Input Table

Return type:

Table

Returns:

Table with only primitive types selected

gnomad.utils.file_utils.get_file_stats(url, project_id=None)[source]

Get size (as both int and str) and md5 for file at specified URL.

Typically used to get stats on VCFs.

Parameters:
  • url (str) – Path to file of interest.

  • project_id (Optional[str]) – Google project ID. Specify if URL points to a requester-pays bucket.

Return type:

Tuple[int, str, str]

Returns:

Tuple of file size and md5.

gnomad.utils.file_utils.read_list_data(input_file_path)[source]

Read a file input into a python list (each line will be an element).

Supports Google storage paths and .gz compression.

Parameters:

input_file_path (str) – File path

Return type:

List[str]

Returns:

List of lines

gnomad.utils.file_utils.repartition_for_join(ht_path, new_partition_percent=1.1)[source]

Calculate new partition intervals using input Table.

Reading in all Tables using the same partition intervals (via _intervals) before they are joined makes the joins much more efficient. For more information, see: https://discuss.hail.is/t/room-for-improvement-when-joining-multiple-hts/2278/8

Parameters:
  • ht_path (str) – Path to Table to use for interval partition calculation.

  • new_partition_percent (float) – Percent of initial dataset partitions to use. Value should be greater than 1 so that input Table will have more partitions for the join. Defaults to 1.1.

Return type:

List[IntervalExpression]

Returns:

List of IntervalExpressions calculated over new set of partitions (number of partitions in HT * desired percent increase).

gnomad.utils.file_utils.create_vds(output_path, temp_path, vdses=None, gvcfs=None, save_path=None, use_genome_default_intervals=False, use_exome_default_intervals=False, intervals=None, gvcf_batch_size=None, reference_genome='GRCh38')[source]

Combine GVCFs into a single VDS.

Parameters:
  • output_path (str) – Path to write output VDS.

  • temp_path (str) – Directory path to write temporary files. A bucket with a life-cycle policy is recommended.

  • vdses (Optional[str]) – Path to file containing VDS paths with no header.

  • gvcfs (Optional[str]) – Path to file containing GVCF paths with no header.

  • save_path (Optional[str]) – Path to write combiner to on failure. Can be used to restart combiner from a failed state. If not specified, defaults to temp_path + combiner_plan.json.

  • use_genome_default_intervals (bool) – Use the default genome intervals.

  • use_exome_default_intervals (bool) – Use the default exome intervals.

  • intervals (Optional[str]) – Path to text file with intervals to use for VDS creation.

  • gvcf_batch_size (Optional[int]) – Number of GVCFs to combine into a Variant Dataset at once.

  • reference_genome (str) – Reference genome to use. Default is GRCh38.

Return type:

VariantDataset

Returns:

Combined VDS.