gnomad.utils.file_utils

gnomad.utils.file_utils.file_exists(fname)

Check whether a file exists.

gnomad.utils.file_utils.check_file_exists_raise_error(fname)

Check whether the file or all files in a list of files exist and optionally raise an exception.

gnomad.utils.file_utils.write_temp_gcs(t, ...)

gnomad.utils.file_utils.select_primitives_from_ht(ht)

Select only primitive types (string, int, float, bool) from a Table.

gnomad.utils.file_utils.get_file_stats(url)

Get size (as both int and str) and md5 for file at specified URL.

gnomad.utils.file_utils.read_list_data(...)

Read a file input into a python list (each line will be an element).

gnomad.utils.file_utils.repartition_for_join(ht)

Calculate new partition intervals from a Table for co-partitioned joins.

gnomad.utils.file_utils.create_vds(...[, ...])

Combine GVCFs into a single VDS.

gnomad.utils.file_utils.file_exists(fname)[source]

Check whether a file exists.

Supports either local or Google cloud (gs://) paths. If the file is a Hail file (.ht, .mt, .bm, .parquet, .he, and .vds extensions), it checks that _SUCCESS is present.

Parameters:

fname (str) – File name.

Return type:

bool

Returns:

Whether the file exists.

gnomad.utils.file_utils.check_file_exists_raise_error(fname, error_if_exists=False, error_if_not_exists=False, error_if_exists_msg='The following files already exist: ', error_if_not_exists_msg='The following files do not exist: ')[source]

Check whether the file or all files in a list of files exist and optionally raise an exception.

This can be useful when writing out to files at the end of a pipeline to first check if the file already exists and therefore requires the file to be removed or overwrite specified so the pipeline doesn’t fail.

Parameters:
  • fname (Union[str, List[str]]) – File path, or list of file paths to check the existence of.

  • error_if_exists (bool) – Whether to raise an exception if any of the files exist. Default is True.

  • error_if_not_exists (bool) – Whether to raise an exception if any of the files do not exist. Default is False.

  • error_if_exists_msg (str) – String of the error message to print if any of the files exist.

  • error_if_not_exists_msg (str) – String of the error message to print if any of the files do not exist.

Return type:

bool

Returns:

Boolean indicating if fname or all files in fname exist.

gnomad.utils.file_utils.write_temp_gcs(t, gcs_path, overwrite=False, temp_path=None)[source]
Parameters:
  • t (Union[MatrixTable, Table]) –

  • gcs_path (str) –

  • overwrite (bool) –

  • temp_path (Optional[str]) –

Return type:

None

gnomad.utils.file_utils.select_primitives_from_ht(ht)[source]

Select only primitive types (string, int, float, bool) from a Table.

Particularly useful for exporting a Table.

Parameters:

ht (Table) – Input Table

Return type:

Table

Returns:

Table with only primitive types selected

gnomad.utils.file_utils.get_file_stats(url, project_id=None)[source]

Get size (as both int and str) and md5 for file at specified URL.

Typically used to get stats on VCFs.

Parameters:
  • url (str) – Path to file of interest.

  • project_id (Optional[str]) – Google project ID. Specify if URL points to a requester-pays bucket.

Return type:

Tuple[int, str, str]

Returns:

Tuple of file size and md5.

gnomad.utils.file_utils.read_list_data(input_file_path)[source]

Read a file input into a python list (each line will be an element).

Supports Google storage paths and .gz compression.

Parameters:

input_file_path (str) – File path

Return type:

List[str]

Returns:

List of lines

gnomad.utils.file_utils.repartition_for_join(ht, new_partition_percent=1.1, n_partitions=None, locus_intervals=False)[source]

Calculate new partition intervals from a Table for co-partitioned joins.

Reading all Tables, MatrixTables, and VDSes with the same partition intervals (via _intervals for Tables and MatrixTables, hl.vds.read_vds(intervals=…) for VDS) before they are joined makes the joins much more efficient. For more information, see: https://discuss.hail.is/t/room-for-improvement-when-joining-multiple-hts/2278/8

The number of intervals is either n_partitions (when provided) or ht.n_partitions() * new_partition_percent.

Parameters:
  • ht (Union[str, Table]) – Table, or path to a Table, to use for interval partition calculation. A path is read with hl.read_table; an in-memory Table (e.g. a partition-filtered VDS reference_data.rows()) is used directly.

  • new_partition_percent (float) – Percent of the input Table’s partitions to use when n_partitions is not given. Value should be greater than 1 so the input has more partitions for the join. Defaults to 1.1. Ignored when n_partitions is provided.

  • n_partitions (Optional[int]) – Explicit number of partitions/intervals to calculate. Overrides new_partition_percent when set (e.g., to subdivide into a fixed, finer-grained number than the input has).

  • locus_intervals (bool) – If True, return bare-hl.Locus intervals instead of the default key-struct intervals. Use this when passing the intervals to hl.vds.read_vds(intervals=…), whose reader expects locus intervals. Requires the input Table to be keyed by locus.

Return type:

List[Interval]

Returns:

List of hl.Interval objects over the calculated partitions.

gnomad.utils.file_utils.create_vds(output_path, temp_path, vdses=None, gvcfs=None, save_path=None, use_genome_default_intervals=False, use_exome_default_intervals=False, intervals=None, gvcf_batch_size=None, reference_genome='GRCh38')[source]

Combine GVCFs into a single VDS.

Parameters:
  • output_path (str) – Path to write output VDS.

  • temp_path (str) – Directory path to write temporary files. A bucket with a life-cycle policy is recommended.

  • vdses (Optional[str]) – Path to file containing VDS paths with no header.

  • gvcfs (Optional[str]) – Path to file containing GVCF paths with no header.

  • save_path (Optional[str]) – Path to write combiner to on failure. Can be used to restart combiner from a failed state. If not specified, defaults to temp_path + combiner_plan.json.

  • use_genome_default_intervals (bool) – Use the default genome intervals.

  • use_exome_default_intervals (bool) – Use the default exome intervals.

  • intervals (Optional[str]) – Path to text file with intervals to use for VDS creation.

  • gvcf_batch_size (Optional[int]) – Number of GVCFs to combine into a Variant Dataset at once.

  • reference_genome (str) – Reference genome to use. Default is GRCh38.

Return type:

VariantDataset

Returns:

Combined VDS.