gnomad.utils.file_utils

`gnomad.utils.file_utils.file_exists`(fname)	Check whether a file exists.
`gnomad.utils.file_utils.check_file_exists_raise_error`(fname)	Check whether the file or all files in a list of files exist and optionally raise an exception.
`gnomad.utils.file_utils.write_temp_gcs`(t, ...)
`gnomad.utils.file_utils.select_primitives_from_ht`(ht)	Select only primitive types (string, int, float, bool) from a Table.
`gnomad.utils.file_utils.get_file_stats`(url)	Get size (as both int and str) and md5 for file at specified URL.
`gnomad.utils.file_utils.read_list_data`(...)	Read a file input into a python list (each line will be an element).
`gnomad.utils.file_utils.repartition_for_join`(ht_path)	Calculate new partition intervals using input Table.
`gnomad.utils.file_utils.create_vds`(...[, ...])	Combine GVCFs into a single VDS.

gnomad.utils.file_utils.file_exists(fname)[source]

Check whether a file exists.

Supports either local or Google cloud (gs://) paths. If the file is a Hail file (.ht, .mt, .bm, .parquet, .he, and .vds extensions), it checks that _SUCCESS is present.

Parameters:: fname (str) – File name.
Return type:: bool
Returns:: Whether the file exists.

gnomad.utils.file_utils.check_file_exists_raise_error(fname, error_if_exists=False, error_if_not_exists=False, error_if_exists_msg='The following files already exist: ', error_if_not_exists_msg='The following files do not exist: ')[source]

Check whether the file or all files in a list of files exist and optionally raise an exception.

This can be useful when writing out to files at the end of a pipeline to first check if the file already exists and therefore requires the file to be removed or overwrite specified so the pipeline doesn’t fail.

Parameters:

fname (Union[str, List[str]]) – File path, or list of file paths to check the existence of.
error_if_exists (bool) – Whether to raise an exception if any of the files exist. Default is True.
error_if_not_exists (bool) – Whether to raise an exception if any of the files do not exist. Default is False.
error_if_exists_msg (str) – String of the error message to print if any of the files exist.
error_if_not_exists_msg (str) – String of the error message to print if any of the files do not exist.

Return type:

bool

Returns:

Boolean indicating if fname or all files in fname exist.

gnomad.utils.file_utils.write_temp_gcs(t, gcs_path, overwrite=False, temp_path=None)[source]

Parameters:

t (Union[MatrixTable, Table]) –
gcs_path (str) –
overwrite (bool) –
temp_path (Optional[str]) –

Return type:

None

gnomad.utils.file_utils.select_primitives_from_ht(ht)[source]

Select only primitive types (string, int, float, bool) from a Table.

Particularly useful for exporting a Table.

Parameters:: ht (Table) – Input Table
Return type:: Table
Returns:: Table with only primitive types selected

gnomad.utils.file_utils.get_file_stats(url, project_id=None)[source]

Get size (as both int and str) and md5 for file at specified URL.

Typically used to get stats on VCFs.

Parameters:

url (str) – Path to file of interest.
project_id (Optional[str]) – Google project ID. Specify if URL points to a requester-pays bucket.

Return type:

Tuple[int, str, str]

Returns:

Tuple of file size and md5.

gnomad.utils.file_utils.read_list_data(input_file_path)[source]

Read a file input into a python list (each line will be an element).

Supports Google storage paths and .gz compression.

Parameters:: input_file_path (str) – File path
Return type:: List[str]
Returns:: List of lines

gnomad.utils.file_utils.repartition_for_join(ht_path, new_partition_percent=1.1)[source]

Calculate new partition intervals using input Table.

Reading in all Tables using the same partition intervals (via _intervals) before they are joined makes the joins much more efficient. For more information, see: https://discuss.hail.is/t/room-for-improvement-when-joining-multiple-hts/2278/8

Parameters:

ht_path (str) – Path to Table to use for interval partition calculation.
new_partition_percent (float) – Percent of initial dataset partitions to use. Value should be greater than 1 so that input Table will have more partitions for the join. Defaults to 1.1.

Return type:

List[IntervalExpression]

Returns:

List of IntervalExpressions calculated over new set of partitions (number of partitions in HT * desired percent increase).

gnomad.utils.file_utils.create_vds(output_path, temp_path, vdses=None, gvcfs=None, save_path=None, use_genome_default_intervals=False, use_exome_default_intervals=False, intervals=None, gvcf_batch_size=None, reference_genome='GRCh38')[source]

Combine GVCFs into a single VDS.

Parameters:

output_path (str) – Path to write output VDS.
temp_path (str) – Directory path to write temporary files. A bucket with a life-cycle policy is recommended.
vdses (Optional[str]) – Path to file containing VDS paths with no header.
gvcfs (Optional[str]) – Path to file containing GVCF paths with no header.
save_path (Optional[str]) – Path to write combiner to on failure. Can be used to restart combiner from a failed state. If not specified, defaults to temp_path + combiner_plan.json.
use_genome_default_intervals (bool) – Use the default genome intervals.
use_exome_default_intervals (bool) – Use the default exome intervals.
intervals (Optional[str]) – Path to text file with intervals to use for VDS creation.
gvcf_batch_size (Optional[int]) – Number of GVCFs to combine into a Variant Dataset at once.
reference_genome (str) – Reference genome to use. Default is GRCh38.

Return type:

VariantDataset

Returns:

Combined VDS.