gnomad.utils.file_utils
| Check whether a file exists. | |
| 
 | Check whether the file or all files in a list of files exist and optionally raise an exception. | 
| Select only primitive types (string, int, float, bool) from a Table. | |
| Get size (as both int and str) and md5 for file at specified URL. | |
| Read a file input into a python list (each line will be an element). | |
| Calculate new partition intervals using input Table. | |
| 
 | Combine GVCFs into a single VDS. | 
- gnomad.utils.file_utils.file_exists(fname)[source]
- Check whether a file exists. - Supports either local or Google cloud (gs://) paths. If the file is a Hail file (.ht, .mt, .bm, .parquet, .he, and .vds extensions), it checks that _SUCCESS is present. - Parameters:
- fname ( - str) – File name.
- Return type:
- bool
- Returns:
- Whether the file exists. 
 
- gnomad.utils.file_utils.check_file_exists_raise_error(fname, error_if_exists=False, error_if_not_exists=False, error_if_exists_msg='The following files already exist: ', error_if_not_exists_msg='The following files do not exist: ')[source]
- Check whether the file or all files in a list of files exist and optionally raise an exception. - This can be useful when writing out to files at the end of a pipeline to first check if the file already exists and therefore requires the file to be removed or overwrite specified so the pipeline doesn’t fail. - Parameters:
- fname ( - Union[- str,- List[- str]]) – File path, or list of file paths to check the existence of.
- error_if_exists ( - bool) – Whether to raise an exception if any of the files exist. Default is True.
- error_if_not_exists ( - bool) – Whether to raise an exception if any of the files do not exist. Default is False.
- error_if_exists_msg ( - str) – String of the error message to print if any of the files exist.
- error_if_not_exists_msg ( - str) – String of the error message to print if any of the files do not exist.
 
- Return type:
- bool
- Returns:
- Boolean indicating if fname or all files in fname exist. 
 
- gnomad.utils.file_utils.write_temp_gcs(t, gcs_path, overwrite=False, temp_path=None)[source]
- Parameters:
- t ( - Union[- MatrixTable,- Table]) –
- gcs_path ( - str) –
- overwrite ( - bool) –
- temp_path ( - Optional[- str]) –
 
- Return type:
- None
 
- gnomad.utils.file_utils.select_primitives_from_ht(ht)[source]
- Select only primitive types (string, int, float, bool) from a Table. - Particularly useful for exporting a Table. 
- gnomad.utils.file_utils.get_file_stats(url, project_id=None)[source]
- Get size (as both int and str) and md5 for file at specified URL. - Typically used to get stats on VCFs. - Parameters:
- url ( - str) – Path to file of interest.
- project_id ( - Optional[- str]) – Google project ID. Specify if URL points to a requester-pays bucket.
 
- Return type:
- Tuple[- int,- str,- str]
- Returns:
- Tuple of file size and md5. 
 
- gnomad.utils.file_utils.read_list_data(input_file_path)[source]
- Read a file input into a python list (each line will be an element). - Supports Google storage paths and .gz compression. - Parameters:
- input_file_path ( - str) – File path
- Return type:
- List[- str]
- Returns:
- List of lines 
 
- gnomad.utils.file_utils.repartition_for_join(ht_path, new_partition_percent=1.1)[source]
- Calculate new partition intervals using input Table. - Reading in all Tables using the same partition intervals (via _intervals) before they are joined makes the joins much more efficient. For more information, see: https://discuss.hail.is/t/room-for-improvement-when-joining-multiple-hts/2278/8 - Parameters:
- ht_path ( - str) – Path to Table to use for interval partition calculation.
- new_partition_percent ( - float) – Percent of initial dataset partitions to use. Value should be greater than 1 so that input Table will have more partitions for the join. Defaults to 1.1.
 
- Return type:
- List[- IntervalExpression]
- Returns:
- List of IntervalExpressions calculated over new set of partitions (number of partitions in HT * desired percent increase). 
 
- gnomad.utils.file_utils.create_vds(output_path, temp_path, vdses=None, gvcfs=None, save_path=None, use_genome_default_intervals=False, use_exome_default_intervals=False, intervals=None, gvcf_batch_size=None, reference_genome='GRCh38')[source]
- Combine GVCFs into a single VDS. - Parameters:
- output_path ( - str) – Path to write output VDS.
- temp_path ( - str) – Directory path to write temporary files. A bucket with a life-cycle policy is recommended.
- vdses ( - Optional[- str]) – Path to file containing VDS paths with no header.
- gvcfs ( - Optional[- str]) – Path to file containing GVCF paths with no header.
- save_path ( - Optional[- str]) – Path to write combiner to on failure. Can be used to restart combiner from a failed state. If not specified, defaults to temp_path + combiner_plan.json.
- use_genome_default_intervals ( - bool) – Use the default genome intervals.
- use_exome_default_intervals ( - bool) – Use the default exome intervals.
- intervals ( - Optional[- str]) – Path to text file with intervals to use for VDS creation.
- gvcf_batch_size ( - Optional[- int]) – Number of GVCFs to combine into a Variant Dataset at once.
- reference_genome ( - str) – Reference genome to use. Default is GRCh38.
 
- Return type:
- Returns:
- Combined VDS.