gnomad.utils.file_utils
Check whether a file exists. |
|
|
Check whether the file or all files in a list of files exist and optionally raise an exception. |
Select only primitive types (string, int, float, bool) from a Table. |
|
Get size (as both int and str) and md5 for file at specified URL. |
|
Read a file input into a python list (each line will be an element). |
|
Calculate new partition intervals using input Table. |
|
|
Combine GVCFs into a single VDS. |
- gnomad.utils.file_utils.file_exists(fname)[source]
Check whether a file exists.
Supports either local or Google cloud (gs://) paths. If the file is a Hail file (.ht, .mt, .bm, .parquet, .he, and .vds extensions), it checks that _SUCCESS is present.
- Parameters:
fname (
str
) – File name.- Return type:
bool
- Returns:
Whether the file exists.
- gnomad.utils.file_utils.check_file_exists_raise_error(fname, error_if_exists=False, error_if_not_exists=False, error_if_exists_msg='The following files already exist: ', error_if_not_exists_msg='The following files do not exist: ')[source]
Check whether the file or all files in a list of files exist and optionally raise an exception.
This can be useful when writing out to files at the end of a pipeline to first check if the file already exists and therefore requires the file to be removed or overwrite specified so the pipeline doesn’t fail.
- Parameters:
fname (
Union
[str
,List
[str
]]) – File path, or list of file paths to check the existence of.error_if_exists (
bool
) – Whether to raise an exception if any of the files exist. Default is True.error_if_not_exists (
bool
) – Whether to raise an exception if any of the files do not exist. Default is False.error_if_exists_msg (
str
) – String of the error message to print if any of the files exist.error_if_not_exists_msg (
str
) – String of the error message to print if any of the files do not exist.
- Return type:
bool
- Returns:
Boolean indicating if fname or all files in fname exist.
- gnomad.utils.file_utils.write_temp_gcs(t, gcs_path, overwrite=False, temp_path=None)[source]
- Parameters:
t (
Union
[MatrixTable
,Table
]) –gcs_path (
str
) –overwrite (
bool
) –temp_path (
Optional
[str
]) –
- Return type:
None
- gnomad.utils.file_utils.select_primitives_from_ht(ht)[source]
Select only primitive types (string, int, float, bool) from a Table.
Particularly useful for exporting a Table.
- gnomad.utils.file_utils.get_file_stats(url, project_id=None)[source]
Get size (as both int and str) and md5 for file at specified URL.
Typically used to get stats on VCFs.
- Parameters:
url (
str
) – Path to file of interest.project_id (
Optional
[str
]) – Google project ID. Specify if URL points to a requester-pays bucket.
- Return type:
Tuple
[int
,str
,str
]- Returns:
Tuple of file size and md5.
- gnomad.utils.file_utils.read_list_data(input_file_path)[source]
Read a file input into a python list (each line will be an element).
Supports Google storage paths and .gz compression.
- Parameters:
input_file_path (
str
) – File path- Return type:
List
[str
]- Returns:
List of lines
- gnomad.utils.file_utils.repartition_for_join(ht_path, new_partition_percent=1.1)[source]
Calculate new partition intervals using input Table.
Reading in all Tables using the same partition intervals (via _intervals) before they are joined makes the joins much more efficient. For more information, see: https://discuss.hail.is/t/room-for-improvement-when-joining-multiple-hts/2278/8
- Parameters:
ht_path (
str
) – Path to Table to use for interval partition calculation.new_partition_percent (
float
) – Percent of initial dataset partitions to use. Value should be greater than 1 so that input Table will have more partitions for the join. Defaults to 1.1.
- Return type:
List
[IntervalExpression
]- Returns:
List of IntervalExpressions calculated over new set of partitions (number of partitions in HT * desired percent increase).
- gnomad.utils.file_utils.create_vds(output_path, temp_path, vdses=None, gvcfs=None, save_path=None, use_genome_default_intervals=False, use_exome_default_intervals=False, intervals=None, gvcf_batch_size=None, reference_genome='GRCh38')[source]
Combine GVCFs into a single VDS.
- Parameters:
output_path (
str
) – Path to write output VDS.temp_path (
str
) – Directory path to write temporary files. A bucket with a life-cycle policy is recommended.vdses (
Optional
[str
]) – Path to file containing VDS paths with no header.gvcfs (
Optional
[str
]) – Path to file containing GVCF paths with no header.save_path (
Optional
[str
]) – Path to write combiner to on failure. Can be used to restart combiner from a failed state. If not specified, defaults to temp_path + combiner_plan.json.use_genome_default_intervals (
bool
) – Use the default genome intervals.use_exome_default_intervals (
bool
) – Use the default exome intervals.intervals (
Optional
[str
]) – Path to text file with intervals to use for VDS creation.gvcf_batch_size (
Optional
[int
]) – Number of GVCFs to combine into a Variant Dataset at once.reference_genome (
str
) – Reference genome to use. Default is GRCh38.
- Return type:
- Returns:
Combined VDS.