gnomad_qc.v4.subset

Script to filter the gnomAD v4 VariantDataset to a subset of specified samples.

Run this script on Hail version 0.2.120. Higher versions of hail greatly increase runtime and cost and thus we enforce this version.

This script subsets gnomAD using a list of samples or terra workspaces.

usage: gnomad_qc.v4.subset.py [-h] [--test]
                              (--subset-samples SUBSET_SAMPLES | --subset-workspaces SUBSET_WORKSPACES)
                              [--data-type {exomes,genomes}]
                              (--output-vds | --output-vcf) [--split-multi]
                              [--rep-on-read-partitions REP_ON_READ_PARTITIONS]
                              [--add-variant-qc] [--pass-only]
                              [--variant-qc-annotations VARIANT_QC_ANNOTATIONS [VARIANT_QC_ANNOTATIONS ...]]
                              [--export-meta] [--keep-data-paths]
                              --output-path OUTPUT_PATH
                              [--output-filename OUTPUT_FILENAME]
                              [--output-partitions OUTPUT_PARTITIONS] [-o]
                              [--tmp-dir TMP_DIR]

Named Arguments

--test

Filter to the first 2 partitions for testing.

Default: False

--subset-samples

Path to a text file with sample IDs for subsetting and a header: s.

--subset-workspaces

Path to a text file with Terra workspaces that should be included in the subset, must use a header of ‘terra_workspace’.

--data-type

Possible choices: exomes, genomes

Type of data to subset.

Default: “exomes”

--output-vds

Whether to output a subset VDS.

Default: False

--output-vcf

Whether to output a subset VCF.

Default: False

--split-multi

Whether to split multi-allelic variants.

Default: False

--rep-on-read-partitions

Number of partitions to pass when reading in the VDS. If passed, and the –output-partitions is not, this will be the number of output partitions. By default, there will be no change in partitioning.

--add-variant-qc

Annotate exported file with gnomAD’s variant QC annotations. Defaults to all annotations if a subset of annotations are not specified using the –variant-qc-annotations arg

Default: False

--pass-only

Keep only the variants that passed variant QC, i.e. the filter field is PASS.

Default: False

--variant-qc-annotations

Variant QC annotations to add to the output file. Defaults to all annotations.

--export-meta

Pull sample subset metadata and export to a HT and .tsv.

Default: False

--keep-data-paths

Keep CRAM and gVCF paths in the project metadata export.

Default: False

--output-path

Output file path for subsetted VDS/VCF/MT, do not include file name or file extension.

--output-filename

Name of the output file, do not include file extension.

Default: “subset”

--output-partitions

Number of desired partitions for the output file.

-o, --overwrite

Overwrite all data from this subset (default: False).

Default: False

--tmp-dir

Temporary directory for Hail to write files to.

Default: “gs://gnomad-tmp-4day”

Module Functions

gnomad_qc.v4.subset.ProcessingConfig(...[, ...])

Configuration for data processing operations.

gnomad_qc.v4.subset.get_gnomad_datasets(...)

Get requested data type's v4 VariantDataset.

gnomad_qc.v4.subset.get_subset_ht(...)

Get the subset HT.

gnomad_qc.v4.subset.check_subset_ht(...)

Check that the subset HT is valid.

gnomad_qc.v4.subset.apply_split_multi_logic(...)

Apply split multi logic to either a MatrixTable (VCF path) or VariantDataset (VDS path).

gnomad_qc.v4.subset.apply_min_rep_logic(vds)

Apply min_rep logic to an unsplit VariantDataset.

gnomad_qc.v4.subset.apply_variant_qc_annotations(...)

Apply variant QC annotations and filtering to either a MatrixTable or VariantDataset.

gnomad_qc.v4.subset.filter_to_pass_only(...)

Filter to variants that passed variant QC.

gnomad_qc.v4.subset.format_vcf_info_fields(mt)

Apply VCF-specific formatting to info fields for export.

gnomad_qc.v4.subset.process_metadata_export(...)

Process metadata and return the processed metadata Table.

gnomad_qc.v4.subset.main(args)

Filter the gnomAD v4 VariantDataset to a subset of specified samples.

gnomad_qc.v4.subset.get_script_argument_parser()

Get script argument parser.

Script to filter the gnomAD v4 VariantDataset to a subset of specified samples.

Run this script on Hail version 0.2.120. Higher versions of hail greatly increase runtime and cost and thus we enforce this version.

class gnomad_qc.v4.subset.ProcessingConfig(split_multi, pass_only, add_variant_qc, variant_qc_annotations, data_type, output_vcf=False, output_vds=False, export_meta=False, keep_data_paths=False, overwrite=False, output_partitions=None, test=False, rep_on_read_partitions=None, output_path='', output_filename='subset', subset_samples=None, subset_workspaces=None, tmp_dir='')[source]

Configuration for data processing operations.

Parameters:
  • split_multi (bool) –

  • pass_only (bool) –

  • add_variant_qc (bool) –

  • variant_qc_annotations (Optional[List[str]]) –

  • data_type (str) –

  • output_vcf (bool) –

  • output_vds (bool) –

  • export_meta (bool) –

  • keep_data_paths (bool) –

  • overwrite (bool) –

  • output_partitions (Optional[int]) –

  • test (bool) –

  • rep_on_read_partitions (Optional[int]) –

  • output_path (str) –

  • output_filename (str) –

  • subset_samples (Optional[str]) –

  • subset_workspaces (Optional[str]) –

  • tmp_dir (str) –

split_multi: bool
pass_only: bool
add_variant_qc: bool
variant_qc_annotations: Optional[List[str]]
data_type: str
output_vcf: bool = False
output_vds: bool = False
export_meta: bool = False
keep_data_paths: bool = False
overwrite: bool = False
output_partitions: Optional[int] = None
test: bool = False
rep_on_read_partitions: Optional[int] = None
output_path: str = ''
output_filename: str = 'subset'
subset_samples: Optional[str] = None
subset_workspaces: Optional[str] = None
tmp_dir: str = ''
classmethod from_args(args)[source]

Create ProcessingConfig from argparse arguments.

Return type:

ProcessingConfig

gnomad_qc.v4.subset.get_gnomad_datasets(data_type, n_partitions, test)[source]

Get requested data type’s v4 VariantDataset.

Parameters:
  • data_type (str) – Type of data to subset.

  • n_partitions (Optional[int]) – Number of desired partitions for the VDS, repartitioned on read.

  • test (bool) – Whether to filter to the first 2 partitions for testing.

Returns:

The gnomAD v4 VariantDataset and metadata Table.

gnomad_qc.v4.subset.get_subset_ht(subset_samples, subset_workspaces, meta_ht)[source]

Get the subset HT.

Parameters:
  • subset_samples (Optional[str]) – Path to a text file with sample IDs for subsetting and a header: ‘s’.

  • subset_workspaces (Optional[str]) – Path to a text file with Terra workspaces that should be included in the subset and a header: ‘terra_workspace’.

  • meta_ht (Optional[Table]) – The meta HT.

Return type:

Table

Returns:

The subset HT.

gnomad_qc.v4.subset.check_subset_ht(subset_ht, vmt_cols)[source]

Check that the subset HT is valid.

Parameters:
  • subset_ht (Table) – The subset HT.

  • vmt_cols (Table) – The variant data MatrixTable samples.

gnomad_qc.v4.subset.apply_split_multi_logic(mtds, config)[source]

Apply split multi logic to either a MatrixTable (VCF path) or VariantDataset (VDS path).

Parameters:
Return type:

Union[MatrixTable, VariantDataset]

Returns:

MatrixTable or VariantDataset with multi-allelic sites split.

gnomad_qc.v4.subset.apply_min_rep_logic(vds)[source]

Apply min_rep logic to an unsplit VariantDataset.

Parameters:
  • vds (VariantDataset) – VariantDataset to process.

  • config – Processing configuration.

Return type:

VariantDataset

Returns:

VariantDataset with rows keyed by minimum representation.

gnomad_qc.v4.subset.apply_variant_qc_annotations(mtds, config)[source]

Apply variant QC annotations and filtering to either a MatrixTable or VariantDataset.

Note

If the variant QC annotations are not specified, all annotations will be added.

Parameters:
Return type:

Union[MatrixTable, VariantDataset]

Returns:

MatrixTable or VariantDataset with variant QC annotations.

gnomad_qc.v4.subset.filter_to_pass_only(mtds, config)[source]

Filter to variants that passed variant QC.

Parameters:
Return type:

Union[MatrixTable, VariantDataset]

Returns:

MatrixTable or VariantDataset containing only variants that passed variant QC.

gnomad_qc.v4.subset.format_vcf_info_fields(mt)[source]

Apply VCF-specific formatting to info fields for export.

Parameters:

mt (MatrixTable) – MatrixTable to format.

Return type:

MatrixTable

Returns:

MatrixTable with fields formatted for VCF export.

gnomad_qc.v4.subset.process_metadata_export(meta_ht, vmt, config)[source]

Process metadata and return the processed metadata Table.

Parameters:
  • meta_ht (Table) – The metadata Table to process.

  • vmt (MatrixTable) – The dataset’s subsetted matrixTable containing the samples to keep.

  • config (ProcessingConfig) – Processing configuration.

Return type:

Table

Returns:

The subsetted metadata Table.

gnomad_qc.v4.subset.main(args)[source]

Filter the gnomAD v4 VariantDataset to a subset of specified samples.