gnomad_qc.v4.subset

Script to filter the gnomAD v4 VariantDataset to a subset of specified samples.

This script subsets gnomAD using a list of samples or terra workspaces.

usage: gnomad_qc.v4.subset.py [-h] [--test]
                              (--subset-samples SUBSET_SAMPLES | --subset-workspaces SUBSET_WORKSPACES)
                              [--include-ukb-200k] [--vds] [--vcf]
                              [--dense-mt] [--split-multi]
                              [--n-partitions N_PARTITIONS]
                              [--subset-call-stats] [--add-variant-qc]
                              [--pass-only]
                              [--variant-qc-annotations VARIANT_QC_ANNOTATIONS [VARIANT_QC_ANNOTATIONS ...]]
                              [--export-meta] [--keep-data-paths]
                              --output-path OUTPUT_PATH [-o]

Named Arguments

--test

Filter to the first 2 partitions for testing.

Default: False

--subset-samples

Path to a text file with sample IDs for subsetting and a header: s.

--subset-workspaces

Path to a text file with Terra workspaces that should be included in the subset, must use a header of ‘terra_workspace’.

--include-ukb-200k

Whether to include the 200K UK Biobank samples.

Default: False

--vds

Whether to make a subset VDS.

Default: False

--vcf

Whether to make a subset VCF.

Default: False

--dense-mt

Whether to make a dense MT

Default: False

--split-multi

Whether to split multi-allelic variants.

Default: False

--n-partitions

Number of desired partitions for the subset VDS if –vds and/or MT if –dense-mt is set and/or the number of shards in the output VCF if –vcf is set. By default, there will be no change in partitioning.

--subset-call-stats

Adds subset callstats, AC, AN, AF, nhomalt.

Default: False

--add-variant-qc

Annotate exported file with gnomAD’s variant QC annotations. Defaults to all annotations if a subset of annotations are not specified using the –variant-qc-annotations arg

Default: False

--pass-only

Keep only the variants that passed variant QC, i.e. the filter field is PASS.

Default: False

--variant-qc-annotations

Variant QC annotations to add to the output file. Defaults to all annotations.

--export-meta

Pull sample subset metadata and export to a HT and .tsv.

Default: False

--keep-data-paths

Keep CRAM and gVCF paths in the project metadata export.

Default: False

--output-path

Output file path for subsetted VDS/VCF/MT, do not include file extension.

-o, --overwrite

Overwrite all data from this subset (default: False).

Default: False

Module Functions

gnomad_qc.v4.subset.make_variant_qc_annotations_dict(...)

Make a dictionary of gnomAD release annotation expressions to annotate onto the subsetted data.

gnomad_qc.v4.subset.main(args)

Filter the gnomAD v4 VariantDataset to a subset of specified samples.

gnomad_qc.v4.subset.get_script_argument_parser()

Get script argument parser.

Script to filter the gnomAD v4 VariantDataset to a subset of specified samples.

gnomad_qc.v4.subset.make_variant_qc_annotations_dict(key_expr, vqc_annotations=None)[source]

Make a dictionary of gnomAD release annotation expressions to annotate onto the subsetted data.

Parameters:
  • key_expr (StructExpression) – Key to join annotations on.

  • vqc_annotations (Optional[List[str]]) – Optional list of desired annotations from the release HT.

Return type:

Dict[str, Expression]

Returns:

Dictionary containing Hail expressions to annotate onto subset.

gnomad_qc.v4.subset.main(args)[source]

Filter the gnomAD v4 VariantDataset to a subset of specified samples.