gnomad_qc.v4.resources.variant_qc

Script containing variant QC related resources.

Module Functions

gnomad_qc.v4.resources.variant_qc.VQSR_FEATURES

List of features used in the VQSR model.

gnomad_qc.v4.resources.variant_qc.SYNDIP

String representation for syndip truth sample.

gnomad_qc.v4.resources.variant_qc.NA12878

String representation for NA12878 truth sample.

gnomad_qc.v4.resources.variant_qc.UKB_NA12878

String representation for the UKB Regeneron generated NA12878 truth sample.

gnomad_qc.v4.resources.variant_qc.TRUTH_SAMPLES

Dictionary containing necessary information for truth samples

gnomad_qc.v4.resources.variant_qc.get_callset_truth_data(...)

Get resources for the truth sample data that is subset from the full callset.

gnomad_qc.v4.resources.variant_qc.get_score_bins(...)

Return the path to a Table containing RF or VQSR scores and annotated with a bin based on rank of the metric scores.

gnomad_qc.v4.resources.variant_qc.get_binned_concordance(...)

Return the path to a truth sample concordance Table.

gnomad_qc.v4.resources.variant_qc.get_rf_run_path([...])

Return the path to the json file containing the RF runs list.

gnomad_qc.v4.resources.variant_qc.get_rf_model_path(...)

Get the path to the RF model for a given run.

gnomad_qc.v4.resources.variant_qc.get_rf_training(...)

Get the training data for a given run.

gnomad_qc.v4.resources.variant_qc.get_variant_qc_result(...)

Get the results of variant QC filtering for a given run.

gnomad_qc.v4.resources.variant_qc.final_filter([...])

Get finalized variant QC filtering Table.

Script containing variant QC related resources.

gnomad_qc.v4.resources.variant_qc.VQSR_FEATURES = {'exomes': {'indel': ['AS_QD', 'AS_MQRankSum', 'AS_ReadPosRankSum', 'AS_FS'], 'snv': ['AS_QD', 'AS_MQRankSum', 'AS_ReadPosRankSum', 'AS_FS', 'AS_MQ']}, 'genomes': {'indel': ['AS_QD', 'AS_MQRankSum', 'AS_ReadPosRankSum', 'AS_FS', 'AS_SOR'], 'snv': ['AS_QD', 'AS_MQRankSum', 'AS_ReadPosRankSum', 'AS_FS', 'AS_SOR', 'AS_MQ']}}

List of features used in the VQSR model.

gnomad_qc.v4.resources.variant_qc.SYNDIP = 'CHMI_CHMI3_Nex1'

String representation for syndip truth sample.

gnomad_qc.v4.resources.variant_qc.NA12878 = 'ASC-4Set-1573S_NA12878@1075619236'

String representation for NA12878 truth sample.

gnomad_qc.v4.resources.variant_qc.UKB_NA12878 = 'Coriell_NA12878_NA12878'

String representation for the UKB Regeneron generated NA12878 truth sample.

gnomad_qc.v4.resources.variant_qc.TRUTH_SAMPLES = {'NA12878': {'hc_intervals': GnomadPublicTableResource(path=gs://gnomad-public-requester-pays/resources/grch38/na12878/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_noCENorHET7_hc_regions.ht,import_args={'path': 'gs://gcp-public-data--gnomad/resources/grch38/na12878/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_noCENorHET7.bed', 'reference_genome': 'GRCh38', 'skip_invalid_intervals': True}), 's': 'ASC-4Set-1573S_NA12878@1075619236', 'truth_mt': GnomadPublicMatrixTableResource(path=gs://gnomad-public-requester-pays/resources/grch38/na12878/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.mt,import_args={'path': 'gs://gcp-public-data--gnomad/resources/grch38/na12878/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz', 'force_bgz': True, 'min_partitions': 100, 'reference_genome': 'GRCh38'})}, 'UKB_NA12878': {'hc_intervals': GnomadPublicTableResource(path=gs://gnomad-public-requester-pays/resources/grch38/na12878/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_noCENorHET7_hc_regions.ht,import_args={'path': 'gs://gcp-public-data--gnomad/resources/grch38/na12878/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_noCENorHET7.bed', 'reference_genome': 'GRCh38', 'skip_invalid_intervals': True}), 's': 'Coriell_NA12878_NA12878', 'truth_mt': GnomadPublicMatrixTableResource(path=gs://gnomad-public-requester-pays/resources/grch38/na12878/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.mt,import_args={'path': 'gs://gcp-public-data--gnomad/resources/grch38/na12878/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz', 'force_bgz': True, 'min_partitions': 100, 'reference_genome': 'GRCh38'})}, 'syndip': {'hc_intervals': VersionedTableResource(default_version=20180222, versions={"20180222": GnomadPublicTableResource(path=gs://gnomad-public-requester-pays/resources/grch38/syndip/syndip_b38_20180222_hc_regions.ht,import_args={'path': 'gs://gcp-public-data--gnomad/resources/grch38/syndip/syndip.b38_20180222.bed', 'reference_genome': 'GRCh38', 'skip_invalid_intervals': True, 'min_partitions': 10})}), 's': 'CHMI_CHMI3_Nex1', 'truth_mt': VersionedMatrixTableResource(default_version=20180222, versions={"20180222": GnomadPublicMatrixTableResource(path=gs://gnomad-public-requester-pays/resources/grch38/syndip/syndip.b38_20180222.mt,import_args={'path': 'gs://gcp-public-data--gnomad/resources/grch38/syndip/full.38.20180222.vcf.gz', 'force_bgz': True, 'min_partitions': 100, 'reference_genome': 'GRCh38'})})}}

Dictionary containing necessary information for truth samples

Current truth samples available are syndip and NA12878. Available data for each are the following:

  • s: Sample name in the callset

  • truth_mt: Truth sample MatrixTable resource

  • hc_intervals: High confidence interval Table resource in truth sample

gnomad_qc.v4.resources.variant_qc.get_callset_truth_data(truth_sample, mt=True, test=False)[source]

Get resources for the truth sample data that is subset from the full callset.

If mt this will return the truth sample MatrixTable (subset from callset); otherwise it returns the merged truth sample Table that includes both the truth data and the data from the callset.

Parameters:
  • truth_sample (str) – Name of the truth sample.

  • mt (bool) – Whether path is for a MatrixTable, default is True.

  • test (bool) – Whether to use a tmp path for variant QC tests.

  • truth_sample

  • mt

Return type:

Union[VersionedMatrixTableResource, VersionedTableResource]

Returns:

Path to callset truth sample MT.

gnomad_qc.v4.resources.variant_qc.get_score_bins(model_id, aggregated, test=False)[source]

Return the path to a Table containing RF or VQSR scores and annotated with a bin based on rank of the metric scores.

Note

These Tables are only available for exomes data. Use the v3 resources for the genomes data.

Parameters:
  • model_id (str) – RF or VQSR model ID for which to return score data.

  • aggregated (bool) – Whether to get the aggregated data. If True, will return the path to Table grouped by bin that contains aggregated variant counts per bin.

  • test (bool) – Whether to use a tmp path for variant QC tests.

  • aggregated

Return type:

VersionedTableResource

Returns:

Path to desired hail Table

gnomad_qc.v4.resources.variant_qc.get_binned_concordance(model_id, truth_sample, test=False)[source]

Return the path to a truth sample concordance Table.

Note

These Tables are only available for exomes data. Use the v3 resources for the genomes data.

This Table contains concordance information (TP, FP, FN) between a truth sample within the callset and the sample’s truth data, grouped by bins of a metric (RF or VQSR scores).

Parameters:
  • model_id (str) – RF or VQSR model ID for which to return score data.

  • truth_sample (str) – Which truth sample concordance to analyze (e.g., “NA12878” or “syndip”).

  • test (bool) – Whether to use a tmp path for variant QC tests.

Return type:

VersionedTableResource

Returns:

Path to binned truth data concordance Hail Table.

gnomad_qc.v4.resources.variant_qc.get_rf_run_path(version='4.0', test=False)[source]

Return the path to the json file containing the RF runs list.

Parameters:
  • version (str) – Version of RF path to return.

  • test (bool) – Whether to return the test RF runs list.

Return type:

str

Returns:

Path to json file.

gnomad_qc.v4.resources.variant_qc.get_rf_model_path(model_id, version='4.0', test=False)[source]

Get the path to the RF model for a given run.

Parameters:
  • model_id (str) – RF run to load.

  • version (str) – Version of model path to return.

  • test (bool) – Whether to use a tmp path for variant QC tests.

Return type:

str

Returns:

Path to the RF model.

gnomad_qc.v4.resources.variant_qc.get_rf_training(model_id, test=False)[source]

Get the training data for a given run.

Parameters:
  • model_id (str) – RF run to load.

  • test (bool) – Whether to use a tmp path for variant QC tests.

Return type:

VersionedTableResource

Returns:

VersionedTableResource for RF training data.

gnomad_qc.v4.resources.variant_qc.get_variant_qc_result(model_id, test=False, split=True)[source]

Get the results of variant QC filtering for a given run.

Note

These Tables are only available for exomes data. Use the v3 resources for the genomes data.

Parameters:
  • model_id (str) – Model ID of variant QC run to load. Must start with ‘rf_’, ‘vqsr_’, or ‘if_’.

  • test (bool) – Whether to use a tmp path for variant QC tests.

  • split (bool) – Whether to return the split or unsplit variant QC result.

Return type:

VersionedTableResource

Returns:

VersionedTableResource for variant QC results.

gnomad_qc.v4.resources.variant_qc.final_filter(data_type='exomes', test=False)[source]

Get finalized variant QC filtering Table.

Parameters:
  • data_type (str) – Whether to return ‘exomes’ or ‘genomes’ data. Default is exomes.

  • test (bool) – Whether to use a tmp path for variant QC tests.

Return type:

VersionedTableResource

Returns:

VersionedTableResource for final variant QC data.