gnomad_qc.v4.resources.variant_qc
Script containing variant QC related resources.
Module Functions
List of features used in the VQSR model. |
|
String representation for syndip truth sample. |
|
String representation for NA12878 truth sample. |
|
String representation for the UKB Regeneron generated NA12878 truth sample. |
|
Dictionary containing necessary information for truth samples |
|
|
Get resources for the truth sample data that is subset from the full callset. |
Return the path to a Table containing RF or VQSR scores and annotated with a bin based on rank of the metric scores. |
|
|
Return the path to a truth sample concordance Table. |
Return the path to the json file containing the RF runs list. |
|
Get the path to the RF model for a given run. |
|
Get the training data for a given run. |
|
|
Get the results of variant QC filtering for a given run. |
Get finalized variant QC filtering Table. |
Script containing variant QC related resources.
- gnomad_qc.v4.resources.variant_qc.VQSR_FEATURES = {'exomes': {'indel': ['AS_QD', 'AS_MQRankSum', 'AS_ReadPosRankSum', 'AS_FS'], 'snv': ['AS_QD', 'AS_MQRankSum', 'AS_ReadPosRankSum', 'AS_FS', 'AS_MQ']}, 'genomes': {'indel': ['AS_QD', 'AS_MQRankSum', 'AS_ReadPosRankSum', 'AS_FS', 'AS_SOR'], 'snv': ['AS_QD', 'AS_MQRankSum', 'AS_ReadPosRankSum', 'AS_FS', 'AS_SOR', 'AS_MQ']}}
List of features used in the VQSR model.
- gnomad_qc.v4.resources.variant_qc.SYNDIP = 'CHMI_CHMI3_Nex1'
String representation for syndip truth sample.
- gnomad_qc.v4.resources.variant_qc.NA12878 = 'ASC-4Set-1573S_NA12878@1075619236'
String representation for NA12878 truth sample.
- gnomad_qc.v4.resources.variant_qc.UKB_NA12878 = 'Coriell_NA12878_NA12878'
String representation for the UKB Regeneron generated NA12878 truth sample.
- gnomad_qc.v4.resources.variant_qc.TRUTH_SAMPLES = {'NA12878': {'hc_intervals': GnomadPublicTableResource(path=gs://gnomad-public-requester-pays/resources/grch38/na12878/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_noCENorHET7_hc_regions.ht,import_args={'path': 'gs://gcp-public-data--gnomad/resources/grch38/na12878/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_noCENorHET7.bed', 'reference_genome': 'GRCh38', 'skip_invalid_intervals': True}), 's': 'ASC-4Set-1573S_NA12878@1075619236', 'truth_mt': GnomadPublicMatrixTableResource(path=gs://gnomad-public-requester-pays/resources/grch38/na12878/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.mt,import_args={'path': 'gs://gcp-public-data--gnomad/resources/grch38/na12878/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz', 'force_bgz': True, 'min_partitions': 100, 'reference_genome': 'GRCh38'})}, 'UKB_NA12878': {'hc_intervals': GnomadPublicTableResource(path=gs://gnomad-public-requester-pays/resources/grch38/na12878/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_noCENorHET7_hc_regions.ht,import_args={'path': 'gs://gcp-public-data--gnomad/resources/grch38/na12878/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_noCENorHET7.bed', 'reference_genome': 'GRCh38', 'skip_invalid_intervals': True}), 's': 'Coriell_NA12878_NA12878', 'truth_mt': GnomadPublicMatrixTableResource(path=gs://gnomad-public-requester-pays/resources/grch38/na12878/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.mt,import_args={'path': 'gs://gcp-public-data--gnomad/resources/grch38/na12878/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz', 'force_bgz': True, 'min_partitions': 100, 'reference_genome': 'GRCh38'})}, 'syndip': {'hc_intervals': VersionedTableResource(default_version=20180222, versions={"20180222": GnomadPublicTableResource(path=gs://gnomad-public-requester-pays/resources/grch38/syndip/syndip_b38_20180222_hc_regions.ht,import_args={'path': 'gs://gcp-public-data--gnomad/resources/grch38/syndip/syndip.b38_20180222.bed', 'reference_genome': 'GRCh38', 'skip_invalid_intervals': True, 'min_partitions': 10})}), 's': 'CHMI_CHMI3_Nex1', 'truth_mt': VersionedMatrixTableResource(default_version=20180222, versions={"20180222": GnomadPublicMatrixTableResource(path=gs://gnomad-public-requester-pays/resources/grch38/syndip/syndip.b38_20180222.mt,import_args={'path': 'gs://gcp-public-data--gnomad/resources/grch38/syndip/full.38.20180222.vcf.gz', 'force_bgz': True, 'min_partitions': 100, 'reference_genome': 'GRCh38'})})}}
Dictionary containing necessary information for truth samples
Current truth samples available are syndip and NA12878. Available data for each are the following:
s: Sample name in the callset
truth_mt: Truth sample MatrixTable resource
hc_intervals: High confidence interval Table resource in truth sample
- gnomad_qc.v4.resources.variant_qc.get_callset_truth_data(truth_sample, mt=True, test=False)[source]
Get resources for the truth sample data that is subset from the full callset.
If mt this will return the truth sample MatrixTable (subset from callset); otherwise it returns the merged truth sample Table that includes both the truth data and the data from the callset.
- Parameters:
truth_sample (
str
) – Name of the truth sample.mt (
bool
) – Whether path is for a MatrixTable, default is True.test (
bool
) – Whether to use a tmp path for variant QC tests.truth_sample –
mt –
- Return type:
Union
[VersionedMatrixTableResource
,VersionedTableResource
]- Returns:
Path to callset truth sample MT.
- gnomad_qc.v4.resources.variant_qc.get_score_bins(model_id, aggregated, test=False)[source]
Return the path to a Table containing RF or VQSR scores and annotated with a bin based on rank of the metric scores.
Note
These Tables are only available for exomes data. Use the v3 resources for the genomes data.
- Parameters:
model_id (
str
) – RF or VQSR model ID for which to return score data.aggregated (
bool
) – Whether to get the aggregated data. If True, will return the path to Table grouped by bin that contains aggregated variant counts per bin.test (
bool
) – Whether to use a tmp path for variant QC tests.aggregated –
- Return type:
VersionedTableResource
- Returns:
Path to desired hail Table
- gnomad_qc.v4.resources.variant_qc.get_binned_concordance(model_id, truth_sample, test=False)[source]
Return the path to a truth sample concordance Table.
Note
These Tables are only available for exomes data. Use the v3 resources for the genomes data.
This Table contains concordance information (TP, FP, FN) between a truth sample within the callset and the sample’s truth data, grouped by bins of a metric (RF or VQSR scores).
- Parameters:
model_id (
str
) – RF or VQSR model ID for which to return score data.truth_sample (
str
) – Which truth sample concordance to analyze (e.g., “NA12878” or “syndip”).test (
bool
) – Whether to use a tmp path for variant QC tests.
- Return type:
VersionedTableResource
- Returns:
Path to binned truth data concordance Hail Table.
- gnomad_qc.v4.resources.variant_qc.get_rf_run_path(version='4.0', test=False)[source]
Return the path to the json file containing the RF runs list.
- Parameters:
version (
str
) – Version of RF path to return.test (
bool
) – Whether to return the test RF runs list.
- Return type:
str
- Returns:
Path to json file.
- gnomad_qc.v4.resources.variant_qc.get_rf_model_path(model_id, version='4.0', test=False)[source]
Get the path to the RF model for a given run.
- Parameters:
model_id (
str
) – RF run to load.version (
str
) – Version of model path to return.test (
bool
) – Whether to use a tmp path for variant QC tests.
- Return type:
str
- Returns:
Path to the RF model.
- gnomad_qc.v4.resources.variant_qc.get_rf_training(model_id, test=False)[source]
Get the training data for a given run.
- Parameters:
model_id (
str
) – RF run to load.test (
bool
) – Whether to use a tmp path for variant QC tests.
- Return type:
VersionedTableResource
- Returns:
VersionedTableResource for RF training data.
- gnomad_qc.v4.resources.variant_qc.get_variant_qc_result(model_id, test=False, split=True)[source]
Get the results of variant QC filtering for a given run.
Note
These Tables are only available for exomes data. Use the v3 resources for the genomes data.
- Parameters:
model_id (
str
) – Model ID of variant QC run to load. Must start with ‘rf_’, ‘vqsr_’, or ‘if_’.test (
bool
) – Whether to use a tmp path for variant QC tests.split (
bool
) – Whether to return the split or unsplit variant QC result.
- Return type:
VersionedTableResource
- Returns:
VersionedTableResource for variant QC results.
- gnomad_qc.v4.resources.variant_qc.final_filter(data_type='exomes', test=False)[source]
Get finalized variant QC filtering Table.
- Parameters:
data_type (
str
) – Whether to return ‘exomes’ or ‘genomes’ data. Default is exomes.test (
bool
) – Whether to use a tmp path for variant QC tests.
- Return type:
VersionedTableResource
- Returns:
VersionedTableResource for final variant QC data.