gnomad.utils.sparse_mt

gnomad.utils.sparse_mt.compute_last_ref_block_end(mt)

Compute the genomic position of the most upstream reference block overlapping each row on a sparse MT.

gnomad.utils.sparse_mt.densify_sites(mt, ...)

Create a dense version of the input sparse MT at the sites in sites_ht reading the minimal amount of data required.

gnomad.utils.sparse_mt.get_as_info_expr(mt)

Return an allele-specific annotation Struct containing typical VCF INFO fields from GVCF INFO fields stored in the MT entries.

gnomad.utils.sparse_mt.get_site_info_expr(mt)

Create a site-level annotation Struct aggregating typical VCF INFO fields from GVCF INFO fields stored in the MT entries.

gnomad.utils.sparse_mt.default_compute_info(mt)

Compute a HT with the typical GATK allele-specific (AS) info fields as well as ACs and lowqual fields.

gnomad.utils.sparse_mt.split_info_annotation(...)

Split multi-allelic allele-specific info fields.

gnomad.utils.sparse_mt.split_lowqual_annotation(...)

Split multi-allelic low QUAL annotation.

gnomad.utils.sparse_mt.impute_sex_ploidy(mt)

Impute sex ploidy from a sparse MatrixTable.

gnomad.utils.sparse_mt.densify_all_reference_sites(...)

Densify a VariantDataset or Sparse MatrixTable at all sites in a reference Table.

gnomad.utils.sparse_mt.compute_stats_per_ref_site(...)

Compute stats per site in a reference Table.

gnomad.utils.sparse_mt.compute_coverage_stats(...)

Compute coverage statistics for every base of the reference_ht provided.

gnomad.utils.sparse_mt.get_allele_number_agg_func([...])

Get a transformation and aggregation function for computing the allele number.

gnomad.utils.sparse_mt.compute_allele_number_per_ref_site(...)

Compute the allele number per reference site.

gnomad.utils.sparse_mt.filter_ref_blocks(t)

Filter ref blocks out of the Table or MatrixTable.

gnomad.utils.sparse_mt.compute_last_ref_block_end(mt)[source]

Compute the genomic position of the most upstream reference block overlapping each row on a sparse MT.

Note that since reference blocks do not extend beyond contig boundaries, only the position is kept.

This function returns a Table with that annotation. (last_END_position).

Parameters:

mt (MatrixTable) – Input MatrixTable

Return type:

Table

Returns:

Output Table with last_END_position annotation

gnomad.utils.sparse_mt.densify_sites(mt, sites_ht, last_END_positions_ht, semi_join_rows=True)[source]

Create a dense version of the input sparse MT at the sites in sites_ht reading the minimal amount of data required.

Note that only rows that appear both in mt and sites_ht are returned.

Parameters:
  • mt (MatrixTable) – Input sparse MT

  • sites_ht (Table) – Desired sites to densify

  • last_END_positions_ht (Table) – Table storing positions of the furthest ref block (END tag)

  • semi_join_rows (bool) – Whether to filter the MT rows based on semi-join (default, better if sites_ht is large) or based on filter_intervals (better if sites_ht only contains a few sites)

Return type:

MatrixTable

Returns:

Dense MT filtered to the sites in sites_ht

gnomad.utils.sparse_mt.get_as_info_expr(mt, sum_agg_fields=['QUALapprox'], int32_sum_agg_fields=['VarDP'], median_agg_fields=['ReadPosRankSum', 'MQRankSum'], array_sum_agg_fields=['SB', 'RAW_MQandDP'], alt_alleles_range_array_field='alt_alleles_range_array', treat_fields_as_allele_specific=False, retain_cdfs=False, cdf_k=200)[source]

Return an allele-specific annotation Struct containing typical VCF INFO fields from GVCF INFO fields stored in the MT entries.

Note

  • If SB is specified in array_sum_agg_fields, it will be aggregated as AS_SB_TABLE, according to GATK standard nomenclature.

  • If RAW_MQandDP is specified in array_sum_agg_fields, it will be used for the MQ calculation and then dropped according to GATK recommendation.

  • If RAW_MQ and MQ_DP are given, they will be used for the MQ calculation and then dropped according to GATK recommendation.

  • If the fields to be aggregate (sum_agg_fields, int32_sum_agg_fields, median_agg_fields) are passed as list of str, then they should correspond to entry fields in mt or in mt.gvcf_info.

  • Priority is given to entry fields in mt over those in mt.gvcf_info in case of a name clash.

  • If treat_fields_as_allele_specific is False, it’s expected that there is a single value for each entry field to be aggregated. Then when performing the aggregation per global alternate allele, that value is included in the aggregation if the global allele is present in the entry’s list of local alleles. If treat_fields_as_allele_specific is True, it’s expected that each entry field to be aggregated has one value per local allele, and each of those is mapped to a global allele for aggregation.

Parameters:
  • mt (MatrixTable) – Input Matrix Table

  • sum_agg_fields (Union[List[str], Dict[str, NumericExpression]]) – Fields to aggregate using sum.

  • int32_sum_agg_fields (Union[List[str], Dict[str, NumericExpression]]) – Fields to aggregate using sum using int32.

  • median_agg_fields (Union[List[str], Dict[str, NumericExpression]]) – Fields to aggregate using (approximate) median.

  • array_sum_agg_fields (Union[List[str], Dict[str, ArrayNumericExpression]]) – Fields to aggregate using array sum.

  • alt_alleles_range_array_field (str) – Annotation containing an array of the range of alternate alleles e.g., hl.range(1, hl.len(mt.alleles))

  • treat_fields_as_allele_specific (bool) – Treat info fields as allele-specific. Defaults to False.

  • retain_cdfs (bool) – If True, retains the cumulative distribution functions (CDFs) as an annotation for median_agg_fields. Keeping the CDFs is useful for annotations that require calculating the median across combined datasets at a later stage. Default is False.

  • cdf_k (int) – Parameter controlling the accuracy vs. memory usage tradeoff when retaining CDFs. A higher value of cdf_k results in a more accurate CDF approximation but increases memory usage and computation time. Default is 200.

Return type:

StructExpression

Returns:

Expression containing the AS info fields

gnomad.utils.sparse_mt.get_site_info_expr(mt, sum_agg_fields=['QUALapprox'], int32_sum_agg_fields=['VarDP'], median_agg_fields=['ReadPosRankSum', 'MQRankSum'], array_sum_agg_fields=['SB', 'RAW_MQandDP'], retain_cdfs=False, cdf_k=200)[source]

Create a site-level annotation Struct aggregating typical VCF INFO fields from GVCF INFO fields stored in the MT entries.

Note

  • If RAW_MQandDP is specified in array_sum_agg_fields, it will be used for the MQ calculation and then dropped according to GATK recommendation.

  • If RAW_MQ and MQ_DP are given, they will be used for the MQ calculation and then dropped according to GATK recommendation.

  • If the fields to be aggregate (sum_agg_fields, int32_sum_agg_fields, median_agg_fields) are passed as list of str, then they should correspond to entry fields in mt or in mt.gvcf_info.

  • Priority is given to entry fields in mt over those in mt.gvcf_info in case of a name clash.

Parameters:
  • mt (MatrixTable) – Input Matrix Table

  • sum_agg_fields (Union[List[str], Dict[str, NumericExpression]]) – Fields to aggregate using sum.

  • int32_sum_agg_fields (Union[List[str], Dict[str, NumericExpression]]) – Fields to aggregate using sum using int32.

  • median_agg_fields (Union[List[str], Dict[str, NumericExpression]]) – Fields to aggregate using (approximate) median.

  • retain_cdfs (bool) – If True, retains the cumulative distribution functions (CDFs) as an annotation for median_agg_fields. Keeping the CDFs is useful for annotations that require calculating the median across combined datasets at a later stage. Default is False.

  • cdf_k (int) – Parameter controlling the accuracy vs. memory usage tradeoff when retaining CDFs. A higher value of cdf_k results in a more accurate CDF approximation but increases memory usage and computation time. Default is 200.

  • array_sum_agg_fields (Union[List[str], Dict[str, ArrayNumericExpression]]) –

Return type:

StructExpression

Returns:

Expression containing the site-level info fields

gnomad.utils.sparse_mt.default_compute_info(mt, site_annotations=False, as_annotations=False, quasi_as_annotations=True, n_partitions=5000, lowqual_indel_phred_het_prior=40, ac_filter_groups=None, retain_cdfs=False, cdf_k=200)[source]

Compute a HT with the typical GATK allele-specific (AS) info fields as well as ACs and lowqual fields.

Note

  • This table doesn’t split multi-allelic sites.

  • At least one of site_annotations, as_annotations or quasi_as_annotations must be True.

Parameters:
  • mt (MatrixTable) – Input MatrixTable. Note that this table should be filtered to nonref sites.

  • site_annotations (bool) – Whether to generate site level info fields. Default is False.

  • as_annotations (bool) – Whether to generate allele-specific info fields using allele-specific annotations in gvcf_info. Default is False.

  • quasi_as_annotations (bool) – Whether to generate allele-specific info fields using non-allele-specific annotations in gvcf_info, but performing per allele aggregations. This method can be used in cases where genotype data doesn’t contain allele-specific annotations to approximate allele-specific annotations. Default is True.

  • n_partitions (Optional[int]) – Optional number of desired partitions for output Table. If specified, naive_coalesce is performed. Default is 5000.

  • lowqual_indel_phred_het_prior (int) – Phred-scaled prior for a het genotype at a site with a low quality indel. Default is 40. We use 1/10k bases (phred=40) to be more consistent with the filtering used by Broad’s Data Sciences Platform for VQSR.

  • ac_filter_groups (Optional[Dict[str, Expression]]) – Optional dictionary of sample filter expressions to compute additional groupings of ACs. Default is None.

  • retain_cdfs (bool) – If True, retains the cumulative distribution functions (CDFs) as an annotation for median_agg_fields. Keeping the CDFs is useful for annotations that require calculating the median across combined datasets at a later stage. Default is False.

  • cdf_k (int) – Parameter controlling the accuracy vs. memory usage tradeoff when retaining CDFs. A higher value of cdf_k results in a more accurate CDF approximation but increases memory usage and computation time. Default is 200.

Returns:

Table with info fields

Return type:

Table

gnomad.utils.sparse_mt.split_info_annotation(info_expr, a_index)[source]

Split multi-allelic allele-specific info fields.

Parameters:
  • info_expr (StructExpression) – Field containing info struct.

  • a_index (Int32Expression) – Allele index. Output by hl.split_multi or hl.split_multi_hts.

Return type:

StructExpression

Returns:

Info struct with split annotations.

gnomad.utils.sparse_mt.split_lowqual_annotation(lowqual_expr, a_index)[source]

Split multi-allelic low QUAL annotation.

Parameters:
  • lowqual_expr (ArrayExpression) – Field containing low QUAL annotation.

  • a_index (Int32Expression) – Allele index. Output by hl.split_multi or hl.split_multi_hts.

Return type:

BooleanExpression

Returns:

Low QUAL expression for particular allele.

gnomad.utils.sparse_mt.impute_sex_ploidy(mt, excluded_calling_intervals=None, included_calling_intervals=None, normalization_contig='chr20', chr_x=None, chr_y=None, use_only_variants=False)[source]

Impute sex ploidy from a sparse MatrixTable.

Sex ploidy is imputed by normalizing the coverage of chromosomes X and Y using the coverage of an autosomal chromosome (by default chr20).

Coverage is computed using the median block coverage (summed over the block size) and the non-ref coverage at non-ref genotypes unless the use_only_variants argument is set to True and then it will use the mean coverage defined by only the variants.

Parameters:
  • mt (MatrixTable) – Input sparse Matrix Table

  • excluded_calling_intervals (Optional[Table]) – Optional table of intervals to exclude from the computation. Used only when determining contig size (not used when computing chromosome depth) when use_only_variants is False.

  • included_calling_intervals (Optional[Table]) – Optional table of intervals to use in the computation. Used only when determining contig size (not used when computing chromosome depth) when use_only_variants is False.

  • normalization_contig (str) – Which chromosome to normalize by

  • chr_x (Optional[str]) – Optional X Chromosome contig name (by default uses the X contig in the reference)

  • chr_y (Optional[str]) – Optional Y Chromosome contig name (by default uses the Y contig in the reference)

  • use_only_variants (bool) – Whether to use depth of variant data within calling intervals instead of reference data. Default will only use reference data.

Return type:

Table

Returns:

Table with mean coverage over chromosomes 20, X and Y and sex chromosomes ploidy based on normalized coverage.

gnomad.utils.sparse_mt.densify_all_reference_sites(mtds, reference_ht, interval_ht=None, row_key_fields=('locus',), entry_keep_fields=('GT',))[source]

Densify a VariantDataset or Sparse MatrixTable at all sites in a reference Table.

Parameters:
  • mtds (Union[MatrixTable, VariantDataset]) – Input sparse MatrixTable or VariantDataset.

  • reference_ht (Table) – Table of reference sites.

  • interval_ht (Optional[Table]) – Optional Table of intervals to filter to.

  • row_key_fields (Union[Tuple[str], List[str], Set[str]]) – Fields to use as row key. Defaults to locus.

  • entry_keep_fields (Union[Tuple[str], List[str], Set[str]]) – Fields to keep in entries before performing the densification. Defaults to GT.

Return type:

MatrixTable

Returns:

Densified MatrixTable.

gnomad.utils.sparse_mt.compute_stats_per_ref_site(mtds, reference_ht, entry_agg_funcs, row_key_fields=('locus',), interval_ht=None, entry_keep_fields=None, row_keep_fields=None, entry_agg_group_membership=None, strata_expr=None, group_membership_ht=None, sex_karyotype_field=None)[source]

Compute stats per site in a reference Table.

Parameters:
  • mtds (Union[MatrixTable, VariantDataset]) – Input sparse Matrix Table or VariantDataset.

  • reference_ht (Table) – Table of reference sites.

  • entry_agg_funcs (Dict[str, Tuple[Callable, Callable]]) – Dict of entry aggregation functions to perform on the VariantDataset/MatrixTable. The keys of the dict are the names of the annotations and the values are tuples of functions. The first function is used to transform the mt entries in some way, and the second function is used to aggregate the output from the first function.

  • row_key_fields (Union[Tuple[str], List[str]]) – Fields to use as row key. Defaults to locus.

  • interval_ht (Optional[Table]) – Optional table of intervals to filter to.

  • entry_keep_fields (Union[Tuple[str], List[str], Set[str]]) – Fields to keep in entries before performing the densification in densify_all_reference_sites. Should include any fields needed for the functions in entry_agg_funcs. By default, only GT or LGT is kept.

  • row_keep_fields (Union[Tuple[str], List[str], Set[str]]) – Fields to keep in rows after performing the stats aggregation. By default, only the row key fields are kept.

  • entry_agg_group_membership (Optional[Dict[str, List[dict[str, str]]]]) – Optional dict indicating the subset of group strata in ‘freq_meta’ to use the entry aggregation functions on. The keys of the dict can be any of the keys in entry_agg_funcs and the values are lists of dicts. Each dict in the list contains the strata in ‘freq_meta’ to use for the corresponding entry aggregation function. If provided, ‘freq_meta’ must be present in group_membership_ht and represent the same strata as those in ‘group_membership’. If not provided, all entries of the ‘group_membership’ annotation will have the entry aggregation functions applied to them.

  • strata_expr (Optional[List[Dict[str, StringExpression]]]) – Optional list of dicts of expressions to stratify by.

  • group_membership_ht (Optional[Table]) – Optional Table of group membership annotations.

  • sex_karyotype_field (Optional[str]) – Optional field to use to adjust genotypes for sex karyotype before stats aggregation. If provided, the field must be present in the columns of mtds (variant_data MT if mtds is a VDS) and use “XX” and “XY” as values. If not provided, no sex karyotype adjustment is performed. Default is None.

Return type:

Table

Returns:

Table of stats per site.

gnomad.utils.sparse_mt.compute_coverage_stats(mtds, reference_ht, interval_ht=None, coverage_over_x_bins=[1, 5, 10, 15, 20, 25, 30, 50, 100], row_key_fields=['locus'], strata_expr=None, group_membership_ht=None)[source]

Compute coverage statistics for every base of the reference_ht provided.

The following coverage stats are calculated:
  • mean

  • median

  • total DP

  • fraction of samples with coverage above X, for each x in coverage_over_x_bins

The reference_ht is a Table that contains a row for each locus coverage that should be computed on. It needs to be keyed by locus. The reference_ht can e.g. be created using get_reference_ht.

Parameters:
  • mtds (Union[MatrixTable, VariantDataset]) – Input sparse MT or VDS

  • reference_ht (Table) – Input reference HT

  • interval_ht (Optional[Table]) – Optional Table containing intervals to filter to

  • coverage_over_x_bins (List[int]) – List of boundaries for computing samples over X

  • row_key_fields (List[str]) – List of row key fields to use for joining mtds with reference_ht

  • strata_expr (Optional[List[Dict[str, StringExpression]]]) – Optional list of dicts containing expressions to stratify the coverage stats by. Only one of group_membership_ht or strata_expr can be specified.

  • group_membership_ht (Optional[Table]) – Optional Table containing group membership annotations to stratify the coverage stats by. Only one of group_membership_ht or strata_expr can be specified.

Return type:

Table

Returns:

Table with per-base coverage stats.

gnomad.utils.sparse_mt.get_allele_number_agg_func(gt_field='GT')[source]

Get a transformation and aggregation function for computing the allele number.

Can be used as an entry aggregation function in compute_stats_per_ref_site.

Parameters:

gt_field (str) – Genotype field to use for computing the allele number.

Return type:

Tuple[Callable, Callable]

Returns:

Tuple of functions to transform and aggregate the allele number.

gnomad.utils.sparse_mt.compute_allele_number_per_ref_site(mtds, reference_ht, **kwargs)[source]

Compute the allele number per reference site.

Parameters:
  • mtds (Union[MatrixTable, VariantDataset]) – Input sparse Matrix Table or VariantDataset.

  • reference_ht (Table) – Table of reference sites.

  • kwargs – Keyword arguments to pass to compute_stats_per_ref_site.

Return type:

Table

Returns:

Table of allele number per reference site.

gnomad.utils.sparse_mt.filter_ref_blocks(t)[source]

Filter ref blocks out of the Table or MatrixTable.

Parameters:

t (Union[MatrixTable, Table]) – Input MT/HT

Return type:

Union[MatrixTable, Table]

Returns:

MT/HT with ref blocks removed