gnomad_qc
This is the documentation for the complete set of scripts used to perform sample and variant QC for the gnomAD v4 release. We are continuously updating and improving upon the code to handle new releases as they grow in size and complexity and as they require increasingly sophisticated QC treatment. The current code therefore represents the most recent iteration of our pipelines and is guaranteed to change over time.
The scripts reference gnomAD-related metadata files (not public) and are specifically tailored to Broad-sequenced/processed data, so they may perform procedures that are not strictly necessary for quality control of all germline datasets. For example, the gnomAD v4 dataset comprises both exomes and genomes, and a substantial portion of the code is written to handle technical differences between those call sets, as well as to perform relevant joint analyses (such as inferring cryptically related individuals across exomes and genomes). These steps may not be relevant for all call sets.
We therefore encourage users to browse through the code and identify modules and functions that will be useful in their own pipelines, and to edit and reconfigure the gnomAD pipeline to suit their particular analysis and QC needs.
Note also that many basic functions and file paths used in the code are imported from a separate repo, gnomad_methods.
A more extensive overview and explanation of the gnomAD QC process is available on the gnomAD browser and may help inform users’ design decisions for other pipelines.