Getting Started
This short guide will help you get started with Hail and the gnomAD Python package. Please note this guide was originally created in 2022, and the arguments for creating Hail clusters may be dated. For example, at the time of this September 2024 update, the command in step 2 below will require an additional flag (--public-ip-address
) if you are using Hail version 0.2.131/Dataproc version 2.2 or later. Please refer to the Hail documentation for the most up-to-date information.
-
pip install hail
Use
hailctl
to start a Google Dataproc cluster with thegnomad
package installed (see Hail on the Cloud for more detail onhailctl
):hailctl dataproc start cluster-name --packages gnomad
Connect to a Jupyter Notebook on the cluster:
hailctl dataproc connect cluster-name notebook
Import gnomAD data in Hail Table format:
gnomAD v2.1.1 variants:
from gnomad.resources.grch37 import gnomad gnomad_v2_exomes = gnomad.public_release("exomes") exomes_ht = gnomad_v2_exomes.ht() exomes_ht.describe() gnomad_v2_genomes = gnomad.public_release("genomes") genomes_ht = gnomad_v2_genomes.ht() genomes_ht.describe()
gnomAD v3 variants:
from gnomad.resources.grch38 import gnomad gnomad_v3_genomes = gnomad.public_release("genomes") ht = gnomad_v3_genomes.ht() ht.describe()
Shut down the cluster when finished with it:
hailctl dataproc stop cluster-name