Resource Sources

gnomAD data is available through multiple cloud providers’ public datasets programs.

The functions in the gnomad.resources package can be configured to load data from different sources.

If Hail determines that is is running in a cloud provider’s Spark environment, resources will default to being read from that cloud provider’s datasets program. For example, resource will be read from Azure Open Datasets if Hail determines that it is running on an Azure HDInsight cluster. Otherwise, resources will default to being read from Google Cloud Public Datasets. This can be configured using the GNOMAD_DEFAULT_PUBLIC_RESOURCE_SOURCE environment variable.

To load resources from a different source (for example, the gnomAD project’s public GCS bucket), use:

from gnomad.resources.config import gnomad_public_resource_configuration, GnomadPublicResourceSource

gnomad_public_resource_configuration.source = GnomadPublicResourceSource.GNOMAD

To see all available public sources for gnomAD resources, use:

from gnomad.resources.config import GnomadPublicResourceSource

list(GnomadPublicResourceSource)

Note

The gnomAD project’s bucket (gs://gnomad-public-requester-pays) is requester pays, meaning that charges for data access and transfer will be billed to your Google Cloud project.

Clusters must be configured to read requester pays buckets during creation. For example,

hailctl dataproc start cluster-name --packages gnomad --requester-pays-allow-buckets gnomad-public-requester-pays

Custom Sources

Alternatively, instead of using one of the pre-defined public sources, a custom source can be provided.

from gnomad.resources.config import gnomad_public_resource_configuration

gnomad_public_resource_configuration.source = "gs://my-bucket/gnomad-resources"

Environment Configuration

The default source can be configured through the GNOMAD_DEFAULT_PUBLIC_RESOURCE_SOURCE environment variable. This variable can be set to either the name of one of the public datasets programs or the URL of a custom source.

Examples:

  • GNOMAD_DEFAULT_PUBLIC_RESOURCE_SOURCE="Google Cloud Public Datasets"

  • GNOMAD_DEFAULT_PUBLIC_RESOURCE_SOURCE="gs://my-bucket/gnomad-resources"