HDL PRS example

Here we show an example of our pipeline for HDL PRS on UK Biobank samples. We use both effects estimates from MVP lipid traits analysis as well as posterior effects generated by mashr package.

Data used

Reference panel

Obtained via download_1000G() in bigsnpr.

Including 503 (mostly unrelated) European individuals and ~1.7M SNPs in common with either HapMap3 or the UK Biobank. Classification of European population can be found at IGSR. European individuals ID are from IGSR data portal.

GWAS summary statistics data

From MVP. We have the original GWAS summary data as well as multivariate posterior estimate of HDL effects using mashr. In brief, we have two versions of summary statistics (effect estimates) for HDL.

Target test data: UK biobank

We select randomly from UK Biobank 2000 individuals with covariates and HDL phenotype (medication adjusted, inverse normalized). Their genotypes are extracted. See UKB.QC.* PLINK file bundle.

PRS Models

Auto model runs the algorithm for 30 different $p$ (the proportion of causal variants) values range from 10e-4 to 0.9, and heritability $h^2$ from LD score regression as initial value.

Grid model tries a grid of parameters $p$, ranges from 0 to 1 and three $h^2$ which are 0.7/1/1.4 times of initial $h^2$ estimated by LD score regression.

Test genotype data preparation

Use awk select columns in phenotypes file saved to traits file UKB.hdl.cov and covariates file UKB.ind.cov. In order to merge all bed, bim and fam files, we use the following command:

for i in {1..22}; do echo ukb_cal_chr$i_v2.bed ukb_snp_chr$i_v2.bim ukb708_cal_chr$i_v2_s488374.fam; done > all_files.txt
plink --merge-list all_files.txt --make-bed --out ukbb_merged --threads 30 --memory 100000

We only want to focus on selected samples in UKB as target data. Below we extract this subset,

At this point files on the disk should be:

Analysis of MVP GWAS data

Step 1: QC on reference panel

Here we assume the target data QC has been already performed. We perform here QC for reference panel,

Step 2: Intersect SNPs among summary stats, reference panel and target data

Step 3: Harmonize alleles for shared SNPs

To handle major/minor allele, strand flips and consequently possible flips in sign for summary statistics.

Step 4: Calculate LD matrix and fit LDSC model

Step 6: Estimate posterior effect sizes and PRS

For original data,

Step 7: predict phenotypes

Baseline model: Traits ~ Sex + Age

Inf/grid/auto model: Traits ~ Sex + Age + PRS