Skip to content

LD score regression


On this page


Introduction

LDSC is one of the most commonly used command-line tools to estimate inflation, heritability, genetic correlation and cell/tissue type specificity from GWAS summary statistics.


LD: Linkage disequilibrium

Linkage disequilibrium (LD) : non-random association of alleles at different loci in a given population. (Wiki)


LD score

LD score \(l_j\) for a SNP \(j\) is defined as the sum of \(r^2\) for the SNP and other SNPs in a region.

\[ l_j= \Sigma_k{r^2_{j,k}} \]

LD score regression

Key idea: A variant will have higher test statistics if it is in LD with causal variant, and the elevation is proportional to the correlation ( \(r^2\) ) with the causal variant.

\[ E[\chi^2|l_j] = {{Nh^2l_j}\over{M}} + Na + 1 \]
  • \(N\): sample size.
  • \(M\) : number of SNPs.
  • \(h^2\) : observed-scale heritability
  • \(a\) : the effect of confounding factors, including cryptic relatedness and population stratification.

LD Score Regression Example

For more details of LD score regression, please refer to : - Bulik-Sullivan, Brendan K., et al. "LD Score regression distinguishes confounding from polygenicity in genome-wide association studies." Nature genetics 47.3 (2015): 291-295.


Install LDSC

LDSC can be downloaded from github (GPL-3.0 license): https://github.com/bulik/ldsc

For ldsc, we need Anaconda to create virtual environment (for python2). If you haven't installed Anaconda, please check how to install anaconda.

# change to your directory for tools
cd ~/tools

# clone the ldsc github repository
git clone https://github.com/bulik/ldsc.git

# create a virtual environment for ldsc (python2)
cd ldsc
conda env create --file environment.yml  

# activate ldsc environment
conda activate ldsc

Data preparation

Required data and tools

  • LDSC — clone repo, Python 2 conda env (Install LDSC); Anaconda/Miniconda if needed.
  • Reference files — EAS weights, frequency/plink reference, and partitioned baseline LD scores are on Zenodo (S-LDSC reference files, DOI 10.5281/zenodo.10515792). Google Cloud with gsutil and a billing-enabled project is still needed for EAS genome-wide LD scores (eas_ldscores) and LDSC-SEG cell-type scores (not included in that Zenodo deposit).
  • GWAS summary statistics — this tutorial uses BBJ HDL-C / LDL-C (Jenger); you can substitute your own sumstats after munging (Munge sumstats).

In this tutorial, we will use sample summary statistics for HDLC and LDLC from Jenger. - Kanai, Masahiro, et al. "Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases." Nature genetics 50.3 (2018): 390-400.

The Miami plot for the two traits:

image


Download sample summary statistics

# HDL-c and LDL-c in Biobank Japan
wget -O BBJ_LDLC.txt.gz http://jenger.riken.jp/61analysisresult_qtl_download/
wget -O BBJ_HDLC.txt.gz http://jenger.riken.jp/47analysisresult_qtl_download/

Download reference files

Precomputed LD-score resources for S-LDSC (standard LD scores, baseline and baseline-LD models, weights, 1000 Genomes Phase 3 plink/frequency files, HapMap3 SNP lists, example sumstats, and GRCh38-related bundles) are distributed on Zenodo as S-LDSC reference files (record 10515792; DOI 10.5281/zenodo.10515792; Steven Gazal; CC BY 4.0). See the record description for the full file list (EUR and EAS), readme_baseline_versions.txt, and the 107 independent GWAS sumstats archive sumstats_indep107.tgz.

For this tutorial (EAS BBJ traits), download and unpack the EAS archives from Zenodo, plus the HapMap3 allele list used by munge_sumstats.py (see note below). EAS genome-wide LD scores for univariate --h2 / --rg and LDSC-SEG Cahoy scores are not in record 10515792; those still come from the Broad requester-pays bucket via gsutil.

# Zenodo: S-LDSC reference files (v4)
ZENODO=https://zenodo.org/records/10515792/files

cd ~/tools/ldsc
mkdir -p resource
cd resource

# EAS weights, plink reference, and baseline LD scores (partitioned heritability)
wget -c "${ZENODO}/1000G_Phase3_EAS_weights_hm3_no_MHC.tgz?download=1"
wget -c "${ZENODO}/1000G_Phase3_EAS_plinkfiles.tgz?download=1"
wget -c "${ZENODO}/1000G_Phase3_EAS_baseline_v1.2_ldscores.tgz?download=1"
tar -xzf 1000G_Phase3_EAS_weights_hm3_no_MHC.tgz
tar -xzf 1000G_Phase3_EAS_plinkfiles.tgz
# Baseline archive unpacks baseline.* files at the top level; keep the layout expected by partitioned LDSC below
mkdir -p 1000G_Phase3_EAS_baseline_v1_2_ldscores
tar -xzf 1000G_Phase3_EAS_baseline_v1.2_ldscores.tgz -C 1000G_Phase3_EAS_baseline_v1_2_ldscores

# Optional on the same record: EAS baselineLD v2.2 (newer baseline-LD annotation set)
# wget -c "${ZENODO}/1000G_Phase3_EAS_baselineLD_v2.2_ldscores.tgz?download=1"
# tar -xzf 1000G_Phase3_EAS_baselineLD_v2.2_ldscores.tgz

# HapMap3 SNP list with reference alleles for --merge-alleles (required by munge_sumstats below).
# Zenodo 10515792 provides hm3_no_MHC.list.txt (rsIDs only); for allele alignment use w_hm3.snplist, e.g.:
wget -c "https://zenodo.org/records/7773502/files/w_hm3.snplist.gz?download=1"
gunzip -f w_hm3.snplist.gz

# EAS LD scores for standard --h2 / --rg (Google Cloud, requester-pays — not on Zenodo 10515792)
gsutil -u <project_name> cp gs://broad-alkesgroup-public-requester-pays/LDSCORE/eas_ldscores.tar.bz2 .
tar -xjf eas_ldscores.tar.bz2

# Cell-type LDSC-SEG scores (Google Cloud, requester-pays)
gsutil -u <project_name> cp gs://broad-alkesgroup-public-requester-pays/LDSCORE/LDSC_SEG_ldscores/Cahoy_EAS_1000Gv3_ldscores.tar.gz .
tar -xzf Cahoy_EAS_1000Gv3_ldscores.tar.gz

EUR analyses can take 1000G_Phase3_ldscores.tgz, 1000G_Phase3_weights_hm3_no_MHC.tgz, 1000G_Phase3_plinkfiles.tgz, and the EUR baseline / baselineLD archives from the same Zenodo record instead of the EAS filenames above.


Munge sumstats

Before the analysis, we need to format and clean the raw sumstats.

Note

Rsid is used here. If the sumstats only contained id like CHR:POS:REF:ALT, annotate it first.

snplist=~/tools/ldsc/resource/w_hm3.snplist
munge_sumstats.py \
    --sumstats BBJ_HDLC.txt.gz \
    --merge-alleles $snplist \
    --a1 ALT \
    --a2 REF \
    --chunksize 500000 \
    --out BBJ_HDLC
munge_sumstats.py \
  --sumstats BBJ_LDLC.txt.gz \
    --a1 ALT \
  --a2 REF \
  --chunksize 500000 \
  --merge-alleles $snplist \
  --out BBJ_LDLC

After munging, you will get two munged and formatted files:

BBJ_HDLC.sumstats.gz
BBJ_LDLC.sumstats.gz
And these are the files we will use to run LD score regression.


LD score regression

Univariate LD score regression is utilized to estimate heritability and confounding factors (cryptic relatedness and population stratification) of a certain trait.

  • Reference : Bulik-Sullivan, Brendan K., et al. "LD Score regression distinguishes confounding from polygenicity in genome-wide association studies." Nature genetics 47.3 (2015): 291-295.

Using the munged sumstats, we can run:

ldsc.py \
  --h2 BBJ_HDLC.sumstats.gz \
  --ref-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \
  --w-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \
  --out BBJ_HDLC

ldsc.py \
  --h2 BBJ_LDLC.sumstats.gz \
  --ref-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \
  --w-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \
  --out BBJ_LDLC

Let's check the results for HDLC:

cat BBJ_HDLC.log
*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.1
* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call: 
./ldsc.py \
--h2 BBJ_HDLC.sumstats.gz \
--ref-ld-chr /home/he/tools/ldsc/resource/eas_ldscores/ \
--out BBJ_HDLC \
--w-ld-chr /home/he/tools/ldsc/resource/eas_ldscores/ 

Beginning analysis at Sat Dec 24 20:40:34 2022
Reading summary statistics from BBJ_HDLC.sumstats.gz ...
Read summary statistics for 1020377 SNPs.
Reading reference panel LD Score from /home/he/tools/ldsc/resource/eas_ldscores/[1-22] ... (ldscore_fromlist)
Read reference panel LD Scores for 1208050 SNPs.
Removing partitioned LD Scores with zero variance.
Reading regression weight LD Score from /home/he/tools/ldsc/resource/eas_ldscores/[1-22] ... (ldscore_fromlist)
Read regression weight LD Scores for 1208050 SNPs.
After merging with reference panel LD, 1012040 SNPs remain.
After merging with regression SNP LD, 1012040 SNPs remain.
Using two-step estimator with cutoff at 30.
Total Observed scale h2: 0.1583 (0.0281)
Lambda GC: 1.1523
Mean Chi^2: 1.2843
Intercept: 1.0563 (0.0114)
Ratio: 0.1981 (0.0402)
Analysis finished at Sat Dec 24 20:40:41 2022
Total time elapsed: 6.57s

We can see that from the log:

  • Observed scale h2 = 0.1583
  • lambda GC = 1.1523
  • intercept = 1.0563
  • Ratio = 0.1981

According to LDSC documents, Ratio measures the proportion of the inflation in the mean chi^2 that the LD Score regression intercept ascribes to causes other than polygenic heritability. The value of ratio should be close to zero, though in practice values of 10-20% are not uncommon.

\[ Ratio = {{intercept-1}\over{mean(\chi^2)-1}} \]

Distribution of h2 and intercept across traits in UKB

The Neale Lab estimated SNP heritability using LDSC across more than 4,000 primary GWAS in UKB. You can check the distributions of SNP heritability and intercept estimates using the following link to get the idea of what you can expect from LD score regresion:

https://nealelab.github.io/UKBB_ldsc/viz_h2.html


Cross-trait LD score regression

Cross-trait LD score regression is employed to estimate the genetic correlation between a pair of traits.

Key idea: replace \chi^2 in univariate LD score regression and the relationship (SNPs with high LD ) still holds.

\[ E[z_{1j}z_{2j}] = {{\sqrt{N_1N_2}\rho_g}\over{M}}l_j + {{\rho N_s}\over{\sqrt{N_1N_2}}} \]
  • \(z_ij\) : z score of trait i for SNP j
  • \(N_i\) : sample size of trait i
  • \(\rho\) : phenotypic correlation
  • \(\rho_g\) : genetic covariance
  • \(l_j\) : LD score
  • \(M\) : number of SNPs

Then we can get the genetic correlation by :

\[ r_g = {{\rho_g}\over{\sqrt{h_1^2h_2^2}}} \]
  • Reference: Bulik-Sullivan, Brendan, et al. "An atlas of genetic correlations across human diseases and traits." Nature genetics 47.11 (2015): 1236-1241.

ldsc.py \
  --rg BBJ_HDLC.sumstats.gz,BBJ_LDLC.sumstats.gz \
  --ref-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \
  --w-ld-chr ~/tools/ldsc/resource/eas_ldscores/ \
  --out BBJ_HDLC_LDLC
Let's check the results:

*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.1
* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call: 
./ldsc.py \
--ref-ld-chr /home/he/tools/ldsc/resource/eas_ldscores/ \
--out BBJ_HDLC_LDLC \
--rg BBJ_HDLC.sumstats.gz,BBJ_LDLC.sumstats.gz \
--w-ld-chr /home/he/tools/ldsc/resource/eas_ldscores/ 

Beginning analysis at Thu Dec 29 21:02:37 2022
Reading summary statistics from BBJ_HDLC.sumstats.gz ...
Read summary statistics for 1020377 SNPs.
Reading reference panel LD Score from /home/he/tools/ldsc/resource/eas_ldscores/[1-22] ... (ldscore_fromlist)
Read reference panel LD Scores for 1208050 SNPs.
Removing partitioned LD Scores with zero variance.
Reading regression weight LD Score from /home/he/tools/ldsc/resource/eas_ldscores/[1-22] ... (ldscore_fromlist)
Read regression weight LD Scores for 1208050 SNPs.
After merging with reference panel LD, 1012040 SNPs remain.
After merging with regression SNP LD, 1012040 SNPs remain.
Computing rg for phenotype 2/2
Reading summary statistics from BBJ_LDLC.sumstats.gz ...
Read summary statistics for 1217311 SNPs.
After merging with summary statistics, 1012040 SNPs remain.
1012040 SNPs with valid alleles.

Heritability of phenotype 1
---------------------------
Total Observed scale h2: 0.1054 (0.0383)
Lambda GC: 1.1523
Mean Chi^2: 1.2843
Intercept: 1.1234 (0.0607)
Ratio: 0.4342 (0.2134)

Heritability of phenotype 2/2
-----------------------------
Total Observed scale h2: 0.0543 (0.0211)
Lambda GC: 1.0833
Mean Chi^2: 1.1465
Intercept: 1.0583 (0.0335)
Ratio: 0.398 (0.2286)

Genetic Covariance
------------------
Total Observed scale gencov: 0.0121 (0.0106)
Mean z1*z2: -0.001
Intercept: -0.0198 (0.0121)

Genetic Correlation
-------------------
Genetic Correlation: 0.1601 (0.1821)
Z-score: 0.8794
P: 0.3792


Summary of Genetic Correlation Results
p1                    p2      rg      se       z       p  h2_obs  h2_obs_se  h2_int  h2_int_se  gcov_int  gcov_int_se
BBJ_HDLC.sumstats.gz  BBJ_LDLC.sumstats.gz  0.1601  0.1821  0.8794  0.3792  0.0543     0.0211  1.0583     0.0335   -0.0198       0.0121

Analysis finished at Thu Dec 29 21:02:47 2022
Total time elapsed: 10.39s

Partitioned LD regression

Partitioned LD regression is utilized to evaluate the contribution of each functional group to the total SNP heritability.

\[ E[\chi^2] = N \sum\limits_C \tau_C l(j,C) + Na + 1 \]
  • \(C\) : functional categories
  • \(l(j,C)\) : LD score for SNP j with respect to C.
  • \(\tau_C\) : per-SNP contribution of category C to heritability

  • Reference: Finucane, Hilary K., et al. "Partitioning heritability by functional annotation using genome-wide association summary statistics." Nature genetics 47.11 (2015): 1228-1235.

ldsc.py \
  --h2 BBJ_HDLC.sumstats.gz \
  --overlap-annot \
  --ref-ld-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_baseline_v1_2_ldscores/baseline. \
  --frqfile-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_plinkfiles/1000G.EAS.QC. \
  --w-ld-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_weights_hm3_no_MHC/weights.EAS.hm3_noMHC. \
  --out BBJ_HDLC_baseline

Celltype specificity LD regression

LDSC-SEG : LD score regression applied to specifically expressed genes

An extension of Partitioned LD regression. Categories are defined by tissue or cell-type specific genes.

  • Reference: Finucane, Hilary K., et al. "Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types." Nature genetics 50.4 (2018): 621-629.
ldsc.py \
  --h2-cts BBJ_HDLC.sumstats.gz \
  --ref-ld-chr-cts ~/tools/ldsc/resource/Cahoy_EAS_1000Gv3_ldscores/Cahoy.EAS.ldcts \
  --ref-ld-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_baseline_v1_2_ldscores/baseline. \
  --w-ld-chr ~/tools/ldsc/resource/1000G_Phase3_EAS_weights_hm3_no_MHC/weights.EAS.hm3_noMHC. \
  --out BBJ_HDLC_baseline_cts

References

  • Gazal, S. S-LDSC reference files (1000 Genomes Phase 3 LD scores, baseline/baseline-LD models, weights, plink/frequency files, HapMap3 lists, example sumstats). Zenodo (2024). https://doi.org/10.5281/zenodo.10515792
  • Bulik-Sullivan, Brendan K., et al. "LD Score regression distinguishes confounding from polygenicity in genome-wide association studies." Nature genetics 47.3 (2015): 291-295.
  • Bulik-Sullivan, Brendan, et al. "An atlas of genetic correlations across human diseases and traits." Nature genetics 47.11 (2015): 1236-1241.
  • Finucane, Hilary K., et al. "Partitioning heritability by functional annotation using genome-wide association summary statistics." Nature genetics 47.11 (2015): 1228-1235.
  • Finucane, Hilary K., et al. "Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types." Nature genetics 50.4 (2018): 621-629.
  • Kanai, Masahiro, et al. "Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases." Nature genetics 50.3 (2018): 390-400.