TWAS
Background
Most variants identified in GWAS are located in regulatory regions, and these genetic variants could potentially affect complex traits through gene expression.
However, due to the limitation of samples and high cost, it is difficult to measure gene expression at a large scale. Consequently, many expression-trait associations have not been detected, especially for those with small effect sizes.
To address these issues, alternative approaches have been proposed and transcriptome-wide association study (TWAS) has become a common and easy-to-perform approach to identify genes whose expression is significantly associated with complex traits in individuals without directly measured expression levels.
GWAS and TWAS
Definition
TWAS is a method to identify significant expression-trait associations using expression imputation from genetic data or summary statistics.
- Individual-level TWAS: uses individual-level genotype and phenotype for expression prediction and association test (e.g. PrediXcan)
- Summary-level TWAS: uses only GWAS summary statistics for expression prediction and association test (e.g. FUSION, S-PrediXcan)
Individual-level and summary-level TWAS
FUSION
In this tutorial, we will introduce FUSION, which is one of the most commonly used tools for performing transcriptome-wide association studies (TWAS) using summary-level data.
url : http://gusevlab.org/projects/fusion/
FUSION trains predictive models of the genetic component of a functional/molecular phenotype and predicts and tests that component for association with disease using GWAS summary statistics. The goal is to identify associations between a GWAS phenotype and a functional phenotype that was only measured in reference data. (http://gusevlab.org/projects/fusion/)
Quote
Gusev, A., Ko, A., Shi, H., Bhatia, G., Chung, W., Penninx, B. W., ... & Pasaniuc, B. (2016). Integrative approaches for large-scale transcriptome-wide association studies. Nature genetics, 48(3), 245-252.
Algorithm for imputing expression into GWAS summary statistics
ImpG-Summary algorithm was extended to impute the Z scores for the cis genetic component of expression.
FUSION statistical model
\(Z\) : a vector of standardized effect sizes (z scores) of SNPs for the target trait at a given locus
We impute the Z score of the expression and trait as a linear combination of elements of \(Z\) with weights \(W\).
-
\(\Sigma_{e,s}\) : covariance matrix between all SNPs and gene expression
-
\(\Sigma_{s,s}\) : covariance among all SNPs (LD)
Both \(\Sigma_{e,s}\) and \(\Sigma_{s,s}\) are estimated from reference datsets.
The variance of \(WZ\) (imputed z score of expression and trait)
The imputation Z score can be obtained by:
ImpG-Summary algorithm
Pasaniuc, B., Zaitlen, N., Shi, H., Bhatia, G., Gusev, A., Pickrell, J., ... & Price, A. L. (2014). Fast and accurate imputation of summary statistics enhances evidence of functional enrichment. Bioinformatics, 30(20), 2906-2914.
Installation
Download FUSION from github and install
wget https://github.com/gusevlab/fusion_twas/archive/master.zip
unzip master.zip
cd fusion_twas-master
Download and unzip the LD reference data (1000 genome)
wget https://data.broadinstitute.org/alkesgroup/FUSION/LDREF.tar.bz2
tar xjvf LDREF.tar.bz2
Download and unzip plink2R
wget https://github.com/gabraham/plink2R/archive/master.zip
unzip master.zip
Install R packages
# R >= 4.0
R
install.packages(c('optparse','RColorBrewer'))
install.packages('plink2R-master/plink2R/',repos=NULL)
Example
FUSION framework
Input:
- GWAS summary statistics (in LDSC format)
- pre-computed gene expression weights (from http://gusevlab.org/projects/fusion/)
Input GWAS sumstats fromat
- SNP (rsID)
- A1 (effect allele)
- A2 (non-effect allele)
- Z (Z score)
Example:
SNP A1 A2 N CHISQ Z
rs6671356 C T 70100.0 0.172612905312 0.415467092935
rs6604968 G A 70100.0 0.291125788806 0.539560736902
rs4970405 A G 70100.0 0.102204513891 0.319694407037
rs12726255 G A 70100.0 0.312418295691 0.558943911042
rs4970409 G A 70100.0 0.0524226849517 0.228960007319
Get sample sumstats and weights
wget https://data.broadinstitute.org/alkesgroup/FUSION/SUM/PGC2.SCZ.sumstats
mkdir WEIGHTS
cd WEIGHTS
wget https://data.broadinstitute.org/alkesgroup/FUSION/WGT/GTEx.Whole_Blood.tar.bz2
tar xjf GTEx.Whole_Blood.tar.bz2
WEIGHTS
files in each WEIGHTS folder
RDat weight files for each gene in a tissue type
GTEx.Whole_Blood.ENSG00000002549.8.LAP3.wgt.RDat GTEx.Whole_Blood.ENSG00000166394.10.CYB5R2.wgt.RDat
GTEx.Whole_Blood.ENSG00000002822.11.MAD1L1.wgt.RDat GTEx.Whole_Blood.ENSG00000166435.11.XRRA1.wgt.RDat
GTEx.Whole_Blood.ENSG00000002919.10.SNX11.wgt.RDat GTEx.Whole_Blood.ENSG00000166436.11.TRIM66.wgt.RDat
GTEx.Whole_Blood.ENSG00000002933.3.TMEM176A.wgt.RDat GTEx.Whole_Blood.ENSG00000166444.13.ST5.wgt.RDat
GTEx.Whole_Blood.ENSG00000003137.4.CYP26B1.wgt.RDat GTEx.Whole_Blood.ENSG00000166471.6.TMEM41B.wgt.RDat
...
Expression imputation
Rscript FUSION.assoc_test.R \
--sumstats PGC2.SCZ.sumstats \
--weights ./WEIGHTS/GTEx.Whole_Blood.pos \
--weights_dir ./WEIGHTS/ \
--ref_ld_chr ./LDREF/1000G.EUR. \
--chr 22 \
--out PGC2.SCZ.22.dat
Results
head PGC2.SCZ.22.dat
PANEL FILE ID CHR P0 P1 HSQ BEST.GWAS.ID BEST.GWAS.Z EQTL.ID EQTL.R2 EQTL.Z EQTL.GWAS.Z NSNP NWGT MODEL MODELCV.R2 MODELCV.PV TWAS.Z TWAS.P
NA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000273311.1.DGCR11.wgt.RDat DGCR11 22 19033675 19035888 0.0551 rs2238767 -2.98 rs2283641 0.013728 4.33 2.5818 408 1 top1 0.014 0.018 2.5818 9.83e-03
NA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000100075.5.SLC25A1.wgt.RDat SLC25A1 22 19163095 19166343 0.0740 rs2238767 -2.98 rs762523 0.080367 5.36 -1.8211 406 1 top1 0.08 7.2e-08 -1.8216.86e-02
NA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000070371.11.CLTCL1.wgt.RDat CLTCL1 22 19166986 19279239 0.1620 rs4819843 3.04 rs809901 0.072193 5.53 -1.9928 456 19 enet 0.085 2.8e-08 -1.8806.00e-02
NA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000232926.1.AC000078.5.wgt.RDat AC000078.5 22 19874812 19875493 0.2226 rs5748555 -3.15 rs13057784 0.052796 5.60 -0.1652 514 44 enet 0.099 2e-09 0.0524 9.58e-01
NA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000185252.13.ZNF74.wgt.RDat ZNF74 22 20748405 20762745 0.1120 rs595272 4.09 rs1005640 0.001422 3.44 -1.3677 301 8 enet 0.008 0.054 -0.8550 3.93e-01
NA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000099940.7.SNAP29.wgt.RDat SNAP29 22 21213771 21245506 0.1286 rs595272 4.09 rs4820575 0.061763 5.94 -1.1978 416 27 enet 0.079 9.4e-08 -1.0354 3.00e-01
NA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000272600.1.AC007308.7.wgt.RDat AC007308.7 22 21243494 21245502 0.2076 rs595272 4.09 rs165783 0.100625 6.79 -0.8871 408 12 lasso 0.16 5.4e-1-1.2049 2.28e-01
NA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000183773.11.AIFM3.wgt.RDat AIFM3 22 21319396 21335649 0.0676 rs595272 4.09 rs565979 0.036672 4.50 -0.4474 362 1 top1 0.037 0.00024 -0.4474 6.55e-01
NA ../WEIGHTS//GTEx.Whole_Blood/GTEx.Whole_Blood.ENSG00000230513.1.THAP7-AS1.wgt.RDat THAP7-AS1 22 21356175 21357118 0.2382 rs595272 4.09 rs2239961 0.105307 -7.04 -0.3783 347 5 lasso 0.15 7.6e-1 0.2292 8.19e-01
Descriptions of the output (cited from http://gusevlab.org/projects/fusion/ )
Colume number | Column header | Value | Usage |
---|---|---|---|
1 | FILE | … | Full path to the reference weight file used |
2 | ID | FAM109B | Feature/gene identifier, taken from --weights file |
3 | CHR | 22 | Chromosome |
4 | P0 | 42470255 | Gene start (from --weights) |
5 | P1 | 42475445 | Gene end (from --weights) |
6 | HSQ | 0.0447 | Heritability of the gene |
7 | BEST.GWAS.ID | rs1023500 | rsID of the most significant GWAS SNP in locus |
8 | BEST.GWAS.Z | -5.94 | Z-score of the most significant GWAS SNP in locus |
9 | EQTL.ID | rs5758566 | rsID of the best eQTL in the locus |
10 | EQTL.R2 | 0.058680 | cross-validation R2 of the best eQTL in the locus |
11 | EQTL.Z | -5.16 | Z-score of the best eQTL in the locus |
12 | EQTL.GWAS.Z | -5.0835 | GWAS Z-score for this eQTL |
13 | NSNP | 327 | Number of SNPs in the locus |
14 | MODEL | lasso | Best performing model |
15 | MODELCV.R2 | 0.058870 | cross-validation R2 of the best performing model |
16 | MODELCV.PV | 3.94e-06 | cross-validation P-value of the best performing model |
17 | TWAS.Z | 5.1100 | TWAS Z-score (our primary statistic of interest) |
18 | TWAS.P | 3.22e-07 | TWAS P-value |
Limitations
-
Significant loci identified in TWAS also contain multiple tarit-associated genes. GWAS often identifies multiple variants in LD. Similarly, TWAS frequently identifies multiple genes in a locus.
-
Co-regulation may cause false positive results. Just like SNPs are correlated due to LD, gene expressions are often correlated due to co-regulation.
-
Sometimes even when co-regulation is not captured, the shared variants (or variants in strong LD) in different expression prediction models may cause false positive results.
-
Predicted expression account for only a limited portion of total gene expression. Total expression is affected not only by genetic components like cis-eQTL but also by other factors like environmental and technical components.
-
Other factors. For example, the window size for selecting variants may affect association results.
Criticism
TWAS aims to test the relationship of the phenotype with the genetic component of the gene expression. But under current framework, TWAS only test the relationship of the phenotype with the predicted gene expression without accounting for the uncertainty in that prediction. The key point here is that the current framework omits the fact that the gene expression data is also the result of a sampling process from the analysis.
"Consequently, the test of association between that predicted genetic component and a phenotype reduces to merely a (weighted) test of joint association of the SNPs with the phenotype, which means that they cannot be used to infer a genetic relationship between gene expression and the phenotype on a population level."
Quote
de Leeuw, C., Werme, J., Savage, J. E., Peyrot, W. J., & Posthuma, D. (2021). On the interpretation of transcriptome-wide association studies. bioRxiv, 2021-08.
Reference
- (FUSION) Gusev, A., Ko, A., Shi, H., Bhatia, G., Chung, W., Penninx, B. W., ... & Pasaniuc, B. (2016). Integrative approaches for large-scale transcriptome-wide association studies. Nature genetics, 48(3), 245-252.
- (review of TWAS) Wainberg, M., Sinnott-Armstrong, N., Mancuso, N., Barbeira, A. N., Knowles, D. A., Golan, D., ... & Kundaje, A. (2019). Opportunities and challenges for transcriptome-wide association studies. Nature genetics, 51(4), 592-599.
- (PrediXcan) Gamazon, E. R., Wheeler, H. E., Shah, K. P., Mozaffari, S. V., Aquino-Michaels, K., Carroll, R. J., ... & Im, H. K. (2015). A gene-based association method for mapping traits using reference transcriptome data. Nature genetics, 47(9), 1091-1098.
- (S-PrediXcan) Barbeira, A. N., Dickinson, S. P., Bonazzola, R., Zheng, J., Wheeler, H. E., Torres, J. M., ... & Genome Browser Data Integration & Visualization—EBI Flicek Paul 108 Juettemann Thomas 108 Ruffier Magali 108 Sheppard Dan 108 Taylor Kieron 108 Trevanion Stephen J. 108 Zerbino Daniel R. 108. (2018). Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nature communications, 9(1), 1825.
- (Criticism of current TWAS framework) de Leeuw, C., Werme, J., Savage, J. E., Peyrot, W. J., & Posthuma, D. (2021). On the interpretation of transcriptome-wide association studies. bioRxiv, 2021-08