Finemapping using susieR¶
Data preparation¶
In [1]:
Copied!
import gwaslab as gl
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import gwaslab as gl
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
Load sumstats¶
In [2]:
Copied!
sumstats = gl.Sumstats("./1kgeas.B1.glm.firth.gz",fmt="plink2")
sumstats = gl.Sumstats("./1kgeas.B1.glm.firth.gz",fmt="plink2")
2024/04/18 10:40:48 GWASLab v3.4.43 https://cloufield.github.io/gwaslab/ 2024/04/18 10:40:48 (C) 2022-2024, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com 2024/04/18 10:40:48 Start to load format from formatbook.... 2024/04/18 10:40:48 -plink2 format meta info: 2024/04/18 10:40:48 - format_name : PLINK2 .glm.firth, .glm.logistic,.glm.linear 2024/04/18 10:40:48 - format_source : https://www.cog-genomics.org/plink/2.0/formats 2024/04/18 10:40:48 - format_version : Alpha 3.3 final (3 Jun) 2024/04/18 10:40:48 - last_check_date : 20220806 2024/04/18 10:40:48 -plink2 to gwaslab format dictionary: 2024/04/18 10:40:48 - plink2 keys: ID,#CHROM,POS,REF,ALT,A1,OBS_CT,A1_FREQ,BETA,LOG(OR)_SE,SE,T_STAT,Z_STAT,P,LOG10_P,MACH_R2,OR 2024/04/18 10:40:48 - gwaslab values: SNPID,CHR,POS,REF,ALT,EA,N,EAF,BETA,SE,SE,T,Z,P,MLOG10P,INFO,OR 2024/04/18 10:40:48 Start to initialize gl.Sumstats from file :./1kgeas.B1.glm.firth.gz 2024/04/18 10:40:49 -Reading columns : Z_STAT,A1_FREQ,POS,ALT,REF,P,A1,OR,OBS_CT,#CHROM,LOG(OR)_SE,ID 2024/04/18 10:40:49 -Renaming columns to : Z,EAF,POS,ALT,REF,P,EA,OR,N,CHR,SE,SNPID 2024/04/18 10:40:49 -Current Dataframe shape : 1128732 x 12 2024/04/18 10:40:49 -Initiating a status column: STATUS ... 2024/04/18 10:40:49 #WARNING! Version of genomic coordinates is unknown... 2024/04/18 10:40:49 NEA not available: assigning REF to NEA... 2024/04/18 10:40:49 -EA,REF and ALT columns are available: assigning NEA... 2024/04/18 10:40:49 -For variants with EA == ALT : assigning REF to NEA ... 2024/04/18 10:40:49 -For variants with EA != ALT : assigning ALT to NEA ... 2024/04/18 10:40:49 Start to reorder the columns...v3.4.43 2024/04/18 10:40:49 -Current Dataframe shape : 1128732 x 14 ; Memory usage: 106.06 MB 2024/04/18 10:40:49 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,SE,Z,P,OR,N,STATUS,REF,ALT 2024/04/18 10:40:49 Finished reordering the columns. 2024/04/18 10:40:49 -Column : SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT 2024/04/18 10:40:49 -DType : object int64 int64 category category float64 float64 float64 float64 float64 int64 category category category 2024/04/18 10:40:49 -Verified: T T T T T T T T T T T T T T 2024/04/18 10:40:50 -Current Dataframe memory usage: 106.06 MB 2024/04/18 10:40:50 Finished loading data successfully!
Data standardization and sanity check¶
In [3]:
Copied!
sumstats.basic_check()
sumstats.basic_check()
2024/04/18 10:40:50 Start to check SNPID/rsID...v3.4.43 2024/04/18 10:40:50 -Current Dataframe shape : 1128732 x 14 ; Memory usage: 106.06 MB 2024/04/18 10:40:50 -Checking SNPID data type... 2024/04/18 10:40:50 -Converting SNPID to pd.string data type... 2024/04/18 10:40:50 -Checking if SNPID is CHR:POS:NEA:EA...(separator: - ,: , _) 2024/04/18 10:40:51 Finished checking SNPID/rsID. 2024/04/18 10:40:51 Start to fix chromosome notation (CHR)...v3.4.43 2024/04/18 10:40:51 -Current Dataframe shape : 1128732 x 14 ; Memory usage: 106.06 MB 2024/04/18 10:40:51 -Checking CHR data type... 2024/04/18 10:40:51 -Variants with standardized chromosome notation: 1128732 2024/04/18 10:40:51 -All CHR are already fixed... 2024/04/18 10:40:52 Finished fixing chromosome notation (CHR). 2024/04/18 10:40:52 Start to fix basepair positions (POS)...v3.4.43 2024/04/18 10:40:52 -Current Dataframe shape : 1128732 x 14 ; Memory usage: 107.13 MB 2024/04/18 10:40:52 -Converting to Int64 data type ... 2024/04/18 10:40:53 -Position bound:(0 , 250,000,000) 2024/04/18 10:40:53 -Removed outliers: 0 2024/04/18 10:40:53 Finished fixing basepair positions (POS). 2024/04/18 10:40:53 Start to fix alleles (EA and NEA)...v3.4.43 2024/04/18 10:40:53 -Current Dataframe shape : 1128732 x 14 ; Memory usage: 116.82 MB 2024/04/18 10:40:53 -Converted all bases to string datatype and UPPERCASE. 2024/04/18 10:40:53 -Variants with bad EA : 0 2024/04/18 10:40:54 -Variants with bad NEA : 0 2024/04/18 10:40:54 -Variants with NA for EA or NEA: 0 2024/04/18 10:40:54 -Variants with same EA and NEA: 0 2024/04/18 10:40:54 -Detected 0 variants with alleles that contain bases other than A/C/T/G . 2024/04/18 10:40:55 Finished fixing alleles (EA and NEA). 2024/04/18 10:40:55 Start to perform sanity check for statistics...v3.4.43 2024/04/18 10:40:55 -Current Dataframe shape : 1128732 x 14 ; Memory usage: 116.82 MB 2024/04/18 10:40:55 -Comparison tolerance for floats: 1e-07 2024/04/18 10:40:55 -Checking if 0 <= N <= 2147483647 ... 2024/04/18 10:40:55 -Removed 0 variants with bad/na N. 2024/04/18 10:40:55 -Checking if -1e-07 < EAF < 1.0000001 ... 2024/04/18 10:40:55 -Removed 0 variants with bad/na EAF. 2024/04/18 10:40:55 -Checking if -9999.0000001 < Z < 9999.0000001 ... 2024/04/18 10:40:55 -Examples of invalid variants(SNPID): 1:15774:G:A,1:15777:A:G,1:57292:C:T,1:87360:C:T,1:625392:T:C ... 2024/04/18 10:40:55 -Examples of invalid values (Z): NA,NA,NA,NA,NA ... 2024/04/18 10:40:55 -Removed 220793 variants with bad/na Z. 2024/04/18 10:40:55 -Checking if -1e-07 < P < 1.0000001 ... 2024/04/18 10:40:55 -Removed 0 variants with bad/na P. 2024/04/18 10:40:55 -Checking if -1e-07 < SE < inf ... 2024/04/18 10:40:55 -Removed 0 variants with bad/na SE. 2024/04/18 10:40:55 -Checking if -100.0000001 < OR < 100.0000001 ... 2024/04/18 10:40:55 -Removed 0 variants with bad/na OR. 2024/04/18 10:40:55 -Checking STATUS and converting STATUS to categories.... 2024/04/18 10:40:56 -Removed 220793 variants with bad statistics in total. 2024/04/18 10:40:56 -Data types for each column: 2024/04/18 10:40:56 -Column : SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT 2024/04/18 10:40:56 -DType : string Int64 Int64 category category float32 float64 float64 float64 float64 Int64 category category category 2024/04/18 10:40:56 -Verified: T T T T T T T T T T T T T T 2024/04/18 10:40:56 Finished sanity check for statistics. 2024/04/18 10:40:56 Start to check data consistency across columns...v3.4.43 2024/04/18 10:40:56 -Current Dataframe shape : 907939 x 14 ; Memory usage: 95.27 MB 2024/04/18 10:40:56 -Tolerance: 0.001 (Relative) and 0.001 (Absolute) 2024/04/18 10:40:56 -No availalbe columns for data consistency checking...Skipping... 2024/04/18 10:40:56 Finished checking data consistency across columns. 2024/04/18 10:40:56 Start to normalize indels...v3.4.43 2024/04/18 10:40:56 -Current Dataframe shape : 907939 x 14 ; Memory usage: 95.27 MB 2024/04/18 10:40:56 -No available variants to normalize.. 2024/04/18 10:40:56 Finished normalizing variants successfully! 2024/04/18 10:40:56 Start to sort the genome coordinates...v3.4.43 2024/04/18 10:40:56 -Current Dataframe shape : 907939 x 14 ; Memory usage: 95.27 MB 2024/04/18 10:40:56 Finished sorting coordinates. 2024/04/18 10:40:56 Start to reorder the columns...v3.4.43 2024/04/18 10:40:56 -Current Dataframe shape : 907939 x 14 ; Memory usage: 88.35 MB 2024/04/18 10:40:56 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,SE,Z,P,OR,N,STATUS,REF,ALT 2024/04/18 10:40:56 Finished reordering the columns.
Note: 220793 variants were removed due to na Z values.This is due to FIRTH_CONVERGE_FAIL when performing GWAS using PLINK2.
Extract lead variants¶
In [4]:
Copied!
sumstats.get_lead()
sumstats.get_lead()
2024/04/18 10:40:56 Start to extract lead variants...v3.4.43 2024/04/18 10:40:56 -Current Dataframe shape : 907939 x 14 ; Memory usage: 88.35 MB 2024/04/18 10:40:56 -Processing 907939 variants... 2024/04/18 10:40:56 -Significance threshold : 5e-08 2024/04/18 10:40:56 -Sliding window size: 500 kb 2024/04/18 10:40:56 -Using P for extracting lead variants... 2024/04/18 10:40:56 -Found 43 significant variants in total... 2024/04/18 10:40:56 -Identified 4 lead variants! 2024/04/18 10:40:56 Finished extracting lead variants.
Out[4]:
SNPID | CHR | POS | EA | NEA | EAF | SE | Z | P | OR | N | STATUS | REF | ALT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
44298 | 1:167562605:G:A | 1 | 167562605 | A | G | 0.391481 | 0.159645 | 7.69462 | 1.419150e-14 | 3.415780 | 493 | 9960099 | G | A |
91266 | 2:55513738:C:T | 2 | 55513738 | C | T | 0.376008 | 0.153159 | -7.96244 | 1.686760e-15 | 0.295373 | 496 | 9960099 | C | T |
442239 | 7:134368632:T:G | 7 | 134368632 | G | T | 0.138105 | 0.225526 | 6.89025 | 5.569440e-12 | 4.730010 | 496 | 9960099 | T | G |
875859 | 20:42758834:T:C | 20 | 42758834 | T | C | 0.227273 | 0.184323 | -7.76902 | 7.909780e-15 | 0.238829 | 495 | 9960099 | T | C |
Create manhattan plot for checking¶
In [5]:
Copied!
sumstats.plot_mqq()
sumstats.plot_mqq()
2024/04/18 10:40:57 Start to create MQQ plot...v3.4.43: 2024/04/18 10:40:57 -Genomic coordinates version: 99... 2024/04/18 10:40:57 #WARNING! Genomic coordinates version is unknown. 2024/04/18 10:40:57 -Genome-wide significance level to plot is set to 5e-08 ... 2024/04/18 10:40:57 -Raw input contains 907939 variants... 2024/04/18 10:40:57 -MQQ plot layout mode is : mqq 2024/04/18 10:40:57 Finished loading specified columns from the sumstats. 2024/04/18 10:40:57 Start data conversion and sanity check: 2024/04/18 10:40:57 -Removed 0 variants with nan in CHR or POS column ... 2024/04/18 10:40:57 -Removed 0 variants with CHR <=0... 2024/04/18 10:40:57 -Removed 0 variants with nan in P column ... 2024/04/18 10:40:57 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... 2024/04/18 10:40:57 -Sumstats P values are being converted to -log10(P)... 2024/04/18 10:40:57 -Sanity check: 0 na/inf/-inf variants will be removed... 2024/04/18 10:40:57 -Converting data above cut line... 2024/04/18 10:40:57 -Maximum -log10(P) value is 14.772946706439042 . 2024/04/18 10:40:57 Finished data conversion and sanity check. 2024/04/18 10:40:57 Start to create MQQ plot with 907939 variants... 2024/04/18 10:40:58 -Creating background plot... 2024/04/18 10:40:59 Finished creating MQQ plot successfully! 2024/04/18 10:40:59 Start to extract variants for annotation... 2024/04/18 10:40:59 -Found 4 significant variants with a sliding window size of 500 kb... 2024/04/18 10:40:59 Finished extracting variants for annotation... 2024/04/18 10:40:59 Start to process figure arts. 2024/04/18 10:40:59 -Processing X ticks... 2024/04/18 10:40:59 -Processing X labels... 2024/04/18 10:40:59 -Processing Y labels... 2024/04/18 10:40:59 -Processing Y tick lables... 2024/04/18 10:40:59 -Processing Y labels... 2024/04/18 10:40:59 -Processing lines... 2024/04/18 10:40:59 Finished processing figure arts. 2024/04/18 10:40:59 Start to annotate variants... 2024/04/18 10:40:59 -Skip annotating 2024/04/18 10:40:59 Finished annotating variants. 2024/04/18 10:40:59 Start to create QQ plot with 907939 variants: 2024/04/18 10:40:59 -Plotting all variants... 2024/04/18 10:40:59 -Expected range of P: (0,1.0) 2024/04/18 10:40:59 -Lambda GC (MLOG10P mode) at 0.5 is 0.98908 2024/04/18 10:40:59 -Processing Y tick lables... 2024/04/18 10:40:59 Finished creating QQ plot successfully! 2024/04/18 10:40:59 Start to save figure... 2024/04/18 10:40:59 -Skip saving figure! 2024/04/18 10:40:59 Finished saving figure... 2024/04/18 10:40:59 Finished creating plot successfully!
Out[5]:
(<Figure size 3000x1000 with 2 Axes>, <gwaslab.g_Log.Log at 0x7fa6ad1132b0>)
Extract the variants around 2:55513738:C:T for finemapping¶
In [6]:
Copied!
locus = sumstats.filter_value('CHR==2 & POS>55013738 & POS<56013738')
locus = sumstats.filter_value('CHR==2 & POS>55013738 & POS<56013738')
2024/04/18 10:41:06 Start filtering values by condition: CHR==2 & POS>55013738 & POS<56013738 2024/04/18 10:41:06 -Removing 907560 variants not meeting the conditions: CHR==2 & POS>55013738 & POS<56013738 2024/04/18 10:41:06 Finished filtering values.
Convert OR to BETA¶
In [7]:
Copied!
locus.fill_data(to_fill=["BETA"])
locus.fill_data(to_fill=["BETA"])
2024/04/18 10:41:06 Start filling data using existing columns...v3.4.43 2024/04/18 10:41:06 -Column : SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT 2024/04/18 10:41:06 -DType : string Int64 Int64 category category float32 float64 float64 float64 float64 Int64 category category category 2024/04/18 10:41:06 -Verified: T T T T T T T T T T T T T T 2024/04/18 10:41:06 -Overwrite mode: False 2024/04/18 10:41:06 -Skipping columns: [] 2024/04/18 10:41:06 -Filling columns: ['BETA'] 2024/04/18 10:41:06 - Filling Columns iteratively... 2024/04/18 10:41:06 - Filling BETA value using OR column... 2024/04/18 10:41:06 Finished filling data using existing columns. 2024/04/18 10:41:06 Start to reorder the columns...v3.4.43 2024/04/18 10:41:06 -Current Dataframe shape : 379 x 15 ; Memory usage: 19.97 MB 2024/04/18 10:41:06 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,Z,P,OR,N,STATUS,REF,ALT 2024/04/18 10:41:06 Finished reordering the columns.
In [8]:
Copied!
locus.data
locus.data
Out[8]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | P | OR | N | STATUS | REF | ALT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
91067 | 2:55015281:A:T | 2 | 55015281 | T | A | 0.126263 | -0.048075 | 0.193967 | -0.247856 | 0.804246 | 0.953062 | 495 | 9960099 | A | T |
91068 | 2:55015604:G:A | 2 | 55015604 | A | G | 0.119192 | -0.047357 | 0.195199 | -0.242606 | 0.808311 | 0.953747 | 495 | 9960099 | G | A |
91069 | 2:55015764:G:A | 2 | 55015764 | A | G | 0.339394 | 0.028986 | 0.135064 | 0.214575 | 0.830098 | 1.029410 | 495 | 9960099 | G | A |
91070 | 2:55016143:A:C | 2 | 55016143 | C | A | 0.126263 | 0.004659 | 0.195728 | 0.023784 | 0.981025 | 1.004670 | 495 | 9960099 | A | C |
91071 | 2:55017199:T:C | 2 | 55017199 | C | T | 0.093306 | 0.268767 | 0.219657 | 1.223580 | 0.221112 | 1.308350 | 493 | 9960099 | T | C |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
91441 | 2:56004219:G:T | 2 | 56004219 | G | T | 0.171717 | 0.148489 | 0.169557 | 0.875763 | 0.381159 | 1.160080 | 495 | 9960099 | G | T |
91442 | 2:56007034:T:C | 2 | 56007034 | T | C | 0.260121 | 0.073325 | 0.145565 | 0.503737 | 0.614446 | 1.076080 | 494 | 9960099 | T | C |
91443 | 2:56008984:C:G | 2 | 56008984 | G | C | 0.013185 | 0.205883 | 0.547226 | 0.376227 | 0.706748 | 1.228610 | 493 | 9960099 | C | G |
91444 | 2:56009480:A:T | 2 | 56009480 | A | T | 0.157258 | 0.135667 | 0.177621 | 0.763784 | 0.444996 | 1.145300 | 496 | 9960099 | A | T |
91445 | 2:56010434:C:T | 2 | 56010434 | T | C | 0.017172 | 0.300305 | 0.491815 | 0.610604 | 0.541462 | 1.350270 | 495 | 9960099 | C | T |
379 rows × 15 columns
Align NEA with reference sequence¶
In [9]:
Copied!
locus.harmonize(basic_check=False, ref_seq="/home/yunye/CommonData/Reference/genome/humanG1Kv37/human_g1k_v37.fasta")
locus.harmonize(basic_check=False, ref_seq="/home/yunye/CommonData/Reference/genome/humanG1Kv37/human_g1k_v37.fasta")
2024/04/18 10:41:07 Start to check if NEA is aligned with reference sequence...v3.4.43 2024/04/18 10:41:07 -Current Dataframe shape : 379 x 15 ; Memory usage: 19.97 MB 2024/04/18 10:41:07 -Reference genome FASTA file: /home/yunye/CommonData/Reference/genome/humanG1Kv37/human_g1k_v37.fasta 2024/04/18 10:41:07 -Loading fasta records:2 2024/04/18 10:41:19 -Checking records 2024/04/18 10:41:19 -Building numpy fasta records from dict 2024/04/18 10:41:20 -Checking records for ( len(NEA) <= 4 and len(EA) <= 4 ) 2024/04/18 10:41:20 -Checking records for ( len(NEA) > 4 or len(EA) > 4 ) 2024/04/18 10:41:20 -Finished checking records 2024/04/18 10:41:20 -Variants allele on given reference sequence : 264 2024/04/18 10:41:20 -Variants flipped : 115 2024/04/18 10:41:20 -Raw Matching rate : 100.00% 2024/04/18 10:41:20 -Variants inferred reverse_complement : 0 2024/04/18 10:41:20 -Variants inferred reverse_complement_flipped : 0 2024/04/18 10:41:20 -Both allele on genome + unable to distinguish : 0 2024/04/18 10:41:20 -Variants not on given reference sequence : 0 2024/04/18 10:41:20 Finished checking if NEA is aligned with reference sequence. 2024/04/18 10:41:20 Start to adjust statistics based on STATUS code...v3.4.43 2024/04/18 10:41:20 -Current Dataframe shape : 379 x 15 ; Memory usage: 0.04 MB 2024/04/18 10:41:20 Start to flip allele-specific stats for SNPs with status xxxxx[35]x: ALT->EA , REF->NEA ...v3.4.43 2024/04/18 10:41:20 -Flipping 115 variants... 2024/04/18 10:41:20 -Swapping column: NEA <=> EA... 2024/04/18 10:41:20 -Flipping column: BETA = - BETA... 2024/04/18 10:41:20 -Flipping column: Z = - Z... 2024/04/18 10:41:20 -Flipping column: EAF = 1 - EAF... 2024/04/18 10:41:20 -Flipping column: OR = 1 / OR... 2024/04/18 10:41:20 -Changed the status for flipped variants : xxxxx[35]x -> xxxxx[12]x 2024/04/18 10:41:20 Finished adjusting statistics based on STATUS code. 2024/04/18 10:41:20 Start to sort the genome coordinates...v3.4.43 2024/04/18 10:41:20 -Current Dataframe shape : 379 x 15 ; Memory usage: 0.04 MB 2024/04/18 10:41:20 Finished sorting coordinates. 2024/04/18 10:41:20 Start to reorder the columns...v3.4.43 2024/04/18 10:41:20 -Current Dataframe shape : 379 x 15 ; Memory usage: 0.03 MB 2024/04/18 10:41:20 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,Z,P,OR,N,STATUS,REF,ALT 2024/04/18 10:41:20 Finished reordering the columns.
Out[9]:
<gwaslab.g_Sumstats.Sumstats at 0x7fa6a33a8130>
In [10]:
Copied!
locus.data
locus.data
Out[10]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | P | OR | N | STATUS | REF | ALT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2:55015281:A:T | 2 | 55015281 | T | A | 0.126263 | -0.048075 | 0.193967 | -0.247856 | 0.804246 | 0.953062 | 495 | 9960009 | A | T |
1 | 2:55015604:G:A | 2 | 55015604 | A | G | 0.119192 | -0.047357 | 0.195199 | -0.242606 | 0.808311 | 0.953747 | 495 | 9960009 | G | A |
2 | 2:55015764:G:A | 2 | 55015764 | A | G | 0.339394 | 0.028986 | 0.135064 | 0.214575 | 0.830098 | 1.029410 | 495 | 9960009 | G | A |
3 | 2:55016143:A:C | 2 | 55016143 | C | A | 0.126263 | 0.004659 | 0.195728 | 0.023784 | 0.981025 | 1.004670 | 495 | 9960009 | A | C |
4 | 2:55017199:T:C | 2 | 55017199 | C | T | 0.093306 | 0.268767 | 0.219657 | 1.223580 | 0.221112 | 1.308350 | 493 | 9960009 | T | C |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
374 | 2:56004219:G:T | 2 | 56004219 | T | G | 0.828283 | -0.148489 | 0.169557 | -0.875763 | 0.381159 | 0.862010 | 495 | 9960019 | G | T |
375 | 2:56007034:T:C | 2 | 56007034 | C | T | 0.739879 | -0.073325 | 0.145565 | -0.503737 | 0.614446 | 0.929299 | 494 | 9960019 | T | C |
376 | 2:56008984:C:G | 2 | 56008984 | G | C | 0.013185 | 0.205883 | 0.547226 | 0.376227 | 0.706748 | 1.228610 | 493 | 9960009 | C | G |
377 | 2:56009480:A:T | 2 | 56009480 | T | A | 0.842742 | -0.135667 | 0.177621 | -0.763784 | 0.444996 | 0.873134 | 496 | 9960019 | A | T |
378 | 2:56010434:C:T | 2 | 56010434 | T | C | 0.017172 | 0.300305 | 0.491815 | 0.610604 | 0.541462 | 1.350270 | 495 | 9960009 | C | T |
379 rows × 15 columns
Output the sumstats of this locus¶
In [11]:
Copied!
locus.data.to_csv("sig_locus.tsv",sep="\t",index=None)
locus.data["SNPID"].to_csv("sig_locus.snplist",sep="\t",index=None,header=None)
locus.data.to_csv("sig_locus.tsv",sep="\t",index=None)
locus.data["SNPID"].to_csv("sig_locus.snplist",sep="\t",index=None,header=None)
Run PLINK to get LD matrix for this locus¶
In [12]:
Copied!
!plink \
--bfile "../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing" \
--keep-allele-order \
--r square \
--extract sig_locus.snplist \
--out sig_locus_mt
!plink \
--bfile "../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing" \
--keep-allele-order \
--r2 square \
--extract sig_locus.snplist \
--out sig_locus_mt_r2
!plink \
--bfile "../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing" \
--keep-allele-order \
--r square \
--extract sig_locus.snplist \
--out sig_locus_mt
!plink \
--bfile "../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing" \
--keep-allele-order \
--r2 square \
--extract sig_locus.snplist \
--out sig_locus_mt_r2
PLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/ (C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to sig_locus_mt.log. Options in effect: --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing --extract sig_locus.snplist --keep-allele-order --out sig_locus_mt --r square 31934 MB RAM detected; reserving 15967 MB for main workspace. 1235116 variants loaded from .bim file. 504 people (0 males, 0 females, 504 ambiguous) loaded from .fam. Ambiguous sex IDs written to sig_locus_mt.nosex . --extract: 379 variants remaining. Using up to 19 threads (change this with --threads). Before main variant filters, 504 founders and 0 nonfounders present. Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done. Total genotyping rate is 0.992472. 379 variants and 504 people pass filters and QC. Note: No phenotypes present. --r square to sig_locus_mt.ld ... 0% [processingwriting] done. PLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/ (C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to sig_locus_mt_r2.log. Options in effect: --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing --extract sig_locus.snplist --keep-allele-order --out sig_locus_mt_r2 --r2 square 31934 MB RAM detected; reserving 15967 MB for main workspace. 1235116 variants loaded from .bim file. 504 people (0 males, 0 females, 504 ambiguous) loaded from .fam. Ambiguous sex IDs written to sig_locus_mt_r2.nosex . --extract: 379 variants remaining. Using up to 19 threads (change this with --threads). Before main variant filters, 504 founders and 0 nonfounders present. Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done. Total genotyping rate is 0.992472. 379 variants and 504 people pass filters and QC. Note: No phenotypes present. --r2 square to sig_locus_mt_r2.ld ... 0% [processingwriting] done.
Finemapping¶
In [13]:
Copied!
import rpy2
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
import rpy2.robjects.numpy2ri as numpy2ri
numpy2ri.activate()
import rpy2
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
import rpy2.robjects.numpy2ri as numpy2ri
numpy2ri.activate()
INFO:rpy2.situation:cffi mode is CFFI_MODE.ANY INFO:rpy2.situation:R home found: /home/yunye/anaconda3/envs/gwaslab_py39/lib/R INFO:rpy2.situation:R library path: INFO:rpy2.situation:LD_LIBRARY_PATH: INFO:rpy2.rinterface_lib.embedded:Default options to initialize R: rpy2, --quiet, --no-save INFO:rpy2.rinterface_lib.embedded:R is already initialized. No need to initialize.
Load locus sumstats¶
In [14]:
Copied!
df = pd.read_csv("sig_locus.tsv",sep="\t")
df
df = pd.read_csv("sig_locus.tsv",sep="\t")
df
Out[14]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | P | OR | N | STATUS | REF | ALT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2:55015281:A:T | 2 | 55015281 | T | A | 0.126263 | -0.048075 | 0.193967 | -0.247856 | 0.804246 | 0.953062 | 495 | 9960009 | A | T |
1 | 2:55015604:G:A | 2 | 55015604 | A | G | 0.119192 | -0.047357 | 0.195199 | -0.242606 | 0.808311 | 0.953747 | 495 | 9960009 | G | A |
2 | 2:55015764:G:A | 2 | 55015764 | A | G | 0.339394 | 0.028986 | 0.135064 | 0.214575 | 0.830098 | 1.029410 | 495 | 9960009 | G | A |
3 | 2:55016143:A:C | 2 | 55016143 | C | A | 0.126263 | 0.004659 | 0.195728 | 0.023784 | 0.981025 | 1.004670 | 495 | 9960009 | A | C |
4 | 2:55017199:T:C | 2 | 55017199 | C | T | 0.093306 | 0.268767 | 0.219657 | 1.223580 | 0.221112 | 1.308350 | 493 | 9960009 | T | C |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
374 | 2:56004219:G:T | 2 | 56004219 | T | G | 0.828283 | -0.148489 | 0.169557 | -0.875763 | 0.381159 | 0.862010 | 495 | 9960019 | G | T |
375 | 2:56007034:T:C | 2 | 56007034 | C | T | 0.739879 | -0.073325 | 0.145565 | -0.503737 | 0.614446 | 0.929299 | 494 | 9960019 | T | C |
376 | 2:56008984:C:G | 2 | 56008984 | G | C | 0.013185 | 0.205883 | 0.547226 | 0.376227 | 0.706748 | 1.228610 | 493 | 9960009 | C | G |
377 | 2:56009480:A:T | 2 | 56009480 | T | A | 0.842742 | -0.135667 | 0.177621 | -0.763784 | 0.444996 | 0.873134 | 496 | 9960019 | A | T |
378 | 2:56010434:C:T | 2 | 56010434 | T | C | 0.017172 | 0.300305 | 0.491815 | 0.610604 | 0.541462 | 1.350270 | 495 | 9960009 | C | T |
379 rows × 15 columns
Import sumsieR¶
In [15]:
Copied!
# import susieR as object
susieR = importr('susieR')
# import susieR as object
susieR = importr('susieR')
Load LD matrix¶
In [16]:
Copied!
# convert pd.DataFrame to numpy
ld = pd.read_csv("sig_locus_mt.ld",sep="\t",header=None)
R_df = ld.values
ld2 = pd.read_csv("sig_locus_mt_r2.ld",sep="\t",header=None)
R_df2 = ld2.values
# convert pd.DataFrame to numpy
ld = pd.read_csv("sig_locus_mt.ld",sep="\t",header=None)
R_df = ld.values
ld2 = pd.read_csv("sig_locus_mt_r2.ld",sep="\t",header=None)
R_df2 = ld2.values
In [17]:
Copied!
R_df
R_df
Out[17]:
array([[ 1.00000e+00, 9.58562e-01, -3.08678e-01, ..., 1.96204e-02, -3.54602e-04, -7.14868e-03], [ 9.58562e-01, 1.00000e+00, -2.97617e-01, ..., 2.47755e-02, -1.49234e-02, -7.00509e-03], [-3.08678e-01, -2.97617e-01, 1.00000e+00, ..., -3.49335e-02, -1.37163e-02, -2.12828e-02], ..., [ 1.96204e-02, 2.47755e-02, -3.49335e-02, ..., 1.00000e+00, 5.26193e-02, -3.09069e-02], [-3.54602e-04, -1.49234e-02, -1.37163e-02, ..., 5.26193e-02, 1.00000e+00, -3.01142e-01], [-7.14868e-03, -7.00509e-03, -2.12828e-02, ..., -3.09069e-02, -3.01142e-01, 1.00000e+00]])
Visualize the LD structure of this locus¶
In [18]:
Copied!
plt.figure(figsize=(10,10),dpi=200)
fig, ax = plt.subplots(ncols=2,figsize=(20,10))
sns.heatmap(data=R_df,cmap="Spectral",ax=ax[0])
sns.heatmap(data=R_df2,ax=ax[1])
ax[0].set_title("LD r matrix")
ax[1].set_title("LD r2 matrix")
plt.figure(figsize=(10,10),dpi=200)
fig, ax = plt.subplots(ncols=2,figsize=(20,10))
sns.heatmap(data=R_df,cmap="Spectral",ax=ax[0])
sns.heatmap(data=R_df2,ax=ax[1])
ax[0].set_title("LD r matrix")
ax[1].set_title("LD r2 matrix")
Out[18]:
Text(0.5, 1.0, 'LD r2 matrix')
<Figure size 2000x2000 with 0 Axes>
Run finemapping use susieR¶
In [19]:
Copied!
ro.r('set.seed(123)')
fit = susieR.susie_rss(
bhat = df["BETA"].values.reshape((len(R_df),1)),
shat = df["SE"].values.reshape((len(R_df),1)),
R = R_df,
L = 10,
n = 503
)
ro.r('set.seed(123)')
fit = susieR.susie_rss(
bhat = df["BETA"].values.reshape((len(R_df),1)),
shat = df["SE"].values.reshape((len(R_df),1)),
R = R_df,
L = 10,
n = 503
)
Extract credible sets and PIP¶
In [20]:
Copied!
# show the results of susie_get_cs
print(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])
# show the results of susie_get_cs
print(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])
$L1 [1] 200 218 221 224
We found 1 credible set here
In [21]:
Copied!
# add the information to dataframe for plotting
df["cs"] = 0
n_cs=len(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])
for i in range(n_cs):
cs_index = susieR.susie_get_cs(fit,coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0][i]
df.loc[np.array(cs_index)-1,"cs"] = i + 1
df["pip"] = np.array(susieR.susie_get_pip(fit))
# add the information to dataframe for plotting
df["cs"] = 0
n_cs=len(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])
for i in range(n_cs):
cs_index = susieR.susie_get_cs(fit,coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0][i]
df.loc[np.array(cs_index)-1,"cs"] = i + 1
df["pip"] = np.array(susieR.susie_get_pip(fit))
Create regional plot¶
In [22]:
Copied!
fig ,axes = plt.subplots(nrows=2,sharex=True,figsize=(15,7),height_ratios=(4,1))
df["MLOG10P"] = -np.log10(df["P"])
col_to_plot = "MLOG10P"
p=axes[0].scatter(df["POS"],df[col_to_plot],c=ld[df["P"].idxmin()]**2)
axes[0].scatter(df.loc[df["cs"]==1,"POS"],df.loc[df["cs"]==1,col_to_plot],
marker='o',s=40,c="None",edgecolors='black',label="Variants in credible set 1")
axes[0].scatter(df.loc[(df["CHR"]==2)&(df["POS"]==55620927),"POS"],df.loc[(df["CHR"]==2)&(df["POS"]==55620927),col_to_plot],
marker='x',s=40,c="red",edgecolors='black',label="Causal")
plt.colorbar( p , label="Rsq with the lead variant")
axes[0].set_xlabel("position")
axes[0].set_xlim((55400000, 55800000))
axes[0].set_ylabel(col_to_plot)
axes[0].legend()
p=axes[1].scatter(df["POS"],df["pip"],c=ld[df["P"].idxmin()]**2)
axes[1].scatter(df.loc[df["cs"]==1,"POS"],df.loc[df["cs"]==1,"pip"],
marker='o',s=40,c="None",edgecolors='black',label="Variants in credible set 1")
plt.colorbar( p , label="Rsq with the lead variant")
axes[1].set_xlabel("position")
axes[1].set_xlim((55400000, 55800000))
axes[1].set_ylabel("PIP")
axes[1].legend()
fig ,axes = plt.subplots(nrows=2,sharex=True,figsize=(15,7),height_ratios=(4,1))
df["MLOG10P"] = -np.log10(df["P"])
col_to_plot = "MLOG10P"
p=axes[0].scatter(df["POS"],df[col_to_plot],c=ld[df["P"].idxmin()]**2)
axes[0].scatter(df.loc[df["cs"]==1,"POS"],df.loc[df["cs"]==1,col_to_plot],
marker='o',s=40,c="None",edgecolors='black',label="Variants in credible set 1")
axes[0].scatter(df.loc[(df["CHR"]==2)&(df["POS"]==55620927),"POS"],df.loc[(df["CHR"]==2)&(df["POS"]==55620927),col_to_plot],
marker='x',s=40,c="red",edgecolors='black',label="Causal")
plt.colorbar( p , label="Rsq with the lead variant")
axes[0].set_xlabel("position")
axes[0].set_xlim((55400000, 55800000))
axes[0].set_ylabel(col_to_plot)
axes[0].legend()
p=axes[1].scatter(df["POS"],df["pip"],c=ld[df["P"].idxmin()]**2)
axes[1].scatter(df.loc[df["cs"]==1,"POS"],df.loc[df["cs"]==1,"pip"],
marker='o',s=40,c="None",edgecolors='black',label="Variants in credible set 1")
plt.colorbar( p , label="Rsq with the lead variant")
axes[1].set_xlabel("position")
axes[1].set_xlim((55400000, 55800000))
axes[1].set_ylabel("PIP")
axes[1].legend()
/tmp/ipykernel_420/3928380454.py:9: UserWarning: You passed a edgecolor/edgecolors ('black') for an unfilled marker ('x'). Matplotlib is ignoring the edgecolor in favor of the facecolor. This behavior may change in the future. axes[0].scatter(df.loc[(df["CHR"]==2)&(df["POS"]==55620927),"POS"],df.loc[(df["CHR"]==2)&(df["POS"]==55620927),col_to_plot],
Out[22]:
<matplotlib.legend.Legend at 0x7fa6a330d5e0>
Pitfalls¶
The causal variant we used to simulate is actually 2:55620927:G:A, which was filtered out during data preparation due to FIRTH_CONVERGE_FAIL. So the credible set we identified does not really include the bona fide causal variant.
Lets then check the variants in credible set
In [23]:
Copied!
df.loc[np.array(cs_index)-1,:]
df.loc[np.array(cs_index)-1,:]
Out[23]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | P | OR | N | STATUS | REF | ALT | cs | pip | MLOG10P | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
199 | 2:55513738:C:T | 2 | 55513738 | T | C | 0.623992 | 1.219516 | 0.153159 | 7.96244 | 1.686760e-15 | 3.385550 | 496 | 9960019 | C | T | 1 | 0.325435 | 14.772947 |
217 | 2:55605943:A:G | 2 | 55605943 | G | A | 0.685484 | 1.321987 | 0.166688 | 7.93089 | 2.175840e-15 | 3.750867 | 496 | 9960019 | A | G | 1 | 0.267953 | 14.662373 |
220 | 2:55612986:G:C | 2 | 55612986 | C | G | 0.685223 | 1.302133 | 0.166154 | 7.83691 | 4.617840e-15 | 3.677133 | 494 | 9960019 | G | C | 1 | 0.150449 | 14.335561 |
223 | 2:55622624:G:A | 2 | 55622624 | A | G | 0.688508 | 1.324109 | 0.167119 | 7.92315 | 2.315640e-15 | 3.758833 | 496 | 9960019 | G | A | 1 | 0.255449 | 14.635329 |
Check LD of the causal variant and variants in the credible set¶
In [24]:
Copied!
!echo "2:55513738:C:T" > credible.snplist
!echo "2:55605943:A:G" >> credible.snplist
!echo "2:55612986:G:C" >> credible.snplist
!echo "2:55620927:G:A" >> credible.snplist
!echo "2:55622624:G:A" >> credible.snplist
!plink \
--bfile "../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing" \
--keep-allele-order \
--r2 square \
--extract credible.snplist \
--out credible_r
!plink \
--bfile "../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing" \
--keep-allele-order \
--r2 square \
--extract credible.snplist \
--out credible_r2
!echo "2:55513738:C:T" > credible.snplist
!echo "2:55605943:A:G" >> credible.snplist
!echo "2:55612986:G:C" >> credible.snplist
!echo "2:55620927:G:A" >> credible.snplist
!echo "2:55622624:G:A" >> credible.snplist
!plink \
--bfile "../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing" \
--keep-allele-order \
--r2 square \
--extract credible.snplist \
--out credible_r
!plink \
--bfile "../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing" \
--keep-allele-order \
--r2 square \
--extract credible.snplist \
--out credible_r2
PLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/ (C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to credible_r.log. Options in effect: --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing --extract credible.snplist --keep-allele-order --out credible_r --r2 square 31934 MB RAM detected; reserving 15967 MB for main workspace. 1235116 variants loaded from .bim file. 504 people (0 males, 0 females, 504 ambiguous) loaded from .fam. Ambiguous sex IDs written to credible_r.nosex . --extract: 5 variants remaining. Using up to 19 threads (change this with --threads). Before main variant filters, 504 founders and 0 nonfounders present. Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done. Total genotyping rate is 0.995635. 5 variants and 504 people pass filters and QC. Note: No phenotypes present. --r2 square to credible_r.ld ... 0% [processingwriting] done. PLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/ (C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to credible_r2.log. Options in effect: --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing --extract credible.snplist --keep-allele-order --out credible_r2 --r2 square 31934 MB RAM detected; reserving 15967 MB for main workspace. 1235116 variants loaded from .bim file. 504 people (0 males, 0 females, 504 ambiguous) loaded from .fam. Ambiguous sex IDs written to credible_r2.nosex . --extract: 5 variants remaining. Using up to 19 threads (change this with --threads). Before main variant filters, 504 founders and 0 nonfounders present. Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done. Total genotyping rate is 0.995635. 5 variants and 504 people pass filters and QC. Note: No phenotypes present. --r2 square to credible_r2.ld ... 0% [processingwriting] done.
Load LD and plot¶
In [25]:
Copied!
credible_snplist=["2:55513738:C:T","2:55605943:A:G", "2:55612986:G:C", "2:55620927:G:A", "2:55622624:G:A"]
ld = pd.read_csv("credible_r.ld",sep="\t",header=None)
ld.columns=credible_snplist
ld.index=credible_snplist
ld2 = pd.read_csv("credible_r2.ld",sep="\t",header=None)
ld2.columns=credible_snplist
ld2.index=credible_snplist
credible_snplist=["2:55513738:C:T","2:55605943:A:G", "2:55612986:G:C", "2:55620927:G:A", "2:55622624:G:A"]
ld = pd.read_csv("credible_r.ld",sep="\t",header=None)
ld.columns=credible_snplist
ld.index=credible_snplist
ld2 = pd.read_csv("credible_r2.ld",sep="\t",header=None)
ld2.columns=credible_snplist
ld2.index=credible_snplist
In [26]:
Copied!
plt.figure(figsize=(10,10),dpi=200)
fig, ax = plt.subplots(ncols=2,figsize=(20,10))
sns.heatmap(data=ld, cmap="Spectral_r",ax=ax[0],center=0)
sns.heatmap(data=ld2,cmap="Spectral_r",ax=ax[1],vmin=0,vmax=1)
ax[0].set_title("LD r matrix")
ax[1].set_title("LD r2 matrix")
plt.figure(figsize=(10,10),dpi=200)
fig, ax = plt.subplots(ncols=2,figsize=(20,10))
sns.heatmap(data=ld, cmap="Spectral_r",ax=ax[0],center=0)
sns.heatmap(data=ld2,cmap="Spectral_r",ax=ax[1],vmin=0,vmax=1)
ax[0].set_title("LD r matrix")
ax[1].set_title("LD r2 matrix")
Out[26]:
Text(0.5, 1.0, 'LD r2 matrix')
<Figure size 2000x2000 with 0 Axes>
Variants in the credible set are in strong LD with the bona fide causal variant.
This could also happen in real-world analysis. Please always be cautious when interpreting fine-mapping results.