Finemapping using susieR¶

Data preparation¶

In [1]:

Copied!





import gwaslab as gl
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import gwaslab as gl
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Load sumstats¶

In [2]:

Copied!

sumstats = gl.Sumstats("./1kgeas.B1.glm.firth.gz",fmt="plink2")
sumstats = gl.Sumstats("./1kgeas.B1.glm.firth.gz",fmt="plink2")

2024/04/18 10:40:48 GWASLab v3.4.43 https://cloufield.github.io/gwaslab/
2024/04/18 10:40:48 (C) 2022-2024, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com
2024/04/18 10:40:48 Start to load format from formatbook....
2024/04/18 10:40:48  -plink2 format meta info:
2024/04/18 10:40:48   - format_name  : PLINK2 .glm.firth, .glm.logistic,.glm.linear
2024/04/18 10:40:48   - format_source  : https://www.cog-genomics.org/plink/2.0/formats
2024/04/18 10:40:48   - format_version  : Alpha 3.3 final (3 Jun)
2024/04/18 10:40:48   - last_check_date  :  20220806
2024/04/18 10:40:48  -plink2 to gwaslab format dictionary:
2024/04/18 10:40:48   - plink2 keys: ID,#CHROM,POS,REF,ALT,A1,OBS_CT,A1_FREQ,BETA,LOG(OR)_SE,SE,T_STAT,Z_STAT,P,LOG10_P,MACH_R2,OR
2024/04/18 10:40:48   - gwaslab values: SNPID,CHR,POS,REF,ALT,EA,N,EAF,BETA,SE,SE,T,Z,P,MLOG10P,INFO,OR
2024/04/18 10:40:48 Start to initialize gl.Sumstats from file :./1kgeas.B1.glm.firth.gz
2024/04/18 10:40:49  -Reading columns          : Z_STAT,A1_FREQ,POS,ALT,REF,P,A1,OR,OBS_CT,#CHROM,LOG(OR)_SE,ID
2024/04/18 10:40:49  -Renaming columns to      : Z,EAF,POS,ALT,REF,P,EA,OR,N,CHR,SE,SNPID
2024/04/18 10:40:49  -Current Dataframe shape : 1128732  x  12
2024/04/18 10:40:49  -Initiating a status column: STATUS ...
2024/04/18 10:40:49  #WARNING! Version of genomic coordinates is unknown...
2024/04/18 10:40:49  NEA not available: assigning REF to NEA...
2024/04/18 10:40:49  -EA,REF and ALT columns are available: assigning NEA...
2024/04/18 10:40:49  -For variants with EA == ALT : assigning REF to NEA ...
2024/04/18 10:40:49  -For variants with EA != ALT : assigning ALT to NEA ...
2024/04/18 10:40:49 Start to reorder the columns...v3.4.43
2024/04/18 10:40:49  -Current Dataframe shape : 1128732 x 14 ; Memory usage: 106.06 MB
2024/04/18 10:40:49  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,SE,Z,P,OR,N,STATUS,REF,ALT
2024/04/18 10:40:49 Finished reordering the columns.
2024/04/18 10:40:49  -Column  : SNPID  CHR   POS   EA       NEA      EAF     SE      Z       P       OR      N     STATUS   REF      ALT     
2024/04/18 10:40:49  -DType   : object int64 int64 category category float64 float64 float64 float64 float64 int64 category category category
2024/04/18 10:40:49  -Verified: T      T     T     T        T        T       T       T       T       T       T     T        T        T       
2024/04/18 10:40:50  -Current Dataframe memory usage: 106.06 MB
2024/04/18 10:40:50 Finished loading data successfully!

Data standardization and sanity check¶

In [3]:

Copied!

sumstats.basic_check()
sumstats.basic_check()

2024/04/18 10:40:50 Start to check SNPID/rsID...v3.4.43
2024/04/18 10:40:50  -Current Dataframe shape : 1128732 x 14 ; Memory usage: 106.06 MB
2024/04/18 10:40:50  -Checking SNPID data type...
2024/04/18 10:40:50  -Converting SNPID to pd.string data type...
2024/04/18 10:40:50  -Checking if SNPID is CHR:POS:NEA:EA...(separator: - ,: , _)
2024/04/18 10:40:51 Finished checking SNPID/rsID.
2024/04/18 10:40:51 Start to fix chromosome notation (CHR)...v3.4.43
2024/04/18 10:40:51  -Current Dataframe shape : 1128732 x 14 ; Memory usage: 106.06 MB
2024/04/18 10:40:51  -Checking CHR data type...
2024/04/18 10:40:51  -Variants with standardized chromosome notation: 1128732
2024/04/18 10:40:51  -All CHR are already fixed...
2024/04/18 10:40:52 Finished fixing chromosome notation (CHR).
2024/04/18 10:40:52 Start to fix basepair positions (POS)...v3.4.43
2024/04/18 10:40:52  -Current Dataframe shape : 1128732 x 14 ; Memory usage: 107.13 MB
2024/04/18 10:40:52  -Converting to Int64 data type ...
2024/04/18 10:40:53  -Position bound:(0 , 250,000,000)
2024/04/18 10:40:53  -Removed outliers: 0
2024/04/18 10:40:53 Finished fixing basepair positions (POS).
2024/04/18 10:40:53 Start to fix alleles (EA and NEA)...v3.4.43
2024/04/18 10:40:53  -Current Dataframe shape : 1128732 x 14 ; Memory usage: 116.82 MB
2024/04/18 10:40:53  -Converted all bases to string datatype and UPPERCASE.
2024/04/18 10:40:53  -Variants with bad EA  : 0
2024/04/18 10:40:54  -Variants with bad NEA : 0
2024/04/18 10:40:54  -Variants with NA for EA or NEA: 0
2024/04/18 10:40:54  -Variants with same EA and NEA: 0
2024/04/18 10:40:54  -Detected 0 variants with alleles that contain bases other than A/C/T/G .
2024/04/18 10:40:55 Finished fixing alleles (EA and NEA).
2024/04/18 10:40:55 Start to perform sanity check for statistics...v3.4.43
2024/04/18 10:40:55  -Current Dataframe shape : 1128732 x 14 ; Memory usage: 116.82 MB
2024/04/18 10:40:55  -Comparison tolerance for floats: 1e-07
2024/04/18 10:40:55  -Checking if 0 <= N <= 2147483647 ...
2024/04/18 10:40:55  -Removed 0 variants with bad/na N.
2024/04/18 10:40:55  -Checking if -1e-07 < EAF < 1.0000001 ...
2024/04/18 10:40:55  -Removed 0 variants with bad/na EAF.
2024/04/18 10:40:55  -Checking if -9999.0000001 < Z < 9999.0000001 ...
2024/04/18 10:40:55   -Examples of invalid variants(SNPID): 1:15774:G:A,1:15777:A:G,1:57292:C:T,1:87360:C:T,1:625392:T:C ...
2024/04/18 10:40:55   -Examples of invalid values (Z): NA,NA,NA,NA,NA ...
2024/04/18 10:40:55  -Removed 220793 variants with bad/na Z.
2024/04/18 10:40:55  -Checking if -1e-07 < P < 1.0000001 ...
2024/04/18 10:40:55  -Removed 0 variants with bad/na P.
2024/04/18 10:40:55  -Checking if -1e-07 < SE < inf ...
2024/04/18 10:40:55  -Removed 0 variants with bad/na SE.
2024/04/18 10:40:55  -Checking if -100.0000001 < OR < 100.0000001 ...
2024/04/18 10:40:55  -Removed 0 variants with bad/na OR.
2024/04/18 10:40:55  -Checking STATUS and converting STATUS to categories....
2024/04/18 10:40:56  -Removed 220793 variants with bad statistics in total.
2024/04/18 10:40:56  -Data types for each column:
2024/04/18 10:40:56  -Column  : SNPID  CHR   POS   EA       NEA      EAF     SE      Z       P       OR      N     STATUS   REF      ALT     
2024/04/18 10:40:56  -DType   : string Int64 Int64 category category float32 float64 float64 float64 float64 Int64 category category category
2024/04/18 10:40:56  -Verified: T      T     T     T        T        T       T       T       T       T       T     T        T        T       
2024/04/18 10:40:56 Finished sanity check for statistics.
2024/04/18 10:40:56 Start to check data consistency across columns...v3.4.43
2024/04/18 10:40:56  -Current Dataframe shape : 907939 x 14 ; Memory usage: 95.27 MB
2024/04/18 10:40:56  -Tolerance: 0.001 (Relative) and 0.001 (Absolute)
2024/04/18 10:40:56  -No availalbe columns for data consistency checking...Skipping...
2024/04/18 10:40:56 Finished checking data consistency across columns.
2024/04/18 10:40:56 Start to normalize indels...v3.4.43
2024/04/18 10:40:56  -Current Dataframe shape : 907939 x 14 ; Memory usage: 95.27 MB
2024/04/18 10:40:56  -No available variants to normalize..
2024/04/18 10:40:56 Finished normalizing variants successfully!
2024/04/18 10:40:56 Start to sort the genome coordinates...v3.4.43
2024/04/18 10:40:56  -Current Dataframe shape : 907939 x 14 ; Memory usage: 95.27 MB
2024/04/18 10:40:56 Finished sorting coordinates.
2024/04/18 10:40:56 Start to reorder the columns...v3.4.43
2024/04/18 10:40:56  -Current Dataframe shape : 907939 x 14 ; Memory usage: 88.35 MB
2024/04/18 10:40:56  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,SE,Z,P,OR,N,STATUS,REF,ALT
2024/04/18 10:40:56 Finished reordering the columns.

Note: 220793 variants were removed due to na Z values.This is due to FIRTH_CONVERGE_FAIL when performing GWAS using PLINK2.

Extract lead variants¶

In [4]:

Copied!

sumstats.get_lead()
sumstats.get_lead()

2024/04/18 10:40:56 Start to extract lead variants...v3.4.43
2024/04/18 10:40:56  -Current Dataframe shape : 907939 x 14 ; Memory usage: 88.35 MB
2024/04/18 10:40:56  -Processing 907939 variants...
2024/04/18 10:40:56  -Significance threshold : 5e-08
2024/04/18 10:40:56  -Sliding window size: 500  kb
2024/04/18 10:40:56  -Using P for extracting lead variants...
2024/04/18 10:40:56  -Found 43 significant variants in total...
2024/04/18 10:40:56  -Identified 4 lead variants!
2024/04/18 10:40:56 Finished extracting lead variants.

Out[4]:

	SNPID	CHR	POS	EA	NEA	EAF	SE	Z	P	OR	N	STATUS	REF	ALT
44298	1:167562605:G:A	1	167562605	A	G	0.391481	0.159645	7.69462	1.419150e-14	3.415780	493	9960099	G	A
91266	2:55513738:C:T	2	55513738	C	T	0.376008	0.153159	-7.96244	1.686760e-15	0.295373	496	9960099	C	T
442239	7:134368632:T:G	7	134368632	G	T	0.138105	0.225526	6.89025	5.569440e-12	4.730010	496	9960099	T	G
875859	20:42758834:T:C	20	42758834	T	C	0.227273	0.184323	-7.76902	7.909780e-15	0.238829	495	9960099	T	C

Create manhattan plot for checking¶

In [5]:

Copied!

sumstats.plot_mqq()
sumstats.plot_mqq()

2024/04/18 10:40:57 Start to create MQQ plot...v3.4.43:
2024/04/18 10:40:57  -Genomic coordinates version: 99...
2024/04/18 10:40:57  #WARNING! Genomic coordinates version is unknown.
2024/04/18 10:40:57  -Genome-wide significance level to plot is set to 5e-08 ...
2024/04/18 10:40:57  -Raw input contains 907939 variants...
2024/04/18 10:40:57  -MQQ plot layout mode is : mqq
2024/04/18 10:40:57 Finished loading specified columns from the sumstats.
2024/04/18 10:40:57 Start data conversion and sanity check:
2024/04/18 10:40:57  -Removed 0 variants with nan in CHR or POS column ...
2024/04/18 10:40:57  -Removed 0 variants with CHR <=0...
2024/04/18 10:40:57  -Removed 0 variants with nan in P column ...
2024/04/18 10:40:57  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
2024/04/18 10:40:57  -Sumstats P values are being converted to -log10(P)...
2024/04/18 10:40:57  -Sanity check: 0 na/inf/-inf variants will be removed...
2024/04/18 10:40:57  -Converting data above cut line...
2024/04/18 10:40:57  -Maximum -log10(P) value is 14.772946706439042 .
2024/04/18 10:40:57 Finished data conversion and sanity check.
2024/04/18 10:40:57 Start to create MQQ plot with 907939 variants...
2024/04/18 10:40:58  -Creating background plot...
2024/04/18 10:40:59 Finished creating MQQ plot successfully!
2024/04/18 10:40:59 Start to extract variants for annotation...
2024/04/18 10:40:59  -Found 4 significant variants with a sliding window size of 500 kb...
2024/04/18 10:40:59 Finished extracting variants for annotation...
2024/04/18 10:40:59 Start to process figure arts.
2024/04/18 10:40:59  -Processing X ticks...
2024/04/18 10:40:59  -Processing X labels...
2024/04/18 10:40:59  -Processing Y labels...
2024/04/18 10:40:59  -Processing Y tick lables...
2024/04/18 10:40:59  -Processing Y labels...
2024/04/18 10:40:59  -Processing lines...
2024/04/18 10:40:59 Finished processing figure arts.
2024/04/18 10:40:59 Start to annotate variants...
2024/04/18 10:40:59  -Skip annotating
2024/04/18 10:40:59 Finished annotating variants.
2024/04/18 10:40:59 Start to create QQ plot with 907939 variants:
2024/04/18 10:40:59  -Plotting all variants...
2024/04/18 10:40:59  -Expected range of P: (0,1.0)
2024/04/18 10:40:59  -Lambda GC (MLOG10P mode) at 0.5 is   0.98908
2024/04/18 10:40:59  -Processing Y tick lables...
2024/04/18 10:40:59 Finished creating QQ plot successfully!
2024/04/18 10:40:59 Start to save figure...
2024/04/18 10:40:59  -Skip saving figure!
2024/04/18 10:40:59 Finished saving figure...
2024/04/18 10:40:59 Finished creating plot successfully!

Out[5]:

(<Figure size 3000x1000 with 2 Axes>, <gwaslab.g_Log.Log at 0x7fa6ad1132b0>)

No description has been provided for this image

Extract the variants around 2:55513738:C:T for finemapping¶

In [6]:

Copied!

locus = sumstats.filter_value('CHR==2 & POS>55013738 & POS<56013738')
locus = sumstats.filter_value('CHR==2 & POS>55013738 & POS<56013738')

2024/04/18 10:41:06 Start filtering values by condition: CHR==2 & POS>55013738 & POS<56013738
2024/04/18 10:41:06  -Removing 907560 variants not meeting the conditions: CHR==2 & POS>55013738 & POS<56013738
2024/04/18 10:41:06 Finished filtering values.

Convert OR to BETA¶

In [7]:

Copied!

locus.fill_data(to_fill=["BETA"])
locus.fill_data(to_fill=["BETA"])

2024/04/18 10:41:06 Start filling data using existing columns...v3.4.43
2024/04/18 10:41:06  -Column  : SNPID  CHR   POS   EA       NEA      EAF     SE      Z       P       OR      N     STATUS   REF      ALT     
2024/04/18 10:41:06  -DType   : string Int64 Int64 category category float32 float64 float64 float64 float64 Int64 category category category
2024/04/18 10:41:06  -Verified: T      T     T     T        T        T       T       T       T       T       T     T        T        T       
2024/04/18 10:41:06  -Overwrite mode:  False
2024/04/18 10:41:06   -Skipping columns:  []
2024/04/18 10:41:06  -Filling columns:  ['BETA']
2024/04/18 10:41:06   - Filling Columns iteratively...
2024/04/18 10:41:06   - Filling BETA value using OR column...
2024/04/18 10:41:06 Finished filling data using existing columns.
2024/04/18 10:41:06 Start to reorder the columns...v3.4.43
2024/04/18 10:41:06  -Current Dataframe shape : 379 x 15 ; Memory usage: 19.97 MB
2024/04/18 10:41:06  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,Z,P,OR,N,STATUS,REF,ALT
2024/04/18 10:41:06 Finished reordering the columns.

In [8]:

Copied!

locus.data
locus.data

Out[8]:

	SNPID	CHR	POS	EA	NEA	EAF	BETA	SE	Z	P	OR	N	STATUS	REF	ALT
91067	2:55015281:A:T	2	55015281	T	A	0.126263	-0.048075	0.193967	-0.247856	0.804246	0.953062	495	9960099	A	T
91068	2:55015604:G:A	2	55015604	A	G	0.119192	-0.047357	0.195199	-0.242606	0.808311	0.953747	495	9960099	G	A
91069	2:55015764:G:A	2	55015764	A	G	0.339394	0.028986	0.135064	0.214575	0.830098	1.029410	495	9960099	G	A
91070	2:55016143:A:C	2	55016143	C	A	0.126263	0.004659	0.195728	0.023784	0.981025	1.004670	495	9960099	A	C
91071	2:55017199:T:C	2	55017199	C	T	0.093306	0.268767	0.219657	1.223580	0.221112	1.308350	493	9960099	T	C
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
91441	2:56004219:G:T	2	56004219	G	T	0.171717	0.148489	0.169557	0.875763	0.381159	1.160080	495	9960099	G	T
91442	2:56007034:T:C	2	56007034	T	C	0.260121	0.073325	0.145565	0.503737	0.614446	1.076080	494	9960099	T	C
91443	2:56008984:C:G	2	56008984	G	C	0.013185	0.205883	0.547226	0.376227	0.706748	1.228610	493	9960099	C	G
91444	2:56009480:A:T	2	56009480	A	T	0.157258	0.135667	0.177621	0.763784	0.444996	1.145300	496	9960099	A	T
91445	2:56010434:C:T	2	56010434	T	C	0.017172	0.300305	0.491815	0.610604	0.541462	1.350270	495	9960099	C	T

379 rows × 15 columns

Align NEA with reference sequence¶

In [9]:

Copied!

locus.harmonize(basic_check=False, ref_seq="/home/yunye/CommonData/Reference/genome/humanG1Kv37/human_g1k_v37.fasta")
locus.harmonize(basic_check=False, ref_seq="/home/yunye/CommonData/Reference/genome/humanG1Kv37/human_g1k_v37.fasta")

2024/04/18 10:41:07 Start to check if NEA is aligned with reference sequence...v3.4.43
2024/04/18 10:41:07  -Current Dataframe shape : 379 x 15 ; Memory usage: 19.97 MB
2024/04/18 10:41:07  -Reference genome FASTA file: /home/yunye/CommonData/Reference/genome/humanG1Kv37/human_g1k_v37.fasta
2024/04/18 10:41:07  -Loading fasta records:2  
2024/04/18 10:41:19  -Checking records
2024/04/18 10:41:19    -Building numpy fasta records from dict
2024/04/18 10:41:20    -Checking records for ( len(NEA) <= 4 and len(EA) <= 4 )
2024/04/18 10:41:20    -Checking records for ( len(NEA) > 4 or len(EA) > 4 )
2024/04/18 10:41:20  -Finished checking records
2024/04/18 10:41:20  -Variants allele on given reference sequence :  264
2024/04/18 10:41:20  -Variants flipped :  115
2024/04/18 10:41:20   -Raw Matching rate :  100.00%
2024/04/18 10:41:20  -Variants inferred reverse_complement :  0
2024/04/18 10:41:20  -Variants inferred reverse_complement_flipped :  0
2024/04/18 10:41:20  -Both allele on genome + unable to distinguish :  0
2024/04/18 10:41:20  -Variants not on given reference sequence :  0
2024/04/18 10:41:20 Finished checking if NEA is aligned with reference sequence.
2024/04/18 10:41:20 Start to adjust statistics based on STATUS code...v3.4.43
2024/04/18 10:41:20  -Current Dataframe shape : 379 x 15 ; Memory usage: 0.04 MB
2024/04/18 10:41:20 Start to flip allele-specific stats for SNPs with status xxxxx[35]x: ALT->EA , REF->NEA ...v3.4.43
2024/04/18 10:41:20  -Flipping 115 variants...
2024/04/18 10:41:20  -Swapping column: NEA <=> EA...
2024/04/18 10:41:20  -Flipping column: BETA = - BETA...
2024/04/18 10:41:20  -Flipping column: Z = - Z...
2024/04/18 10:41:20  -Flipping column: EAF = 1 - EAF...
2024/04/18 10:41:20  -Flipping column: OR = 1 / OR...
2024/04/18 10:41:20  -Changed the status for flipped variants : xxxxx[35]x -> xxxxx[12]x
2024/04/18 10:41:20 Finished adjusting statistics based on STATUS code.
2024/04/18 10:41:20 Start to sort the genome coordinates...v3.4.43
2024/04/18 10:41:20  -Current Dataframe shape : 379 x 15 ; Memory usage: 0.04 MB
2024/04/18 10:41:20 Finished sorting coordinates.
2024/04/18 10:41:20 Start to reorder the columns...v3.4.43
2024/04/18 10:41:20  -Current Dataframe shape : 379 x 15 ; Memory usage: 0.03 MB
2024/04/18 10:41:20  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,Z,P,OR,N,STATUS,REF,ALT
2024/04/18 10:41:20 Finished reordering the columns.

Out[9]:

<gwaslab.g_Sumstats.Sumstats at 0x7fa6a33a8130>

In [10]:

Copied!

locus.data
locus.data

Out[10]:

	SNPID	CHR	POS	EA	NEA	EAF	BETA	SE	Z	P	OR	N	STATUS	REF	ALT
0	2:55015281:A:T	2	55015281	T	A	0.126263	-0.048075	0.193967	-0.247856	0.804246	0.953062	495	9960009	A	T
1	2:55015604:G:A	2	55015604	A	G	0.119192	-0.047357	0.195199	-0.242606	0.808311	0.953747	495	9960009	G	A
2	2:55015764:G:A	2	55015764	A	G	0.339394	0.028986	0.135064	0.214575	0.830098	1.029410	495	9960009	G	A
3	2:55016143:A:C	2	55016143	C	A	0.126263	0.004659	0.195728	0.023784	0.981025	1.004670	495	9960009	A	C
4	2:55017199:T:C	2	55017199	C	T	0.093306	0.268767	0.219657	1.223580	0.221112	1.308350	493	9960009	T	C
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
374	2:56004219:G:T	2	56004219	T	G	0.828283	-0.148489	0.169557	-0.875763	0.381159	0.862010	495	9960019	G	T
375	2:56007034:T:C	2	56007034	C	T	0.739879	-0.073325	0.145565	-0.503737	0.614446	0.929299	494	9960019	T	C
376	2:56008984:C:G	2	56008984	G	C	0.013185	0.205883	0.547226	0.376227	0.706748	1.228610	493	9960009	C	G
377	2:56009480:A:T	2	56009480	T	A	0.842742	-0.135667	0.177621	-0.763784	0.444996	0.873134	496	9960019	A	T
378	2:56010434:C:T	2	56010434	T	C	0.017172	0.300305	0.491815	0.610604	0.541462	1.350270	495	9960009	C	T

379 rows × 15 columns

Output the sumstats of this locus¶

In [11]:

Copied!

locus.data.to_csv("sig_locus.tsv",sep="\t",index=None)
locus.data["SNPID"].to_csv("sig_locus.snplist",sep="\t",index=None,header=None)
locus.data.to_csv("sig_locus.tsv",sep="\t",index=None)
locus.data["SNPID"].to_csv("sig_locus.snplist",sep="\t",index=None,header=None)

Run PLINK to get LD matrix for this locus¶

In [12]:

Copied!





!plink \
  --bfile "../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing" \
  --keep-allele-order \
  --r square \
  --extract sig_locus.snplist \
  --out sig_locus_mt

!plink \
  --bfile "../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing" \
  --keep-allele-order \
  --r2 square \
  --extract sig_locus.snplist \
  --out sig_locus_mt_r2
!plink \
  --bfile "../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing" \
  --keep-allele-order \
  --r square \
  --extract sig_locus.snplist \
  --out sig_locus_mt

!plink \
  --bfile "../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing" \
  --keep-allele-order \
  --r2 square \
  --extract sig_locus.snplist \
  --out sig_locus_mt_r2

PLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to sig_locus_mt.log.
Options in effect:
  --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing
  --extract sig_locus.snplist
  --keep-allele-order
  --out sig_locus_mt
  --r square

31934 MB RAM detected; reserving 15967 MB for main workspace.
1235116 variants loaded from .bim file.
504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.
Ambiguous sex IDs written to sig_locus_mt.nosex .
--extract: 379 variants remaining.
Using up to 19 threads (change this with --threads).
Before main variant filters, 504 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate is 0.992472.
379 variants and 504 people pass filters and QC.
Note: No phenotypes present.
--r square to sig_locus_mt.ld ... 0% [processingwriting]          done.
PLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to sig_locus_mt_r2.log.
Options in effect:
  --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing
  --extract sig_locus.snplist
  --keep-allele-order
  --out sig_locus_mt_r2
  --r2 square

31934 MB RAM detected; reserving 15967 MB for main workspace.
1235116 variants loaded from .bim file.
504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.
Ambiguous sex IDs written to sig_locus_mt_r2.nosex .
--extract: 379 variants remaining.
Using up to 19 threads (change this with --threads).
Before main variant filters, 504 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate is 0.992472.
379 variants and 504 people pass filters and QC.
Note: No phenotypes present.
--r2 square to sig_locus_mt_r2.ld ... 0% [processingwriting]          done.

Finemapping¶

In [13]:

Copied!





import rpy2
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
import rpy2.robjects.numpy2ri as numpy2ri
numpy2ri.activate()
import rpy2
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
import rpy2.robjects.numpy2ri as numpy2ri
numpy2ri.activate()

INFO:rpy2.situation:cffi mode is CFFI_MODE.ANY
INFO:rpy2.situation:R home found: /home/yunye/anaconda3/envs/gwaslab_py39/lib/R
INFO:rpy2.situation:R library path: 
INFO:rpy2.situation:LD_LIBRARY_PATH: 
INFO:rpy2.rinterface_lib.embedded:Default options to initialize R: rpy2, --quiet, --no-save
INFO:rpy2.rinterface_lib.embedded:R is already initialized. No need to initialize.

Load locus sumstats¶

In [14]:

Copied!

df = pd.read_csv("sig_locus.tsv",sep="\t")
df
df = pd.read_csv("sig_locus.tsv",sep="\t")
df

Out[14]:

	SNPID	CHR	POS	EA	NEA	EAF	BETA	SE	Z	P	OR	N	STATUS	REF	ALT
0	2:55015281:A:T	2	55015281	T	A	0.126263	-0.048075	0.193967	-0.247856	0.804246	0.953062	495	9960009	A	T
1	2:55015604:G:A	2	55015604	A	G	0.119192	-0.047357	0.195199	-0.242606	0.808311	0.953747	495	9960009	G	A
2	2:55015764:G:A	2	55015764	A	G	0.339394	0.028986	0.135064	0.214575	0.830098	1.029410	495	9960009	G	A
3	2:55016143:A:C	2	55016143	C	A	0.126263	0.004659	0.195728	0.023784	0.981025	1.004670	495	9960009	A	C
4	2:55017199:T:C	2	55017199	C	T	0.093306	0.268767	0.219657	1.223580	0.221112	1.308350	493	9960009	T	C
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
374	2:56004219:G:T	2	56004219	T	G	0.828283	-0.148489	0.169557	-0.875763	0.381159	0.862010	495	9960019	G	T
375	2:56007034:T:C	2	56007034	C	T	0.739879	-0.073325	0.145565	-0.503737	0.614446	0.929299	494	9960019	T	C
376	2:56008984:C:G	2	56008984	G	C	0.013185	0.205883	0.547226	0.376227	0.706748	1.228610	493	9960009	C	G
377	2:56009480:A:T	2	56009480	T	A	0.842742	-0.135667	0.177621	-0.763784	0.444996	0.873134	496	9960019	A	T
378	2:56010434:C:T	2	56010434	T	C	0.017172	0.300305	0.491815	0.610604	0.541462	1.350270	495	9960009	C	T

379 rows × 15 columns

Import sumsieR¶

In [15]:

Copied!

# import susieR as object
susieR = importr('susieR')
# import susieR as object
susieR = importr('susieR')

Load LD matrix¶

In [16]:

Copied!





# convert pd.DataFrame to numpy
ld = pd.read_csv("sig_locus_mt.ld",sep="\t",header=None)
R_df = ld.values
ld2 = pd.read_csv("sig_locus_mt_r2.ld",sep="\t",header=None)
R_df2 = ld2.values
# convert pd.DataFrame to numpy
ld = pd.read_csv("sig_locus_mt.ld",sep="\t",header=None)
R_df = ld.values
ld2 = pd.read_csv("sig_locus_mt_r2.ld",sep="\t",header=None)
R_df2 = ld2.values

In [17]:

Copied!

R_df
R_df

Out[17]:

array([[ 1.00000e+00,  9.58562e-01, -3.08678e-01, ...,  1.96204e-02,
        -3.54602e-04, -7.14868e-03],
       [ 9.58562e-01,  1.00000e+00, -2.97617e-01, ...,  2.47755e-02,
        -1.49234e-02, -7.00509e-03],
       [-3.08678e-01, -2.97617e-01,  1.00000e+00, ..., -3.49335e-02,
        -1.37163e-02, -2.12828e-02],
       ...,
       [ 1.96204e-02,  2.47755e-02, -3.49335e-02, ...,  1.00000e+00,
         5.26193e-02, -3.09069e-02],
       [-3.54602e-04, -1.49234e-02, -1.37163e-02, ...,  5.26193e-02,
         1.00000e+00, -3.01142e-01],
       [-7.14868e-03, -7.00509e-03, -2.12828e-02, ..., -3.09069e-02,
        -3.01142e-01,  1.00000e+00]])

Visualize the LD structure of this locus¶

In [18]:

Copied!





plt.figure(figsize=(10,10),dpi=200)
fig, ax = plt.subplots(ncols=2,figsize=(20,10))
sns.heatmap(data=R_df,cmap="Spectral",ax=ax[0])
sns.heatmap(data=R_df2,ax=ax[1])
ax[0].set_title("LD r matrix")
ax[1].set_title("LD r2 matrix")
plt.figure(figsize=(10,10),dpi=200)
fig, ax = plt.subplots(ncols=2,figsize=(20,10))
sns.heatmap(data=R_df,cmap="Spectral",ax=ax[0])
sns.heatmap(data=R_df2,ax=ax[1])
ax[0].set_title("LD r matrix")
ax[1].set_title("LD r2 matrix")

Out[18]:

Text(0.5, 1.0, 'LD r2 matrix')

<Figure size 2000x2000 with 0 Axes>

https://stephenslab.github.io/susieR/articles/finemapping_summary_statistics.html#fine-mapping-with-susier-using-summary-statistics

Run finemapping use susieR¶

In [19]:

Copied!





ro.r('set.seed(123)')
fit = susieR.susie_rss(
    bhat = df["BETA"].values.reshape((len(R_df),1)),
    shat = df["SE"].values.reshape((len(R_df),1)),
    R = R_df,
    L = 10,
    n = 503
)
ro.r('set.seed(123)')
fit = susieR.susie_rss(
    bhat = df["BETA"].values.reshape((len(R_df),1)),
    shat = df["SE"].values.reshape((len(R_df),1)),
    R = R_df,
    L = 10,
    n = 503
)

Extract credible sets and PIP¶

In [20]:

Copied!

# show the results of susie_get_cs
print(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])
# show the results of susie_get_cs
print(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])

$L1
[1] 200 218 221 224

We found 1 credible set here

In [21]:

Copied!





# add the information to dataframe for plotting
df["cs"] = 0
n_cs=len(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])
for i in range(n_cs):
    cs_index = susieR.susie_get_cs(fit,coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0][i]
    df.loc[np.array(cs_index)-1,"cs"] = i + 1
df["pip"] = np.array(susieR.susie_get_pip(fit))
# add the information to dataframe for plotting
df["cs"] = 0
n_cs=len(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])
for i in range(n_cs):
    cs_index = susieR.susie_get_cs(fit,coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0][i]
    df.loc[np.array(cs_index)-1,"cs"] = i + 1
df["pip"] = np.array(susieR.susie_get_pip(fit))

Create regional plot¶

In [22]:

Copied!





fig ,axes = plt.subplots(nrows=2,sharex=True,figsize=(15,7),height_ratios=(4,1))
df["MLOG10P"] = -np.log10(df["P"])
col_to_plot = "MLOG10P"
p=axes[0].scatter(df["POS"],df[col_to_plot],c=ld[df["P"].idxmin()]**2)

axes[0].scatter(df.loc[df["cs"]==1,"POS"],df.loc[df["cs"]==1,col_to_plot],
           marker='o',s=40,c="None",edgecolors='black',label="Variants in credible set 1")

axes[0].scatter(df.loc[(df["CHR"]==2)&(df["POS"]==55620927),"POS"],df.loc[(df["CHR"]==2)&(df["POS"]==55620927),col_to_plot],
           marker='x',s=40,c="red",edgecolors='black',label="Causal")

plt.colorbar( p , label="Rsq with the lead variant")
axes[0].set_xlabel("position")
axes[0].set_xlim((55400000, 55800000))
axes[0].set_ylabel(col_to_plot)
axes[0].legend()

p=axes[1].scatter(df["POS"],df["pip"],c=ld[df["P"].idxmin()]**2)

axes[1].scatter(df.loc[df["cs"]==1,"POS"],df.loc[df["cs"]==1,"pip"],
           marker='o',s=40,c="None",edgecolors='black',label="Variants in credible set 1")

plt.colorbar( p , label="Rsq with the lead variant")
axes[1].set_xlabel("position")
axes[1].set_xlim((55400000, 55800000))
axes[1].set_ylabel("PIP")
axes[1].legend()
fig ,axes = plt.subplots(nrows=2,sharex=True,figsize=(15,7),height_ratios=(4,1))
df["MLOG10P"] = -np.log10(df["P"])
col_to_plot = "MLOG10P"
p=axes[0].scatter(df["POS"],df[col_to_plot],c=ld[df["P"].idxmin()]**2)

axes[0].scatter(df.loc[df["cs"]==1,"POS"],df.loc[df["cs"]==1,col_to_plot],
           marker='o',s=40,c="None",edgecolors='black',label="Variants in credible set 1")

axes[0].scatter(df.loc[(df["CHR"]==2)&(df["POS"]==55620927),"POS"],df.loc[(df["CHR"]==2)&(df["POS"]==55620927),col_to_plot],
           marker='x',s=40,c="red",edgecolors='black',label="Causal")

plt.colorbar( p , label="Rsq with the lead variant")
axes[0].set_xlabel("position")
axes[0].set_xlim((55400000, 55800000))
axes[0].set_ylabel(col_to_plot)
axes[0].legend()

p=axes[1].scatter(df["POS"],df["pip"],c=ld[df["P"].idxmin()]**2)

axes[1].scatter(df.loc[df["cs"]==1,"POS"],df.loc[df["cs"]==1,"pip"],
           marker='o',s=40,c="None",edgecolors='black',label="Variants in credible set 1")

plt.colorbar( p , label="Rsq with the lead variant")
axes[1].set_xlabel("position")
axes[1].set_xlim((55400000, 55800000))
axes[1].set_ylabel("PIP")
axes[1].legend()

/tmp/ipykernel_420/3928380454.py:9: UserWarning: You passed a edgecolor/edgecolors ('black') for an unfilled marker ('x').  Matplotlib is ignoring the edgecolor in favor of the facecolor.  This behavior may change in the future.
  axes[0].scatter(df.loc[(df["CHR"]==2)&(df["POS"]==55620927),"POS"],df.loc[(df["CHR"]==2)&(df["POS"]==55620927),col_to_plot],

Out[22]:

<matplotlib.legend.Legend at 0x7fa6a330d5e0>

Pitfalls¶

The causal variant we used to simulate is actually 2:55620927:G:A, which was filtered out during data preparation due to FIRTH_CONVERGE_FAIL. So the credible set we identified does not really include the bona fide causal variant.

Lets then check the variants in credible set

In [23]:

Copied!

df.loc[np.array(cs_index)-1,:]
df.loc[np.array(cs_index)-1,:]

Out[23]:

	SNPID	CHR	POS	EA	NEA	EAF	BETA	SE	Z	P	OR	N	STATUS	REF	ALT	cs	pip	MLOG10P
199	2:55513738:C:T	2	55513738	T	C	0.623992	1.219516	0.153159	7.96244	1.686760e-15	3.385550	496	9960019	C	T	1	0.325435	14.772947
217	2:55605943:A:G	2	55605943	G	A	0.685484	1.321987	0.166688	7.93089	2.175840e-15	3.750867	496	9960019	A	G	1	0.267953	14.662373
220	2:55612986:G:C	2	55612986	C	G	0.685223	1.302133	0.166154	7.83691	4.617840e-15	3.677133	494	9960019	G	C	1	0.150449	14.335561
223	2:55622624:G:A	2	55622624	A	G	0.688508	1.324109	0.167119	7.92315	2.315640e-15	3.758833	496	9960019	G	A	1	0.255449	14.635329

Check LD of the causal variant and variants in the credible set¶

In [24]:

Copied!





!echo "2:55513738:C:T" > credible.snplist
!echo "2:55605943:A:G" >> credible.snplist
!echo "2:55612986:G:C" >> credible.snplist
!echo "2:55620927:G:A" >> credible.snplist
!echo "2:55622624:G:A" >> credible.snplist

!plink \
  --bfile "../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing" \
  --keep-allele-order \
  --r2 square \
  --extract credible.snplist \
  --out credible_r

!plink \
  --bfile "../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing" \
  --keep-allele-order \
  --r2 square \
  --extract credible.snplist \
  --out credible_r2
!echo "2:55513738:C:T" > credible.snplist
!echo "2:55605943:A:G" >> credible.snplist
!echo "2:55612986:G:C" >> credible.snplist
!echo "2:55620927:G:A" >> credible.snplist
!echo "2:55622624:G:A" >> credible.snplist

!plink \
  --bfile "../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing" \
  --keep-allele-order \
  --r2 square \
  --extract credible.snplist \
  --out credible_r

!plink \
  --bfile "../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing" \
  --keep-allele-order \
  --r2 square \
  --extract credible.snplist \
  --out credible_r2

PLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to credible_r.log.
Options in effect:
  --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing
  --extract credible.snplist
  --keep-allele-order
  --out credible_r
  --r2 square

31934 MB RAM detected; reserving 15967 MB for main workspace.
1235116 variants loaded from .bim file.
504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.
Ambiguous sex IDs written to credible_r.nosex .
--extract: 5 variants remaining.
Using up to 19 threads (change this with --threads).
Before main variant filters, 504 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate is 0.995635.
5 variants and 504 people pass filters and QC.
Note: No phenotypes present.
--r2 square to credible_r.ld ... 0% [processingwriting]          done.
PLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to credible_r2.log.
Options in effect:
  --bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing
  --extract credible.snplist
  --keep-allele-order
  --out credible_r2
  --r2 square

31934 MB RAM detected; reserving 15967 MB for main workspace.
1235116 variants loaded from .bim file.
504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.
Ambiguous sex IDs written to credible_r2.nosex .
--extract: 5 variants remaining.
Using up to 19 threads (change this with --threads).
Before main variant filters, 504 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate is 0.995635.
5 variants and 504 people pass filters and QC.
Note: No phenotypes present.
--r2 square to credible_r2.ld ... 0% [processingwriting]          done.

Load LD and plot¶

In [25]:

Copied!





credible_snplist=["2:55513738:C:T","2:55605943:A:G", "2:55612986:G:C", "2:55620927:G:A", "2:55622624:G:A"]
ld = pd.read_csv("credible_r.ld",sep="\t",header=None)
ld.columns=credible_snplist
ld.index=credible_snplist
ld2 = pd.read_csv("credible_r2.ld",sep="\t",header=None)
ld2.columns=credible_snplist
ld2.index=credible_snplist
credible_snplist=["2:55513738:C:T","2:55605943:A:G", "2:55612986:G:C", "2:55620927:G:A", "2:55622624:G:A"]
ld = pd.read_csv("credible_r.ld",sep="\t",header=None)
ld.columns=credible_snplist
ld.index=credible_snplist
ld2 = pd.read_csv("credible_r2.ld",sep="\t",header=None)
ld2.columns=credible_snplist
ld2.index=credible_snplist

In [26]:

Copied!





plt.figure(figsize=(10,10),dpi=200)
fig, ax = plt.subplots(ncols=2,figsize=(20,10))
sns.heatmap(data=ld, cmap="Spectral_r",ax=ax[0],center=0)
sns.heatmap(data=ld2,cmap="Spectral_r",ax=ax[1],vmin=0,vmax=1)
ax[0].set_title("LD r matrix")
ax[1].set_title("LD r2 matrix")
plt.figure(figsize=(10,10),dpi=200)
fig, ax = plt.subplots(ncols=2,figsize=(20,10))
sns.heatmap(data=ld, cmap="Spectral_r",ax=ax[0],center=0)
sns.heatmap(data=ld2,cmap="Spectral_r",ax=ax[1],vmin=0,vmax=1)
ax[0].set_title("LD r matrix")
ax[1].set_title("LD r2 matrix")

Out[26]:

Text(0.5, 1.0, 'LD r2 matrix')

<Figure size 2000x2000 with 0 Axes>

Variants in the credible set are in strong LD with the bona fide causal variant.

This could also happen in real-world analysis. Please always be cautious when interpreting fine-mapping results.