Finemapping using susieR
Data preparation
import sys
sys.path.insert(0,"/home/yunye/work/gwaslab/src")
import gwaslab as gl
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import gc
Load sumstats
sumstats = gl.Sumstats("./1kgeas.B1.glm.firth.gz",fmt="plink2")
stdout:
2024/04/19 22:44:31 GWASLab v3.4.43 https://cloufield.github.io/gwaslab/
2024/04/19 22:44:31 (C) 2022-2024, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com
2024/04/19 22:44:31 Start to load format from formatbook....
2024/04/19 22:44:31 -plink2 format meta info:
2024/04/19 22:44:31 - format_name : PLINK2 .glm.firth, .glm.logistic,.glm.linear
2024/04/19 22:44:31 - format_source : https://www.cog-genomics.org/plink/2.0/formats
2024/04/19 22:44:31 - format_version : Alpha 3.3 final (3 Jun)
2024/04/19 22:44:31 - last_check_date : 20220806
2024/04/19 22:44:31 -plink2 to gwaslab format dictionary:
2024/04/19 22:44:31 - plink2 keys: ID,#CHROM,POS,REF,ALT,A1,OBS_CT,A1_FREQ,BETA,LOG(OR)_SE,SE,T_STAT,Z_STAT,P,LOG10_P,MACH_R2,OR
2024/04/19 22:44:31 - gwaslab values: SNPID,CHR,POS,REF,ALT,EA,N,EAF,BETA,SE,SE,T,Z,P,MLOG10P,INFO,OR
2024/04/19 22:44:31 Start to initialize gl.Sumstats from file :./1kgeas.B1.glm.firth.gz
2024/04/19 22:44:32 -Reading columns : #CHROM,POS,OBS_CT,A1_FREQ,P,OR,Z_STAT,A1,LOG(OR)_SE,ALT,REF,ID
2024/04/19 22:44:32 -Renaming columns to : CHR,POS,N,EAF,P,OR,Z,EA,SE,ALT,REF,SNPID
2024/04/19 22:44:32 -Current Dataframe shape : 1128732 x 12
2024/04/19 22:44:32 -Initiating a status column: STATUS ...
2024/04/19 22:44:32 #WARNING! Version of genomic coordinates is unknown...
2024/04/19 22:44:33 NEA not available: assigning REF to NEA...
2024/04/19 22:44:33 -EA,REF and ALT columns are available: assigning NEA...
2024/04/19 22:44:33 -For variants with EA == ALT : assigning REF to NEA ...
2024/04/19 22:44:33 -For variants with EA != ALT : assigning ALT to NEA ...
2024/04/19 22:44:33 Start to reorder the columns...v3.4.43
2024/04/19 22:44:33 -Current Dataframe shape : 1128732 x 14 ; Memory usage: 106.06 MB
2024/04/19 22:44:33 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,SE,Z,P,OR,N,STATUS,REF,ALT
2024/04/19 22:44:33 Finished reordering the columns.
2024/04/19 22:44:33 -Column : SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT
2024/04/19 22:44:33 -DType : object int64 int64 category category float64 float64 float64 float64 float64 int64 category category category
2024/04/19 22:44:33 -Verified: T T T T T T T T T T T T T T
2024/04/19 22:44:33 -Current Dataframe memory usage: 106.06 MB
2024/04/19 22:44:33 Finished loading data successfully!
import pandas as pd
df = pd.concat([sumstats.data[["SNPID","CHR","POS","EA","NEA","EAF","N","STATUS","REF","ALT"]]]*20, ignore_index=True).copy()
df
| SNPID | CHR | POS | EA | NEA | EAF | N | STATUS | REF | ALT |
|---|---|---|---|---|---|---|---|---|---|
| 1:15774:G:A | 1 | 15774 | A | G | 0.028283 | 495 | 9999999 | G | A |
| 1:15777:A:G | 1 | 15777 | G | A | 0.073737 | 495 | 9999999 | A | G |
| 1:57292:C:T | 1 | 57292 | T | C | 0.104675 | 492 | 9999999 | C | T |
| 1:77874:G:A | 1 | 77874 | A | G | 0.019153 | 496 | 9999999 | G | A |
| 1:87360:C:T | 1 | 87360 | T | C | 0.023139 | 497 | 9999999 | C | T |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 22:51217954:G:A | 22 | 51217954 | A | G | 0.033199 | 497 | 9999999 | G | A |
| 22:51218377:G:C | 22 | 51218377 | C | G | 0.033333 | 495 | 9999999 | G | C |
| 22:51218615:T:A | 22 | 51218615 | A | T | 0.033266 | 496 | 9999999 | T | A |
| 22:51222100:G:T | 22 | 51222100 | T | G | 0.039157 | 498 | 9999999 | G | T |
| 22:51239678:G:T | 22 | 51239678 | T | G | 0.034137 | 498 | 9999999 | G | T |
sumstats = gl.Sumstats(df,fmt="gwaslab")
df = 0
gc.collect()
stdout:
2024/04/19 22:44:35 GWASLab v3.4.43 https://cloufield.github.io/gwaslab/
2024/04/19 22:44:35 (C) 2022-2024, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com
2024/04/19 22:44:35 Start to load format from formatbook....
2024/04/19 22:44:35 -gwaslab format meta info:
2024/04/19 22:44:35 - format_name : gwaslab
2024/04/19 22:44:35 - format_source : https://cloufield.github.io/gwaslab/
2024/04/19 22:44:35 - format_version : 20231220_v4
2024/04/19 22:44:35 Start to initialize gl.Sumstats from pandas DataFrame ...
2024/04/19 22:44:43 -Reading columns : POS,EA,STATUS,CHR,N,REF,EAF,ALT,NEA,SNPID
2024/04/19 22:44:43 -Renaming columns to : POS,EA,STATUS,CHR,N,REF,EAF,ALT,NEA,SNPID
2024/04/19 22:44:43 -Current Dataframe shape : 22574640 x 10
2024/04/19 22:44:44 -Initiating a status column: STATUS ...
2024/04/19 22:44:44 #WARNING! Version of genomic coordinates is unknown...
2024/04/19 22:44:45 -EA,REF and ALT columns are available: assigning NEA...
2024/04/19 22:44:45 -For variants with EA == ALT : assigning REF to NEA ...
2024/04/19 22:44:46 -For variants with EA != ALT : assigning ALT to NEA ...
2024/04/19 22:44:46 Start to reorder the columns...v3.4.43
2024/04/19 22:44:46 -Current Dataframe shape : 22574640 x 10 ; Memory usage: 1053.33 MB
2024/04/19 22:44:46 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,N,STATUS,REF,ALT
2024/04/19 22:44:46 Finished reordering the columns.
2024/04/19 22:44:47 -Column : SNPID CHR POS EA NEA EAF N STATUS REF ALT
2024/04/19 22:44:47 -DType : object string int64 category category float64 int64 category category category
2024/04/19 22:44:47 -Verified: T F T T T T T T T T
2024/04/19 22:44:47 #WARNING! Columns with possibly incompatible dtypes: CHR
2024/04/19 22:44:47 -Current Dataframe memory usage: 1053.33 MB
2024/04/19 22:44:47 Finished loading data successfully!
0
sumstats.data["EA"]="AA"
sumstats.data["NEA"]="AT"
sumstats.basic_check()
stdout:
2024/04/19 22:44:47 Start to check SNPID/rsID...v3.4.43
2024/04/19 22:44:47 -Current Dataframe shape : 22574640 x 10 ; Memory usage: 1354.73 MB
2024/04/19 22:44:47 -Checking SNPID data type...
2024/04/19 22:44:47 -Converting SNPID to pd.string data type...
2024/04/19 22:44:48 -Checking if SNPID is CHR:POS:NEA:EA...(separator: - ,: , _)
2024/04/19 22:45:08 Finished checking SNPID/rsID.
2024/04/19 22:45:08 Start to fix chromosome notation (CHR)...v3.4.43
2024/04/19 22:45:08 -Current Dataframe shape : 22574640 x 10 ; Memory usage: 1354.73 MB
2024/04/19 22:45:08 -Checking CHR data type...
2024/04/19 22:45:59 -Variants with standardized chromosome notation: 22574640
2024/04/19 22:46:49 -All CHR are already fixed...
2024/04/19 22:47:49 Finished fixing chromosome notation (CHR).
2024/04/19 22:47:49 Start to fix basepair positions (POS)...v3.4.43
2024/04/19 22:47:49 -Current Dataframe shape : 22574640 x 10 ; Memory usage: 1376.26 MB
2024/04/19 22:47:49 -Converting to Int64 data type ...
2024/04/19 22:48:01 -Position bound:(0 , 250,000,000)
2024/04/19 22:48:50 -Removed outliers: 0
2024/04/19 22:48:53 Finished fixing basepair positions (POS).
2024/04/19 22:48:53 Start to fix alleles (EA and NEA)...v3.4.43
2024/04/19 22:48:53 -Current Dataframe shape : 22574640 x 10 ; Memory usage: 1570.02 MB
2024/04/19 22:48:53 -Converted all bases to string datatype and UPPERCASE.
2024/04/19 22:49:16 -Variants with bad EA : 0
2024/04/19 22:49:17 -Variants with bad NEA : 0
2024/04/19 22:49:19 -Variants with NA for EA or NEA: 0
2024/04/19 22:49:20 -Variants with same EA and NEA: 0
2024/04/19 22:49:21 -Detected 0 variants with alleles that contain bases other than A/C/T/G .
2024/04/19 22:49:41 Finished fixing alleles (EA and NEA).
2024/04/19 22:49:41 Start to perform sanity check for statistics...v3.4.43
2024/04/19 22:49:41 -Current Dataframe shape : 22574640 x 10 ; Memory usage: 1268.61 MB
2024/04/19 22:49:41 -Comparison tolerance for floats: 1e-07
2024/04/19 22:49:42 -Checking if 0 <= N <= 2147483647 ...
2024/04/19 22:50:34 -Removed 0 variants with bad/na N.
2024/04/19 22:50:34 -Checking if -1e-07 < EAF < 1.0000001 ...
2024/04/19 22:50:37 -Removed 0 variants with bad/na EAF.
2024/04/19 22:50:37 -Checking STATUS and converting STATUS to categories....
2024/04/19 22:50:38 -Removed 0 variants with bad statistics in total.
2024/04/19 22:50:38 -Data types for each column:
2024/04/19 22:50:38 -Column : SNPID CHR POS EA NEA EAF N STATUS REF ALT
2024/04/19 22:50:38 -DType : string Int64 Int64 category category float32 Int64 category category category
2024/04/19 22:50:38 -Verified: T T T T T T T T T T
2024/04/19 22:50:38 Finished sanity check for statistics.
2024/04/19 22:50:38 Start to check data consistency across columns...v3.4.43
2024/04/19 22:50:38 -Current Dataframe shape : 22574640 x 10 ; Memory usage: 1204.03 MB
2024/04/19 22:50:38 -Tolerance: 0.001 (Relative) and 0.001 (Absolute)
2024/04/19 22:50:38 -No availalbe columns for data consistency checking...Skipping...
2024/04/19 22:50:38 Finished checking data consistency across columns.
2024/04/19 22:50:38 Start to normalize indels...v3.4.43
2024/04/19 22:50:38 -Current Dataframe shape : 22574640 x 10 ; Memory usage: 1204.03 MB
2024/04/19 22:56:44 -Not normalized allele IDs:1:15774:G:A 1:15777:A:G 1:57292:C:T 1:77874:G:A 1:87360:C:T ...
2024/04/19 22:56:46 -Not normalized allele:['AA' 'AT']['AA' 'AT']['AA' 'AT']['AA' 'AT']['AA' 'AT']...
2024/04/19 22:56:48 -Modified 22574640 variants according to parsimony and left alignment principal.
2024/04/19 22:57:07 Finished normalizing indels.
2024/04/19 22:57:07 Start to sort the genome coordinates...v3.4.43
2024/04/19 22:57:07 -Current Dataframe shape : 22574640 x 10 ; Memory usage: 1204.03 MB
2024/04/19 22:57:12 Finished sorting coordinates.
2024/04/19 22:57:12 Start to reorder the columns...v3.4.43
2024/04/19 22:57:12 -Current Dataframe shape : 22574640 x 10 ; Memory usage: 1031.80 MB
2024/04/19 22:57:12 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,N,STATUS,REF,ALT
2024/04/19 22:57:13 Finished reordering the columns.
sumstats.memory_usage()
Data standardization and sanity check
sumstats.basic_check(sanitycheckstats_args={"OR":(-10,10)})
stdout:
Fri Apr 19 20:13:50 2024 Start to check IDs...
Fri Apr 19 20:13:50 2024 -Current Dataframe shape : 1128732 x 14
Fri Apr 19 20:13:50 2024 -Checking if SNPID is chr:pos:ref:alt...(separator: - ,: , _)
Fri Apr 19 20:13:51 2024 Finished checking IDs successfully!
Fri Apr 19 20:13:51 2024 Start to fix chromosome notation...
Fri Apr 19 20:13:51 2024 -Current Dataframe shape : 1128732 x 14
Fri Apr 19 20:13:51 2024 -Checking CHR data type...
Fri Apr 19 20:13:54 2024 -Variants with standardized chromosome notation: 1128732
Fri Apr 19 20:13:57 2024 -All CHR are already fixed...
Fri Apr 19 20:13:58 2024 -Sanity check for CHR...
Fri Apr 19 20:14:00 2024 -Removed 0 varaints with CHR < 1...
Fri Apr 19 20:14:01 2024 Finished fixing chromosome notation successfully!
Fri Apr 19 20:14:01 2024 Start to fix basepair positions...
Fri Apr 19 20:14:01 2024 -Current Dataframe shape : 1128732 x 14
Fri Apr 19 20:14:01 2024 -Converting to Int64 data type ...
Fri Apr 19 20:14:02 2024 -Position upper_bound is: 250,000,000
Fri Apr 19 20:14:04 2024 -Remove outliers: 0
Fri Apr 19 20:14:04 2024 -Converted all position to datatype Int64.
Fri Apr 19 20:14:04 2024 Finished fixing basepair position successfully!
Fri Apr 19 20:14:04 2024 Start to fix alleles...
Fri Apr 19 20:14:04 2024 -Current Dataframe shape : 1128732 x 14
Fri Apr 19 20:14:04 2024 -Converted all bases to string datatype and UPPERCASE.
Fri Apr 19 20:14:05 2024 -Detected 0 variants with alleles that contain bases other than A/C/T/G .
Fri Apr 19 20:14:06 2024 Finished fixing allele successfully!
Fri Apr 19 20:14:06 2024 Start sanity check for statistics ...
Fri Apr 19 20:14:06 2024 -Current Dataframe shape : 1128732 x 14
Fri Apr 19 20:14:06 2024 -Checking if 0 <=N<= 2147483647 ...
Fri Apr 19 20:14:07 2024 -Removed 0 variants with bad N.
Fri Apr 19 20:14:07 2024 -Checking if 0 <EAF< 1 ...
Fri Apr 19 20:14:07 2024 -Removed 0 variants with bad EAF.
Fri Apr 19 20:14:07 2024 -Checking if 0 <=MAC<= 2147483647 ...
Fri Apr 19 20:14:07 2024 -Removed 0 variants with bad MAC.
Fri Apr 19 20:14:07 2024 -Checking if -37.5 <Z< 37.5 ...
Fri Apr 19 20:14:07 2024 -Removed 220793 variants with bad Z.
Fri Apr 19 20:14:07 2024 -Checking if 5e-300 < P < 1.000001 ...
Fri Apr 19 20:14:07 2024 -Removed 0 variants with bad P.
Fri Apr 19 20:14:07 2024 -Checking if 0 <SE< inf ...
Fri Apr 19 20:14:07 2024 -Removed 0 variants with bad SE.
Fri Apr 19 20:14:07 2024 -Checking if -10 <log(OR)< 10 ...
Fri Apr 19 20:14:07 2024 -Removed 0 variants with bad OR.
Fri Apr 19 20:14:07 2024 -Checking STATUS and converting STATUS to categories....
Fri Apr 19 20:14:07 2024 -Dropping 0 variants with NAs in the checked columns...
Fri Apr 19 20:14:07 2024 -Removed 220793 variants with bad statistics in total.
Fri Apr 19 20:14:07 2024 -Data types for each column:
Fri Apr 19 20:14:07 2024 -Column: SNPID CHR POS EA NEA EAF SE Z P OR N STATUS REF ALT
Fri Apr 19 20:14:07 2024 -DType : object Int64 Int64 category category float32 float32 float32 float64 float32 Int32 category category category
Fri Apr 19 20:14:07 2024 Finished sanity check successfully!
Fri Apr 19 20:14:08 2024 Start to normalize variants...
Fri Apr 19 20:14:08 2024 -Current Dataframe shape : 907939 x 14
Fri Apr 19 20:14:08 2024 -No available variants to normalize..
Fri Apr 19 20:14:08 2024 Finished normalizing variants successfully!
Fri Apr 19 20:14:08 2024 Start to sort the genome coordinates...
Fri Apr 19 20:14:08 2024 -Current Dataframe shape : 907939 x 14
Fri Apr 19 20:14:08 2024 -Sorting genome coordinates...
Fri Apr 19 20:14:09 2024 Finished sorting genome coordinates successfully!
Fri Apr 19 20:14:09 2024 Start to reorder the columns...
Fri Apr 19 20:14:09 2024 -Current Dataframe shape : 907939 x 14
Fri Apr 19 20:14:09 2024 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,SE,Z,P,OR,N,STATUS,REF,ALT
Fri Apr 19 20:14:09 2024 Finished sorting columns successfully!
Note: 220793 variants were removed due to na Z values.This is due to FIRTH_CONVERGE_FAIL when performing GWAS using PLINK2.
Extract lead variants
sumstats.get_lead()
stdout:
Fri Apr 19 20:14:09 2024 Start to extract lead variants...
Fri Apr 19 20:14:09 2024 -Processing 907939 variants...
Fri Apr 19 20:14:09 2024 -Significance threshold : 5e-08
Fri Apr 19 20:14:09 2024 -Sliding window size: 500 kb
Fri Apr 19 20:14:09 2024 -Found 43 significant variants in total...
Fri Apr 19 20:14:09 2024 -Identified 4 lead variants!
Fri Apr 19 20:14:09 2024 Finished extracting lead variants successfully!
| SNPID | CHR | POS | EA | NEA | EAF | SE | Z | P | OR | N | STATUS | REF | ALT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1:167562605:G:A | 1 | 167562605 | A | G | 0.391481 | 0.159645 | 7.69462 | 1.419150e-14 | 3.415780 | 493 | 9960099 | G | A |
| 2:55513738:C:T | 2 | 55513738 | C | T | 0.376008 | 0.153159 | -7.96244 | 1.686760e-15 | 0.295373 | 496 | 9960099 | C | T |
| 7:134368632:T:G | 7 | 134368632 | G | T | 0.138105 | 0.225526 | 6.89025 | 5.569440e-12 | 4.730010 | 496 | 9960099 | T | G |
| 20:42758834:T:C | 20 | 42758834 | T | C | 0.227273 | 0.184323 | -7.76902 | 7.909780e-15 | 0.238829 | 495 | 9960099 | T | C |
Create manhattan plot for checking
sumstats.plot_mqq()
stdout:
Fri Apr 19 20:15:00 2024 Start to plot manhattan/qq plot with the following basic settings:
Fri Apr 19 20:15:00 2024 -Genomic coordinates version: 99...
Fri Apr 19 20:15:00 2024 -WARNING!!! Genomic coordinates version is unknown...
Fri Apr 19 20:15:00 2024 -Genome-wide significance level is set to 5e-08 ...
Fri Apr 19 20:15:00 2024 -Raw input contains 907939 variants...
Fri Apr 19 20:15:00 2024 -Plot layout mode is : mqq
Fri Apr 19 20:15:00 2024 Finished loading specified columns from the sumstats.
Fri Apr 19 20:15:00 2024 Start conversion and sanity check:
Fri Apr 19 20:15:00 2024 -Removed 0 variants with nan in CHR or POS column ...
Fri Apr 19 20:15:02 2024 -Removed 0 varaints with CHR <=0...
Fri Apr 19 20:15:02 2024 -Removed 0 variants with nan in P column ...
Fri Apr 19 20:15:02 2024 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
Fri Apr 19 20:15:02 2024 -Sumstats P values are being converted to -log10(P)...
Fri Apr 19 20:15:02 2024 -Sanity check: 0 na/inf/-inf variants will be removed...
Fri Apr 19 20:15:02 2024 -Maximum -log10(P) values is 14.772946706439042 .
Fri Apr 19 20:15:02 2024 Finished data conversion and sanity check.
Fri Apr 19 20:15:03 2024 Start to create manhattan plot with 907939 variants:
Fri Apr 19 20:15:06 2024 -Found 4 significant variants with a sliding window size of 500 kb...
Fri Apr 19 20:15:06 2024 Finished creating Manhattan plot successfully!
Fri Apr 19 20:15:06 2024 -Skip annotating
Fri Apr 19 20:15:06 2024 Start to create QQ plot with 907939 variants:
Fri Apr 19 20:15:06 2024 Expected range of P: (0,1.0)
Fri Apr 19 20:15:06 2024 -Lambda GC (MLOG10P mode) at 0.5 is 0.98908
Fri Apr 19 20:15:06 2024 Finished creating QQ plot successfully!
Fri Apr 19 20:15:06 2024 -Skip saving figures!
(<Figure size 3000x1000 with 2 Axes>, <gwaslab.Log.Log at 0x7f8f8f8cf220>)

Extract the variants around 2:55513738:C:T for finemapping
locus = sumstats.filter_value('CHR==2 & POS>55013738 & POS<56013738')
stdout:
Fri Apr 19 20:15:18 2024 Start filtering values by condition: CHR==2 & POS>55013738 & POS<56013738
Fri Apr 19 20:15:18 2024 -Removing 907560 variants not meeting the conditions: CHR==2 & POS>55013738 & POS<56013738
Fri Apr 19 20:15:18 2024 Finished filtering values.
Convert OR to BETA
locus.fill_data(to_fill=["BETA"])
stdout:
Fri Apr 19 20:15:18 2024 Start filling data using existing columns...
Fri Apr 19 20:15:18 2024 -Raw input columns: ['SNPID', 'CHR', 'POS', 'EA', 'NEA', 'EAF', 'SE', 'Z', 'P', 'OR', 'N', 'STATUS', 'REF', 'ALT']
Fri Apr 19 20:15:18 2024 -Overwrite mode: False
Fri Apr 19 20:15:18 2024 -Skipping columns: []
Fri Apr 19 20:15:18 2024 -Filling columns: ['BETA']
Fri Apr 19 20:15:18 2024 - Filling Columns iteratively...
Fri Apr 19 20:15:18 2024 - Filling BETA value using OR column...
Fri Apr 19 20:15:18 2024 Start to reorder the columns...
Fri Apr 19 20:15:18 2024 -Current Dataframe shape : 379 x 15
Fri Apr 19 20:15:18 2024 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,Z,P,OR,N,STATUS,REF,ALT
Fri Apr 19 20:15:18 2024 Finished sorting columns successfully!
Fri Apr 19 20:15:18 2024 Finished filling data using existing columns.
locus.data
| SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | P | OR | N | STATUS | REF | ALT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2:55015281:A:T | 2 | 55015281 | T | A | 0.126263 | -0.048075 | 0.193967 | -0.247856 | 0.804246 | 0.953062 | 495 | 9960099 | A | T |
| 2:55015604:G:A | 2 | 55015604 | A | G | 0.119192 | -0.047357 | 0.195199 | -0.242606 | 0.808311 | 0.953747 | 495 | 9960099 | G | A |
| 2:55015764:G:A | 2 | 55015764 | A | G | 0.339394 | 0.028986 | 0.135064 | 0.214575 | 0.830098 | 1.029410 | 495 | 9960099 | G | A |
| 2:55016143:A:C | 2 | 55016143 | C | A | 0.126263 | 0.004659 | 0.195728 | 0.023784 | 0.981025 | 1.004670 | 495 | 9960099 | A | C |
| 2:55017199:T:C | 2 | 55017199 | C | T | 0.093306 | 0.268767 | 0.219657 | 1.223580 | 0.221112 | 1.308350 | 493 | 9960099 | T | C |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2:56004219:G:T | 2 | 56004219 | G | T | 0.171717 | 0.148489 | 0.169557 | 0.875763 | 0.381159 | 1.160080 | 495 | 9960099 | G | T |
| 2:56007034:T:C | 2 | 56007034 | T | C | 0.260121 | 0.073325 | 0.145565 | 0.503737 | 0.614446 | 1.076080 | 494 | 9960099 | T | C |
| 2:56008984:C:G | 2 | 56008984 | G | C | 0.013185 | 0.205883 | 0.547226 | 0.376227 | 0.706748 | 1.228610 | 493 | 9960099 | C | G |
| 2:56009480:A:T | 2 | 56009480 | A | T | 0.157258 | 0.135667 | 0.177621 | 0.763784 | 0.444996 | 1.145300 | 496 | 9960099 | A | T |
| 2:56010434:C:T | 2 | 56010434 | T | C | 0.017172 | 0.300305 | 0.491815 | 0.610604 | 0.541462 | 1.350270 | 495 | 9960099 | C | T |
Align NEA with reference sequence
locus.harmonize(basic_check=False, ref_seq="/home/yunye/mydata/d_disk/human_g1k_v37.fasta")
stdout:
Fri Apr 19 20:15:18 2024 Start to check if NEA is aligned with reference sequence...
Fri Apr 19 20:15:18 2024 -Current Dataframe shape : 379 x 15
Fri Apr 19 20:15:18 2024 -Reference genome fasta file: /home/yunye/mydata/d_disk/human_g1k_v37.fasta
Fri Apr 19 20:15:18 2024 -Checking records: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT
Fri Apr 19 20:16:30 2024 -Variants allele on given reference sequence : 264
Fri Apr 19 20:16:30 2024 -Variants flipped : 115
Fri Apr 19 20:16:30 2024 -Raw Matching rate : 100.00%
Fri Apr 19 20:16:30 2024 -Variants inferred reverse_complement : 0
Fri Apr 19 20:16:30 2024 -Variants inferred reverse_complement_flipped : 0
Fri Apr 19 20:16:30 2024 -Both allele on genome + unable to distinguish : 0
Fri Apr 19 20:16:30 2024 -Variants not on given reference sequence : 0
Fri Apr 19 20:16:30 2024 -Current Dataframe shape : 379 x 15
Fri Apr 19 20:16:30 2024 Start to flip allele-specific stats for SNPs with status xxxxx[35]x: alt->ea , ref->nea ...
Fri Apr 19 20:16:30 2024 -Flipping 115 variants...
Fri Apr 19 20:16:30 2024 -Swapping column: NEA <=> EA...
Fri Apr 19 20:16:30 2024 -Flipping column: BETA = - BETA...
Fri Apr 19 20:16:30 2024 -Flipping column: EAF = 1 - EAF...
Fri Apr 19 20:16:30 2024 -Flipping column: OR = 1 / OR...
Fri Apr 19 20:16:30 2024 -Changed the status for flipped variants : xxxxx[35]x -> xxxxx[12]x
Fri Apr 19 20:16:30 2024 Finished converting successfully!
Fri Apr 19 20:16:30 2024 Start to sort the genome coordinates...
Fri Apr 19 20:16:30 2024 -Current Dataframe shape : 379 x 15
Fri Apr 19 20:16:30 2024 -Sorting genome coordinates...
Fri Apr 19 20:16:30 2024 Finished sorting genome coordinates successfully!
Fri Apr 19 20:16:30 2024 Start to reorder the columns...
Fri Apr 19 20:16:30 2024 -Current Dataframe shape : 379 x 15
Fri Apr 19 20:16:30 2024 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,Z,P,OR,N,STATUS,REF,ALT
Fri Apr 19 20:16:30 2024 Finished sorting columns successfully!
<gwaslab.Sumstats.Sumstats at 0x7f8f826ef6d0>
locus.data
| SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | P | OR | N | STATUS | REF | ALT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2:55015281:A:T | 2 | 55015281 | T | A | 0.126263 | -0.048075 | 0.193967 | -0.247856 | 0.804246 | 0.953062 | 495 | 9960009 | A | T |
| 2:55015604:G:A | 2 | 55015604 | A | G | 0.119192 | -0.047357 | 0.195199 | -0.242606 | 0.808311 | 0.953747 | 495 | 9960009 | G | A |
| 2:55015764:G:A | 2 | 55015764 | A | G | 0.339394 | 0.028986 | 0.135064 | 0.214575 | 0.830098 | 1.029410 | 495 | 9960009 | G | A |
| 2:55016143:A:C | 2 | 55016143 | C | A | 0.126263 | 0.004659 | 0.195728 | 0.023784 | 0.981025 | 1.004670 | 495 | 9960009 | A | C |
| 2:55017199:T:C | 2 | 55017199 | C | T | 0.093306 | 0.268767 | 0.219657 | 1.223580 | 0.221112 | 1.308350 | 493 | 9960009 | T | C |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2:56004219:G:T | 2 | 56004219 | T | G | 0.828283 | -0.148489 | 0.169557 | 0.875763 | 0.381159 | 0.862010 | 495 | 9960019 | G | T |
| 2:56007034:T:C | 2 | 56007034 | C | T | 0.739879 | -0.073325 | 0.145565 | 0.503737 | 0.614446 | 0.929299 | 494 | 9960019 | T | C |
| 2:56008984:C:G | 2 | 56008984 | G | C | 0.013185 | 0.205883 | 0.547226 | 0.376227 | 0.706748 | 1.228610 | 493 | 9960009 | C | G |
| 2:56009480:A:T | 2 | 56009480 | T | A | 0.842742 | -0.135667 | 0.177621 | 0.763784 | 0.444996 | 0.873134 | 496 | 9960019 | A | T |
| 2:56010434:C:T | 2 | 56010434 | T | C | 0.017172 | 0.300305 | 0.491815 | 0.610604 | 0.541462 | 1.350270 | 495 | 9960009 | C | T |
Output the sumstats of this locus
locus.data.to_csv("sig_locus.tsv",sep="\t",index=None)
locus.data["SNPID"].to_csv("sig_locus.snplist",sep="\t",index=None,header=None)
Run PLINK to get LD matrix for this locus
!plink \
--bfile "../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing" \
--keep-allele-order \
--r square \
--extract sig_locus.snplist \
--out sig_locus_mt
!plink \
--bfile "../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing" \
--keep-allele-order \
--r2 square \
--extract sig_locus.snplist \
--out sig_locus_mt_r2
stdout:
PLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to sig_locus_mt.log.
Options in effect:
--bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing
--extract sig_locus.snplist
--keep-allele-order
--out sig_locus_mt
--r square
24011 MB RAM detected; reserving 12005 MB for main workspace.
1235116 variants loaded from .bim file.
504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.
Ambiguous sex IDs written to sig_locus_mt.nosex .
--extract: 379 variants remaining.
Using up to 11 threads (change this with --threads).
Before main variant filters, 504 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate is 0.992472.
379 variants and 504 people pass filters and QC.
Note: No phenotypes present.
--r square to sig_locus_mt.ld ... 0% [processingwriting] done.
PLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to sig_locus_mt_r2.log.
Options in effect:
--bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing
--extract sig_locus.snplist
--keep-allele-order
--out sig_locus_mt_r2
--r2 square
24011 MB RAM detected; reserving 12005 MB for main workspace.
1235116 variants loaded from .bim file.
504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.
Ambiguous sex IDs written to sig_locus_mt_r2.nosex .
--extract: 379 variants remaining.
Using up to 11 threads (change this with --threads).
Before main variant filters, 504 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate is 0.992472.
379 variants and 504 people pass filters and QC.
Note: No phenotypes present.
--r2 square to sig_locus_mt_r2.ld ... 0% [processingwriting] done.
Finemapping
import rpy2
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
import rpy2.robjects.numpy2ri as numpy2ri
numpy2ri.activate()
stderr:
INFO:rpy2.situation:cffi mode is CFFI_MODE.ANY
INFO:rpy2.situation:R home found: /home/yunye/anaconda3/lib/R
INFO:rpy2.situation:R library path:
INFO:rpy2.situation:LD_LIBRARY_PATH:
INFO:rpy2.rinterface_lib.embedded:Default options to initialize R: rpy2, --quiet, --no-save
INFO:rpy2.rinterface_lib.embedded:R is already initialized. No need to initialize.
Load locus sumstats
df = pd.read_csv("sig_locus.tsv",sep="\t")
df
| SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | P | OR | N | STATUS | REF | ALT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2:55015281:A:T | 2 | 55015281 | T | A | 0.126263 | -0.048075 | 0.193967 | -0.247856 | 0.804246 | 0.953062 | 495 | 9960009 | A | T |
| 2:55015604:G:A | 2 | 55015604 | A | G | 0.119192 | -0.047357 | 0.195199 | -0.242606 | 0.808311 | 0.953747 | 495 | 9960009 | G | A |
| 2:55015764:G:A | 2 | 55015764 | A | G | 0.339394 | 0.028986 | 0.135064 | 0.214575 | 0.830098 | 1.029410 | 495 | 9960009 | G | A |
| 2:55016143:A:C | 2 | 55016143 | C | A | 0.126263 | 0.004659 | 0.195728 | 0.023784 | 0.981025 | 1.004670 | 495 | 9960009 | A | C |
| 2:55017199:T:C | 2 | 55017199 | C | T | 0.093306 | 0.268767 | 0.219657 | 1.223580 | 0.221112 | 1.308350 | 493 | 9960009 | T | C |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2:56004219:G:T | 2 | 56004219 | T | G | 0.828283 | -0.148489 | 0.169557 | 0.875763 | 0.381159 | 0.862009 | 495 | 9960019 | G | T |
| 2:56007034:T:C | 2 | 56007034 | C | T | 0.739879 | -0.073325 | 0.145565 | 0.503737 | 0.614446 | 0.929299 | 494 | 9960019 | T | C |
| 2:56008984:C:G | 2 | 56008984 | G | C | 0.013185 | 0.205883 | 0.547226 | 0.376227 | 0.706748 | 1.228610 | 493 | 9960009 | C | G |
| 2:56009480:A:T | 2 | 56009480 | T | A | 0.842742 | -0.135667 | 0.177621 | 0.763784 | 0.444996 | 0.873134 | 496 | 9960019 | A | T |
| 2:56010434:C:T | 2 | 56010434 | T | C | 0.017172 | 0.300305 | 0.491815 | 0.610604 | 0.541462 | 1.350270 | 495 | 9960009 | C | T |
Import sumsieR
# import susieR as object
susieR = importr('susieR')
Load LD matrix
# convert pd.DataFrame to numpy
ld = pd.read_csv("sig_locus_mt.ld",sep="\t",header=None)
R_df = ld.values
ld2 = pd.read_csv("sig_locus_mt_r2.ld",sep="\t",header=None)
R_df2 = ld2.values
R_df
array([[ 1.00000e+00, 9.58562e-01, -3.08678e-01, ..., 1.96204e-02,
-3.54602e-04, -7.14868e-03],
[ 9.58562e-01, 1.00000e+00, -2.97617e-01, ..., 2.47755e-02,
-1.49234e-02, -7.00509e-03],
[-3.08678e-01, -2.97617e-01, 1.00000e+00, ..., -3.49335e-02,
-1.37163e-02, -2.12828e-02],
...,
[ 1.96204e-02, 2.47755e-02, -3.49335e-02, ..., 1.00000e+00,
5.26193e-02, -3.09069e-02],
[-3.54602e-04, -1.49234e-02, -1.37163e-02, ..., 5.26193e-02,
1.00000e+00, -3.01142e-01],
[-7.14868e-03, -7.00509e-03, -2.12828e-02, ..., -3.09069e-02,
-3.01142e-01, 1.00000e+00]])
Visualize the LD structure of this locus
plt.figure(figsize=(10,10),dpi=200)
fig, ax = plt.subplots(ncols=2,figsize=(20,10))
sns.heatmap(data=R_df,cmap="Spectral",ax=ax[0])
sns.heatmap(data=R_df2,ax=ax[1])
ax[0].set_title("LD r matrix")
ax[1].set_title("LD r2 matrix")
Text(0.5, 1.0, 'LD r2 matrix')
<Figure size 2000x2000 with 0 Axes>

Run finemapping use susieR
ro.r('set.seed(123)')
fit = susieR.susie_rss(
bhat = df["BETA"].values.reshape((len(R_df),1)),
shat = df["SE"].values.reshape((len(R_df),1)),
R = R_df,
L = 10,
n = 503
)
Extract credible sets and PIP
# show the results of susie_get_cs
print(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])
stdout:
$L1
[1] 200 218 221 224
We found 1 credible set here
# add the information to dataframe for plotting
df["cs"] = 0
n_cs=len(susieR.susie_get_cs(fit, coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0])
for i in range(n_cs):
cs_index = susieR.susie_get_cs(fit,coverage = 0.95,min_abs_corr = 0.5,Xcorr = R_df)[0][i]
df.loc[np.array(cs_index)-1,"cs"] = i + 1
df["pip"] = np.array(susieR.susie_get_pip(fit))
Create regional plot
fig ,axes = plt.subplots(nrows=2,sharex=True,figsize=(15,7),height_ratios=(4,1))
df["MLOG10P"] = -np.log10(df["P"])
col_to_plot = "MLOG10P"
p=axes[0].scatter(df["POS"],df[col_to_plot],c=ld[df["P"].idxmin()]**2)
axes[0].scatter(df.loc[df["cs"]==1,"POS"],df.loc[df["cs"]==1,col_to_plot],
marker='o',s=40,c="None",edgecolors='black',label="Variants in credible set 1")
axes[0].scatter(df.loc[(df["CHR"]==2)&(df["POS"]==55620927),"POS"],df.loc[(df["CHR"]==2)&(df["POS"]==55620927),col_to_plot],
marker='x',s=40,c="red",edgecolors='black',label="Causal")
plt.colorbar( p , label="Rsq with the lead variant")
axes[0].set_xlabel("position")
axes[0].set_xlim((55400000, 55800000))
axes[0].set_ylabel(col_to_plot)
axes[0].legend()
p=axes[1].scatter(df["POS"],df["pip"],c=ld[df["P"].idxmin()]**2)
axes[1].scatter(df.loc[df["cs"]==1,"POS"],df.loc[df["cs"]==1,"pip"],
marker='o',s=40,c="None",edgecolors='black',label="Variants in credible set 1")
plt.colorbar( p , label="Rsq with the lead variant")
axes[1].set_xlabel("position")
axes[1].set_xlim((55400000, 55800000))
axes[1].set_ylabel("PIP")
axes[1].legend()
stderr:
/tmp/ipykernel_3707/3928380454.py:9: UserWarning: You passed a edgecolor/edgecolors ('black') for an unfilled marker ('x'). Matplotlib is ignoring the edgecolor in favor of the facecolor. This behavior may change in the future.
axes[0].scatter(df.loc[(df["CHR"]==2)&(df["POS"]==55620927),"POS"],df.loc[(df["CHR"]==2)&(df["POS"]==55620927),col_to_plot],
<matplotlib.legend.Legend at 0x7f8faaa6ccd0>

Pitfalls
The causal variant we used to simulate is actually 2:55620927:G:A, which was filtered out during data preparation due to FIRTH_CONVERGE_FAIL. So the credible set we identified does not really include the bona fide causal variant.
Lets then check the variants in credible set
df.loc[np.array(cs_index)-1,:]
| SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | P | OR | N | STATUS | REF | ALT | cs | pip | MLOG10P |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2:55513738:C:T | 2 | 55513738 | T | C | 0.623992 | 1.219516 | 0.153159 | -7.96244 | 1.686760e-15 | 3.385550 | 496 | 9960019 | C | T | 1 | 0.325435 | 14.772947 |
| 2:55605943:A:G | 2 | 55605943 | G | A | 0.685484 | 1.321987 | 0.166688 | -7.93089 | 2.175840e-15 | 3.750868 | 496 | 9960019 | A | G | 1 | 0.267953 | 14.662373 |
| 2:55612986:G:C | 2 | 55612986 | C | G | 0.685223 | 1.302133 | 0.166154 | -7.83691 | 4.617840e-15 | 3.677133 | 494 | 9960019 | G | C | 1 | 0.150449 | 14.335561 |
| 2:55622624:G:A | 2 | 55622624 | A | G | 0.688508 | 1.324109 | 0.167119 | -7.92315 | 2.315640e-15 | 3.758833 | 496 | 9960019 | G | A | 1 | 0.255448 | 14.635329 |
Check LD of the causal variant and variants in the credible set
!echo "2:55513738:C:T" > credible.snplist
!echo "2:55605943:A:G" >> credible.snplist
!echo "2:55612986:G:C" >> credible.snplist
!echo "2:55620927:G:A" >> credible.snplist
!echo "2:55622624:G:A" >> credible.snplist
!plink \
--bfile "../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing" \
--keep-allele-order \
--r2 square \
--extract credible.snplist \
--out credible_r
!plink \
--bfile "../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing" \
--keep-allele-order \
--r2 square \
--extract credible.snplist \
--out credible_r2
stdout:
PLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to credible_r.log.
Options in effect:
--bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing
--extract credible.snplist
--keep-allele-order
--out credible_r
--r2 square
24011 MB RAM detected; reserving 12005 MB for main workspace.
1235116 variants loaded from .bim file.
504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.
Ambiguous sex IDs written to credible_r.nosex .
--extract: 5 variants remaining.
Using up to 11 threads (change this with --threads).
Before main variant filters, 504 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate is 0.995635.
5 variants and 504 people pass filters and QC.
Note: No phenotypes present.
--r2 square to credible_r.ld ... 0% [processingwriting] done.
PLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to credible_r2.log.
Options in effect:
--bfile ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing
--extract credible.snplist
--keep-allele-order
--out credible_r2
--r2 square
24011 MB RAM detected; reserving 12005 MB for main workspace.
1235116 variants loaded from .bim file.
504 people (0 males, 0 females, 504 ambiguous) loaded from .fam.
Ambiguous sex IDs written to credible_r2.nosex .
--extract: 5 variants remaining.
Using up to 11 threads (change this with --threads).
Before main variant filters, 504 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate is 0.995635.
5 variants and 504 people pass filters and QC.
Note: No phenotypes present.
--r2 square to credible_r2.ld ... 0% [processingwriting] done.
Load LD and plot
credible_snplist=["2:55513738:C:T","2:55605943:A:G", "2:55612986:G:C", "2:55620927:G:A", "2:55622624:G:A"]
ld = pd.read_csv("credible_r.ld",sep="\t",header=None)
ld.columns=credible_snplist
ld.index=credible_snplist
ld2 = pd.read_csv("credible_r2.ld",sep="\t",header=None)
ld2.columns=credible_snplist
ld2.index=credible_snplist
plt.figure(figsize=(10,10),dpi=200)
fig, ax = plt.subplots(ncols=2,figsize=(20,10))
sns.heatmap(data=ld, cmap="Spectral_r",ax=ax[0],center=0)
sns.heatmap(data=ld2,cmap="Spectral_r",ax=ax[1],vmin=0,vmax=1)
ax[0].set_title("LD r matrix")
ax[1].set_title("LD r2 matrix")
Text(0.5, 1.0, 'LD r2 matrix')
<Figure size 2000x2000 with 0 Axes>

Variants in the credible set are in strong LD with the bona fide causal variant.
This could also happen in real-world analysis. Please always be cautious when interpreting fine-mapping results.