Clumping by calling PLINK¶
In [1]:
Copied!
import gwaslab as gl
import gwaslab as gl
In [2]:
Copied!
gl.show_version()
gl.show_version()
2025/07/18 17:37:30 GWASLab v3.6.7 https://cloufield.github.io/gwaslab/ 2025/07/18 17:37:30 (C) 2022-2025, Yunye He, Kamatani Lab, GPL-3.0 license, gwaslab@gmail.com 2025/07/18 17:37:30 Python version: 3.12.0 | packaged by conda-forge | (main, Oct 3 2023, 08:43:22) [GCC 12.3.0]
Load sample data and perform QC¶
In [3]:
Copied!
mysumstats = gl.Sumstats("../0_sample_data/t2d_bbj.txt.gz",
snpid="SNP",
chrom="CHR",
pos="POS",
se="SE",
p="P",
nrows=1000000 # Load only the first 1 million lines for demonstration purposes
)
mysumstats.basic_check(verbose=False)
mysumstats.fix_id(fixsep=True)
mysumstats = gl.Sumstats("../0_sample_data/t2d_bbj.txt.gz",
snpid="SNP",
chrom="CHR",
pos="POS",
se="SE",
p="P",
nrows=1000000 # Load only the first 1 million lines for demonstration purposes
)
mysumstats.basic_check(verbose=False)
mysumstats.fix_id(fixsep=True)
2025/07/18 17:32:20 GWASLab v3.6.6 https://cloufield.github.io/gwaslab/ 2025/07/18 17:32:20 (C) 2022-2025, Yunye He, Kamatani Lab, GPL-3.0 license, gwaslab@gmail.com 2025/07/18 17:32:20 Python version: 3.12.0 | packaged by conda-forge | (main, Oct 3 2023, 08:43:22) [GCC 12.3.0] 2025/07/18 17:32:20 Start to initialize gl.Sumstats from file :../0_sample_data/t2d_bbj.txt.gz 2025/07/18 17:32:20 -Reading columns : POS,SE,SNP,P,CHR 2025/07/18 17:32:20 -Renaming columns to : POS,SE,SNPID,P,CHR 2025/07/18 17:32:20 -Current Dataframe shape : 1000000 x 5 2025/07/18 17:32:20 -Initiating a status column: STATUS ... 2025/07/18 17:32:20 #WARNING! Version of genomic coordinates is unknown... 2025/07/18 17:32:21 Start to reorder the columns...v3.6.6 2025/07/18 17:32:21 -Current Dataframe shape : 1000000 x 6 ; Memory usage: 63.43 MB 2025/07/18 17:32:21 -Reordering columns to : SNPID,CHR,POS,SE,P,STATUS 2025/07/18 17:32:21 Finished reordering the columns. 2025/07/18 17:32:21 -Trying to convert datatype for CHR: string -> Int64...Int64 2025/07/18 17:32:21 -Column : SNPID CHR POS SE P STATUS 2025/07/18 17:32:21 -DType : object Int64 int64 float64 float64 category 2025/07/18 17:32:21 -Verified: T T T T T T 2025/07/18 17:32:21 -Current Dataframe memory usage: 64.38 MB 2025/07/18 17:32:21 Finished loading data successfully! 2025/07/18 17:32:27 #WARNING! Necessary columns for .fix_allele() were not detected:EA,NEA 2025/07/18 17:32:28 #WARNING! Necessary columns for .normalize() were not detected:EA,NEA 2025/07/18 17:32:28 Start to check SNPID/rsID...v3.6.6 2025/07/18 17:32:28 -Current Dataframe shape : 1000000 x 6 ; Memory usage: 65.33 MB 2025/07/18 17:32:28 -Checking SNPID data type... 2025/07/18 17:32:28 -Checking if SNPID is CHR:POS:NEA:EA...(separator: - ,: , _) 2025/07/18 17:32:31 -Replacing [_-] in SNPID with ":" ... 2025/07/18 17:32:31 Finished checking SNPID/rsID.
In [4]:
Copied!
mysumstats.data
mysumstats.data
Out[4]:
| SNPID | CHR | POS | SE | P | STATUS | |
|---|---|---|---|---|---|---|
| 0 | 1:725932:G:A | 1 | 725932 | 0.1394 | 0.59700 | 9960999 |
| 1 | 1:725933:A:G | 1 | 725933 | 0.1394 | 0.59730 | 9960999 |
| 2 | 1:737801:T:C | 1 | 737801 | 0.1231 | 0.69080 | 9960999 |
| 3 | 1:749963:T:TAA | 1 | 749963 | 0.0199 | 0.28460 | 9960999 |
| 4 | 1:751343:T:A | 1 | 751343 | 0.0156 | 0.27050 | 9960999 |
| ... | ... | ... | ... | ... | ... | ... |
| 999995 | 2:6347639:C:A | 2 | 6347639 | 0.0111 | 0.15100 | 9960999 |
| 999996 | 2:6347694:G:C | 2 | 6347694 | 0.0111 | 0.17250 | 9960999 |
| 999997 | 2:6348478:G:A | 2 | 6348478 | 0.0111 | 0.17160 | 9960999 |
| 999998 | 2:6348490:G:C | 2 | 6348490 | 0.2032 | 0.11750 | 9960999 |
| 999999 | 2:6348754:A:C | 2 | 6348754 | 0.0126 | 0.01846 | 9960999 |
1000000 rows × 6 columns
Run clumping by calling plink2¶
GWASLab will extract the chromosomes with lead variants and prepare input files for running clumping by calling PLINK2
- VCF: reference LD VCF file (it will be converted to bfile)
- bfile: PLINK bfile prefix
- pfile: PLINK2 pfile prefix
Default parameters:
- clump_p1=5e-8
- clump_p2=5e-8
- clump_r2=0.01
- clump_kb=250
Note:
- plink2 need to be avaialble in your environment, or you can specify the path for PLINK2.
- out: output prefix
In [5]:
Copied!
mysumstats.clump( out="./t2d_bbj",
plink2="plink2", # default
clump_r2=0.1,
clump_p2=1e-6,
vcf=gl.get_path("1kg_eas_hg19"))
mysumstats.clump( out="./t2d_bbj",
plink2="plink2", # default
clump_r2=0.1,
clump_p2=1e-6,
vcf=gl.get_path("1kg_eas_hg19"))
2025/07/18 17:32:32 Dumpping dataframe to : ./123264708373936 2025/07/18 17:32:32 Start to perfrom clumping...v3.6.6 2025/07/18 17:32:32 -Current Dataframe shape : 1000000 x 6 ; Memory usage: 65.33 MB 2025/07/18 17:32:32 Start to perform clumping... 2025/07/18 17:32:32 -Clumping parameters for PLINK2: 2025/07/18 17:32:32 -clump_p1 : 5e-08... 2025/07/18 17:32:32 -clump_p2 : 1e-06... 2025/07/18 17:32:32 -clump_kb : 250... 2025/07/18 17:32:32 -clump_r2 : 0.1... 2025/07/18 17:32:32 -Clumping will be performed using P 2025/07/18 17:32:32 -Significant variants on CHR: [np.int64(1), np.int64(2)] 2025/07/18 17:32:32 -Processing VCF : /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz... 2025/07/18 17:32:32 -PLINK version: PLINK v2.00a5.9LM AVX2 AMD (12 Dec 2023) 2025/07/18 17:32:32 -Processing VCF for CHR 1... 2025/07/18 17:32:32 -Converting VCF to bfile: /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.1.bim/bed/fam... 2025/07/18 17:34:13 -Processing VCF for CHR 2... 2025/07/18 17:34:13 -Converting VCF to bfile: /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.2.bim/bed/fam... 2025/07/18 17:36:04 -Processing sumstats for CHR 1... 2025/07/18 17:36:08 -Variants in reference file: 6471908... 2025/07/18 17:36:08 -Variants in sumstats: 624... 2025/07/18 17:36:08 -Variants available in both reference and sumstats: 624... 2025/07/18 17:36:08 -Processing sumstats for CHR 2... 2025/07/18 17:36:13 -Variants in reference file: 7085912... 2025/07/18 17:36:13 -Variants in sumstats: 233... 2025/07/18 17:36:13 -Variants available in both reference and sumstats: 233... 2025/07/18 17:36:13 -Performing clumping for CHR 1... 2025/07/18 17:36:13 -PLINK version: PLINK v2.00a5.9LM AVX2 AMD (12 Dec 2023) 2025/07/18 17:36:14 -Saved results for CHR 1 to : ./t2d_bbj.1.clumps 2025/07/18 17:36:14 -Performing clumping for CHR 2... 2025/07/18 17:36:14 -PLINK version: PLINK v2.00a5.9LM AVX2 AMD (12 Dec 2023) 2025/07/18 17:36:15 -Saved results for CHR 2 to : ./t2d_bbj.2.clumps 2025/07/18 17:36:15 Finished clumping. 2025/07/18 17:36:16 Loaded dataframe back from : ./123264708373936
if using multiple bfile or pfile, @ can be used to indicate each chromosome.
In [6]:
Copied!
#mysumstats.clump( plink2="plink2", # default
# clump_r2=0.1,
# clump_p2=1e-6,
# bfile="/home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.@")
#mysumstats.clump( plink2="plink2", # default
# clump_r2=0.1,
# clump_p2=1e-6,
# bfile="/home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.@")
clump results are stored in mysumstats.clumps¶
In [7]:
Copied!
mysumstats.clumps["clumps"]
mysumstats.clumps["clumps"]
Out[7]:
| SNPID | CHR | POS | SE | P | STATUS | |
|---|---|---|---|---|---|---|
| 96739 | 1:22068326:A:G | 1 | 22068326 | 0.0103 | 1.629000e-09 | 9960999 |
| 213860 | 1:51103268:T:C | 1 | 51103268 | 0.0120 | 2.519000e-11 | 9960999 |
| 214500 | 1:51401146:CT:C | 1 | 51401146 | 0.0145 | 6.090000e-10 | 9960999 |
| 534095 | 1:154309595:TA:T | 1 | 154309595 | 0.0166 | 3.289000e-08 | 9960999 |
| 969974 | 2:640986:CACAT:C | 2 | 640986 | 0.0150 | 2.665000e-10 | 9960999 |
In [8]:
Copied!
mysumstats.clumps["clumps_raw"]
mysumstats.clumps["clumps_raw"]
Out[8]:
| CHR | POS | SNPID | P | TOTAL | NONSIG | S0.05 | S0.01 | S0.001 | S0.0001 | SP2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 1 | 22068326 | 1:22068326:A:G | 1.629000e-09 | 89 | 0 | 0 | 0 | 0 | 89 | 1:21982702:A:G,1:21983254:C:T,1:21983611:C:T,1... |
| 0 | 1 | 51103268 | 1:51103268:T:C | 2.519000e-11 | 347 | 0 | 0 | 0 | 0 | 347 | 1:50854197:G:C,1:50854654:A:G,1:50855083:A:G,1... |
| 1 | 1 | 51401146 | 1:51401146:CT:C | 6.090000e-10 | 134 | 0 | 0 | 0 | 0 | 134 | 1:51353720:C:G,1:51355823:A:G,1:51356091:G:T,1... |
| 3 | 1 | 154309595 | 1:154309595:TA:T | 3.289000e-08 | 5 | 0 | 0 | 0 | 0 | 5 | 1:154264194:G:A,1:154321382:C:T,1:154345344:G:... |
| 4 | 2 | 640986 | 2:640986:CACAT:C | 2.665000e-10 | 232 | 0 | 0 | 0 | 0 | 232 | 2:601905:T:G,2:605176:T:A,2:609177:A:G,2:61060... |
In [10]:
Copied!
# print(mysumstats.clumps["plink_log"])
# print(mysumstats.clumps["plink_log"])