Tutorial for gwaslab 3.6¶

In this tutorial, we will provide a brief overview of the core functions in GWASLab for summary statistics quality control (QC), standardization, and harmonization.
We will also demonstrate visualization features, including Manhattan plots, Q–Q plots, and regional plots.
The Jupyter notebook for this tutorial can be downloaded from https://github.com/Cloufield/gwaslab/blob/main/docs/tutorial_3.6.ipynb
Please note that the sample data and reference files are available on GitHub, while the full processed reference datasets are currently hosted on Dropbox.

Download sample data¶

Using a jupyter notebook, we first download a sample dataset.
The sample dataset we will use as an example (File size: 40M)
1. sample sumstats bbj_t2d_hm3_chr7_variants.txt.gz processed sumstats (only hapmap3 variants and chr7 region) of type 2 diabetes from BBJ (K. Suzuki et al., Nature Genetics. 51, 379–386 (2019).)
2. SNPID-rsID conversion table (sample) 1kg_dbsnp151_hg19_auto_hm3_chr7_variants.txt.gz
3. dbSNP vcf (sample) b157_2564.vcf.gz
4. 1KG EAS reference VCF (only chr7:126253550-128253550) 1kg_eas_hg19.chr7_126253550_128253550.vcf.gz

In [1]:

Copied!

!wget https://raw.githubusercontent.com/cloufield/gwaslab/main/examples/1_main_tutorial/sample_data.tar.gz
!tar -xvzf sample_data.tar.gz
!wget https://raw.githubusercontent.com/cloufield/gwaslab/main/examples/1_main_tutorial/sample_data.tar.gz
!tar -xvzf sample_data.tar.gz

--2025-10-19 20:56:16--  https://raw.githubusercontent.com/cloufield/gwaslab/main/examples/1_main_tutorial/sample_data.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 41498334 (40M) [application/octet-stream]
Saving to: ‘sample_data.tar.gz’

sample_data.tar.gz  100%[===================>]  39.58M  8.96MB/s    in 4.5s    

2025-10-19 20:56:20 (8.78 MB/s) - ‘sample_data.tar.gz’ saved [41498334/41498334]

./sample_data/
./sample_data/bbj_t2d_hm3_chr7_variants.txt.gz
./sample_data/1kg_eas_hg19.chr7_126253550_128253550.vcf.gz
./sample_data/b157_2564.vcf.gz
./sample_data/1kg_eas_hg19.chr7_126253550_128253550.vcf.gz.tbi
./sample_data/1kg_dbsnp151_hg19_auto_hm3_chr7_variants.txt.gz
./sample_data/b157_2564.vcf.gz.tbi

Import gwaslab package¶

gwaslab can be installed using pip

In [2]:

Copied!

# !pip install gwaslab==3.6.9
# !pip install gwaslab==3.6.9

If you installed gwaslab from pip, simply run the command to import the package:

In [3]:

Copied!

import gwaslab as gl
import gwaslab as gl

Or if you want to use the latest version from github (beta version), you can clone the repository and import the package by inserting your package path into the system path like:

In [4]:

Copied!

#!git clone https://github.com/Cloufield/gwaslab.git

import sys
sys.path.insert(0,"/home/yunye/work/gwaslab/src")
import gwaslab as gl
#!git clone https://github.com/Cloufield/gwaslab.git

import sys
sys.path.insert(0,"/home/yunye/work/gwaslab/src")
import gwaslab as gl

Loading data into gwaslab Sumstats¶

Let's import the raw sumstats into the gwaslab.Sumstats Object by specifying the necessary columns.

In [5]:

Copied!





mysumstats = gl.Sumstats("./sample_data/bbj_t2d_hm3_chr7_variants.txt.gz",
             snpid="SNPID",
             chrom="CHR",
             pos="POS",
             ea="EA",
             nea="NEA",
             neaf="EAF",
             beta="BETA",
             se="SE",
             p="P",
             n="N",
             sep="\t")
mysumstats = gl.Sumstats("./sample_data/bbj_t2d_hm3_chr7_variants.txt.gz",
             snpid="SNPID",
             chrom="CHR",
             pos="POS",
             ea="EA",
             nea="NEA",
             neaf="EAF",
             beta="BETA",
             se="SE",
             p="P",
             n="N",
             sep="\t")

2025/10/19 20:56:22 GWASLab v3.6.9 https://cloufield.github.io/gwaslab/
2025/10/19 20:56:22 (C) 2022-2025, Yunye He, Kamatani Lab, GPL-3.0 license, gwaslab@gmail.com
2025/10/19 20:56:22 Python version: 3.12.0 | packaged by conda-forge | (main, Oct  3 2023, 08:43:22) [GCC 12.3.0]
2025/10/19 20:56:22 Start to initialize gl.Sumstats from file :./sample_data/bbj_t2d_hm3_chr7_variants.txt.gz
2025/10/19 20:56:24  -Reading columns          : BETA,SE,NEA,SNPID,EAF,EA,P,N,POS,CHR
2025/10/19 20:56:24  -Renaming columns to      : BETA,SE,NEA,SNPID,EAF,EA,P,N,POS,CHR
2025/10/19 20:56:24  -Current Dataframe shape : 1103020  x  10
2025/10/19 20:56:24  -Initiating a status column: STATUS ...
2025/10/19 20:56:24  #WARNING! Version of genomic coordinates is unknown...
2025/10/19 20:56:24  -NEAF is specified...
2025/10/19 20:56:24  -Checking if 0<= NEAF <=1 ...
2025/10/19 20:56:24  -Converted NEAF to EAF.
2025/10/19 20:56:24  -Removed 0 variants with bad NEAF.
2025/10/19 20:56:24 Start to reorder the columns...v3.6.9
2025/10/19 20:56:24  -Current Dataframe shape : 1103020 x 11 ; Memory usage: 105.34 MB
2025/10/19 20:56:24  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,STATUS
2025/10/19 20:56:24 Finished reordering the columns.
2025/10/19 20:56:25  -Trying to convert datatype for CHR: string -> Int64...Failed...
2025/10/19 20:56:25  -Column  : SNPID  CHR    POS   EA       NEA      EAF     BETA    SE      P       N     STATUS  
2025/10/19 20:56:25  -DType   : object string int64 category category float64 float64 float64 float64 int64 category
2025/10/19 20:56:25  -Verified: T      F      T     T        T        T       T       T       T       T     T       
2025/10/19 20:56:25  #WARNING! Columns with possibly incompatible dtypes: CHR
2025/10/19 20:56:25  -Current Dataframe memory usage: 105.34 MB
2025/10/19 20:56:25 Finished loading data successfully!
2025/10/19 20:56:25 Path component detected: ['140656748651744']
2025/10/19 20:56:25 Creating path: ./140656748651744

Loading by specifying keyword in formatbook¶

Or you can simply specify format keywords listed in formatbook (https://github.com/Cloufield/formatbook)

In [6]:

Copied!

# mysumstats = gl.Sumstats("./sample_data/bbj_t2d_hm3_chr7_variants.txt.gz", fmt="gwaslab", verbose=True)
# mysumstats = gl.Sumstats("./sample_data/bbj_t2d_hm3_chr7_variants.txt.gz", fmt="gwaslab", verbose=True)

Sumstats.data¶

Sumstats are stored in Sumstats.data as a pandas.DataFrame, you can check the data like:

In [7]:

Copied!

mysumstats.data
mysumstats.data

Out[7]:

	SNPID	CHR	POS	EA	NEA	EAF	BETA	SE	P	N	STATUS
0	1:752566_G_A	1	752566	G	A	0.8422	-0.0155	0.0131	0.2350	166718	9999999
1	1:752721_A_G	1	752721	G	A	0.2507	0.0204	0.0147	0.1650	166718	9999999
2	1:754182_A_G	1	754182	G	A	0.2505	0.0222	0.0166	0.1817	166718	9999999
3	1:760912_C_T	1	760912	C	T	0.8425	-0.0171	0.0148	0.2480	166718	9999999
4	1:761147_T_C	1	761147	C	T	0.1581	0.0171	0.0148	0.2480	166718	9999999
...	...	...	...	...	...	...	...	...	...	...	...
1103015	X:154343911_A_G	X	154343911	G	A	0.8058	0.0019	0.0090	0.8297	191764	9999999
1103016	X:154379088_C_A	X	154379088	C	A	0.7783	0.0027	0.0094	0.7723	191764	9999999
1103017	X:154536836_C_T	X	154536836	C	T	0.2196	-0.0084	0.0085	0.3192	191764	9999999
1103018	X:154763036_A_G	X	154763036	G	A	0.3686	-0.0102	0.0105	0.3302	191764	9999999
1103019	X:154816439_A_C	X	154816439	C	A	0.2339	-0.0083	0.0100	0.4021	191764	9999999

1103020 rows × 11 columns

For details on gwaslab Sumstats Object, see: https://cloufield.github.io/gwaslab/SumstatsObject/

Infer genome build¶

If the build is not specified, gwaslab can infer it based on HapMap3 SNPs. For more details, see: https://cloufield.github.io/gwaslab/InferBuild/

In [8]:

Copied!

mysumstats.infer_build()
mysumstats.infer_build()

2025/10/19 20:56:25 Start to infer genome build version using hapmap3 SNPs...v3.6.9
2025/10/19 20:56:25  -Current Dataframe shape : 1103020 x 11 ; Memory usage: 105.34 MB
2025/10/19 20:56:25 Start to infer genome build version using hapmap3 SNPs...
2025/10/19 20:56:25  -Loading Hapmap3 variants data...
2025/10/19 20:56:26  -CHR:POS will be used for matching...
2025/10/19 20:56:28  -Matching variants for hg19: num_hg19 =  1093200
2025/10/19 20:56:28  -Matching variants for hg38: num_hg38 =  10807
2025/10/19 20:56:28  -Since num_hg19 >> num_hg38, assigning genome build hg19...
2025/10/19 20:56:29 Finished inferring genome build version using hapmap3 SNPs.

Create Manhattan plots and Q-Q plots¶

The first thing you’ll probably want to check is the Manhattan and Q–Q plots for your summary statistics. gwaslab performs a minimal QC on the data automatically when generating these plots.

In [9]:

Copied!

mysumstats.plot_mqq()
mysumstats.plot_mqq()

2025/10/19 20:56:29 Start to create MQQ plot...v3.6.9:
2025/10/19 20:56:29  -Genomic coordinates version: 19...
2025/10/19 20:56:29  -Genome-wide significance level to plot is set to 5e-08 ...
2025/10/19 20:56:29  -Raw input contains 1103020 variants...
2025/10/19 20:56:29  -MQQ plot layout mode is : mqq
2025/10/19 20:56:29 Finished loading specified columns from the sumstats.
2025/10/19 20:56:29 Start data conversion and sanity check:
2025/10/19 20:56:29  -Removed 0 variants with nan in CHR or POS column ...
2025/10/19 20:56:30  -Removed 0 variants with CHR <=0...
2025/10/19 20:56:30  -Removed 0 variants with nan in P column ...
2025/10/19 20:56:30  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
2025/10/19 20:56:30  -Sumstats P values are being converted to -log10(P)...
2025/10/19 20:56:30  -Sanity check: 0 na/inf/-inf variants will be removed...
2025/10/19 20:56:30  -Converting data above cut line...
2025/10/19 20:56:30  -Maximum -log10(P) value is 152.39254497678533 .
2025/10/19 20:56:30 Finished data conversion and sanity check.
2025/10/19 20:56:30 Start to create MQQ plot with 1103020 variants...
2025/10/19 20:56:31  -Creating background plot...
2025/10/19 20:56:33 Finished creating MQQ plot successfully!
2025/10/19 20:56:33 Start to extract variants for annotation...
2025/10/19 20:56:33  -Found 75 significant variants with a sliding window size of 500 kb...
2025/10/19 20:56:33 Finished extracting variants for annotation...
2025/10/19 20:56:33 Start to process figure arts.
2025/10/19 20:56:33  -Processing X ticks...
2025/10/19 20:56:33  -Processing X labels...
2025/10/19 20:56:33  -Processing Y labels...
2025/10/19 20:56:33  -Processing Y tick lables...
2025/10/19 20:56:33  -Processing Y labels...
2025/10/19 20:56:33  -Processing lines...
2025/10/19 20:56:33 Finished processing figure arts.
2025/10/19 20:56:33 Start to annotate variants...
2025/10/19 20:56:33  -Skip annotating
2025/10/19 20:56:33 Finished annotating variants.
2025/10/19 20:56:33 Start to create QQ plot with 1103020 variants:
2025/10/19 20:56:33  -Plotting all variants...
2025/10/19 20:56:33  -Expected range of P: (0,1.0)
2025/10/19 20:56:33  -Lambda GC (MLOG10P mode) at 0.5 is   1.33437
2025/10/19 20:56:33  -Processing Y tick lables...
2025/10/19 20:56:33 Finished creating QQ plot successfully!
2025/10/19 20:56:33 Start to save figure...
2025/10/19 20:56:33  -Skip saving figure!
2025/10/19 20:56:33 Finished saving figure...
2025/10/19 20:56:34 Finished creating plot successfully!

Out[9]:

(<Figure size 3000x1000 with 2 Axes>, <gwaslab.g_Log.Log at 0x7fed335eb950>)

No description has been provided for this image

Using .plot_mqq(), you can easily plot the Manhattan and QQ plot, but the plots without any manipulations are not really informative in this case, and plotting all points takes a relatively long time. The most significant locus dwarfed other less signicant loci. To adjust the plot, gwaslab provides a wide range of options for customization. For example, we can use skip and cut :

skip : skip variants with MLOG10P < skip for faster plotting speed
cut : rescale the MLOG10P values when MLOG10P > cut

In [10]:

Copied!

mysumstats.plot_mqq(skip=2, cut=20)
mysumstats.plot_mqq(skip=2, cut=20)

2025/10/19 20:56:46 Start to create MQQ plot...v3.6.9:
2025/10/19 20:56:46  -Genomic coordinates version: 19...
2025/10/19 20:56:46  -Genome-wide significance level to plot is set to 5e-08 ...
2025/10/19 20:56:46  -Raw input contains 1103020 variants...
2025/10/19 20:56:46  -MQQ plot layout mode is : mqq
2025/10/19 20:56:46 Finished loading specified columns from the sumstats.
2025/10/19 20:56:46 Start data conversion and sanity check:
2025/10/19 20:56:46  -Removed 0 variants with nan in CHR or POS column ...
2025/10/19 20:56:46  -Removed 0 variants with CHR <=0...
2025/10/19 20:56:46  -Removed 0 variants with nan in P column ...
2025/10/19 20:56:46  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
2025/10/19 20:56:46  -Sumstats P values are being converted to -log10(P)...
2025/10/19 20:56:46  -Sanity check: 0 na/inf/-inf variants will be removed...
2025/10/19 20:56:47  -Converting data above cut line...
2025/10/19 20:56:47  -Maximum -log10(P) value is 152.39254497678533 .
2025/10/19 20:56:47  -Minus log10(P) values above 20 will be shrunk with a shrinkage factor of 10...
2025/10/19 20:56:47 Finished data conversion and sanity check.
2025/10/19 20:56:47 Start to create MQQ plot with 38995 variants...
2025/10/19 20:56:47  -Creating background plot...
2025/10/19 20:56:47 Finished creating MQQ plot successfully!
2025/10/19 20:56:47 Start to extract variants for annotation...
2025/10/19 20:56:47  -Found 75 significant variants with a sliding window size of 500 kb...
2025/10/19 20:56:47 Finished extracting variants for annotation...
2025/10/19 20:56:47 Start to process figure arts.
2025/10/19 20:56:47  -Processing X ticks...
2025/10/19 20:56:47  -Processing X labels...
2025/10/19 20:56:47  -Processing Y labels...
2025/10/19 20:56:47  -Processing Y tick lables...
2025/10/19 20:56:47  -Processing Y labels...
2025/10/19 20:56:47  -Processing lines...
2025/10/19 20:56:47 Finished processing figure arts.
2025/10/19 20:56:47 Start to annotate variants...
2025/10/19 20:56:47  -Skip annotating
2025/10/19 20:56:47 Finished annotating variants.
2025/10/19 20:56:47 Start to create QQ plot with 38995 variants:
2025/10/19 20:56:47  -Plotting all variants...
2025/10/19 20:56:47  -Expected range of P: (0,1.0)
2025/10/19 20:56:47  -Lambda GC (MLOG10P mode) at 0.5 is   1.33437
2025/10/19 20:56:47  -Processing Y tick lables...
2025/10/19 20:56:47 Finished creating QQ plot successfully!
2025/10/19 20:56:47 Start to save figure...
2025/10/19 20:56:47  -Skip saving figure!
2025/10/19 20:56:47 Finished saving figure...
2025/10/19 20:56:47 Finished creating plot successfully!

Out[10]:

(<Figure size 3000x1000 with 2 Axes>, <gwaslab.g_Log.Log at 0x7fed335eb950>)

Looks better now. But what if we want to annotate some of the most significant loci (for example, lead variants with MLOG10P>30) and only plot Manhattan plot?

In [11]:

Copied!

mysumstats.plot_mqq(skip=2, cut=20, mode="m", anno=True, sig_level_lead=1e-30)
mysumstats.plot_mqq(skip=2, cut=20, mode="m", anno=True, sig_level_lead=1e-30)

2025/10/19 20:56:48 Start to create MQQ plot...v3.6.9:
2025/10/19 20:56:48  -Genomic coordinates version: 19...
2025/10/19 20:56:48  -Genome-wide significance level to plot is set to 5e-08 ...
2025/10/19 20:56:48  -Raw input contains 1103020 variants...
2025/10/19 20:56:48  -MQQ plot layout mode is : m
2025/10/19 20:56:48 Finished loading specified columns from the sumstats.
2025/10/19 20:56:48 Start data conversion and sanity check:
2025/10/19 20:56:48  -Removed 0 variants with nan in CHR or POS column ...
2025/10/19 20:56:48  -Removed 0 variants with CHR <=0...
2025/10/19 20:56:48  -Removed 0 variants with nan in P column ...
2025/10/19 20:56:48  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
2025/10/19 20:56:48  -Sumstats P values are being converted to -log10(P)...
2025/10/19 20:56:49  -Sanity check: 0 na/inf/-inf variants will be removed...
2025/10/19 20:56:49  -Converting data above cut line...
2025/10/19 20:56:49  -Maximum -log10(P) value is 152.39254497678533 .
2025/10/19 20:56:49  -Minus log10(P) values above 20 will be shrunk with a shrinkage factor of 10...
2025/10/19 20:56:49 Finished data conversion and sanity check.
2025/10/19 20:56:49 Start to create MQQ plot with 38995 variants...
2025/10/19 20:56:49  -Creating background plot...
2025/10/19 20:56:49 Finished creating MQQ plot successfully!
2025/10/19 20:56:49 Start to extract variants for annotation...
2025/10/19 20:56:49  -Found 1 significant variants with a sliding window size of 500 kb...
2025/10/19 20:56:49 Finished extracting variants for annotation...
2025/10/19 20:56:49 Start to process figure arts.
2025/10/19 20:56:49  -Processing X ticks...
2025/10/19 20:56:49  -Processing X labels...
2025/10/19 20:56:49  -Processing Y labels...
2025/10/19 20:56:49  -Processing Y tick lables...
2025/10/19 20:56:49  -Processing Y labels...
2025/10/19 20:56:49  -Processing lines...
2025/10/19 20:56:49 Finished processing figure arts.
2025/10/19 20:56:49 Start to annotate variants...
2025/10/19 20:56:49  -Annotating using column CHR:POS...
2025/10/19 20:56:49  -Adjusting text positions with repel_force=0.03...
2025/10/19 20:56:49 Finished annotating variants.
2025/10/19 20:56:49 Start to save figure...
2025/10/19 20:56:49  -Skip saving figure!
2025/10/19 20:56:49 Finished saving figure...
2025/10/19 20:56:49 Finished creating plot successfully!

Out[11]:

(<Figure size 3000x1000 with 1 Axes>, <gwaslab.g_Log.Log at 0x7fed335eb950>)

gwaslab supports a wide range of customizable options. For details on other options for Manhattan and Q-Q plots, see: https://cloufield.github.io/gwaslab/Visualization/

Standardization & QC : `.basic_check()`¶

It is needed to check variant ID (SNPID), rsID, chromosome(CHR), basepair position(POS), alleles (EA and NEA) and statistics first before any manipulations or analysis. gwaslab provides a all-in-one function for this, .basic_check().

Note: Sometimes you need to do this before plotting if the sumstats are not in a standard format.

In [12]:

Copied!

#check SNPID,rsID,CHR,POS,EA, NEA and statistics
mysumstats.basic_check()
#check SNPID,rsID,CHR,POS,EA, NEA and statistics
mysumstats.basic_check()

2025/10/19 20:56:50 Start to check SNPID/rsID...v3.6.9
2025/10/19 20:56:50  -Current Dataframe shape : 1103020 x 11 ; Memory usage: 105.34 MB
2025/10/19 20:56:50  -Checking SNPID data type...
2025/10/19 20:56:50  -Converting SNPID to pd.string data type...
2025/10/19 20:56:50  -Checking NA strings :na,NA,Na,Nan,NaN,<NA>,null,NULL,#N/A,#VALUE!,N/A,n/a,missing,
2025/10/19 20:56:50  -Checking if SNPID contains NA strings...
2025/10/19 20:56:50  -Checking if SNPID is CHR:POS:NEA:EA...(separator: - ,: , _)
2025/10/19 20:56:52 Finished checking SNPID/rsID.
2025/10/19 20:56:52 Start to fix chromosome notation (CHR)...v3.6.9
2025/10/19 20:56:52  -Current Dataframe shape : 1103020 x 11 ; Memory usage: 105.34 MB
2025/10/19 20:56:52  -Checking CHR data type...
2025/10/19 20:56:52  -Variants with standardized chromosome notation: 1100517
2025/10/19 20:56:52  -Variants with fixable chromosome notations: 2503
2025/10/19 20:56:52  -No unrecognized chromosome notations...
2025/10/19 20:56:52  -Identifying non-autosomal chromosomes : X, Y, and MT ...
2025/10/19 20:56:52  -Identified  2503  variants on sex chromosomes...
2025/10/19 20:56:52  -Standardizing sex chromosome notations: X to 23...
2025/10/19 20:56:55 Finished fixing chromosome notation (CHR).
2025/10/19 20:56:55 Start to fix basepair positions (POS)...v3.6.9
2025/10/19 20:56:55  -Current Dataframe shape : 1103020 x 11 ; Memory usage: 138.64 MB
2025/10/19 20:56:55  -Converting to Int64 data type ...
2025/10/19 20:56:56  -Position bound:(0 , 250,000,000)
2025/10/19 20:56:56  -Removed outliers: 0
2025/10/19 20:56:56 Finished fixing basepair positions (POS).
2025/10/19 20:56:56 Start to fix alleles (EA and NEA)...v3.6.9
2025/10/19 20:56:56  -Current Dataframe shape : 1103020 x 11 ; Memory usage: 107.45 MB
2025/10/19 20:56:56  -Converted all bases to string datatype and UPPERCASE.
2025/10/19 20:56:57  -Variants with bad EA  : 0
2025/10/19 20:56:57  -Variants with bad NEA : 0
2025/10/19 20:56:57  -Variants with NA for EA or NEA: 0
2025/10/19 20:56:57  -Variants with same EA and NEA: 0
2025/10/19 20:56:57  -Detected 0 variants with alleles that contain bases other than A/C/T/G .
2025/10/19 20:56:59 Finished fixing alleles (EA and NEA).
2025/10/19 20:56:59 Start to perform sanity check for statistics...v3.6.9
2025/10/19 20:56:59  -Current Dataframe shape : 1103020 x 11 ; Memory usage: 108.51 MB
2025/10/19 20:56:59  -Comparison tolerance for floats: 1e-07
2025/10/19 20:56:59  -Checking if 0 <= N <= 2147483647 ...
2025/10/19 20:57:00  -Removed 0 variants with bad/na N.
2025/10/19 20:57:00  -Checking if -1e-07 < EAF < 1.0000001 ...
2025/10/19 20:57:00  -Removed 0 variants with bad/na EAF.
2025/10/19 20:57:00  -Checking if -1e-07 < P < 1.0000001 ...
2025/10/19 20:57:00  -Removed 0 variants with bad/na P.
2025/10/19 20:57:00  -Checking if -100.0000001 < BETA < 100.0000001 ...
2025/10/19 20:57:00  -Removed 0 variants with bad/na BETA.
2025/10/19 20:57:00  -Checking if -1e-07 < SE < inf ...
2025/10/19 20:57:00  -Removed 0 variants with bad/na SE.
2025/10/19 20:57:00  -Checking STATUS and converting STATUS to categories....
2025/10/19 20:57:00  -Removed 0 variants with bad statistics in total.
2025/10/19 20:57:00  -Data types for each column:
2025/10/19 20:57:00  -Column  : SNPID  CHR   POS   EA       NEA      EAF     BETA    SE      P       N     STATUS  
2025/10/19 20:57:00  -DType   : string Int64 Int64 category category float32 float64 float64 float64 Int64 category
2025/10/19 20:57:00  -Verified: T      T     T     T        T        T       T       T       T       T     T       
2025/10/19 20:57:00 Finished sanity check for statistics.
2025/10/19 20:57:01 Start to check data consistency across columns...v3.6.9
2025/10/19 20:57:01  -Current Dataframe shape : 1103020 x 11 ; Memory usage: 105.35 MB
2025/10/19 20:57:01  -Tolerance: 0.001 (Relative) and 0.001 (Absolute)
2025/10/19 20:57:01  -Checking if BETA/SE-derived-P is consistent with P...
2025/10/19 20:57:04   -Not consistent: 380165 variant(s)
2025/10/19 20:57:04   -Variant SNPID with max difference: X:15800751_G_A with 0.006652248071004951
2025/10/19 20:57:04  -Note: if the max difference is greater than expected, please check your original sumstats.
2025/10/19 20:57:04 Finished checking data consistency across columns.
2025/10/19 20:57:04 Start to normalize indels...v3.6.9
2025/10/19 20:57:04  -Current Dataframe shape : 1103020 x 11 ; Memory usage: 137.60 MB
2025/10/19 20:57:05  -No available variants to normalize..
2025/10/19 20:57:05 Finished normalizing variants successfully!
2025/10/19 20:57:05 Start to sort the genome coordinates...v3.6.9
2025/10/19 20:57:05  -Current Dataframe shape : 1103020 x 11 ; Memory usage: 137.60 MB
2025/10/19 20:57:05 Finished sorting coordinates.
2025/10/19 20:57:05 Start to reorder the columns...v3.6.9
2025/10/19 20:57:05  -Current Dataframe shape : 1103020 x 11 ; Memory usage: 96.94 MB
2025/10/19 20:57:05  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,STATUS
2025/10/19 20:57:05 Finished reordering the columns.

In [13]:

Copied!

mysumstats.data
mysumstats.data

Out[13]:

	SNPID	CHR	POS	EA	NEA	EAF	BETA	SE	P	N	STATUS
0	1:752566_G_A	1	752566	G	A	0.8422	-0.0155	0.0131	0.2350	166718	1960099
1	1:752721_A_G	1	752721	G	A	0.2507	0.0204	0.0147	0.1650	166718	1960099
2	1:754182_A_G	1	754182	G	A	0.2505	0.0222	0.0166	0.1817	166718	1960099
3	1:760912_C_T	1	760912	C	T	0.8425	-0.0171	0.0148	0.2480	166718	1960099
4	1:761147_T_C	1	761147	C	T	0.1581	0.0171	0.0148	0.2480	166718	1960099
...	...	...	...	...	...	...	...	...	...	...	...
1103015	X:154343911_A_G	23	154343911	G	A	0.8058	0.0019	0.0090	0.8297	191764	1960099
1103016	X:154379088_C_A	23	154379088	C	A	0.7783	0.0027	0.0094	0.7723	191764	1960099
1103017	X:154536836_C_T	23	154536836	C	T	0.2196	-0.0084	0.0085	0.3192	191764	1960099
1103018	X:154763036_A_G	23	154763036	G	A	0.3686	-0.0102	0.0105	0.3302	191764	1960099
1103019	X:154816439_A_C	23	154816439	C	A	0.2339	-0.0083	0.0100	0.4021	191764	1960099

1103020 rows × 11 columns

By checking the log, we can see that the sumstats look good. But we still found several variants that were not normalized. gwaslab fixed the position and alleles for the un-normalizated indels. And gwaslab standardize the notation for chromosome X to 23.

In fact, .basic_check() is a wrapper of the following basic functions, you can also use these separately.

mysumstats.fix_ID()
mysumstats.fix_chr()
mysumstats.fix_pos()
mysumstats.fix_allele()
mysumstats.check_sanity()
mysumstats.check_data_consistency()
mysumstats.normalize_allele()
mysumstats.remove_dup()

For other options, see: https://cloufield.github.io/gwaslab/Standardization/

Extract lead variants : get_lead()¶

Let's extract the lead variants in each significant loci to check our data.

The significant loci are detected based on a sliding window (default window size: windowsizekb=500 kb)

By specifying anno=True , gwaslab will also annotate the lead variant with its nearest gene names and distance.

Note: If you did not specify build when loading the summary statistics or have not run .infer_build(), you need to explicitly set build="19" (GRCh37/hg19) or build="38" (GRCh38/hg38) here for annotation.

Note: GWASLab will download reference files when you run it for the first time. In this case, ensembl_hg19_gtf_protein_coding was downloaded and processed automatically.

In [14]:

Copied!

mysumstats.get_lead(anno=True)
mysumstats.get_lead(anno=True)

2025/10/19 20:57:05 Start to extract lead variants...v3.6.9
2025/10/19 20:57:05  -Current Dataframe shape : 1103020 x 11 ; Memory usage: 96.94 MB
2025/10/19 20:57:05  -Processing 1103020 variants...
2025/10/19 20:57:05  -Significance threshold : 5e-08
2025/10/19 20:57:05  -Sliding window size: 500  kb
2025/10/19 20:57:05  -Using P for extracting lead variants...
2025/10/19 20:57:05  -Found 1271 significant variants in total...
2025/10/19 20:57:06  -Identified 75 lead variants!
2025/10/19 20:57:06  -Annotating variants using references:ensembl
2025/10/19 20:57:06  -Annotating variants using references based on genome build:19
2025/10/19 20:57:06 Start to annotate variants with nearest gene name(s)...
2025/10/19 20:57:06  -Assigning Gene name using ensembl_hg19_gtf for protein coding genes
2025/10/19 20:57:08 Finished annotating variants with nearest gene name(s) successfully!
2025/10/19 20:57:08 Finished extracting lead variants.

Out[14]:

	SNPID	CHR	POS	EA	NEA	EAF	BETA	SE	P	N	STATUS	LOCATION	GENE
8927	1:22046558_A_C	1	22046558	C	A	0.2527	0.0607	0.0103	3.941000e-09	191764	1960099	0	USP48
19220	1:51174330_C_T	1	51174330	C	T	0.7902	0.0723	0.0113	1.380000e-10	191764	1960099	0	FAF1
90573	2:637597_C_T	2	637597	C	T	0.9017	0.0938	0.0150	3.725000e-10	191764	1960099	29738	TMEM18
102403	2:27742603_T_C	2	27742603	C	T	0.5613	-0.0690	0.0088	4.358000e-15	191764	1960099	0	GCKR
117019	2:58959112_G_A	2	58959112	G	A	0.1754	0.0654	0.0117	2.237000e-08	191764	1960099	-490605	FANCL
...	...	...	...	...	...	...	...	...	...	...	...	...	...
1035288	19:46158643_G_T	19	46158643	G	T	0.6372	0.0805	0.0094	1.332000e-17	191764	1960099	-9756	EML2
1059172	20:42989777_G_A	20	42989777	G	A	0.4503	0.0518	0.0088	3.770000e-09	191764	1960099	0	HNF4A
1062583	20:50155386_T_C	20	50155386	C	T	0.4266	0.0538	0.0090	1.836000e-09	191764	1960099	0	NFATC2
1101431	X:57150010_C_T	23	57150010	C	T	0.3012	0.0443	0.0076	6.427000e-09	191764	1960099	-2030	SPIN2B
1102334	X:117916520_A_G	23	117916520	G	A	0.5681	-0.0507	0.0069	2.330000e-13	191764	1960099	0	IL13RA1

75 rows × 13 columns

We extracted a total of 75 lead variants with a sliding window size of 500kb from this processed dataset.

For other options, see: https://cloufield.github.io/gwaslab/ExtractLead/

Use the SNPID to create some highly customized mqq plot¶

GWASLab can create much more complicated Manhattan plots.

For example,

annotate the lead variants with closest gene names (threshold for annotation p<1e-20)
annotate selected variants with user-provided texts
pinpoint some variants
highlight some loci
MAF-stratified Q-Q plot
save as my_first_mqq_plot.png with {"dpi":400,"facecolor":"white"}

In [15]:

Copied!





mysumstats.plot_mqq(mode="mqq",
                    cut=20,
                    skip=2,
                    anno="GENENAME",
                    sig_level_lead=1e-20,
                    anno_alias={"9:22134094_T_C":"some annotations"},
                    anno_style="expand",
                    xpad=0.01,
                    pinpoint=["9:22134094_T_C","6:20686996_C_A"], 
                    pinpoint_color="green",
                    highlight=["7:127253550_C_T","11:2858440_G_A"],
                    highlight_windowkb =1000,
                    stratified=True,
                    jagged=True,
                    marker_size=(5,5),
                    figargs={"figsize":(15,5),"dpi":300},
                    save="my_first_mqq_plot.png", 
                    save_args={"dpi":400,"facecolor":"white"})
mysumstats.plot_mqq(mode="mqq",
                    cut=20,
                    skip=2,
                    anno="GENENAME",
                    sig_level_lead=1e-20,
                    anno_alias={"9:22134094_T_C":"some annotations"},
                    anno_style="expand",
                    xpad=0.01,
                    pinpoint=["9:22134094_T_C","6:20686996_C_A"], 
                    pinpoint_color="green",
                    highlight=["7:127253550_C_T","11:2858440_G_A"],
                    highlight_windowkb =1000,
                    stratified=True,
                    jagged=True,
                    marker_size=(5,5),
                    figargs={"figsize":(15,5),"dpi":300},
                    save="my_first_mqq_plot.png", 
                    save_args={"dpi":400,"facecolor":"white"})

2025/10/19 20:57:08 Start to create MQQ plot...v3.6.9:
2025/10/19 20:57:08  -Genomic coordinates version: 19...
2025/10/19 20:57:08  -Genome-wide significance level to plot is set to 5e-08 ...
2025/10/19 20:57:08  -Raw input contains 1103020 variants...
2025/10/19 20:57:08  -MQQ plot layout mode is : mqq
2025/10/19 20:57:08  -Loci to highlight (#CB132D): 7:127253550_C_T,11:2858440_G_A
2025/10/19 20:57:08   -highlight_windowkb is set to:  1000  kb
2025/10/19 20:57:08  -Variants to pinpoint (green) : 9:22134094_T_C,6:20686996_C_A
2025/10/19 20:57:09 Finished loading specified columns from the sumstats.
2025/10/19 20:57:09 Start data conversion and sanity check:
2025/10/19 20:57:09  -Removed 0 variants with nan in CHR or POS column ...
2025/10/19 20:57:09  -Removed 0 variants with CHR <=0...
2025/10/19 20:57:09  -Removed 0 variants with nan in EAF column ...
2025/10/19 20:57:09  -Removed 0 variants with nan in P column ...
2025/10/19 20:57:09  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
2025/10/19 20:57:09  -Sumstats P values are being converted to -log10(P)...
2025/10/19 20:57:09  -Sanity check: 0 na/inf/-inf variants will be removed...
2025/10/19 20:57:09  -Converting data above cut line...
2025/10/19 20:57:09  -Maximum -log10(P) value is 152.39254497678533 .
2025/10/19 20:57:09  -Minus log10(P) values above 20 will be shrunk with a shrinkage factor of 10...
2025/10/19 20:57:09 Finished data conversion and sanity check.
2025/10/19 20:57:09 Start to create MQQ plot with 38995 variants...
2025/10/19 20:57:09  -Creating background plot...
2025/10/19 20:57:10  -Highlighting target loci...
2025/10/19 20:57:10  -Pinpointing target vairants...
2025/10/19 20:57:10 Finished creating MQQ plot successfully!
2025/10/19 20:57:10 Start to extract variants for annotation...
2025/10/19 20:57:10  -Found 14 significant variants with a sliding window size of 500 kb...
2025/10/19 20:57:10 Start to annotate variants with nearest gene name(s)...
2025/10/19 20:57:10  -Assigning Gene name using ensembl_hg19_gtf for protein coding genes
2025/10/19 20:57:10 Finished annotating variants with nearest gene name(s) successfully!
2025/10/19 20:57:10 Finished extracting variants for annotation...
2025/10/19 20:57:10 Start to process figure arts.
2025/10/19 20:57:10  -Processing X ticks...
2025/10/19 20:57:10  -Processing X labels...
2025/10/19 20:57:10  -Processing Y labels...
2025/10/19 20:57:10  -Processing Y tick lables...
2025/10/19 20:57:10  -Processing Y labels...
2025/10/19 20:57:10  -Processing lines...
2025/10/19 20:57:10 Finished processing figure arts.
2025/10/19 20:57:10 Start to annotate variants...
2025/10/19 20:57:10  -Annotating using column GENENAME...
2025/10/19 20:57:10  -Adjusting text positions with repel_force=0.03...
2025/10/19 20:57:10 Finished annotating variants.
2025/10/19 20:57:10 Start to create QQ plot with 38995 variants:
2025/10/19 20:57:10  -Plotting variants stratified by MAF...
2025/10/19 20:57:10  -Lambda GC (MLOG10P mode) at 0.5 is   1.33437
2025/10/19 20:57:10  -Processing Y tick lables...
2025/10/19 20:57:10 Finished creating QQ plot successfully!
2025/10/19 20:57:10  -Processing jagged Y axis...
2025/10/19 20:57:10  -Processing jagged Y axis...
2025/10/19 20:57:10  -Adjusting X padding on both side: 0.01
2025/10/19 20:57:10 Start to save figure...
2025/10/19 20:57:11  -Saved to my_first_mqq_plot.png successfully! (overwrite)
2025/10/19 20:57:11 Finished saving figure...
2025/10/19 20:57:11 Finished creating plot successfully!

Out[15]:

(<Figure size 6000x2000 with 2 Axes>, <gwaslab.g_Log.Log at 0x7fed335eb950>)

For details, see: https://cloufield.github.io/gwaslab/Visualization/

Quick regional plot without LD-information¶

gwaslab can also plot regional plots with or with out LD reference files.

For details, see: https://cloufield.github.io/gwaslab/RegionalPlot/

We first create a regional plot without references by specifying mode and region.

In [16]:

Copied!

mysumstats.plot_mqq(mode="r",skip=2,cut=20, region=(7,126253550,128253550),region_grid=True)
mysumstats.plot_mqq(mode="r",skip=2,cut=20, region=(7,126253550,128253550),region_grid=True)

2025/10/19 20:57:13 Start to create MQQ plot...v3.6.9:
2025/10/19 20:57:13  -Genomic coordinates version: 19...
2025/10/19 20:57:13  -Genome-wide significance level to plot is set to 5e-08 ...
2025/10/19 20:57:13  -Raw input contains 1103020 variants...
2025/10/19 20:57:13  -MQQ plot layout mode is : r
2025/10/19 20:57:13  -Region to plot : chr7:126253550-128253550.
2025/10/19 20:57:13  -Extract SNPs in region : chr7:126253550-128253550...
2025/10/19 20:57:14  -Extract SNPs in specified regions: 8857
2025/10/19 20:57:14 Finished loading specified columns from the sumstats.
2025/10/19 20:57:14 Start data conversion and sanity check:
2025/10/19 20:57:14  -Removed 0 variants with nan in CHR or POS column ...
2025/10/19 20:57:14  -Removed 0 variants with CHR <=0...
2025/10/19 20:57:14  -Removed 0 variants with nan in P column ...
2025/10/19 20:57:14  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
2025/10/19 20:57:14  -Sumstats P values are being converted to -log10(P)...
2025/10/19 20:57:14  -Sanity check: 0 na/inf/-inf variants will be removed...
2025/10/19 20:57:14  -Converting data above cut line...
2025/10/19 20:57:14  -Maximum -log10(P) value is 73.38711023071251 .
2025/10/19 20:57:14  -Minus log10(P) values above 20 will be shrunk with a shrinkage factor of 10...
2025/10/19 20:57:14 Finished data conversion and sanity check.
2025/10/19 20:57:14 Start to create MQQ plot with 1878 variants...
2025/10/19 20:57:14  -Creating background plot...
2025/10/19 20:57:14  -Extracting lead variant...
2025/10/19 20:57:14  -Loading gtf files from:default

INFO:root:Extracted GTF attributes: ['gene_id', 'gene_name', 'gene_biotype']

2025/10/19 20:57:45  -plotting gene track..
2025/10/19 20:57:45  -plotting genes: 14..
2025/10/19 20:57:45  -plotting exons: 675..
2025/10/19 20:57:46  -Finished plotting gene track..
2025/10/19 20:57:46 Finished creating MQQ plot successfully!
2025/10/19 20:57:46 Start to extract variants for annotation...
2025/10/19 20:57:46  -Found 1 significant variants with a sliding window size of 500 kb...
2025/10/19 20:57:46 Finished extracting variants for annotation...
2025/10/19 20:57:46 Start to process figure arts.
2025/10/19 20:57:46  -Processing X ticks...
2025/10/19 20:57:46  -Processing X labels...
2025/10/19 20:57:46  -Processing Y labels...
2025/10/19 20:57:46  -Processing Y tick lables...
2025/10/19 20:57:46  -Processing Y labels...
2025/10/19 20:57:46  -Processing lines...
2025/10/19 20:57:46 Finished processing figure arts.
2025/10/19 20:57:46 Start to annotate variants...
2025/10/19 20:57:46  -Skip annotating
2025/10/19 20:57:46 Finished annotating variants.
2025/10/19 20:57:46 Start to save figure...
2025/10/19 20:57:46  -Skip saving figure!
2025/10/19 20:57:46 Finished saving figure...
2025/10/19 20:57:46 Finished creating plot successfully!

Out[16]:

(<Figure size 3000x2000 with 3 Axes>, <gwaslab.g_Log.Log at 0x7fed335eb950>)

Reference file downloading¶

Full regional plot will require either: user-provided vcf or preprocessed vcf files by gwaslab: (e.g 1000 Genomes project, see Reference: https://cloufield.github.io/gwaslab/Reference/)

GWASLab provides pre-processed 1KG datasets for downloading.

check available reference from gwaslab¶

Update the available reference list first if needed

In [17]:

Copied!

# gl.update_available_ref()
# gl.update_available_ref()

In [18]:

Copied!

gl.check_available_ref()
gl.check_available_ref()

2025/10/19 20:57:47 Start to check available reference files...
2025/10/19 20:57:47  - 1kg_eas_hg19  :  https://www.dropbox.com/s/lztaxqhy2o6dpxw/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1
2025/10/19 20:57:47  - 1kg_eas_hg19_md5  :  c8c97434843c0da3113fc06879ead472
2025/10/19 20:57:47  - 1kg_eas_hg19_tbi  :  https://www.dropbox.com/s/k9klefl8m9fcfo1/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1
2025/10/19 20:57:47  - 1kg_eur_hg19  :  https://www.dropbox.com/s/1nbgqshknevseks/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1
2025/10/19 20:57:47  - 1kg_eur_hg19_md5  :  734069d895009d38c2f962bfbb6fab52
2025/10/19 20:57:47  - 1kg_eur_hg19_tbi  :  https://www.dropbox.com/s/vscvkrflh6fc5a0/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1
2025/10/19 20:57:47  - 1kg_eas_hg38  :  https://www.dropbox.com/s/3dstbbb1el9r3au/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1
2025/10/19 20:57:47  - 1kg_eas_hg38_md5  :  f45e80bca9ef7b29e6b1832e6ac15375
2025/10/19 20:57:47  - 1kg_eas_hg38_tbi  :  https://www.dropbox.com/s/vwnp5vd8dcqksn4/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1
2025/10/19 20:57:47  - 1kg_eur_hg38  :  https://www.dropbox.com/s/z0mkehg17lryapv/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1
2025/10/19 20:57:47  - 1kg_eur_hg38_md5  :  228d3285fa99132cc6321e2925e0768d
2025/10/19 20:57:47  - 1kg_eur_hg38_tbi  :  https://www.dropbox.com/s/ze8g58x75x9qbf0/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1
2025/10/19 20:57:47  - 1kg_sas_hg19  :  https://www.dropbox.com/scl/fi/fubqvuj3p4ii4y35zknv8/SAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=5z50f66iltjchcaszznq5bczt&dl=1
2025/10/19 20:57:47  - 1kg_sas_hg19_md5  :  e2d3f9e2e6580d05e877e9effd435c4e
2025/10/19 20:57:47  - 1kg_sas_hg19_tbi  :  https://www.dropbox.com/scl/fi/icnmrnzee7ofdpx5l96tg/SAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=st8t88snby26q37rqi6zh5zck&dl=1
2025/10/19 20:57:47  - 1kg_amr_hg19  :  https://www.dropbox.com/scl/fi/bxa4zfngsxsc38rhtiv8c/AMR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=ibcn8hb1n8n36j3u0jfzci267&dl=1
2025/10/19 20:57:47  - 1kg_amr_hg19_md5  :  68d3cdf01cbabdae6e74a07795fa881c
2025/10/19 20:57:47  - 1kg_amr_hg19_tbi  :  https://www.dropbox.com/scl/fi/1zk16x7h4r89jurzwu05u/AMR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=b4cere4w38zvzyfitfge3r8n0&dl=1
2025/10/19 20:57:47  - 1kg_sas_hg38  :  https://www.dropbox.com/scl/fi/jr3l5zz42py3kny2bccmj/SAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=x0t6tsy71jxzf021wfqdn8k5q&dl=1
2025/10/19 20:57:47  - 1kg_sas_hg38_md5  :  e5d79bea1958aa50c23f618d342ccc83
2025/10/19 20:57:47  - 1kg_sas_hg38_tbi  :  https://www.dropbox.com/scl/fi/02oia4ur5r7w9qgiuf6i9/SAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=00p9rxe0xzfs6hr1rg4d8oadm&dl=1
2025/10/19 20:57:47  - 1kg_amr_hg38  :  https://www.dropbox.com/scl/fi/4t4tyuhzp78uyb6tgkroq/AMR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=p96gbs1tcdia31jnjv1b82kuz&dl=1
2025/10/19 20:57:47  - 1kg_amr_hg38_md5  :  229fbd610001cf6f137b7f738352a44a
2025/10/19 20:57:47  - 1kg_amr_hg38_tbi  :  https://www.dropbox.com/scl/fi/x0dby543tr9xpaqj2i0ba/AMR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=uj8o7j0cy0spipe174jn54sqs&dl=1
2025/10/19 20:57:47  - 1kg_afr_hg19  :  https://www.dropbox.com/scl/fi/tq4w9lyt5z47ym7grtrxg/AFR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=k3bimeu3yr5loq8hohba5mr6k&dl=1
2025/10/19 20:57:47  - 1kg_afr_hg19_md5  :  f7b4425f39e8292dce6f13711e7f6c50
2025/10/19 20:57:47  - 1kg_afr_hg19_tbi  :  https://www.dropbox.com/scl/fi/0giiptu0btwj1kfm6jdzr/AFR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=ucb5weprsc5prcg8hvtgmruxx&dl=1
2025/10/19 20:57:47  - 1kg_pan_hg19  :  https://www.dropbox.com/scl/fi/6b4j9z9knmllfnbx86aw6/PAN.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=eento8vg06zyrkvooc9wd4cvu&dl=1
2025/10/19 20:57:47  - 1kg_pan_hg19_md5  :  fed846482204487b60d33b21ddb18106
2025/10/19 20:57:47  - 1kg_pan_hg19_tbi  :  https://www.dropbox.com/scl/fi/stco946scio5tvto0ln4j/PAN.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=hfh53beb627lmqwv3d8mzqy0c&dl=1
2025/10/19 20:57:47  - 1kg_afr_hg38  :  https://www.dropbox.com/scl/fi/239xmm7qijtnsks97chc9/AFR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=47en5fk1icbekpg7we3uot9g8&dl=1
2025/10/19 20:57:47  - 1kg_afr_hg38_md5  :  3bb7923be0809a324d7b7633b8d58a3b
2025/10/19 20:57:47  - 1kg_afr_hg38_tbi  :  https://www.dropbox.com/scl/fi/3y3pg4yqwo2jaaamx1c8f/AFR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=say0ihfwa51z3otgn4bjtze8p&dl=1
2025/10/19 20:57:47  - 1kg_pan_hg38  :  https://www.dropbox.com/scl/fi/nf01487smtmeq243ihfwm/PAN.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=3pefbkzxwcnejx4inynifpft7&dl=1
2025/10/19 20:57:47  - 1kg_pan_hg38_md5  :  23bb86d748c4a66e85e087f647e8b60e
2025/10/19 20:57:47  - 1kg_pan_hg38_tbi  :  https://www.dropbox.com/scl/fi/hu7cttr4cenw5yjsm2775/PAN.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=568u7bkvkybm4wt2q9284o198&dl=1
2025/10/19 20:57:47  - 1kg_eas_x_hg19  :  https://www.dropbox.com/scl/fi/1inmw09rk35ncuq7tibmp/EAS.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=vcjpukgsb7gt4tizg1fvr7tr2&dl=1
2025/10/19 20:57:47  - 1kg_eas_x_hg19_md5  :  b2aced2a1522ed23818989b3153b7e91
2025/10/19 20:57:47  - 1kg_eas_x_hg19_tbi  :  https://www.dropbox.com/scl/fi/uyxb9lfi88dqjp5l3vrzf/EAS.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=vt196d16h690dmzox5jyc33xx&dl=1
2025/10/19 20:57:47  - 1kg_eur_x_hg19  :  https://www.dropbox.com/scl/fi/6r4sc2yax8pk644piew2d/EUR.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=l5towjhyl733nrd1msjr1d8gl&dl=1
2025/10/19 20:57:47  - 1kg_eur_x_hg19_md5  :  6380cb71eafe985d7b894029e979139b
2025/10/19 20:57:47  - 1kg_eur_x_hg19_tbi  :  https://www.dropbox.com/scl/fi/yuid87x398yc9n8nc4bb1/EUR.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=01skm13sk6099y34zy6qvweqj&dl=1
2025/10/19 20:57:47  - 1kg_eas_x_hg38  :  https://www.dropbox.com/scl/fi/2m6i93vv1ooano0muukck/EAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=y6087mmt9kmls066mzobjeqp7&dl=1
2025/10/19 20:57:47  - 1kg_eas_x_hg38_md5  :  8c6a35da51621f952a5b97cbcc832046
2025/10/19 20:57:47  - 1kg_eas_x_hg38_tbi  :  https://www.dropbox.com/scl/fi/l6jpt86edarb4emehxwcy/EAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=ddr1fcijb1bh2nso0q0updolh&dl=1
2025/10/19 20:57:47  - 1kg_eur_x_hg38  :  https://www.dropbox.com/scl/fi/ceoff4p95ftef6yldhl67/EUR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=yyt9u11dk6kyha0cvturvrtvf&dl=1
2025/10/19 20:57:47  - 1kg_eur_x_hg38_md5  :  b9a4b8553dec202109f72281f33cb454
2025/10/19 20:57:47  - 1kg_eur_x_hg38_tbi  :  https://www.dropbox.com/scl/fi/tux32myi6g18bx7nd1rdq/EUR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=m4f0v3rblnzv7dj0lo6hsd0hb&dl=1
2025/10/19 20:57:47  - 1kg_sas_x_hg19  :  https://www.dropbox.com/scl/fi/592lbmrkfjn80twnvlt2q/SAS.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=zrar7nmltpsuznlyuqw9k77q6&dl=1
2025/10/19 20:57:47  - 1kg_sas_x_hg19_md5  :  f4f370274fe586d209ca6fddc4eceaaf
2025/10/19 20:57:47  - 1kg_sas_x_hg19_tbi  :  https://www.dropbox.com/scl/fi/rmaybun6v248nmjmaz2tj/SAS.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=izv96vfajgdd5wsyuvx98ntvk&dl=1
2025/10/19 20:57:47  - 1kg_amr_x_hg19  :  https://www.dropbox.com/scl/fi/gwryyxs0ilgoazqvp39be/AMR.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=z46we3kshi9t96x7issl36bda&dl=1
2025/10/19 20:57:47  - 1kg_amr_x_hg19_md5  :  ead838f7059a80118e949959cf1a3ff3
2025/10/19 20:57:47  - 1kg_amr_x_hg19_tbi  :  https://www.dropbox.com/scl/fi/r6g5893smgmnsir5r0v5n/AMR.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=sel2f4p7ctggf30udrhcy0psu&dl=1
2025/10/19 20:57:47  - 1kg_sas_x_hg38  :  https://www.dropbox.com/scl/fi/r6qa9l6h9rc5rvenjsdqo/SAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=mucn2zizrlkebn1e5q7rtzu8e&dl=1
2025/10/19 20:57:47  - 1kg_sas_x_hg38_md5  :  31c60999ebb9a13d17d21e02fd9d1f4c
2025/10/19 20:57:47  - 1kg_sas_x_hg38_tbi  :  https://www.dropbox.com/scl/fi/5ktxs24sq6v8hvaowr4xv/SAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=5y2z66b3s6bgzoikdjvr46ccw&dl=1
2025/10/19 20:57:47  - 1kg_amr_x_hg38  :  https://www.dropbox.com/scl/fi/5brd1nh7u20oigtb17mkb/AMR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=zt0d0nqmlq3u5uu6ukwvof6ta&dl=1
2025/10/19 20:57:47  - 1kg_amr_x_hg38_md5  :  bc7de683d603c8bbff02f5bec8d3469a
2025/10/19 20:57:47  - 1kg_amr_x_hg38_tbi  :  https://www.dropbox.com/scl/fi/8bz6uwjgw8bj16uaz6kye/AMR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=pkc2mepgosxdijpfru4vhum5v&dl=1
2025/10/19 20:57:47  - 1kg_afr_x_hg19  :  https://www.dropbox.com/scl/fi/kz5j4532pyaigceww0vg5/AFR.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=sti0g2ri004chu8b12pqgicvw&dl=1
2025/10/19 20:57:47  - 1kg_afr_x_hg19_md5  :  77807e5e315a6e47504c175b0aaece88
2025/10/19 20:57:47  - 1kg_afr_x_hg19_tbi  :  https://www.dropbox.com/scl/fi/oadjwamy5pe1ilv2237gq/AFR.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=a10bhxrfa904dasrcuc40njq4&dl=1
2025/10/19 20:57:47  - 1kg_pan_x_hg19  :  https://www.dropbox.com/scl/fi/rwov9vszj8rx78u65dxnk/PAN.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=ej33zb9ulwdfseur1surz653z&dl=1
2025/10/19 20:57:47  - 1kg_pan_x_hg19_md5  :  389d474984ff82df79efd25c0dd66fc9
2025/10/19 20:57:47  - 1kg_pan_x_hg19_tbi  :  https://www.dropbox.com/scl/fi/x0n1htmkbulqybr5cc2cb/PAN.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=8rga64u5gm9vp9whqwlj37hre&dl=1
2025/10/19 20:57:47  - 1kg_afr_x_hg38  :  https://www.dropbox.com/scl/fi/ef8h09lhg8vmdxfxsayv4/AFR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=96xjxu546sbbq5hbgaihh84l8&dl=1
2025/10/19 20:57:47  - 1kg_afr_x_hg38_md5  :  b1410bb21e389a0f08fc2741d33fcc51
2025/10/19 20:57:47  - 1kg_afr_x_hg38_tbi  :  https://www.dropbox.com/scl/fi/tlwhui80cy32mtx6hc8f6/AFR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=ana497w8is840ygarx3w1v5pc&dl=1
2025/10/19 20:57:47  - 1kg_pan_x_hg38  :  https://www.dropbox.com/scl/fi/tf7j5540jyzxz2oo7jtho/PAN.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=e872ihg5477mlu30vet1e5lvk&dl=1
2025/10/19 20:57:47  - 1kg_pan_x_hg38_md5  :  8ae424786a6bfe64c92ca6b9f96ee5e6
2025/10/19 20:57:47  - 1kg_pan_x_hg38_tbi  :  https://www.dropbox.com/scl/fi/u28zo2sjbcmtfs69zqaya/PAN.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=t5psxvoewd1oog2hwzellc78p&dl=1
2025/10/19 20:57:47  - dbsnp_v151_hg19  :  https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz
2025/10/19 20:57:47  - dbsnp_v151_hg19_tbi  :  https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz.tbi
2025/10/19 20:57:47  - dbsnp_v151_hg38  :  https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz
2025/10/19 20:57:47  - dbsnp_v151_hg38_tbi  :  https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz.tbi
2025/10/19 20:57:47  - dbsnp_v156_hg19  :  https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.25.gz
2025/10/19 20:57:47  - dbsnp_v156_hg19_tbi  :  https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.25.gz.tbi
2025/10/19 20:57:47  - dbsnp_v156_hg38  :  https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.40.gz
2025/10/19 20:57:47  - dbsnp_v156_hg38_tbi  :  https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.40.gz.tbi
2025/10/19 20:57:47  - ucsc_genome_hg19  :  http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz
2025/10/19 20:57:47  - ucsc_genome_hg38  :  https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
2025/10/19 20:57:47  - 1kg_dbsnp151_hg19_auto  :  https://www.dropbox.com/s/37p2u1xwmux4gwo/1kg_dbsnp151_hg19_auto.txt.gz?dl=1
2025/10/19 20:57:47  - 1kg_dbsnp151_hg19_auto_md5  :  7d1e7624fb6e4df7a2f6f05558d436b4
2025/10/19 20:57:47  - 1kg_dbsnp151_hg38_auto  :  https://www.dropbox.com/s/ouf60n7gdz6cm0g/1kg_dbsnp151_hg38_auto.txt.gz?dl=1
2025/10/19 20:57:47  - 1kg_dbsnp151_hg38_auto_md5  :  4c7ef2d2415c18c286219e970fdda972
2025/10/19 20:57:47  - 1kg_dbsnp151_hg19_x  :  https://www.dropbox.com/scl/fi/ghwq2yh5bi7o411m1mw15/1kg_dbsnp151_hg19_X.txt.gz?rlkey=50du8e42qjdgzge0lcdiu7tv2&dl=1
2025/10/19 20:57:47  - 1kg_dbsnp151_hg19_x_md5  :  cbf0e3518ab73d6d8a96bab9d55c094d
2025/10/19 20:57:47  - 1kg_dbsnp151_hg38_x  :  https://www.dropbox.com/scl/fi/bqdrfh0dx561ir210tu92/1kg_dbsnp151_hg38_X.txt.gz?rlkey=jjetkirbflt02f3w8mrxdqa4g&dl=1
2025/10/19 20:57:47  - 1kg_dbsnp151_hg38_x_md5  :  48c05eeb1454c0dd4cbee3cb26382e8e
2025/10/19 20:57:47  - recombination_hg19  :  https://www.dropbox.com/s/wbesl8haxknonuc/recombination_hg19.tar.gz?dl=1
2025/10/19 20:57:47  - recombination_hg38  :  https://www.dropbox.com/s/vuo8mvqx0fpibzj/recombination_hg38.tar.gz?dl=1
2025/10/19 20:57:47  - ensembl_hg19_gtf  :  https://ftp.ensembl.org/pub/grch37/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.chr.gtf.gz
2025/10/19 20:57:47  - ensembl_hg38_gtf  :  https://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens//Homo_sapiens.GRCh38.109.chr.gtf.gz
2025/10/19 20:57:47  - refseq_hg19_gtf  :  https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gtf.gz
2025/10/19 20:57:47  - refseq_hg38_gtf  :  https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gtf.gz
2025/10/19 20:57:47  - testlink  :  https://www.dropbox.com/s/8u7capwge0ihshu/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz?dl=1
2025/10/19 20:57:47  - testlink_tbi  :  https://www.dropbox.com/s/hdneg53t6u1j6ib/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz.tbi?dl=1
2025/10/19 20:57:47  - 19to38  :  https://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
2025/10/19 20:57:47  - 19to13  :  https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/chain/v1_nflo/hg19-chm13v2.chain
2025/10/19 20:57:47  - 38to19  :  https://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz
2025/10/19 20:57:47  - 38to13  :  https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/chain/v1_nflo/grch38-chm13v2.chain
2025/10/19 20:57:47  - 13to19  :  https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/chain/v1_nflo/chm13v2-hg19.chain
2025/10/19 20:57:47  - 13to38  :  https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/chain/v1_nflo/chm13v2-grch38.chain
2025/10/19 20:57:47  - 18to19  :  https://hgdownload.soe.ucsc.edu/goldenPath/hg18/liftOver/hg18ToHg19.over.chain.gz
2025/10/19 20:57:47  - 18to38  :  https://hgdownload.soe.ucsc.edu/goldenPath/hg18/liftOver/hg18ToHg38.over.chain.gz
2025/10/19 20:57:47  - 1kg_hm3_hg38_eaf  :  https://www.dropbox.com/scl/fi/ymkqfsaec6mwjzlvxsm45/PAN.hapmap3.hg38.EAF.tsv.gz?rlkey=p1auef5y1kk7ui41k6j3s8b0z&dl=1
2025/10/19 20:57:47  - 1kg_hm3_hg19_eaf  :  https://www.dropbox.com/scl/fi/dmv9wtfchv6ahim86d49r/PAN.hapmap3.hg19.EAF.tsv.gz?rlkey=ywne2gj1rlm2nj42q9lt2d99n&dl=1

Out[18]:

{'1kg_eas_hg19': 'https://www.dropbox.com/s/lztaxqhy2o6dpxw/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1',
 '1kg_eas_hg19_md5': 'c8c97434843c0da3113fc06879ead472',
 '1kg_eas_hg19_tbi': 'https://www.dropbox.com/s/k9klefl8m9fcfo1/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1',
 '1kg_eur_hg19': 'https://www.dropbox.com/s/1nbgqshknevseks/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1',
 '1kg_eur_hg19_md5': '734069d895009d38c2f962bfbb6fab52',
 '1kg_eur_hg19_tbi': 'https://www.dropbox.com/s/vscvkrflh6fc5a0/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1',
 '1kg_eas_hg38': 'https://www.dropbox.com/s/3dstbbb1el9r3au/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1',
 '1kg_eas_hg38_md5': 'f45e80bca9ef7b29e6b1832e6ac15375',
 '1kg_eas_hg38_tbi': 'https://www.dropbox.com/s/vwnp5vd8dcqksn4/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1',
 '1kg_eur_hg38': 'https://www.dropbox.com/s/z0mkehg17lryapv/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1',
 '1kg_eur_hg38_md5': '228d3285fa99132cc6321e2925e0768d',
 '1kg_eur_hg38_tbi': 'https://www.dropbox.com/s/ze8g58x75x9qbf0/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1',
 '1kg_sas_hg19': 'https://www.dropbox.com/scl/fi/fubqvuj3p4ii4y35zknv8/SAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=5z50f66iltjchcaszznq5bczt&dl=1',
 '1kg_sas_hg19_md5': 'e2d3f9e2e6580d05e877e9effd435c4e',
 '1kg_sas_hg19_tbi': 'https://www.dropbox.com/scl/fi/icnmrnzee7ofdpx5l96tg/SAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=st8t88snby26q37rqi6zh5zck&dl=1',
 '1kg_amr_hg19': 'https://www.dropbox.com/scl/fi/bxa4zfngsxsc38rhtiv8c/AMR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=ibcn8hb1n8n36j3u0jfzci267&dl=1',
 '1kg_amr_hg19_md5': '68d3cdf01cbabdae6e74a07795fa881c',
 '1kg_amr_hg19_tbi': 'https://www.dropbox.com/scl/fi/1zk16x7h4r89jurzwu05u/AMR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=b4cere4w38zvzyfitfge3r8n0&dl=1',
 '1kg_sas_hg38': 'https://www.dropbox.com/scl/fi/jr3l5zz42py3kny2bccmj/SAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=x0t6tsy71jxzf021wfqdn8k5q&dl=1',
 '1kg_sas_hg38_md5': 'e5d79bea1958aa50c23f618d342ccc83',
 '1kg_sas_hg38_tbi': 'https://www.dropbox.com/scl/fi/02oia4ur5r7w9qgiuf6i9/SAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=00p9rxe0xzfs6hr1rg4d8oadm&dl=1',
 '1kg_amr_hg38': 'https://www.dropbox.com/scl/fi/4t4tyuhzp78uyb6tgkroq/AMR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=p96gbs1tcdia31jnjv1b82kuz&dl=1',
 '1kg_amr_hg38_md5': '229fbd610001cf6f137b7f738352a44a',
 '1kg_amr_hg38_tbi': 'https://www.dropbox.com/scl/fi/x0dby543tr9xpaqj2i0ba/AMR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=uj8o7j0cy0spipe174jn54sqs&dl=1',
 '1kg_afr_hg19': 'https://www.dropbox.com/scl/fi/tq4w9lyt5z47ym7grtrxg/AFR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=k3bimeu3yr5loq8hohba5mr6k&dl=1',
 '1kg_afr_hg19_md5': 'f7b4425f39e8292dce6f13711e7f6c50',
 '1kg_afr_hg19_tbi': 'https://www.dropbox.com/scl/fi/0giiptu0btwj1kfm6jdzr/AFR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=ucb5weprsc5prcg8hvtgmruxx&dl=1',
 '1kg_pan_hg19': 'https://www.dropbox.com/scl/fi/6b4j9z9knmllfnbx86aw6/PAN.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=eento8vg06zyrkvooc9wd4cvu&dl=1',
 '1kg_pan_hg19_md5': 'fed846482204487b60d33b21ddb18106',
 '1kg_pan_hg19_tbi': 'https://www.dropbox.com/scl/fi/stco946scio5tvto0ln4j/PAN.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=hfh53beb627lmqwv3d8mzqy0c&dl=1',
 '1kg_afr_hg38': 'https://www.dropbox.com/scl/fi/239xmm7qijtnsks97chc9/AFR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=47en5fk1icbekpg7we3uot9g8&dl=1',
 '1kg_afr_hg38_md5': '3bb7923be0809a324d7b7633b8d58a3b',
 '1kg_afr_hg38_tbi': 'https://www.dropbox.com/scl/fi/3y3pg4yqwo2jaaamx1c8f/AFR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=say0ihfwa51z3otgn4bjtze8p&dl=1',
 '1kg_pan_hg38': 'https://www.dropbox.com/scl/fi/nf01487smtmeq243ihfwm/PAN.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=3pefbkzxwcnejx4inynifpft7&dl=1',
 '1kg_pan_hg38_md5': '23bb86d748c4a66e85e087f647e8b60e',
 '1kg_pan_hg38_tbi': 'https://www.dropbox.com/scl/fi/hu7cttr4cenw5yjsm2775/PAN.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=568u7bkvkybm4wt2q9284o198&dl=1',
 '1kg_eas_x_hg19': 'https://www.dropbox.com/scl/fi/1inmw09rk35ncuq7tibmp/EAS.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=vcjpukgsb7gt4tizg1fvr7tr2&dl=1',
 '1kg_eas_x_hg19_md5': 'b2aced2a1522ed23818989b3153b7e91',
 '1kg_eas_x_hg19_tbi': 'https://www.dropbox.com/scl/fi/uyxb9lfi88dqjp5l3vrzf/EAS.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=vt196d16h690dmzox5jyc33xx&dl=1',
 '1kg_eur_x_hg19': 'https://www.dropbox.com/scl/fi/6r4sc2yax8pk644piew2d/EUR.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=l5towjhyl733nrd1msjr1d8gl&dl=1',
 '1kg_eur_x_hg19_md5': '6380cb71eafe985d7b894029e979139b',
 '1kg_eur_x_hg19_tbi': 'https://www.dropbox.com/scl/fi/yuid87x398yc9n8nc4bb1/EUR.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=01skm13sk6099y34zy6qvweqj&dl=1',
 '1kg_eas_x_hg38': 'https://www.dropbox.com/scl/fi/2m6i93vv1ooano0muukck/EAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=y6087mmt9kmls066mzobjeqp7&dl=1',
 '1kg_eas_x_hg38_md5': '8c6a35da51621f952a5b97cbcc832046',
 '1kg_eas_x_hg38_tbi': 'https://www.dropbox.com/scl/fi/l6jpt86edarb4emehxwcy/EAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=ddr1fcijb1bh2nso0q0updolh&dl=1',
 '1kg_eur_x_hg38': 'https://www.dropbox.com/scl/fi/ceoff4p95ftef6yldhl67/EUR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=yyt9u11dk6kyha0cvturvrtvf&dl=1',
 '1kg_eur_x_hg38_md5': 'b9a4b8553dec202109f72281f33cb454',
 '1kg_eur_x_hg38_tbi': 'https://www.dropbox.com/scl/fi/tux32myi6g18bx7nd1rdq/EUR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=m4f0v3rblnzv7dj0lo6hsd0hb&dl=1',
 '1kg_sas_x_hg19': 'https://www.dropbox.com/scl/fi/592lbmrkfjn80twnvlt2q/SAS.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=zrar7nmltpsuznlyuqw9k77q6&dl=1',
 '1kg_sas_x_hg19_md5': 'f4f370274fe586d209ca6fddc4eceaaf',
 '1kg_sas_x_hg19_tbi': 'https://www.dropbox.com/scl/fi/rmaybun6v248nmjmaz2tj/SAS.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=izv96vfajgdd5wsyuvx98ntvk&dl=1',
 '1kg_amr_x_hg19': 'https://www.dropbox.com/scl/fi/gwryyxs0ilgoazqvp39be/AMR.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=z46we3kshi9t96x7issl36bda&dl=1',
 '1kg_amr_x_hg19_md5': 'ead838f7059a80118e949959cf1a3ff3',
 '1kg_amr_x_hg19_tbi': 'https://www.dropbox.com/scl/fi/r6g5893smgmnsir5r0v5n/AMR.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=sel2f4p7ctggf30udrhcy0psu&dl=1',
 '1kg_sas_x_hg38': 'https://www.dropbox.com/scl/fi/r6qa9l6h9rc5rvenjsdqo/SAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=mucn2zizrlkebn1e5q7rtzu8e&dl=1',
 '1kg_sas_x_hg38_md5': '31c60999ebb9a13d17d21e02fd9d1f4c',
 '1kg_sas_x_hg38_tbi': 'https://www.dropbox.com/scl/fi/5ktxs24sq6v8hvaowr4xv/SAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=5y2z66b3s6bgzoikdjvr46ccw&dl=1',
 '1kg_amr_x_hg38': 'https://www.dropbox.com/scl/fi/5brd1nh7u20oigtb17mkb/AMR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=zt0d0nqmlq3u5uu6ukwvof6ta&dl=1',
 '1kg_amr_x_hg38_md5': 'bc7de683d603c8bbff02f5bec8d3469a',
 '1kg_amr_x_hg38_tbi': 'https://www.dropbox.com/scl/fi/8bz6uwjgw8bj16uaz6kye/AMR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=pkc2mepgosxdijpfru4vhum5v&dl=1',
 '1kg_afr_x_hg19': 'https://www.dropbox.com/scl/fi/kz5j4532pyaigceww0vg5/AFR.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=sti0g2ri004chu8b12pqgicvw&dl=1',
 '1kg_afr_x_hg19_md5': '77807e5e315a6e47504c175b0aaece88',
 '1kg_afr_x_hg19_tbi': 'https://www.dropbox.com/scl/fi/oadjwamy5pe1ilv2237gq/AFR.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=a10bhxrfa904dasrcuc40njq4&dl=1',
 '1kg_pan_x_hg19': 'https://www.dropbox.com/scl/fi/rwov9vszj8rx78u65dxnk/PAN.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=ej33zb9ulwdfseur1surz653z&dl=1',
 '1kg_pan_x_hg19_md5': '389d474984ff82df79efd25c0dd66fc9',
 '1kg_pan_x_hg19_tbi': 'https://www.dropbox.com/scl/fi/x0n1htmkbulqybr5cc2cb/PAN.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=8rga64u5gm9vp9whqwlj37hre&dl=1',
 '1kg_afr_x_hg38': 'https://www.dropbox.com/scl/fi/ef8h09lhg8vmdxfxsayv4/AFR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=96xjxu546sbbq5hbgaihh84l8&dl=1',
 '1kg_afr_x_hg38_md5': 'b1410bb21e389a0f08fc2741d33fcc51',
 '1kg_afr_x_hg38_tbi': 'https://www.dropbox.com/scl/fi/tlwhui80cy32mtx6hc8f6/AFR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=ana497w8is840ygarx3w1v5pc&dl=1',
 '1kg_pan_x_hg38': 'https://www.dropbox.com/scl/fi/tf7j5540jyzxz2oo7jtho/PAN.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=e872ihg5477mlu30vet1e5lvk&dl=1',
 '1kg_pan_x_hg38_md5': '8ae424786a6bfe64c92ca6b9f96ee5e6',
 '1kg_pan_x_hg38_tbi': 'https://www.dropbox.com/scl/fi/u28zo2sjbcmtfs69zqaya/PAN.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=t5psxvoewd1oog2hwzellc78p&dl=1',
 'dbsnp_v151_hg19': 'https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz',
 'dbsnp_v151_hg19_tbi': 'https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz.tbi',
 'dbsnp_v151_hg38': 'https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz',
 'dbsnp_v151_hg38_tbi': 'https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz.tbi',
 'dbsnp_v156_hg19': 'https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.25.gz',
 'dbsnp_v156_hg19_tbi': 'https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.25.gz.tbi',
 'dbsnp_v156_hg38': 'https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.40.gz',
 'dbsnp_v156_hg38_tbi': 'https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.40.gz.tbi',
 'ucsc_genome_hg19': 'http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz',
 'ucsc_genome_hg38': 'https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz',
 '1kg_dbsnp151_hg19_auto': 'https://www.dropbox.com/s/37p2u1xwmux4gwo/1kg_dbsnp151_hg19_auto.txt.gz?dl=1',
 '1kg_dbsnp151_hg19_auto_md5': '7d1e7624fb6e4df7a2f6f05558d436b4',
 '1kg_dbsnp151_hg38_auto': 'https://www.dropbox.com/s/ouf60n7gdz6cm0g/1kg_dbsnp151_hg38_auto.txt.gz?dl=1',
 '1kg_dbsnp151_hg38_auto_md5': '4c7ef2d2415c18c286219e970fdda972',
 '1kg_dbsnp151_hg19_x': 'https://www.dropbox.com/scl/fi/ghwq2yh5bi7o411m1mw15/1kg_dbsnp151_hg19_X.txt.gz?rlkey=50du8e42qjdgzge0lcdiu7tv2&dl=1',
 '1kg_dbsnp151_hg19_x_md5': 'cbf0e3518ab73d6d8a96bab9d55c094d',
 '1kg_dbsnp151_hg38_x': 'https://www.dropbox.com/scl/fi/bqdrfh0dx561ir210tu92/1kg_dbsnp151_hg38_X.txt.gz?rlkey=jjetkirbflt02f3w8mrxdqa4g&dl=1',
 '1kg_dbsnp151_hg38_x_md5': '48c05eeb1454c0dd4cbee3cb26382e8e',
 'recombination_hg19': 'https://www.dropbox.com/s/wbesl8haxknonuc/recombination_hg19.tar.gz?dl=1',
 'recombination_hg38': 'https://www.dropbox.com/s/vuo8mvqx0fpibzj/recombination_hg38.tar.gz?dl=1',
 'ensembl_hg19_gtf': 'https://ftp.ensembl.org/pub/grch37/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.chr.gtf.gz',
 'ensembl_hg38_gtf': 'https://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens//Homo_sapiens.GRCh38.109.chr.gtf.gz',
 'refseq_hg19_gtf': 'https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gtf.gz',
 'refseq_hg38_gtf': 'https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gtf.gz',
 'testlink': 'https://www.dropbox.com/s/8u7capwge0ihshu/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz?dl=1',
 'testlink_tbi': 'https://www.dropbox.com/s/hdneg53t6u1j6ib/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz.tbi?dl=1',
 '19to38': 'https://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz',
 '19to13': 'https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/chain/v1_nflo/hg19-chm13v2.chain',
 '38to19': 'https://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz',
 '38to13': 'https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/chain/v1_nflo/grch38-chm13v2.chain',
 '13to19': 'https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/chain/v1_nflo/chm13v2-hg19.chain',
 '13to38': 'https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/chain/v1_nflo/chm13v2-grch38.chain',
 '18to19': 'https://hgdownload.soe.ucsc.edu/goldenPath/hg18/liftOver/hg18ToHg19.over.chain.gz',
 '18to38': 'https://hgdownload.soe.ucsc.edu/goldenPath/hg18/liftOver/hg18ToHg38.over.chain.gz',
 '1kg_hm3_hg38_eaf': 'https://www.dropbox.com/scl/fi/ymkqfsaec6mwjzlvxsm45/PAN.hapmap3.hg38.EAF.tsv.gz?rlkey=p1auef5y1kk7ui41k6j3s8b0z&dl=1',
 '1kg_hm3_hg19_eaf': 'https://www.dropbox.com/scl/fi/dmv9wtfchv6ahim86d49r/PAN.hapmap3.hg19.EAF.tsv.gz?rlkey=ywne2gj1rlm2nj42q9lt2d99n&dl=1'}

You can see the current available reference files (from the original source or pre-processed by gwaslab).

download reference using gwaslab¶

Select the keyword and use download_ref to download the files. The downloaded files will be placed in ~/.gwasalb by default.

1kg_eas_hg19 : processed 1000 Genomes Project EAS samples dataset(hg19; ~2.8GB) It may take several minutes to download.

In [19]:

Copied!

# ~2.8GB
# gl.download_ref("1kg_eas_hg19")
# ~2.8GB
# gl.download_ref("1kg_eas_hg19")

In [20]:

Copied!

# gl.check_downloaded_ref()
# gl.check_downloaded_ref()

After downloading, use get_path to obtain the file path by specifying the keyword.

In [21]:

Copied!

# gl.get_path("1kg_eas_hg19")
# gl.get_path("1kg_eas_hg19")

Note:

If tabix is available in PATH, the speed will be greatly improved. Otherwise, vcf files will be loaded from the head.
tabix: http://www.htslib.org/download/

Region plots with or with out LD reference files.¶

To create region plot with LD information, you need pass the path of reference vcf to vcf_path.

For this tutorial, we just use the 1kg_eas_hg19.chr7_126253550_128253550.vcf.gz included in the sample dataset.
You can also use the downloaded preprocessed file like vcf_path = gl.get_path("1kg_eas_hg19") or your own vcf files. )

In [22]:

Copied!





mysumstats.plot_mqq(mode="r",
                    region=(7,126253550,128253550),
                    region_grid=True,
                    anno=True,
                    anno_args={"rotation":0,"fontsize":12},
                    vcf_path="sample_data/1kg_eas_hg19.chr7_126253550_128253550.vcf.gz")
mysumstats.plot_mqq(mode="r",
                    region=(7,126253550,128253550),
                    region_grid=True,
                    anno=True,
                    anno_args={"rotation":0,"fontsize":12},
                    vcf_path="sample_data/1kg_eas_hg19.chr7_126253550_128253550.vcf.gz")

2025/10/19 20:57:47 Start to create MQQ plot...v3.6.9:
2025/10/19 20:57:47  -Genomic coordinates version: 19...
2025/10/19 20:57:47  -Genome-wide significance level to plot is set to 5e-08 ...
2025/10/19 20:57:47  -Raw input contains 1103020 variants...
2025/10/19 20:57:47  -MQQ plot layout mode is : r
2025/10/19 20:57:47  -Region to plot : chr7:126253550-128253550.
2025/10/19 20:57:47  -Checking chromosome notations in VCF/BCF files...
2025/10/19 20:57:47  -Checking prefix for chromosomes in VCF/BCF files...
2025/10/19 20:57:47  -No prefix for chromosomes in the VCF/BCF files.
2025/10/19 20:57:47  -Extract SNPs in region : chr7:126253550-128253550...
2025/10/19 20:57:47  -Extract SNPs in specified regions: 8857
2025/10/19 20:57:47 Finished loading specified columns from the sumstats.
2025/10/19 20:57:47 Start data conversion and sanity check:
2025/10/19 20:57:47  -Removed 0 variants with nan in CHR or POS column ...
2025/10/19 20:57:47  -Removed 0 variants with CHR <=0...
2025/10/19 20:57:47  -Removed 0 variants with nan in P column ...
2025/10/19 20:57:47  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
2025/10/19 20:57:47  -Sumstats P values are being converted to -log10(P)...
2025/10/19 20:57:47  -Sanity check: 0 na/inf/-inf variants will be removed...
2025/10/19 20:57:47  -Converting data above cut line...
2025/10/19 20:57:47  -Maximum -log10(P) value is 73.38711023071251 .
2025/10/19 20:57:47 Finished data conversion and sanity check.
2025/10/19 20:57:47 Start to create MQQ plot with 8857 variants...
2025/10/19 20:57:47  -tabix will be used: /home/yunye/tools/bin/tabix
2025/10/19 20:57:47 Start to load reference genotype...
2025/10/19 20:57:47  -reference vcf path : sample_data/1kg_eas_hg19.chr7_126253550_128253550.vcf.gz
2025/10/19 20:57:50  -Retrieving index...
2025/10/19 20:57:50  -Ref variants in the region: 54234
2025/10/19 20:57:50  -Matching variants using POS, NEA, EA ...
2025/10/19 20:57:51  -Calculating Rsq...
2025/10/19 20:57:51 Finished loading reference genotype successfully!
2025/10/19 20:57:51  -Creating background plot...
2025/10/19 20:57:51  -Extracting lead variant...
2025/10/19 20:57:51  -Loading gtf files from:default

INFO:root:Extracted GTF attributes: ['gene_id', 'gene_name', 'gene_biotype']

2025/10/19 20:58:23  -plotting gene track..
2025/10/19 20:58:23  -plotting genes: 14..
2025/10/19 20:58:23  -plotting exons: 675..
2025/10/19 20:58:23  -Finished plotting gene track..
2025/10/19 20:58:23 Finished creating MQQ plot successfully!
2025/10/19 20:58:23 Start to extract variants for annotation...
2025/10/19 20:58:23  -Found 1 significant variants with a sliding window size of 500 kb...
2025/10/19 20:58:23 Finished extracting variants for annotation...
2025/10/19 20:58:23 Start to process figure arts.
2025/10/19 20:58:23  -Processing X ticks...
2025/10/19 20:58:23  -Processing X labels...
2025/10/19 20:58:23  -Processing Y labels...
2025/10/19 20:58:23  -Processing Y tick lables...
2025/10/19 20:58:23  -Processing Y labels...
2025/10/19 20:58:23  -Processing color bar...
2025/10/19 20:58:23  -Processing lines...
2025/10/19 20:58:23 Finished processing figure arts.
2025/10/19 20:58:23 Start to annotate variants...
2025/10/19 20:58:23  -Annotating using column CHR:POS...
2025/10/19 20:58:23  -Adjusting text positions with repel_force=0.03...
2025/10/19 20:58:23 Finished annotating variants.
2025/10/19 20:58:23 Start to save figure...
2025/10/19 20:58:23  -Skip saving figure!
2025/10/19 20:58:23 Finished saving figure...
2025/10/19 20:58:24 Finished creating plot successfully!

Out[22]:

(<Figure size 3000x2000 with 4 Axes>, <gwaslab.g_Log.Log at 0x7fed335eb950>)

Note:

GWASLab default genome build version is build="19" (GRCh37/hg19), you can change it to build="38" (GRCh38/hg38) when needed.
For gene tracks, default is gtf_path="ensembl" , you can also use gtf_path="refseq" (NCBA RefSeq)

Fix SNPID¶

You may notice that the SNPID is in CHR:POS_REF_ALT format. We want SNPID to be in a standardized format CHR:POS:REF:ALT, we can use fix_id for this:

For other options of standardization, see: https://cloufield.github.io/gwaslab/Standardization/

In [23]:

Copied!

#fixsep : fix ID separator
mysumstats.fix_id(fixsep=True)
#fixsep : fix ID separator
mysumstats.fix_id(fixsep=True)

2025/10/19 20:58:24 Start to check SNPID/rsID...v3.6.9
2025/10/19 20:58:24  -Current Dataframe shape : 1103020 x 11 ; Memory usage: 96.94 MB
2025/10/19 20:58:24  -Checking SNPID data type...
2025/10/19 20:58:24  -Checking NA strings :na,NA,Na,Nan,NaN,<NA>,null,NULL,#N/A,#VALUE!,N/A,n/a,missing,
2025/10/19 20:58:24  -Checking if SNPID contains NA strings...
2025/10/19 20:58:25  -Checking if SNPID is CHR:POS:NEA:EA...(separator: - ,: , _)
2025/10/19 20:58:26  -Replacing [_-] in SNPID with ":" ...
2025/10/19 20:58:27 Finished checking SNPID/rsID.

In [24]:

Copied!

mysumstats.data
mysumstats.data

Out[24]:

	SNPID	CHR	POS	EA	NEA	EAF	BETA	SE	P	N	STATUS
0	1:752566:G:A	1	752566	G	A	0.8422	-0.0155	0.0131	0.2350	166718	1960099
1	1:752721:A:G	1	752721	G	A	0.2507	0.0204	0.0147	0.1650	166718	1960099
2	1:754182:A:G	1	754182	G	A	0.2505	0.0222	0.0166	0.1817	166718	1960099
3	1:760912:C:T	1	760912	C	T	0.8425	-0.0171	0.0148	0.2480	166718	1960099
4	1:761147:T:C	1	761147	C	T	0.1581	0.0171	0.0148	0.2480	166718	1960099
...	...	...	...	...	...	...	...	...	...	...	...
1103015	X:154343911:A:G	23	154343911	G	A	0.8058	0.0019	0.0090	0.8297	191764	1960099
1103016	X:154379088:C:A	23	154379088	C	A	0.7783	0.0027	0.0094	0.7723	191764	1960099
1103017	X:154536836:C:T	23	154536836	C	T	0.2196	-0.0084	0.0085	0.3192	191764	1960099
1103018	X:154763036:A:G	23	154763036	G	A	0.3686	-0.0102	0.0105	0.3302	191764	1960099
1103019	X:154816439:A:C	23	154816439	C	A	0.2339	-0.0083	0.0100	0.4021	191764	1960099

1103020 rows × 11 columns

Annotate rsID¶

rsID is assigned using two types of reference file:

ref_rsid_tsv : tsv file for annotation of commonly used variants
ref_rsid_vcf : vcf file for annotation of other variants

GWASLab provides preprocessed tsv files for 1KG varaints (~80M), we can download the file using .download_ref with key words. This time we will use 1kg_dbsnp151_hg19_auto, which is the SNPID-rsID conversion table for autosomal variants in 1KG project (hg19). This will take around a few minutes to dowadload.

In [25]:

Copied!

# 961M
# gl.download_ref("1kg_dbsnp151_hg19_auto")
# 961M
# gl.download_ref("1kg_dbsnp151_hg19_auto")

For now, we just use the "1kg_dbsnp151_hg19_auto_hm3_chr7_variants.txt.gz" included in the sample dataset.

In [26]:

Copied!

mysumstats.assign_rsid(ref_rsid_tsv= "sample_data/1kg_dbsnp151_hg19_auto_hm3_chr7_variants.txt.gz")
mysumstats.assign_rsid(ref_rsid_tsv= "sample_data/1kg_dbsnp151_hg19_auto_hm3_chr7_variants.txt.gz")

2025/10/19 20:58:27 Start to assign rsID by matching SNPID with CHR:POS:REF:ALT in the reference TSV...v3.6.9
2025/10/19 20:58:27  -Current Dataframe shape : 1103020 x 11 ; Memory usage: 96.94 MB
2025/10/19 20:58:27  -Number of threads/cores to use: 1
2025/10/19 20:58:27  -Reference TSV: sample_data/1kg_dbsnp151_hg19_auto_hm3_chr7_variants.txt.gz
2025/10/19 20:58:27  -1103020 rsID could be possibly fixed...
2025/10/19 20:58:27  -Setting block size:  5000000
2025/10/19 20:58:27  -Loading block: 0   
2025/10/19 20:58:30  -rsID annotation for 2564 needed to be fixed!
2025/10/19 20:58:30  -Annotated 1100456 rsID successfully!
2025/10/19 20:58:30 Finished assign rsID using reference file.

In [27]:

Copied!

mysumstats.data
mysumstats.data

Out[27]:

	SNPID	CHR	POS	EA	NEA	EAF	BETA	SE	P	N	STATUS	rsID
0	1:752566:G:A	1	752566	G	A	0.8422	-0.0155	0.0131	0.2350	166718	1960099	rs3094315
1	1:752721:A:G	1	752721	G	A	0.2507	0.0204	0.0147	0.1650	166718	1960099	rs3131972
2	1:754182:A:G	1	754182	G	A	0.2505	0.0222	0.0166	0.1817	166718	1960099	rs3131969
3	1:760912:C:T	1	760912	C	T	0.8425	-0.0171	0.0148	0.2480	166718	1960099	rs1048488
4	1:761147:T:C	1	761147	C	T	0.1581	0.0171	0.0148	0.2480	166718	1960099	rs3115850
...	...	...	...	...	...	...	...	...	...	...	...	...
1103015	X:154343911:A:G	23	154343911	G	A	0.8058	0.0019	0.0090	0.8297	191764	1960099	<NA>
1103016	X:154379088:C:A	23	154379088	C	A	0.7783	0.0027	0.0094	0.7723	191764	1960099	<NA>
1103017	X:154536836:C:T	23	154536836	C	T	0.2196	-0.0084	0.0085	0.3192	191764	1960099	<NA>
1103018	X:154763036:A:G	23	154763036	G	A	0.3686	-0.0102	0.0105	0.3302	191764	1960099	<NA>
1103019	X:154816439:A:C	23	154816439	C	A	0.2339	-0.0083	0.0100	0.4021	191764	1960099	<NA>

1103020 rows × 12 columns

We annotated 1100456 (1100456 /1103020 = 99.7%) variants using the conversion table.
For the annotation of other variants (0.3%), we may need the dbSNP reference vcf.

Maximize the annotation¶

For the annotation of not-so-common variants, we can use vcf files downloaded form NCBI dbsnp (https://ftp.ncbi.nih.gov/snp/latest_release/VCF/).

Note:

The file size is huge (>20GB) and it might take several hours to download. (we can skip this step in this tutorial. )
Specify chr_dict to match the chromosome notations in sumstats and in vcf files.

Parameters :

ref_rsid_vcf: the path to the reference rsID vcf file
n_cores : number of threads to use

For this tutorial, we use b157_2564.vcf.gz in the sample dataset, which is extracted from the b157 version of dbSNP VCF.

In [28]:

Copied!

mysumstats.assign_rsid(n_cores=1,
                       ref_rsid_vcf= "sample_data/b157_2564.vcf.gz",
                       chr_dict = gl.get_number_to_NC(build="19") )
mysumstats.assign_rsid(n_cores=1,
                       ref_rsid_vcf= "sample_data/b157_2564.vcf.gz",
                       chr_dict = gl.get_number_to_NC(build="19") )

2025/10/19 20:58:30 Start to assign rsID using reference VCF...v3.6.9
2025/10/19 20:58:30  -Current Dataframe shape : 1103020 x 12 ; Memory usage: 105.35 MB
2025/10/19 20:58:30  -Number of threads/cores to use: 1
2025/10/19 20:58:30  -Reference VCF: sample_data/b157_2564.vcf.gz
2025/10/19 20:58:30  -Assigning rsID based on CHR:POS and REF:ALT/ALT:REF...
2025/10/19 20:58:32  -rsID Annotation for 98 need to be fixed!
2025/10/19 20:58:32  -Annotated 2466 rsID successfully!
2025/10/19 20:58:32 Finished assign rsID using reference file.

After this, only 98 variants were not annotated, mostly indels that are not available in the reference VCF.

In [29]:

Copied!

mysumstats.data.loc[mysumstats.data["rsID"].isna(),:]
mysumstats.data.loc[mysumstats.data["rsID"].isna(),:]

Out[29]:

	SNPID	CHR	POS	EA	NEA	EAF	BETA	SE	P	N	STATUS	rsID
514556	7:126557427:CCACACACACA:C	7	126557427	C	CCACACACACA	0.9952	-0.0497	0.0724	0.49210	191764	1960399	<NA>
518422	7:127424966:C:CTTGTTTTGTTTTGTTTTGTTTTG	7	127424966	C	CTTGTTTTGTTTTGTTTTGTTTTG	0.0034	-0.1844	0.1342	0.16940	191764	1960399	<NA>
519593	7:127765815:A:AAAATAAATAAATAAATAAAT	7	127765815	AAAATAAATAAATAAATAAAT	A	0.9442	-0.0262	0.0239	0.27350	191764	1960399	<NA>
519882	7:127825806:T:TACACAC	7	127825806	TACACAC	T	0.9640	0.0721	0.0283	0.01093	191764	1960399	<NA>
519883	7:127825806:T:TACAC	7	127825806	TACAC	T	0.8700	0.0232	0.0143	0.10540	191764	1960399	<NA>
...	...	...	...	...	...	...	...	...	...	...	...	...
1102883	X:145493373:TAA:T	23	145493373	TAA	T	0.1006	0.0098	0.0120	0.41410	191764	1960399	<NA>
1102893	X:145943926:TA:T	23	145943926	TA	T	0.5974	-0.0076	0.0118	0.52130	191764	1960399	<NA>
1102959	X:151172462:CGT:C	23	151172462	C	CGT	0.7959	0.0066	0.0138	0.63190	191764	1960399	<NA>
1102972	X:152032452:A:AT	23	152032452	AT	A	0.9633	0.0111	0.0239	0.64110	191764	1960399	<NA>
1103003	X:153254566:C:CTTT	23	153254566	C	CTTT	0.6664	-0.0014	0.0084	0.86770	191764	1960399	<NA>

98 rows × 12 columns

Harmonization¶

gwaslab can harmonize the sumstats based on reference files.

ref_seq : reference genome fasta file for allele alignment
ref_infer : vcf file with allele frequency information for inferring strand and comparing allele frequency
ref_alt_freq : field in INFO of vcf file for alternative allele frequency

For details see: https://cloufield.github.io/gwaslab/Harmonization/

For reference data, see: https://cloufield.github.io/gwaslab/Reference/

We can use the refernce genome from 1000 genomes.

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai

When calling .harmonize(), .basic_check() will be called first to make sure the dataset is ready for harmonization. Since we have already performed basic_check, we set basic_check=False here. For ref_infer, we pass the downloaded vcf files for 1KG EAS and specify the field for alternative allele frequency to AF by ref_alt_freq= "AF".

In [30]:

Copied!





mysumstats.harmonize(basic_check=False,
                     n_cores=3,
                     ref_seq     =  "/home/yunye/.gwaslab/hg19.fa",
                     ref_infer   =  gl.get_path("1kg_eas_hg19"),
                     ref_alt_freq= "AF")
mysumstats.harmonize(basic_check=False,
                     n_cores=3,
                     ref_seq     =  "/home/yunye/.gwaslab/hg19.fa",
                     ref_infer   =  gl.get_path("1kg_eas_hg19"),
                     ref_alt_freq= "AF")

2025/10/19 20:58:32 Start to check if NEA is aligned with reference sequence...v3.6.9
2025/10/19 20:58:32  -Current Dataframe shape : 1103020 x 12 ; Memory usage: 105.35 MB
2025/10/19 20:58:32  -Reference genome FASTA file: /home/yunye/.gwaslab/hg19.fa
2025/10/19 20:58:32  -Loading fasta records:1  2  3  4  5  6  7  X  8  9  10  11  12  13  14  15  16  17  18  20  19  22  21  
2025/10/19 20:58:49  -Checking records
2025/10/19 20:58:49    -Building numpy fasta records from dict
2025/10/19 20:59:02    -Checking records for ( len(NEA) <= 4 and len(EA) <= 4 )
2025/10/19 20:59:04    -Checking records for ( len(NEA) > 4 or len(EA) > 4 )
2025/10/19 20:59:05  -Finished checking records
2025/10/19 20:59:07  -Variants allele on given reference sequence :  530823
2025/10/19 20:59:07  -Variants flipped :  571268
2025/10/19 20:59:07   -Raw Matching rate :  99.92%
2025/10/19 20:59:07  -Variants inferred reverse_complement :  0
2025/10/19 20:59:07  -Variants inferred reverse_complement_flipped :  0
2025/10/19 20:59:07  -Both allele on genome + unable to distinguish :  929
2025/10/19 20:59:07  -Variants not on given reference sequence :  0
2025/10/19 20:59:07 Finished checking if NEA is aligned with reference sequence.
2025/10/19 20:59:07 Start to adjust statistics based on STATUS code...v3.6.9
2025/10/19 20:59:07  -Current Dataframe shape : 1103020 x 12 ; Memory usage: 105.35 MB
2025/10/19 20:59:08 Start to flip allele-specific stats for SNPs with status xxxxx[35]x: ALT->EA , REF->NEA ...v3.6.9
2025/10/19 20:59:08  -Flipping 571268 variants...
2025/10/19 20:59:08  -Swapping column: NEA <=> EA...
2025/10/19 20:59:08  -Flipping column: BETA = - BETA...
2025/10/19 20:59:08  -Flipping column: EAF = 1 - EAF...
2025/10/19 20:59:08  -Changed the status for flipped variants : xxxxx[35]x -> xxxxx[12]x
2025/10/19 20:59:09 Finished adjusting statistics based on STATUS code.
2025/10/19 20:59:10 Start to infer strand for palindromic SNPs/align indistinguishable indels...v3.6.9
2025/10/19 20:59:10  -Current Dataframe shape : 1103020 x 12 ; Memory usage: 105.35 MB
2025/10/19 20:59:10  -Number of threads/cores to use: 3
2025/10/19 20:59:10  -Reference VCF: /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz
2025/10/19 20:59:10  -Checking chromosome notations in VCF/BCF files...
2025/10/19 20:59:10  -Checking prefix for chromosomes in VCF/BCF files...
2025/10/19 20:59:10  -No prefix for chromosomes in the VCF/BCF files.
2025/10/19 20:59:10  -Field for alternative allele frequency in VCF INFO: AF
2025/10/19 20:59:11  -Identified  1431  palindromic SNPs...
2025/10/19 20:59:12  -After filtering by MAF< 0.4 , 1334 palindromic SNPs with unknown strand will be inferred...
2025/10/19 20:59:12  -Starting strand inference for palindromic SNPs...
2025/10/19 20:59:14  -Finished strand inference.
2025/10/19 20:59:15   -Non-palindromic :  1100575
2025/10/19 20:59:15   -Palindromic SNPs on + strand:  0
2025/10/19 20:59:15   -Palindromic SNPs on - strand and needed to be flipped: 1014
2025/10/19 20:59:15   -Palindromic SNPs with MAF not available to infer :  97
2025/10/19 20:59:15   -Palindromic SNPs with no macthes or no information :  164
2025/10/19 20:59:16  -Identified  929  indistinguishable Indels...
2025/10/19 20:59:16  -Indistinguishable indels will be inferred from reference vcf REF and ALT...
2025/10/19 20:59:16  -Difference in allele frequency (DAF) tolerance: 0.2
2025/10/19 20:59:16  -Starting indistinguishable indel inference...
2025/10/19 20:59:17  -Finished indistinguishable indel inference.
2025/10/19 20:59:18   -Indels ea/nea match reference :  27
2025/10/19 20:59:18   -Indels ea/nea need to be flipped :  31
2025/10/19 20:59:18   -Indels with no macthes or no information :  871
2025/10/19 20:59:18 Finished inferring strand for palindromic SNPs/align indistinguishable indels.
2025/10/19 20:59:18 Start to adjust statistics based on STATUS code...v3.6.9
2025/10/19 20:59:18  -Current Dataframe shape : 1103020 x 12 ; Memory usage: 105.35 MB
2025/10/19 20:59:20 Start to flip allele-specific stats for standardized indels with status xxxx[123][67][6]: ALT->EA , REF->NEA...v3.6.9
2025/10/19 20:59:20  -Flipping 31 variants...
2025/10/19 20:59:20  -Swapping column: NEA <=> EA...
2025/10/19 20:59:20  -Flipping column: BETA = - BETA...
2025/10/19 20:59:20  -Flipping column: EAF = 1 - EAF...
2025/10/19 20:59:20  -Changed the status for flipped variants xxxx[123][67]6 -> xxxx[123][67]4
2025/10/19 20:59:21 Start to flip allele-specific stats for palindromic SNPs with status xxxxx[12]5: (-)strand <=> (+)strand...v3.6.9
2025/10/19 20:59:21  -Flipping 1014 variants...
2025/10/19 20:59:21  -Flipping column: BETA = - BETA...
2025/10/19 20:59:21  -Flipping column: EAF = 1 - EAF...
2025/10/19 20:59:21  -Changed the status for flipped variants:  xxxxx[012]5: ->  xxxxx[012]2
2025/10/19 20:59:21 Finished adjusting statistics based on STATUS code.
2025/10/19 20:59:22 Start to sort the genome coordinates...v3.6.9
2025/10/19 20:59:22  -Current Dataframe shape : 1103020 x 12 ; Memory usage: 105.35 MB
2025/10/19 20:59:22 Finished sorting coordinates.
2025/10/19 20:59:22 Start to reorder the columns...v3.6.9
2025/10/19 20:59:22  -Current Dataframe shape : 1103020 x 12 ; Memory usage: 105.35 MB
2025/10/19 20:59:22  -Reordering columns to    : SNPID,rsID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,STATUS
2025/10/19 20:59:22 Finished reordering the columns.

Out[30]:

<gwaslab.g_Sumstats.Sumstats at 0x7fed3389e4e0>

In [31]:

Copied!

mysumstats.data
mysumstats.data

Out[31]:

	SNPID	rsID	CHR	POS	EA	NEA	EAF	BETA	SE	P	N	STATUS
0	1:752566:G:A	rs3094315	1	752566	A	G	0.1578	0.0155	0.0131	0.2350	166718	1960010
1	1:752721:A:G	rs3131972	1	752721	G	A	0.2507	0.0204	0.0147	0.1650	166718	1960000
2	1:754182:A:G	rs3131969	1	754182	G	A	0.2505	0.0222	0.0166	0.1817	166718	1960000
3	1:760912:C:T	rs1048488	1	760912	T	C	0.1575	0.0171	0.0148	0.2480	166718	1960010
4	1:761147:T:C	rs3115850	1	761147	C	T	0.1581	0.0171	0.0148	0.2480	166718	1960000
...	...	...	...	...	...	...	...	...	...	...	...	...
1103015	X:154343911:A:G	rs4898358	23	154343911	G	A	0.8058	0.0019	0.0090	0.8297	191764	1960000
1103016	X:154379088:C:A	rs5987090	23	154379088	A	C	0.2217	-0.0027	0.0094	0.7723	191764	1960010
1103017	X:154536836:C:T	rs12013168	23	154536836	T	C	0.7804	0.0084	0.0085	0.3192	191764	1960010
1103018	X:154763036:A:G	rs5940466	23	154763036	G	A	0.3686	-0.0102	0.0105	0.3302	191764	1960000
1103019	X:154816439:A:C	rs1084451	23	154816439	C	A	0.2339	-0.0083	0.0100	0.4021	191764	1960000

1103020 rows × 12 columns

Check the data again. Looks good!

Sumstats Summary¶

Check the summary of the currrent sumstats (see: https://cloufield.github.io/gwaslab/StatusCode/):

In [32]:

Copied!

mysumstats.summary()
mysumstats.summary()

Out[32]:

		Values	Percentage
Category	Items
META	Row_num	1103020	<NA>
	Column_num	12	<NA>
	Column_names	SNPID,rsID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,STATUS	<NA>
	Last_checked_time	Sun Oct 19 20:59:23 2025	<NA>
MISSING	Missing_total	98	0.01
MISSING	Missing_rsID	98	0.01
MAF	Common	954139	86.5
	Low_frequency	105565	9.57
	Rare	43245	3.92
P	Minimum	4.05e-153	0.0
	Significant	1271	0.12
	Suggestive	2532	0.23
STATUS	1960010	570540	51.73
	1960000	530035	48.05
	1960368	871	0.08
	1960002	520	0.05
	1960012	494	0.04
	1960018	164	0.01
	1960008	156	0.01
	1960309	63	0.01
	1960007	49	0.0
	1960017	48	0.0
	1960364	31	0.0
	1960363	27	0.0
	1960319	22	0.0

Check the details of harmonization results .lookup_status()

In [33]:

Copied!

mysumstats.lookup_status()
mysumstats.lookup_status()

Out[33]:

	Genome_Build	rsID&SNPID	CHR&POS	Stadardize&Normalize	Align	Panlidromic_SNP&Indel	Count	Percentage(%)
1960000	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized SNP	Match: NEA=REF	Not_palindromic_SNPs	530035	48.05
1960002	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized SNP	Match: NEA=REF	Palindromic-strand_fixed	520	0.05
1960007	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized SNP	Match: NEA=REF	Indistinguishable	49	0.0
1960008	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized SNP	Match: NEA=REF	No_matching_or_no_info	156	0.01
1960010	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized SNP	Flipped_fixed	Not_palindromic_SNPs	570540	51.73
1960012	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized SNP	Flipped_fixed	Palindromic-strand_fixed	494	0.04
1960017	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized SNP	Flipped_fixed	Indistinguishable	48	0.0
1960018	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized SNP	Flipped_fixed	No_matching_or_no_info	164	0.01
1960309	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized & normalized indel	Match: NEA=REF	Unchecked	63	0.01
1960319	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized & normalized indel	Flipped_fixed	Unchecked	22	0.0
1960363	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized & normalized indel	Both_alleles_on_ref+indistinguishable	Indel_match	27	0.0
1960364	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized & normalized indel	Both_alleles_on_ref+indistinguishable	Indel_flipped_fixed	31	0.0
1960368	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized & normalized indel	Both_alleles_on_ref+indistinguishable	No_matching_or_no_info	871	0.08

Formatting and saving : to_format()¶

You can easily format the processed sumstats and save it. For details, see: https://cloufield.github.io/gwaslab/Format/

Let's export the sumstats as the default format used in LDSC.

In [34]:

Copied!

mysumstats.to_format("clean_sumstats",fmt="ldsc")
mysumstats.to_format("clean_sumstats",fmt="ldsc")

2025/10/19 20:59:25 Start to convert the output sumstats in:  ldsc  format
2025/10/19 20:59:25  -Formatting statistics ...
2025/10/19 20:59:27  -Float statistics formats:
2025/10/19 20:59:27   - Columns       : ['EAF', 'BETA', 'SE', 'P']
2025/10/19 20:59:27   - Output formats: ['{:.4g}', '{:.4f}', '{:.4f}', '{:.4e}']
2025/10/19 20:59:27  -Start outputting sumstats in ldsc format...
2025/10/19 20:59:27  -ldsc format will be loaded...
2025/10/19 20:59:27  -ldsc format meta info:
2025/10/19 20:59:27   - format_name  : ldsc
2025/10/19 20:59:27   - format_source  : https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format
2025/10/19 20:59:27   - format_source2  : https://github.com/bulik/ldsc/blob/master/munge_sumstats.py
2025/10/19 20:59:27   - format_version  :  20150306
2025/10/19 20:59:27  -gwaslab to ldsc format dictionary:
2025/10/19 20:59:27   - gwaslab keys: rsID,NEA,EA,EAF,N,BETA,P,Z,INFO,OR,CHR,POS
2025/10/19 20:59:27   - ldsc values: SNP,A2,A1,Frq,N,Beta,P,Z,INFO,OR,CHR,POS
2025/10/19 20:59:27  -Output path: clean_sumstats.ldsc.tsv.gz
2025/10/19 20:59:27  -Output columns: SNP,CHR,POS,A1,A2,Frq,Beta,P,N
2025/10/19 20:59:27  -Writing sumstats to: clean_sumstats.ldsc.tsv.gz...
2025/10/19 20:59:27   -Fast to csv mode...
2025/10/19 20:59:29  -Saving log file to: clean_sumstats.ldsc.log
2025/10/19 20:59:29 Finished outputting successfully!

Sometimes we only need to export part of the sumstats. For example,

we can specify hapmap3=True to export only hapmap3 variants;
specify exclude_hla=True to exclude variants in HLA region when exporting;
specify md5sum=True to calculate the md5sum value for the exported sumstats.

In [35]:

Copied!

mysumstats.to_format("clean_sumstats",fmt="ldsc",hapmap3=True,exclude_hla=True,md5sum=True)
mysumstats.to_format("clean_sumstats",fmt="ldsc",hapmap3=True,exclude_hla=True,md5sum=True)

2025/10/19 20:59:29 Start to convert the output sumstats in:  ldsc  format
2025/10/19 20:59:29  -Excluded 2757 variants in HLA region (chr6: 25000000-34000000 )...
2025/10/19 20:59:29 Start to extract HapMap3 SNPs...v3.6.9
2025/10/19 20:59:29  -Current Dataframe shape : 1100263 x 12 ; Memory usage: 113.54 MB
2025/10/19 20:59:29  -Loading Hapmap3 variants from built-in datasets...
2025/10/19 20:59:31  -rsID will be used for matching...
2025/10/19 20:59:33  -Raw input contains 1090439 Hapmap3 variants based on rsID...
2025/10/19 20:59:33  -Checking if alleles are same...
2025/10/19 20:59:33  -Filtered 0 Hapmap3 variants due to unmatech alleles...
2025/10/19 20:59:33  -Extract 1090439 variants in Hapmap3 datasets for build 19.
2025/10/19 20:59:33  -Formatting statistics ...
2025/10/19 20:59:35  -Float statistics formats:
2025/10/19 20:59:35   - Columns       : ['EAF', 'BETA', 'SE', 'P']
2025/10/19 20:59:35   - Output formats: ['{:.4g}', '{:.4f}', '{:.4f}', '{:.4e}']
2025/10/19 20:59:35  -Start outputting sumstats in ldsc format...
2025/10/19 20:59:35  -ldsc format will be loaded...
2025/10/19 20:59:35  -ldsc format meta info:
2025/10/19 20:59:35   - format_name  : ldsc
2025/10/19 20:59:35   - format_source  : https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format
2025/10/19 20:59:35   - format_source2  : https://github.com/bulik/ldsc/blob/master/munge_sumstats.py
2025/10/19 20:59:35   - format_version  :  20150306
2025/10/19 20:59:35  -gwaslab to ldsc format dictionary:
2025/10/19 20:59:35   - gwaslab keys: rsID,NEA,EA,EAF,N,BETA,P,Z,INFO,OR,CHR,POS
2025/10/19 20:59:35   - ldsc values: SNP,A2,A1,Frq,N,Beta,P,Z,INFO,OR,CHR,POS
2025/10/19 20:59:35  -Output path: clean_sumstats.hapmap3.noMHC.ldsc.tsv.gz
2025/10/19 20:59:35  -Output columns: SNP,CHR,POS,A1,A2,Frq,Beta,P,N
2025/10/19 20:59:35  -Writing sumstats to: clean_sumstats.hapmap3.noMHC.ldsc.tsv.gz...
2025/10/19 20:59:35   -Fast to csv mode...
2025/10/19 20:59:37  -md5sum hashing for the file: clean_sumstats.hapmap3.noMHC.ldsc.tsv.gz
2025/10/19 20:59:37  -md5sum path: clean_sumstats.hapmap3.noMHC.ldsc.tsv.gz.md5sum
2025/10/19 20:59:37  -md5sum: a1436f8f17b81adce3fc1e14017a836f
2025/10/19 20:59:37  -Saving log file to: clean_sumstats.hapmap3.noMHC.ldsc.log
2025/10/19 20:59:37 Finished outputting successfully!

Output in GWAS-SSF format

In [36]:

Copied!

mysumstats.to_format("clean_sumstats",fmt="ssf",ssfmeta=True, md5sum=True)
mysumstats.to_format("clean_sumstats",fmt="ssf",ssfmeta=True, md5sum=True)

2025/10/19 20:59:37 Start to convert the output sumstats in:  ssf  format
2025/10/19 20:59:37  -Formatting statistics ...
2025/10/19 20:59:39  -Float statistics formats:
2025/10/19 20:59:39   - Columns       : ['EAF', 'BETA', 'SE', 'P']
2025/10/19 20:59:39   - Output formats: ['{:.4g}', '{:.4f}', '{:.4f}', '{:.4e}']
2025/10/19 20:59:39  -Replacing SNPID separator from ":" to "_"...
2025/10/19 20:59:39  -Start outputting sumstats in ssf format...
2025/10/19 20:59:39  -ssf format will be loaded...
2025/10/19 20:59:39  -ssf format meta info:
2025/10/19 20:59:39   - format_name  : ssf
2025/10/19 20:59:39   - format_source  : https://www.biorxiv.org/content/10.1101/2022.07.15.500230v1.full
2025/10/19 20:59:39   - format_cite_name  : GWAS-SSF v0.1
2025/10/19 20:59:39   - format_separator  : \t
2025/10/19 20:59:39   - format_na  : #NA
2025/10/19 20:59:39   - format_col_order  : chromosome,base_pair_location,effect_allele,other_allele,beta,odds_ratio,hazard_ratio,standard_error,effect_allele_frequency,p_value,neg_log_10_p_value,ci_upper,ci_lower,rsid,variant_id,info,ref_allele,n
2025/10/19 20:59:39   - format_version  :  20230328
2025/10/19 20:59:39  -gwaslab to ssf format dictionary:
2025/10/19 20:59:39   - gwaslab keys: SNPID,rsID,CHR,POS,NEA,EA,EAF,N,BETA,SE,P,MLOG10P,INFO,OR,HR,OR_95L,OR_95U
2025/10/19 20:59:39   - ssf values: variant_id,rsid,chromosome,base_pair_location,other_allele,effect_allele,effect_allele_frequency,n,beta,standard_error,p_value,neg_log_10_p_value,info,odds_ratio,hazard_ratio,ci_lower,ci_upper
2025/10/19 20:59:39  -Output path: clean_sumstats.ssf.tsv.gz
2025/10/19 20:59:39  -Output columns: chromosome,base_pair_location,effect_allele,other_allele,beta,standard_error,effect_allele_frequency,p_value,rsid,variant_id,n
2025/10/19 20:59:39  -Writing sumstats to: clean_sumstats.ssf.tsv.gz...
2025/10/19 20:59:39   -Fast to csv mode...
2025/10/19 20:59:42  -md5sum hashing for the file: clean_sumstats.ssf.tsv.gz
2025/10/19 20:59:42  -md5sum path: clean_sumstats.ssf.tsv.gz.md5sum
2025/10/19 20:59:42  -md5sum: 72da5fc163fce94c690e596d988ac48f
2025/10/19 20:59:42  -Exporting SSF-style meta data to clean_sumstats.ssf.tsv.gz.ssf.tsv-meta.yaml
2025/10/19 20:59:42  -Saving log file to: clean_sumstats.ssf.log
2025/10/19 20:59:42 Finished outputting successfully!

Liftover¶

gwaslab can perform liftover for base pair positions.

Note: GWASLab only liftover CHR and POS, and when lifted, the last two digits status code will be rolled back to 99. Since for different reference genome, the reference allele or strand might be reverse, so it is needed to harmonize again after liftover.

In [37]:

Copied!

mysumstats.liftover(n_cores=3, from_build="19", to_build="38")
mysumstats.liftover(n_cores=3, from_build="19", to_build="38")

2025/10/19 20:59:42 Start to perform liftover...v3.6.9
2025/10/19 20:59:42  -Current Dataframe shape : 1103020 x 12 ; Memory usage: 105.35 MB
2025/10/19 20:59:42  -Number of threads/cores to use: 3
2025/10/19 20:59:42 No records in config file. Please download first.
2025/10/19 20:59:42 Start to download  19to38  ...
2025/10/19 20:59:42  -Downloading to: /home/yunye/.gwaslab/hg19ToHg38.over.chain.gz
2025/10/19 20:59:42  -File /home/yunye/.gwaslab/hg19ToHg38.over.chain.gz exists.
2025/10/19 20:59:42  -Updating record in config file...
2025/10/19 20:59:42 Downloaded  19to38  successfully!
2025/10/19 20:59:42  -Creating converter using provided ChainFile: /home/yunye/.gwaslab/hg19ToHg38.over.chain.gz
2025/10/19 20:59:42  -Creating converter : 19 -> 38
2025/10/19 20:59:43  -Converting variants with status code xxx0xxx :1103020...
2025/10/19 21:00:52  -Removed unmapped variants: 174
2025/10/19 21:00:52 Start to fix chromosome notation (CHR)...v3.6.9
2025/10/19 21:00:52  -Current Dataframe shape : 1102846 x 12 ; Memory usage: 113.75 MB
2025/10/19 21:00:52  -Checking CHR data type...
2025/10/19 21:00:52  -Variants with standardized chromosome notation: 1102846
2025/10/19 21:00:53  -All CHR are already fixed...
2025/10/19 21:00:54 Finished fixing chromosome notation (CHR).
2025/10/19 21:00:54 Start to fix basepair positions (POS)...v3.6.9
2025/10/19 21:00:54  -Current Dataframe shape : 1102846 x 12 ; Memory usage: 113.75 MB
2025/10/19 21:00:54  -Converting to Int64 data type ...
2025/10/19 21:00:55  -Position bound:(0 , 250,000,000)
2025/10/19 21:00:55  -Removed outliers: 0
2025/10/19 21:00:56  -Removed 0 variants with bad positions.
2025/10/19 21:00:56 Finished fixing basepair positions (POS).
2025/10/19 21:00:56 Finished liftover.

In [38]:

Copied!

mysumstats.data
mysumstats.data

Out[38]:

	SNPID	rsID	CHR	POS	EA	NEA	EAF	BETA	SE	P	N	STATUS
0	1:752566:G:A	rs3094315	1	817186	A	G	0.1578	0.0155	0.0131	0.2350	166718	3860099
1	1:752721:A:G	rs3131972	1	817341	G	A	0.2507	0.0204	0.0147	0.1650	166718	3860099
2	1:754182:A:G	rs3131969	1	818802	G	A	0.2505	0.0222	0.0166	0.1817	166718	3860099
3	1:760912:C:T	rs1048488	1	825532	T	C	0.1575	0.0171	0.0148	0.2480	166718	3860099
4	1:761147:T:C	rs3115850	1	825767	C	T	0.1581	0.0171	0.0148	0.2480	166718	3860099
...	...	...	...	...	...	...	...	...	...	...	...	...
1103015	X:154343911:A:G	rs4898358	23	155115636	G	A	0.8058	0.0019	0.0090	0.8297	191764	3860099
1103016	X:154379088:C:A	rs5987090	23	155150813	A	C	0.2217	-0.0027	0.0094	0.7723	191764	3860099
1103017	X:154536836:C:T	rs12013168	23	155307523	T	C	0.7804	0.0084	0.0085	0.3192	191764	3860099
1103018	X:154763036:A:G	rs5940466	23	155533375	G	A	0.3686	-0.0102	0.0105	0.3302	191764	3860099
1103019	X:154816439:A:C	rs1084451	23	155586778	C	A	0.2339	-0.0083	0.0100	0.4021	191764	3860099

1102846 rows × 12 columns

For details, see https://cloufield.github.io/gwaslab/LiftOver/

LDSC¶

Reload the sample data and run LDSC in GWASLab

In [39]:

Copied!

mysumstats = gl.Sumstats("./sample_data/bbj_t2d_hm3_chr7_variants.txt.gz", fmt="gwaslab", build="19", verbose=False)
mysumstats = gl.Sumstats("./sample_data/bbj_t2d_hm3_chr7_variants.txt.gz", fmt="gwaslab", build="19", verbose=False)

In [40]:

Copied!

mysumstats.filter_hapmap3(inplace=True)
mysumstats.filter_hapmap3(inplace=True)

2025/10/19 21:00:58 Start to extract HapMap3 SNPs...v3.6.9
2025/10/19 21:00:58  -Current Dataframe shape : 1103020 x 13 ; Memory usage: 113.76 MB
2025/10/19 21:00:58  -Loading Hapmap3 variants from built-in datasets...
2025/10/19 21:00:59  -rsID will be used for matching...
2025/10/19 21:01:01  -Raw input contains 1092430 Hapmap3 variants based on rsID...
2025/10/19 21:01:01  -Checking if alleles are same...
2025/10/19 21:01:02  -Filtered 0 Hapmap3 variants due to unmatech alleles...

In [41]:

Copied!

# pass your ld score files downloaded from LDSC
# https://github.com/bulik/ldsc

mysumstats.estimate_h2_by_ldsc(ref_ld_chr = "~/tools/ldsc/ldscores/eas_ldscores/", 
                                 w_ld_chr = "~/tools/ldsc/ldscores/eas_ldscores/")
# pass your ld score files downloaded from LDSC
# https://github.com/bulik/ldsc

mysumstats.estimate_h2_by_ldsc(ref_ld_chr = "~/tools/ldsc/ldscores/eas_ldscores/", 
                                 w_ld_chr = "~/tools/ldsc/ldscores/eas_ldscores/")

2025/10/19 21:01:02 Start to extract HapMap3 SNPs...v3.6.9
2025/10/19 21:01:02  -Current Dataframe shape : 1092430 x 13 ; Memory usage: 121.21 MB
2025/10/19 21:01:02  -Loading Hapmap3 variants from built-in datasets...
2025/10/19 21:01:03  -rsID will be used for matching...
2025/10/19 21:01:05  -Raw input contains 1213752 Hapmap3 variants based on rsID...
2025/10/19 21:01:05  -Checking if alleles are same...
2025/10/19 21:01:06  -Filtered 0 Hapmap3 variants due to unmatech alleles...
2025/10/19 21:01:06 Start to run LD score regression...v3.6.9
2025/10/19 21:01:06  -Current Dataframe shape : 1213752 x 13 ; Memory usage: 132.21 MB
2025/10/19 21:01:06  -Run single variate LD score regression:
2025/10/19 21:01:06   -Adopted from LDSC source code: https://github.com/bulik/ldsc
2025/10/19 21:01:06   -Please cite LDSC: Bulik-Sullivan, et al. LD Score Regression Distinguishes Confounding from Polygenicity in Genome-Wide Association Studies. Nature Genetics, 2015.
2025/10/19 21:01:06  -Arguments:
2025/10/19 21:01:06   -ref_ld_chr:~/tools/ldsc/ldscores/eas_ldscores/
2025/10/19 21:01:06   -w_ld_chr:~/tools/ldsc/ldscores/eas_ldscores/
2025/10/19 21:01:06  -LDSC log:
2025/10/19 21:01:06   -Reading reference panel LD Score from ~/tools/ldsc/ldscores/eas_ldscores/[1-22] ... (ldscore_fromlist)
2025/10/19 21:01:08   -Read reference panel LD Scores for 1208050 SNPs.
2025/10/19 21:01:08   -Removing partitioned LD Scores with zero variance.
2025/10/19 21:01:08   -Reading regression weight LD Score from ~/tools/ldsc/ldscores/eas_ldscores/[1-22] ... (ldscore_fromlist)
2025/10/19 21:01:09   -Read regression weight LD Scores for 1208050 SNPs.
2025/10/19 21:01:10   -After merging with reference panel LD, 1079390 SNPs remain.
2025/10/19 21:01:11   -After merging with regression SNP LD, 1079390 SNPs remain.
2025/10/19 21:01:11   -Using two-step estimator with cutoff at 30.
2025/10/19 21:01:11  -Results have been stored in .ldsc_h2
2025/10/19 21:01:11 Finished running LD score regression.

In [42]:

Copied!

mysumstats.ldsc_h2
mysumstats.ldsc_h2

Out[42]:

	h2_obs	h2_se	Lambda_gc	Mean_chi2	Intercept	Intercept_se	Ratio	Ratio_se	Catagories
0	0.10394433	0.00650644	1.32982693	1.49125406	1.09147712	0.01056279	0.18621142	0.02150169	NA

For details, see https://cloufield.github.io/gwaslab/LDSCinGWASLab/

In [ ]:

Tutorial for gwaslab 3.6¶

Download sample data¶

Import gwaslab package¶

Loading data into gwaslab Sumstats¶

Loading by specifying keyword in formatbook¶

Sumstats.data¶

Infer genome build¶

Create Manhattan plots and Q-Q plots¶

Standardization & QC : .basic_check()¶

Extract lead variants : get_lead()¶

Use the SNPID to create some highly customized mqq plot¶

Quick regional plot without LD-information¶

Reference file downloading¶

check available reference from gwaslab¶

download reference using gwaslab¶

Region plots with or with out LD reference files.¶

Fix SNPID¶

Annotate rsID¶

Maximize the annotation¶

Harmonization¶

Sumstats Summary¶

Formatting and saving : to_format()¶

Liftover¶

LDSC¶

Standardization & QC : `.basic_check()`¶