Tutorial for gwaslab¶

In this tutorial, we will briefly describe the core functions in gwaslab for sumstats QC, standardization and harmonization.
We will also show examples of visualization, including Manhattan plots, Q-Q plots and regional plots.
This jupyter notebook can be downloaded from https://github.com/Cloufield/gwaslab/blob/main/docs/tutorial_3.4.ipynb
Some steps need reference files, which may take some time to download.
Please note that the processed reference datasets are currently hosted on Dropbox.

Download sample data¶

Using a jupyter notebook, we first download a sample dataset.
The dataset we will use as an example is the sumstats of type 2 diabetes from BBJ (K. Suzuki et al., Nature Genetics. 51, 379–386 (2019).)
File size: 261M

In [1]:

Copied!

!wget -O t2d_bbj.txt.gz http://jenger.riken.jp/14/
!wget -O t2d_bbj.txt.gz http://jenger.riken.jp/14/

--2023-04-21 13:41:25--  http://jenger.riken.jp/14/
Resolving jenger.riken.jp (jenger.riken.jp)... 134.160.84.25, 161.232.72.25, 202.12.30.131, ...
Connecting to jenger.riken.jp (jenger.riken.jp)|134.160.84.25|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 274187574 (261M) [text/plain]
Saving to: ‘t2d_bbj.txt.gz’

t2d_bbj.txt.gz      100%[===================>] 261.49M  25.2MB/s    in 11s     

2023-04-21 13:41:35 (24.7 MB/s) - ‘t2d_bbj.txt.gz’ saved [274187574/274187574]

Sometimes the link is not stable. You can also download from alternative source: !wget -O t2d_bbj.txt.gz https://www.dropbox.com/s/5vp93eq0gtsm92n/t2d_bbj.txt.gz?dl=1

Import gwaslab package¶

gwaslab can be installed using pip

In [1]:

Copied!

!pip install gwaslab==3.4.41
!pip install gwaslab==3.4.41

If you installed gwaslab from pip, simply run the command to import the package:

In [1]:

Copied!

import gwaslab as gl
import gwaslab as gl

Or if you want to use the latest version from github (beta version), you can clone the repository and import the package by inserting your package path into the system path like:

In [1]:

Copied!

import sys
sys.path.insert(0,"/home/yunye/work/gwaslab/src")
import gwaslab as gl
import sys
sys.path.insert(0,"/home/yunye/work/gwaslab/src")
import gwaslab as gl

Loading data into gwaslab Sumstats¶

Let's import the raw sumstats into the gwaslab.Sumstats Object by specifying the necessary columns.

Note: you can either specify eaf (effect allele frequency) or neaf (non-effect allele frequency), if neaf is specified, it will be converted to eaf when loading the sumstats.

In [2]:

Copied!





mysumstats = gl.Sumstats("t2d_bbj.txt.gz",
             snpid="SNP",
             chrom="CHR",
             pos="POS",
             ea="ALT",
             nea="REF",
             neaf="Frq",
             beta="BETA",
             se="SE",
             p="P",
             direction="Dir",
             n="N",
             sep="\t")
mysumstats = gl.Sumstats("t2d_bbj.txt.gz",
             snpid="SNP",
             chrom="CHR",
             pos="POS",
             ea="ALT",
             nea="REF",
             neaf="Frq",
             beta="BETA",
             se="SE",
             p="P",
             direction="Dir",
             n="N",
             sep="\t")

2024/02/15 18:55:03 GWASLab v3.4.40 https://cloufield.github.io/gwaslab/
2024/02/15 18:55:03 (C) 2022-2024, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com
2024/02/15 18:55:03 Start to initialize gl.Sumstats from file :t2d_bbj.txt.gz
2024/02/15 18:55:18  -Reading columns          : Dir,SNP,REF,N,P,POS,ALT,Frq,BETA,CHR,SE
2024/02/15 18:55:18  -Renaming columns to      : DIRECTION,SNPID,NEA,N,P,POS,EA,EAF,BETA,CHR,SE
2024/02/15 18:55:18  -Current Dataframe shape : 12557761  x  11
2024/02/15 18:55:19  -Initiating a status column: STATUS ...
2024/02/15 18:55:19  #WARNING! Version of genomic coordinates is unknown...
2024/02/15 18:55:21  -NEAF is specified...
2024/02/15 18:55:21  -Checking if 0<= NEAF <=1 ...
2024/02/15 18:55:22  -Converted NEAF to EAF.
2024/02/15 18:55:22  -Removed 0 variants with bad NEAF.
2024/02/15 18:55:22 Start to reorder the columns...v3.4.40
2024/02/15 18:55:22  -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1100.88 MB
2024/02/15 18:55:22  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,DIRECTION,STATUS
2024/02/15 18:55:22 Finished reordering the columns.
2024/02/15 18:55:22  -Column  : SNPID  CHR    POS   EA       NEA      EAF     BETA    SE      P       N     DIRECTION STATUS  
2024/02/15 18:55:22  -DType   : object string int64 category category float64 float64 float64 float64 int64 object    category
2024/02/15 18:55:22  -Verified: T      F      T     T        T        T       T       T       T       T     T         T       
2024/02/15 18:55:22  #WARNING! Columns with possibly incompatible dtypes: CHR
2024/02/15 18:55:22  -Current Dataframe memory usage: 1100.88 MB
2024/02/15 18:55:22 Finished loading data successfully!

Sumstats are stored in Sumstats.data as a pandas.DataFrame, you can check the data like:

In [3]:

Copied!

mysumstats.data
mysumstats.data

Out[3]:

	SNPID	CHR	POS	EA	NEA	EAF	BETA	SE	P	N	DIRECTION	STATUS
0	1:725932_G_A	1	725932	G	A	0.9960	-0.0737	0.1394	0.5970	166718	-?+-	9999999
1	1:725933_A_G	1	725933	G	A	0.0040	0.0737	0.1394	0.5973	166718	+?-+	9999999
2	1:737801_T_C	1	737801	C	T	0.0051	0.0490	0.1231	0.6908	166718	+?-+	9999999
3	1:749963_T_TAA	1	749963	TAA	T	0.8374	0.0213	0.0199	0.2846	166718	-?++	9999999
4	1:751343_T_A	1	751343	T	A	0.8593	0.0172	0.0156	0.2705	166718	-?++	9999999
...	...	...	...	...	...	...	...	...	...	...	...	...
12557756	X:154874837_A_G	X	154874837	G	A	0.7478	-0.0064	0.0117	0.5840	191764	-+-+	9999999
12557757	X:154875192_GTACTC_G	X	154875192	GTACTC	G	0.2525	0.0071	0.0122	0.5612	191764	+-+-	9999999
12557758	X:154879115_A_G	X	154879115	G	A	0.7463	-0.0070	0.0122	0.5646	191764	-+-+	9999999
12557759	X:154880669_T_A	X	154880669	T	A	0.2558	0.0071	0.0122	0.5618	191764	+-+-	9999999
12557760	X:154880917_C_T	X	154880917	C	T	0.2558	0.0072	0.0122	0.5570	191764	+-+-	9999999

12557761 rows × 12 columns

For details on gwaslab Sumstats Object, see: https://cloufield.github.io/gwaslab/SumstatsObject/

Create Manhattan plots and Q-Q plots¶

The first thing you want to check is probably the Manhattan and Q-Q plots for your sumstats.

gwaslab will conduct a minimal QC for sumstats when plotting.

In [4]:

Copied!

mysumstats.plot_mqq()
mysumstats.plot_mqq()

Sun Feb  4 01:02:47 2024 Start to create MQQ plot...v3.4.38:
Sun Feb  4 01:02:47 2024  -Genomic coordinates version: 99...
Sun Feb  4 01:02:47 2024  #WARNING! Genomic coordinates version is unknown.
Sun Feb  4 01:02:47 2024  -Genome-wide significance level to plot is set to 5e-08 ...
Sun Feb  4 01:02:47 2024  -Raw input contains 12557761 variants...
Sun Feb  4 01:02:47 2024  -MQQ plot layout mode is : mqq
Sun Feb  4 01:02:51 2024 Finished loading specified columns from the sumstats.
Sun Feb  4 01:02:51 2024 Start data conversion and sanity check:
Sun Feb  4 01:02:51 2024  -Removed 0 variants with nan in CHR or POS column ...
Sun Feb  4 01:03:16 2024  -Removed 0 variants with CHR <=0...
Sun Feb  4 01:03:17 2024  -Removed 0 variants with nan in P column ...
Sun Feb  4 01:03:18 2024  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
Sun Feb  4 01:03:18 2024  -Sumstats P values are being converted to -log10(P)...
Sun Feb  4 01:03:19 2024  -Sanity check: 0 na/inf/-inf variants will be removed...
Sun Feb  4 01:03:21 2024  -Converting data above cut line...
Sun Feb  4 01:03:21 2024  -Maximum -log10(P) value is 167.58838029403677 .
Sun Feb  4 01:03:21 2024 Finished data conversion and sanity check.
Sun Feb  4 01:03:21 2024 Start to create MQQ plot with 12557761 variants...
Sun Feb  4 01:03:35 2024  -Creating background plot...
Sun Feb  4 01:04:00 2024 Finished creating MQQ plot successfully!
Sun Feb  4 01:04:00 2024 Start to extract variants for annotation...
Sun Feb  4 01:04:00 2024  -Found 89 significant variants with a sliding window size of 500 kb...
Sun Feb  4 01:04:00 2024 Finished extracting variants for annotation...
Sun Feb  4 01:04:00 2024 Start to process figure arts.
Sun Feb  4 01:04:00 2024  -Processing X ticks...
Sun Feb  4 01:04:00 2024  -Processing X labels...
Sun Feb  4 01:04:00 2024  -Processing Y labels...
Sun Feb  4 01:04:00 2024  -Processing Y tick lables...
Sun Feb  4 01:04:00 2024  -Processing Y labels...
Sun Feb  4 01:04:00 2024  -Processing lines...
Sun Feb  4 01:04:00 2024 Finished processing figure arts.
Sun Feb  4 01:04:00 2024 Start to annotate variants...
Sun Feb  4 01:04:00 2024  -Skip annotating
Sun Feb  4 01:04:00 2024 Finished annotating variants.
Sun Feb  4 01:04:00 2024 Start to create QQ plot with 12557761 variants:
Sun Feb  4 01:04:00 2024  -Plotting all variants...
Sun Feb  4 01:04:03 2024  -Expected range of P: (0,1.0)
Sun Feb  4 01:04:04 2024  -Lambda GC (MLOG10P mode) at 0.5 is   1.21283
Sun Feb  4 01:04:04 2024  -Processing Y tick lables...
Sun Feb  4 01:04:04 2024 Finished creating QQ plot successfully!
Sun Feb  4 01:04:04 2024 Start to save figure...
Sun Feb  4 01:04:04 2024  -Skip saving figure!
Sun Feb  4 01:04:04 2024 Finished saving figure...
Sun Feb  4 01:04:04 2024 Finished creating plot successfully!

Out[4]:

(<Figure size 3000x1000 with 2 Axes>, <gwaslab.g_Log.Log at 0x7fe713e8edc0>)

No description has been provided for this image

Using .plot_mqq(), you can easily plot the Manhattan and QQ plot, but the plots without any manipulations are not really informative in this case, and plotting all points takes a relatively long time. The most significant locus dwarfed other less signicant loci. To adjust the plot, gwaslab provides a wide range of options for customization. For example, we can use skip and cut :

skip : skip variants with MLOG10P < skip for faster plotting speed
cut : rescale the MLOG10P values when MLOG10P > cut

In [5]:

Copied!

mysumstats.plot_mqq(skip=2, cut=20)
mysumstats.plot_mqq(skip=2, cut=20)

Sun Feb  4 01:06:17 2024 Start to create MQQ plot...v3.4.38:
Sun Feb  4 01:06:17 2024  -Genomic coordinates version: 99...
Sun Feb  4 01:06:17 2024  #WARNING! Genomic coordinates version is unknown.
Sun Feb  4 01:06:17 2024  -Genome-wide significance level to plot is set to 5e-08 ...
Sun Feb  4 01:06:17 2024  -Raw input contains 12557761 variants...
Sun Feb  4 01:06:17 2024  -MQQ plot layout mode is : mqq
Sun Feb  4 01:06:20 2024 Finished loading specified columns from the sumstats.
Sun Feb  4 01:06:20 2024 Start data conversion and sanity check:
Sun Feb  4 01:06:21 2024  -Removed 0 variants with nan in CHR or POS column ...
Sun Feb  4 01:06:45 2024  -Removed 0 variants with CHR <=0...
Sun Feb  4 01:06:46 2024  -Removed 0 variants with nan in P column ...
Sun Feb  4 01:06:47 2024  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
Sun Feb  4 01:06:47 2024  -Sumstats P values are being converted to -log10(P)...
Sun Feb  4 01:06:48 2024  -Sanity check: 0 na/inf/-inf variants will be removed...
Sun Feb  4 01:06:49 2024  -Converting data above cut line...
Sun Feb  4 01:06:49 2024  -Maximum -log10(P) value is 167.58838029403677 .
Sun Feb  4 01:06:49 2024  -Minus log10(P) values above 20 will be shrunk with a shrinkage factor of 10...
Sun Feb  4 01:06:49 2024 Finished data conversion and sanity check.
Sun Feb  4 01:06:49 2024 Start to create MQQ plot with 332882 variants...
Sun Feb  4 01:06:49 2024  -Creating background plot...
Sun Feb  4 01:06:50 2024 Finished creating MQQ plot successfully!
Sun Feb  4 01:06:50 2024 Start to extract variants for annotation...
Sun Feb  4 01:06:51 2024  -Found 89 significant variants with a sliding window size of 500 kb...
Sun Feb  4 01:06:51 2024 Finished extracting variants for annotation...
Sun Feb  4 01:06:51 2024 Start to process figure arts.
Sun Feb  4 01:06:51 2024  -Processing X ticks...
Sun Feb  4 01:06:51 2024  -Processing X labels...
Sun Feb  4 01:06:51 2024  -Processing Y labels...
Sun Feb  4 01:06:51 2024  -Processing Y tick lables...
Sun Feb  4 01:06:51 2024  -Processing Y labels...
Sun Feb  4 01:06:51 2024  -Processing lines...
Sun Feb  4 01:06:51 2024 Finished processing figure arts.
Sun Feb  4 01:06:51 2024 Start to annotate variants...
Sun Feb  4 01:06:51 2024  -Skip annotating
Sun Feb  4 01:06:51 2024 Finished annotating variants.
Sun Feb  4 01:06:51 2024 Start to create QQ plot with 332882 variants:
Sun Feb  4 01:06:51 2024  -Plotting all variants...
Sun Feb  4 01:06:51 2024  -Expected range of P: (0,1.0)
Sun Feb  4 01:06:52 2024  -Lambda GC (MLOG10P mode) at 0.5 is   1.21283
Sun Feb  4 01:06:52 2024  -Processing Y tick lables...
Sun Feb  4 01:06:52 2024 Finished creating QQ plot successfully!
Sun Feb  4 01:06:52 2024 Start to save figure...
Sun Feb  4 01:06:52 2024  -Skip saving figure!
Sun Feb  4 01:06:52 2024 Finished saving figure...
Sun Feb  4 01:06:52 2024 Finished creating plot successfully!

Out[5]:

(<Figure size 3000x1000 with 2 Axes>, <gwaslab.g_Log.Log at 0x7fe713e8edc0>)

Looks better now. But what if we want to annotate some of the most significant loci (for example, lead variants with MLOG10P>30) and only plot Manhattan plot?

In [6]:

Copied!

mysumstats.plot_mqq(skip=2, cut=20, mode="m", anno=True, sig_level_lead=1e-30)
mysumstats.plot_mqq(skip=2, cut=20, mode="m", anno=True, sig_level_lead=1e-30)

Sun Feb  4 01:06:56 2024 Start to create MQQ plot...v3.4.38:
Sun Feb  4 01:06:56 2024  -Genomic coordinates version: 99...
Sun Feb  4 01:06:56 2024  #WARNING! Genomic coordinates version is unknown.
Sun Feb  4 01:06:56 2024  -Genome-wide significance level to plot is set to 5e-08 ...
Sun Feb  4 01:06:56 2024  -Raw input contains 12557761 variants...
Sun Feb  4 01:06:56 2024  -MQQ plot layout mode is : m
Sun Feb  4 01:07:01 2024 Finished loading specified columns from the sumstats.
Sun Feb  4 01:07:01 2024 Start data conversion and sanity check:
Sun Feb  4 01:07:01 2024  -Removed 0 variants with nan in CHR or POS column ...
Sun Feb  4 01:07:25 2024  -Removed 0 variants with CHR <=0...
Sun Feb  4 01:07:26 2024  -Removed 0 variants with nan in P column ...
Sun Feb  4 01:07:27 2024  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
Sun Feb  4 01:07:28 2024  -Sumstats P values are being converted to -log10(P)...
Sun Feb  4 01:07:29 2024  -Sanity check: 0 na/inf/-inf variants will be removed...
Sun Feb  4 01:07:30 2024  -Converting data above cut line...
Sun Feb  4 01:07:30 2024  -Maximum -log10(P) value is 167.58838029403677 .
Sun Feb  4 01:07:30 2024  -Minus log10(P) values above 20 will be shrunk with a shrinkage factor of 10...
Sun Feb  4 01:07:30 2024 Finished data conversion and sanity check.
Sun Feb  4 01:07:30 2024 Start to create MQQ plot with 332882 variants...
Sun Feb  4 01:07:30 2024  -Creating background plot...
Sun Feb  4 01:07:31 2024 Finished creating MQQ plot successfully!
Sun Feb  4 01:07:31 2024 Start to extract variants for annotation...
Sun Feb  4 01:07:31 2024  -Found 1 significant variants with a sliding window size of 500 kb...
Sun Feb  4 01:07:31 2024 Finished extracting variants for annotation...
Sun Feb  4 01:07:31 2024 Start to process figure arts.
Sun Feb  4 01:07:31 2024  -Processing X ticks...
Sun Feb  4 01:07:31 2024  -Processing X labels...
Sun Feb  4 01:07:31 2024  -Processing Y labels...
Sun Feb  4 01:07:31 2024  -Processing Y tick lables...
Sun Feb  4 01:07:31 2024  -Processing Y labels...
Sun Feb  4 01:07:31 2024  -Processing lines...
Sun Feb  4 01:07:31 2024 Finished processing figure arts.
Sun Feb  4 01:07:31 2024 Start to annotate variants...
Sun Feb  4 01:07:31 2024  -Annotating using column CHR:POS...
Sun Feb  4 01:07:31 2024  -Adjusting text positions with repel_force=0.03...
Sun Feb  4 01:07:31 2024 Finished annotating variants.
Sun Feb  4 01:07:31 2024 Start to save figure...
Sun Feb  4 01:07:31 2024  -Skip saving figure!
Sun Feb  4 01:07:31 2024 Finished saving figure...
Sun Feb  4 01:07:31 2024 Finished creating plot successfully!

Out[6]:

(<Figure size 3000x1000 with 1 Axes>, <gwaslab.g_Log.Log at 0x7fe713e8edc0>)

GWASLab supports a wide range of customizable options. For details on other options for Manhattan and Q-Q plots, see: https://cloufield.github.io/gwaslab/Visualization/

For more examples, see https://cloufield.github.io/gwaslab/visualization_mqq/

Standardization & QC : `.basic_check()`¶

It is needed to check variant ID (SNPID), rsID, chromosome(CHR), basepair position(POS), alleles (EA and NEA) and statistics first before any manipulations or analysis. gwaslab provides a all-in-one function for this, .basic_check().

Note: Sometimes you need do this before plotting if the sumstats are not in a standard format.

In [7]:

Copied!

#check SNPID,rsID,CHR,POS,EA, NEA and statistics
mysumstats.basic_check()
#check SNPID,rsID,CHR,POS,EA, NEA and statistics
mysumstats.basic_check()

Sun Feb  4 01:07:34 2024 Start to check SNPID/rsID...v3.4.38
Sun Feb  4 01:07:34 2024  -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1100.88 MB
Sun Feb  4 01:07:34 2024  -Checking SNPID data type...
Sun Feb  4 01:07:34 2024  -Converting SNPID to pd.string data type...
Sun Feb  4 01:07:34 2024  -Checking if SNPID is CHR:POS:NEA:EA...(separator: - ,: , _)
Sun Feb  4 01:07:45 2024 Finished checking SNPID/rsID.
Sun Feb  4 01:07:46 2024 Start to fix chromosome notation (CHR)...v3.4.38
Sun Feb  4 01:07:46 2024  -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1100.88 MB
Sun Feb  4 01:07:46 2024  -Checking CHR data type...
Sun Feb  4 01:08:11 2024  -Variants with standardized chromosome notation: 12228970
Sun Feb  4 01:08:35 2024  -Variants with fixable chromosome notations: 328791
Sun Feb  4 01:08:35 2024  -No unrecognized chromosome notations...
Sun Feb  4 01:08:37 2024  -Identifying non-autosomal chromosomes : X, Y, and MT ...
Sun Feb  4 01:08:38 2024  -Identified  328791  variants on sex chromosomes...
Sun Feb  4 01:08:39 2024  -Standardizing sex chromosome notations: X to 23...
Sun Feb  4 01:09:14 2024 Finished fixing chromosome notation (CHR).
Sun Feb  4 01:09:14 2024 Start to fix basepair positions (POS)...v3.4.38
Sun Feb  4 01:09:14 2024  -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1370.85 MB
Sun Feb  4 01:09:14 2024  -Converting to Int64 data type ...
Sun Feb  4 01:09:20 2024  -Position bound:(0 , 250,000,000)
Sun Feb  4 01:09:45 2024  -Removed outliers: 0
Sun Feb  4 01:09:46 2024 Finished fixing basepair positions (POS).
Sun Feb  4 01:09:46 2024 Start to fix alleles (EA and NEA)...v3.4.38
Sun Feb  4 01:09:46 2024  -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1124.83 MB
Sun Feb  4 01:09:46 2024  -Converted all bases to string datatype and UPPERCASE.
Sun Feb  4 01:09:50 2024  -Variants with bad EA  : 0
Sun Feb  4 01:09:50 2024  -Variants with bad NEA : 0
Sun Feb  4 01:09:51 2024  -Variants with NA for EA or NEA: 0
Sun Feb  4 01:09:52 2024  -Variants with same EA and NEA: 0
Sun Feb  4 01:09:52 2024  -Detected 0 variants with alleles that contain bases other than A/C/T/G .
Sun Feb  4 01:10:07 2024 Finished fixing alleles (EA and NEA).
Sun Feb  4 01:10:08 2024 Start to perform sanity check for statistics...v3.4.38
Sun Feb  4 01:10:08 2024  -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1150.87 MB
Sun Feb  4 01:10:08 2024  -Comparison tolerance for floats: 1e-07
Sun Feb  4 01:10:08 2024  -Checking if 0 <= N <= 2147483647 ...
Sun Feb  4 01:10:35 2024  -Removed 0 variants with bad/na N.
Sun Feb  4 01:10:35 2024  -Checking if -1e-07 < EAF < 1.0000001 ...
Sun Feb  4 01:10:37 2024  -Removed 0 variants with bad/na EAF.
Sun Feb  4 01:10:37 2024  -Checking if -1e-07 < P < 1.0000001 ...
Sun Feb  4 01:10:39 2024  -Removed 0 variants with bad/na P.
Sun Feb  4 01:10:39 2024  -Checking if -100.0000001 < BETA < 100.0000001 ...
Sun Feb  4 01:10:42 2024  -Removed 0 variants with bad/na BETA.
Sun Feb  4 01:10:42 2024  -Checking if -1e-07 < SE < inf ...
Sun Feb  4 01:10:43 2024  -Removed 0 variants with bad/na SE.
Sun Feb  4 01:10:43 2024  -Checking STATUS and converting STATUS to categories....
Sun Feb  4 01:10:44 2024  -Removed 0 variants with bad statistics in total.
Sun Feb  4 01:10:44 2024  -Data types for each column:
Sun Feb  4 01:10:44 2024  -Column  : SNPID  CHR   POS   EA       NEA      EAF     BETA    SE      P       N     DIRECTION STATUS  
Sun Feb  4 01:10:44 2024  -DType   : string Int64 Int64 category category float32 float64 float64 float64 Int64 object    category
Sun Feb  4 01:10:44 2024  -Verified: T      T     T     T        T        T       T       T       T       T     T         T       
Sun Feb  4 01:10:44 2024 Finished sanity check for statistics.
Sun Feb  4 01:10:44 2024 Start to check data consistency across columns...v3.4.38
Sun Feb  4 01:10:44 2024  -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1114.94 MB
Sun Feb  4 01:10:44 2024  -Tolerance: 0.001 (Relative) and 0.001 (Absolute)
Sun Feb  4 01:10:44 2024  -Checking if BETA/SE-derived-P is consistent with P...
Sun Feb  4 01:12:03 2024   -Not consistent: 2380309 variant(s)
Sun Feb  4 01:12:04 2024   -Variant SNPID with max difference: X:3293819_T_C with 0.0072642804381793935
Sun Feb  4 01:12:04 2024  -Note: if the max difference is greater than expected, please check your original sumstats.
Sun Feb  4 01:12:04 2024 Finished checking data consistency across columns.
Sun Feb  4 01:12:04 2024 Start to normalize indels...v3.4.38
Sun Feb  4 01:12:04 2024  -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1372.94 MB
Sun Feb  4 01:12:09 2024  -Not normalized allele IDs:X:7151130_TT_TC X:9093382_CTTTT_CTTT X:12292253_ATTT_ATT X:16001576_ATT_ATTT X:33822416_GT_GTT ... 
Sun Feb  4 01:12:09 2024  -Not normalized allele:['TT' 'TC']['CTTTT' 'CTTT']['ATTT' 'ATT']['ATTT' 'ATT']['GTT' 'GT']... 
Sun Feb  4 01:12:09 2024  -Modified 13 variants according to parsimony and left alignment principal.
Sun Feb  4 01:12:10 2024 Finished normalizing indels.
Sun Feb  4 01:12:10 2024 Start to sort the genome coordinates...v3.4.38
Sun Feb  4 01:12:10 2024  -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1372.94 MB
Sun Feb  4 01:12:16 2024 Finished sorting coordinates.
Sun Feb  4 01:12:17 2024 Start to reorder the columns...v3.4.38
Sun Feb  4 01:12:17 2024  -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1019.13 MB
Sun Feb  4 01:12:17 2024  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,DIRECTION,STATUS
Sun Feb  4 01:12:17 2024 Finished reordering the columns.

In [8]:

Copied!

mysumstats.data
mysumstats.data

Out[8]:

	SNPID	CHR	POS	EA	NEA	EAF	BETA	SE	P	N	DIRECTION	STATUS
0	1:725932_G_A	1	725932	G	A	0.9960	-0.0737	0.1394	0.5970	166718	-?+-	9960099
1	1:725933_A_G	1	725933	G	A	0.0040	0.0737	0.1394	0.5973	166718	+?-+	9960099
2	1:737801_T_C	1	737801	C	T	0.0051	0.0490	0.1231	0.6908	166718	+?-+	9960099
3	1:749963_T_TAA	1	749963	TAA	T	0.8374	0.0213	0.0199	0.2846	166718	-?++	9960399
4	1:751343_T_A	1	751343	T	A	0.8593	0.0172	0.0156	0.2705	166718	-?++	9960099
...	...	...	...	...	...	...	...	...	...	...	...	...
12557756	X:154874837_A_G	23	154874837	G	A	0.7478	-0.0064	0.0117	0.5840	191764	-+-+	9960099
12557757	X:154875192_GTACTC_G	23	154875192	GTACTC	G	0.2525	0.0071	0.0122	0.5612	191764	+-+-	9960399
12557758	X:154879115_A_G	23	154879115	G	A	0.7463	-0.0070	0.0122	0.5646	191764	-+-+	9960099
12557759	X:154880669_T_A	23	154880669	T	A	0.2558	0.0071	0.0122	0.5618	191764	+-+-	9960099
12557760	X:154880917_C_T	23	154880917	C	T	0.2558	0.0072	0.0122	0.5570	191764	+-+-	9960099

12557761 rows × 12 columns

By checking the log, we can see that the sumstats look good. But we still found several variants that were not normalized. GWASLab fixed the position and alleles for the un-normalizated indels. And GWASLab standardizes the notation for chromosome X to 23.

In fact, .basic_check() is a wrapper of the following basic functions, you can also use these separately.

mysumstats.fix_ID()
mysumstats.fix_chr()
mysumstats.fix_pos()
mysumstats.fix_allele()
mysumstats.check_sanity()
mysumstats.check_data_consisteny()
mysumstats.normalize_allele()

For other options, see: https://cloufield.github.io/gwaslab/Standardization/

For other examples, see: https://cloufield.github.io/gwaslab/standardization_workflow/

Extract lead variants : get_lead()¶

Let's extract the lead variants in each significant locus to check our data.

The significant loci are detected based on a sliding window (default window size: windowsizekb=500 kb)

By specifying anno=True , gwaslab will also annotate the lead variant with its nearest gene names and distance.

Note: GWASLab default genome build version is build="99" (Unknown) , you can change it to build="19" (GRCh37/hg19) or build="38" (GRCh38/hg38) when needed. Or you can specify the build for sumstats when loading into gl.Sumstats(build="19") or use mysumstats.infer_build().

Note: GWASLab will donwload reference files when you run it for the first time. In this case, ensembl_hg19_gtf_protein_coding was downloaded and processed automatically.

In [9]:

Copied!

mysumstats.get_lead(anno=True, build="19")
mysumstats.get_lead(anno=True, build="19")

Sun Feb  4 01:12:17 2024 Start to extract lead variants...v3.4.38
Sun Feb  4 01:12:17 2024  -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1019.13 MB
Sun Feb  4 01:12:17 2024  -Processing 12557761 variants...
Sun Feb  4 01:12:17 2024  -Significance threshold : 5e-08
Sun Feb  4 01:12:17 2024  -Sliding window size: 500  kb
Sun Feb  4 01:12:20 2024  -Using P for extracting lead variants...
Sun Feb  4 01:12:20 2024  -Found 9461 significant variants in total...
Sun Feb  4 01:12:20 2024  -Identified 89 lead variants!
Sun Feb  4 01:12:20 2024  -Annotating variants using references:ensembl
Sun Feb  4 01:12:20 2024  -Annotating variants using references based on genome build:19
Sun Feb  4 01:12:20 2024 Start to annotate variants with nearest gene name(s)...
Sun Feb  4 01:12:20 2024  -Assigning Gene name using ensembl_hg19_gtf for protein coding genes
Sun Feb  4 01:12:23 2024 Finished annotating variants with nearest gene name(s) successfully!
Sun Feb  4 01:12:23 2024 Finished extracting lead variants.

Out[9]:

	SNPID	CHR	POS	EA	NEA	EAF	BETA	SE	P	N	DIRECTION	STATUS	LOCATION	GENE
96739	1:22068326_A_G	1	22068326	G	A	0.7550	0.0621	0.0103	1.629000e-09	191764	++++	9960099	0	USP48
213860	1:51103268_T_C	1	51103268	C	T	0.7953	-0.0802	0.0120	2.519000e-11	191764	----	9960099	0	FAF1
534095	1:154309595_TA_T	1	154309595	TA	T	0.0947	-0.0915	0.0166	3.289000e-08	191764	----	9960399	0	ATP8B2
969974	2:640986_CACAT_C	2	640986	C	CACAT	0.9006	-0.0946	0.0150	2.665000e-10	191764	----	9960399	26349	TMEM18
1091807	2:27734972_G_A	2	27734972	G	A	0.5605	0.0691	0.0088	3.897000e-15	191764	++++	9960099	0	GCKR
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
12272930	X:21569920_A_G	23	21569920	G	A	0.3190	0.0423	0.0076	2.616000e-08	191764	++++	9960099	0	CNKSR2
12341406	X:48724648_CAA_C	23	48724648	C	CAA	0.6260	-0.0602	0.0103	4.576000e-09	191764	----	9960399	26082	TIMM17B
12350767	X:57170781_A_AT	23	57170781	AT	A	0.3003	-0.0447	0.0076	4.583000e-09	191764	----	9960399	-6723	SPIN2A
12469290	X:117915163_T_TA	23	117915163	TA	T	0.5560	0.0548	0.0071	9.818000e-15	191764	++++	9960399	0	IL13RA1
12554976	X:152908887_G_A	23	152908887	G	A	0.6792	-0.1235	0.0077	9.197000e-58	191764	----	9960099	0	DUSP9

89 rows × 14 columns

We extracted a total of 89 lead variants with a sliding window size of 500kb!

For other options, see: https://cloufield.github.io/gwaslab/ExtractLead/

Use the SNPID to create some highly customized mqq plot¶

GWASLab can create much more complicated Manhattan plots.

For example,

annotate the lead variants with closest gene names (threshold for annotation p<1e-20)
annotate selected variants with user-provided texts
pinpoint some variants
highlight some loci
MAF-stratified Q-Q plot
save as my_first_mqq_plot.png with {"dpi":400,"facecolor":"white"}

In [10]:

Copied!





mysumstats.plot_mqq(mode="mqq",
                    build="19",
                    cut=20,
                    skip=2,
                    anno="GENENAME",
                    sig_level_lead=1e-20,
                    anno_alias={"9:22132729_A_G":"some annotations"},
                    anno_style="expand",
                    xpad=0.01,
                    pinpoint=["9:22132729_A_G","5:176513896_C_A"], 
                    pinpoint_color="green",
                    highlight=["7:127253550_C_T","19:46166604_C_T"],
                    highlight_windowkb =1000,
                    stratified=True,
                    jagged=True,
                    marker_size=(5,5),
                    figargs={"figsize":(15,5),"dpi":300},
                    save="my_first_mqq_plot.png", 
                    save_args={"dpi":400,"facecolor":"white"})
mysumstats.plot_mqq(mode="mqq",
                    build="19",
                    cut=20,
                    skip=2,
                    anno="GENENAME",
                    sig_level_lead=1e-20,
                    anno_alias={"9:22132729_A_G":"some annotations"},
                    anno_style="expand",
                    xpad=0.01,
                    pinpoint=["9:22132729_A_G","5:176513896_C_A"], 
                    pinpoint_color="green",
                    highlight=["7:127253550_C_T","19:46166604_C_T"],
                    highlight_windowkb =1000,
                    stratified=True,
                    jagged=True,
                    marker_size=(5,5),
                    figargs={"figsize":(15,5),"dpi":300},
                    save="my_first_mqq_plot.png", 
                    save_args={"dpi":400,"facecolor":"white"})

Sun Feb  4 01:12:24 2024 Start to create MQQ plot...v3.4.38:
Sun Feb  4 01:12:24 2024  -Genomic coordinates version: 19...
Sun Feb  4 01:12:24 2024  -Genome-wide significance level to plot is set to 5e-08 ...
Sun Feb  4 01:12:24 2024  -Raw input contains 12557761 variants...
Sun Feb  4 01:12:24 2024  -MQQ plot layout mode is : mqq
Sun Feb  4 01:12:24 2024  -Loci to highlight (#CB132D): 7:127253550_C_T,19:46166604_C_T
Sun Feb  4 01:12:24 2024   -highlight_windowkb is set to:  1000  kb
Sun Feb  4 01:12:24 2024  -Variants to pinpoint (green) : 9:22132729_A_G,5:176513896_C_A
Sun Feb  4 01:12:27 2024 Finished loading specified columns from the sumstats.
Sun Feb  4 01:12:27 2024 Start data conversion and sanity check:
Sun Feb  4 01:12:27 2024  -Removed 0 variants with nan in CHR or POS column ...
Sun Feb  4 01:12:52 2024  -Removed 0 variants with CHR <=0...
Sun Feb  4 01:12:53 2024  -Removed 0 variants with nan in EAF column ...
Sun Feb  4 01:12:53 2024  -Removed 0 variants with nan in P column ...
Sun Feb  4 01:12:55 2024  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
Sun Feb  4 01:12:56 2024  -Sumstats P values are being converted to -log10(P)...
Sun Feb  4 01:12:57 2024  -Sanity check: 0 na/inf/-inf variants will be removed...
Sun Feb  4 01:12:59 2024  -Converting data above cut line...
Sun Feb  4 01:12:59 2024  -Maximum -log10(P) value is 167.58838029403677 .
Sun Feb  4 01:12:59 2024  -Minus log10(P) values above 20 will be shrunk with a shrinkage factor of 10...
Sun Feb  4 01:12:59 2024 Finished data conversion and sanity check.
Sun Feb  4 01:12:59 2024 Start to create MQQ plot with 332882 variants...
Sun Feb  4 01:12:59 2024  -Creating background plot...
Sun Feb  4 01:13:00 2024  -Highlighting target loci...
Sun Feb  4 01:13:00 2024  -Pinpointing target vairants...
Sun Feb  4 01:13:00 2024 Finished creating MQQ plot successfully!
Sun Feb  4 01:13:00 2024 Start to extract variants for annotation...
Sun Feb  4 01:13:00 2024  -Found 16 significant variants with a sliding window size of 500 kb...
Sun Feb  4 01:13:00 2024 Start to annotate variants with nearest gene name(s)...
Sun Feb  4 01:13:00 2024  -Assigning Gene name using ensembl_hg19_gtf for protein coding genes
Sun Feb  4 01:13:00 2024 Finished annotating variants with nearest gene name(s) successfully!
Sun Feb  4 01:13:00 2024 Finished extracting variants for annotation...
Sun Feb  4 01:13:00 2024 Start to process figure arts.
Sun Feb  4 01:13:00 2024  -Processing X ticks...
Sun Feb  4 01:13:00 2024  -Processing X labels...
Sun Feb  4 01:13:00 2024  -Processing Y labels...
Sun Feb  4 01:13:00 2024  -Processing Y tick lables...
Sun Feb  4 01:13:00 2024  -Processing Y labels...
Sun Feb  4 01:13:00 2024  -Processing lines...
Sun Feb  4 01:13:00 2024 Finished processing figure arts.
Sun Feb  4 01:13:00 2024 Start to annotate variants...
Sun Feb  4 01:13:00 2024  -Annotating using column GENENAME...
Sun Feb  4 01:13:00 2024  -Adjusting text positions with repel_force=0.03...
Sun Feb  4 01:13:00 2024 Finished annotating variants.
Sun Feb  4 01:13:00 2024 Start to create QQ plot with 332882 variants:
Sun Feb  4 01:13:00 2024  -Plotting variants stratified by MAF...
Sun Feb  4 01:13:02 2024  -Lambda GC (MLOG10P mode) at 0.5 is   1.21283
Sun Feb  4 01:13:02 2024  -Processing Y tick lables...
Sun Feb  4 01:13:02 2024 Finished creating QQ plot successfully!
Sun Feb  4 01:13:02 2024  -Processing jagged Y axis...
Sun Feb  4 01:13:02 2024  -Processing jagged Y axis...
Sun Feb  4 01:13:02 2024 Start to save figure...
Sun Feb  4 01:13:07 2024  -Saved to my_first_mqq_plot.png successfully! (overwrite)
Sun Feb  4 01:13:07 2024 Finished saving figure...
Sun Feb  4 01:13:07 2024 Finished creating plot successfully!

Out[10]:

(<Figure size 4500x1500 with 2 Axes>, <gwaslab.g_Log.Log at 0x7fe713e8edc0>)

For details, see: https://cloufield.github.io/gwaslab/Visualization/ and https://cloufield.github.io/gwaslab/visualization_mqq/

Quick regional plot without LD-information¶

gwaslab can also plot regional plots with or with out LD reference files.

For details, see: https://cloufield.github.io/gwaslab/RegionalPlot/

We first create a regional plot without references by specifying mode and region.

In [11]:

Copied!

mysumstats.plot_mqq(mode="r",build="19",skip=2,cut=20, region=(7,126253550,128253550),region_grid=True)
mysumstats.plot_mqq(mode="r",build="19",skip=2,cut=20, region=(7,126253550,128253550),region_grid=True)

Sun Feb  4 01:13:14 2024 Start to create MQQ plot...v3.4.38:
Sun Feb  4 01:13:14 2024  -Genomic coordinates version: 19...
Sun Feb  4 01:13:14 2024  -Genome-wide significance level to plot is set to 5e-08 ...
Sun Feb  4 01:13:14 2024  -Raw input contains 12557761 variants...
Sun Feb  4 01:13:14 2024  -MQQ plot layout mode is : r
Sun Feb  4 01:13:14 2024  -Region to plot : chr7:126253550-128253550.
Sun Feb  4 01:13:16 2024  -Extract SNPs in region : chr7:126253550-128253550...
Sun Feb  4 01:13:40 2024  -Extract SNPs in specified regions: 8087
Sun Feb  4 01:13:40 2024 Finished loading specified columns from the sumstats.
Sun Feb  4 01:13:40 2024 Start data conversion and sanity check:
Sun Feb  4 01:13:40 2024  -Removed 0 variants with nan in CHR or POS column ...
Sun Feb  4 01:13:40 2024  -Removed 0 variants with CHR <=0...
Sun Feb  4 01:13:40 2024  -Removed 0 variants with nan in P column ...
Sun Feb  4 01:13:40 2024  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
Sun Feb  4 01:13:40 2024  -Sumstats P values are being converted to -log10(P)...
Sun Feb  4 01:13:40 2024  -Sanity check: 0 na/inf/-inf variants will be removed...
Sun Feb  4 01:13:40 2024  -Converting data above cut line...
Sun Feb  4 01:13:40 2024  -Maximum -log10(P) value is 73.38711023071251 .
Sun Feb  4 01:13:40 2024  -Minus log10(P) values above 20 will be shrunk with a shrinkage factor of 10...
Sun Feb  4 01:13:40 2024 Finished data conversion and sanity check.
Sun Feb  4 01:13:40 2024 Start to create MQQ plot with 1600 variants...
Sun Feb  4 01:13:40 2024  -Creating background plot...
Sun Feb  4 01:13:40 2024  -Extracting lead variant...
Sun Feb  4 01:13:40 2024  -Loading gtf files from:default

INFO:root:Extracted GTF attributes: ['gene_id', 'gene_name', 'gene_biotype']

Sun Feb  4 01:14:18 2024  -plotting gene track..
Sun Feb  4 01:14:18 2024  -Finished plotting gene track..
Sun Feb  4 01:14:19 2024 Finished creating MQQ plot successfully!
Sun Feb  4 01:14:19 2024 Start to extract variants for annotation...
Sun Feb  4 01:14:19 2024  -Found 1 significant variants with a sliding window size of 500 kb...
Sun Feb  4 01:14:19 2024 Finished extracting variants for annotation...
Sun Feb  4 01:14:19 2024 Start to process figure arts.
Sun Feb  4 01:14:19 2024  -Processing X labels...
Sun Feb  4 01:14:19 2024  -Processing Y labels...
Sun Feb  4 01:14:19 2024  -Processing Y tick lables...
Sun Feb  4 01:14:19 2024  -Processing Y labels...
Sun Feb  4 01:14:19 2024  -Processing lines...
Sun Feb  4 01:14:19 2024 Finished processing figure arts.
Sun Feb  4 01:14:19 2024 Start to annotate variants...
Sun Feb  4 01:14:19 2024  -Skip annotating
Sun Feb  4 01:14:19 2024 Finished annotating variants.
Sun Feb  4 01:14:19 2024 Start to save figure...
Sun Feb  4 01:14:19 2024  -Skip saving figure!
Sun Feb  4 01:14:19 2024 Finished saving figure...
Sun Feb  4 01:14:19 2024 Finished creating plot successfully!

Out[11]:

(<Figure size 3000x2000 with 3 Axes>, <gwaslab.g_Log.Log at 0x7fe713e8edc0>)

Reference file downloading¶

Full regional plot will require user-provided vcf or preprocessed vcf files: (e.g 1000 Genomes project, see Reference: https://cloufield.github.io/gwaslab/Reference/)

GWASLab also provides pre-processed 1KG datasets.

Check available reference from gwaslab¶

Update the available reference list first if needed

In [12]:

Copied!

# gl.update_available_ref()
# gl.update_available_ref()

In [13]:

Copied!

gl.check_available_ref()
gl.check_available_ref()

Sun Feb  4 01:14:20 2024 Start to check available reference files...
Sun Feb  4 01:14:20 2024  - 1kg_eas_hg19  :  https://www.dropbox.com/s/lztaxqhy2o6dpxw/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1
Sun Feb  4 01:14:20 2024  - 1kg_eas_hg19_md5  :  c8c97434843c0da3113fc06879ead472
Sun Feb  4 01:14:20 2024  - 1kg_eas_hg19_tbi  :  https://www.dropbox.com/s/k9klefl8m9fcfo1/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1
Sun Feb  4 01:14:20 2024  - 1kg_eur_hg19  :  https://www.dropbox.com/s/1nbgqshknevseks/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1
Sun Feb  4 01:14:20 2024  - 1kg_eur_hg19_md5  :  734069d895009d38c2f962bfbb6fab52
Sun Feb  4 01:14:20 2024  - 1kg_eur_hg19_tbi  :  https://www.dropbox.com/s/vscvkrflh6fc5a0/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1
Sun Feb  4 01:14:20 2024  - 1kg_eas_hg38  :  https://www.dropbox.com/s/3dstbbb1el9r3au/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1
Sun Feb  4 01:14:20 2024  - 1kg_eas_hg38_md5  :  f45e80bca9ef7b29e6b1832e6ac15375
Sun Feb  4 01:14:20 2024  - 1kg_eas_hg38_tbi  :  https://www.dropbox.com/s/vwnp5vd8dcqksn4/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1
Sun Feb  4 01:14:20 2024  - 1kg_eur_hg38  :  https://www.dropbox.com/s/z0mkehg17lryapv/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1
Sun Feb  4 01:14:20 2024  - 1kg_eur_hg38_md5  :  228d3285fa99132cc6321e2925e0768d
Sun Feb  4 01:14:20 2024  - 1kg_eur_hg38_tbi  :  https://www.dropbox.com/s/ze8g58x75x9qbf0/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1
Sun Feb  4 01:14:20 2024  - 1kg_sas_hg19  :  https://www.dropbox.com/scl/fi/fubqvuj3p4ii4y35zknv8/SAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=5z50f66iltjchcaszznq5bczt&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_sas_hg19_md5  :  e2d3f9e2e6580d05e877e9effd435c4e
Sun Feb  4 01:14:20 2024  - 1kg_sas_hg19_tbi  :  https://www.dropbox.com/scl/fi/icnmrnzee7ofdpx5l96tg/SAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=st8t88snby26q37rqi6zh5zck&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_amr_hg19  :  https://www.dropbox.com/scl/fi/bxa4zfngsxsc38rhtiv8c/AMR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=ibcn8hb1n8n36j3u0jfzci267&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_amr_hg19_md5  :  68d3cdf01cbabdae6e74a07795fa881c
Sun Feb  4 01:14:20 2024  - 1kg_amr_hg19_tbi  :  https://www.dropbox.com/scl/fi/1zk16x7h4r89jurzwu05u/AMR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=b4cere4w38zvzyfitfge3r8n0&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_sas_hg38  :  https://www.dropbox.com/scl/fi/jr3l5zz42py3kny2bccmj/SAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=x0t6tsy71jxzf021wfqdn8k5q&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_sas_hg38_md5  :  e5d79bea1958aa50c23f618d342ccc83
Sun Feb  4 01:14:20 2024  - 1kg_sas_hg38_tbi  :  https://www.dropbox.com/scl/fi/02oia4ur5r7w9qgiuf6i9/SAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=00p9rxe0xzfs6hr1rg4d8oadm&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_amr_hg38  :  https://www.dropbox.com/scl/fi/4t4tyuhzp78uyb6tgkroq/AMR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=p96gbs1tcdia31jnjv1b82kuz&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_amr_hg38_md5  :  229fbd610001cf6f137b7f738352a44a
Sun Feb  4 01:14:20 2024  - 1kg_amr_hg38_tbi  :  https://www.dropbox.com/scl/fi/x0dby543tr9xpaqj2i0ba/AMR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=uj8o7j0cy0spipe174jn54sqs&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_afr_hg19  :  https://www.dropbox.com/scl/fi/tq4w9lyt5z47ym7grtrxg/AFR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=k3bimeu3yr5loq8hohba5mr6k&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_afr_hg19_md5  :  f7b4425f39e8292dce6f13711e7f6c50
Sun Feb  4 01:14:20 2024  - 1kg_afr_hg19_tbi  :  https://www.dropbox.com/scl/fi/0giiptu0btwj1kfm6jdzr/AFR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=ucb5weprsc5prcg8hvtgmruxx&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_pan_hg19  :  https://www.dropbox.com/scl/fi/6b4j9z9knmllfnbx86aw6/PAN.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=eento8vg06zyrkvooc9wd4cvu&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_pan_hg19_md5  :  fed846482204487b60d33b21ddb18106
Sun Feb  4 01:14:20 2024  - 1kg_pan_hg19_tbi  :  https://www.dropbox.com/scl/fi/stco946scio5tvto0ln4j/PAN.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=hfh53beb627lmqwv3d8mzqy0c&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_afr_hg38  :  https://www.dropbox.com/scl/fi/239xmm7qijtnsks97chc9/AFR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=47en5fk1icbekpg7we3uot9g8&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_afr_hg38_md5  :  3bb7923be0809a324d7b7633b8d58a3b
Sun Feb  4 01:14:20 2024  - 1kg_afr_hg38_tbi  :  https://www.dropbox.com/scl/fi/3y3pg4yqwo2jaaamx1c8f/AFR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=say0ihfwa51z3otgn4bjtze8p&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_pan_hg38  :  https://www.dropbox.com/scl/fi/nf01487smtmeq243ihfwm/PAN.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=3pefbkzxwcnejx4inynifpft7&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_pan_hg38_md5  :  23bb86d748c4a66e85e087f647e8b60e
Sun Feb  4 01:14:20 2024  - 1kg_pan_hg38_tbi  :  https://www.dropbox.com/scl/fi/hu7cttr4cenw5yjsm2775/PAN.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=568u7bkvkybm4wt2q9284o198&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_eas_x_hg19  :  https://www.dropbox.com/scl/fi/1inmw09rk35ncuq7tibmp/EAS.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=vcjpukgsb7gt4tizg1fvr7tr2&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_eas_x_hg19_md5  :  b2aced2a1522ed23818989b3153b7e91
Sun Feb  4 01:14:20 2024  - 1kg_eas_x_hg19_tbi  :  https://www.dropbox.com/scl/fi/uyxb9lfi88dqjp5l3vrzf/EAS.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=vt196d16h690dmzox5jyc33xx&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_eur_x_hg19  :  https://www.dropbox.com/scl/fi/6r4sc2yax8pk644piew2d/EUR.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=l5towjhyl733nrd1msjr1d8gl&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_eur_x_hg19_md5  :  6380cb71eafe985d7b894029e979139b
Sun Feb  4 01:14:20 2024  - 1kg_eur_x_hg19_tbi  :  https://www.dropbox.com/scl/fi/yuid87x398yc9n8nc4bb1/EUR.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=01skm13sk6099y34zy6qvweqj&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_eas_x_hg38  :  https://www.dropbox.com/scl/fi/2m6i93vv1ooano0muukck/EAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=y6087mmt9kmls066mzobjeqp7&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_eas_x_hg38_md5  :  8c6a35da51621f952a5b97cbcc832046
Sun Feb  4 01:14:20 2024  - 1kg_eas_x_hg38_tbi  :  https://www.dropbox.com/scl/fi/l6jpt86edarb4emehxwcy/EAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=ddr1fcijb1bh2nso0q0updolh&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_eur_x_hg38  :  https://www.dropbox.com/scl/fi/ceoff4p95ftef6yldhl67/EUR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=yyt9u11dk6kyha0cvturvrtvf&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_eur_x_hg38_md5  :  b9a4b8553dec202109f72281f33cb454
Sun Feb  4 01:14:20 2024  - 1kg_eur_x_hg38_tbi  :  https://www.dropbox.com/scl/fi/tux32myi6g18bx7nd1rdq/EUR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=m4f0v3rblnzv7dj0lo6hsd0hb&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_sas_x_hg19  :  https://www.dropbox.com/scl/fi/592lbmrkfjn80twnvlt2q/SAS.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=zrar7nmltpsuznlyuqw9k77q6&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_sas_x_hg19_md5  :  f4f370274fe586d209ca6fddc4eceaaf
Sun Feb  4 01:14:20 2024  - 1kg_sas_x_hg19_tbi  :  https://www.dropbox.com/scl/fi/rmaybun6v248nmjmaz2tj/SAS.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=izv96vfajgdd5wsyuvx98ntvk&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_amr_x_hg19  :  https://www.dropbox.com/scl/fi/gwryyxs0ilgoazqvp39be/AMR.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=z46we3kshi9t96x7issl36bda&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_amr_x_hg19_md5  :  ead838f7059a80118e949959cf1a3ff3
Sun Feb  4 01:14:20 2024  - 1kg_amr_x_hg19_tbi  :  https://www.dropbox.com/scl/fi/r6g5893smgmnsir5r0v5n/AMR.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=sel2f4p7ctggf30udrhcy0psu&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_sas_x_hg38  :  https://www.dropbox.com/scl/fi/r6qa9l6h9rc5rvenjsdqo/SAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=mucn2zizrlkebn1e5q7rtzu8e&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_sas_x_hg38_md5  :  31c60999ebb9a13d17d21e02fd9d1f4c
Sun Feb  4 01:14:20 2024  - 1kg_sas_x_hg38_tbi  :  https://www.dropbox.com/scl/fi/5ktxs24sq6v8hvaowr4xv/SAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=5y2z66b3s6bgzoikdjvr46ccw&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_amr_x_hg38  :  https://www.dropbox.com/scl/fi/5brd1nh7u20oigtb17mkb/AMR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=zt0d0nqmlq3u5uu6ukwvof6ta&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_amr_x_hg38_md5  :  bc7de683d603c8bbff02f5bec8d3469a
Sun Feb  4 01:14:20 2024  - 1kg_amr_x_hg38_tbi  :  https://www.dropbox.com/scl/fi/8bz6uwjgw8bj16uaz6kye/AMR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=pkc2mepgosxdijpfru4vhum5v&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_afr_x_hg19  :  https://www.dropbox.com/scl/fi/kz5j4532pyaigceww0vg5/AFR.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=sti0g2ri004chu8b12pqgicvw&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_afr_x_hg19_md5  :  77807e5e315a6e47504c175b0aaece88
Sun Feb  4 01:14:20 2024  - 1kg_afr_x_hg19_tbi  :  https://www.dropbox.com/scl/fi/oadjwamy5pe1ilv2237gq/AFR.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=a10bhxrfa904dasrcuc40njq4&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_pan_x_hg19  :  https://www.dropbox.com/scl/fi/rwov9vszj8rx78u65dxnk/PAN.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=ej33zb9ulwdfseur1surz653z&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_pan_x_hg19_md5  :  389d474984ff82df79efd25c0dd66fc9
Sun Feb  4 01:14:20 2024  - 1kg_pan_x_hg19_tbi  :  https://www.dropbox.com/scl/fi/x0n1htmkbulqybr5cc2cb/PAN.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=8rga64u5gm9vp9whqwlj37hre&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_afr_x_hg38  :  https://www.dropbox.com/scl/fi/ef8h09lhg8vmdxfxsayv4/AFR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=96xjxu546sbbq5hbgaihh84l8&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_afr_x_hg38_md5  :  b1410bb21e389a0f08fc2741d33fcc51
Sun Feb  4 01:14:20 2024  - 1kg_afr_x_hg38_tbi  :  https://www.dropbox.com/scl/fi/tlwhui80cy32mtx6hc8f6/AFR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=ana497w8is840ygarx3w1v5pc&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_pan_x_hg38  :  https://www.dropbox.com/scl/fi/tf7j5540jyzxz2oo7jtho/PAN.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=e872ihg5477mlu30vet1e5lvk&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_pan_x_hg38_md5  :  8ae424786a6bfe64c92ca6b9f96ee5e6
Sun Feb  4 01:14:20 2024  - 1kg_pan_x_hg38_tbi  :  https://www.dropbox.com/scl/fi/u28zo2sjbcmtfs69zqaya/PAN.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=t5psxvoewd1oog2hwzellc78p&dl=1
Sun Feb  4 01:14:20 2024  - dbsnp_v151_hg19  :  https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz
Sun Feb  4 01:14:20 2024  - dbsnp_v151_hg19_tbi  :  https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz.tbi
Sun Feb  4 01:14:20 2024  - dbsnp_v151_hg38  :  https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz
Sun Feb  4 01:14:20 2024  - dbsnp_v151_hg38_tbi  :  https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz.tbi
Sun Feb  4 01:14:20 2024  - dbsnp_v156_hg19  :  https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.25.gz
Sun Feb  4 01:14:20 2024  - dbsnp_v156_hg19_tbi  :  https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.25.gz.tbi
Sun Feb  4 01:14:20 2024  - dbsnp_v156_hg38  :  https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.40.gz
Sun Feb  4 01:14:20 2024  - dbsnp_v156_hg38_tbi  :  https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.40.gz.tbi
Sun Feb  4 01:14:20 2024  - ucsc_genome_hg19  :  http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz
Sun Feb  4 01:14:20 2024  - ucsc_genome_hg38  :  https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
Sun Feb  4 01:14:20 2024  - 1kg_dbsnp151_hg19_auto  :  https://www.dropbox.com/s/37p2u1xwmux4gwo/1kg_dbsnp151_hg19_auto.txt.gz?dl=1
Sun Feb  4 01:14:20 2024  - 1kg_dbsnp151_hg19_auto_md5  :  7d1e7624fb6e4df7a2f6f05558d436b4
Sun Feb  4 01:14:20 2024  - 1kg_dbsnp151_hg38_auto  :  https://www.dropbox.com/s/ouf60n7gdz6cm0g/1kg_dbsnp151_hg38_auto.txt.gz?dl=1
Sun Feb  4 01:14:20 2024  - 1kg_dbsnp151_hg38_auto_md5  :  4c7ef2d2415c18c286219e970fdda972
Sun Feb  4 01:14:20 2024  - 1kg_dbsnp151_hg19_x  :  https://www.dropbox.com/scl/fi/ghwq2yh5bi7o411m1mw15/1kg_dbsnp151_hg19_X.txt.gz?rlkey=50du8e42qjdgzge0lcdiu7tv2&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_dbsnp151_hg19_x_md5  :  cbf0e3518ab73d6d8a96bab9d55c094d
Sun Feb  4 01:14:20 2024  - 1kg_dbsnp151_hg38_x  :  https://www.dropbox.com/scl/fi/bqdrfh0dx561ir210tu92/1kg_dbsnp151_hg38_X.txt.gz?rlkey=jjetkirbflt02f3w8mrxdqa4g&dl=1
Sun Feb  4 01:14:20 2024  - 1kg_dbsnp151_hg38_x_md5  :  48c05eeb1454c0dd4cbee3cb26382e8e
Sun Feb  4 01:14:20 2024  - recombination_hg19  :  https://www.dropbox.com/s/wbesl8haxknonuc/recombination_hg19.tar.gz?dl=1
Sun Feb  4 01:14:20 2024  - recombination_hg38  :  https://www.dropbox.com/s/vuo8mvqx0fpibzj/recombination_hg38.tar.gz?dl=1
Sun Feb  4 01:14:20 2024  - ensembl_hg19_gtf  :  https://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.chr.gtf.gz
Sun Feb  4 01:14:20 2024  - ensembl_hg38_gtf  :  https://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens//Homo_sapiens.GRCh38.109.chr.gtf.gz
Sun Feb  4 01:14:20 2024  - refseq_hg19_gtf  :  https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gtf.gz
Sun Feb  4 01:14:20 2024  - refseq_hg38_gtf  :  https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gtf.gz
Sun Feb  4 01:14:20 2024  - testlink  :  https://www.dropbox.com/s/8u7capwge0ihshu/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz?dl=1
Sun Feb  4 01:14:20 2024  - testlink_tbi  :  https://www.dropbox.com/s/hdneg53t6u1j6ib/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz.tbi?dl=1

Out[13]:

{'1kg_eas_hg19': 'https://www.dropbox.com/s/lztaxqhy2o6dpxw/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1',
 '1kg_eas_hg19_md5': 'c8c97434843c0da3113fc06879ead472',
 '1kg_eas_hg19_tbi': 'https://www.dropbox.com/s/k9klefl8m9fcfo1/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1',
 '1kg_eur_hg19': 'https://www.dropbox.com/s/1nbgqshknevseks/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1',
 '1kg_eur_hg19_md5': '734069d895009d38c2f962bfbb6fab52',
 '1kg_eur_hg19_tbi': 'https://www.dropbox.com/s/vscvkrflh6fc5a0/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1',
 '1kg_eas_hg38': 'https://www.dropbox.com/s/3dstbbb1el9r3au/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1',
 '1kg_eas_hg38_md5': 'f45e80bca9ef7b29e6b1832e6ac15375',
 '1kg_eas_hg38_tbi': 'https://www.dropbox.com/s/vwnp5vd8dcqksn4/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1',
 '1kg_eur_hg38': 'https://www.dropbox.com/s/z0mkehg17lryapv/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1',
 '1kg_eur_hg38_md5': '228d3285fa99132cc6321e2925e0768d',
 '1kg_eur_hg38_tbi': 'https://www.dropbox.com/s/ze8g58x75x9qbf0/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1',
 '1kg_sas_hg19': 'https://www.dropbox.com/scl/fi/fubqvuj3p4ii4y35zknv8/SAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=5z50f66iltjchcaszznq5bczt&dl=1',
 '1kg_sas_hg19_md5': 'e2d3f9e2e6580d05e877e9effd435c4e',
 '1kg_sas_hg19_tbi': 'https://www.dropbox.com/scl/fi/icnmrnzee7ofdpx5l96tg/SAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=st8t88snby26q37rqi6zh5zck&dl=1',
 '1kg_amr_hg19': 'https://www.dropbox.com/scl/fi/bxa4zfngsxsc38rhtiv8c/AMR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=ibcn8hb1n8n36j3u0jfzci267&dl=1',
 '1kg_amr_hg19_md5': '68d3cdf01cbabdae6e74a07795fa881c',
 '1kg_amr_hg19_tbi': 'https://www.dropbox.com/scl/fi/1zk16x7h4r89jurzwu05u/AMR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=b4cere4w38zvzyfitfge3r8n0&dl=1',
 '1kg_sas_hg38': 'https://www.dropbox.com/scl/fi/jr3l5zz42py3kny2bccmj/SAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=x0t6tsy71jxzf021wfqdn8k5q&dl=1',
 '1kg_sas_hg38_md5': 'e5d79bea1958aa50c23f618d342ccc83',
 '1kg_sas_hg38_tbi': 'https://www.dropbox.com/scl/fi/02oia4ur5r7w9qgiuf6i9/SAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=00p9rxe0xzfs6hr1rg4d8oadm&dl=1',
 '1kg_amr_hg38': 'https://www.dropbox.com/scl/fi/4t4tyuhzp78uyb6tgkroq/AMR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=p96gbs1tcdia31jnjv1b82kuz&dl=1',
 '1kg_amr_hg38_md5': '229fbd610001cf6f137b7f738352a44a',
 '1kg_amr_hg38_tbi': 'https://www.dropbox.com/scl/fi/x0dby543tr9xpaqj2i0ba/AMR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=uj8o7j0cy0spipe174jn54sqs&dl=1',
 '1kg_afr_hg19': 'https://www.dropbox.com/scl/fi/tq4w9lyt5z47ym7grtrxg/AFR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=k3bimeu3yr5loq8hohba5mr6k&dl=1',
 '1kg_afr_hg19_md5': 'f7b4425f39e8292dce6f13711e7f6c50',
 '1kg_afr_hg19_tbi': 'https://www.dropbox.com/scl/fi/0giiptu0btwj1kfm6jdzr/AFR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=ucb5weprsc5prcg8hvtgmruxx&dl=1',
 '1kg_pan_hg19': 'https://www.dropbox.com/scl/fi/6b4j9z9knmllfnbx86aw6/PAN.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=eento8vg06zyrkvooc9wd4cvu&dl=1',
 '1kg_pan_hg19_md5': 'fed846482204487b60d33b21ddb18106',
 '1kg_pan_hg19_tbi': 'https://www.dropbox.com/scl/fi/stco946scio5tvto0ln4j/PAN.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=hfh53beb627lmqwv3d8mzqy0c&dl=1',
 '1kg_afr_hg38': 'https://www.dropbox.com/scl/fi/239xmm7qijtnsks97chc9/AFR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=47en5fk1icbekpg7we3uot9g8&dl=1',
 '1kg_afr_hg38_md5': '3bb7923be0809a324d7b7633b8d58a3b',
 '1kg_afr_hg38_tbi': 'https://www.dropbox.com/scl/fi/3y3pg4yqwo2jaaamx1c8f/AFR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=say0ihfwa51z3otgn4bjtze8p&dl=1',
 '1kg_pan_hg38': 'https://www.dropbox.com/scl/fi/nf01487smtmeq243ihfwm/PAN.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=3pefbkzxwcnejx4inynifpft7&dl=1',
 '1kg_pan_hg38_md5': '23bb86d748c4a66e85e087f647e8b60e',
 '1kg_pan_hg38_tbi': 'https://www.dropbox.com/scl/fi/hu7cttr4cenw5yjsm2775/PAN.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=568u7bkvkybm4wt2q9284o198&dl=1',
 '1kg_eas_x_hg19': 'https://www.dropbox.com/scl/fi/1inmw09rk35ncuq7tibmp/EAS.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=vcjpukgsb7gt4tizg1fvr7tr2&dl=1',
 '1kg_eas_x_hg19_md5': 'b2aced2a1522ed23818989b3153b7e91',
 '1kg_eas_x_hg19_tbi': 'https://www.dropbox.com/scl/fi/uyxb9lfi88dqjp5l3vrzf/EAS.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=vt196d16h690dmzox5jyc33xx&dl=1',
 '1kg_eur_x_hg19': 'https://www.dropbox.com/scl/fi/6r4sc2yax8pk644piew2d/EUR.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=l5towjhyl733nrd1msjr1d8gl&dl=1',
 '1kg_eur_x_hg19_md5': '6380cb71eafe985d7b894029e979139b',
 '1kg_eur_x_hg19_tbi': 'https://www.dropbox.com/scl/fi/yuid87x398yc9n8nc4bb1/EUR.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=01skm13sk6099y34zy6qvweqj&dl=1',
 '1kg_eas_x_hg38': 'https://www.dropbox.com/scl/fi/2m6i93vv1ooano0muukck/EAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=y6087mmt9kmls066mzobjeqp7&dl=1',
 '1kg_eas_x_hg38_md5': '8c6a35da51621f952a5b97cbcc832046',
 '1kg_eas_x_hg38_tbi': 'https://www.dropbox.com/scl/fi/l6jpt86edarb4emehxwcy/EAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=ddr1fcijb1bh2nso0q0updolh&dl=1',
 '1kg_eur_x_hg38': 'https://www.dropbox.com/scl/fi/ceoff4p95ftef6yldhl67/EUR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=yyt9u11dk6kyha0cvturvrtvf&dl=1',
 '1kg_eur_x_hg38_md5': 'b9a4b8553dec202109f72281f33cb454',
 '1kg_eur_x_hg38_tbi': 'https://www.dropbox.com/scl/fi/tux32myi6g18bx7nd1rdq/EUR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=m4f0v3rblnzv7dj0lo6hsd0hb&dl=1',
 '1kg_sas_x_hg19': 'https://www.dropbox.com/scl/fi/592lbmrkfjn80twnvlt2q/SAS.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=zrar7nmltpsuznlyuqw9k77q6&dl=1',
 '1kg_sas_x_hg19_md5': 'f4f370274fe586d209ca6fddc4eceaaf',
 '1kg_sas_x_hg19_tbi': 'https://www.dropbox.com/scl/fi/rmaybun6v248nmjmaz2tj/SAS.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=izv96vfajgdd5wsyuvx98ntvk&dl=1',
 '1kg_amr_x_hg19': 'https://www.dropbox.com/scl/fi/gwryyxs0ilgoazqvp39be/AMR.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=z46we3kshi9t96x7issl36bda&dl=1',
 '1kg_amr_x_hg19_md5': 'ead838f7059a80118e949959cf1a3ff3',
 '1kg_amr_x_hg19_tbi': 'https://www.dropbox.com/scl/fi/r6g5893smgmnsir5r0v5n/AMR.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=sel2f4p7ctggf30udrhcy0psu&dl=1',
 '1kg_sas_x_hg38': 'https://www.dropbox.com/scl/fi/r6qa9l6h9rc5rvenjsdqo/SAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=mucn2zizrlkebn1e5q7rtzu8e&dl=1',
 '1kg_sas_x_hg38_md5': '31c60999ebb9a13d17d21e02fd9d1f4c',
 '1kg_sas_x_hg38_tbi': 'https://www.dropbox.com/scl/fi/5ktxs24sq6v8hvaowr4xv/SAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=5y2z66b3s6bgzoikdjvr46ccw&dl=1',
 '1kg_amr_x_hg38': 'https://www.dropbox.com/scl/fi/5brd1nh7u20oigtb17mkb/AMR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=zt0d0nqmlq3u5uu6ukwvof6ta&dl=1',
 '1kg_amr_x_hg38_md5': 'bc7de683d603c8bbff02f5bec8d3469a',
 '1kg_amr_x_hg38_tbi': 'https://www.dropbox.com/scl/fi/8bz6uwjgw8bj16uaz6kye/AMR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=pkc2mepgosxdijpfru4vhum5v&dl=1',
 '1kg_afr_x_hg19': 'https://www.dropbox.com/scl/fi/kz5j4532pyaigceww0vg5/AFR.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=sti0g2ri004chu8b12pqgicvw&dl=1',
 '1kg_afr_x_hg19_md5': '77807e5e315a6e47504c175b0aaece88',
 '1kg_afr_x_hg19_tbi': 'https://www.dropbox.com/scl/fi/oadjwamy5pe1ilv2237gq/AFR.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=a10bhxrfa904dasrcuc40njq4&dl=1',
 '1kg_pan_x_hg19': 'https://www.dropbox.com/scl/fi/rwov9vszj8rx78u65dxnk/PAN.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=ej33zb9ulwdfseur1surz653z&dl=1',
 '1kg_pan_x_hg19_md5': '389d474984ff82df79efd25c0dd66fc9',
 '1kg_pan_x_hg19_tbi': 'https://www.dropbox.com/scl/fi/x0n1htmkbulqybr5cc2cb/PAN.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=8rga64u5gm9vp9whqwlj37hre&dl=1',
 '1kg_afr_x_hg38': 'https://www.dropbox.com/scl/fi/ef8h09lhg8vmdxfxsayv4/AFR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=96xjxu546sbbq5hbgaihh84l8&dl=1',
 '1kg_afr_x_hg38_md5': 'b1410bb21e389a0f08fc2741d33fcc51',
 '1kg_afr_x_hg38_tbi': 'https://www.dropbox.com/scl/fi/tlwhui80cy32mtx6hc8f6/AFR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=ana497w8is840ygarx3w1v5pc&dl=1',
 '1kg_pan_x_hg38': 'https://www.dropbox.com/scl/fi/tf7j5540jyzxz2oo7jtho/PAN.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=e872ihg5477mlu30vet1e5lvk&dl=1',
 '1kg_pan_x_hg38_md5': '8ae424786a6bfe64c92ca6b9f96ee5e6',
 '1kg_pan_x_hg38_tbi': 'https://www.dropbox.com/scl/fi/u28zo2sjbcmtfs69zqaya/PAN.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=t5psxvoewd1oog2hwzellc78p&dl=1',
 'dbsnp_v151_hg19': 'https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz',
 'dbsnp_v151_hg19_tbi': 'https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz.tbi',
 'dbsnp_v151_hg38': 'https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz',
 'dbsnp_v151_hg38_tbi': 'https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz.tbi',
 'dbsnp_v156_hg19': 'https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.25.gz',
 'dbsnp_v156_hg19_tbi': 'https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.25.gz.tbi',
 'dbsnp_v156_hg38': 'https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.40.gz',
 'dbsnp_v156_hg38_tbi': 'https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.40.gz.tbi',
 'ucsc_genome_hg19': 'http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz',
 'ucsc_genome_hg38': 'https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz',
 '1kg_dbsnp151_hg19_auto': 'https://www.dropbox.com/s/37p2u1xwmux4gwo/1kg_dbsnp151_hg19_auto.txt.gz?dl=1',
 '1kg_dbsnp151_hg19_auto_md5': '7d1e7624fb6e4df7a2f6f05558d436b4',
 '1kg_dbsnp151_hg38_auto': 'https://www.dropbox.com/s/ouf60n7gdz6cm0g/1kg_dbsnp151_hg38_auto.txt.gz?dl=1',
 '1kg_dbsnp151_hg38_auto_md5': '4c7ef2d2415c18c286219e970fdda972',
 '1kg_dbsnp151_hg19_x': 'https://www.dropbox.com/scl/fi/ghwq2yh5bi7o411m1mw15/1kg_dbsnp151_hg19_X.txt.gz?rlkey=50du8e42qjdgzge0lcdiu7tv2&dl=1',
 '1kg_dbsnp151_hg19_x_md5': 'cbf0e3518ab73d6d8a96bab9d55c094d',
 '1kg_dbsnp151_hg38_x': 'https://www.dropbox.com/scl/fi/bqdrfh0dx561ir210tu92/1kg_dbsnp151_hg38_X.txt.gz?rlkey=jjetkirbflt02f3w8mrxdqa4g&dl=1',
 '1kg_dbsnp151_hg38_x_md5': '48c05eeb1454c0dd4cbee3cb26382e8e',
 'recombination_hg19': 'https://www.dropbox.com/s/wbesl8haxknonuc/recombination_hg19.tar.gz?dl=1',
 'recombination_hg38': 'https://www.dropbox.com/s/vuo8mvqx0fpibzj/recombination_hg38.tar.gz?dl=1',
 'ensembl_hg19_gtf': 'https://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.chr.gtf.gz',
 'ensembl_hg38_gtf': 'https://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens//Homo_sapiens.GRCh38.109.chr.gtf.gz',
 'refseq_hg19_gtf': 'https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gtf.gz',
 'refseq_hg38_gtf': 'https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gtf.gz',
 'testlink': 'https://www.dropbox.com/s/8u7capwge0ihshu/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz?dl=1',
 'testlink_tbi': 'https://www.dropbox.com/s/hdneg53t6u1j6ib/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz.tbi?dl=1'}

You can see the current available reference files (from the original source or pre-processed by gwaslab).

Download reference using gwaslab¶

Select the keyword and use download_ref to download the files. The downloaded files will be placed in ~/.gwasalb by default.

1kg_eas_hg19 : processed 1000 Genomes Project EAS samples dataset(hg19; ~2.8GB) It may take several minutes to download.

Default directory for reference files downloaded using GWASLab : ~/.gwaslab

In [14]:

Copied!

# ~2.8GB
gl.download_ref("1kg_eas_hg19")
# ~2.8GB
gl.download_ref("1kg_eas_hg19")

Tue Apr 25 14:13:51 2023 Start to download  1kg_eas_hg19  ...
Tue Apr 25 14:13:51 2023  -Downloading to: /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz
Tue Apr 25 14:15:41 2023  -File path: /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz
Tue Apr 25 14:15:41 2023  -MD5 check: c8c97434843c0da3113fc06879ead472
Tue Apr 25 14:15:41 2023  -MD5 verified.
Tue Apr 25 14:15:41 2023  -Updating record in config file...
Tue Apr 25 14:15:43 2023  -Updating record in config file...
Tue Apr 25 14:15:43 2023  -Downloading to: /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi
Tue Apr 25 14:15:43 2023 Downloaded  1kg_eas_hg19  successfully!

In [15]:

Copied!

gl.check_downloaded_ref()
gl.check_downloaded_ref()

Tue Apr 25 14:15:43 2023 Start to check downloaded reference files...
Tue Apr 25 14:15:43 2023  -Checking the config file:/home/yunye/work/gwaslab/src/gwaslab/data/config.json
Tue Apr 25 14:15:43 2023  -Config file exists.
Tue Apr 25 14:15:43 2023  -Updating config.json...
Tue Apr 25 14:15:43 2023   - ensembl_hg19_gtf  :  /home/yunye/.gwaslab/Homo_sapiens.GRCh37.87.chr.gtf.gz
Tue Apr 25 14:15:43 2023   - recombination_hg19  :  /home/yunye/.gwaslab/recombination/hg19/recombination_hg19.tar.gz
Tue Apr 25 14:15:43 2023   - 1kg_eas_hg19  :  /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz
Tue Apr 25 14:15:43 2023   - 1kg_eas_hg19_tbi  :  /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi

Out[15]:

{'ensembl_hg19_gtf': '/home/yunye/.gwaslab/Homo_sapiens.GRCh37.87.chr.gtf.gz',
 'recombination_hg19': '/home/yunye/.gwaslab/recombination/hg19/recombination_hg19.tar.gz',
 '1kg_eas_hg19': '/home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz',
 '1kg_eas_hg19_tbi': '/home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi'}

After downloading, use get_path to obtain the file path by specifying the keyword.

Note:

If tabix is available in PATH, the speed will be will greatly improved. Otherwise, vcf files will be loaded from the head.
tabix: http://www.htslib.org/download/

In [3]:

Copied!





mysumstats.plot_mqq(mode="r",
                    build="19",
                    region=(7,126253550,128253550),
                    region_grid=True,
                    anno=True,
                    anno_args={"rotation":0,"fontsize":12},
                    vcf_path=gl.get_path("1kg_eas_hg19"))
mysumstats.plot_mqq(mode="r",
                    build="19",
                    region=(7,126253550,128253550),
                    region_grid=True,
                    anno=True,
                    anno_args={"rotation":0,"fontsize":12},
                    vcf_path=gl.get_path("1kg_eas_hg19"))

2024/02/15 18:55:28 Start to create MQQ plot...v3.4.40:
2024/02/15 18:55:28  -Genomic coordinates version: 19...
2024/02/15 18:55:28  -Genome-wide significance level to plot is set to 5e-08 ...
2024/02/15 18:55:28  -Raw input contains 12557761 variants...
2024/02/15 18:55:28  -MQQ plot layout mode is : r
2024/02/15 18:55:28  -Region to plot : chr7:126253550-128253550.
2024/02/15 18:55:28  -Checking prefix for chromosomes in vcf files...
2024/02/15 18:55:28  -No prefix for chromosomes in the VCF files.
2024/02/15 18:55:31  -Extract SNPs in region : chr7:126253550-128253550...
2024/02/15 18:55:32  -Extract SNPs in specified regions: 8087
2024/02/15 18:55:32 Finished loading specified columns from the sumstats.
2024/02/15 18:55:32 Start data conversion and sanity check:
2024/02/15 18:55:32  -Removed 0 variants with nan in CHR or POS column ...
2024/02/15 18:55:32  -Removed 0 variants with CHR <=0...
2024/02/15 18:55:32  -Removed 0 variants with nan in P column ...
2024/02/15 18:55:32  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
2024/02/15 18:55:32  -Sumstats P values are being converted to -log10(P)...
2024/02/15 18:55:32  -Sanity check: 0 na/inf/-inf variants will be removed...
2024/02/15 18:55:32  -Converting data above cut line...
2024/02/15 18:55:32  -Maximum -log10(P) value is 73.38711023071251 .
2024/02/15 18:55:32 Finished data conversion and sanity check.
2024/02/15 18:55:32 Start to create MQQ plot with 8087 variants...
2024/02/15 18:55:32  -tabix will be used: /home/yunye/tools/bin/tabix
2024/02/15 18:55:32 Start to load reference genotype...
2024/02/15 18:55:32  -reference vcf path : /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz
2024/02/15 18:55:34  -Retrieving index...
2024/02/15 18:55:34  -Ref variants in the region: 54234
2024/02/15 18:55:34  -Matching variants using POS, NEA, EA ...
2024/02/15 18:55:35  -Calculating Rsq...
2024/02/15 18:55:35 Finished loading reference genotype successfully!
2024/02/15 18:55:35  -Creating background plot...
2024/02/15 18:55:35  -Extracting lead variant...
2024/02/15 18:55:35  -Loading gtf files from:default

INFO:root:Extracted GTF attributes: ['gene_id', 'gene_name', 'gene_biotype']

2024/02/15 18:56:04  -plotting gene track..
2024/02/15 18:56:04  -Finished plotting gene track..
2024/02/15 18:56:04 Finished creating MQQ plot successfully!
2024/02/15 18:56:04 Start to extract variants for annotation...
2024/02/15 18:56:04  -Found 1 significant variants with a sliding window size of 500 kb...
2024/02/15 18:56:04 Finished extracting variants for annotation...
2024/02/15 18:56:04 Start to process figure arts.
2024/02/15 18:56:04  -Processing X labels...
2024/02/15 18:56:04  -Processing Y labels...
2024/02/15 18:56:04  -Processing Y tick lables...
2024/02/15 18:56:04  -Processing Y labels...
2024/02/15 18:56:04  -Processing color bar...
2024/02/15 18:56:04  -Processing lines...
2024/02/15 18:56:04 Finished processing figure arts.
2024/02/15 18:56:04 Start to annotate variants...
2024/02/15 18:56:04  -Annotating using column CHR:POS...
2024/02/15 18:56:04  -Adjusting text positions with repel_force=0.03...
2024/02/15 18:56:04 Finished annotating variants.
2024/02/15 18:56:04 Start to save figure...
2024/02/15 18:56:04  -Skip saving figure!
2024/02/15 18:56:04 Finished saving figure...
2024/02/15 18:56:04 Finished creating plot successfully!

Out[3]:

(<Figure size 3000x2000 with 4 Axes>, <gwaslab.g_Log.Log at 0x7fa21cbcb880>)

Or you can provide your own vcf files for vcf_path.

In [15]:

Copied!





# mysumstats.plot_mqq(mode="r",
#                     region=(7,156538803,157538803),
#                     region_grid=True,
#                     anno=True,
#                     vcf_path="/home/yunye/mydata/d_disk/eas_1kg_af/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz")
# mysumstats.plot_mqq(mode="r",
#                     region=(7,156538803,157538803),
#                     region_grid=True,
#                     anno=True,
#                     vcf_path="/home/yunye/mydata/d_disk/eas_1kg_af/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz")

Note: GWASLab default genome build version is build="99" (Unknown) , you can change it to build="19" (GRCh37/hg19) or build="38" (GRCh38/hg38) when needed. Or you can specify build when loading into gl.Sumstats(build="19") or use mysumstats.infer_build().

For gene tracks, default is gtf_path="ensembl" , you can also use gtf_path="refseq" (NCBA RefSeq)

Sampling¶

There are more than 10 million variants in the original sumstats and it will take some time to process the entrie dataset. Let's just randomly sample 100K variants for this tutorial.

In [16]:

Copied!

mysumstats.random_variants(n=100000,inplace=True,random_state=1234)
mysumstats.random_variants(n=100000,inplace=True,random_state=1234)

Sun Feb  4 01:15:53 2024 Start to randomly select variants from the sumstats...
Sun Feb  4 01:15:53 2024  -Number of variants selected from the sumstats: 100000
Sun Feb  4 01:15:53 2024  -Random state (seed): 1234
Sun Feb  4 01:15:54 2024 Finished sampling...

Infer genome build¶

In case you don't know the genome build of the sumstats

For details, see: https://cloufield.github.io/gwaslab/InferBuild/

In [17]:

Copied!

mysumstats.infer_build()
mysumstats.infer_build()

Sun Feb  4 01:15:54 2024 Start to infer genome build version using hapmap3 SNPs...v3.4.38
Sun Feb  4 01:15:54 2024  -Current Dataframe shape : 100000 x 12 ; Memory usage: 33.80 MB
Sun Feb  4 01:15:54 2024 Start to infer genome build version using hapmap3 SNPs...
Sun Feb  4 01:15:54 2024  -Loading Hapmap3 variants data...
Sun Feb  4 01:15:56 2024  -CHR:POS will be used for matching...
Sun Feb  4 01:15:57 2024  -Matching variants for hg19: num_hg19 =  8671
Sun Feb  4 01:15:57 2024  -Matching variants for hg38: num_hg38 =  145
Sun Feb  4 01:15:57 2024  -Warning: please be cautious due to the limited number of variants.
Sun Feb  4 01:15:57 2024  -Since num_hg19 >> num_hg38, assigning genome build hg19...
Sun Feb  4 01:15:57 2024 Finished inferring genome build version using hapmap3 SNPs.

Fix SNPID¶

You may notice that the SNPID is in CHR:POS_REF_ALT format. We want SNPID to be in a stadardized format CHR:POS:REF:ALT, we can use fix_id for this:

For other options of standardization, see: https://cloufield.github.io/gwaslab/Standardization/

In [18]:

Copied!

#fixsep : fix ID separator
mysumstats.fix_id(fixsep=True)
#fixsep : fix ID separator
mysumstats.fix_id(fixsep=True)

Sun Feb  4 01:15:57 2024 Start to check SNPID/rsID...v3.4.38
Sun Feb  4 01:15:57 2024  -Current Dataframe shape : 100000 x 12 ; Memory usage: 14.24 MB
Sun Feb  4 01:15:57 2024  -Checking SNPID data type...
Sun Feb  4 01:15:57 2024  -Checking if SNPID is CHR:POS:NEA:EA...(separator: - ,: , _)
Sun Feb  4 01:15:58 2024  -Replacing [_-] in SNPID with ":" ...
Sun Feb  4 01:15:58 2024 Finished checking SNPID/rsID.

In [19]:

Copied!

mysumstats.data
mysumstats.data

Out[19]:

	SNPID	CHR	POS	EA	NEA	EAF	BETA	SE	P	N	DIRECTION	STATUS
1736173	2:180784768:G:T	2	180784768	G	T	0.9978	0.0797	0.1141	0.4850	191764	-+++	1960099
258841	1:61108984:G:C	1	61108984	G	C	0.4847	-0.0095	0.0087	0.2767	191764	+--+	1960099
89268	1:20294687:G:A	1	20294687	G	A	0.7936	0.0161	0.0111	0.1484	191764	0++-	1960099
10697866	16:86560264:T:C	16	86560264	C	T	0.3114	-0.0016	0.0095	0.8663	191764	--+-	1960099
6068219	7:152311067:G:A	7	152311067	G	A	0.9977	0.1369	0.1433	0.3395	166718	+?++	1960099
...	...	...	...	...	...	...	...	...	...	...	...	...
3376277	4:105613391:C:T	4	105613391	C	T	0.9966	0.1211	0.0939	0.1974	191764	++++	1960099
5598990	7:38184549:A:T	7	38184549	T	A	0.6533	-0.0000	0.0095	0.9982	191764	+-+-	1960099
7580034	10:58897803:A:G	10	58897803	G	A	0.0096	-0.0339	0.0623	0.5860	191764	++--	1960099
10546104	16:58355232:G:A	16	58355232	G	A	0.6904	0.0083	0.0108	0.4401	191764	+-++	1960099
9937711	14:94630415:G:A	14	94630415	G	A	0.7970	0.0174	0.0113	0.1221	191764	++++	1960099

100000 rows × 12 columns

Harmonization¶

gwaslab can harmonize the sumstats based on reference files.

ref_seq : reference genome fasta file for allele alignment
ref_infer : vcf file with allele frequency information for inferring strand and comparing allele frequency
ref_alt_freq : field in INFO of vcf file for alternative allele frequency

For details see: https://cloufield.github.io/gwaslab/Harmonization/

For reference data, see: https://cloufield.github.io/gwaslab/Reference/

When calling .harmonize(), .basic_check() will be called first to make sure the dataset is ready for harmonization. Since we have already performed basic_check, we set basic_check=False here. For ref_infer, we pass the downloaded vcf files for 1KG EAS and specify the field for alternative allele frequency to AF by ref_alt_freq= "AF".

We will use hg19 reference genome FASTA file from UCSC.

In [20]:

Copied!

gl.download_ref("ucsc_genome_hg19")
gl.download_ref("ucsc_genome_hg19")

Sun Feb  4 01:16:03 2024 Start to download  ucsc_genome_hg19  ...
Sun Feb  4 01:16:03 2024  -Downloading to: /home/yunye/.gwaslab/hg19.fa.gz
Sun Feb  4 01:16:03 2024  -File /home/yunye/.gwaslab/hg19.fa.gz exists.
Sun Feb  4 01:16:03 2024  -Updating record in config file...
Sun Feb  4 01:16:03 2024  -gunzip : /home/yunye/.gwaslab/hg19.fa.gz
Sun Feb  4 01:16:18 2024  -Updating record in config file...
Sun Feb  4 01:16:18 2024 Downloaded  ucsc_genome_hg19  successfully!

In [21]:

Copied!





mysumstats.harmonize(basic_check=False,
                     n_cores=3,
                     ref_seq     = gl.get_path("ucsc_genome_hg19"),
                     ref_infer   = gl.get_path("1kg_eas_hg19"),
                     ref_alt_freq= "AF")
mysumstats.harmonize(basic_check=False,
                     n_cores=3,
                     ref_seq     = gl.get_path("ucsc_genome_hg19"),
                     ref_infer   = gl.get_path("1kg_eas_hg19"),
                     ref_alt_freq= "AF")

Sun Feb  4 01:16:18 2024 Start to check if NEA is aligned with reference sequence...v3.4.38
Sun Feb  4 01:16:18 2024  -Current Dataframe shape : 100000 x 12 ; Memory usage: 14.24 MB
Sun Feb  4 01:16:18 2024  -Reference genome FASTA file: /home/yunye/.gwaslab/hg19.fa
Sun Feb  4 01:16:18 2024  -Checking records: 1  2  3  4  5  6  7  X  8  9  10  11  12  13  14  15  16  17  18  20  Y  19  22  21  M  
Sun Feb  4 01:16:41 2024  -Variants allele on given reference sequence :  41528
Sun Feb  4 01:16:41 2024  -Variants flipped :  50055
Sun Feb  4 01:16:41 2024   -Raw Matching rate :  91.58%
Sun Feb  4 01:16:41 2024  -Variants inferred reverse_complement :  0
Sun Feb  4 01:16:41 2024  -Variants inferred reverse_complement_flipped :  0
Sun Feb  4 01:16:41 2024  -Both allele on genome + unable to distinguish :  8417
Sun Feb  4 01:16:41 2024  -Variants not on given reference sequence :  0
Sun Feb  4 01:16:41 2024 Finished checking if NEA is aligned with reference sequence.
Sun Feb  4 01:16:41 2024 Start to adjust statistics based on STATUS code...v3.4.38
Sun Feb  4 01:16:41 2024  -Current Dataframe shape : 100000 x 12 ; Memory usage: 14.24 MB
Sun Feb  4 01:16:42 2024 Start to flip allele-specific stats for SNPs with status xxxxx[35]x: ALT->EA , REF->NEA ...v3.4.38
Sun Feb  4 01:16:42 2024  -Flipping 50055 variants...
Sun Feb  4 01:16:42 2024  -Swapping column: NEA <=> EA...
Sun Feb  4 01:16:42 2024  -Flipping column: BETA = - BETA...
Sun Feb  4 01:16:42 2024  -Flipping column: DIRECTION +-?0 <=> -+?0 ...
Sun Feb  4 01:16:42 2024  -Flipping column: EAF = 1 - EAF...
Sun Feb  4 01:16:42 2024  -Changed the status for flipped variants : xxxxx[35]x -> xxxxx[12]x
Sun Feb  4 01:16:43 2024 Finished adjusting.
Sun Feb  4 01:16:43 2024 Start to infer strand for palindromic SNPs/align indistinguishable indels...v3.4.38
Sun Feb  4 01:16:43 2024  -Current Dataframe shape : 100000 x 12 ; Memory usage: 14.24 MB
Sun Feb  4 01:16:43 2024  -Number of threads/cores to use: 3
Sun Feb  4 01:16:43 2024  -Reference VCF: /home/yunye/anaconda3/lib/python3.9/site-packages/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz
Sun Feb  4 01:16:43 2024  -Checking prefix for chromosomes in vcf files...
Sun Feb  4 01:16:43 2024  -No prefix for chromosomes in the VCF files.
Sun Feb  4 01:16:43 2024  -Field for alternative allele frequency in VCF INFO: AF
Sun Feb  4 01:16:43 2024  -Identified  14174  palindromic SNPs...
Sun Feb  4 01:16:44 2024  -After filtering by MAF< 0.4 , 12937 palindromic SNPs with unknown strand will be inferred...
Sun Feb  4 01:16:52 2024   -Non-palindromic :  76650
Sun Feb  4 01:16:52 2024   -Palindromic SNPs on + strand:  12566
Sun Feb  4 01:16:52 2024   -Palindromic SNPs on - strand and needed to be flipped: 17
Sun Feb  4 01:16:52 2024   -Palindromic SNPs with MAF not available to infer :  1237
Sun Feb  4 01:16:52 2024   -Palindromic SNPs with no macthes or no information :  181
Sun Feb  4 01:16:53 2024  -Identified  8417  indistinguishable Indels...
Sun Feb  4 01:16:53 2024  -Indistinguishable indels will be inferred from reference vcf ref and alt...
Sun Feb  4 01:16:53 2024  -DAF tolerance: 0.2
Sun Feb  4 01:17:05 2024   -Indels ea/nea match reference :  3656
Sun Feb  4 01:17:06 2024   -Indels ea/nea need to be flipped :  4483
Sun Feb  4 01:17:06 2024   -Indels with no macthes or no information :  278
Sun Feb  4 01:17:06 2024 Finished inferring strand for palindromic SNPs/align indistinguishable indels.
Sun Feb  4 01:17:06 2024 Start to adjust statistics based on STATUS code...v3.4.38
Sun Feb  4 01:17:06 2024  -Current Dataframe shape : 100000 x 12 ; Memory usage: 14.24 MB
Sun Feb  4 01:17:07 2024 Start to flip allele-specific stats for standardized indels with status xxxx[123][67][6]: ALT->EA , REF->NEA...v3.4.38
Sun Feb  4 01:17:07 2024  -Flipping 4483 variants...
Sun Feb  4 01:17:07 2024  -Swapping column: NEA <=> EA...
Sun Feb  4 01:17:07 2024  -Flipping column: BETA = - BETA...
Sun Feb  4 01:17:07 2024  -Flipping column: DIRECTION +-?0 <=> -+?0 ...
Sun Feb  4 01:17:07 2024  -Flipping column: EAF = 1 - EAF...
Sun Feb  4 01:17:07 2024  -Changed the status for flipped variants xxxx[123][67]6 -> xxxx[123][67]4
Sun Feb  4 01:17:07 2024 Start to flip allele-specific stats for palindromic SNPs with status xxxxx[12]5: (-)strand <=> (+)strand...v3.4.38
Sun Feb  4 01:17:07 2024  -Flipping 17 variants...
Sun Feb  4 01:17:07 2024  -Flipping column: BETA = - BETA...
Sun Feb  4 01:17:07 2024  -Flipping column: DIRECTION +-?0 <=> -+?0 ...
Sun Feb  4 01:17:07 2024  -Flipping column: EAF = 1 - EAF...
Sun Feb  4 01:17:07 2024  -Changed the status for flipped variants:  xxxxx[012]5: ->  xxxxx[012]2
Sun Feb  4 01:17:07 2024 Finished adjusting.
Sun Feb  4 01:17:08 2024 Start to sort the genome coordinates...v3.4.38
Sun Feb  4 01:17:08 2024  -Current Dataframe shape : 100000 x 12 ; Memory usage: 14.24 MB
Sun Feb  4 01:17:08 2024 Finished sorting coordinates.
Sun Feb  4 01:17:08 2024 Start to reorder the columns...v3.4.38
Sun Feb  4 01:17:08 2024  -Current Dataframe shape : 100000 x 12 ; Memory usage: 13.48 MB
Sun Feb  4 01:17:08 2024  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,DIRECTION,STATUS
Sun Feb  4 01:17:08 2024 Finished reordering the columns.

Out[21]:

<gwaslab.g_Sumstats.Sumstats at 0x7fe710b84310>

Check the data again. Looks good!

In [22]:

Copied!

mysumstats.data
mysumstats.data

Out[22]:

	SNPID	CHR	POS	EA	NEA	EAF	BETA	SE	P	N	DIRECTION	STATUS
0	1:875643:C:G	1	875643	G	C	0.7290	0.0147	0.0130	0.258600	166718	+?+-	1960001
1	1:928416:G:A	1	928416	A	G	0.1363	0.0113	0.0142	0.423300	191764	--+-	1960010
2	1:933045:C:T	1	933045	T	C	0.0087	0.0008	0.0704	0.991500	166718	-?+-	1960010
3	1:1019158:G:A	1	1019158	A	G	0.0110	0.1189	0.0613	0.052510	166718	+?++	1960010
4	1:1023310:T:C	1	1023310	C	T	0.0854	0.0202	0.0182	0.265600	191764	+++-	1960000
...	...	...	...	...	...	...	...	...	...	...	...	...
99995	X:153882706:C:T	23	153882706	T	C	0.0125	-0.0017	0.0526	0.973600	166718	+?--	1960010
99996	X:154087726:C:T	23	154087726	T	C	0.0743	0.0414	0.0151	0.006171	191764	+-+-	1960010
99997	X:154126990:T:C	23	154126990	C	T	0.0763	-0.0085	0.0132	0.519500	191764	+---	1960000
99998	X:154140146:T:G	23	154140146	G	T	0.1707	0.0043	0.0096	0.652700	191764	+-+-	1960000
99999	X:154278797:T:C	23	154278797	C	T	0.1738	0.0025	0.0091	0.785500	191764	+-+-	1960000

100000 rows × 12 columns

Annotate rsID¶

rsID is assigned using two types of reference file:

ref_rsid_tsv : tsv file for annotation of commonly used variants
ref_rsid_vcf : vcf file for annotation of other variants

GWASLab provides preprocessed tsv files for 1KG varaints (~80M), we can download the file using .download_ref with key words. This time we will use 1kg_dbsnp151_hg19_auto, which is the SNPID-rsID conversion table for autosomal variants in 1KG project (hg19). This will take around a few minutes to dowadload.

In [23]:

Copied!

# 961M
gl.download_ref("1kg_dbsnp151_hg19_auto")
# 961M
gl.download_ref("1kg_dbsnp151_hg19_auto")

Sun Feb  4 01:17:08 2024 Start to download  1kg_dbsnp151_hg19_auto  ...
Sun Feb  4 01:17:08 2024  -Downloading to: /home/yunye/.gwaslab/1kg_dbsnp151_hg19_auto.txt.gz
Sun Feb  4 01:17:08 2024  -File /home/yunye/.gwaslab/1kg_dbsnp151_hg19_auto.txt.gz exists.
Sun Feb  4 01:17:10 2024  -File path: /home/yunye/.gwaslab/1kg_dbsnp151_hg19_auto.txt.gz
Sun Feb  4 01:17:10 2024  -MD5 check: 7d1e7624fb6e4df7a2f6f05558d436b4
Sun Feb  4 01:17:10 2024  -MD5 verified.
Sun Feb  4 01:17:10 2024  -Updating record in config file...
Sun Feb  4 01:17:10 2024 Downloaded  1kg_dbsnp151_hg19_auto  successfully!

In [24]:

Copied!

mysumstats.assign_rsid(ref_rsid_tsv= gl.get_path("1kg_dbsnp151_hg19_auto"))
mysumstats.assign_rsid(ref_rsid_tsv= gl.get_path("1kg_dbsnp151_hg19_auto"))

Sun Feb  4 01:17:10 2024 Start to assign rsID by matching SNPID with CHR:POS:REF:ALT in the reference TSV...v3.4.38
Sun Feb  4 01:17:10 2024  -Current Dataframe shape : 100000 x 12 ; Memory usage: 13.48 MB
Sun Feb  4 01:17:10 2024  -Number of threads/cores to use: 1
Sun Feb  4 01:17:10 2024  -Reference TSV: /home/yunye/.gwaslab/1kg_dbsnp151_hg19_auto.txt.gz
Sun Feb  4 01:17:10 2024  -100000 rsID could be possibly fixed...
Sun Feb  4 01:17:10 2024  -Setting block size:  5000000
Sun Feb  4 01:17:10 2024  -Loading block: 0   1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   
Sun Feb  4 01:21:13 2024  -rsID annotation for 3266 needed to be fixed!
Sun Feb  4 01:21:13 2024  -Annotated 96734 rsID successfully!
Sun Feb  4 01:21:13 2024 Finished assign rsID using reference file.

We annotated 96734 (96734 /100000 = 96.7%) variants using the conversion table.
For the annotation of other variants (3.3%), we may need the dbSNP reference vcf.

In [25]:

Copied!

mysumstats.data
mysumstats.data

Out[25]:

	SNPID	CHR	POS	EA	NEA	EAF	BETA	SE	P	N	DIRECTION	STATUS	rsID
0	1:875643:C:G	1	875643	G	C	0.7290	0.0147	0.0130	0.258600	166718	+?+-	1960001	rs7411115
1	1:928416:G:A	1	928416	A	G	0.1363	0.0113	0.0142	0.423300	191764	--+-	1960010	rs111754459
2	1:933045:C:T	1	933045	T	C	0.0087	0.0008	0.0704	0.991500	166718	-?+-	1960010	rs189213898
3	1:1019158:G:A	1	1019158	A	G	0.0110	0.1189	0.0613	0.052510	166718	+?++	1960010	rs142369888
4	1:1023310:T:C	1	1023310	C	T	0.0854	0.0202	0.0182	0.265600	191764	+++-	1960000	rs11260589
...	...	...	...	...	...	...	...	...	...	...	...	...	...
99995	X:153882706:C:T	23	153882706	T	C	0.0125	-0.0017	0.0526	0.973600	166718	+?--	1960010	<NA>
99996	X:154087726:C:T	23	154087726	T	C	0.0743	0.0414	0.0151	0.006171	191764	+-+-	1960010	<NA>
99997	X:154126990:T:C	23	154126990	C	T	0.0763	-0.0085	0.0132	0.519500	191764	+---	1960000	<NA>
99998	X:154140146:T:G	23	154140146	G	T	0.1707	0.0043	0.0096	0.652700	191764	+-+-	1960000	<NA>
99999	X:154278797:T:C	23	154278797	C	T	0.1738	0.0025	0.0091	0.785500	191764	+-+-	1960000	<NA>

100000 rows × 13 columns

Maximize the annotation¶

For the annotation of not-so-common variants, we can use VCF files downloaded form NCBI dbsnp (https://ftp.ncbi.nih.gov/snp/latest_release/VCF/).

Note:

The file size is huge (>20GB) and it might take several hours to download. (we can skip this step in this tutorial. )
Specify chr_dict to match the chromosome notations in sumstats and in vcf files.

Parameters :

ref_rsid_vcf: the path to the reference rsID vcf file
n_cores : number of threads to use

In [26]:

Copied!

mysumstats.assign_rsid(n_cores=1,
                       ref_rsid_vcf= "/home/yunye/mydata/d_disk/dbsnp/GCF_000001405.25.vcf.gz",
                       chr_dict = gl.get_number_to_NC(build="19") )
mysumstats.assign_rsid(n_cores=1,
                       ref_rsid_vcf= "/home/yunye/mydata/d_disk/dbsnp/GCF_000001405.25.vcf.gz",
                       chr_dict = gl.get_number_to_NC(build="19") )

Sun Feb  4 01:21:13 2024 Start to assign rsID using reference VCF...v3.4.38
Sun Feb  4 01:21:13 2024  -Current Dataframe shape : 100000 x 13 ; Memory usage: 14.24 MB
Sun Feb  4 01:21:13 2024  -Number of threads/cores to use: 1
Sun Feb  4 01:21:13 2024  -Reference VCF: /home/yunye/mydata/d_disk/dbsnp/GCF_000001405.25.vcf.gz
Sun Feb  4 01:21:13 2024  -Assigning rsID based on CHR:POS and REF:ALT/ALT:REF...
Sun Feb  4 01:21:25 2024  -rsID Annotation for 122 need to be fixed!
Sun Feb  4 01:21:25 2024  -Annotated 3144 rsID successfully!
Sun Feb  4 01:21:25 2024 Finished assign rsID using reference file.

After this, only 122 variants were not annotated. (122/100000 = 0.122%), mostly indels not available in the reference VCF.

In [27]:

Copied!

mysumstats.data.loc[mysumstats.data["rsID"].isna(),:]
mysumstats.data.loc[mysumstats.data["rsID"].isna(),:]

Out[27]:

	SNPID	CHR	POS	EA	NEA	EAF	BETA	SE	P	N	DIRECTION	STATUS	rsID
614	1:18539349:T:AG	1	18539349	AG	T	0.3139	-0.0099	0.0096	0.29840	191764	+---	1960309	<NA>
4451	1:160397344:C:CATTATT	1	160397344	C	CATTATT	0.8184	0.0298	0.0123	0.01560	191764	++++	1960368	<NA>
11431	2:107374263:ATCTC:A	2	107374263	ATCTC	A	0.9613	0.0452	0.0238	0.05739	191764	+++-	1960368	<NA>
12873	2:153147115:C:AA	2	153147115	AA	C	0.4939	0.0005	0.0103	0.96120	191764	+--+	1960319	<NA>
13626	2:175894098:G:AA	2	175894098	AA	G	0.3400	0.0029	0.0095	0.76090	191764	--++	1960319	<NA>
...	...	...	...	...	...	...	...	...	...	...	...	...	...
99841	X:145191155:CT:C	23	145191155	C	CT	0.5646	-0.0023	0.0071	0.74480	191764	-+--	1960368	<NA>
99856	X:145727210:CAT:C	23	145727210	C	CAT	0.4036	-0.0159	0.0116	0.16940	191764	--+-	1960368	<NA>
99935	X:150826311:GT:G	23	150826311	GT	G	0.2320	0.0016	0.0143	0.91320	191764	--+-	1960368	<NA>
99971	X:152656101:CCA:C	23	152656101	C	CCA	0.1351	-0.0298	0.0134	0.02646	191764	----	1960368	<NA>
99994	X:153820016:AT:A	23	153820016	AT	A	0.9888	0.0470	0.0605	0.43720	166718	+?++	1960368	<NA>

122 rows × 13 columns

Tips : The two steps can be run in a single one.

In [28]:

Copied!





#mysumstats.assign_rsid(n_cores=3,
#                       ref_rsid_tsv= gl.get_path("1kg_dbsnp151_hg19_auto"),
#                       ref_rsid_vcf= "/home/yunye/CommonData/Reference/ncbi_dbsnp/ncbi_dbsnp/db155/GCF_000001405.25.gz",
#                       chr_dict = gl.get_number_to_NC(build="19") )
#mysumstats.assign_rsid(n_cores=3,
#                       ref_rsid_tsv= gl.get_path("1kg_dbsnp151_hg19_auto"),
#                       ref_rsid_vcf= "/home/yunye/CommonData/Reference/ncbi_dbsnp/ncbi_dbsnp/db155/GCF_000001405.25.gz",
#                       chr_dict = gl.get_number_to_NC(build="19") )

Sumstats Summary¶

Check the summary of the currrent sumstats (see: https://cloufield.github.io/gwaslab/StatusCode/):

In [29]:

Copied!

mysumstats.summary()
mysumstats.summary()

Out[29]:

		Values	Percentage
Category	Items
META	Row_num	100000	<NA>
	Column_num	13	<NA>
	Column_names	SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,DIRECTION...	<NA>
	Last_checked_time	Sun Feb 4 01:21:25 2024	<NA>
MISSING	Missing_total	122	0.12
MISSING	Missing_rsID	122	0.12
MAF	Common	50983	50.98
	Low_frequency	16809	16.81
	Rare	32157	32.16
P	Minimum	1.1900000000000001e-77	0.0
	Significant	86	0.09
	Suggestive	166	0.17
STATUS	1960010	42815	42.82
	1960000	33835	33.84
	1960001	6329	6.33
	1960011	6237	6.24
	1960364	4483	4.48
	1960363	3656	3.66
	1960007	621	0.62
	1960017	616	0.62
	1960309	563	0.56
	1960368	278	0.28
	1960319	196	0.2
	1960018	181	0.18
	1960008	173	0.17
	1960012	10	0.01
	1960002	7	0.01

Check the details of harmonization results .lookup_status()

In [30]:

Copied!

mysumstats.lookup_status()
mysumstats.lookup_status()

Out[30]:

	Genome_Build	rsID&SNPID	CHR&POS	Stadardize&Normalize	Align	Panlidromic_SNP&Indel	Count	Percentage(%)
1960000	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized SNP	Match: NEA=REF	Not_palindromic_SNPs	33835	33.84
1960001	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized SNP	Match: NEA=REF	Palindromic+strand	6329	6.33
1960002	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized SNP	Match: NEA=REF	Palindromic-strand_fixed	7	0.01
1960007	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized SNP	Match: NEA=REF	Indistinguishable	621	0.62
1960008	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized SNP	Match: NEA=REF	No_matching_or_no_info	173	0.17
1960010	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized SNP	Flipped_fixed	Not_palindromic_SNPs	42815	42.82
1960011	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized SNP	Flipped_fixed	Palindromic+strand	6237	6.24
1960012	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized SNP	Flipped_fixed	Palindromic-strand_fixed	10	0.01
1960017	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized SNP	Flipped_fixed	Indistinguishable	616	0.62
1960018	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized SNP	Flipped_fixed	No_matching_or_no_info	181	0.18
1960309	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized & normalized indel	Match: NEA=REF	Unchecked	563	0.56
1960319	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized & normalized indel	Flipped_fixed	Unchecked	196	0.2
1960363	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized & normalized indel	Both_alleles_on_ref+indistinguishable	Indel_match	3656	3.66
1960364	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized & normalized indel	Both_alleles_on_ref+indistinguishable	Indel_flipped_fixed	4483	4.48
1960368	hg19	rsid unknown & SNPID valid	CHR valid & POS valid	standardized & normalized indel	Both_alleles_on_ref+indistinguishable	No_matching_or_no_info	278	0.28

Formatting and saving : to_format()¶

You can easily format the processed sumstats and save it. For details, see: https://cloufield.github.io/gwaslab/Format/

Let's export the sumstats as the default format used in LDSC.

In [31]:

Copied!

mysumstats.to_format("clean_sumstats",fmt="ldsc")
mysumstats.to_format("clean_sumstats",fmt="ldsc")

Sun Feb  4 01:21:28 2024 Start to convert the output sumstats in:  ldsc  format
Sun Feb  4 01:21:28 2024  -Formatting statistics ...
Sun Feb  4 01:21:29 2024  -Float statistics formats:
Sun Feb  4 01:21:29 2024   - Columns       : ['EAF', 'BETA', 'SE', 'P']
Sun Feb  4 01:21:29 2024   - Output formats: ['{:.4g}', '{:.4f}', '{:.4f}', '{:.4e}']
Sun Feb  4 01:21:29 2024  -Start outputting sumstats in ldsc format...
Sun Feb  4 01:21:29 2024  -ldsc format will be loaded...
Sun Feb  4 01:21:29 2024  -ldsc format meta info:
Sun Feb  4 01:21:29 2024   - format_name  : ldsc
Sun Feb  4 01:21:29 2024   - format_source  : https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format
Sun Feb  4 01:21:29 2024   - format_source2  : https://github.com/bulik/ldsc/blob/master/munge_sumstats.py
Sun Feb  4 01:21:29 2024   - format_version  :  20150306
Sun Feb  4 01:21:29 2024  -gwaslab to ldsc format dictionary:
Sun Feb  4 01:21:29 2024   - gwaslab keys: rsID,NEA,EA,EAF,N,BETA,P,Z,INFO,OR,CHR,POS
Sun Feb  4 01:21:29 2024   - ldsc values: SNP,A2,A1,Frq,N,Beta,P,Z,INFO,OR,CHR,POS
Sun Feb  4 01:21:29 2024  -Output path: clean_sumstats.ldsc.tsv.gz
Sun Feb  4 01:21:29 2024  -Output columns: Beta,Frq,SNP,A2,P,N,A1,POS,CHR
Sun Feb  4 01:21:29 2024  -Writing sumstats to: clean_sumstats.ldsc.tsv.gz...
Sun Feb  4 01:21:30 2024  -Saving log file to: clean_sumstats.ldsc.log
Sun Feb  4 01:21:30 2024 Finished outputting successfully!

Sometimes we only need to export part of the sumstats. For example,

we can specify hapmap3=True to export only hapmap3 variants;
specify exclude_hla=True to exclude variants in HLA region when exporting;
specify md5sum=True to calculate the md5sum value for the exported sumstats.

In [32]:

Copied!

mysumstats.to_format("clean_sumstats",fmt="ldsc",hapmap3=True,exclude_hla=True,md5sum=True)
mysumstats.to_format("clean_sumstats",fmt="ldsc",hapmap3=True,exclude_hla=True,md5sum=True)

Sun Feb  4 01:21:30 2024 Start to convert the output sumstats in:  ldsc  format
Sun Feb  4 01:21:30 2024  -Excluding variants in MHC (HLA) region ...
Sun Feb  4 01:21:30 2024  -Exclude 609 variants in MHC (HLA) region : 25Mb - 34Mb.
Sun Feb  4 01:21:30 2024  -Processing 99391 raw variants...
Sun Feb  4 01:21:30 2024  -Loading Hapmap3 variants data...
Sun Feb  4 01:21:31 2024  -Extract 8644 variants in Hapmap3 datasets for build 19.
Sun Feb  4 01:21:31 2024  -Formatting statistics ...
Sun Feb  4 01:21:31 2024  -Float statistics formats:
Sun Feb  4 01:21:31 2024   - Columns       : ['EAF', 'BETA', 'SE', 'P']
Sun Feb  4 01:21:31 2024   - Output formats: ['{:.4g}', '{:.4f}', '{:.4f}', '{:.4e}']
Sun Feb  4 01:21:31 2024  -Start outputting sumstats in ldsc format...
Sun Feb  4 01:21:31 2024  -ldsc format will be loaded...
Sun Feb  4 01:21:31 2024  -ldsc format meta info:
Sun Feb  4 01:21:31 2024   - format_name  : ldsc
Sun Feb  4 01:21:31 2024   - format_source  : https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format
Sun Feb  4 01:21:31 2024   - format_source2  : https://github.com/bulik/ldsc/blob/master/munge_sumstats.py
Sun Feb  4 01:21:31 2024   - format_version  :  20150306
Sun Feb  4 01:21:31 2024  -gwaslab to ldsc format dictionary:
Sun Feb  4 01:21:31 2024   - gwaslab keys: rsID,NEA,EA,EAF,N,BETA,P,Z,INFO,OR,CHR,POS
Sun Feb  4 01:21:31 2024   - ldsc values: SNP,A2,A1,Frq,N,Beta,P,Z,INFO,OR,CHR,POS
Sun Feb  4 01:21:31 2024  -Output path: clean_sumstats.hapmap3.noMHC.ldsc.tsv.gz
Sun Feb  4 01:21:31 2024  -Output columns: Beta,Frq,SNP,A2,P,N,A1,POS,CHR
Sun Feb  4 01:21:31 2024  -Writing sumstats to: clean_sumstats.hapmap3.noMHC.ldsc.tsv.gz...
Sun Feb  4 01:21:31 2024  -md5sum hashing for the file: clean_sumstats.hapmap3.noMHC.ldsc.tsv.gz
Sun Feb  4 01:21:31 2024  -md5sum path: clean_sumstats.hapmap3.noMHC.ldsc.tsv.gz.md5sum
Sun Feb  4 01:21:31 2024  -md5sum: 7b148066fae5f407f1239e49efe1b88a
Sun Feb  4 01:21:31 2024  -Saving log file to: clean_sumstats.hapmap3.noMHC.ldsc.log
Sun Feb  4 01:21:31 2024 Finished outputting successfully!

Output in GWAS-SSF format

In [33]:

Copied!

mysumstats.to_format("clean_sumstats",fmt="ssf",md5sum=True)
mysumstats.to_format("clean_sumstats",fmt="ssf",md5sum=True)

Sun Feb  4 01:21:31 2024 Start to convert the output sumstats in:  ssf  format
Sun Feb  4 01:21:31 2024  -Formatting statistics ...
Sun Feb  4 01:21:31 2024  -Float statistics formats:
Sun Feb  4 01:21:31 2024   - Columns       : ['EAF', 'BETA', 'SE', 'P']
Sun Feb  4 01:21:31 2024   - Output formats: ['{:.4g}', '{:.4f}', '{:.4f}', '{:.4e}']
Sun Feb  4 01:21:31 2024  -Replacing SNPID separator from ":" to "_"...
Sun Feb  4 01:21:31 2024  -Start outputting sumstats in ssf format...
Sun Feb  4 01:21:31 2024  -ssf format will be loaded...
Sun Feb  4 01:21:31 2024  -ssf format meta info:
Sun Feb  4 01:21:31 2024   - format_name  : ssf
Sun Feb  4 01:21:31 2024   - format_source  : https://www.biorxiv.org/content/10.1101/2022.07.15.500230v1.full
Sun Feb  4 01:21:31 2024   - format_cite_name  : GWAS-SSF v0.1
Sun Feb  4 01:21:31 2024   - format_separator  : \t
Sun Feb  4 01:21:31 2024   - format_na  : #NA
Sun Feb  4 01:21:31 2024   - format_col_order  : chromosome,base_pair_location,effect_allele,other_allele,beta,odds_ratio,hazard_ratio,standard_error,effect_allele_frequency,p_value,neg_log_10_p_value,ci_upper,ci_lower,rsid,variant_id,info,ref_allele,n
Sun Feb  4 01:21:31 2024   - format_version  :  20230328
Sun Feb  4 01:21:31 2024  -gwaslab to ssf format dictionary:
Sun Feb  4 01:21:31 2024   - gwaslab keys: SNPID,rsID,CHR,POS,NEA,EA,EAF,N,BETA,SE,P,MLOG10P,INFO,OR,HR,OR_95L,OR_95U
Sun Feb  4 01:21:31 2024   - ssf values: variant_id,rsid,chromosome,base_pair_location,other_allele,effect_allele,effect_allele_frequency,n,beta,standard_error,p_value,neg_log_10_p_value,info,odds_ratio,hazard_ratio,ci_lower,ci_upper
Sun Feb  4 01:21:31 2024  -Output path: clean_sumstats.ssf.tsv.gz
Sun Feb  4 01:21:31 2024  -Output columns: chromosome,base_pair_location,effect_allele,other_allele,beta,standard_error,effect_allele_frequency,p_value,rsid,variant_id,n
Sun Feb  4 01:21:31 2024  -Writing sumstats to: clean_sumstats.ssf.tsv.gz...
Sun Feb  4 01:21:32 2024  -md5sum hashing for the file: clean_sumstats.ssf.tsv.gz
Sun Feb  4 01:21:32 2024  -md5sum path: clean_sumstats.ssf.tsv.gz.md5sum
Sun Feb  4 01:21:32 2024  -md5sum: f4cee107b7e9dc7d24c97dc4d051e79b
Sun Feb  4 01:21:32 2024  -Saving log file to: clean_sumstats.ssf.log
Sun Feb  4 01:21:32 2024 Finished outputting successfully!

For more formats and examples, see https://cloufield.github.io/gwaslab/format_load_save/

Liftover¶

GWASLab can perform liftover for base pair positions.

Note: GWASLab only liftover CHR and POS, and when lifted, the last two digits status (alignment and palindromic/indel check) code will be rolled back to 99. Since for different reference genome, the reference allele or strand might be reverse, so it is needed to harmonize again using reference files of the target genome build after liftover.

In [34]:

Copied!

mysumstats.liftover(n_cores=1, from_build="19", to_build="38")
mysumstats.liftover(n_cores=1, from_build="19", to_build="38")

Sun Feb  4 01:21:32 2024 Start to perform liftover...v3.4.38
Sun Feb  4 01:21:32 2024  -Current Dataframe shape : 100000 x 13 ; Memory usage: 14.24 MB
Sun Feb  4 01:21:32 2024  -Number of threads/cores to use: 1
Sun Feb  4 01:21:32 2024  -Creating converter : hg19 to hg38
Sun Feb  4 01:21:33 2024  -Converting variants with status code xxx0xxx :100000...
Sun Feb  4 01:21:54 2024  -Removed unmapped variants: 57
Sun Feb  4 01:21:54 2024 Start to fix chromosome notation (CHR)...v3.4.38
Sun Feb  4 01:21:54 2024  -Current Dataframe shape : 99943 x 13 ; Memory usage: 15.00 MB
Sun Feb  4 01:21:54 2024  -Checking CHR data type...
Sun Feb  4 01:21:54 2024  -Variants with standardized chromosome notation: 99943
Sun Feb  4 01:21:54 2024  -All CHR are already fixed...
Sun Feb  4 01:21:55 2024 Finished fixing chromosome notation (CHR).
Sun Feb  4 01:21:55 2024 Start to fix basepair positions (POS)...v3.4.38
Sun Feb  4 01:21:55 2024  -Current Dataframe shape : 99943 x 13 ; Memory usage: 15.00 MB
Sun Feb  4 01:21:55 2024  -Converting to Int64 data type ...
Sun Feb  4 01:21:55 2024  -Position bound:(0 , 250,000,000)
Sun Feb  4 01:21:55 2024  -Removed outliers: 0
Sun Feb  4 01:21:55 2024  -Removed 0 variants with bad positions.
Sun Feb  4 01:21:55 2024 Finished fixing basepair positions (POS).
Sun Feb  4 01:21:55 2024 Finished liftover.

In [35]:

Copied!

mysumstats.data
mysumstats.data

Out[35]:

	SNPID	CHR	POS	EA	NEA	EAF	BETA	SE	P	N	DIRECTION	STATUS	rsID
0	1:875643:C:G	1	940263	G	C	0.7290	0.0147	0.0130	0.258600	166718	+?+-	3860099	rs7411115
1	1:928416:G:A	1	993036	A	G	0.1363	0.0113	0.0142	0.423300	191764	--+-	3860099	rs111754459
2	1:933045:C:T	1	997665	T	C	0.0087	0.0008	0.0704	0.991500	166718	-?+-	3860099	rs189213898
3	1:1019158:G:A	1	1083778	A	G	0.0110	0.1189	0.0613	0.052510	166718	+?++	3860099	rs142369888
4	1:1023310:T:C	1	1087930	C	T	0.0854	0.0202	0.0182	0.265600	191764	+++-	3860099	rs11260589
...	...	...	...	...	...	...	...	...	...	...	...	...	...
99995	X:153882706:C:T	23	154654432	T	C	0.0125	-0.0017	0.0526	0.973600	166718	+?--	3860099	rs3829775
99996	X:154087726:C:T	23	154859451	T	C	0.0743	0.0414	0.0151	0.006171	191764	+-+-	3860099	rs112922881
99997	X:154126990:T:C	23	154898715	C	T	0.0763	-0.0085	0.0132	0.519500	191764	+---	3860099	rs28370221
99998	X:154140146:T:G	23	154911871	G	T	0.1707	0.0043	0.0096	0.652700	191764	+-+-	3860099	rs28895719
99999	X:154278797:T:C	23	155050522	C	T	0.1738	0.0025	0.0091	0.785500	191764	+-+-	3860099	rs114209171

99943 rows × 13 columns

For details, see https://cloufield.github.io/gwaslab/LiftOver/

Tutorial for gwaslab¶

Download sample data¶

Import gwaslab package¶

Loading data into gwaslab Sumstats¶

Create Manhattan plots and Q-Q plots¶

Standardization & QC : .basic_check()¶

Extract lead variants : get_lead()¶

Use the SNPID to create some highly customized mqq plot¶

Quick regional plot without LD-information¶

Reference file downloading¶

Check available reference from gwaslab¶

Download reference using gwaslab¶

Sampling¶

Infer genome build¶

Fix SNPID¶

Harmonization¶

Annotate rsID¶

Maximize the annotation¶

Sumstats Summary¶

Formatting and saving : to_format()¶

Liftover¶

Standardization & QC : `.basic_check()`¶