Tutorial for gwaslab¶
- In this tutorial, we will briefly describe the core functions in gwaslab for sumstats QC, standardization and harmonization.
- We will also show examples of visualization, including Manhattan plots, Q-Q plots and regional plots.
- This jupyter notebook can be downloaded from https://github.com/Cloufield/gwaslab/blob/main/docs/tutorial_3.4.ipynb
- Some steps need reference files, which may take some time to download.
- Please note that the processed reference datasets are currently hosted on Dropbox.
Download sample data¶
- Using a jupyter notebook, we first download a sample dataset.
- The dataset we will use as an example is the sumstats of type 2 diabetes from BBJ (K. Suzuki et al., Nature Genetics. 51, 379–386 (2019).)
- File size: 261M
!wget -O t2d_bbj.txt.gz http://jenger.riken.jp/14/
--2023-04-21 13:41:25-- http://jenger.riken.jp/14/ Resolving jenger.riken.jp (jenger.riken.jp)... 134.160.84.25, 161.232.72.25, 202.12.30.131, ... Connecting to jenger.riken.jp (jenger.riken.jp)|134.160.84.25|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 274187574 (261M) [text/plain] Saving to: ‘t2d_bbj.txt.gz’ t2d_bbj.txt.gz 100%[===================>] 261.49M 25.2MB/s in 11s 2023-04-21 13:41:35 (24.7 MB/s) - ‘t2d_bbj.txt.gz’ saved [274187574/274187574]
Sometimes the link is not stable. You can also download from alternative source:
!wget -O t2d_bbj.txt.gz https://www.dropbox.com/s/5vp93eq0gtsm92n/t2d_bbj.txt.gz?dl=1
Import gwaslab package¶
gwaslab can be installed using pip
!pip install gwaslab==3.4.41
If you installed gwaslab from pip, simply run the command to import the package:
import gwaslab as gl
Or if you want to use the latest version from github (beta version), you can clone the repository and import the package by inserting your package path into the system path like:
import sys
sys.path.insert(0,"/home/yunye/work/gwaslab/src")
import gwaslab as gl
Loading data into gwaslab Sumstats¶
Let's import the raw sumstats into the gwaslab.Sumstats
Object by specifying the necessary columns.
Note: you can either specify eaf
(effect allele frequency) or neaf
(non-effect allele frequency), if neaf
is specified, it will be converted to eaf
when loading the sumstats.
mysumstats = gl.Sumstats("t2d_bbj.txt.gz",
snpid="SNP",
chrom="CHR",
pos="POS",
ea="ALT",
nea="REF",
neaf="Frq",
beta="BETA",
se="SE",
p="P",
direction="Dir",
n="N",
sep="\t")
2024/02/15 18:55:03 GWASLab v3.4.40 https://cloufield.github.io/gwaslab/ 2024/02/15 18:55:03 (C) 2022-2024, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com 2024/02/15 18:55:03 Start to initialize gl.Sumstats from file :t2d_bbj.txt.gz 2024/02/15 18:55:18 -Reading columns : Dir,SNP,REF,N,P,POS,ALT,Frq,BETA,CHR,SE 2024/02/15 18:55:18 -Renaming columns to : DIRECTION,SNPID,NEA,N,P,POS,EA,EAF,BETA,CHR,SE 2024/02/15 18:55:18 -Current Dataframe shape : 12557761 x 11 2024/02/15 18:55:19 -Initiating a status column: STATUS ... 2024/02/15 18:55:19 #WARNING! Version of genomic coordinates is unknown... 2024/02/15 18:55:21 -NEAF is specified... 2024/02/15 18:55:21 -Checking if 0<= NEAF <=1 ... 2024/02/15 18:55:22 -Converted NEAF to EAF. 2024/02/15 18:55:22 -Removed 0 variants with bad NEAF. 2024/02/15 18:55:22 Start to reorder the columns...v3.4.40 2024/02/15 18:55:22 -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1100.88 MB 2024/02/15 18:55:22 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,DIRECTION,STATUS 2024/02/15 18:55:22 Finished reordering the columns. 2024/02/15 18:55:22 -Column : SNPID CHR POS EA NEA EAF BETA SE P N DIRECTION STATUS 2024/02/15 18:55:22 -DType : object string int64 category category float64 float64 float64 float64 int64 object category 2024/02/15 18:55:22 -Verified: T F T T T T T T T T T T 2024/02/15 18:55:22 #WARNING! Columns with possibly incompatible dtypes: CHR 2024/02/15 18:55:22 -Current Dataframe memory usage: 1100.88 MB 2024/02/15 18:55:22 Finished loading data successfully!
Sumstats are stored in Sumstats.data
as a pandas.DataFrame, you can check the data like:
mysumstats.data
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:725932_G_A | 1 | 725932 | G | A | 0.9960 | -0.0737 | 0.1394 | 0.5970 | 166718 | -?+- | 9999999 |
1 | 1:725933_A_G | 1 | 725933 | G | A | 0.0040 | 0.0737 | 0.1394 | 0.5973 | 166718 | +?-+ | 9999999 |
2 | 1:737801_T_C | 1 | 737801 | C | T | 0.0051 | 0.0490 | 0.1231 | 0.6908 | 166718 | +?-+ | 9999999 |
3 | 1:749963_T_TAA | 1 | 749963 | TAA | T | 0.8374 | 0.0213 | 0.0199 | 0.2846 | 166718 | -?++ | 9999999 |
4 | 1:751343_T_A | 1 | 751343 | T | A | 0.8593 | 0.0172 | 0.0156 | 0.2705 | 166718 | -?++ | 9999999 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
12557756 | X:154874837_A_G | X | 154874837 | G | A | 0.7478 | -0.0064 | 0.0117 | 0.5840 | 191764 | -+-+ | 9999999 |
12557757 | X:154875192_GTACTC_G | X | 154875192 | GTACTC | G | 0.2525 | 0.0071 | 0.0122 | 0.5612 | 191764 | +-+- | 9999999 |
12557758 | X:154879115_A_G | X | 154879115 | G | A | 0.7463 | -0.0070 | 0.0122 | 0.5646 | 191764 | -+-+ | 9999999 |
12557759 | X:154880669_T_A | X | 154880669 | T | A | 0.2558 | 0.0071 | 0.0122 | 0.5618 | 191764 | +-+- | 9999999 |
12557760 | X:154880917_C_T | X | 154880917 | C | T | 0.2558 | 0.0072 | 0.0122 | 0.5570 | 191764 | +-+- | 9999999 |
12557761 rows × 12 columns
- For details on gwaslab Sumstats Object, see: https://cloufield.github.io/gwaslab/SumstatsObject/
Create Manhattan plots and Q-Q plots¶
The first thing you want to check is probably the Manhattan and Q-Q plots for your sumstats.
gwaslab will conduct a minimal QC for sumstats when plotting.
mysumstats.plot_mqq()
Sun Feb 4 01:02:47 2024 Start to create MQQ plot...v3.4.38: Sun Feb 4 01:02:47 2024 -Genomic coordinates version: 99... Sun Feb 4 01:02:47 2024 #WARNING! Genomic coordinates version is unknown. Sun Feb 4 01:02:47 2024 -Genome-wide significance level to plot is set to 5e-08 ... Sun Feb 4 01:02:47 2024 -Raw input contains 12557761 variants... Sun Feb 4 01:02:47 2024 -MQQ plot layout mode is : mqq Sun Feb 4 01:02:51 2024 Finished loading specified columns from the sumstats. Sun Feb 4 01:02:51 2024 Start data conversion and sanity check: Sun Feb 4 01:02:51 2024 -Removed 0 variants with nan in CHR or POS column ... Sun Feb 4 01:03:16 2024 -Removed 0 variants with CHR <=0... Sun Feb 4 01:03:17 2024 -Removed 0 variants with nan in P column ... Sun Feb 4 01:03:18 2024 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Sun Feb 4 01:03:18 2024 -Sumstats P values are being converted to -log10(P)... Sun Feb 4 01:03:19 2024 -Sanity check: 0 na/inf/-inf variants will be removed... Sun Feb 4 01:03:21 2024 -Converting data above cut line... Sun Feb 4 01:03:21 2024 -Maximum -log10(P) value is 167.58838029403677 . Sun Feb 4 01:03:21 2024 Finished data conversion and sanity check. Sun Feb 4 01:03:21 2024 Start to create MQQ plot with 12557761 variants... Sun Feb 4 01:03:35 2024 -Creating background plot... Sun Feb 4 01:04:00 2024 Finished creating MQQ plot successfully! Sun Feb 4 01:04:00 2024 Start to extract variants for annotation... Sun Feb 4 01:04:00 2024 -Found 89 significant variants with a sliding window size of 500 kb... Sun Feb 4 01:04:00 2024 Finished extracting variants for annotation... Sun Feb 4 01:04:00 2024 Start to process figure arts. Sun Feb 4 01:04:00 2024 -Processing X ticks... Sun Feb 4 01:04:00 2024 -Processing X labels... Sun Feb 4 01:04:00 2024 -Processing Y labels... Sun Feb 4 01:04:00 2024 -Processing Y tick lables... Sun Feb 4 01:04:00 2024 -Processing Y labels... Sun Feb 4 01:04:00 2024 -Processing lines... Sun Feb 4 01:04:00 2024 Finished processing figure arts. Sun Feb 4 01:04:00 2024 Start to annotate variants... Sun Feb 4 01:04:00 2024 -Skip annotating Sun Feb 4 01:04:00 2024 Finished annotating variants. Sun Feb 4 01:04:00 2024 Start to create QQ plot with 12557761 variants: Sun Feb 4 01:04:00 2024 -Plotting all variants... Sun Feb 4 01:04:03 2024 -Expected range of P: (0,1.0) Sun Feb 4 01:04:04 2024 -Lambda GC (MLOG10P mode) at 0.5 is 1.21283 Sun Feb 4 01:04:04 2024 -Processing Y tick lables... Sun Feb 4 01:04:04 2024 Finished creating QQ plot successfully! Sun Feb 4 01:04:04 2024 Start to save figure... Sun Feb 4 01:04:04 2024 -Skip saving figure! Sun Feb 4 01:04:04 2024 Finished saving figure... Sun Feb 4 01:04:04 2024 Finished creating plot successfully!
(<Figure size 3000x1000 with 2 Axes>, <gwaslab.g_Log.Log at 0x7fe713e8edc0>)
Using .plot_mqq()
, you can easily plot the Manhattan and QQ plot, but the plots without any manipulations are not really informative in this case, and plotting all points takes a relatively long time. The most significant locus dwarfed other less signicant loci. To adjust the plot, gwaslab provides a wide range of options for customization. For example, we can use skip
and cut
:
- skip : skip variants with MLOG10P <
skip
for faster plotting speed - cut : rescale the MLOG10P values when MLOG10P >
cut
mysumstats.plot_mqq(skip=2, cut=20)
Sun Feb 4 01:06:17 2024 Start to create MQQ plot...v3.4.38: Sun Feb 4 01:06:17 2024 -Genomic coordinates version: 99... Sun Feb 4 01:06:17 2024 #WARNING! Genomic coordinates version is unknown. Sun Feb 4 01:06:17 2024 -Genome-wide significance level to plot is set to 5e-08 ... Sun Feb 4 01:06:17 2024 -Raw input contains 12557761 variants... Sun Feb 4 01:06:17 2024 -MQQ plot layout mode is : mqq Sun Feb 4 01:06:20 2024 Finished loading specified columns from the sumstats. Sun Feb 4 01:06:20 2024 Start data conversion and sanity check: Sun Feb 4 01:06:21 2024 -Removed 0 variants with nan in CHR or POS column ... Sun Feb 4 01:06:45 2024 -Removed 0 variants with CHR <=0... Sun Feb 4 01:06:46 2024 -Removed 0 variants with nan in P column ... Sun Feb 4 01:06:47 2024 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Sun Feb 4 01:06:47 2024 -Sumstats P values are being converted to -log10(P)... Sun Feb 4 01:06:48 2024 -Sanity check: 0 na/inf/-inf variants will be removed... Sun Feb 4 01:06:49 2024 -Converting data above cut line... Sun Feb 4 01:06:49 2024 -Maximum -log10(P) value is 167.58838029403677 . Sun Feb 4 01:06:49 2024 -Minus log10(P) values above 20 will be shrunk with a shrinkage factor of 10... Sun Feb 4 01:06:49 2024 Finished data conversion and sanity check. Sun Feb 4 01:06:49 2024 Start to create MQQ plot with 332882 variants... Sun Feb 4 01:06:49 2024 -Creating background plot... Sun Feb 4 01:06:50 2024 Finished creating MQQ plot successfully! Sun Feb 4 01:06:50 2024 Start to extract variants for annotation... Sun Feb 4 01:06:51 2024 -Found 89 significant variants with a sliding window size of 500 kb... Sun Feb 4 01:06:51 2024 Finished extracting variants for annotation... Sun Feb 4 01:06:51 2024 Start to process figure arts. Sun Feb 4 01:06:51 2024 -Processing X ticks... Sun Feb 4 01:06:51 2024 -Processing X labels... Sun Feb 4 01:06:51 2024 -Processing Y labels... Sun Feb 4 01:06:51 2024 -Processing Y tick lables... Sun Feb 4 01:06:51 2024 -Processing Y labels... Sun Feb 4 01:06:51 2024 -Processing lines... Sun Feb 4 01:06:51 2024 Finished processing figure arts. Sun Feb 4 01:06:51 2024 Start to annotate variants... Sun Feb 4 01:06:51 2024 -Skip annotating Sun Feb 4 01:06:51 2024 Finished annotating variants. Sun Feb 4 01:06:51 2024 Start to create QQ plot with 332882 variants: Sun Feb 4 01:06:51 2024 -Plotting all variants... Sun Feb 4 01:06:51 2024 -Expected range of P: (0,1.0) Sun Feb 4 01:06:52 2024 -Lambda GC (MLOG10P mode) at 0.5 is 1.21283 Sun Feb 4 01:06:52 2024 -Processing Y tick lables... Sun Feb 4 01:06:52 2024 Finished creating QQ plot successfully! Sun Feb 4 01:06:52 2024 Start to save figure... Sun Feb 4 01:06:52 2024 -Skip saving figure! Sun Feb 4 01:06:52 2024 Finished saving figure... Sun Feb 4 01:06:52 2024 Finished creating plot successfully!
(<Figure size 3000x1000 with 2 Axes>, <gwaslab.g_Log.Log at 0x7fe713e8edc0>)
Looks better now. But what if we want to annotate some of the most significant loci (for example, lead variants with MLOG10P>30) and only plot Manhattan plot?
mysumstats.plot_mqq(skip=2, cut=20, mode="m", anno=True, sig_level_lead=1e-30)
Sun Feb 4 01:06:56 2024 Start to create MQQ plot...v3.4.38: Sun Feb 4 01:06:56 2024 -Genomic coordinates version: 99... Sun Feb 4 01:06:56 2024 #WARNING! Genomic coordinates version is unknown. Sun Feb 4 01:06:56 2024 -Genome-wide significance level to plot is set to 5e-08 ... Sun Feb 4 01:06:56 2024 -Raw input contains 12557761 variants... Sun Feb 4 01:06:56 2024 -MQQ plot layout mode is : m Sun Feb 4 01:07:01 2024 Finished loading specified columns from the sumstats. Sun Feb 4 01:07:01 2024 Start data conversion and sanity check: Sun Feb 4 01:07:01 2024 -Removed 0 variants with nan in CHR or POS column ... Sun Feb 4 01:07:25 2024 -Removed 0 variants with CHR <=0... Sun Feb 4 01:07:26 2024 -Removed 0 variants with nan in P column ... Sun Feb 4 01:07:27 2024 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Sun Feb 4 01:07:28 2024 -Sumstats P values are being converted to -log10(P)... Sun Feb 4 01:07:29 2024 -Sanity check: 0 na/inf/-inf variants will be removed... Sun Feb 4 01:07:30 2024 -Converting data above cut line... Sun Feb 4 01:07:30 2024 -Maximum -log10(P) value is 167.58838029403677 . Sun Feb 4 01:07:30 2024 -Minus log10(P) values above 20 will be shrunk with a shrinkage factor of 10... Sun Feb 4 01:07:30 2024 Finished data conversion and sanity check. Sun Feb 4 01:07:30 2024 Start to create MQQ plot with 332882 variants... Sun Feb 4 01:07:30 2024 -Creating background plot... Sun Feb 4 01:07:31 2024 Finished creating MQQ plot successfully! Sun Feb 4 01:07:31 2024 Start to extract variants for annotation... Sun Feb 4 01:07:31 2024 -Found 1 significant variants with a sliding window size of 500 kb... Sun Feb 4 01:07:31 2024 Finished extracting variants for annotation... Sun Feb 4 01:07:31 2024 Start to process figure arts. Sun Feb 4 01:07:31 2024 -Processing X ticks... Sun Feb 4 01:07:31 2024 -Processing X labels... Sun Feb 4 01:07:31 2024 -Processing Y labels... Sun Feb 4 01:07:31 2024 -Processing Y tick lables... Sun Feb 4 01:07:31 2024 -Processing Y labels... Sun Feb 4 01:07:31 2024 -Processing lines... Sun Feb 4 01:07:31 2024 Finished processing figure arts. Sun Feb 4 01:07:31 2024 Start to annotate variants... Sun Feb 4 01:07:31 2024 -Annotating using column CHR:POS... Sun Feb 4 01:07:31 2024 -Adjusting text positions with repel_force=0.03... Sun Feb 4 01:07:31 2024 Finished annotating variants. Sun Feb 4 01:07:31 2024 Start to save figure... Sun Feb 4 01:07:31 2024 -Skip saving figure! Sun Feb 4 01:07:31 2024 Finished saving figure... Sun Feb 4 01:07:31 2024 Finished creating plot successfully!
(<Figure size 3000x1000 with 1 Axes>, <gwaslab.g_Log.Log at 0x7fe713e8edc0>)
GWASLab supports a wide range of customizable options. For details on other options for Manhattan and Q-Q plots, see: https://cloufield.github.io/gwaslab/Visualization/
For more examples, see https://cloufield.github.io/gwaslab/visualization_mqq/
Standardization & QC : .basic_check()
¶
It is needed to check variant ID (SNPID), rsID, chromosome(CHR), basepair position(POS), alleles (EA and NEA) and statistics first before any manipulations or analysis. gwaslab provides a all-in-one function for this, .basic_check()
.
Note: Sometimes you need do this before plotting if the sumstats are not in a standard format.
#check SNPID,rsID,CHR,POS,EA, NEA and statistics
mysumstats.basic_check()
Sun Feb 4 01:07:34 2024 Start to check SNPID/rsID...v3.4.38 Sun Feb 4 01:07:34 2024 -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1100.88 MB Sun Feb 4 01:07:34 2024 -Checking SNPID data type... Sun Feb 4 01:07:34 2024 -Converting SNPID to pd.string data type... Sun Feb 4 01:07:34 2024 -Checking if SNPID is CHR:POS:NEA:EA...(separator: - ,: , _) Sun Feb 4 01:07:45 2024 Finished checking SNPID/rsID. Sun Feb 4 01:07:46 2024 Start to fix chromosome notation (CHR)...v3.4.38 Sun Feb 4 01:07:46 2024 -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1100.88 MB Sun Feb 4 01:07:46 2024 -Checking CHR data type... Sun Feb 4 01:08:11 2024 -Variants with standardized chromosome notation: 12228970 Sun Feb 4 01:08:35 2024 -Variants with fixable chromosome notations: 328791 Sun Feb 4 01:08:35 2024 -No unrecognized chromosome notations... Sun Feb 4 01:08:37 2024 -Identifying non-autosomal chromosomes : X, Y, and MT ... Sun Feb 4 01:08:38 2024 -Identified 328791 variants on sex chromosomes... Sun Feb 4 01:08:39 2024 -Standardizing sex chromosome notations: X to 23... Sun Feb 4 01:09:14 2024 Finished fixing chromosome notation (CHR). Sun Feb 4 01:09:14 2024 Start to fix basepair positions (POS)...v3.4.38 Sun Feb 4 01:09:14 2024 -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1370.85 MB Sun Feb 4 01:09:14 2024 -Converting to Int64 data type ... Sun Feb 4 01:09:20 2024 -Position bound:(0 , 250,000,000) Sun Feb 4 01:09:45 2024 -Removed outliers: 0 Sun Feb 4 01:09:46 2024 Finished fixing basepair positions (POS). Sun Feb 4 01:09:46 2024 Start to fix alleles (EA and NEA)...v3.4.38 Sun Feb 4 01:09:46 2024 -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1124.83 MB Sun Feb 4 01:09:46 2024 -Converted all bases to string datatype and UPPERCASE. Sun Feb 4 01:09:50 2024 -Variants with bad EA : 0 Sun Feb 4 01:09:50 2024 -Variants with bad NEA : 0 Sun Feb 4 01:09:51 2024 -Variants with NA for EA or NEA: 0 Sun Feb 4 01:09:52 2024 -Variants with same EA and NEA: 0 Sun Feb 4 01:09:52 2024 -Detected 0 variants with alleles that contain bases other than A/C/T/G . Sun Feb 4 01:10:07 2024 Finished fixing alleles (EA and NEA). Sun Feb 4 01:10:08 2024 Start to perform sanity check for statistics...v3.4.38 Sun Feb 4 01:10:08 2024 -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1150.87 MB Sun Feb 4 01:10:08 2024 -Comparison tolerance for floats: 1e-07 Sun Feb 4 01:10:08 2024 -Checking if 0 <= N <= 2147483647 ... Sun Feb 4 01:10:35 2024 -Removed 0 variants with bad/na N. Sun Feb 4 01:10:35 2024 -Checking if -1e-07 < EAF < 1.0000001 ... Sun Feb 4 01:10:37 2024 -Removed 0 variants with bad/na EAF. Sun Feb 4 01:10:37 2024 -Checking if -1e-07 < P < 1.0000001 ... Sun Feb 4 01:10:39 2024 -Removed 0 variants with bad/na P. Sun Feb 4 01:10:39 2024 -Checking if -100.0000001 < BETA < 100.0000001 ... Sun Feb 4 01:10:42 2024 -Removed 0 variants with bad/na BETA. Sun Feb 4 01:10:42 2024 -Checking if -1e-07 < SE < inf ... Sun Feb 4 01:10:43 2024 -Removed 0 variants with bad/na SE. Sun Feb 4 01:10:43 2024 -Checking STATUS and converting STATUS to categories.... Sun Feb 4 01:10:44 2024 -Removed 0 variants with bad statistics in total. Sun Feb 4 01:10:44 2024 -Data types for each column: Sun Feb 4 01:10:44 2024 -Column : SNPID CHR POS EA NEA EAF BETA SE P N DIRECTION STATUS Sun Feb 4 01:10:44 2024 -DType : string Int64 Int64 category category float32 float64 float64 float64 Int64 object category Sun Feb 4 01:10:44 2024 -Verified: T T T T T T T T T T T T Sun Feb 4 01:10:44 2024 Finished sanity check for statistics. Sun Feb 4 01:10:44 2024 Start to check data consistency across columns...v3.4.38 Sun Feb 4 01:10:44 2024 -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1114.94 MB Sun Feb 4 01:10:44 2024 -Tolerance: 0.001 (Relative) and 0.001 (Absolute) Sun Feb 4 01:10:44 2024 -Checking if BETA/SE-derived-P is consistent with P... Sun Feb 4 01:12:03 2024 -Not consistent: 2380309 variant(s) Sun Feb 4 01:12:04 2024 -Variant SNPID with max difference: X:3293819_T_C with 0.0072642804381793935 Sun Feb 4 01:12:04 2024 -Note: if the max difference is greater than expected, please check your original sumstats. Sun Feb 4 01:12:04 2024 Finished checking data consistency across columns. Sun Feb 4 01:12:04 2024 Start to normalize indels...v3.4.38 Sun Feb 4 01:12:04 2024 -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1372.94 MB Sun Feb 4 01:12:09 2024 -Not normalized allele IDs:X:7151130_TT_TC X:9093382_CTTTT_CTTT X:12292253_ATTT_ATT X:16001576_ATT_ATTT X:33822416_GT_GTT ... Sun Feb 4 01:12:09 2024 -Not normalized allele:['TT' 'TC']['CTTTT' 'CTTT']['ATTT' 'ATT']['ATTT' 'ATT']['GTT' 'GT']... Sun Feb 4 01:12:09 2024 -Modified 13 variants according to parsimony and left alignment principal. Sun Feb 4 01:12:10 2024 Finished normalizing indels. Sun Feb 4 01:12:10 2024 Start to sort the genome coordinates...v3.4.38 Sun Feb 4 01:12:10 2024 -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1372.94 MB Sun Feb 4 01:12:16 2024 Finished sorting coordinates. Sun Feb 4 01:12:17 2024 Start to reorder the columns...v3.4.38 Sun Feb 4 01:12:17 2024 -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1019.13 MB Sun Feb 4 01:12:17 2024 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,DIRECTION,STATUS Sun Feb 4 01:12:17 2024 Finished reordering the columns.
mysumstats.data
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:725932_G_A | 1 | 725932 | G | A | 0.9960 | -0.0737 | 0.1394 | 0.5970 | 166718 | -?+- | 9960099 |
1 | 1:725933_A_G | 1 | 725933 | G | A | 0.0040 | 0.0737 | 0.1394 | 0.5973 | 166718 | +?-+ | 9960099 |
2 | 1:737801_T_C | 1 | 737801 | C | T | 0.0051 | 0.0490 | 0.1231 | 0.6908 | 166718 | +?-+ | 9960099 |
3 | 1:749963_T_TAA | 1 | 749963 | TAA | T | 0.8374 | 0.0213 | 0.0199 | 0.2846 | 166718 | -?++ | 9960399 |
4 | 1:751343_T_A | 1 | 751343 | T | A | 0.8593 | 0.0172 | 0.0156 | 0.2705 | 166718 | -?++ | 9960099 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
12557756 | X:154874837_A_G | 23 | 154874837 | G | A | 0.7478 | -0.0064 | 0.0117 | 0.5840 | 191764 | -+-+ | 9960099 |
12557757 | X:154875192_GTACTC_G | 23 | 154875192 | GTACTC | G | 0.2525 | 0.0071 | 0.0122 | 0.5612 | 191764 | +-+- | 9960399 |
12557758 | X:154879115_A_G | 23 | 154879115 | G | A | 0.7463 | -0.0070 | 0.0122 | 0.5646 | 191764 | -+-+ | 9960099 |
12557759 | X:154880669_T_A | 23 | 154880669 | T | A | 0.2558 | 0.0071 | 0.0122 | 0.5618 | 191764 | +-+- | 9960099 |
12557760 | X:154880917_C_T | 23 | 154880917 | C | T | 0.2558 | 0.0072 | 0.0122 | 0.5570 | 191764 | +-+- | 9960099 |
12557761 rows × 12 columns
By checking the log, we can see that the sumstats look good. But we still found several variants that were not normalized. GWASLab fixed the position and alleles for the un-normalizated indels. And GWASLab standardizes the notation for chromosome X to 23.
In fact, .basic_check()
is a wrapper of the following basic functions, you can also use these separately.
- mysumstats.fix_ID()
- mysumstats.fix_chr()
- mysumstats.fix_pos()
- mysumstats.fix_allele()
- mysumstats.check_sanity()
- mysumstats.check_data_consisteny()
- mysumstats.normalize_allele()
For other options, see: https://cloufield.github.io/gwaslab/Standardization/
For other examples, see: https://cloufield.github.io/gwaslab/standardization_workflow/
Extract lead variants : get_lead()¶
Let's extract the lead variants in each significant locus to check our data.
The significant loci are detected based on a sliding window (default window size: windowsizekb=500
kb)
By specifying anno=True
, gwaslab will also annotate the lead variant with its nearest gene names and distance.
Note: GWASLab default genome build version is build="99"
(Unknown) , you can change it to build="19"
(GRCh37/hg19) or build="38"
(GRCh38/hg38) when needed. Or you can specify the build for sumstats when loading into gl.Sumstats(build="19")
or use mysumstats.infer_build()
.
Note: GWASLab will donwload reference files when you run it for the first time. In this case, ensembl_hg19_gtf_protein_coding
was downloaded and processed automatically.
mysumstats.get_lead(anno=True, build="19")
Sun Feb 4 01:12:17 2024 Start to extract lead variants...v3.4.38 Sun Feb 4 01:12:17 2024 -Current Dataframe shape : 12557761 x 12 ; Memory usage: 1019.13 MB Sun Feb 4 01:12:17 2024 -Processing 12557761 variants... Sun Feb 4 01:12:17 2024 -Significance threshold : 5e-08 Sun Feb 4 01:12:17 2024 -Sliding window size: 500 kb Sun Feb 4 01:12:20 2024 -Using P for extracting lead variants... Sun Feb 4 01:12:20 2024 -Found 9461 significant variants in total... Sun Feb 4 01:12:20 2024 -Identified 89 lead variants! Sun Feb 4 01:12:20 2024 -Annotating variants using references:ensembl Sun Feb 4 01:12:20 2024 -Annotating variants using references based on genome build:19 Sun Feb 4 01:12:20 2024 Start to annotate variants with nearest gene name(s)... Sun Feb 4 01:12:20 2024 -Assigning Gene name using ensembl_hg19_gtf for protein coding genes Sun Feb 4 01:12:23 2024 Finished annotating variants with nearest gene name(s) successfully! Sun Feb 4 01:12:23 2024 Finished extracting lead variants.
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | LOCATION | GENE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
96739 | 1:22068326_A_G | 1 | 22068326 | G | A | 0.7550 | 0.0621 | 0.0103 | 1.629000e-09 | 191764 | ++++ | 9960099 | 0 | USP48 |
213860 | 1:51103268_T_C | 1 | 51103268 | C | T | 0.7953 | -0.0802 | 0.0120 | 2.519000e-11 | 191764 | ---- | 9960099 | 0 | FAF1 |
534095 | 1:154309595_TA_T | 1 | 154309595 | TA | T | 0.0947 | -0.0915 | 0.0166 | 3.289000e-08 | 191764 | ---- | 9960399 | 0 | ATP8B2 |
969974 | 2:640986_CACAT_C | 2 | 640986 | C | CACAT | 0.9006 | -0.0946 | 0.0150 | 2.665000e-10 | 191764 | ---- | 9960399 | 26349 | TMEM18 |
1091807 | 2:27734972_G_A | 2 | 27734972 | G | A | 0.5605 | 0.0691 | 0.0088 | 3.897000e-15 | 191764 | ++++ | 9960099 | 0 | GCKR |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
12272930 | X:21569920_A_G | 23 | 21569920 | G | A | 0.3190 | 0.0423 | 0.0076 | 2.616000e-08 | 191764 | ++++ | 9960099 | 0 | CNKSR2 |
12341406 | X:48724648_CAA_C | 23 | 48724648 | C | CAA | 0.6260 | -0.0602 | 0.0103 | 4.576000e-09 | 191764 | ---- | 9960399 | 26082 | TIMM17B |
12350767 | X:57170781_A_AT | 23 | 57170781 | AT | A | 0.3003 | -0.0447 | 0.0076 | 4.583000e-09 | 191764 | ---- | 9960399 | -6723 | SPIN2A |
12469290 | X:117915163_T_TA | 23 | 117915163 | TA | T | 0.5560 | 0.0548 | 0.0071 | 9.818000e-15 | 191764 | ++++ | 9960399 | 0 | IL13RA1 |
12554976 | X:152908887_G_A | 23 | 152908887 | G | A | 0.6792 | -0.1235 | 0.0077 | 9.197000e-58 | 191764 | ---- | 9960099 | 0 | DUSP9 |
89 rows × 14 columns
We extracted a total of 89 lead variants with a sliding window size of 500kb!
For other options, see: https://cloufield.github.io/gwaslab/ExtractLead/
Use the SNPID to create some highly customized mqq plot¶
GWASLab can create much more complicated Manhattan plots.
For example,
- annotate the lead variants with closest gene names (threshold for annotation p<1e-20)
- annotate selected variants with user-provided texts
- pinpoint some variants
- highlight some loci
- MAF-stratified Q-Q plot
- save as my_first_mqq_plot.png with {"dpi":400,"facecolor":"white"}
mysumstats.plot_mqq(mode="mqq",
build="19",
cut=20,
skip=2,
anno="GENENAME",
sig_level_lead=1e-20,
anno_alias={"9:22132729_A_G":"some annotations"},
anno_style="expand",
xpad=0.01,
pinpoint=["9:22132729_A_G","5:176513896_C_A"],
pinpoint_color="green",
highlight=["7:127253550_C_T","19:46166604_C_T"],
highlight_windowkb =1000,
stratified=True,
jagged=True,
marker_size=(5,5),
figargs={"figsize":(15,5),"dpi":300},
save="my_first_mqq_plot.png",
save_args={"dpi":400,"facecolor":"white"})
Sun Feb 4 01:12:24 2024 Start to create MQQ plot...v3.4.38: Sun Feb 4 01:12:24 2024 -Genomic coordinates version: 19... Sun Feb 4 01:12:24 2024 -Genome-wide significance level to plot is set to 5e-08 ... Sun Feb 4 01:12:24 2024 -Raw input contains 12557761 variants... Sun Feb 4 01:12:24 2024 -MQQ plot layout mode is : mqq Sun Feb 4 01:12:24 2024 -Loci to highlight (#CB132D): 7:127253550_C_T,19:46166604_C_T Sun Feb 4 01:12:24 2024 -highlight_windowkb is set to: 1000 kb Sun Feb 4 01:12:24 2024 -Variants to pinpoint (green) : 9:22132729_A_G,5:176513896_C_A Sun Feb 4 01:12:27 2024 Finished loading specified columns from the sumstats. Sun Feb 4 01:12:27 2024 Start data conversion and sanity check: Sun Feb 4 01:12:27 2024 -Removed 0 variants with nan in CHR or POS column ... Sun Feb 4 01:12:52 2024 -Removed 0 variants with CHR <=0... Sun Feb 4 01:12:53 2024 -Removed 0 variants with nan in EAF column ... Sun Feb 4 01:12:53 2024 -Removed 0 variants with nan in P column ... Sun Feb 4 01:12:55 2024 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Sun Feb 4 01:12:56 2024 -Sumstats P values are being converted to -log10(P)... Sun Feb 4 01:12:57 2024 -Sanity check: 0 na/inf/-inf variants will be removed... Sun Feb 4 01:12:59 2024 -Converting data above cut line... Sun Feb 4 01:12:59 2024 -Maximum -log10(P) value is 167.58838029403677 . Sun Feb 4 01:12:59 2024 -Minus log10(P) values above 20 will be shrunk with a shrinkage factor of 10... Sun Feb 4 01:12:59 2024 Finished data conversion and sanity check. Sun Feb 4 01:12:59 2024 Start to create MQQ plot with 332882 variants... Sun Feb 4 01:12:59 2024 -Creating background plot... Sun Feb 4 01:13:00 2024 -Highlighting target loci... Sun Feb 4 01:13:00 2024 -Pinpointing target vairants... Sun Feb 4 01:13:00 2024 Finished creating MQQ plot successfully! Sun Feb 4 01:13:00 2024 Start to extract variants for annotation... Sun Feb 4 01:13:00 2024 -Found 16 significant variants with a sliding window size of 500 kb... Sun Feb 4 01:13:00 2024 Start to annotate variants with nearest gene name(s)... Sun Feb 4 01:13:00 2024 -Assigning Gene name using ensembl_hg19_gtf for protein coding genes Sun Feb 4 01:13:00 2024 Finished annotating variants with nearest gene name(s) successfully! Sun Feb 4 01:13:00 2024 Finished extracting variants for annotation... Sun Feb 4 01:13:00 2024 Start to process figure arts. Sun Feb 4 01:13:00 2024 -Processing X ticks... Sun Feb 4 01:13:00 2024 -Processing X labels... Sun Feb 4 01:13:00 2024 -Processing Y labels... Sun Feb 4 01:13:00 2024 -Processing Y tick lables... Sun Feb 4 01:13:00 2024 -Processing Y labels... Sun Feb 4 01:13:00 2024 -Processing lines... Sun Feb 4 01:13:00 2024 Finished processing figure arts. Sun Feb 4 01:13:00 2024 Start to annotate variants... Sun Feb 4 01:13:00 2024 -Annotating using column GENENAME... Sun Feb 4 01:13:00 2024 -Adjusting text positions with repel_force=0.03... Sun Feb 4 01:13:00 2024 Finished annotating variants. Sun Feb 4 01:13:00 2024 Start to create QQ plot with 332882 variants: Sun Feb 4 01:13:00 2024 -Plotting variants stratified by MAF... Sun Feb 4 01:13:02 2024 -Lambda GC (MLOG10P mode) at 0.5 is 1.21283 Sun Feb 4 01:13:02 2024 -Processing Y tick lables... Sun Feb 4 01:13:02 2024 Finished creating QQ plot successfully! Sun Feb 4 01:13:02 2024 -Processing jagged Y axis... Sun Feb 4 01:13:02 2024 -Processing jagged Y axis... Sun Feb 4 01:13:02 2024 Start to save figure... Sun Feb 4 01:13:07 2024 -Saved to my_first_mqq_plot.png successfully! (overwrite) Sun Feb 4 01:13:07 2024 Finished saving figure... Sun Feb 4 01:13:07 2024 Finished creating plot successfully!
(<Figure size 4500x1500 with 2 Axes>, <gwaslab.g_Log.Log at 0x7fe713e8edc0>)
Quick regional plot without LD-information¶
gwaslab can also plot regional plots with or with out LD reference files.
For details, see: https://cloufield.github.io/gwaslab/RegionalPlot/
We first create a regional plot without references by specifying mode
and region
.
mysumstats.plot_mqq(mode="r",build="19",skip=2,cut=20, region=(7,126253550,128253550),region_grid=True)
Sun Feb 4 01:13:14 2024 Start to create MQQ plot...v3.4.38: Sun Feb 4 01:13:14 2024 -Genomic coordinates version: 19... Sun Feb 4 01:13:14 2024 -Genome-wide significance level to plot is set to 5e-08 ... Sun Feb 4 01:13:14 2024 -Raw input contains 12557761 variants... Sun Feb 4 01:13:14 2024 -MQQ plot layout mode is : r Sun Feb 4 01:13:14 2024 -Region to plot : chr7:126253550-128253550. Sun Feb 4 01:13:16 2024 -Extract SNPs in region : chr7:126253550-128253550... Sun Feb 4 01:13:40 2024 -Extract SNPs in specified regions: 8087 Sun Feb 4 01:13:40 2024 Finished loading specified columns from the sumstats. Sun Feb 4 01:13:40 2024 Start data conversion and sanity check: Sun Feb 4 01:13:40 2024 -Removed 0 variants with nan in CHR or POS column ... Sun Feb 4 01:13:40 2024 -Removed 0 variants with CHR <=0... Sun Feb 4 01:13:40 2024 -Removed 0 variants with nan in P column ... Sun Feb 4 01:13:40 2024 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Sun Feb 4 01:13:40 2024 -Sumstats P values are being converted to -log10(P)... Sun Feb 4 01:13:40 2024 -Sanity check: 0 na/inf/-inf variants will be removed... Sun Feb 4 01:13:40 2024 -Converting data above cut line... Sun Feb 4 01:13:40 2024 -Maximum -log10(P) value is 73.38711023071251 . Sun Feb 4 01:13:40 2024 -Minus log10(P) values above 20 will be shrunk with a shrinkage factor of 10... Sun Feb 4 01:13:40 2024 Finished data conversion and sanity check. Sun Feb 4 01:13:40 2024 Start to create MQQ plot with 1600 variants... Sun Feb 4 01:13:40 2024 -Creating background plot... Sun Feb 4 01:13:40 2024 -Extracting lead variant... Sun Feb 4 01:13:40 2024 -Loading gtf files from:default
INFO:root:Extracted GTF attributes: ['gene_id', 'gene_name', 'gene_biotype']
Sun Feb 4 01:14:18 2024 -plotting gene track.. Sun Feb 4 01:14:18 2024 -Finished plotting gene track.. Sun Feb 4 01:14:19 2024 Finished creating MQQ plot successfully! Sun Feb 4 01:14:19 2024 Start to extract variants for annotation... Sun Feb 4 01:14:19 2024 -Found 1 significant variants with a sliding window size of 500 kb... Sun Feb 4 01:14:19 2024 Finished extracting variants for annotation... Sun Feb 4 01:14:19 2024 Start to process figure arts. Sun Feb 4 01:14:19 2024 -Processing X labels... Sun Feb 4 01:14:19 2024 -Processing Y labels... Sun Feb 4 01:14:19 2024 -Processing Y tick lables... Sun Feb 4 01:14:19 2024 -Processing Y labels... Sun Feb 4 01:14:19 2024 -Processing lines... Sun Feb 4 01:14:19 2024 Finished processing figure arts. Sun Feb 4 01:14:19 2024 Start to annotate variants... Sun Feb 4 01:14:19 2024 -Skip annotating Sun Feb 4 01:14:19 2024 Finished annotating variants. Sun Feb 4 01:14:19 2024 Start to save figure... Sun Feb 4 01:14:19 2024 -Skip saving figure! Sun Feb 4 01:14:19 2024 Finished saving figure... Sun Feb 4 01:14:19 2024 Finished creating plot successfully!
(<Figure size 3000x2000 with 3 Axes>, <gwaslab.g_Log.Log at 0x7fe713e8edc0>)
Reference file downloading¶
Full regional plot will require user-provided vcf or preprocessed vcf files: (e.g 1000 Genomes project, see Reference: https://cloufield.github.io/gwaslab/Reference/)
GWASLab also provides pre-processed 1KG datasets.
Check available reference from gwaslab¶
Update the available reference list first if needed
# gl.update_available_ref()
gl.check_available_ref()
Sun Feb 4 01:14:20 2024 Start to check available reference files... Sun Feb 4 01:14:20 2024 - 1kg_eas_hg19 : https://www.dropbox.com/s/lztaxqhy2o6dpxw/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1 Sun Feb 4 01:14:20 2024 - 1kg_eas_hg19_md5 : c8c97434843c0da3113fc06879ead472 Sun Feb 4 01:14:20 2024 - 1kg_eas_hg19_tbi : https://www.dropbox.com/s/k9klefl8m9fcfo1/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1 Sun Feb 4 01:14:20 2024 - 1kg_eur_hg19 : https://www.dropbox.com/s/1nbgqshknevseks/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1 Sun Feb 4 01:14:20 2024 - 1kg_eur_hg19_md5 : 734069d895009d38c2f962bfbb6fab52 Sun Feb 4 01:14:20 2024 - 1kg_eur_hg19_tbi : https://www.dropbox.com/s/vscvkrflh6fc5a0/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1 Sun Feb 4 01:14:20 2024 - 1kg_eas_hg38 : https://www.dropbox.com/s/3dstbbb1el9r3au/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1 Sun Feb 4 01:14:20 2024 - 1kg_eas_hg38_md5 : f45e80bca9ef7b29e6b1832e6ac15375 Sun Feb 4 01:14:20 2024 - 1kg_eas_hg38_tbi : https://www.dropbox.com/s/vwnp5vd8dcqksn4/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1 Sun Feb 4 01:14:20 2024 - 1kg_eur_hg38 : https://www.dropbox.com/s/z0mkehg17lryapv/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1 Sun Feb 4 01:14:20 2024 - 1kg_eur_hg38_md5 : 228d3285fa99132cc6321e2925e0768d Sun Feb 4 01:14:20 2024 - 1kg_eur_hg38_tbi : https://www.dropbox.com/s/ze8g58x75x9qbf0/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1 Sun Feb 4 01:14:20 2024 - 1kg_sas_hg19 : https://www.dropbox.com/scl/fi/fubqvuj3p4ii4y35zknv8/SAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=5z50f66iltjchcaszznq5bczt&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_sas_hg19_md5 : e2d3f9e2e6580d05e877e9effd435c4e Sun Feb 4 01:14:20 2024 - 1kg_sas_hg19_tbi : https://www.dropbox.com/scl/fi/icnmrnzee7ofdpx5l96tg/SAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=st8t88snby26q37rqi6zh5zck&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_amr_hg19 : https://www.dropbox.com/scl/fi/bxa4zfngsxsc38rhtiv8c/AMR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=ibcn8hb1n8n36j3u0jfzci267&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_amr_hg19_md5 : 68d3cdf01cbabdae6e74a07795fa881c Sun Feb 4 01:14:20 2024 - 1kg_amr_hg19_tbi : https://www.dropbox.com/scl/fi/1zk16x7h4r89jurzwu05u/AMR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=b4cere4w38zvzyfitfge3r8n0&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_sas_hg38 : https://www.dropbox.com/scl/fi/jr3l5zz42py3kny2bccmj/SAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=x0t6tsy71jxzf021wfqdn8k5q&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_sas_hg38_md5 : e5d79bea1958aa50c23f618d342ccc83 Sun Feb 4 01:14:20 2024 - 1kg_sas_hg38_tbi : https://www.dropbox.com/scl/fi/02oia4ur5r7w9qgiuf6i9/SAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=00p9rxe0xzfs6hr1rg4d8oadm&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_amr_hg38 : https://www.dropbox.com/scl/fi/4t4tyuhzp78uyb6tgkroq/AMR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=p96gbs1tcdia31jnjv1b82kuz&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_amr_hg38_md5 : 229fbd610001cf6f137b7f738352a44a Sun Feb 4 01:14:20 2024 - 1kg_amr_hg38_tbi : https://www.dropbox.com/scl/fi/x0dby543tr9xpaqj2i0ba/AMR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=uj8o7j0cy0spipe174jn54sqs&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_afr_hg19 : https://www.dropbox.com/scl/fi/tq4w9lyt5z47ym7grtrxg/AFR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=k3bimeu3yr5loq8hohba5mr6k&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_afr_hg19_md5 : f7b4425f39e8292dce6f13711e7f6c50 Sun Feb 4 01:14:20 2024 - 1kg_afr_hg19_tbi : https://www.dropbox.com/scl/fi/0giiptu0btwj1kfm6jdzr/AFR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=ucb5weprsc5prcg8hvtgmruxx&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_pan_hg19 : https://www.dropbox.com/scl/fi/6b4j9z9knmllfnbx86aw6/PAN.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=eento8vg06zyrkvooc9wd4cvu&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_pan_hg19_md5 : fed846482204487b60d33b21ddb18106 Sun Feb 4 01:14:20 2024 - 1kg_pan_hg19_tbi : https://www.dropbox.com/scl/fi/stco946scio5tvto0ln4j/PAN.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=hfh53beb627lmqwv3d8mzqy0c&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_afr_hg38 : https://www.dropbox.com/scl/fi/239xmm7qijtnsks97chc9/AFR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=47en5fk1icbekpg7we3uot9g8&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_afr_hg38_md5 : 3bb7923be0809a324d7b7633b8d58a3b Sun Feb 4 01:14:20 2024 - 1kg_afr_hg38_tbi : https://www.dropbox.com/scl/fi/3y3pg4yqwo2jaaamx1c8f/AFR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=say0ihfwa51z3otgn4bjtze8p&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_pan_hg38 : https://www.dropbox.com/scl/fi/nf01487smtmeq243ihfwm/PAN.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=3pefbkzxwcnejx4inynifpft7&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_pan_hg38_md5 : 23bb86d748c4a66e85e087f647e8b60e Sun Feb 4 01:14:20 2024 - 1kg_pan_hg38_tbi : https://www.dropbox.com/scl/fi/hu7cttr4cenw5yjsm2775/PAN.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=568u7bkvkybm4wt2q9284o198&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_eas_x_hg19 : https://www.dropbox.com/scl/fi/1inmw09rk35ncuq7tibmp/EAS.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=vcjpukgsb7gt4tizg1fvr7tr2&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_eas_x_hg19_md5 : b2aced2a1522ed23818989b3153b7e91 Sun Feb 4 01:14:20 2024 - 1kg_eas_x_hg19_tbi : https://www.dropbox.com/scl/fi/uyxb9lfi88dqjp5l3vrzf/EAS.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=vt196d16h690dmzox5jyc33xx&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_eur_x_hg19 : https://www.dropbox.com/scl/fi/6r4sc2yax8pk644piew2d/EUR.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=l5towjhyl733nrd1msjr1d8gl&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_eur_x_hg19_md5 : 6380cb71eafe985d7b894029e979139b Sun Feb 4 01:14:20 2024 - 1kg_eur_x_hg19_tbi : https://www.dropbox.com/scl/fi/yuid87x398yc9n8nc4bb1/EUR.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=01skm13sk6099y34zy6qvweqj&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_eas_x_hg38 : https://www.dropbox.com/scl/fi/2m6i93vv1ooano0muukck/EAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=y6087mmt9kmls066mzobjeqp7&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_eas_x_hg38_md5 : 8c6a35da51621f952a5b97cbcc832046 Sun Feb 4 01:14:20 2024 - 1kg_eas_x_hg38_tbi : https://www.dropbox.com/scl/fi/l6jpt86edarb4emehxwcy/EAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=ddr1fcijb1bh2nso0q0updolh&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_eur_x_hg38 : https://www.dropbox.com/scl/fi/ceoff4p95ftef6yldhl67/EUR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=yyt9u11dk6kyha0cvturvrtvf&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_eur_x_hg38_md5 : b9a4b8553dec202109f72281f33cb454 Sun Feb 4 01:14:20 2024 - 1kg_eur_x_hg38_tbi : https://www.dropbox.com/scl/fi/tux32myi6g18bx7nd1rdq/EUR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=m4f0v3rblnzv7dj0lo6hsd0hb&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_sas_x_hg19 : https://www.dropbox.com/scl/fi/592lbmrkfjn80twnvlt2q/SAS.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=zrar7nmltpsuznlyuqw9k77q6&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_sas_x_hg19_md5 : f4f370274fe586d209ca6fddc4eceaaf Sun Feb 4 01:14:20 2024 - 1kg_sas_x_hg19_tbi : https://www.dropbox.com/scl/fi/rmaybun6v248nmjmaz2tj/SAS.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=izv96vfajgdd5wsyuvx98ntvk&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_amr_x_hg19 : https://www.dropbox.com/scl/fi/gwryyxs0ilgoazqvp39be/AMR.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=z46we3kshi9t96x7issl36bda&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_amr_x_hg19_md5 : ead838f7059a80118e949959cf1a3ff3 Sun Feb 4 01:14:20 2024 - 1kg_amr_x_hg19_tbi : https://www.dropbox.com/scl/fi/r6g5893smgmnsir5r0v5n/AMR.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=sel2f4p7ctggf30udrhcy0psu&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_sas_x_hg38 : https://www.dropbox.com/scl/fi/r6qa9l6h9rc5rvenjsdqo/SAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=mucn2zizrlkebn1e5q7rtzu8e&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_sas_x_hg38_md5 : 31c60999ebb9a13d17d21e02fd9d1f4c Sun Feb 4 01:14:20 2024 - 1kg_sas_x_hg38_tbi : https://www.dropbox.com/scl/fi/5ktxs24sq6v8hvaowr4xv/SAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=5y2z66b3s6bgzoikdjvr46ccw&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_amr_x_hg38 : https://www.dropbox.com/scl/fi/5brd1nh7u20oigtb17mkb/AMR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=zt0d0nqmlq3u5uu6ukwvof6ta&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_amr_x_hg38_md5 : bc7de683d603c8bbff02f5bec8d3469a Sun Feb 4 01:14:20 2024 - 1kg_amr_x_hg38_tbi : https://www.dropbox.com/scl/fi/8bz6uwjgw8bj16uaz6kye/AMR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=pkc2mepgosxdijpfru4vhum5v&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_afr_x_hg19 : https://www.dropbox.com/scl/fi/kz5j4532pyaigceww0vg5/AFR.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=sti0g2ri004chu8b12pqgicvw&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_afr_x_hg19_md5 : 77807e5e315a6e47504c175b0aaece88 Sun Feb 4 01:14:20 2024 - 1kg_afr_x_hg19_tbi : https://www.dropbox.com/scl/fi/oadjwamy5pe1ilv2237gq/AFR.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=a10bhxrfa904dasrcuc40njq4&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_pan_x_hg19 : https://www.dropbox.com/scl/fi/rwov9vszj8rx78u65dxnk/PAN.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=ej33zb9ulwdfseur1surz653z&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_pan_x_hg19_md5 : 389d474984ff82df79efd25c0dd66fc9 Sun Feb 4 01:14:20 2024 - 1kg_pan_x_hg19_tbi : https://www.dropbox.com/scl/fi/x0n1htmkbulqybr5cc2cb/PAN.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=8rga64u5gm9vp9whqwlj37hre&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_afr_x_hg38 : https://www.dropbox.com/scl/fi/ef8h09lhg8vmdxfxsayv4/AFR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=96xjxu546sbbq5hbgaihh84l8&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_afr_x_hg38_md5 : b1410bb21e389a0f08fc2741d33fcc51 Sun Feb 4 01:14:20 2024 - 1kg_afr_x_hg38_tbi : https://www.dropbox.com/scl/fi/tlwhui80cy32mtx6hc8f6/AFR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=ana497w8is840ygarx3w1v5pc&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_pan_x_hg38 : https://www.dropbox.com/scl/fi/tf7j5540jyzxz2oo7jtho/PAN.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=e872ihg5477mlu30vet1e5lvk&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_pan_x_hg38_md5 : 8ae424786a6bfe64c92ca6b9f96ee5e6 Sun Feb 4 01:14:20 2024 - 1kg_pan_x_hg38_tbi : https://www.dropbox.com/scl/fi/u28zo2sjbcmtfs69zqaya/PAN.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=t5psxvoewd1oog2hwzellc78p&dl=1 Sun Feb 4 01:14:20 2024 - dbsnp_v151_hg19 : https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz Sun Feb 4 01:14:20 2024 - dbsnp_v151_hg19_tbi : https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz.tbi Sun Feb 4 01:14:20 2024 - dbsnp_v151_hg38 : https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz Sun Feb 4 01:14:20 2024 - dbsnp_v151_hg38_tbi : https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz.tbi Sun Feb 4 01:14:20 2024 - dbsnp_v156_hg19 : https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.25.gz Sun Feb 4 01:14:20 2024 - dbsnp_v156_hg19_tbi : https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.25.gz.tbi Sun Feb 4 01:14:20 2024 - dbsnp_v156_hg38 : https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.40.gz Sun Feb 4 01:14:20 2024 - dbsnp_v156_hg38_tbi : https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.40.gz.tbi Sun Feb 4 01:14:20 2024 - ucsc_genome_hg19 : http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz Sun Feb 4 01:14:20 2024 - ucsc_genome_hg38 : https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz Sun Feb 4 01:14:20 2024 - 1kg_dbsnp151_hg19_auto : https://www.dropbox.com/s/37p2u1xwmux4gwo/1kg_dbsnp151_hg19_auto.txt.gz?dl=1 Sun Feb 4 01:14:20 2024 - 1kg_dbsnp151_hg19_auto_md5 : 7d1e7624fb6e4df7a2f6f05558d436b4 Sun Feb 4 01:14:20 2024 - 1kg_dbsnp151_hg38_auto : https://www.dropbox.com/s/ouf60n7gdz6cm0g/1kg_dbsnp151_hg38_auto.txt.gz?dl=1 Sun Feb 4 01:14:20 2024 - 1kg_dbsnp151_hg38_auto_md5 : 4c7ef2d2415c18c286219e970fdda972 Sun Feb 4 01:14:20 2024 - 1kg_dbsnp151_hg19_x : https://www.dropbox.com/scl/fi/ghwq2yh5bi7o411m1mw15/1kg_dbsnp151_hg19_X.txt.gz?rlkey=50du8e42qjdgzge0lcdiu7tv2&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_dbsnp151_hg19_x_md5 : cbf0e3518ab73d6d8a96bab9d55c094d Sun Feb 4 01:14:20 2024 - 1kg_dbsnp151_hg38_x : https://www.dropbox.com/scl/fi/bqdrfh0dx561ir210tu92/1kg_dbsnp151_hg38_X.txt.gz?rlkey=jjetkirbflt02f3w8mrxdqa4g&dl=1 Sun Feb 4 01:14:20 2024 - 1kg_dbsnp151_hg38_x_md5 : 48c05eeb1454c0dd4cbee3cb26382e8e Sun Feb 4 01:14:20 2024 - recombination_hg19 : https://www.dropbox.com/s/wbesl8haxknonuc/recombination_hg19.tar.gz?dl=1 Sun Feb 4 01:14:20 2024 - recombination_hg38 : https://www.dropbox.com/s/vuo8mvqx0fpibzj/recombination_hg38.tar.gz?dl=1 Sun Feb 4 01:14:20 2024 - ensembl_hg19_gtf : https://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.chr.gtf.gz Sun Feb 4 01:14:20 2024 - ensembl_hg38_gtf : https://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens//Homo_sapiens.GRCh38.109.chr.gtf.gz Sun Feb 4 01:14:20 2024 - refseq_hg19_gtf : https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gtf.gz Sun Feb 4 01:14:20 2024 - refseq_hg38_gtf : https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gtf.gz Sun Feb 4 01:14:20 2024 - testlink : https://www.dropbox.com/s/8u7capwge0ihshu/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz?dl=1 Sun Feb 4 01:14:20 2024 - testlink_tbi : https://www.dropbox.com/s/hdneg53t6u1j6ib/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz.tbi?dl=1
{'1kg_eas_hg19': 'https://www.dropbox.com/s/lztaxqhy2o6dpxw/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1', '1kg_eas_hg19_md5': 'c8c97434843c0da3113fc06879ead472', '1kg_eas_hg19_tbi': 'https://www.dropbox.com/s/k9klefl8m9fcfo1/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1', '1kg_eur_hg19': 'https://www.dropbox.com/s/1nbgqshknevseks/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1', '1kg_eur_hg19_md5': '734069d895009d38c2f962bfbb6fab52', '1kg_eur_hg19_tbi': 'https://www.dropbox.com/s/vscvkrflh6fc5a0/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1', '1kg_eas_hg38': 'https://www.dropbox.com/s/3dstbbb1el9r3au/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1', '1kg_eas_hg38_md5': 'f45e80bca9ef7b29e6b1832e6ac15375', '1kg_eas_hg38_tbi': 'https://www.dropbox.com/s/vwnp5vd8dcqksn4/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1', '1kg_eur_hg38': 'https://www.dropbox.com/s/z0mkehg17lryapv/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1', '1kg_eur_hg38_md5': '228d3285fa99132cc6321e2925e0768d', '1kg_eur_hg38_tbi': 'https://www.dropbox.com/s/ze8g58x75x9qbf0/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1', '1kg_sas_hg19': 'https://www.dropbox.com/scl/fi/fubqvuj3p4ii4y35zknv8/SAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=5z50f66iltjchcaszznq5bczt&dl=1', '1kg_sas_hg19_md5': 'e2d3f9e2e6580d05e877e9effd435c4e', '1kg_sas_hg19_tbi': 'https://www.dropbox.com/scl/fi/icnmrnzee7ofdpx5l96tg/SAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=st8t88snby26q37rqi6zh5zck&dl=1', '1kg_amr_hg19': 'https://www.dropbox.com/scl/fi/bxa4zfngsxsc38rhtiv8c/AMR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=ibcn8hb1n8n36j3u0jfzci267&dl=1', '1kg_amr_hg19_md5': '68d3cdf01cbabdae6e74a07795fa881c', '1kg_amr_hg19_tbi': 'https://www.dropbox.com/scl/fi/1zk16x7h4r89jurzwu05u/AMR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=b4cere4w38zvzyfitfge3r8n0&dl=1', '1kg_sas_hg38': 'https://www.dropbox.com/scl/fi/jr3l5zz42py3kny2bccmj/SAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=x0t6tsy71jxzf021wfqdn8k5q&dl=1', '1kg_sas_hg38_md5': 'e5d79bea1958aa50c23f618d342ccc83', '1kg_sas_hg38_tbi': 'https://www.dropbox.com/scl/fi/02oia4ur5r7w9qgiuf6i9/SAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=00p9rxe0xzfs6hr1rg4d8oadm&dl=1', '1kg_amr_hg38': 'https://www.dropbox.com/scl/fi/4t4tyuhzp78uyb6tgkroq/AMR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=p96gbs1tcdia31jnjv1b82kuz&dl=1', '1kg_amr_hg38_md5': '229fbd610001cf6f137b7f738352a44a', '1kg_amr_hg38_tbi': 'https://www.dropbox.com/scl/fi/x0dby543tr9xpaqj2i0ba/AMR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=uj8o7j0cy0spipe174jn54sqs&dl=1', '1kg_afr_hg19': 'https://www.dropbox.com/scl/fi/tq4w9lyt5z47ym7grtrxg/AFR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=k3bimeu3yr5loq8hohba5mr6k&dl=1', '1kg_afr_hg19_md5': 'f7b4425f39e8292dce6f13711e7f6c50', '1kg_afr_hg19_tbi': 'https://www.dropbox.com/scl/fi/0giiptu0btwj1kfm6jdzr/AFR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=ucb5weprsc5prcg8hvtgmruxx&dl=1', '1kg_pan_hg19': 'https://www.dropbox.com/scl/fi/6b4j9z9knmllfnbx86aw6/PAN.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?rlkey=eento8vg06zyrkvooc9wd4cvu&dl=1', '1kg_pan_hg19_md5': 'fed846482204487b60d33b21ddb18106', '1kg_pan_hg19_tbi': 'https://www.dropbox.com/scl/fi/stco946scio5tvto0ln4j/PAN.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?rlkey=hfh53beb627lmqwv3d8mzqy0c&dl=1', '1kg_afr_hg38': 'https://www.dropbox.com/scl/fi/239xmm7qijtnsks97chc9/AFR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=47en5fk1icbekpg7we3uot9g8&dl=1', '1kg_afr_hg38_md5': '3bb7923be0809a324d7b7633b8d58a3b', '1kg_afr_hg38_tbi': 'https://www.dropbox.com/scl/fi/3y3pg4yqwo2jaaamx1c8f/AFR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=say0ihfwa51z3otgn4bjtze8p&dl=1', '1kg_pan_hg38': 'https://www.dropbox.com/scl/fi/nf01487smtmeq243ihfwm/PAN.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=3pefbkzxwcnejx4inynifpft7&dl=1', '1kg_pan_hg38_md5': '23bb86d748c4a66e85e087f647e8b60e', '1kg_pan_hg38_tbi': 'https://www.dropbox.com/scl/fi/hu7cttr4cenw5yjsm2775/PAN.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=568u7bkvkybm4wt2q9284o198&dl=1', '1kg_eas_x_hg19': 'https://www.dropbox.com/scl/fi/1inmw09rk35ncuq7tibmp/EAS.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=vcjpukgsb7gt4tizg1fvr7tr2&dl=1', '1kg_eas_x_hg19_md5': 'b2aced2a1522ed23818989b3153b7e91', '1kg_eas_x_hg19_tbi': 'https://www.dropbox.com/scl/fi/uyxb9lfi88dqjp5l3vrzf/EAS.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=vt196d16h690dmzox5jyc33xx&dl=1', '1kg_eur_x_hg19': 'https://www.dropbox.com/scl/fi/6r4sc2yax8pk644piew2d/EUR.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=l5towjhyl733nrd1msjr1d8gl&dl=1', '1kg_eur_x_hg19_md5': '6380cb71eafe985d7b894029e979139b', '1kg_eur_x_hg19_tbi': 'https://www.dropbox.com/scl/fi/yuid87x398yc9n8nc4bb1/EUR.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=01skm13sk6099y34zy6qvweqj&dl=1', '1kg_eas_x_hg38': 'https://www.dropbox.com/scl/fi/2m6i93vv1ooano0muukck/EAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=y6087mmt9kmls066mzobjeqp7&dl=1', '1kg_eas_x_hg38_md5': '8c6a35da51621f952a5b97cbcc832046', '1kg_eas_x_hg38_tbi': 'https://www.dropbox.com/scl/fi/l6jpt86edarb4emehxwcy/EAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=ddr1fcijb1bh2nso0q0updolh&dl=1', '1kg_eur_x_hg38': 'https://www.dropbox.com/scl/fi/ceoff4p95ftef6yldhl67/EUR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=yyt9u11dk6kyha0cvturvrtvf&dl=1', '1kg_eur_x_hg38_md5': 'b9a4b8553dec202109f72281f33cb454', '1kg_eur_x_hg38_tbi': 'https://www.dropbox.com/scl/fi/tux32myi6g18bx7nd1rdq/EUR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=m4f0v3rblnzv7dj0lo6hsd0hb&dl=1', '1kg_sas_x_hg19': 'https://www.dropbox.com/scl/fi/592lbmrkfjn80twnvlt2q/SAS.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=zrar7nmltpsuznlyuqw9k77q6&dl=1', '1kg_sas_x_hg19_md5': 'f4f370274fe586d209ca6fddc4eceaaf', '1kg_sas_x_hg19_tbi': 'https://www.dropbox.com/scl/fi/rmaybun6v248nmjmaz2tj/SAS.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=izv96vfajgdd5wsyuvx98ntvk&dl=1', '1kg_amr_x_hg19': 'https://www.dropbox.com/scl/fi/gwryyxs0ilgoazqvp39be/AMR.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=z46we3kshi9t96x7issl36bda&dl=1', '1kg_amr_x_hg19_md5': 'ead838f7059a80118e949959cf1a3ff3', '1kg_amr_x_hg19_tbi': 'https://www.dropbox.com/scl/fi/r6g5893smgmnsir5r0v5n/AMR.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=sel2f4p7ctggf30udrhcy0psu&dl=1', '1kg_sas_x_hg38': 'https://www.dropbox.com/scl/fi/r6qa9l6h9rc5rvenjsdqo/SAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=mucn2zizrlkebn1e5q7rtzu8e&dl=1', '1kg_sas_x_hg38_md5': '31c60999ebb9a13d17d21e02fd9d1f4c', '1kg_sas_x_hg38_tbi': 'https://www.dropbox.com/scl/fi/5ktxs24sq6v8hvaowr4xv/SAS.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=5y2z66b3s6bgzoikdjvr46ccw&dl=1', '1kg_amr_x_hg38': 'https://www.dropbox.com/scl/fi/5brd1nh7u20oigtb17mkb/AMR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=zt0d0nqmlq3u5uu6ukwvof6ta&dl=1', '1kg_amr_x_hg38_md5': 'bc7de683d603c8bbff02f5bec8d3469a', '1kg_amr_x_hg38_tbi': 'https://www.dropbox.com/scl/fi/8bz6uwjgw8bj16uaz6kye/AMR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=pkc2mepgosxdijpfru4vhum5v&dl=1', '1kg_afr_x_hg19': 'https://www.dropbox.com/scl/fi/kz5j4532pyaigceww0vg5/AFR.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=sti0g2ri004chu8b12pqgicvw&dl=1', '1kg_afr_x_hg19_md5': '77807e5e315a6e47504c175b0aaece88', '1kg_afr_x_hg19_tbi': 'https://www.dropbox.com/scl/fi/oadjwamy5pe1ilv2237gq/AFR.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=a10bhxrfa904dasrcuc40njq4&dl=1', '1kg_pan_x_hg19': 'https://www.dropbox.com/scl/fi/rwov9vszj8rx78u65dxnk/PAN.chrX.split_norm_af.1kgp3v5.vcf.gz?rlkey=ej33zb9ulwdfseur1surz653z&dl=1', '1kg_pan_x_hg19_md5': '389d474984ff82df79efd25c0dd66fc9', '1kg_pan_x_hg19_tbi': 'https://www.dropbox.com/scl/fi/x0n1htmkbulqybr5cc2cb/PAN.chrX.split_norm_af.1kgp3v5.vcf.gz.tbi?rlkey=8rga64u5gm9vp9whqwlj37hre&dl=1', '1kg_afr_x_hg38': 'https://www.dropbox.com/scl/fi/ef8h09lhg8vmdxfxsayv4/AFR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=96xjxu546sbbq5hbgaihh84l8&dl=1', '1kg_afr_x_hg38_md5': 'b1410bb21e389a0f08fc2741d33fcc51', '1kg_afr_x_hg38_tbi': 'https://www.dropbox.com/scl/fi/tlwhui80cy32mtx6hc8f6/AFR.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=ana497w8is840ygarx3w1v5pc&dl=1', '1kg_pan_x_hg38': 'https://www.dropbox.com/scl/fi/tf7j5540jyzxz2oo7jtho/PAN.chrX.split_norm_af.1kg_30x.hg38.vcf.gz?rlkey=e872ihg5477mlu30vet1e5lvk&dl=1', '1kg_pan_x_hg38_md5': '8ae424786a6bfe64c92ca6b9f96ee5e6', '1kg_pan_x_hg38_tbi': 'https://www.dropbox.com/scl/fi/u28zo2sjbcmtfs69zqaya/PAN.chrX.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?rlkey=t5psxvoewd1oog2hwzellc78p&dl=1', 'dbsnp_v151_hg19': 'https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz', 'dbsnp_v151_hg19_tbi': 'https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz.tbi', 'dbsnp_v151_hg38': 'https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz', 'dbsnp_v151_hg38_tbi': 'https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz.tbi', 'dbsnp_v156_hg19': 'https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.25.gz', 'dbsnp_v156_hg19_tbi': 'https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.25.gz.tbi', 'dbsnp_v156_hg38': 'https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.40.gz', 'dbsnp_v156_hg38_tbi': 'https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/GCF_000001405.40.gz.tbi', 'ucsc_genome_hg19': 'http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz', 'ucsc_genome_hg38': 'https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz', '1kg_dbsnp151_hg19_auto': 'https://www.dropbox.com/s/37p2u1xwmux4gwo/1kg_dbsnp151_hg19_auto.txt.gz?dl=1', '1kg_dbsnp151_hg19_auto_md5': '7d1e7624fb6e4df7a2f6f05558d436b4', '1kg_dbsnp151_hg38_auto': 'https://www.dropbox.com/s/ouf60n7gdz6cm0g/1kg_dbsnp151_hg38_auto.txt.gz?dl=1', '1kg_dbsnp151_hg38_auto_md5': '4c7ef2d2415c18c286219e970fdda972', '1kg_dbsnp151_hg19_x': 'https://www.dropbox.com/scl/fi/ghwq2yh5bi7o411m1mw15/1kg_dbsnp151_hg19_X.txt.gz?rlkey=50du8e42qjdgzge0lcdiu7tv2&dl=1', '1kg_dbsnp151_hg19_x_md5': 'cbf0e3518ab73d6d8a96bab9d55c094d', '1kg_dbsnp151_hg38_x': 'https://www.dropbox.com/scl/fi/bqdrfh0dx561ir210tu92/1kg_dbsnp151_hg38_X.txt.gz?rlkey=jjetkirbflt02f3w8mrxdqa4g&dl=1', '1kg_dbsnp151_hg38_x_md5': '48c05eeb1454c0dd4cbee3cb26382e8e', 'recombination_hg19': 'https://www.dropbox.com/s/wbesl8haxknonuc/recombination_hg19.tar.gz?dl=1', 'recombination_hg38': 'https://www.dropbox.com/s/vuo8mvqx0fpibzj/recombination_hg38.tar.gz?dl=1', 'ensembl_hg19_gtf': 'https://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.chr.gtf.gz', 'ensembl_hg38_gtf': 'https://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens//Homo_sapiens.GRCh38.109.chr.gtf.gz', 'refseq_hg19_gtf': 'https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gtf.gz', 'refseq_hg38_gtf': 'https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gtf.gz', 'testlink': 'https://www.dropbox.com/s/8u7capwge0ihshu/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz?dl=1', 'testlink_tbi': 'https://www.dropbox.com/s/hdneg53t6u1j6ib/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz.tbi?dl=1'}
You can see the current available reference files (from the original source or pre-processed by gwaslab).
Download reference using gwaslab¶
Select the keyword and use download_ref
to download the files. The downloaded files will be placed in ~/.gwasalb
by default.
1kg_eas_hg19
: processed 1000 Genomes Project EAS samples dataset(hg19; ~2.8GB) It may take several minutes to download.
Default directory for reference files downloaded using GWASLab : ~/.gwaslab
# ~2.8GB
gl.download_ref("1kg_eas_hg19")
Tue Apr 25 14:13:51 2023 Start to download 1kg_eas_hg19 ... Tue Apr 25 14:13:51 2023 -Downloading to: /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz Tue Apr 25 14:15:41 2023 -File path: /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz Tue Apr 25 14:15:41 2023 -MD5 check: c8c97434843c0da3113fc06879ead472 Tue Apr 25 14:15:41 2023 -MD5 verified. Tue Apr 25 14:15:41 2023 -Updating record in config file... Tue Apr 25 14:15:43 2023 -Updating record in config file... Tue Apr 25 14:15:43 2023 -Downloading to: /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi Tue Apr 25 14:15:43 2023 Downloaded 1kg_eas_hg19 successfully!
gl.check_downloaded_ref()
Tue Apr 25 14:15:43 2023 Start to check downloaded reference files... Tue Apr 25 14:15:43 2023 -Checking the config file:/home/yunye/work/gwaslab/src/gwaslab/data/config.json Tue Apr 25 14:15:43 2023 -Config file exists. Tue Apr 25 14:15:43 2023 -Updating config.json... Tue Apr 25 14:15:43 2023 - ensembl_hg19_gtf : /home/yunye/.gwaslab/Homo_sapiens.GRCh37.87.chr.gtf.gz Tue Apr 25 14:15:43 2023 - recombination_hg19 : /home/yunye/.gwaslab/recombination/hg19/recombination_hg19.tar.gz Tue Apr 25 14:15:43 2023 - 1kg_eas_hg19 : /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz Tue Apr 25 14:15:43 2023 - 1kg_eas_hg19_tbi : /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi
{'ensembl_hg19_gtf': '/home/yunye/.gwaslab/Homo_sapiens.GRCh37.87.chr.gtf.gz', 'recombination_hg19': '/home/yunye/.gwaslab/recombination/hg19/recombination_hg19.tar.gz', '1kg_eas_hg19': '/home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz', '1kg_eas_hg19_tbi': '/home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi'}
After downloading, use get_path
to obtain the file path by specifying the keyword.
Note:
- If
tabix
is available in PATH, the speed will be will greatly improved. Otherwise, vcf files will be loaded from the head. - tabix: http://www.htslib.org/download/
mysumstats.plot_mqq(mode="r",
build="19",
region=(7,126253550,128253550),
region_grid=True,
anno=True,
anno_args={"rotation":0,"fontsize":12},
vcf_path=gl.get_path("1kg_eas_hg19"))
2024/02/15 18:55:28 Start to create MQQ plot...v3.4.40: 2024/02/15 18:55:28 -Genomic coordinates version: 19... 2024/02/15 18:55:28 -Genome-wide significance level to plot is set to 5e-08 ... 2024/02/15 18:55:28 -Raw input contains 12557761 variants... 2024/02/15 18:55:28 -MQQ plot layout mode is : r 2024/02/15 18:55:28 -Region to plot : chr7:126253550-128253550. 2024/02/15 18:55:28 -Checking prefix for chromosomes in vcf files... 2024/02/15 18:55:28 -No prefix for chromosomes in the VCF files. 2024/02/15 18:55:31 -Extract SNPs in region : chr7:126253550-128253550... 2024/02/15 18:55:32 -Extract SNPs in specified regions: 8087 2024/02/15 18:55:32 Finished loading specified columns from the sumstats. 2024/02/15 18:55:32 Start data conversion and sanity check: 2024/02/15 18:55:32 -Removed 0 variants with nan in CHR or POS column ... 2024/02/15 18:55:32 -Removed 0 variants with CHR <=0... 2024/02/15 18:55:32 -Removed 0 variants with nan in P column ... 2024/02/15 18:55:32 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... 2024/02/15 18:55:32 -Sumstats P values are being converted to -log10(P)... 2024/02/15 18:55:32 -Sanity check: 0 na/inf/-inf variants will be removed... 2024/02/15 18:55:32 -Converting data above cut line... 2024/02/15 18:55:32 -Maximum -log10(P) value is 73.38711023071251 . 2024/02/15 18:55:32 Finished data conversion and sanity check. 2024/02/15 18:55:32 Start to create MQQ plot with 8087 variants... 2024/02/15 18:55:32 -tabix will be used: /home/yunye/tools/bin/tabix 2024/02/15 18:55:32 Start to load reference genotype... 2024/02/15 18:55:32 -reference vcf path : /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz 2024/02/15 18:55:34 -Retrieving index... 2024/02/15 18:55:34 -Ref variants in the region: 54234 2024/02/15 18:55:34 -Matching variants using POS, NEA, EA ... 2024/02/15 18:55:35 -Calculating Rsq... 2024/02/15 18:55:35 Finished loading reference genotype successfully! 2024/02/15 18:55:35 -Creating background plot... 2024/02/15 18:55:35 -Extracting lead variant... 2024/02/15 18:55:35 -Loading gtf files from:default
INFO:root:Extracted GTF attributes: ['gene_id', 'gene_name', 'gene_biotype']
2024/02/15 18:56:04 -plotting gene track.. 2024/02/15 18:56:04 -Finished plotting gene track.. 2024/02/15 18:56:04 Finished creating MQQ plot successfully! 2024/02/15 18:56:04 Start to extract variants for annotation... 2024/02/15 18:56:04 -Found 1 significant variants with a sliding window size of 500 kb... 2024/02/15 18:56:04 Finished extracting variants for annotation... 2024/02/15 18:56:04 Start to process figure arts. 2024/02/15 18:56:04 -Processing X labels... 2024/02/15 18:56:04 -Processing Y labels... 2024/02/15 18:56:04 -Processing Y tick lables... 2024/02/15 18:56:04 -Processing Y labels... 2024/02/15 18:56:04 -Processing color bar... 2024/02/15 18:56:04 -Processing lines... 2024/02/15 18:56:04 Finished processing figure arts. 2024/02/15 18:56:04 Start to annotate variants... 2024/02/15 18:56:04 -Annotating using column CHR:POS... 2024/02/15 18:56:04 -Adjusting text positions with repel_force=0.03... 2024/02/15 18:56:04 Finished annotating variants. 2024/02/15 18:56:04 Start to save figure... 2024/02/15 18:56:04 -Skip saving figure! 2024/02/15 18:56:04 Finished saving figure... 2024/02/15 18:56:04 Finished creating plot successfully!
(<Figure size 3000x2000 with 4 Axes>, <gwaslab.g_Log.Log at 0x7fa21cbcb880>)
Or you can provide your own vcf files for vcf_path
.
# mysumstats.plot_mqq(mode="r",
# region=(7,156538803,157538803),
# region_grid=True,
# anno=True,
# vcf_path="/home/yunye/mydata/d_disk/eas_1kg_af/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz")
Note: GWASLab default genome build version is build="99"
(Unknown) , you can change it to build="19"
(GRCh37/hg19) or build="38"
(GRCh38/hg38) when needed. Or you can specify build when loading into gl.Sumstats(build="19")
or use mysumstats.infer_build()
.
For gene tracks, default is gtf_path="ensembl"
, you can also use gtf_path="refseq"
(NCBA RefSeq)
Sampling¶
There are more than 10 million variants in the original sumstats and it will take some time to process the entrie dataset. Let's just randomly sample 100K variants for this tutorial.
mysumstats.random_variants(n=100000,inplace=True,random_state=1234)
Sun Feb 4 01:15:53 2024 Start to randomly select variants from the sumstats... Sun Feb 4 01:15:53 2024 -Number of variants selected from the sumstats: 100000 Sun Feb 4 01:15:53 2024 -Random state (seed): 1234 Sun Feb 4 01:15:54 2024 Finished sampling...
Infer genome build¶
In case you don't know the genome build of the sumstats
For details, see: https://cloufield.github.io/gwaslab/InferBuild/
mysumstats.infer_build()
Sun Feb 4 01:15:54 2024 Start to infer genome build version using hapmap3 SNPs...v3.4.38 Sun Feb 4 01:15:54 2024 -Current Dataframe shape : 100000 x 12 ; Memory usage: 33.80 MB Sun Feb 4 01:15:54 2024 Start to infer genome build version using hapmap3 SNPs... Sun Feb 4 01:15:54 2024 -Loading Hapmap3 variants data... Sun Feb 4 01:15:56 2024 -CHR:POS will be used for matching... Sun Feb 4 01:15:57 2024 -Matching variants for hg19: num_hg19 = 8671 Sun Feb 4 01:15:57 2024 -Matching variants for hg38: num_hg38 = 145 Sun Feb 4 01:15:57 2024 -Warning: please be cautious due to the limited number of variants. Sun Feb 4 01:15:57 2024 -Since num_hg19 >> num_hg38, assigning genome build hg19... Sun Feb 4 01:15:57 2024 Finished inferring genome build version using hapmap3 SNPs.
Fix SNPID¶
You may notice that the SNPID is in CHR:POS_REF_ALT
format. We want SNPID to be in a stadardized format CHR:POS:REF:ALT
, we can use fix_id for this:
For other options of standardization, see: https://cloufield.github.io/gwaslab/Standardization/
#fixsep : fix ID separator
mysumstats.fix_id(fixsep=True)
Sun Feb 4 01:15:57 2024 Start to check SNPID/rsID...v3.4.38 Sun Feb 4 01:15:57 2024 -Current Dataframe shape : 100000 x 12 ; Memory usage: 14.24 MB Sun Feb 4 01:15:57 2024 -Checking SNPID data type... Sun Feb 4 01:15:57 2024 -Checking if SNPID is CHR:POS:NEA:EA...(separator: - ,: , _) Sun Feb 4 01:15:58 2024 -Replacing [_-] in SNPID with ":" ... Sun Feb 4 01:15:58 2024 Finished checking SNPID/rsID.
mysumstats.data
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1736173 | 2:180784768:G:T | 2 | 180784768 | G | T | 0.9978 | 0.0797 | 0.1141 | 0.4850 | 191764 | -+++ | 1960099 |
258841 | 1:61108984:G:C | 1 | 61108984 | G | C | 0.4847 | -0.0095 | 0.0087 | 0.2767 | 191764 | +--+ | 1960099 |
89268 | 1:20294687:G:A | 1 | 20294687 | G | A | 0.7936 | 0.0161 | 0.0111 | 0.1484 | 191764 | 0++- | 1960099 |
10697866 | 16:86560264:T:C | 16 | 86560264 | C | T | 0.3114 | -0.0016 | 0.0095 | 0.8663 | 191764 | --+- | 1960099 |
6068219 | 7:152311067:G:A | 7 | 152311067 | G | A | 0.9977 | 0.1369 | 0.1433 | 0.3395 | 166718 | +?++ | 1960099 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3376277 | 4:105613391:C:T | 4 | 105613391 | C | T | 0.9966 | 0.1211 | 0.0939 | 0.1974 | 191764 | ++++ | 1960099 |
5598990 | 7:38184549:A:T | 7 | 38184549 | T | A | 0.6533 | -0.0000 | 0.0095 | 0.9982 | 191764 | +-+- | 1960099 |
7580034 | 10:58897803:A:G | 10 | 58897803 | G | A | 0.0096 | -0.0339 | 0.0623 | 0.5860 | 191764 | ++-- | 1960099 |
10546104 | 16:58355232:G:A | 16 | 58355232 | G | A | 0.6904 | 0.0083 | 0.0108 | 0.4401 | 191764 | +-++ | 1960099 |
9937711 | 14:94630415:G:A | 14 | 94630415 | G | A | 0.7970 | 0.0174 | 0.0113 | 0.1221 | 191764 | ++++ | 1960099 |
100000 rows × 12 columns
Harmonization¶
gwaslab can harmonize the sumstats based on reference files.
- ref_seq : reference genome fasta file for allele alignment
- ref_infer : vcf file with allele frequency information for inferring strand and comparing allele frequency
- ref_alt_freq : field in INFO of vcf file for alternative allele frequency
For details see: https://cloufield.github.io/gwaslab/Harmonization/
For reference data, see: https://cloufield.github.io/gwaslab/Reference/
When calling .harmonize()
, .basic_check()
will be called first to make sure the dataset is ready for harmonization. Since we have already performed basic_check, we set basic_check=False
here. For ref_infer
, we pass the downloaded vcf files for 1KG EAS and specify the field for alternative allele frequency to AF by ref_alt_freq= "AF"
.
We will use hg19 reference genome FASTA file from UCSC.
gl.download_ref("ucsc_genome_hg19")
Sun Feb 4 01:16:03 2024 Start to download ucsc_genome_hg19 ... Sun Feb 4 01:16:03 2024 -Downloading to: /home/yunye/.gwaslab/hg19.fa.gz Sun Feb 4 01:16:03 2024 -File /home/yunye/.gwaslab/hg19.fa.gz exists. Sun Feb 4 01:16:03 2024 -Updating record in config file... Sun Feb 4 01:16:03 2024 -gunzip : /home/yunye/.gwaslab/hg19.fa.gz Sun Feb 4 01:16:18 2024 -Updating record in config file... Sun Feb 4 01:16:18 2024 Downloaded ucsc_genome_hg19 successfully!
mysumstats.harmonize(basic_check=False,
n_cores=3,
ref_seq = gl.get_path("ucsc_genome_hg19"),
ref_infer = gl.get_path("1kg_eas_hg19"),
ref_alt_freq= "AF")
Sun Feb 4 01:16:18 2024 Start to check if NEA is aligned with reference sequence...v3.4.38 Sun Feb 4 01:16:18 2024 -Current Dataframe shape : 100000 x 12 ; Memory usage: 14.24 MB Sun Feb 4 01:16:18 2024 -Reference genome FASTA file: /home/yunye/.gwaslab/hg19.fa Sun Feb 4 01:16:18 2024 -Checking records: 1 2 3 4 5 6 7 X 8 9 10 11 12 13 14 15 16 17 18 20 Y 19 22 21 M Sun Feb 4 01:16:41 2024 -Variants allele on given reference sequence : 41528 Sun Feb 4 01:16:41 2024 -Variants flipped : 50055 Sun Feb 4 01:16:41 2024 -Raw Matching rate : 91.58% Sun Feb 4 01:16:41 2024 -Variants inferred reverse_complement : 0 Sun Feb 4 01:16:41 2024 -Variants inferred reverse_complement_flipped : 0 Sun Feb 4 01:16:41 2024 -Both allele on genome + unable to distinguish : 8417 Sun Feb 4 01:16:41 2024 -Variants not on given reference sequence : 0 Sun Feb 4 01:16:41 2024 Finished checking if NEA is aligned with reference sequence. Sun Feb 4 01:16:41 2024 Start to adjust statistics based on STATUS code...v3.4.38 Sun Feb 4 01:16:41 2024 -Current Dataframe shape : 100000 x 12 ; Memory usage: 14.24 MB Sun Feb 4 01:16:42 2024 Start to flip allele-specific stats for SNPs with status xxxxx[35]x: ALT->EA , REF->NEA ...v3.4.38 Sun Feb 4 01:16:42 2024 -Flipping 50055 variants... Sun Feb 4 01:16:42 2024 -Swapping column: NEA <=> EA... Sun Feb 4 01:16:42 2024 -Flipping column: BETA = - BETA... Sun Feb 4 01:16:42 2024 -Flipping column: DIRECTION +-?0 <=> -+?0 ... Sun Feb 4 01:16:42 2024 -Flipping column: EAF = 1 - EAF... Sun Feb 4 01:16:42 2024 -Changed the status for flipped variants : xxxxx[35]x -> xxxxx[12]x Sun Feb 4 01:16:43 2024 Finished adjusting. Sun Feb 4 01:16:43 2024 Start to infer strand for palindromic SNPs/align indistinguishable indels...v3.4.38 Sun Feb 4 01:16:43 2024 -Current Dataframe shape : 100000 x 12 ; Memory usage: 14.24 MB Sun Feb 4 01:16:43 2024 -Number of threads/cores to use: 3 Sun Feb 4 01:16:43 2024 -Reference VCF: /home/yunye/anaconda3/lib/python3.9/site-packages/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz Sun Feb 4 01:16:43 2024 -Checking prefix for chromosomes in vcf files... Sun Feb 4 01:16:43 2024 -No prefix for chromosomes in the VCF files. Sun Feb 4 01:16:43 2024 -Field for alternative allele frequency in VCF INFO: AF Sun Feb 4 01:16:43 2024 -Identified 14174 palindromic SNPs... Sun Feb 4 01:16:44 2024 -After filtering by MAF< 0.4 , 12937 palindromic SNPs with unknown strand will be inferred... Sun Feb 4 01:16:52 2024 -Non-palindromic : 76650 Sun Feb 4 01:16:52 2024 -Palindromic SNPs on + strand: 12566 Sun Feb 4 01:16:52 2024 -Palindromic SNPs on - strand and needed to be flipped: 17 Sun Feb 4 01:16:52 2024 -Palindromic SNPs with MAF not available to infer : 1237 Sun Feb 4 01:16:52 2024 -Palindromic SNPs with no macthes or no information : 181 Sun Feb 4 01:16:53 2024 -Identified 8417 indistinguishable Indels... Sun Feb 4 01:16:53 2024 -Indistinguishable indels will be inferred from reference vcf ref and alt... Sun Feb 4 01:16:53 2024 -DAF tolerance: 0.2 Sun Feb 4 01:17:05 2024 -Indels ea/nea match reference : 3656 Sun Feb 4 01:17:06 2024 -Indels ea/nea need to be flipped : 4483 Sun Feb 4 01:17:06 2024 -Indels with no macthes or no information : 278 Sun Feb 4 01:17:06 2024 Finished inferring strand for palindromic SNPs/align indistinguishable indels. Sun Feb 4 01:17:06 2024 Start to adjust statistics based on STATUS code...v3.4.38 Sun Feb 4 01:17:06 2024 -Current Dataframe shape : 100000 x 12 ; Memory usage: 14.24 MB Sun Feb 4 01:17:07 2024 Start to flip allele-specific stats for standardized indels with status xxxx[123][67][6]: ALT->EA , REF->NEA...v3.4.38 Sun Feb 4 01:17:07 2024 -Flipping 4483 variants... Sun Feb 4 01:17:07 2024 -Swapping column: NEA <=> EA... Sun Feb 4 01:17:07 2024 -Flipping column: BETA = - BETA... Sun Feb 4 01:17:07 2024 -Flipping column: DIRECTION +-?0 <=> -+?0 ... Sun Feb 4 01:17:07 2024 -Flipping column: EAF = 1 - EAF... Sun Feb 4 01:17:07 2024 -Changed the status for flipped variants xxxx[123][67]6 -> xxxx[123][67]4 Sun Feb 4 01:17:07 2024 Start to flip allele-specific stats for palindromic SNPs with status xxxxx[12]5: (-)strand <=> (+)strand...v3.4.38 Sun Feb 4 01:17:07 2024 -Flipping 17 variants... Sun Feb 4 01:17:07 2024 -Flipping column: BETA = - BETA... Sun Feb 4 01:17:07 2024 -Flipping column: DIRECTION +-?0 <=> -+?0 ... Sun Feb 4 01:17:07 2024 -Flipping column: EAF = 1 - EAF... Sun Feb 4 01:17:07 2024 -Changed the status for flipped variants: xxxxx[012]5: -> xxxxx[012]2 Sun Feb 4 01:17:07 2024 Finished adjusting. Sun Feb 4 01:17:08 2024 Start to sort the genome coordinates...v3.4.38 Sun Feb 4 01:17:08 2024 -Current Dataframe shape : 100000 x 12 ; Memory usage: 14.24 MB Sun Feb 4 01:17:08 2024 Finished sorting coordinates. Sun Feb 4 01:17:08 2024 Start to reorder the columns...v3.4.38 Sun Feb 4 01:17:08 2024 -Current Dataframe shape : 100000 x 12 ; Memory usage: 13.48 MB Sun Feb 4 01:17:08 2024 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,DIRECTION,STATUS Sun Feb 4 01:17:08 2024 Finished reordering the columns.
<gwaslab.g_Sumstats.Sumstats at 0x7fe710b84310>
Check the data again. Looks good!
mysumstats.data
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:875643:C:G | 1 | 875643 | G | C | 0.7290 | 0.0147 | 0.0130 | 0.258600 | 166718 | +?+- | 1960001 |
1 | 1:928416:G:A | 1 | 928416 | A | G | 0.1363 | 0.0113 | 0.0142 | 0.423300 | 191764 | --+- | 1960010 |
2 | 1:933045:C:T | 1 | 933045 | T | C | 0.0087 | 0.0008 | 0.0704 | 0.991500 | 166718 | -?+- | 1960010 |
3 | 1:1019158:G:A | 1 | 1019158 | A | G | 0.0110 | 0.1189 | 0.0613 | 0.052510 | 166718 | +?++ | 1960010 |
4 | 1:1023310:T:C | 1 | 1023310 | C | T | 0.0854 | 0.0202 | 0.0182 | 0.265600 | 191764 | +++- | 1960000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
99995 | X:153882706:C:T | 23 | 153882706 | T | C | 0.0125 | -0.0017 | 0.0526 | 0.973600 | 166718 | +?-- | 1960010 |
99996 | X:154087726:C:T | 23 | 154087726 | T | C | 0.0743 | 0.0414 | 0.0151 | 0.006171 | 191764 | +-+- | 1960010 |
99997 | X:154126990:T:C | 23 | 154126990 | C | T | 0.0763 | -0.0085 | 0.0132 | 0.519500 | 191764 | +--- | 1960000 |
99998 | X:154140146:T:G | 23 | 154140146 | G | T | 0.1707 | 0.0043 | 0.0096 | 0.652700 | 191764 | +-+- | 1960000 |
99999 | X:154278797:T:C | 23 | 154278797 | C | T | 0.1738 | 0.0025 | 0.0091 | 0.785500 | 191764 | +-+- | 1960000 |
100000 rows × 12 columns
Annotate rsID¶
rsID is assigned using two types of reference file:
- ref_rsid_tsv : tsv file for annotation of commonly used variants
- ref_rsid_vcf : vcf file for annotation of other variants
GWASLab provides preprocessed tsv files for 1KG varaints (~80M), we can download the file using .download_ref
with key words. This time we will use 1kg_dbsnp151_hg19_auto
, which is the SNPID-rsID conversion table for autosomal variants in 1KG project (hg19). This will take around a few minutes to dowadload.
# 961M
gl.download_ref("1kg_dbsnp151_hg19_auto")
Sun Feb 4 01:17:08 2024 Start to download 1kg_dbsnp151_hg19_auto ... Sun Feb 4 01:17:08 2024 -Downloading to: /home/yunye/.gwaslab/1kg_dbsnp151_hg19_auto.txt.gz Sun Feb 4 01:17:08 2024 -File /home/yunye/.gwaslab/1kg_dbsnp151_hg19_auto.txt.gz exists. Sun Feb 4 01:17:10 2024 -File path: /home/yunye/.gwaslab/1kg_dbsnp151_hg19_auto.txt.gz Sun Feb 4 01:17:10 2024 -MD5 check: 7d1e7624fb6e4df7a2f6f05558d436b4 Sun Feb 4 01:17:10 2024 -MD5 verified. Sun Feb 4 01:17:10 2024 -Updating record in config file... Sun Feb 4 01:17:10 2024 Downloaded 1kg_dbsnp151_hg19_auto successfully!
mysumstats.assign_rsid(ref_rsid_tsv= gl.get_path("1kg_dbsnp151_hg19_auto"))
Sun Feb 4 01:17:10 2024 Start to assign rsID by matching SNPID with CHR:POS:REF:ALT in the reference TSV...v3.4.38 Sun Feb 4 01:17:10 2024 -Current Dataframe shape : 100000 x 12 ; Memory usage: 13.48 MB Sun Feb 4 01:17:10 2024 -Number of threads/cores to use: 1 Sun Feb 4 01:17:10 2024 -Reference TSV: /home/yunye/.gwaslab/1kg_dbsnp151_hg19_auto.txt.gz Sun Feb 4 01:17:10 2024 -100000 rsID could be possibly fixed... Sun Feb 4 01:17:10 2024 -Setting block size: 5000000 Sun Feb 4 01:17:10 2024 -Loading block: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Sun Feb 4 01:21:13 2024 -rsID annotation for 3266 needed to be fixed! Sun Feb 4 01:21:13 2024 -Annotated 96734 rsID successfully! Sun Feb 4 01:21:13 2024 Finished assign rsID using reference file.
- We annotated 96734 (96734 /100000 = 96.7%) variants using the conversion table.
- For the annotation of other variants (3.3%), we may need the dbSNP reference vcf.
mysumstats.data
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | rsID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:875643:C:G | 1 | 875643 | G | C | 0.7290 | 0.0147 | 0.0130 | 0.258600 | 166718 | +?+- | 1960001 | rs7411115 |
1 | 1:928416:G:A | 1 | 928416 | A | G | 0.1363 | 0.0113 | 0.0142 | 0.423300 | 191764 | --+- | 1960010 | rs111754459 |
2 | 1:933045:C:T | 1 | 933045 | T | C | 0.0087 | 0.0008 | 0.0704 | 0.991500 | 166718 | -?+- | 1960010 | rs189213898 |
3 | 1:1019158:G:A | 1 | 1019158 | A | G | 0.0110 | 0.1189 | 0.0613 | 0.052510 | 166718 | +?++ | 1960010 | rs142369888 |
4 | 1:1023310:T:C | 1 | 1023310 | C | T | 0.0854 | 0.0202 | 0.0182 | 0.265600 | 191764 | +++- | 1960000 | rs11260589 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
99995 | X:153882706:C:T | 23 | 153882706 | T | C | 0.0125 | -0.0017 | 0.0526 | 0.973600 | 166718 | +?-- | 1960010 | <NA> |
99996 | X:154087726:C:T | 23 | 154087726 | T | C | 0.0743 | 0.0414 | 0.0151 | 0.006171 | 191764 | +-+- | 1960010 | <NA> |
99997 | X:154126990:T:C | 23 | 154126990 | C | T | 0.0763 | -0.0085 | 0.0132 | 0.519500 | 191764 | +--- | 1960000 | <NA> |
99998 | X:154140146:T:G | 23 | 154140146 | G | T | 0.1707 | 0.0043 | 0.0096 | 0.652700 | 191764 | +-+- | 1960000 | <NA> |
99999 | X:154278797:T:C | 23 | 154278797 | C | T | 0.1738 | 0.0025 | 0.0091 | 0.785500 | 191764 | +-+- | 1960000 | <NA> |
100000 rows × 13 columns
Maximize the annotation¶
For the annotation of not-so-common variants, we can use VCF files downloaded form NCBI dbsnp (https://ftp.ncbi.nih.gov/snp/latest_release/VCF/).
Note:
- The file size is huge (>20GB) and it might take several hours to download. (we can skip this step in this tutorial. )
- Specify
chr_dict
to match the chromosome notations in sumstats and in vcf files.
Parameters :
ref_rsid_vcf
: the path to the reference rsID vcf filen_cores
: number of threads to use
mysumstats.assign_rsid(n_cores=1,
ref_rsid_vcf= "/home/yunye/mydata/d_disk/dbsnp/GCF_000001405.25.vcf.gz",
chr_dict = gl.get_number_to_NC(build="19") )
Sun Feb 4 01:21:13 2024 Start to assign rsID using reference VCF...v3.4.38 Sun Feb 4 01:21:13 2024 -Current Dataframe shape : 100000 x 13 ; Memory usage: 14.24 MB Sun Feb 4 01:21:13 2024 -Number of threads/cores to use: 1 Sun Feb 4 01:21:13 2024 -Reference VCF: /home/yunye/mydata/d_disk/dbsnp/GCF_000001405.25.vcf.gz Sun Feb 4 01:21:13 2024 -Assigning rsID based on CHR:POS and REF:ALT/ALT:REF... Sun Feb 4 01:21:25 2024 -rsID Annotation for 122 need to be fixed! Sun Feb 4 01:21:25 2024 -Annotated 3144 rsID successfully! Sun Feb 4 01:21:25 2024 Finished assign rsID using reference file.
After this, only 122 variants were not annotated. (122/100000 = 0.122%), mostly indels not available in the reference VCF.
mysumstats.data.loc[mysumstats.data["rsID"].isna(),:]
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | rsID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
614 | 1:18539349:T:AG | 1 | 18539349 | AG | T | 0.3139 | -0.0099 | 0.0096 | 0.29840 | 191764 | +--- | 1960309 | <NA> |
4451 | 1:160397344:C:CATTATT | 1 | 160397344 | C | CATTATT | 0.8184 | 0.0298 | 0.0123 | 0.01560 | 191764 | ++++ | 1960368 | <NA> |
11431 | 2:107374263:ATCTC:A | 2 | 107374263 | ATCTC | A | 0.9613 | 0.0452 | 0.0238 | 0.05739 | 191764 | +++- | 1960368 | <NA> |
12873 | 2:153147115:C:AA | 2 | 153147115 | AA | C | 0.4939 | 0.0005 | 0.0103 | 0.96120 | 191764 | +--+ | 1960319 | <NA> |
13626 | 2:175894098:G:AA | 2 | 175894098 | AA | G | 0.3400 | 0.0029 | 0.0095 | 0.76090 | 191764 | --++ | 1960319 | <NA> |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
99841 | X:145191155:CT:C | 23 | 145191155 | C | CT | 0.5646 | -0.0023 | 0.0071 | 0.74480 | 191764 | -+-- | 1960368 | <NA> |
99856 | X:145727210:CAT:C | 23 | 145727210 | C | CAT | 0.4036 | -0.0159 | 0.0116 | 0.16940 | 191764 | --+- | 1960368 | <NA> |
99935 | X:150826311:GT:G | 23 | 150826311 | GT | G | 0.2320 | 0.0016 | 0.0143 | 0.91320 | 191764 | --+- | 1960368 | <NA> |
99971 | X:152656101:CCA:C | 23 | 152656101 | C | CCA | 0.1351 | -0.0298 | 0.0134 | 0.02646 | 191764 | ---- | 1960368 | <NA> |
99994 | X:153820016:AT:A | 23 | 153820016 | AT | A | 0.9888 | 0.0470 | 0.0605 | 0.43720 | 166718 | +?++ | 1960368 | <NA> |
122 rows × 13 columns
Tips : The two steps can be run in a single one.
#mysumstats.assign_rsid(n_cores=3,
# ref_rsid_tsv= gl.get_path("1kg_dbsnp151_hg19_auto"),
# ref_rsid_vcf= "/home/yunye/CommonData/Reference/ncbi_dbsnp/ncbi_dbsnp/db155/GCF_000001405.25.gz",
# chr_dict = gl.get_number_to_NC(build="19") )
Sumstats Summary¶
Check the summary of the currrent sumstats (see: https://cloufield.github.io/gwaslab/StatusCode/):
mysumstats.summary()
Values | Percentage | ||
---|---|---|---|
Category | Items | ||
META | Row_num | 100000 | <NA> |
Column_num | 13 | <NA> | |
Column_names | SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,DIRECTION... | <NA> | |
Last_checked_time | Sun Feb 4 01:21:25 2024 | <NA> | |
MISSING | Missing_total | 122 | 0.12 |
Missing_rsID | 122 | 0.12 | |
MAF | Common | 50983 | 50.98 |
Low_frequency | 16809 | 16.81 | |
Rare | 32157 | 32.16 | |
P | Minimum | 1.1900000000000001e-77 | 0.0 |
Significant | 86 | 0.09 | |
Suggestive | 166 | 0.17 | |
STATUS | 1960010 | 42815 | 42.82 |
1960000 | 33835 | 33.84 | |
1960001 | 6329 | 6.33 | |
1960011 | 6237 | 6.24 | |
1960364 | 4483 | 4.48 | |
1960363 | 3656 | 3.66 | |
1960007 | 621 | 0.62 | |
1960017 | 616 | 0.62 | |
1960309 | 563 | 0.56 | |
1960368 | 278 | 0.28 | |
1960319 | 196 | 0.2 | |
1960018 | 181 | 0.18 | |
1960008 | 173 | 0.17 | |
1960012 | 10 | 0.01 | |
1960002 | 7 | 0.01 |
Check the details of harmonization results .lookup_status()
mysumstats.lookup_status()
Genome_Build | rsID&SNPID | CHR&POS | Stadardize&Normalize | Align | Panlidromic_SNP&Indel | Count | Percentage(%) | |
---|---|---|---|---|---|---|---|---|
1960000 | hg19 | rsid unknown & SNPID valid | CHR valid & POS valid | standardized SNP | Match: NEA=REF | Not_palindromic_SNPs | 33835 | 33.84 |
1960001 | hg19 | rsid unknown & SNPID valid | CHR valid & POS valid | standardized SNP | Match: NEA=REF | Palindromic+strand | 6329 | 6.33 |
1960002 | hg19 | rsid unknown & SNPID valid | CHR valid & POS valid | standardized SNP | Match: NEA=REF | Palindromic-strand_fixed | 7 | 0.01 |
1960007 | hg19 | rsid unknown & SNPID valid | CHR valid & POS valid | standardized SNP | Match: NEA=REF | Indistinguishable | 621 | 0.62 |
1960008 | hg19 | rsid unknown & SNPID valid | CHR valid & POS valid | standardized SNP | Match: NEA=REF | No_matching_or_no_info | 173 | 0.17 |
1960010 | hg19 | rsid unknown & SNPID valid | CHR valid & POS valid | standardized SNP | Flipped_fixed | Not_palindromic_SNPs | 42815 | 42.82 |
1960011 | hg19 | rsid unknown & SNPID valid | CHR valid & POS valid | standardized SNP | Flipped_fixed | Palindromic+strand | 6237 | 6.24 |
1960012 | hg19 | rsid unknown & SNPID valid | CHR valid & POS valid | standardized SNP | Flipped_fixed | Palindromic-strand_fixed | 10 | 0.01 |
1960017 | hg19 | rsid unknown & SNPID valid | CHR valid & POS valid | standardized SNP | Flipped_fixed | Indistinguishable | 616 | 0.62 |
1960018 | hg19 | rsid unknown & SNPID valid | CHR valid & POS valid | standardized SNP | Flipped_fixed | No_matching_or_no_info | 181 | 0.18 |
1960309 | hg19 | rsid unknown & SNPID valid | CHR valid & POS valid | standardized & normalized indel | Match: NEA=REF | Unchecked | 563 | 0.56 |
1960319 | hg19 | rsid unknown & SNPID valid | CHR valid & POS valid | standardized & normalized indel | Flipped_fixed | Unchecked | 196 | 0.2 |
1960363 | hg19 | rsid unknown & SNPID valid | CHR valid & POS valid | standardized & normalized indel | Both_alleles_on_ref+indistinguishable | Indel_match | 3656 | 3.66 |
1960364 | hg19 | rsid unknown & SNPID valid | CHR valid & POS valid | standardized & normalized indel | Both_alleles_on_ref+indistinguishable | Indel_flipped_fixed | 4483 | 4.48 |
1960368 | hg19 | rsid unknown & SNPID valid | CHR valid & POS valid | standardized & normalized indel | Both_alleles_on_ref+indistinguishable | No_matching_or_no_info | 278 | 0.28 |
Formatting and saving : to_format()¶
You can easily format the processed sumstats and save it. For details, see: https://cloufield.github.io/gwaslab/Format/
Let's export the sumstats as the default format used in LDSC.
mysumstats.to_format("clean_sumstats",fmt="ldsc")
Sun Feb 4 01:21:28 2024 Start to convert the output sumstats in: ldsc format Sun Feb 4 01:21:28 2024 -Formatting statistics ... Sun Feb 4 01:21:29 2024 -Float statistics formats: Sun Feb 4 01:21:29 2024 - Columns : ['EAF', 'BETA', 'SE', 'P'] Sun Feb 4 01:21:29 2024 - Output formats: ['{:.4g}', '{:.4f}', '{:.4f}', '{:.4e}'] Sun Feb 4 01:21:29 2024 -Start outputting sumstats in ldsc format... Sun Feb 4 01:21:29 2024 -ldsc format will be loaded... Sun Feb 4 01:21:29 2024 -ldsc format meta info: Sun Feb 4 01:21:29 2024 - format_name : ldsc Sun Feb 4 01:21:29 2024 - format_source : https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format Sun Feb 4 01:21:29 2024 - format_source2 : https://github.com/bulik/ldsc/blob/master/munge_sumstats.py Sun Feb 4 01:21:29 2024 - format_version : 20150306 Sun Feb 4 01:21:29 2024 -gwaslab to ldsc format dictionary: Sun Feb 4 01:21:29 2024 - gwaslab keys: rsID,NEA,EA,EAF,N,BETA,P,Z,INFO,OR,CHR,POS Sun Feb 4 01:21:29 2024 - ldsc values: SNP,A2,A1,Frq,N,Beta,P,Z,INFO,OR,CHR,POS Sun Feb 4 01:21:29 2024 -Output path: clean_sumstats.ldsc.tsv.gz Sun Feb 4 01:21:29 2024 -Output columns: Beta,Frq,SNP,A2,P,N,A1,POS,CHR Sun Feb 4 01:21:29 2024 -Writing sumstats to: clean_sumstats.ldsc.tsv.gz... Sun Feb 4 01:21:30 2024 -Saving log file to: clean_sumstats.ldsc.log Sun Feb 4 01:21:30 2024 Finished outputting successfully!
Sometimes we only need to export part of the sumstats. For example,
- we can specify
hapmap3=True
to export only hapmap3 variants; - specify
exclude_hla=True
to exclude variants in HLA region when exporting; - specify
md5sum=True
to calculate the md5sum value for the exported sumstats.
mysumstats.to_format("clean_sumstats",fmt="ldsc",hapmap3=True,exclude_hla=True,md5sum=True)
Sun Feb 4 01:21:30 2024 Start to convert the output sumstats in: ldsc format Sun Feb 4 01:21:30 2024 -Excluding variants in MHC (HLA) region ... Sun Feb 4 01:21:30 2024 -Exclude 609 variants in MHC (HLA) region : 25Mb - 34Mb. Sun Feb 4 01:21:30 2024 -Processing 99391 raw variants... Sun Feb 4 01:21:30 2024 -Loading Hapmap3 variants data... Sun Feb 4 01:21:31 2024 -Extract 8644 variants in Hapmap3 datasets for build 19. Sun Feb 4 01:21:31 2024 -Formatting statistics ... Sun Feb 4 01:21:31 2024 -Float statistics formats: Sun Feb 4 01:21:31 2024 - Columns : ['EAF', 'BETA', 'SE', 'P'] Sun Feb 4 01:21:31 2024 - Output formats: ['{:.4g}', '{:.4f}', '{:.4f}', '{:.4e}'] Sun Feb 4 01:21:31 2024 -Start outputting sumstats in ldsc format... Sun Feb 4 01:21:31 2024 -ldsc format will be loaded... Sun Feb 4 01:21:31 2024 -ldsc format meta info: Sun Feb 4 01:21:31 2024 - format_name : ldsc Sun Feb 4 01:21:31 2024 - format_source : https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format Sun Feb 4 01:21:31 2024 - format_source2 : https://github.com/bulik/ldsc/blob/master/munge_sumstats.py Sun Feb 4 01:21:31 2024 - format_version : 20150306 Sun Feb 4 01:21:31 2024 -gwaslab to ldsc format dictionary: Sun Feb 4 01:21:31 2024 - gwaslab keys: rsID,NEA,EA,EAF,N,BETA,P,Z,INFO,OR,CHR,POS Sun Feb 4 01:21:31 2024 - ldsc values: SNP,A2,A1,Frq,N,Beta,P,Z,INFO,OR,CHR,POS Sun Feb 4 01:21:31 2024 -Output path: clean_sumstats.hapmap3.noMHC.ldsc.tsv.gz Sun Feb 4 01:21:31 2024 -Output columns: Beta,Frq,SNP,A2,P,N,A1,POS,CHR Sun Feb 4 01:21:31 2024 -Writing sumstats to: clean_sumstats.hapmap3.noMHC.ldsc.tsv.gz... Sun Feb 4 01:21:31 2024 -md5sum hashing for the file: clean_sumstats.hapmap3.noMHC.ldsc.tsv.gz Sun Feb 4 01:21:31 2024 -md5sum path: clean_sumstats.hapmap3.noMHC.ldsc.tsv.gz.md5sum Sun Feb 4 01:21:31 2024 -md5sum: 7b148066fae5f407f1239e49efe1b88a Sun Feb 4 01:21:31 2024 -Saving log file to: clean_sumstats.hapmap3.noMHC.ldsc.log Sun Feb 4 01:21:31 2024 Finished outputting successfully!
Output in GWAS-SSF format
mysumstats.to_format("clean_sumstats",fmt="ssf",md5sum=True)
Sun Feb 4 01:21:31 2024 Start to convert the output sumstats in: ssf format Sun Feb 4 01:21:31 2024 -Formatting statistics ... Sun Feb 4 01:21:31 2024 -Float statistics formats: Sun Feb 4 01:21:31 2024 - Columns : ['EAF', 'BETA', 'SE', 'P'] Sun Feb 4 01:21:31 2024 - Output formats: ['{:.4g}', '{:.4f}', '{:.4f}', '{:.4e}'] Sun Feb 4 01:21:31 2024 -Replacing SNPID separator from ":" to "_"... Sun Feb 4 01:21:31 2024 -Start outputting sumstats in ssf format... Sun Feb 4 01:21:31 2024 -ssf format will be loaded... Sun Feb 4 01:21:31 2024 -ssf format meta info: Sun Feb 4 01:21:31 2024 - format_name : ssf Sun Feb 4 01:21:31 2024 - format_source : https://www.biorxiv.org/content/10.1101/2022.07.15.500230v1.full Sun Feb 4 01:21:31 2024 - format_cite_name : GWAS-SSF v0.1 Sun Feb 4 01:21:31 2024 - format_separator : \t Sun Feb 4 01:21:31 2024 - format_na : #NA Sun Feb 4 01:21:31 2024 - format_col_order : chromosome,base_pair_location,effect_allele,other_allele,beta,odds_ratio,hazard_ratio,standard_error,effect_allele_frequency,p_value,neg_log_10_p_value,ci_upper,ci_lower,rsid,variant_id,info,ref_allele,n Sun Feb 4 01:21:31 2024 - format_version : 20230328 Sun Feb 4 01:21:31 2024 -gwaslab to ssf format dictionary: Sun Feb 4 01:21:31 2024 - gwaslab keys: SNPID,rsID,CHR,POS,NEA,EA,EAF,N,BETA,SE,P,MLOG10P,INFO,OR,HR,OR_95L,OR_95U Sun Feb 4 01:21:31 2024 - ssf values: variant_id,rsid,chromosome,base_pair_location,other_allele,effect_allele,effect_allele_frequency,n,beta,standard_error,p_value,neg_log_10_p_value,info,odds_ratio,hazard_ratio,ci_lower,ci_upper Sun Feb 4 01:21:31 2024 -Output path: clean_sumstats.ssf.tsv.gz Sun Feb 4 01:21:31 2024 -Output columns: chromosome,base_pair_location,effect_allele,other_allele,beta,standard_error,effect_allele_frequency,p_value,rsid,variant_id,n Sun Feb 4 01:21:31 2024 -Writing sumstats to: clean_sumstats.ssf.tsv.gz... Sun Feb 4 01:21:32 2024 -md5sum hashing for the file: clean_sumstats.ssf.tsv.gz Sun Feb 4 01:21:32 2024 -md5sum path: clean_sumstats.ssf.tsv.gz.md5sum Sun Feb 4 01:21:32 2024 -md5sum: f4cee107b7e9dc7d24c97dc4d051e79b Sun Feb 4 01:21:32 2024 -Saving log file to: clean_sumstats.ssf.log Sun Feb 4 01:21:32 2024 Finished outputting successfully!
For more formats and examples, see https://cloufield.github.io/gwaslab/format_load_save/
Liftover¶
GWASLab can perform liftover for base pair positions.
Note: GWASLab only liftover CHR and POS, and when lifted, the last two digits status (alignment and palindromic/indel check) code will be rolled back to 99. Since for different reference genome, the reference allele or strand might be reverse, so it is needed to harmonize again using reference files of the target genome build after liftover.
mysumstats.liftover(n_cores=1, from_build="19", to_build="38")
Sun Feb 4 01:21:32 2024 Start to perform liftover...v3.4.38 Sun Feb 4 01:21:32 2024 -Current Dataframe shape : 100000 x 13 ; Memory usage: 14.24 MB Sun Feb 4 01:21:32 2024 -Number of threads/cores to use: 1 Sun Feb 4 01:21:32 2024 -Creating converter : hg19 to hg38 Sun Feb 4 01:21:33 2024 -Converting variants with status code xxx0xxx :100000... Sun Feb 4 01:21:54 2024 -Removed unmapped variants: 57 Sun Feb 4 01:21:54 2024 Start to fix chromosome notation (CHR)...v3.4.38 Sun Feb 4 01:21:54 2024 -Current Dataframe shape : 99943 x 13 ; Memory usage: 15.00 MB Sun Feb 4 01:21:54 2024 -Checking CHR data type... Sun Feb 4 01:21:54 2024 -Variants with standardized chromosome notation: 99943 Sun Feb 4 01:21:54 2024 -All CHR are already fixed... Sun Feb 4 01:21:55 2024 Finished fixing chromosome notation (CHR). Sun Feb 4 01:21:55 2024 Start to fix basepair positions (POS)...v3.4.38 Sun Feb 4 01:21:55 2024 -Current Dataframe shape : 99943 x 13 ; Memory usage: 15.00 MB Sun Feb 4 01:21:55 2024 -Converting to Int64 data type ... Sun Feb 4 01:21:55 2024 -Position bound:(0 , 250,000,000) Sun Feb 4 01:21:55 2024 -Removed outliers: 0 Sun Feb 4 01:21:55 2024 -Removed 0 variants with bad positions. Sun Feb 4 01:21:55 2024 Finished fixing basepair positions (POS). Sun Feb 4 01:21:55 2024 Finished liftover.
mysumstats.data
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | rsID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:875643:C:G | 1 | 940263 | G | C | 0.7290 | 0.0147 | 0.0130 | 0.258600 | 166718 | +?+- | 3860099 | rs7411115 |
1 | 1:928416:G:A | 1 | 993036 | A | G | 0.1363 | 0.0113 | 0.0142 | 0.423300 | 191764 | --+- | 3860099 | rs111754459 |
2 | 1:933045:C:T | 1 | 997665 | T | C | 0.0087 | 0.0008 | 0.0704 | 0.991500 | 166718 | -?+- | 3860099 | rs189213898 |
3 | 1:1019158:G:A | 1 | 1083778 | A | G | 0.0110 | 0.1189 | 0.0613 | 0.052510 | 166718 | +?++ | 3860099 | rs142369888 |
4 | 1:1023310:T:C | 1 | 1087930 | C | T | 0.0854 | 0.0202 | 0.0182 | 0.265600 | 191764 | +++- | 3860099 | rs11260589 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
99995 | X:153882706:C:T | 23 | 154654432 | T | C | 0.0125 | -0.0017 | 0.0526 | 0.973600 | 166718 | +?-- | 3860099 | rs3829775 |
99996 | X:154087726:C:T | 23 | 154859451 | T | C | 0.0743 | 0.0414 | 0.0151 | 0.006171 | 191764 | +-+- | 3860099 | rs112922881 |
99997 | X:154126990:T:C | 23 | 154898715 | C | T | 0.0763 | -0.0085 | 0.0132 | 0.519500 | 191764 | +--- | 3860099 | rs28370221 |
99998 | X:154140146:T:G | 23 | 154911871 | G | T | 0.1707 | 0.0043 | 0.0096 | 0.652700 | 191764 | +-+- | 3860099 | rs28895719 |
99999 | X:154278797:T:C | 23 | 155050522 | C | T | 0.1738 | 0.0025 | 0.0091 | 0.785500 | 191764 | +-+- | 3860099 | rs114209171 |
99943 rows × 13 columns
For details, see https://cloufield.github.io/gwaslab/LiftOver/