Lead and novel variants¶
In [1]:
Copied!
import gwaslab as gl
import gwaslab as gl
Load sample data¶
Use only first 1000000 variants as example
In [2]:
Copied!
mysumstats = gl.Sumstats("t2d_bbj.txt.gz",
snpid="SNP",
chrom="CHR",
pos="POS",
ea="ALT",
nea="REF",
beta="BETA",
se="SE",
p="P",
build="19",
verbose=False,
nrows=1000000)
mysumstats.basic_check(verbose=False)
mysumstats = gl.Sumstats("t2d_bbj.txt.gz",
snpid="SNP",
chrom="CHR",
pos="POS",
ea="ALT",
nea="REF",
beta="BETA",
se="SE",
p="P",
build="19",
verbose=False,
nrows=1000000)
mysumstats.basic_check(verbose=False)
Get lead variants¶
GWASLab will use MLOG10P first by default. If MLOG10P is not avaiable, it will look for P column.
In [3]:
Copied!
mysumstats.get_lead()
mysumstats.get_lead()
Sat Feb 3 01:27:18 2024 Start to extract lead variants...v3.4.38 Sat Feb 3 01:27:18 2024 -Current Dataframe shape : 1000000 x 9 ; Memory usage: 75.89 MB Sat Feb 3 01:27:18 2024 -Processing 1000000 variants... Sat Feb 3 01:27:18 2024 -Significance threshold : 5e-08 Sat Feb 3 01:27:18 2024 -Sliding window size: 500 kb Sat Feb 3 01:27:19 2024 -Using P for extracting lead variants... Sat Feb 3 01:27:19 2024 -Found 543 significant variants in total... Sat Feb 3 01:27:19 2024 -Identified 4 lead variants! Sat Feb 3 01:27:19 2024 Finished extracting lead variants.
Out[3]:
SNPID | CHR | POS | EA | NEA | BETA | SE | P | STATUS | |
---|---|---|---|---|---|---|---|---|---|
96739 | 1:22068326_A_G | 1 | 22068326 | G | A | 0.0621 | 0.0103 | 1.629000e-09 | 1960099 |
213860 | 1:51103268_T_C | 1 | 51103268 | C | T | -0.0802 | 0.0120 | 2.519000e-11 | 1960099 |
534095 | 1:154309595_TA_T | 1 | 154309595 | TA | T | -0.0915 | 0.0166 | 3.289000e-08 | 1960399 |
969974 | 2:640986_CACAT_C | 2 | 640986 | C | CACAT | -0.0946 | 0.0150 | 2.665000e-10 | 1960399 |
Get lead variants with gene name annotation¶
In [4]:
Copied!
mysumstats.get_lead(anno=True)
mysumstats.get_lead(anno=True)
Sat Feb 3 01:27:19 2024 Start to extract lead variants...v3.4.38 Sat Feb 3 01:27:19 2024 -Current Dataframe shape : 1000000 x 9 ; Memory usage: 75.89 MB Sat Feb 3 01:27:19 2024 -Processing 1000000 variants... Sat Feb 3 01:27:19 2024 -Significance threshold : 5e-08 Sat Feb 3 01:27:19 2024 -Sliding window size: 500 kb Sat Feb 3 01:27:19 2024 -Using P for extracting lead variants... Sat Feb 3 01:27:19 2024 -Found 543 significant variants in total... Sat Feb 3 01:27:19 2024 -Identified 4 lead variants! Sat Feb 3 01:27:19 2024 -Annotating variants using references:ensembl Sat Feb 3 01:27:19 2024 -Annotating variants using references based on genome build:19 Sat Feb 3 01:27:19 2024 Start to annotate variants with nearest gene name(s)... Sat Feb 3 01:27:19 2024 -Assigning Gene name using ensembl_hg19_gtf for protein coding genes Sat Feb 3 01:27:19 2024 Finished annotating variants with nearest gene name(s) successfully! Sat Feb 3 01:27:19 2024 Finished extracting lead variants.
Out[4]:
SNPID | CHR | POS | EA | NEA | BETA | SE | P | STATUS | LOCATION | GENE | |
---|---|---|---|---|---|---|---|---|---|---|---|
96739 | 1:22068326_A_G | 1 | 22068326 | G | A | 0.0621 | 0.0103 | 1.629000e-09 | 1960099 | 0 | USP48 |
213860 | 1:51103268_T_C | 1 | 51103268 | C | T | -0.0802 | 0.0120 | 2.519000e-11 | 1960099 | 0 | FAF1 |
534095 | 1:154309595_TA_T | 1 | 154309595 | TA | T | -0.0915 | 0.0166 | 3.289000e-08 | 1960399 | 0 | ATP8B2 |
969974 | 2:640986_CACAT_C | 2 | 640986 | C | CACAT | -0.0946 | 0.0150 | 2.665000e-10 | 1960399 | 26349 | TMEM18 |
Different window sizes¶
In [5]:
Copied!
mysumstats.get_lead(windowsizekb=1000)
mysumstats.get_lead(windowsizekb=1000)
Sat Feb 3 01:27:19 2024 Start to extract lead variants...v3.4.38 Sat Feb 3 01:27:19 2024 -Current Dataframe shape : 1000000 x 9 ; Memory usage: 75.89 MB Sat Feb 3 01:27:19 2024 -Processing 1000000 variants... Sat Feb 3 01:27:19 2024 -Significance threshold : 5e-08 Sat Feb 3 01:27:19 2024 -Sliding window size: 1000 kb Sat Feb 3 01:27:19 2024 -Using P for extracting lead variants... Sat Feb 3 01:27:19 2024 -Found 543 significant variants in total... Sat Feb 3 01:27:19 2024 -Identified 4 lead variants! Sat Feb 3 01:27:19 2024 Finished extracting lead variants.
Out[5]:
SNPID | CHR | POS | EA | NEA | BETA | SE | P | STATUS | |
---|---|---|---|---|---|---|---|---|---|
96739 | 1:22068326_A_G | 1 | 22068326 | G | A | 0.0621 | 0.0103 | 1.629000e-09 | 1960099 |
213860 | 1:51103268_T_C | 1 | 51103268 | C | T | -0.0802 | 0.0120 | 2.519000e-11 | 1960099 |
534095 | 1:154309595_TA_T | 1 | 154309595 | TA | T | -0.0915 | 0.0166 | 3.289000e-08 | 1960399 |
969974 | 2:640986_CACAT_C | 2 | 640986 | C | CACAT | -0.0946 | 0.0150 | 2.665000e-10 | 1960399 |
Different thresholds¶
In [6]:
Copied!
mysumstats.get_lead(sig_level=1e-10)
mysumstats.get_lead(sig_level=1e-10)
Sat Feb 3 01:27:19 2024 Start to extract lead variants...v3.4.38 Sat Feb 3 01:27:19 2024 -Current Dataframe shape : 1000000 x 9 ; Memory usage: 75.89 MB Sat Feb 3 01:27:19 2024 -Processing 1000000 variants... Sat Feb 3 01:27:19 2024 -Significance threshold : 1e-10 Sat Feb 3 01:27:19 2024 -Sliding window size: 500 kb Sat Feb 3 01:27:20 2024 -Using P for extracting lead variants... Sat Feb 3 01:27:20 2024 -Found 1 significant variants in total... Sat Feb 3 01:27:20 2024 -Identified 1 lead variants! Sat Feb 3 01:27:20 2024 Finished extracting lead variants.
Out[6]:
SNPID | CHR | POS | EA | NEA | BETA | SE | P | STATUS | |
---|---|---|---|---|---|---|---|---|---|
213860 | 1:51103268_T_C | 1 | 51103268 | C | T | -0.0802 | 0.012 | 2.519000e-11 | 1960099 |
Check if novel against a tabular file¶
In [8]:
Copied!
novel = mysumstats.get_novel(known="./toy_data/known_loci.txt")
novel = mysumstats.get_novel(known="./toy_data/known_loci.txt")
Sat Feb 3 01:22:31 2024 Start to check if lead variants are known...v3.4.38 Sat Feb 3 01:22:31 2024 -Current Dataframe shape : 1000000 x 9 ; Memory usage: 75.89 MB Sat Feb 3 01:22:31 2024 Start to extract lead variants...v3.4.38 Sat Feb 3 01:22:31 2024 -Current Dataframe shape : 1000000 x 9 ; Memory usage: 75.89 MB Sat Feb 3 01:22:31 2024 -Processing 1000000 variants... Sat Feb 3 01:22:31 2024 -Significance threshold : 5e-08 Sat Feb 3 01:22:31 2024 -Sliding window size: 500 kb Sat Feb 3 01:22:31 2024 -Using P for extracting lead variants... Sat Feb 3 01:22:31 2024 -Found 543 significant variants in total... Sat Feb 3 01:22:31 2024 -Identified 4 lead variants! Sat Feb 3 01:22:31 2024 Finished extracting lead variants. Sat Feb 3 01:22:31 2024 -Lead variants in known loci: 2 Sat Feb 3 01:22:31 2024 -Checking the minimum distance between identified lead variants and provided known variants... Sat Feb 3 01:22:31 2024 -Identified 2 known vairants in current sumstats... Sat Feb 3 01:22:31 2024 -Identified 2 novel vairants in current sumstats... Sat Feb 3 01:22:31 2024 Finished checking if lead variants are known.
In [9]:
Copied!
novel
novel
Out[9]:
SNPID | CHR | POS | EA | NEA | BETA | SE | P | STATUS | DISTANCE_TO_KNOWN | KNOWN_ID | NOVEL | LOCATION_OF_KNOWN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:22068326_A_G | 1 | 22068326 | G | A | 0.0621 | 0.0103 | 1.629000e-09 | 1960099 | 29034942.0 | 1:51103268 | True | Upstream |
1 | 1:51103268_T_C | 1 | 51103268 | C | T | -0.0802 | 0.0120 | 2.519000e-11 | 1960099 | 0.0 | 1:51103268 | False | Same |
2 | 1:154309595_TA_T | 1 | 154309595 | TA | T | -0.0915 | 0.0166 | 3.289000e-08 | 1960399 | 0.0 | 1:154309595 | False | Same |
3 | 2:640986_CACAT_C | 2 | 640986 | C | CACAT | -0.0946 | 0.0150 | 2.665000e-10 | 1960399 | NaN | <NA> | True | NoneOnThisChr |
Ccheck against GWAS Catalog using EFO ID¶
In [10]:
Copied!
# EFO ID can be found on gwas catalog
mysumstats.get_novel(efo="MONDO_0005148")
# EFO ID can be found on gwas catalog
mysumstats.get_novel(efo="MONDO_0005148")
Sat Feb 3 01:22:31 2024 Start to check if lead variants are known...v3.4.38 Sat Feb 3 01:22:31 2024 -Current Dataframe shape : 1000000 x 9 ; Memory usage: 75.89 MB Sat Feb 3 01:22:31 2024 Start to extract lead variants...v3.4.38 Sat Feb 3 01:22:31 2024 -Current Dataframe shape : 1000000 x 9 ; Memory usage: 75.89 MB Sat Feb 3 01:22:31 2024 -Processing 1000000 variants... Sat Feb 3 01:22:31 2024 -Significance threshold : 5e-08 Sat Feb 3 01:22:31 2024 -Sliding window size: 500 kb Sat Feb 3 01:22:31 2024 -Using P for extracting lead variants... Sat Feb 3 01:22:31 2024 -Found 543 significant variants in total... Sat Feb 3 01:22:31 2024 -Identified 4 lead variants! Sat Feb 3 01:22:31 2024 Finished extracting lead variants. Sat Feb 3 01:22:32 2024 Start to retrieve data using EFO: MONDO_0005148... Sat Feb 3 01:22:32 2024 Start to retrieve data from GWASCatalog... Sat Feb 3 01:22:32 2024 -Please make sure your sumstats is based on GRCh38... Sat Feb 3 01:22:32 2024 -Requesting (GET) trait information through the GWASCatalog API... Sat Feb 3 01:22:32 2024 -EFO trait api: https://www.ebi.ac.uk/gwas/rest/api/efoTraits/MONDO_0005148 Sat Feb 3 01:22:33 2024 -Status code: 200 Sat Feb 3 01:22:33 2024 -Trait Name: type 2 diabetes mellitus Sat Feb 3 01:22:33 2024 -Trait URL: http://purl.obolibrary.org/obo/MONDO_0005148 Sat Feb 3 01:22:33 2024 -Requesting (GET) GWAS associations through the GWASCatalog API... Sat Feb 3 01:22:33 2024 -associationsByTraitSummary API: https://www.ebi.ac.uk/gwas/rest/api/efoTraits/MONDO_0005148/associations?projection=associationByEfoTrait Sat Feb 3 01:22:33 2024 -Note: this step might take a while... Sat Feb 3 01:25:44 2024 -Status code 200 OK: Retrieved data from GWASCatalog successffully ... Sat Feb 3 01:25:44 2024 -Loading json ... Sat Feb 3 01:25:45 2024 -Parsing json ... Sat Feb 3 01:25:45 2024 -Number of reported associations for MONDO_0005148 in GWASCatalog: 5450 Sat Feb 3 01:25:45 2024 -Loading retrieved data into gwaslab Sumstats object ... Sat Feb 3 01:25:50 2024 Finished retrieving data from GWASCatalog... Sat Feb 3 01:25:50 2024 -Retrieved 4176 associations from GWAS catalog. Sat Feb 3 01:25:50 2024 -Lead variants in known loci: 4176 Sat Feb 3 01:25:50 2024 -Checking the minimum distance between identified lead variants and provided known variants... Sat Feb 3 01:25:50 2024 -Identified 4 known vairants in current sumstats... Sat Feb 3 01:25:50 2024 -Identified 0 novel vairants in current sumstats... Sat Feb 3 01:25:50 2024 Finished checking if lead variants are known.
Out[10]:
SNPID | CHR | POS | EA | NEA | BETA | SE | P | STATUS | DISTANCE_TO_KNOWN | KNOWN_ID | KNOWN_PUBMED_ID | KNOWN_AUTHOR | NOVEL | LOCATION_OF_KNOWN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:22068326_A_G | 1 | 22068326 | G | A | 0.0621 | 0.0103 | 1.629000e-09 | 1960099 | -279528 | rs34759301 | 34594039 | Sakaue S | False | Downstream |
1 | 1:51103268_T_C | 1 | 51103268 | C | T | -0.0802 | 0.0120 | 2.519000e-11 | 1960099 | -62054 | rs12088739 | 30054458 | Xue A | False | Downstream |
2 | 1:154309595_TA_T | 1 | 154309595 | TA | T | -0.0915 | 0.0166 | 3.289000e-08 | 1960399 | 12189 | rs1194606 | 32541925 | Vujkovic M | False | Upstream |
3 | 2:640986_CACAT_C | 2 | 640986 | C | CACAT | -0.0946 | 0.0150 | 2.665000e-10 | 1960399 | 1 | rs72156956 | 30718926 | Suzuki K | False | Upstream |