Lead and novel variants¶

In [1]:

            
                Copied!
                
import gwaslab as gl
import gwaslab as gl

Load sample data¶

Use only first 1000000 variants as example

In [2]:

            
                Copied!
                
                    
                    
                
                

        
mysumstats = gl.Sumstats("t2d_bbj.txt.gz",
             snpid="SNP",
             chrom="CHR",
             pos="POS",
             ea="ALT",
             nea="REF",            
             beta="BETA",
             se="SE",
             p="P", 
             build="19",
             verbose=False, 
             nrows=1000000)
mysumstats.basic_check(verbose=False)
mysumstats = gl.Sumstats("t2d_bbj.txt.gz",
             snpid="SNP",
             chrom="CHR",
             pos="POS",
             ea="ALT",
             nea="REF",            
             beta="BETA",
             se="SE",
             p="P", 
             build="19",
             verbose=False, 
             nrows=1000000)
mysumstats.basic_check(verbose=False)

Get lead variants¶

GWASLab will use MLOG10P first by default. If MLOG10P is not avaiable, it will look for P column.

In [3]:

            
                Copied!
                
mysumstats.get_lead()
mysumstats.get_lead()

Sat Feb  3 01:27:18 2024 Start to extract lead variants...v3.4.38
Sat Feb  3 01:27:18 2024  -Current Dataframe shape : 1000000 x 9 ; Memory usage: 75.89 MB
Sat Feb  3 01:27:18 2024  -Processing 1000000 variants...
Sat Feb  3 01:27:18 2024  -Significance threshold : 5e-08
Sat Feb  3 01:27:18 2024  -Sliding window size: 500  kb
Sat Feb  3 01:27:19 2024  -Using P for extracting lead variants...
Sat Feb  3 01:27:19 2024  -Found 543 significant variants in total...
Sat Feb  3 01:27:19 2024  -Identified 4 lead variants!
Sat Feb  3 01:27:19 2024 Finished extracting lead variants.

Out[3]:

	SNPID	CHR	POS	EA	NEA	BETA	SE	P	STATUS
96739	1:22068326_A_G	1	22068326	G	A	0.0621	0.0103	1.629000e-09	1960099
213860	1:51103268_T_C	1	51103268	C	T	-0.0802	0.0120	2.519000e-11	1960099
534095	1:154309595_TA_T	1	154309595	TA	T	-0.0915	0.0166	3.289000e-08	1960399
969974	2:640986_CACAT_C	2	640986	C	CACAT	-0.0946	0.0150	2.665000e-10	1960399

Get lead variants with gene name annotation¶

In [4]:

            
                Copied!
                
mysumstats.get_lead(anno=True)
mysumstats.get_lead(anno=True)

Sat Feb  3 01:27:19 2024 Start to extract lead variants...v3.4.38
Sat Feb  3 01:27:19 2024  -Current Dataframe shape : 1000000 x 9 ; Memory usage: 75.89 MB
Sat Feb  3 01:27:19 2024  -Processing 1000000 variants...
Sat Feb  3 01:27:19 2024  -Significance threshold : 5e-08
Sat Feb  3 01:27:19 2024  -Sliding window size: 500  kb
Sat Feb  3 01:27:19 2024  -Using P for extracting lead variants...
Sat Feb  3 01:27:19 2024  -Found 543 significant variants in total...
Sat Feb  3 01:27:19 2024  -Identified 4 lead variants!
Sat Feb  3 01:27:19 2024  -Annotating variants using references:ensembl
Sat Feb  3 01:27:19 2024  -Annotating variants using references based on genome build:19
Sat Feb  3 01:27:19 2024 Start to annotate variants with nearest gene name(s)...
Sat Feb  3 01:27:19 2024  -Assigning Gene name using ensembl_hg19_gtf for protein coding genes
Sat Feb  3 01:27:19 2024 Finished annotating variants with nearest gene name(s) successfully!
Sat Feb  3 01:27:19 2024 Finished extracting lead variants.

Out[4]:

	SNPID	CHR	POS	EA	NEA	BETA	SE	P	STATUS	LOCATION	GENE
96739	1:22068326_A_G	1	22068326	G	A	0.0621	0.0103	1.629000e-09	1960099	0	USP48
213860	1:51103268_T_C	1	51103268	C	T	-0.0802	0.0120	2.519000e-11	1960099	0	FAF1
534095	1:154309595_TA_T	1	154309595	TA	T	-0.0915	0.0166	3.289000e-08	1960399	0	ATP8B2
969974	2:640986_CACAT_C	2	640986	C	CACAT	-0.0946	0.0150	2.665000e-10	1960399	26349	TMEM18

Different window sizes¶

In [5]:

            
                Copied!
                
mysumstats.get_lead(windowsizekb=1000)
mysumstats.get_lead(windowsizekb=1000)

Sat Feb  3 01:27:19 2024 Start to extract lead variants...v3.4.38
Sat Feb  3 01:27:19 2024  -Current Dataframe shape : 1000000 x 9 ; Memory usage: 75.89 MB
Sat Feb  3 01:27:19 2024  -Processing 1000000 variants...
Sat Feb  3 01:27:19 2024  -Significance threshold : 5e-08
Sat Feb  3 01:27:19 2024  -Sliding window size: 1000  kb
Sat Feb  3 01:27:19 2024  -Using P for extracting lead variants...
Sat Feb  3 01:27:19 2024  -Found 543 significant variants in total...
Sat Feb  3 01:27:19 2024  -Identified 4 lead variants!
Sat Feb  3 01:27:19 2024 Finished extracting lead variants.

Out[5]:

	SNPID	CHR	POS	EA	NEA	BETA	SE	P	STATUS
96739	1:22068326_A_G	1	22068326	G	A	0.0621	0.0103	1.629000e-09	1960099
213860	1:51103268_T_C	1	51103268	C	T	-0.0802	0.0120	2.519000e-11	1960099
534095	1:154309595_TA_T	1	154309595	TA	T	-0.0915	0.0166	3.289000e-08	1960399
969974	2:640986_CACAT_C	2	640986	C	CACAT	-0.0946	0.0150	2.665000e-10	1960399

Different thresholds¶

In [6]:

            
                Copied!
                
mysumstats.get_lead(sig_level=1e-10)
mysumstats.get_lead(sig_level=1e-10)

Sat Feb  3 01:27:19 2024 Start to extract lead variants...v3.4.38
Sat Feb  3 01:27:19 2024  -Current Dataframe shape : 1000000 x 9 ; Memory usage: 75.89 MB
Sat Feb  3 01:27:19 2024  -Processing 1000000 variants...
Sat Feb  3 01:27:19 2024  -Significance threshold : 1e-10
Sat Feb  3 01:27:19 2024  -Sliding window size: 500  kb
Sat Feb  3 01:27:20 2024  -Using P for extracting lead variants...
Sat Feb  3 01:27:20 2024  -Found 1 significant variants in total...
Sat Feb  3 01:27:20 2024  -Identified 1 lead variants!
Sat Feb  3 01:27:20 2024 Finished extracting lead variants.

Out[6]:

	SNPID	CHR	POS	EA	NEA	BETA	SE	P	STATUS
213860	1:51103268_T_C	1	51103268	C	T	-0.0802	0.012	2.519000e-11	1960099

Check if novel against a tabular file¶

known_loci.txt file looks like: CHR POS 1 154309595 1 51103268

In [8]:

            
                Copied!
                
novel = mysumstats.get_novel(known="./toy_data/known_loci.txt")
novel = mysumstats.get_novel(known="./toy_data/known_loci.txt")

Sat Feb  3 01:22:31 2024 Start to check if lead variants are known...v3.4.38
Sat Feb  3 01:22:31 2024  -Current Dataframe shape : 1000000 x 9 ; Memory usage: 75.89 MB
Sat Feb  3 01:22:31 2024 Start to extract lead variants...v3.4.38
Sat Feb  3 01:22:31 2024  -Current Dataframe shape : 1000000 x 9 ; Memory usage: 75.89 MB
Sat Feb  3 01:22:31 2024  -Processing 1000000 variants...
Sat Feb  3 01:22:31 2024  -Significance threshold : 5e-08
Sat Feb  3 01:22:31 2024  -Sliding window size: 500  kb
Sat Feb  3 01:22:31 2024  -Using P for extracting lead variants...
Sat Feb  3 01:22:31 2024  -Found 543 significant variants in total...
Sat Feb  3 01:22:31 2024  -Identified 4 lead variants!
Sat Feb  3 01:22:31 2024 Finished extracting lead variants.
Sat Feb  3 01:22:31 2024  -Lead variants in known loci: 2
Sat Feb  3 01:22:31 2024  -Checking the minimum distance between identified lead variants and provided known variants...
Sat Feb  3 01:22:31 2024  -Identified  2  known vairants in current sumstats...
Sat Feb  3 01:22:31 2024  -Identified  2  novel vairants in current sumstats...
Sat Feb  3 01:22:31 2024 Finished checking if lead variants are known.

In [9]:

            
                Copied!
                
novel
novel

Out[9]:

	SNPID	CHR	POS	EA	NEA	BETA	SE	P	STATUS	DISTANCE_TO_KNOWN	KNOWN_ID	NOVEL	LOCATION_OF_KNOWN
0	1:22068326_A_G	1	22068326	G	A	0.0621	0.0103	1.629000e-09	1960099	29034942.0	1:51103268	True	Upstream
1	1:51103268_T_C	1	51103268	C	T	-0.0802	0.0120	2.519000e-11	1960099	0.0	1:51103268	False	Same
2	1:154309595_TA_T	1	154309595	TA	T	-0.0915	0.0166	3.289000e-08	1960399	0.0	1:154309595	False	Same
3	2:640986_CACAT_C	2	640986	C	CACAT	-0.0946	0.0150	2.665000e-10	1960399	NaN	<NA>	True	NoneOnThisChr

Ccheck against GWAS Catalog using EFO ID¶

In [10]:

            
                Copied!
                
# EFO ID can be found on gwas catalog
mysumstats.get_novel(efo="MONDO_0005148")
# EFO ID can be found on gwas catalog
mysumstats.get_novel(efo="MONDO_0005148")

Sat Feb  3 01:22:31 2024 Start to check if lead variants are known...v3.4.38
Sat Feb  3 01:22:31 2024  -Current Dataframe shape : 1000000 x 9 ; Memory usage: 75.89 MB
Sat Feb  3 01:22:31 2024 Start to extract lead variants...v3.4.38
Sat Feb  3 01:22:31 2024  -Current Dataframe shape : 1000000 x 9 ; Memory usage: 75.89 MB
Sat Feb  3 01:22:31 2024  -Processing 1000000 variants...
Sat Feb  3 01:22:31 2024  -Significance threshold : 5e-08
Sat Feb  3 01:22:31 2024  -Sliding window size: 500  kb
Sat Feb  3 01:22:31 2024  -Using P for extracting lead variants...
Sat Feb  3 01:22:31 2024  -Found 543 significant variants in total...
Sat Feb  3 01:22:31 2024  -Identified 4 lead variants!
Sat Feb  3 01:22:31 2024 Finished extracting lead variants.
Sat Feb  3 01:22:32 2024 Start to retrieve data using EFO: MONDO_0005148...
Sat Feb  3 01:22:32 2024 Start to retrieve data from GWASCatalog...
Sat Feb  3 01:22:32 2024  -Please make sure your sumstats is based on GRCh38...
Sat Feb  3 01:22:32 2024  -Requesting (GET) trait information through the GWASCatalog API...
Sat Feb  3 01:22:32 2024  -EFO trait api: https://www.ebi.ac.uk/gwas/rest/api/efoTraits/MONDO_0005148
Sat Feb  3 01:22:33 2024  -Status code: 200
Sat Feb  3 01:22:33 2024  -Trait Name: type 2 diabetes mellitus
Sat Feb  3 01:22:33 2024  -Trait URL: http://purl.obolibrary.org/obo/MONDO_0005148
Sat Feb  3 01:22:33 2024  -Requesting (GET) GWAS associations through the GWASCatalog API...
Sat Feb  3 01:22:33 2024  -associationsByTraitSummary API: https://www.ebi.ac.uk/gwas/rest/api/efoTraits/MONDO_0005148/associations?projection=associationByEfoTrait
Sat Feb  3 01:22:33 2024  -Note: this step might take a while...
Sat Feb  3 01:25:44 2024  -Status code 200 OK: Retrieved data from GWASCatalog successffully ...
Sat Feb  3 01:25:44 2024  -Loading json ...
Sat Feb  3 01:25:45 2024  -Parsing json ...
Sat Feb  3 01:25:45 2024  -Number of reported associations for MONDO_0005148 in GWASCatalog: 5450
Sat Feb  3 01:25:45 2024  -Loading retrieved data into gwaslab Sumstats object ...
Sat Feb  3 01:25:50 2024 Finished retrieving data from GWASCatalog...
Sat Feb  3 01:25:50 2024  -Retrieved 4176 associations from GWAS catalog.
Sat Feb  3 01:25:50 2024  -Lead variants in known loci: 4176
Sat Feb  3 01:25:50 2024  -Checking the minimum distance between identified lead variants and provided known variants...
Sat Feb  3 01:25:50 2024  -Identified  4  known vairants in current sumstats...
Sat Feb  3 01:25:50 2024  -Identified  0  novel vairants in current sumstats...
Sat Feb  3 01:25:50 2024 Finished checking if lead variants are known.

Out[10]:

	SNPID	CHR	POS	EA	NEA	BETA	SE	P	STATUS	DISTANCE_TO_KNOWN	KNOWN_ID	KNOWN_PUBMED_ID	KNOWN_AUTHOR	NOVEL	LOCATION_OF_KNOWN
0	1:22068326_A_G	1	22068326	G	A	0.0621	0.0103	1.629000e-09	1960099	-279528	rs34759301	34594039	Sakaue S	False	Downstream
1	1:51103268_T_C	1	51103268	C	T	-0.0802	0.0120	2.519000e-11	1960099	-62054	rs12088739	30054458	Xue A	False	Downstream
2	1:154309595_TA_T	1	154309595	TA	T	-0.0915	0.0166	3.289000e-08	1960399	12189	rs1194606	32541925	Vujkovic M	False	Upstream
3	2:640986_CACAT_C	2	640986	C	CACAT	-0.0946	0.0150	2.665000e-10	1960399	1	rs72156956	30718926	Suzuki K	False	Upstream