Standardization Workflow¶
In [1]:
Copied!
import gwaslab as gl
import gwaslab as gl
In [2]:
Copied!
gl.show_version()
gl.show_version()
2024/12/20 13:17:13 GWASLab v3.5.4 https://cloufield.github.io/gwaslab/ 2024/12/20 13:17:13 (C) 2022-2024, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com
Load sample data¶
In [3]:
Copied!
mysumstats = gl.Sumstats("../0_sample_data/toy_data/dirty_sumstats.tsv",fmt="gwaslab",other=["NOTE"])
mysumstats = gl.Sumstats("../0_sample_data/toy_data/dirty_sumstats.tsv",fmt="gwaslab",other=["NOTE"])
2024/12/20 13:17:15 GWASLab v3.5.4 https://cloufield.github.io/gwaslab/ 2024/12/20 13:17:15 (C) 2022-2024, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com 2024/12/20 13:17:15 Start to load format from formatbook.... 2024/12/20 13:17:15 -gwaslab format meta info: 2024/12/20 13:17:15 - format_name : gwaslab 2024/12/20 13:17:15 - format_source : https://cloufield.github.io/gwaslab/ 2024/12/20 13:17:15 - format_version : 20231220_v4 2024/12/20 13:17:15 Start to initialize gl.Sumstats from file :../0_sample_data/toy_data/dirty_sumstats.tsv 2024/12/20 13:17:15 -Reading columns : OR_95L,NEA,BETA,DIRECTION,OR_95U,P,N_CASE,CHISQ,MLOG10P,NOTE,EAF,N_CONTROL,Z,SE,SNPID,N,EA,OR,POS,CHR 2024/12/20 13:17:15 -Renaming columns to : OR_95L,NEA,BETA,DIRECTION,OR_95U,P,N_CASE,CHISQ,MLOG10P,NOTE,EAF,N_CONTROL,Z,SE,SNPID,N,EA,OR,POS,CHR 2024/12/20 13:17:15 -Current Dataframe shape : 63 x 20 2024/12/20 13:17:15 -Initiating a status column: STATUS ... 2024/12/20 13:17:15 #WARNING! Version of genomic coordinates is unknown... 2024/12/20 13:17:16 Start to reorder the columns...v3.5.4 2024/12/20 13:17:16 -Current Dataframe shape : 63 x 21 ; Memory usage: 21.48 MB 2024/12/20 13:17:16 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,Z,CHISQ,P,MLOG10P,OR,OR_95L,OR_95U,N,N_CASE,N_CONTROL,DIRECTION,STATUS,NOTE 2024/12/20 13:17:16 Finished reordering the columns. 2024/12/20 13:17:16 -Column : SNPID CHR POS EA NEA EAF BETA SE Z CHISQ P MLOG10P OR OR_95L OR_95U N N_CASE N_CONTROL DIRECTION STATUS NOTE 2024/12/20 13:17:16 -DType : object string object category category float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 int64 int64 object category object 2024/12/20 13:17:16 -Verified: T F F T T T T T T T T T T T T F T T T T NA 2024/12/20 13:17:16 #WARNING! Columns with possibly incompatible dtypes: CHR,POS,N 2024/12/20 13:17:16 -Current Dataframe memory usage: 21.48 MB 2024/12/20 13:17:16 Finished loading data successfully!
Dirty sumstats with issues specified in NOTE column
In [4]:
Copied!
mysumstats.data
mysumstats.data
Out[4]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | CHISQ | ... | MLOG10P | OR | OR_95L | OR_95U | N | N_CASE | N_CONTROL | DIRECTION | STATUS | NOTE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:1_G_A | 1 | 1 | A | G | 0.004 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9999999 | Duplicated |
1 | 1:1_A_G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9999999 | Duplicated |
2 | 1:1_A_G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9999999 | Duplicated |
3 | 1:2 | 1 | 2 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9999999 | Multiallelic |
4 | 1:2 | 1 | 2 | T | TAA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9999999 | Multiallelic |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
58 | 51 | 1 | 51 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 12345.000000 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9999999 | MLOG10P out of range |
59 | 52 | 1 | 52 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | -0.100000 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9999999 | MLOG10P out of range |
60 | 53 | 1 | 53 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | NaN | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9999999 | MLOG10P missing |
61 | 54 | 1 | 54 | AG | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9999999 | Clean sumstats |
62 | 55 | 1 | 55 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9999999 | Clean sumstats |
63 rows × 21 columns
All in one function¶
In [5]:
Copied!
mysumstats.basic_check(remove=True,remove_dup=True)
mysumstats.basic_check(remove=True,remove_dup=True)
2024/12/20 13:17:27 Start to check SNPID/rsID...v3.5.4 2024/12/20 13:17:27 -Current Dataframe shape : 63 x 21 ; Memory usage: 21.48 MB 2024/12/20 13:17:27 -Checking SNPID data type... 2024/12/20 13:17:27 -Converting SNPID to pd.string data type... 2024/12/20 13:17:27 -Checking if SNPID is CHR:POS:NEA:EA...(separator: - ,: , _) 2024/12/20 13:17:29 Finished checking SNPID/rsID. 2024/12/20 13:17:29 Start to fix chromosome notation (CHR)...v3.5.4 2024/12/20 13:17:29 -Current Dataframe shape : 63 x 21 ; Memory usage: 21.48 MB 2024/12/20 13:17:29 -Checking CHR data type... 2024/12/20 13:17:29 -Variants with standardized chromosome notation: 56 2024/12/20 13:17:29 -Variants with fixable chromosome notations: 4 2024/12/20 13:17:29 -Variants with NA chromosome notations: 1 2024/12/20 13:17:29 -Variants with invalid chromosome notations: 2 2024/12/20 13:17:29 -A look at invalid chromosome notations: {'1.0001', '-1'} 2024/12/20 13:17:29 -Identifying non-autosomal chromosomes : X, Y, and MT ... 2024/12/20 13:17:29 -Identified 1 variants on sex chromosomes... 2024/12/20 13:17:29 -Standardizing sex chromosome notations: X to 23... 2024/12/20 13:17:31 -Valid CHR list: 1 - 25 2024/12/20 13:17:31 -Removed 5 variants with chromosome notations not in CHR list. 2024/12/20 13:17:31 -A look at chromosome notations not in CHR list: {'0', '300', <NA>} 2024/12/20 13:17:31 Finished fixing chromosome notation (CHR). 2024/12/20 13:17:31 Start to fix basepair positions (POS)...v3.5.4 2024/12/20 13:17:31 -Current Dataframe shape : 58 x 21 ; Memory usage: 21.48 MB 2024/12/20 13:17:31 -Removing thousands separator "," or underbar "_" ... 2024/12/20 13:17:31 -Converting to Int64 data type ... 2024/12/20 13:17:31 -Force converting to Int64 data type ... 2024/12/20 13:17:33 -Position bound:(0 , 250,000,000) 2024/12/20 13:17:33 -Removed outliers: 2 2024/12/20 13:17:33 -Removed 4 variants with bad positions. 2024/12/20 13:17:33 Finished fixing basepair positions (POS). 2024/12/20 13:17:33 Start to fix alleles (EA and NEA)...v3.5.4 2024/12/20 13:17:33 -Current Dataframe shape : 54 x 21 ; Memory usage: 21.47 MB 2024/12/20 13:17:33 -Converted all bases to string datatype and UPPERCASE. 2024/12/20 13:17:33 -Variants with bad EA : 1 2024/12/20 13:17:33 -Variants with bad NEA : 5 2024/12/20 13:17:33 -Variants with NA for EA or NEA: 1 2024/12/20 13:17:33 -Variants with same EA and NEA: 1 2024/12/20 13:17:33 -A look at the non-ATCG EA: {'<CN0>'} ... 2024/12/20 13:17:33 -A look at the non-ATCG NEA: {nan, '*', 'N', '<CN1>'} ... 2024/12/20 13:17:33 -Removed 5 variants with NA alleles or alleles that contain bases other than A/C/T/G. 2024/12/20 13:17:33 -Removed 1 variants with same allele for EA and NEA. 2024/12/20 13:17:38 Finished fixing alleles (EA and NEA). 2024/12/20 13:17:38 Start to perform sanity check for statistics...v3.5.4 2024/12/20 13:17:38 -Current Dataframe shape : 48 x 21 ; Memory usage: 21.47 MB 2024/12/20 13:17:38 -Comparison tolerance for floats: 1e-07 2024/12/20 13:17:38 -Checking if 0 <= N <= 2147483647 ... 2024/12/20 13:17:38 -Examples of invalid variants(SNPID): 25,26,27 ... 2024/12/20 13:17:38 -Examples of invalid values (N): 12345700000000000,NA,-1 ... 2024/12/20 13:17:38 -Removed 3 variants with bad/na N. 2024/12/20 13:17:38 -Checking if 0 <= N_CASE <= 2147483647 ... 2024/12/20 13:17:38 -Examples of invalid variants(SNPID): 29 ... 2024/12/20 13:17:38 -Examples of invalid values (N_CASE): -1 ... 2024/12/20 13:17:38 -Removed 1 variants with bad/na N_CASE. 2024/12/20 13:17:38 -Checking if 0 <= N_CONTROL <= 2147483647 ... 2024/12/20 13:17:38 -Examples of invalid variants(SNPID): 28 ... 2024/12/20 13:17:38 -Examples of invalid values (N_CONTROL): -1 ... 2024/12/20 13:17:38 -Removed 1 variants with bad/na N_CONTROL. 2024/12/20 13:17:38 -Checking if -1e-07 < EAF < 1.0000001 ... 2024/12/20 13:17:38 -Examples of invalid variants(SNPID): 31,32,33 ... 2024/12/20 13:17:38 -Examples of invalid values (EAF): 1.02,-0.01,NA ... 2024/12/20 13:17:38 -Removed 3 variants with bad/na EAF. 2024/12/20 13:17:38 -Checking if -1e-07 < CHISQ < inf ... 2024/12/20 13:17:38 -Examples of invalid variants(SNPID): 38,39 ... 2024/12/20 13:17:38 -Examples of invalid values (CHISQ): -0.01,NA ... 2024/12/20 13:17:38 -Removed 2 variants with bad/na CHISQ. 2024/12/20 13:17:38 -Checking if -9999.0000001 < Z < 9999.0000001 ... 2024/12/20 13:17:38 -Examples of invalid variants(SNPID): 40,41 ... 2024/12/20 13:17:38 -Examples of invalid values (Z): NA,999999.0 ... 2024/12/20 13:17:38 -Removed 2 variants with bad/na Z. 2024/12/20 13:17:38 -Checking if -1e-07 < P < 1.0000001 ... 2024/12/20 13:17:38 -Examples of invalid variants(SNPID): 48,49,50 ... 2024/12/20 13:17:38 -Examples of invalid values (P): 1.1,-0.01,NA ... 2024/12/20 13:17:38 -Removed 3 variants with bad/na P. 2024/12/20 13:17:38 -Checking if -1e-07 < MLOG10P < 9999.0000001 ... 2024/12/20 13:17:38 -Examples of invalid variants(SNPID): 51,52,53 ... 2024/12/20 13:17:38 -Examples of invalid values (MLOG10P): 12345.0,-0.1,NA ... 2024/12/20 13:17:38 -Removed 3 variants with bad/na MLOG10P. 2024/12/20 13:17:38 -Checking if -100.0000001 < BETA < 100.0000001 ... 2024/12/20 13:17:38 -Examples of invalid variants(SNPID): 34,35 ... 2024/12/20 13:17:38 -Examples of invalid values (BETA): 99999.0,NA ... 2024/12/20 13:17:38 -Removed 2 variants with bad/na BETA. 2024/12/20 13:17:38 -Checking if -1e-07 < SE < inf ... 2024/12/20 13:17:38 -Examples of invalid variants(SNPID): 37 ... 2024/12/20 13:17:38 -Examples of invalid values (SE): NA ... 2024/12/20 13:17:38 -Removed 1 variants with bad/na SE. 2024/12/20 13:17:38 -Checking if -100.0000001 < OR < 100.0000001 ... 2024/12/20 13:17:38 -Examples of invalid variants(SNPID): 42,43 ... 2024/12/20 13:17:38 -Examples of invalid values (OR): 999999.0,NA ... 2024/12/20 13:17:38 -Removed 2 variants with bad/na OR. 2024/12/20 13:17:38 -Checking if -1e-07 < OR_95L < inf ... 2024/12/20 13:17:38 -Examples of invalid variants(SNPID): 44,45 ... 2024/12/20 13:17:38 -Examples of invalid values (OR_95L): -0.01,NA ... 2024/12/20 13:17:38 -Removed 2 variants with bad/na OR_95L. 2024/12/20 13:17:38 -Checking if -1e-07 < OR_95U < inf ... 2024/12/20 13:17:38 -Examples of invalid variants(SNPID): 46,47 ... 2024/12/20 13:17:38 -Examples of invalid values (OR_95U): -0.01,NA ... 2024/12/20 13:17:38 -Removed 2 variants with bad/na OR_95U. 2024/12/20 13:17:38 -Checking STATUS and converting STATUS to categories.... 2024/12/20 13:17:38 -Removed 27 variants with bad statistics in total. 2024/12/20 13:17:38 -Data types for each column: 2024/12/20 13:17:38 -Column : SNPID CHR POS EA NEA EAF BETA SE Z CHISQ P MLOG10P OR OR_95L OR_95U N N_CASE N_CONTROL DIRECTION STATUS NOTE 2024/12/20 13:17:38 -DType : string Int64 Int64 category category float32 float64 float64 float64 float64 float64 float64 float64 float64 float64 Int64 Int64 Int64 object category object 2024/12/20 13:17:38 -Verified: T T T T T T T T T T T T T T T T T T T T NA 2024/12/20 13:17:38 Finished sanity check for statistics. 2024/12/20 13:17:38 Start to check data consistency across columns...v3.5.4 2024/12/20 13:17:38 -Current Dataframe shape : 21 x 21 ; Memory usage: 21.47 MB 2024/12/20 13:17:38 -Tolerance: 0.001 (Relative) and 0.001 (Absolute) 2024/12/20 13:17:38 -Checking if BETA/SE-derived-MLOG10P is consistent with MLOG10P... 2024/12/20 13:17:38 -Variants with inconsistent values were not detected. 2024/12/20 13:17:38 -Checking if BETA/SE-derived-P is consistent with P... 2024/12/20 13:17:38 -Variants with inconsistent values were not detected. 2024/12/20 13:17:38 -Checking if MLOG10P-derived-P is consistent with P... 2024/12/20 13:17:38 -Variants with inconsistent values were not detected. 2024/12/20 13:17:38 -Checking if N is consistent with N_CASE + N_CONTROL ... 2024/12/20 13:17:38 -Not consistent: 1 variant(s) 2024/12/20 13:17:38 -Variant SNPID with max difference: 30 with 10000 2024/12/20 13:17:38 -Note: if the max difference is greater than expected, please check your original sumstats. 2024/12/20 13:17:38 Finished checking data consistency across columns. 2024/12/20 13:17:38 Start to normalize indels...v3.5.4 2024/12/20 13:17:38 -Current Dataframe shape : 21 x 21 ; Memory usage: 21.47 MB
2024/12/20 13:17:39 -Not normalized allele IDs:22 ... 2024/12/20 13:17:39 -Not normalized allele:['AT' 'GT']... 2024/12/20 13:17:39 -Modified 1 variants according to parsimony and left alignment principal. 2024/12/20 13:17:39 Finished normalizing indels. 2024/12/20 13:17:39 Start to remove duplicated/multiallelic variants...v3.5.4 2024/12/20 13:17:39 -Current Dataframe shape : 21 x 21 ; Memory usage: 21.47 MB 2024/12/20 13:17:39 -Removing mode:dm 2024/12/20 13:17:39 Start to sort the sumstats using P... 2024/12/20 13:17:39 Start to remove duplicated variants based on snpid...v3.5.4 2024/12/20 13:17:39 -Current Dataframe shape : 21 x 21 ; Memory usage: 21.47 MB 2024/12/20 13:17:39 -Which variant to keep: first 2024/12/20 13:17:39 -Removed 2 based on SNPID... 2024/12/20 13:17:39 Start to remove duplicated variants based on CHR,POS,EA and NEA... 2024/12/20 13:17:39 -Current Dataframe shape : 19 x 21 ; Memory usage: 21.47 MB 2024/12/20 13:17:39 -Which variant to keep: first 2024/12/20 13:17:39 -Removed 1 based on CHR,POS,EA and NEA... 2024/12/20 13:17:39 Start to remove multiallelic variants based on chr:pos... 2024/12/20 13:17:39 -Current Dataframe shape : 18 x 21 ; Memory usage: 21.47 MB 2024/12/20 13:17:39 -Which variant to keep: first 2024/12/20 13:17:39 -Removed 1 multiallelic variants... 2024/12/20 13:17:39 -Removed 4 variants in total. 2024/12/20 13:17:39 -Sort the coordinates based on CHR and POS... 2024/12/20 13:17:39 Finished removing duplicated/multiallelic variants. 2024/12/20 13:17:39 Start to sort the genome coordinates...v3.5.4 2024/12/20 13:17:39 -Current Dataframe shape : 17 x 21 ; Memory usage: 21.47 MB 2024/12/20 13:17:39 Finished sorting coordinates. 2024/12/20 13:17:39 Start to reorder the columns...v3.5.4 2024/12/20 13:17:39 -Current Dataframe shape : 17 x 21 ; Memory usage: 21.47 MB 2024/12/20 13:17:39 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,Z,CHISQ,P,MLOG10P,OR,OR_95L,OR_95U,N,N_CASE,N_CONTROL,DIRECTION,STATUS,NOTE 2024/12/20 13:17:39 Finished reordering the columns.
In [5]:
Copied!
mysumstats.data
mysumstats.data
Out[5]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | CHISQ | ... | MLOG10P | OR | OR_95L | OR_95U | N | N_CASE | N_CONTROL | DIRECTION | STATUS | NOTE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:1_G_A | 1 | 1 | A | G | 0.004 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960099 | Duplicated |
1 | 1:2 | 1 | 2 | T | TAA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Multiallelic |
2 | 1:3_T_GA | 1 | 3 | T | GA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960399 | Multiallelic |
3 | 3 | 1 | 6 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
4 | 4 | 1 | 7 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
5 | 5 | 1 | 8 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
6 | 13 | 1 | 13 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | POS float |
7 | 16 | 1 | 16 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Lowercase alleles |
8 | 22 | 1 | 22 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Not normalizated allelels |
9 | 23 | 1 | 23 | AT | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Both Uppercase and lowercases |
10 | 24 | 1 | 24 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | N float |
11 | 30 | 1 | 30 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 130000 | 40000 | --++ | 9980099 | N!=N_CONTROL +N_CASE |
12 | 36 | 1 | 36 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | SE out of range |
13 | 54 | 1 | 54 | AG | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Clean sumstats |
14 | 55 | 1 | 55 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Clean sumstats |
15 | 10 | 1 | 123456789 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | POS with separator |
16 | 1 | 23 | 4 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Sex chromosomes |
17 rows × 21 columns
Separate functions¶
In [7]:
Copied!
#reload
mysumstats = gl.Sumstats("../0_sample_data/toy_data/dirty_sumstats.tsv",fmt="gwaslab",other=["NOTE"])
#reload
mysumstats = gl.Sumstats("../0_sample_data/toy_data/dirty_sumstats.tsv",fmt="gwaslab",other=["NOTE"])
2024/12/20 13:18:36 GWASLab v3.5.4 https://cloufield.github.io/gwaslab/ 2024/12/20 13:18:36 (C) 2022-2024, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com 2024/12/20 13:18:36 Start to load format from formatbook.... 2024/12/20 13:18:36 -gwaslab format meta info: 2024/12/20 13:18:36 - format_name : gwaslab 2024/12/20 13:18:36 - format_source : https://cloufield.github.io/gwaslab/ 2024/12/20 13:18:36 - format_version : 20231220_v4 2024/12/20 13:18:36 Start to initialize gl.Sumstats from file :../0_sample_data/toy_data/dirty_sumstats.tsv 2024/12/20 13:18:36 -Reading columns : OR_95L,NEA,BETA,DIRECTION,OR_95U,P,N_CASE,CHISQ,MLOG10P,NOTE,EAF,N_CONTROL,Z,SE,SNPID,N,EA,OR,POS,CHR 2024/12/20 13:18:36 -Renaming columns to : OR_95L,NEA,BETA,DIRECTION,OR_95U,P,N_CASE,CHISQ,MLOG10P,NOTE,EAF,N_CONTROL,Z,SE,SNPID,N,EA,OR,POS,CHR 2024/12/20 13:18:36 -Current Dataframe shape : 63 x 20 2024/12/20 13:18:36 -Initiating a status column: STATUS ... 2024/12/20 13:18:36 #WARNING! Version of genomic coordinates is unknown... 2024/12/20 13:18:36 Start to reorder the columns...v3.5.4 2024/12/20 13:18:36 -Current Dataframe shape : 63 x 21 ; Memory usage: 21.48 MB 2024/12/20 13:18:36 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,Z,CHISQ,P,MLOG10P,OR,OR_95L,OR_95U,N,N_CASE,N_CONTROL,DIRECTION,STATUS,NOTE 2024/12/20 13:18:36 Finished reordering the columns. 2024/12/20 13:18:36 -Column : SNPID CHR POS EA NEA EAF BETA SE Z CHISQ P MLOG10P OR OR_95L OR_95U N N_CASE N_CONTROL DIRECTION STATUS NOTE 2024/12/20 13:18:36 -DType : object string object category category float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 int64 int64 object category object 2024/12/20 13:18:36 -Verified: T F F T T T T T T T T T T T T F T T T T NA 2024/12/20 13:18:36 #WARNING! Columns with possibly incompatible dtypes: CHR,POS,N 2024/12/20 13:18:36 -Current Dataframe memory usage: 21.48 MB 2024/12/20 13:18:36 Finished loading data successfully!
fix id¶
In [8]:
Copied!
mysumstats.fix_id(fixsep=True)
mysumstats.fix_id(fixsep=True)
2024/12/20 13:18:38 Start to check SNPID/rsID...v3.5.4 2024/12/20 13:18:38 -Current Dataframe shape : 63 x 21 ; Memory usage: 21.48 MB 2024/12/20 13:18:38 -Checking SNPID data type... 2024/12/20 13:18:38 -Converting SNPID to pd.string data type... 2024/12/20 13:18:38 -Checking if SNPID is CHR:POS:NEA:EA...(separator: - ,: , _) 2024/12/20 13:18:40 -Replacing [_-] in SNPID with ":" ... 2024/12/20 13:18:40 Finished checking SNPID/rsID.
In [9]:
Copied!
mysumstats.data
mysumstats.data
Out[9]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | CHISQ | ... | MLOG10P | OR | OR_95L | OR_95U | N | N_CASE | N_CONTROL | DIRECTION | STATUS | NOTE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:1:G:A | 1 | 1 | A | G | 0.004 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9969999 | Duplicated |
1 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9969999 | Duplicated |
2 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9969999 | Duplicated |
3 | 1:2 | 1 | 2 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9989999 | Multiallelic |
4 | 1:2 | 1 | 2 | T | TAA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9989999 | Multiallelic |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
58 | 51 | 1 | 51 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 12345.000000 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9989999 | MLOG10P out of range |
59 | 52 | 1 | 52 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | -0.100000 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9989999 | MLOG10P out of range |
60 | 53 | 1 | 53 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | NaN | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9989999 | MLOG10P missing |
61 | 54 | 1 | 54 | AG | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9989999 | Clean sumstats |
62 | 55 | 1 | 55 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9989999 | Clean sumstats |
63 rows × 21 columns
fix chromosome¶
In [10]:
Copied!
mysumstats.fix_chr(remove=True)
mysumstats.fix_chr(remove=True)
2024/12/20 13:18:40 Start to fix chromosome notation (CHR)...v3.5.4 2024/12/20 13:18:40 -Current Dataframe shape : 63 x 21 ; Memory usage: 21.48 MB 2024/12/20 13:18:40 -Checking CHR data type... 2024/12/20 13:18:40 -Variants with standardized chromosome notation: 56 2024/12/20 13:18:40 -Variants with fixable chromosome notations: 4 2024/12/20 13:18:40 -Variants with NA chromosome notations: 1 2024/12/20 13:18:40 -Variants with invalid chromosome notations: 2 2024/12/20 13:18:40 -A look at invalid chromosome notations: {'1.0001', '-1'} 2024/12/20 13:18:40 -Identifying non-autosomal chromosomes : X, Y, and MT ... 2024/12/20 13:18:40 -Identified 1 variants on sex chromosomes... 2024/12/20 13:18:40 -Standardizing sex chromosome notations: X to 23... 2024/12/20 13:18:42 -Valid CHR list: 1 - 25 2024/12/20 13:18:42 -Removed 5 variants with chromosome notations not in CHR list. 2024/12/20 13:18:42 -A look at chromosome notations not in CHR list: {'0', '300', <NA>} 2024/12/20 13:18:42 Finished fixing chromosome notation (CHR).
In [11]:
Copied!
mysumstats.data
mysumstats.data
Out[11]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | CHISQ | ... | MLOG10P | OR | OR_95L | OR_95U | N | N_CASE | N_CONTROL | DIRECTION | STATUS | NOTE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:1:G:A | 1 | 1 | A | G | 0.004 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9965999 | Duplicated |
1 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9965999 | Duplicated |
2 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9965999 | Duplicated |
3 | 1:2 | 1 | 2 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Multiallelic |
4 | 1:2 | 1 | 2 | T | TAA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Multiallelic |
5 | 1:3:T:A | 1 | 3 | A | T | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9965999 | Multiallelic |
6 | 1:3:T:GA | 1 | 3 | T | GA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9965999 | Multiallelic |
7 | 1 | 23 | 4 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Sex chromosomes |
9 | 3 | 1 | 6 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | CHR with prefix |
10 | 4 | 1 | 7 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | CHR with prefix |
11 | 5 | 1 | 8 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | CHR with prefix |
16 | 10 | 1 | 123,456,789 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | POS with separator |
17 | 11 | 1 | -1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | POS out of normal range |
18 | 12 | 1 | 1.23214E+13 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | POS out of normal range |
19 | 13 | 1 | 13.00000001 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | POS float |
20 | 14 | 1 | NaN | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | POS missing |
21 | 13 | 1 | abc | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | POS string |
22 | 15 | 1 | 15 | A | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Same alleles |
23 | 16 | 1 | 16 | a | g | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Lowercase alleles |
24 | 17 | 1 | 17 | A | <CN1> | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Unrecognized alleles |
25 | 18 | 1 | 18 | <CN0> | <CN1> | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Unrecognized alleles |
26 | 19 | 1 | 19 | A | * | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Unrecognized alleles |
27 | 20 | 1 | 20 | A | N | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Unrecognized alleles |
28 | 21 | 1 | 21 | A | NaN | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Allele missing |
29 | 22 | 1 | 22 | AT | GT | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Not normalizated allelels |
30 | 23 | 1 | 23 | At | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Both Uppercase and lowercases |
31 | 24 | 1 | 24 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600001e+05 | 120000 | 40000 | --++ | 9985999 | N float |
32 | 25 | 1 | 25 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.234570e+16 | 120000 | 40000 | --++ | 9985999 | N out of range |
33 | 26 | 1 | 26 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | NaN | 120000 | 40000 | --++ | 9985999 | N missing |
34 | 27 | 1 | 27 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | -1.000000e+00 | 120000 | 40000 | --++ | 9985999 | N out of range |
35 | 28 | 1 | 28 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | -1 | --++ | 9985999 | N_CASE out of range |
36 | 29 | 1 | 29 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | -1 | 40000 | --++ | 9985999 | N_CONTROL out of range |
37 | 30 | 1 | 30 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 130000 | 40000 | --++ | 9985999 | N!=N_CONTROL +N_CASE |
38 | 31 | 1 | 31 | A | G | 1.020 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | EAF out of range |
39 | 32 | 1 | 32 | A | G | -0.010 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | EAF out of range |
40 | 33 | 1 | 33 | A | G | NaN | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | EAF missing |
41 | 34 | 1 | 34 | A | G | 0.996 | 99999.0000 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | BETA out of range |
42 | 35 | 1 | 35 | A | G | 0.996 | NaN | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | BETA missing |
43 | 36 | 1 | 36 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | SE out of range |
44 | 37 | 1 | 37 | A | G | 0.996 | 0.0603 | NaN | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | SE missing |
45 | 38 | 1 | 38 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | -0.010000 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | CHISQ out of range |
46 | 39 | 1 | 39 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | NaN | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | CHISQ missing |
47 | 40 | 1 | 40 | A | G | 0.996 | 0.0603 | 0.0103 | NaN | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Z missing |
48 | 41 | 1 | 41 | A | G | 0.996 | 0.0603 | 0.0103 | 999999.000000 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Z out of range |
49 | 42 | 1 | 42 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 999999.000000 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | OR out of range |
50 | 43 | 1 | 43 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | NaN | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | OR missing |
51 | 44 | 1 | 44 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | -0.010000 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | OR_95L out of range |
52 | 45 | 1 | 45 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | NaN | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | OR_95L missing |
53 | 46 | 1 | 46 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | -0.010000 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | OR_95U out of range |
54 | 47 | 1 | 47 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | NaN | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | OR_95U missing |
55 | 48 | 1 | 48 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | -0.041393 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | P out of range |
56 | 49 | 1 | 49 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | P out of range |
57 | 50 | 1 | 50 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | P missing |
58 | 51 | 1 | 51 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 12345.000000 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | MLOG10P out of range |
59 | 52 | 1 | 52 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | -0.100000 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | MLOG10P out of range |
60 | 53 | 1 | 53 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | NaN | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | MLOG10P missing |
61 | 54 | 1 | 54 | AG | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Clean sumstats |
62 | 55 | 1 | 55 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Clean sumstats |
58 rows × 21 columns
fix position¶
In [12]:
Copied!
mysumstats.fix_pos(remove=True)
mysumstats.fix_pos(remove=True)
2024/12/20 13:18:42 Start to fix basepair positions (POS)...v3.5.4 2024/12/20 13:18:42 -Current Dataframe shape : 58 x 21 ; Memory usage: 21.48 MB 2024/12/20 13:18:42 -Removing thousands separator "," or underbar "_" ... 2024/12/20 13:18:42 -Converting to Int64 data type ... 2024/12/20 13:18:42 -Force converting to Int64 data type ... 2024/12/20 13:18:44 -Position bound:(0 , 250,000,000) 2024/12/20 13:18:44 -Removed outliers: 2 2024/12/20 13:18:44 -Removed 4 variants with bad positions. 2024/12/20 13:18:44 Finished fixing basepair positions (POS).
In [13]:
Copied!
mysumstats.data
mysumstats.data
Out[13]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | CHISQ | ... | MLOG10P | OR | OR_95L | OR_95U | N | N_CASE | N_CONTROL | DIRECTION | STATUS | NOTE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:1:G:A | 1 | 1 | A | G | 0.004 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9960999 | Duplicated |
1 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9960999 | Duplicated |
2 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9960999 | Duplicated |
3 | 1:2 | 1 | 2 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Multiallelic |
4 | 1:2 | 1 | 2 | T | TAA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Multiallelic |
5 | 1:3:T:A | 1 | 3 | A | T | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9960999 | Multiallelic |
6 | 1:3:T:GA | 1 | 3 | T | GA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9960999 | Multiallelic |
7 | 1 | 23 | 4 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Sex chromosomes |
9 | 3 | 1 | 6 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | CHR with prefix |
10 | 4 | 1 | 7 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | CHR with prefix |
11 | 5 | 1 | 8 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | CHR with prefix |
16 | 10 | 1 | 123456789 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | POS with separator |
19 | 13 | 1 | 13 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | POS float |
22 | 15 | 1 | 15 | A | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Same alleles |
23 | 16 | 1 | 16 | a | g | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Lowercase alleles |
24 | 17 | 1 | 17 | A | <CN1> | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Unrecognized alleles |
25 | 18 | 1 | 18 | <CN0> | <CN1> | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Unrecognized alleles |
26 | 19 | 1 | 19 | A | * | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Unrecognized alleles |
27 | 20 | 1 | 20 | A | N | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Unrecognized alleles |
28 | 21 | 1 | 21 | A | NaN | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Allele missing |
29 | 22 | 1 | 22 | AT | GT | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Not normalizated allelels |
30 | 23 | 1 | 23 | At | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Both Uppercase and lowercases |
31 | 24 | 1 | 24 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600001e+05 | 120000 | 40000 | --++ | 9980999 | N float |
32 | 25 | 1 | 25 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.234570e+16 | 120000 | 40000 | --++ | 9980999 | N out of range |
33 | 26 | 1 | 26 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | NaN | 120000 | 40000 | --++ | 9980999 | N missing |
34 | 27 | 1 | 27 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | -1.000000e+00 | 120000 | 40000 | --++ | 9980999 | N out of range |
35 | 28 | 1 | 28 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | -1 | --++ | 9980999 | N_CASE out of range |
36 | 29 | 1 | 29 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | -1 | 40000 | --++ | 9980999 | N_CONTROL out of range |
37 | 30 | 1 | 30 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 130000 | 40000 | --++ | 9980999 | N!=N_CONTROL +N_CASE |
38 | 31 | 1 | 31 | A | G | 1.020 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | EAF out of range |
39 | 32 | 1 | 32 | A | G | -0.010 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | EAF out of range |
40 | 33 | 1 | 33 | A | G | NaN | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | EAF missing |
41 | 34 | 1 | 34 | A | G | 0.996 | 99999.0000 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | BETA out of range |
42 | 35 | 1 | 35 | A | G | 0.996 | NaN | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | BETA missing |
43 | 36 | 1 | 36 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | SE out of range |
44 | 37 | 1 | 37 | A | G | 0.996 | 0.0603 | NaN | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | SE missing |
45 | 38 | 1 | 38 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | -0.010000 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | CHISQ out of range |
46 | 39 | 1 | 39 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | NaN | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | CHISQ missing |
47 | 40 | 1 | 40 | A | G | 0.996 | 0.0603 | 0.0103 | NaN | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Z missing |
48 | 41 | 1 | 41 | A | G | 0.996 | 0.0603 | 0.0103 | 999999.000000 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Z out of range |
49 | 42 | 1 | 42 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 999999.000000 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | OR out of range |
50 | 43 | 1 | 43 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | NaN | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | OR missing |
51 | 44 | 1 | 44 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | -0.010000 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | OR_95L out of range |
52 | 45 | 1 | 45 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | NaN | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | OR_95L missing |
53 | 46 | 1 | 46 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | -0.010000 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | OR_95U out of range |
54 | 47 | 1 | 47 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | NaN | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | OR_95U missing |
55 | 48 | 1 | 48 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | -0.041393 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | P out of range |
56 | 49 | 1 | 49 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | P out of range |
57 | 50 | 1 | 50 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | P missing |
58 | 51 | 1 | 51 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 12345.000000 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | MLOG10P out of range |
59 | 52 | 1 | 52 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | -0.100000 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | MLOG10P out of range |
60 | 53 | 1 | 53 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | NaN | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | MLOG10P missing |
61 | 54 | 1 | 54 | AG | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Clean sumstats |
62 | 55 | 1 | 55 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Clean sumstats |
54 rows × 21 columns
fix allele¶
In [14]:
Copied!
mysumstats.fix_allele(remove=True)
mysumstats.fix_allele(remove=True)
2024/12/20 13:18:44 Start to fix alleles (EA and NEA)...v3.5.4 2024/12/20 13:18:44 -Current Dataframe shape : 54 x 21 ; Memory usage: 21.47 MB 2024/12/20 13:18:44 -Converted all bases to string datatype and UPPERCASE. 2024/12/20 13:18:44 -Variants with bad EA : 1 2024/12/20 13:18:44 -Variants with bad NEA : 5 2024/12/20 13:18:44 -Variants with NA for EA or NEA: 1 2024/12/20 13:18:44 -Variants with same EA and NEA: 1 2024/12/20 13:18:44 -A look at the non-ATCG EA: {'<CN0>'} ... 2024/12/20 13:18:44 -A look at the non-ATCG NEA: {nan, '*', 'N', '<CN1>'} ... 2024/12/20 13:18:44 -Removed 5 variants with NA alleles or alleles that contain bases other than A/C/T/G. 2024/12/20 13:18:44 -Removed 1 variants with same allele for EA and NEA. 2024/12/20 13:18:49 Finished fixing alleles (EA and NEA).
In [15]:
Copied!
mysumstats.data
mysumstats.data
Out[15]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | CHISQ | ... | MLOG10P | OR | OR_95L | OR_95U | N | N_CASE | N_CONTROL | DIRECTION | STATUS | NOTE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:1:G:A | 1 | 1 | A | G | 0.004 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9960099 | Duplicated |
1 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9960099 | Duplicated |
2 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9960099 | Duplicated |
3 | 1:2 | 1 | 2 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | Multiallelic |
4 | 1:2 | 1 | 2 | T | TAA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980399 | Multiallelic |
5 | 1:3:T:A | 1 | 3 | A | T | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9960099 | Multiallelic |
6 | 1:3:T:GA | 1 | 3 | T | GA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9960399 | Multiallelic |
7 | 1 | 23 | 4 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | Sex chromosomes |
9 | 3 | 1 | 6 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
10 | 4 | 1 | 7 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
11 | 5 | 1 | 8 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
16 | 10 | 1 | 123456789 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | POS with separator |
19 | 13 | 1 | 13 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | POS float |
23 | 16 | 1 | 16 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | Lowercase alleles |
29 | 22 | 1 | 22 | AT | GT | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980599 | Not normalizated allelels |
30 | 23 | 1 | 23 | AT | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980399 | Both Uppercase and lowercases |
31 | 24 | 1 | 24 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600001e+05 | 120000 | 40000 | --++ | 9980099 | N float |
32 | 25 | 1 | 25 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.234570e+16 | 120000 | 40000 | --++ | 9980099 | N out of range |
33 | 26 | 1 | 26 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | NaN | 120000 | 40000 | --++ | 9980099 | N missing |
34 | 27 | 1 | 27 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | -1.000000e+00 | 120000 | 40000 | --++ | 9980099 | N out of range |
35 | 28 | 1 | 28 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | -1 | --++ | 9980099 | N_CASE out of range |
36 | 29 | 1 | 29 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | -1 | 40000 | --++ | 9980099 | N_CONTROL out of range |
37 | 30 | 1 | 30 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 130000 | 40000 | --++ | 9980099 | N!=N_CONTROL +N_CASE |
38 | 31 | 1 | 31 | A | G | 1.020 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | EAF out of range |
39 | 32 | 1 | 32 | A | G | -0.010 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | EAF out of range |
40 | 33 | 1 | 33 | A | G | NaN | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | EAF missing |
41 | 34 | 1 | 34 | A | G | 0.996 | 99999.0000 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | BETA out of range |
42 | 35 | 1 | 35 | A | G | 0.996 | NaN | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | BETA missing |
43 | 36 | 1 | 36 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | SE out of range |
44 | 37 | 1 | 37 | A | G | 0.996 | 0.0603 | NaN | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | SE missing |
45 | 38 | 1 | 38 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | -0.010000 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | CHISQ out of range |
46 | 39 | 1 | 39 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | NaN | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | CHISQ missing |
47 | 40 | 1 | 40 | A | G | 0.996 | 0.0603 | 0.0103 | NaN | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | Z missing |
48 | 41 | 1 | 41 | A | G | 0.996 | 0.0603 | 0.0103 | 999999.000000 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | Z out of range |
49 | 42 | 1 | 42 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 999999.000000 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | OR out of range |
50 | 43 | 1 | 43 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | NaN | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | OR missing |
51 | 44 | 1 | 44 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | -0.010000 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | OR_95L out of range |
52 | 45 | 1 | 45 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | NaN | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | OR_95L missing |
53 | 46 | 1 | 46 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | -0.010000 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | OR_95U out of range |
54 | 47 | 1 | 47 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | NaN | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | OR_95U missing |
55 | 48 | 1 | 48 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | -0.041393 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | P out of range |
56 | 49 | 1 | 49 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | P out of range |
57 | 50 | 1 | 50 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | P missing |
58 | 51 | 1 | 51 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 12345.000000 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | MLOG10P out of range |
59 | 52 | 1 | 52 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | -0.100000 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | MLOG10P out of range |
60 | 53 | 1 | 53 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | NaN | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | MLOG10P missing |
61 | 54 | 1 | 54 | AG | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980399 | Clean sumstats |
62 | 55 | 1 | 55 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | Clean sumstats |
48 rows × 21 columns
sanity check for statistics¶
In [16]:
Copied!
mysumstats.check_sanity()
mysumstats.check_sanity()
2024/12/20 13:18:49 Start to perform sanity check for statistics...v3.5.4 2024/12/20 13:18:49 -Current Dataframe shape : 48 x 21 ; Memory usage: 21.47 MB 2024/12/20 13:18:49 -Comparison tolerance for floats: 1e-07 2024/12/20 13:18:49 -Checking if 0 <= N <= 2147483647 ... 2024/12/20 13:18:49 -Examples of invalid variants(SNPID): 25,26,27 ... 2024/12/20 13:18:49 -Examples of invalid values (N): 12345700000000000,NA,-1 ... 2024/12/20 13:18:49 -Removed 3 variants with bad/na N. 2024/12/20 13:18:49 -Checking if 0 <= N_CASE <= 2147483647 ... 2024/12/20 13:18:49 -Examples of invalid variants(SNPID): 29 ... 2024/12/20 13:18:49 -Examples of invalid values (N_CASE): -1 ... 2024/12/20 13:18:49 -Removed 1 variants with bad/na N_CASE. 2024/12/20 13:18:49 -Checking if 0 <= N_CONTROL <= 2147483647 ... 2024/12/20 13:18:49 -Examples of invalid variants(SNPID): 28 ... 2024/12/20 13:18:49 -Examples of invalid values (N_CONTROL): -1 ... 2024/12/20 13:18:49 -Removed 1 variants with bad/na N_CONTROL. 2024/12/20 13:18:49 -Checking if -1e-07 < EAF < 1.0000001 ... 2024/12/20 13:18:49 -Examples of invalid variants(SNPID): 31,32,33 ... 2024/12/20 13:18:49 -Examples of invalid values (EAF): 1.02,-0.01,NA ... 2024/12/20 13:18:49 -Removed 3 variants with bad/na EAF. 2024/12/20 13:18:49 -Checking if -1e-07 < CHISQ < inf ... 2024/12/20 13:18:49 -Examples of invalid variants(SNPID): 38,39 ... 2024/12/20 13:18:49 -Examples of invalid values (CHISQ): -0.01,NA ... 2024/12/20 13:18:49 -Removed 2 variants with bad/na CHISQ. 2024/12/20 13:18:49 -Checking if -9999.0000001 < Z < 9999.0000001 ... 2024/12/20 13:18:49 -Examples of invalid variants(SNPID): 40,41 ... 2024/12/20 13:18:49 -Examples of invalid values (Z): NA,999999.0 ... 2024/12/20 13:18:49 -Removed 2 variants with bad/na Z. 2024/12/20 13:18:49 -Checking if -1e-07 < P < 1.0000001 ... 2024/12/20 13:18:49 -Examples of invalid variants(SNPID): 48,49,50 ... 2024/12/20 13:18:49 -Examples of invalid values (P): 1.1,-0.01,NA ... 2024/12/20 13:18:49 -Removed 3 variants with bad/na P. 2024/12/20 13:18:49 -Checking if -1e-07 < MLOG10P < 9999.0000001 ... 2024/12/20 13:18:49 -Examples of invalid variants(SNPID): 51,52,53 ... 2024/12/20 13:18:49 -Examples of invalid values (MLOG10P): 12345.0,-0.1,NA ... 2024/12/20 13:18:49 -Removed 3 variants with bad/na MLOG10P. 2024/12/20 13:18:49 -Checking if -100.0000001 < BETA < 100.0000001 ... 2024/12/20 13:18:49 -Examples of invalid variants(SNPID): 34,35 ... 2024/12/20 13:18:49 -Examples of invalid values (BETA): 99999.0,NA ... 2024/12/20 13:18:49 -Removed 2 variants with bad/na BETA. 2024/12/20 13:18:49 -Checking if -1e-07 < SE < inf ... 2024/12/20 13:18:49 -Examples of invalid variants(SNPID): 37 ... 2024/12/20 13:18:49 -Examples of invalid values (SE): NA ... 2024/12/20 13:18:49 -Removed 1 variants with bad/na SE. 2024/12/20 13:18:49 -Checking if -100.0000001 < OR < 100.0000001 ... 2024/12/20 13:18:49 -Examples of invalid variants(SNPID): 42,43 ... 2024/12/20 13:18:49 -Examples of invalid values (OR): 999999.0,NA ... 2024/12/20 13:18:49 -Removed 2 variants with bad/na OR. 2024/12/20 13:18:49 -Checking if -1e-07 < OR_95L < inf ... 2024/12/20 13:18:49 -Examples of invalid variants(SNPID): 44,45 ... 2024/12/20 13:18:49 -Examples of invalid values (OR_95L): -0.01,NA ... 2024/12/20 13:18:49 -Removed 2 variants with bad/na OR_95L. 2024/12/20 13:18:49 -Checking if -1e-07 < OR_95U < inf ... 2024/12/20 13:18:49 -Examples of invalid variants(SNPID): 46,47 ... 2024/12/20 13:18:49 -Examples of invalid values (OR_95U): -0.01,NA ... 2024/12/20 13:18:49 -Removed 2 variants with bad/na OR_95U. 2024/12/20 13:18:49 -Checking STATUS and converting STATUS to categories.... 2024/12/20 13:18:50 -Removed 27 variants with bad statistics in total. 2024/12/20 13:18:50 -Data types for each column: 2024/12/20 13:18:50 -Column : SNPID CHR POS EA NEA EAF BETA SE Z CHISQ P MLOG10P OR OR_95L OR_95U N N_CASE N_CONTROL DIRECTION STATUS NOTE 2024/12/20 13:18:50 -DType : string Int64 Int64 category category float32 float64 float64 float64 float64 float64 float64 float64 float64 float64 Int64 Int64 Int64 object category object 2024/12/20 13:18:50 -Verified: T T T T T T T T T T T T T T T T T T T T NA 2024/12/20 13:18:50 Finished sanity check for statistics.
In [17]:
Copied!
mysumstats.data
mysumstats.data
Out[17]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | CHISQ | ... | MLOG10P | OR | OR_95L | OR_95U | N | N_CASE | N_CONTROL | DIRECTION | STATUS | NOTE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:1:G:A | 1 | 1 | A | G | 0.004 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960099 | Duplicated |
1 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960099 | Duplicated |
2 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960099 | Duplicated |
3 | 1:2 | 1 | 2 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Multiallelic |
4 | 1:2 | 1 | 2 | T | TAA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Multiallelic |
5 | 1:3:T:A | 1 | 3 | A | T | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960099 | Multiallelic |
6 | 1:3:T:GA | 1 | 3 | T | GA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960399 | Multiallelic |
7 | 1 | 23 | 4 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Sex chromosomes |
9 | 3 | 1 | 6 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
10 | 4 | 1 | 7 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
11 | 5 | 1 | 8 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
16 | 10 | 1 | 123456789 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | POS with separator |
19 | 13 | 1 | 13 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | POS float |
23 | 16 | 1 | 16 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Lowercase alleles |
29 | 22 | 1 | 22 | AT | GT | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980599 | Not normalizated allelels |
30 | 23 | 1 | 23 | AT | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Both Uppercase and lowercases |
31 | 24 | 1 | 24 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | N float |
37 | 30 | 1 | 30 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 130000 | 40000 | --++ | 9980099 | N!=N_CONTROL +N_CASE |
43 | 36 | 1 | 36 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | SE out of range |
61 | 54 | 1 | 54 | AG | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Clean sumstats |
62 | 55 | 1 | 55 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Clean sumstats |
21 rows × 21 columns
check data consistency¶
In [18]:
Copied!
mysumstats.check_data_consistency()
mysumstats.check_data_consistency()
2024/12/20 13:18:50 Start to check data consistency across columns...v3.5.4 2024/12/20 13:18:50 -Current Dataframe shape : 21 x 21 ; Memory usage: 21.47 MB 2024/12/20 13:18:50 -Tolerance: 0.001 (Relative) and 0.001 (Absolute) 2024/12/20 13:18:50 -Checking if BETA/SE-derived-MLOG10P is consistent with MLOG10P... 2024/12/20 13:18:50 -Variants with inconsistent values were not detected. 2024/12/20 13:18:50 -Checking if BETA/SE-derived-P is consistent with P... 2024/12/20 13:18:50 -Variants with inconsistent values were not detected. 2024/12/20 13:18:50 -Checking if MLOG10P-derived-P is consistent with P... 2024/12/20 13:18:50 -Variants with inconsistent values were not detected. 2024/12/20 13:18:50 -Checking if N is consistent with N_CASE + N_CONTROL ... 2024/12/20 13:18:50 -Not consistent: 1 variant(s) 2024/12/20 13:18:50 -Variant SNPID with max difference: 30 with 10000 2024/12/20 13:18:50 -Note: if the max difference is greater than expected, please check your original sumstats. 2024/12/20 13:18:50 Finished checking data consistency across columns.
normalize variants¶
In [19]:
Copied!
mysumstats.normalize_allele()
mysumstats.normalize_allele()
2024/12/20 13:18:50 Start to normalize indels...v3.5.4 2024/12/20 13:18:50 -Current Dataframe shape : 21 x 21 ; Memory usage: 21.47 MB 2024/12/20 13:18:50 -Not normalized allele IDs:22 ... 2024/12/20 13:18:50 -Not normalized allele:['AT' 'GT']... 2024/12/20 13:18:50 -Modified 1 variants according to parsimony and left alignment principal. 2024/12/20 13:18:50 Finished normalizing indels.
In [20]:
Copied!
mysumstats.data
mysumstats.data
Out[20]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | CHISQ | ... | MLOG10P | OR | OR_95L | OR_95U | N | N_CASE | N_CONTROL | DIRECTION | STATUS | NOTE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:1:G:A | 1 | 1 | A | G | 0.004 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960099 | Duplicated |
1 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960099 | Duplicated |
2 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960099 | Duplicated |
3 | 1:2 | 1 | 2 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Multiallelic |
4 | 1:2 | 1 | 2 | T | TAA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Multiallelic |
5 | 1:3:T:A | 1 | 3 | A | T | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960099 | Multiallelic |
6 | 1:3:T:GA | 1 | 3 | T | GA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960399 | Multiallelic |
7 | 1 | 23 | 4 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Sex chromosomes |
9 | 3 | 1 | 6 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
10 | 4 | 1 | 7 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
11 | 5 | 1 | 8 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
16 | 10 | 1 | 123456789 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | POS with separator |
19 | 13 | 1 | 13 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | POS float |
23 | 16 | 1 | 16 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Lowercase alleles |
29 | 22 | 1 | 22 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Not normalizated allelels |
30 | 23 | 1 | 23 | AT | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Both Uppercase and lowercases |
31 | 24 | 1 | 24 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | N float |
37 | 30 | 1 | 30 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 130000 | 40000 | --++ | 9980099 | N!=N_CONTROL +N_CASE |
43 | 36 | 1 | 36 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | SE out of range |
61 | 54 | 1 | 54 | AG | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Clean sumstats |
62 | 55 | 1 | 55 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Clean sumstats |
21 rows × 21 columns
remove duplicated / multiallelic variants¶
In [21]:
Copied!
mysumstats.remove_dup(mode="md")
mysumstats.remove_dup(mode="md")
2024/12/20 13:18:50 Start to remove duplicated/multiallelic variants...v3.5.4 2024/12/20 13:18:50 -Current Dataframe shape : 21 x 21 ; Memory usage: 21.47 MB 2024/12/20 13:18:50 -Removing mode:md 2024/12/20 13:18:50 Start to sort the sumstats using P... 2024/12/20 13:18:50 Start to remove duplicated variants based on snpid...v3.5.4 2024/12/20 13:18:50 -Current Dataframe shape : 21 x 21 ; Memory usage: 21.47 MB 2024/12/20 13:18:50 -Which variant to keep: first 2024/12/20 13:18:50 -Removed 2 based on SNPID... 2024/12/20 13:18:50 Start to remove duplicated variants based on CHR,POS,EA and NEA... 2024/12/20 13:18:50 -Current Dataframe shape : 19 x 21 ; Memory usage: 21.47 MB 2024/12/20 13:18:50 -Which variant to keep: first 2024/12/20 13:18:50 -Removed 1 based on CHR,POS,EA and NEA... 2024/12/20 13:18:50 Start to remove multiallelic variants based on chr:pos... 2024/12/20 13:18:50 -Current Dataframe shape : 18 x 21 ; Memory usage: 21.47 MB 2024/12/20 13:18:50 -Which variant to keep: first 2024/12/20 13:18:50 -Removed 1 multiallelic variants... 2024/12/20 13:18:50 -Removed 4 variants in total. 2024/12/20 13:18:50 -Sort the coordinates based on CHR and POS... 2024/12/20 13:18:50 Finished removing duplicated/multiallelic variants.
In [22]:
Copied!
mysumstats.data
mysumstats.data
Out[22]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | CHISQ | ... | MLOG10P | OR | OR_95L | OR_95U | N | N_CASE | N_CONTROL | DIRECTION | STATUS | NOTE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:1:G:A | 1 | 1 | A | G | 0.004 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960099 | Duplicated |
1 | 1:2 | 1 | 2 | T | TAA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Multiallelic |
2 | 1:3:T:GA | 1 | 3 | T | GA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960399 | Multiallelic |
3 | 3 | 1 | 6 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
4 | 4 | 1 | 7 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
5 | 5 | 1 | 8 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
6 | 13 | 1 | 13 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | POS float |
7 | 16 | 1 | 16 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Lowercase alleles |
8 | 22 | 1 | 22 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Not normalizated allelels |
9 | 23 | 1 | 23 | AT | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Both Uppercase and lowercases |
10 | 24 | 1 | 24 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | N float |
11 | 30 | 1 | 30 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 130000 | 40000 | --++ | 9980099 | N!=N_CONTROL +N_CASE |
12 | 36 | 1 | 36 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | SE out of range |
13 | 54 | 1 | 54 | AG | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Clean sumstats |
14 | 55 | 1 | 55 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Clean sumstats |
15 | 10 | 1 | 123456789 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | POS with separator |
16 | 1 | 23 | 4 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Sex chromosomes |
17 rows × 21 columns
sort genome coordinate¶
In [23]:
Copied!
mysumstats.sort_coordinate()
mysumstats.sort_coordinate()
2024/12/20 13:18:51 Start to sort the genome coordinates...v3.5.4 2024/12/20 13:18:51 -Current Dataframe shape : 17 x 21 ; Memory usage: 21.47 MB 2024/12/20 13:18:51 Finished sorting coordinates.
sort column¶
In [24]:
Copied!
mysumstats.sort_column()
mysumstats.sort_column()
2024/12/20 13:18:51 Start to reorder the columns...v3.5.4 2024/12/20 13:18:51 -Current Dataframe shape : 17 x 21 ; Memory usage: 21.47 MB 2024/12/20 13:18:51 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,Z,CHISQ,P,MLOG10P,OR,OR_95L,OR_95U,N,N_CASE,N_CONTROL,DIRECTION,STATUS,NOTE 2024/12/20 13:18:51 Finished reordering the columns.