Standardization¶
In [1]:
Copied!
import gwaslab as gl
import gwaslab as gl
Load sample data¶
In [2]:
Copied!
mysumstats = gl.Sumstats("/home/yunye/work/gwaslab/examples/toy_data/dirty_sumstats.tsv",fmt="gwaslab",other=["NOTE"])
mysumstats = gl.Sumstats("/home/yunye/work/gwaslab/examples/toy_data/dirty_sumstats.tsv",fmt="gwaslab",other=["NOTE"])
Fri Feb 2 19:46:01 2024 GWASLab v3.4.38 https://cloufield.github.io/gwaslab/ Fri Feb 2 19:46:01 2024 (C) 2022-2024, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com Fri Feb 2 19:46:01 2024 Start to load format from formatbook.... Fri Feb 2 19:46:01 2024 -gwaslab format meta info: Fri Feb 2 19:46:01 2024 - format_name : gwaslab Fri Feb 2 19:46:01 2024 - format_source : https://cloufield.github.io/gwaslab/ Fri Feb 2 19:46:01 2024 - format_version : 20231220_v4 Fri Feb 2 19:46:01 2024 Start to initialize gl.Sumstats from file :/home/yunye/work/gwaslab/examples/toy_data/dirty_sumstats.tsv Fri Feb 2 19:46:01 2024 -Reading columns : BETA,DIRECTION,EA,P,N,NOTE,NEA,CHISQ,EAF,POS,MLOG10P,Z,N_CONTROL,OR_95L,SNPID,CHR,OR_95U,SE,N_CASE,OR Fri Feb 2 19:46:01 2024 -Renaming columns to : BETA,DIRECTION,EA,P,N,NOTE,NEA,CHISQ,EAF,POS,MLOG10P,Z,N_CONTROL,OR_95L,SNPID,CHR,OR_95U,SE,N_CASE,OR Fri Feb 2 19:46:01 2024 -Current Dataframe shape : 63 x 20 Fri Feb 2 19:46:01 2024 -Initiating a status column: STATUS ... Fri Feb 2 19:46:01 2024 #WARNING! Version of genomic coordinates is unknown... Fri Feb 2 19:46:01 2024 Start to reorder the columns...v3.4.38 Fri Feb 2 19:46:01 2024 -Current Dataframe shape : 63 x 21 ; Memory usage: 19.95 MB Fri Feb 2 19:46:01 2024 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,Z,CHISQ,P,MLOG10P,OR,OR_95L,OR_95U,N,N_CASE,N_CONTROL,DIRECTION,STATUS,NOTE Fri Feb 2 19:46:01 2024 Finished reordering the columns. Fri Feb 2 19:46:01 2024 -Column : SNPID CHR POS EA NEA EAF BETA SE Z CHISQ P MLOG10P OR OR_95L OR_95U N N_CASE N_CONTROL DIRECTION STATUS NOTE Fri Feb 2 19:46:01 2024 -DType : object string object category category float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 int64 int64 object category object Fri Feb 2 19:46:01 2024 -Verified: T F F T T T T T T T T T T T T F T T T T NA Fri Feb 2 19:46:01 2024 #WARNING! Columns with possibly incompatable dtypes: CHR,POS,N Fri Feb 2 19:46:02 2024 -Current Dataframe memory usage: 19.95 MB Fri Feb 2 19:46:02 2024 Finished loading data successfully!
Dirty sumstats with issues specified in NOTE column
In [3]:
Copied!
mysumstats.data
mysumstats.data
Out[3]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | CHISQ | ... | MLOG10P | OR | OR_95L | OR_95U | N | N_CASE | N_CONTROL | DIRECTION | STATUS | NOTE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:1_G_A | 1 | 1 | A | G | 0.004 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9999999 | Duplicated |
1 | 1:1_A_G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9999999 | Duplicated |
2 | 1:1_A_G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9999999 | Duplicated |
3 | 1:2 | 1 | 2 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9999999 | Multiallelic |
4 | 1:2 | 1 | 2 | T | TAA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9999999 | Multiallelic |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
58 | 51 | 1 | 51 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 12345.000000 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9999999 | MLOG10P out of range |
59 | 52 | 1 | 52 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | -0.100000 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9999999 | MLOG10P out of range |
60 | 53 | 1 | 53 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | NaN | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9999999 | MLOG10P missing |
61 | 54 | 1 | 54 | AG | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9999999 | Clean sumstats |
62 | 55 | 1 | 55 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9999999 | Clean sumstats |
63 rows × 21 columns
All in one function¶
In [4]:
Copied!
mysumstats.basic_check(remove=True,remove_dup=True)
mysumstats.basic_check(remove=True,remove_dup=True)
Fri Feb 2 19:46:02 2024 Start to check SNPID/rsID...v3.4.38 Fri Feb 2 19:46:02 2024 -Current Dataframe shape : 63 x 21 ; Memory usage: 19.95 MB Fri Feb 2 19:46:02 2024 -Checking SNPID data type... Fri Feb 2 19:46:02 2024 -Converting SNPID to pd.string data type... Fri Feb 2 19:46:02 2024 -Checking if SNPID is CHR:POS:NEA:EA...(separator: - ,: , _) Fri Feb 2 19:46:03 2024 Finished checking SNPID/rsID. Fri Feb 2 19:46:03 2024 Start to fix chromosome notation (CHR)...v3.4.38 Fri Feb 2 19:46:03 2024 -Current Dataframe shape : 63 x 21 ; Memory usage: 19.95 MB Fri Feb 2 19:46:03 2024 -Checking CHR data type... Fri Feb 2 19:46:03 2024 -Variants with standardized chromosome notation: 56 Fri Feb 2 19:46:03 2024 -Variants with fixable chromosome notations: 4 Fri Feb 2 19:46:03 2024 -Variants with NA chromosome notations: 1 Fri Feb 2 19:46:03 2024 -Variants with invalid chromosome notations: 2 Fri Feb 2 19:46:03 2024 -A look at invalid chromosome notations: {'1.0001', '-1'} Fri Feb 2 19:46:03 2024 -Identifying non-autosomal chromosomes : X, Y, and MT ... Fri Feb 2 19:46:03 2024 -Identified 1 variants on sex chromosomes... Fri Feb 2 19:46:03 2024 -Standardizing sex chromosome notations: X to 23... Fri Feb 2 19:46:04 2024 -Valid CHR list: 1 - 25 Fri Feb 2 19:46:04 2024 -Removed 5 variants with chromosome notations not in CHR list. Fri Feb 2 19:46:04 2024 -A look at chromosome notations not in CHR list: {'0', '300', <NA>} Fri Feb 2 19:46:04 2024 Finished fixing chromosome notation (CHR). Fri Feb 2 19:46:04 2024 Start to fix basepair positions (POS)...v3.4.38 Fri Feb 2 19:46:04 2024 -Current Dataframe shape : 58 x 21 ; Memory usage: 19.95 MB Fri Feb 2 19:46:04 2024 -Removing thousands separator "," or underbar "_" ... Fri Feb 2 19:46:04 2024 -Converting to Int64 data type ... Fri Feb 2 19:46:04 2024 -Force converting to Int64 data type ... Fri Feb 2 19:46:05 2024 -Position bound:(0 , 250,000,000) Fri Feb 2 19:46:05 2024 -Removed outliers: 2 Fri Feb 2 19:46:05 2024 -Removed 4 variants with bad positions. Fri Feb 2 19:46:05 2024 Finished fixing basepair positions (POS). Fri Feb 2 19:46:05 2024 Start to fix alleles (EA and NEA)...v3.4.38 Fri Feb 2 19:46:05 2024 -Current Dataframe shape : 54 x 21 ; Memory usage: 19.95 MB Fri Feb 2 19:46:05 2024 -Converted all bases to string datatype and UPPERCASE. Fri Feb 2 19:46:05 2024 -Variants with bad EA : 1 Fri Feb 2 19:46:05 2024 -Variants with bad NEA : 5 Fri Feb 2 19:46:05 2024 -Variants with NA for EA or NEA: 1 Fri Feb 2 19:46:05 2024 -Variants with same EA and NEA: 1 Fri Feb 2 19:46:05 2024 -A look at the non-ATCG EA: {'<CN0>'} ... Fri Feb 2 19:46:05 2024 -A look at the non-ATCG NEA: {nan, '*', '<CN1>', 'N'} ... Fri Feb 2 19:46:05 2024 -Removed 5 variants with NA alleles or alleles that contain bases other than A/C/T/G. Fri Feb 2 19:46:05 2024 -Removed 1 variants with same allele for EA and NEA. Fri Feb 2 19:46:09 2024 Finished fixing alleles (EA and NEA). Fri Feb 2 19:46:09 2024 Start to perform sanity check for statistics...v3.4.38 Fri Feb 2 19:46:09 2024 -Current Dataframe shape : 48 x 21 ; Memory usage: 19.95 MB Fri Feb 2 19:46:09 2024 -Comparison tolerance for floats: 1e-07 Fri Feb 2 19:46:09 2024 -Checking if 0 <= N <= 2147483647 ... Fri Feb 2 19:46:09 2024 -Examples of invalid variants(SNPID): 25,26,27 ... Fri Feb 2 19:46:09 2024 -Examples of invalid values (N): 12345700000000000,NA,-1 ... Fri Feb 2 19:46:09 2024 -Removed 3 variants with bad/na N. Fri Feb 2 19:46:09 2024 -Checking if 0 <= N_CASE <= 2147483647 ... Fri Feb 2 19:46:09 2024 -Examples of invalid variants(SNPID): 29 ... Fri Feb 2 19:46:09 2024 -Examples of invalid values (N_CASE): -1 ... Fri Feb 2 19:46:09 2024 -Removed 1 variants with bad/na N_CASE. Fri Feb 2 19:46:09 2024 -Checking if 0 <= N_CONTROL <= 2147483647 ... Fri Feb 2 19:46:09 2024 -Examples of invalid variants(SNPID): 28 ... Fri Feb 2 19:46:09 2024 -Examples of invalid values (N_CONTROL): -1 ... Fri Feb 2 19:46:09 2024 -Removed 1 variants with bad/na N_CONTROL. Fri Feb 2 19:46:09 2024 -Checking if -1e-07 < EAF < 1.0000001 ... Fri Feb 2 19:46:09 2024 -Examples of invalid variants(SNPID): 31,32,33 ... Fri Feb 2 19:46:09 2024 -Examples of invalid values (EAF): 1.02,-0.01,NA ... Fri Feb 2 19:46:09 2024 -Removed 3 variants with bad/na EAF. Fri Feb 2 19:46:09 2024 -Checking if -1e-07 < CHISQ < inf ... Fri Feb 2 19:46:09 2024 -Examples of invalid variants(SNPID): 38,39 ... Fri Feb 2 19:46:09 2024 -Examples of invalid values (CHISQ): -0.01,NA ... Fri Feb 2 19:46:09 2024 -Removed 2 variants with bad/na CHISQ. Fri Feb 2 19:46:09 2024 -Checking if -9999.0000001 < Z < 9999.0000001 ... Fri Feb 2 19:46:09 2024 -Examples of invalid variants(SNPID): 40,41 ... Fri Feb 2 19:46:09 2024 -Examples of invalid values (Z): NA,999999.0 ... Fri Feb 2 19:46:09 2024 -Removed 2 variants with bad/na Z. Fri Feb 2 19:46:09 2024 -Checking if -1e-07 < P < 1.0000001 ... Fri Feb 2 19:46:09 2024 -Examples of invalid variants(SNPID): 48,49,50 ... Fri Feb 2 19:46:09 2024 -Examples of invalid values (P): 1.1,-0.01,NA ... Fri Feb 2 19:46:09 2024 -Removed 3 variants with bad/na P. Fri Feb 2 19:46:09 2024 -Checking if -1e-07 < MLOG10P < 9999.0000001 ... Fri Feb 2 19:46:09 2024 -Examples of invalid variants(SNPID): 51,52,53 ... Fri Feb 2 19:46:09 2024 -Examples of invalid values (MLOG10P): 12345.0,-0.1,NA ... Fri Feb 2 19:46:09 2024 -Removed 3 variants with bad/na MLOG10P. Fri Feb 2 19:46:09 2024 -Checking if -100.0000001 < BETA < 100.0000001 ... Fri Feb 2 19:46:09 2024 -Examples of invalid variants(SNPID): 34,35 ... Fri Feb 2 19:46:09 2024 -Examples of invalid values (BETA): 99999.0,NA ... Fri Feb 2 19:46:09 2024 -Removed 2 variants with bad/na BETA. Fri Feb 2 19:46:09 2024 -Checking if -1e-07 < SE < inf ... Fri Feb 2 19:46:09 2024 -Examples of invalid variants(SNPID): 37 ... Fri Feb 2 19:46:09 2024 -Examples of invalid values (SE): NA ... Fri Feb 2 19:46:09 2024 -Removed 1 variants with bad/na SE. Fri Feb 2 19:46:09 2024 -Checking if -100.0000001 < OR < 100.0000001 ... Fri Feb 2 19:46:09 2024 -Examples of invalid variants(SNPID): 42,43 ... Fri Feb 2 19:46:09 2024 -Examples of invalid values (OR): 999999.0,NA ... Fri Feb 2 19:46:09 2024 -Removed 2 variants with bad/na OR. Fri Feb 2 19:46:09 2024 -Checking if -1e-07 < OR_95L < inf ... Fri Feb 2 19:46:09 2024 -Examples of invalid variants(SNPID): 44,45 ... Fri Feb 2 19:46:09 2024 -Examples of invalid values (OR_95L): -0.01,NA ... Fri Feb 2 19:46:09 2024 -Removed 2 variants with bad/na OR_95L. Fri Feb 2 19:46:09 2024 -Checking if -1e-07 < OR_95U < inf ... Fri Feb 2 19:46:09 2024 -Examples of invalid variants(SNPID): 46,47 ... Fri Feb 2 19:46:09 2024 -Examples of invalid values (OR_95U): -0.01,NA ... Fri Feb 2 19:46:09 2024 -Removed 2 variants with bad/na OR_95U. Fri Feb 2 19:46:09 2024 -Checking STATUS and converting STATUS to categories.... Fri Feb 2 19:46:09 2024 -Removed 27 variants with bad statistics in total. Fri Feb 2 19:46:09 2024 -Data types for each column: Fri Feb 2 19:46:09 2024 -Column : SNPID CHR POS EA NEA EAF BETA SE Z CHISQ P MLOG10P OR OR_95L OR_95U N N_CASE N_CONTROL DIRECTION STATUS NOTE Fri Feb 2 19:46:09 2024 -DType : string Int64 Int64 category category float32 float64 float64 float64 float64 float64 float64 float64 float64 float64 Int64 Int64 Int64 object category object Fri Feb 2 19:46:09 2024 -Verified: T T T T T T T T T T T T T T T T T T T T NA Fri Feb 2 19:46:09 2024 Finished sanity check for statistics. Fri Feb 2 19:46:09 2024 Start to check data consistency across columns...v3.4.38 Fri Feb 2 19:46:09 2024 -Current Dataframe shape : 21 x 21 ; Memory usage: 19.94 MB Fri Feb 2 19:46:09 2024 -Tolerance: 0.001 (Relative) and 0.001 (Absolute) Fri Feb 2 19:46:09 2024 -Checking if BETA/SE-derived-MLOG10P is consistent with MLOG10P... Fri Feb 2 19:46:09 2024 -Variants with inconsistent values were not detected. Fri Feb 2 19:46:09 2024 -Checking if BETA/SE-derived-P is consistent with P... Fri Feb 2 19:46:09 2024 -Variants with inconsistent values were not detected. Fri Feb 2 19:46:09 2024 -Checking if MLOG10P-derived-P is consistent with P... Fri Feb 2 19:46:09 2024 -Variants with inconsistent values were not detected. Fri Feb 2 19:46:09 2024 -Checking if N is consistent with N_CASE + N_CONTROL ... Fri Feb 2 19:46:09 2024 -Not consistent: 1 variant(s) Fri Feb 2 19:46:09 2024 -Variant SNPID with max difference: 30 with 10000 Fri Feb 2 19:46:09 2024 -Note: if the max difference is greater than expected, please check your original sumstats. Fri Feb 2 19:46:09 2024 Finished checking data consistency across columns. Fri Feb 2 19:46:09 2024 Start to normalize indels...v3.4.38 Fri Feb 2 19:46:09 2024 -Current Dataframe shape : 21 x 21 ; Memory usage: 19.94 MB
Fri Feb 2 19:46:10 2024 -Not normalized allele IDs:22 ... Fri Feb 2 19:46:10 2024 -Not normalized allele:['AT' 'GT']... Fri Feb 2 19:46:10 2024 -Modified 1 variants according to parsimony and left alignment principal. Fri Feb 2 19:46:10 2024 Finished normalizing indels. Fri Feb 2 19:46:10 2024 Start to remove duplicated/multiallelic variants...v3.4.38 Fri Feb 2 19:46:10 2024 -Current Dataframe shape : 21 x 21 ; Memory usage: 19.94 MB Fri Feb 2 19:46:10 2024 -Removing mode:dm Fri Feb 2 19:46:10 2024 Start to sort the sumstats using P... Fri Feb 2 19:46:10 2024 Start to remove duplicated variants based on snpid...v3.4.38 Fri Feb 2 19:46:10 2024 -Current Dataframe shape : 21 x 21 ; Memory usage: 19.94 MB Fri Feb 2 19:46:10 2024 -Which variant to keep: first Fri Feb 2 19:46:10 2024 -Removed 2 based on SNPID... Fri Feb 2 19:46:10 2024 Start to remove duplicated variants based on CHR,POS,EA and NEA... Fri Feb 2 19:46:10 2024 -Current Dataframe shape : 19 x 21 ; Memory usage: 19.94 MB Fri Feb 2 19:46:10 2024 -Which variant to keep: first Fri Feb 2 19:46:10 2024 -Removed 1 based on CHR,POS,EA and NEA... Fri Feb 2 19:46:10 2024 Start to remove multiallelic variants based on chr:pos... Fri Feb 2 19:46:10 2024 -Current Dataframe shape : 18 x 21 ; Memory usage: 19.94 MB Fri Feb 2 19:46:10 2024 -Which variant to keep: first Fri Feb 2 19:46:10 2024 -Removed 1 multiallelic variants... Fri Feb 2 19:46:10 2024 -Removed 4 variants in total. Fri Feb 2 19:46:10 2024 -Sort the coordinates based on CHR and POS... Fri Feb 2 19:46:10 2024 Finished removing duplicated/multiallelic variants. Fri Feb 2 19:46:10 2024 Start to sort the genome coordinates...v3.4.38 Fri Feb 2 19:46:10 2024 -Current Dataframe shape : 17 x 21 ; Memory usage: 19.94 MB Fri Feb 2 19:46:10 2024 Finished sorting coordinates. Fri Feb 2 19:46:10 2024 Start to reorder the columns...v3.4.38 Fri Feb 2 19:46:10 2024 -Current Dataframe shape : 17 x 21 ; Memory usage: 19.94 MB Fri Feb 2 19:46:10 2024 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,Z,CHISQ,P,MLOG10P,OR,OR_95L,OR_95U,N,N_CASE,N_CONTROL,DIRECTION,STATUS,NOTE Fri Feb 2 19:46:10 2024 Finished reordering the columns.
In [5]:
Copied!
mysumstats.data
mysumstats.data
Out[5]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | CHISQ | ... | MLOG10P | OR | OR_95L | OR_95U | N | N_CASE | N_CONTROL | DIRECTION | STATUS | NOTE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:1_G_A | 1 | 1 | A | G | 0.004 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960099 | Duplicated |
1 | 1:2 | 1 | 2 | T | TAA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Multiallelic |
2 | 1:3_T_GA | 1 | 3 | T | GA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960399 | Multiallelic |
3 | 3 | 1 | 6 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
4 | 4 | 1 | 7 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
5 | 5 | 1 | 8 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
6 | 13 | 1 | 13 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | POS float |
7 | 16 | 1 | 16 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Lowercase alleles |
8 | 22 | 1 | 22 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Not normalizated allelels |
9 | 23 | 1 | 23 | AT | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Both Uppercase and lowercases |
10 | 24 | 1 | 24 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | N float |
11 | 30 | 1 | 30 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 130000 | 40000 | --++ | 9980099 | N!=N_CONTROL +N_CASE |
12 | 36 | 1 | 36 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | SE out of range |
13 | 54 | 1 | 54 | AG | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Clean sumstats |
14 | 55 | 1 | 55 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Clean sumstats |
15 | 10 | 1 | 123456789 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | POS with separator |
16 | 1 | 23 | 4 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Sex chromosomes |
17 rows × 21 columns
Separate functions¶
In [6]:
Copied!
#reload
mysumstats = gl.Sumstats("/home/yunye/work/gwaslab/examples/toy_data/dirty_sumstats.tsv",fmt="gwaslab",other=["NOTE"], verbose=False)
#reload
mysumstats = gl.Sumstats("/home/yunye/work/gwaslab/examples/toy_data/dirty_sumstats.tsv",fmt="gwaslab",other=["NOTE"], verbose=False)
fix id¶
In [7]:
Copied!
mysumstats.fix_id(fixsep=True)
mysumstats.fix_id(fixsep=True)
Fri Feb 2 19:46:11 2024 Start to check SNPID/rsID...v3.4.38 Fri Feb 2 19:46:11 2024 -Current Dataframe shape : 63 x 21 ; Memory usage: 19.95 MB Fri Feb 2 19:46:11 2024 -Checking SNPID data type... Fri Feb 2 19:46:11 2024 -Converting SNPID to pd.string data type... Fri Feb 2 19:46:11 2024 -Checking if SNPID is CHR:POS:NEA:EA...(separator: - ,: , _) Fri Feb 2 19:46:12 2024 -Replacing [_-] in SNPID with ":" ... Fri Feb 2 19:46:12 2024 Finished checking SNPID/rsID.
In [8]:
Copied!
mysumstats.data
mysumstats.data
Out[8]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | CHISQ | ... | MLOG10P | OR | OR_95L | OR_95U | N | N_CASE | N_CONTROL | DIRECTION | STATUS | NOTE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:1:G:A | 1 | 1 | A | G | 0.004 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9969999 | Duplicated |
1 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9969999 | Duplicated |
2 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9969999 | Duplicated |
3 | 1:2 | 1 | 2 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9989999 | Multiallelic |
4 | 1:2 | 1 | 2 | T | TAA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9989999 | Multiallelic |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
58 | 51 | 1 | 51 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 12345.000000 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9989999 | MLOG10P out of range |
59 | 52 | 1 | 52 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | -0.100000 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9989999 | MLOG10P out of range |
60 | 53 | 1 | 53 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | NaN | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9989999 | MLOG10P missing |
61 | 54 | 1 | 54 | AG | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9989999 | Clean sumstats |
62 | 55 | 1 | 55 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000.0 | 120000 | 40000 | --++ | 9989999 | Clean sumstats |
63 rows × 21 columns
fix chromosome¶
In [9]:
Copied!
mysumstats.fix_chr(remove=True)
mysumstats.fix_chr(remove=True)
Fri Feb 2 19:46:12 2024 Start to fix chromosome notation (CHR)...v3.4.38 Fri Feb 2 19:46:12 2024 -Current Dataframe shape : 63 x 21 ; Memory usage: 19.95 MB Fri Feb 2 19:46:12 2024 -Checking CHR data type... Fri Feb 2 19:46:12 2024 -Variants with standardized chromosome notation: 56 Fri Feb 2 19:46:12 2024 -Variants with fixable chromosome notations: 4 Fri Feb 2 19:46:12 2024 -Variants with NA chromosome notations: 1 Fri Feb 2 19:46:12 2024 -Variants with invalid chromosome notations: 2 Fri Feb 2 19:46:12 2024 -A look at invalid chromosome notations: {'1.0001', '-1'} Fri Feb 2 19:46:12 2024 -Identifying non-autosomal chromosomes : X, Y, and MT ... Fri Feb 2 19:46:12 2024 -Identified 1 variants on sex chromosomes... Fri Feb 2 19:46:12 2024 -Standardizing sex chromosome notations: X to 23... Fri Feb 2 19:46:13 2024 -Valid CHR list: 1 - 25 Fri Feb 2 19:46:13 2024 -Removed 5 variants with chromosome notations not in CHR list. Fri Feb 2 19:46:13 2024 -A look at chromosome notations not in CHR list: {'0', '300', <NA>} Fri Feb 2 19:46:13 2024 Finished fixing chromosome notation (CHR).
In [10]:
Copied!
mysumstats.data
mysumstats.data
Out[10]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | CHISQ | ... | MLOG10P | OR | OR_95L | OR_95U | N | N_CASE | N_CONTROL | DIRECTION | STATUS | NOTE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:1:G:A | 1 | 1 | A | G | 0.004 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9965999 | Duplicated |
1 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9965999 | Duplicated |
2 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9965999 | Duplicated |
3 | 1:2 | 1 | 2 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Multiallelic |
4 | 1:2 | 1 | 2 | T | TAA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Multiallelic |
5 | 1:3:T:A | 1 | 3 | A | T | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9965999 | Multiallelic |
6 | 1:3:T:GA | 1 | 3 | T | GA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9965999 | Multiallelic |
7 | 1 | 23 | 4 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Sex chromosomes |
9 | 3 | 1 | 6 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | CHR with prefix |
10 | 4 | 1 | 7 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | CHR with prefix |
11 | 5 | 1 | 8 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | CHR with prefix |
16 | 10 | 1 | 123,456,789 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | POS with separator |
17 | 11 | 1 | -1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | POS out of normal range |
18 | 12 | 1 | 1.23214E+13 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | POS out of normal range |
19 | 13 | 1 | 13.00000001 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | POS float |
20 | 14 | 1 | NaN | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | POS missing |
21 | 13 | 1 | abc | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | POS string |
22 | 15 | 1 | 15 | A | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Same alleles |
23 | 16 | 1 | 16 | a | g | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Lowercase alleles |
24 | 17 | 1 | 17 | A | <CN1> | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Unrecognized alleles |
25 | 18 | 1 | 18 | <CN0> | <CN1> | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Unrecognized alleles |
26 | 19 | 1 | 19 | A | * | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Unrecognized alleles |
27 | 20 | 1 | 20 | A | N | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Unrecognized alleles |
28 | 21 | 1 | 21 | A | NaN | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Allele missing |
29 | 22 | 1 | 22 | AT | GT | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Not normalizated allelels |
30 | 23 | 1 | 23 | At | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Both Uppercase and lowercases |
31 | 24 | 1 | 24 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600001e+05 | 120000 | 40000 | --++ | 9985999 | N float |
32 | 25 | 1 | 25 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.234570e+16 | 120000 | 40000 | --++ | 9985999 | N out of range |
33 | 26 | 1 | 26 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | NaN | 120000 | 40000 | --++ | 9985999 | N missing |
34 | 27 | 1 | 27 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | -1.000000e+00 | 120000 | 40000 | --++ | 9985999 | N out of range |
35 | 28 | 1 | 28 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | -1 | --++ | 9985999 | N_CASE out of range |
36 | 29 | 1 | 29 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | -1 | 40000 | --++ | 9985999 | N_CONTROL out of range |
37 | 30 | 1 | 30 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 130000 | 40000 | --++ | 9985999 | N!=N_CONTROL +N_CASE |
38 | 31 | 1 | 31 | A | G | 1.020 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | EAF out of range |
39 | 32 | 1 | 32 | A | G | -0.010 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | EAF out of range |
40 | 33 | 1 | 33 | A | G | NaN | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | EAF missing |
41 | 34 | 1 | 34 | A | G | 0.996 | 99999.0000 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | BETA out of range |
42 | 35 | 1 | 35 | A | G | 0.996 | NaN | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | BETA missing |
43 | 36 | 1 | 36 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | SE out of range |
44 | 37 | 1 | 37 | A | G | 0.996 | 0.0603 | NaN | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | SE missing |
45 | 38 | 1 | 38 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | -0.010000 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | CHISQ out of range |
46 | 39 | 1 | 39 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | NaN | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | CHISQ missing |
47 | 40 | 1 | 40 | A | G | 0.996 | 0.0603 | 0.0103 | NaN | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Z missing |
48 | 41 | 1 | 41 | A | G | 0.996 | 0.0603 | 0.0103 | 999999.000000 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Z out of range |
49 | 42 | 1 | 42 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 999999.000000 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | OR out of range |
50 | 43 | 1 | 43 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | NaN | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | OR missing |
51 | 44 | 1 | 44 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | -0.010000 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | OR_95L out of range |
52 | 45 | 1 | 45 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | NaN | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | OR_95L missing |
53 | 46 | 1 | 46 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | -0.010000 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | OR_95U out of range |
54 | 47 | 1 | 47 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | NaN | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | OR_95U missing |
55 | 48 | 1 | 48 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | -0.041393 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | P out of range |
56 | 49 | 1 | 49 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | P out of range |
57 | 50 | 1 | 50 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | P missing |
58 | 51 | 1 | 51 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 12345.000000 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | MLOG10P out of range |
59 | 52 | 1 | 52 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | -0.100000 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | MLOG10P out of range |
60 | 53 | 1 | 53 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | NaN | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | MLOG10P missing |
61 | 54 | 1 | 54 | AG | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Clean sumstats |
62 | 55 | 1 | 55 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9985999 | Clean sumstats |
58 rows × 21 columns
fix position¶
In [11]:
Copied!
mysumstats.fix_pos(remove=True)
mysumstats.fix_pos(remove=True)
Fri Feb 2 19:46:13 2024 Start to fix basepair positions (POS)...v3.4.38 Fri Feb 2 19:46:13 2024 -Current Dataframe shape : 58 x 21 ; Memory usage: 19.95 MB Fri Feb 2 19:46:13 2024 -Removing thousands separator "," or underbar "_" ... Fri Feb 2 19:46:13 2024 -Converting to Int64 data type ... Fri Feb 2 19:46:13 2024 -Force converting to Int64 data type ... Fri Feb 2 19:46:15 2024 -Position bound:(0 , 250,000,000) Fri Feb 2 19:46:15 2024 -Removed outliers: 2 Fri Feb 2 19:46:15 2024 -Removed 4 variants with bad positions. Fri Feb 2 19:46:15 2024 Finished fixing basepair positions (POS).
In [12]:
Copied!
mysumstats.data
mysumstats.data
Out[12]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | CHISQ | ... | MLOG10P | OR | OR_95L | OR_95U | N | N_CASE | N_CONTROL | DIRECTION | STATUS | NOTE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:1:G:A | 1 | 1 | A | G | 0.004 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9960999 | Duplicated |
1 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9960999 | Duplicated |
2 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9960999 | Duplicated |
3 | 1:2 | 1 | 2 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Multiallelic |
4 | 1:2 | 1 | 2 | T | TAA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Multiallelic |
5 | 1:3:T:A | 1 | 3 | A | T | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9960999 | Multiallelic |
6 | 1:3:T:GA | 1 | 3 | T | GA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9960999 | Multiallelic |
7 | 1 | 23 | 4 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Sex chromosomes |
9 | 3 | 1 | 6 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | CHR with prefix |
10 | 4 | 1 | 7 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | CHR with prefix |
11 | 5 | 1 | 8 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | CHR with prefix |
16 | 10 | 1 | 123456789 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | POS with separator |
19 | 13 | 1 | 13 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | POS float |
22 | 15 | 1 | 15 | A | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Same alleles |
23 | 16 | 1 | 16 | a | g | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Lowercase alleles |
24 | 17 | 1 | 17 | A | <CN1> | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Unrecognized alleles |
25 | 18 | 1 | 18 | <CN0> | <CN1> | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Unrecognized alleles |
26 | 19 | 1 | 19 | A | * | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Unrecognized alleles |
27 | 20 | 1 | 20 | A | N | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Unrecognized alleles |
28 | 21 | 1 | 21 | A | NaN | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Allele missing |
29 | 22 | 1 | 22 | AT | GT | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Not normalizated allelels |
30 | 23 | 1 | 23 | At | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Both Uppercase and lowercases |
31 | 24 | 1 | 24 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600001e+05 | 120000 | 40000 | --++ | 9980999 | N float |
32 | 25 | 1 | 25 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.234570e+16 | 120000 | 40000 | --++ | 9980999 | N out of range |
33 | 26 | 1 | 26 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | NaN | 120000 | 40000 | --++ | 9980999 | N missing |
34 | 27 | 1 | 27 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | -1.000000e+00 | 120000 | 40000 | --++ | 9980999 | N out of range |
35 | 28 | 1 | 28 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | -1 | --++ | 9980999 | N_CASE out of range |
36 | 29 | 1 | 29 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | -1 | 40000 | --++ | 9980999 | N_CONTROL out of range |
37 | 30 | 1 | 30 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 130000 | 40000 | --++ | 9980999 | N!=N_CONTROL +N_CASE |
38 | 31 | 1 | 31 | A | G | 1.020 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | EAF out of range |
39 | 32 | 1 | 32 | A | G | -0.010 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | EAF out of range |
40 | 33 | 1 | 33 | A | G | NaN | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | EAF missing |
41 | 34 | 1 | 34 | A | G | 0.996 | 99999.0000 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | BETA out of range |
42 | 35 | 1 | 35 | A | G | 0.996 | NaN | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | BETA missing |
43 | 36 | 1 | 36 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | SE out of range |
44 | 37 | 1 | 37 | A | G | 0.996 | 0.0603 | NaN | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | SE missing |
45 | 38 | 1 | 38 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | -0.010000 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | CHISQ out of range |
46 | 39 | 1 | 39 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | NaN | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | CHISQ missing |
47 | 40 | 1 | 40 | A | G | 0.996 | 0.0603 | 0.0103 | NaN | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Z missing |
48 | 41 | 1 | 41 | A | G | 0.996 | 0.0603 | 0.0103 | 999999.000000 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Z out of range |
49 | 42 | 1 | 42 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 999999.000000 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | OR out of range |
50 | 43 | 1 | 43 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | NaN | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | OR missing |
51 | 44 | 1 | 44 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | -0.010000 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | OR_95L out of range |
52 | 45 | 1 | 45 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | NaN | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | OR_95L missing |
53 | 46 | 1 | 46 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | -0.010000 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | OR_95U out of range |
54 | 47 | 1 | 47 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | NaN | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | OR_95U missing |
55 | 48 | 1 | 48 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | -0.041393 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | P out of range |
56 | 49 | 1 | 49 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | P out of range |
57 | 50 | 1 | 50 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | P missing |
58 | 51 | 1 | 51 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 12345.000000 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | MLOG10P out of range |
59 | 52 | 1 | 52 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | -0.100000 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | MLOG10P out of range |
60 | 53 | 1 | 53 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | NaN | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | MLOG10P missing |
61 | 54 | 1 | 54 | AG | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Clean sumstats |
62 | 55 | 1 | 55 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980999 | Clean sumstats |
54 rows × 21 columns
fix allele¶
In [13]:
Copied!
mysumstats.fix_allele(remove=True)
mysumstats.fix_allele(remove=True)
Fri Feb 2 19:46:15 2024 Start to fix alleles (EA and NEA)...v3.4.38 Fri Feb 2 19:46:15 2024 -Current Dataframe shape : 54 x 21 ; Memory usage: 19.95 MB Fri Feb 2 19:46:15 2024 -Converted all bases to string datatype and UPPERCASE. Fri Feb 2 19:46:15 2024 -Variants with bad EA : 1 Fri Feb 2 19:46:15 2024 -Variants with bad NEA : 5 Fri Feb 2 19:46:15 2024 -Variants with NA for EA or NEA: 1 Fri Feb 2 19:46:15 2024 -Variants with same EA and NEA: 1 Fri Feb 2 19:46:15 2024 -A look at the non-ATCG EA: {'<CN0>'} ... Fri Feb 2 19:46:15 2024 -A look at the non-ATCG NEA: {nan, '*', '<CN1>', 'N'} ... Fri Feb 2 19:46:15 2024 -Removed 5 variants with NA alleles or alleles that contain bases other than A/C/T/G. Fri Feb 2 19:46:15 2024 -Removed 1 variants with same allele for EA and NEA. Fri Feb 2 19:46:18 2024 Finished fixing alleles (EA and NEA).
In [14]:
Copied!
mysumstats.data
mysumstats.data
Out[14]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | CHISQ | ... | MLOG10P | OR | OR_95L | OR_95U | N | N_CASE | N_CONTROL | DIRECTION | STATUS | NOTE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:1:G:A | 1 | 1 | A | G | 0.004 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9960099 | Duplicated |
1 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9960099 | Duplicated |
2 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9960099 | Duplicated |
3 | 1:2 | 1 | 2 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | Multiallelic |
4 | 1:2 | 1 | 2 | T | TAA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980399 | Multiallelic |
5 | 1:3:T:A | 1 | 3 | A | T | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9960099 | Multiallelic |
6 | 1:3:T:GA | 1 | 3 | T | GA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9960399 | Multiallelic |
7 | 1 | 23 | 4 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | Sex chromosomes |
9 | 3 | 1 | 6 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
10 | 4 | 1 | 7 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
11 | 5 | 1 | 8 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
16 | 10 | 1 | 123456789 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | POS with separator |
19 | 13 | 1 | 13 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | POS float |
23 | 16 | 1 | 16 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | Lowercase alleles |
29 | 22 | 1 | 22 | AT | GT | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980599 | Not normalizated allelels |
30 | 23 | 1 | 23 | AT | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980399 | Both Uppercase and lowercases |
31 | 24 | 1 | 24 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600001e+05 | 120000 | 40000 | --++ | 9980099 | N float |
32 | 25 | 1 | 25 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.234570e+16 | 120000 | 40000 | --++ | 9980099 | N out of range |
33 | 26 | 1 | 26 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | NaN | 120000 | 40000 | --++ | 9980099 | N missing |
34 | 27 | 1 | 27 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | -1.000000e+00 | 120000 | 40000 | --++ | 9980099 | N out of range |
35 | 28 | 1 | 28 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | -1 | --++ | 9980099 | N_CASE out of range |
36 | 29 | 1 | 29 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | -1 | 40000 | --++ | 9980099 | N_CONTROL out of range |
37 | 30 | 1 | 30 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 130000 | 40000 | --++ | 9980099 | N!=N_CONTROL +N_CASE |
38 | 31 | 1 | 31 | A | G | 1.020 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | EAF out of range |
39 | 32 | 1 | 32 | A | G | -0.010 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | EAF out of range |
40 | 33 | 1 | 33 | A | G | NaN | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | EAF missing |
41 | 34 | 1 | 34 | A | G | 0.996 | 99999.0000 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | BETA out of range |
42 | 35 | 1 | 35 | A | G | 0.996 | NaN | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | BETA missing |
43 | 36 | 1 | 36 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | SE out of range |
44 | 37 | 1 | 37 | A | G | 0.996 | 0.0603 | NaN | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | SE missing |
45 | 38 | 1 | 38 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | -0.010000 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | CHISQ out of range |
46 | 39 | 1 | 39 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | NaN | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | CHISQ missing |
47 | 40 | 1 | 40 | A | G | 0.996 | 0.0603 | 0.0103 | NaN | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | Z missing |
48 | 41 | 1 | 41 | A | G | 0.996 | 0.0603 | 0.0103 | 999999.000000 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | Z out of range |
49 | 42 | 1 | 42 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 999999.000000 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | OR out of range |
50 | 43 | 1 | 43 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | NaN | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | OR missing |
51 | 44 | 1 | 44 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | -0.010000 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | OR_95L out of range |
52 | 45 | 1 | 45 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | NaN | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | OR_95L missing |
53 | 46 | 1 | 46 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | -0.010000 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | OR_95U out of range |
54 | 47 | 1 | 47 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | NaN | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | OR_95U missing |
55 | 48 | 1 | 48 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | -0.041393 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | P out of range |
56 | 49 | 1 | 49 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | P out of range |
57 | 50 | 1 | 50 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | P missing |
58 | 51 | 1 | 51 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 12345.000000 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | MLOG10P out of range |
59 | 52 | 1 | 52 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | -0.100000 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | MLOG10P out of range |
60 | 53 | 1 | 53 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | NaN | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | MLOG10P missing |
61 | 54 | 1 | 54 | AG | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980399 | Clean sumstats |
62 | 55 | 1 | 55 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 1.600000e+05 | 120000 | 40000 | --++ | 9980099 | Clean sumstats |
48 rows × 21 columns
sanity check for statistics¶
In [15]:
Copied!
mysumstats.check_sanity()
mysumstats.check_sanity()
Fri Feb 2 19:46:18 2024 Start to perform sanity check for statistics...v3.4.38 Fri Feb 2 19:46:18 2024 -Current Dataframe shape : 48 x 21 ; Memory usage: 19.95 MB Fri Feb 2 19:46:18 2024 -Comparison tolerance for floats: 1e-07 Fri Feb 2 19:46:18 2024 -Checking if 0 <= N <= 2147483647 ... Fri Feb 2 19:46:18 2024 -Examples of invalid variants(SNPID): 25,26,27 ... Fri Feb 2 19:46:18 2024 -Examples of invalid values (N): 12345700000000000,NA,-1 ... Fri Feb 2 19:46:18 2024 -Removed 3 variants with bad/na N. Fri Feb 2 19:46:18 2024 -Checking if 0 <= N_CASE <= 2147483647 ... Fri Feb 2 19:46:18 2024 -Examples of invalid variants(SNPID): 29 ... Fri Feb 2 19:46:18 2024 -Examples of invalid values (N_CASE): -1 ... Fri Feb 2 19:46:18 2024 -Removed 1 variants with bad/na N_CASE. Fri Feb 2 19:46:18 2024 -Checking if 0 <= N_CONTROL <= 2147483647 ... Fri Feb 2 19:46:18 2024 -Examples of invalid variants(SNPID): 28 ... Fri Feb 2 19:46:18 2024 -Examples of invalid values (N_CONTROL): -1 ... Fri Feb 2 19:46:18 2024 -Removed 1 variants with bad/na N_CONTROL. Fri Feb 2 19:46:18 2024 -Checking if -1e-07 < EAF < 1.0000001 ... Fri Feb 2 19:46:18 2024 -Examples of invalid variants(SNPID): 31,32,33 ... Fri Feb 2 19:46:18 2024 -Examples of invalid values (EAF): 1.02,-0.01,NA ... Fri Feb 2 19:46:18 2024 -Removed 3 variants with bad/na EAF. Fri Feb 2 19:46:18 2024 -Checking if -1e-07 < CHISQ < inf ... Fri Feb 2 19:46:18 2024 -Examples of invalid variants(SNPID): 38,39 ... Fri Feb 2 19:46:18 2024 -Examples of invalid values (CHISQ): -0.01,NA ... Fri Feb 2 19:46:18 2024 -Removed 2 variants with bad/na CHISQ. Fri Feb 2 19:46:18 2024 -Checking if -9999.0000001 < Z < 9999.0000001 ... Fri Feb 2 19:46:18 2024 -Examples of invalid variants(SNPID): 40,41 ... Fri Feb 2 19:46:18 2024 -Examples of invalid values (Z): NA,999999.0 ... Fri Feb 2 19:46:18 2024 -Removed 2 variants with bad/na Z. Fri Feb 2 19:46:18 2024 -Checking if -1e-07 < P < 1.0000001 ... Fri Feb 2 19:46:18 2024 -Examples of invalid variants(SNPID): 48,49,50 ... Fri Feb 2 19:46:18 2024 -Examples of invalid values (P): 1.1,-0.01,NA ... Fri Feb 2 19:46:18 2024 -Removed 3 variants with bad/na P. Fri Feb 2 19:46:18 2024 -Checking if -1e-07 < MLOG10P < 9999.0000001 ... Fri Feb 2 19:46:18 2024 -Examples of invalid variants(SNPID): 51,52,53 ... Fri Feb 2 19:46:18 2024 -Examples of invalid values (MLOG10P): 12345.0,-0.1,NA ... Fri Feb 2 19:46:18 2024 -Removed 3 variants with bad/na MLOG10P. Fri Feb 2 19:46:18 2024 -Checking if -100.0000001 < BETA < 100.0000001 ... Fri Feb 2 19:46:18 2024 -Examples of invalid variants(SNPID): 34,35 ... Fri Feb 2 19:46:18 2024 -Examples of invalid values (BETA): 99999.0,NA ... Fri Feb 2 19:46:18 2024 -Removed 2 variants with bad/na BETA. Fri Feb 2 19:46:18 2024 -Checking if -1e-07 < SE < inf ... Fri Feb 2 19:46:18 2024 -Examples of invalid variants(SNPID): 37 ... Fri Feb 2 19:46:18 2024 -Examples of invalid values (SE): NA ... Fri Feb 2 19:46:18 2024 -Removed 1 variants with bad/na SE. Fri Feb 2 19:46:18 2024 -Checking if -100.0000001 < OR < 100.0000001 ... Fri Feb 2 19:46:18 2024 -Examples of invalid variants(SNPID): 42,43 ... Fri Feb 2 19:46:18 2024 -Examples of invalid values (OR): 999999.0,NA ... Fri Feb 2 19:46:18 2024 -Removed 2 variants with bad/na OR. Fri Feb 2 19:46:18 2024 -Checking if -1e-07 < OR_95L < inf ... Fri Feb 2 19:46:18 2024 -Examples of invalid variants(SNPID): 44,45 ... Fri Feb 2 19:46:18 2024 -Examples of invalid values (OR_95L): -0.01,NA ... Fri Feb 2 19:46:18 2024 -Removed 2 variants with bad/na OR_95L. Fri Feb 2 19:46:18 2024 -Checking if -1e-07 < OR_95U < inf ... Fri Feb 2 19:46:18 2024 -Examples of invalid variants(SNPID): 46,47 ... Fri Feb 2 19:46:18 2024 -Examples of invalid values (OR_95U): -0.01,NA ... Fri Feb 2 19:46:18 2024 -Removed 2 variants with bad/na OR_95U. Fri Feb 2 19:46:18 2024 -Checking STATUS and converting STATUS to categories.... Fri Feb 2 19:46:18 2024 -Removed 27 variants with bad statistics in total. Fri Feb 2 19:46:18 2024 -Data types for each column: Fri Feb 2 19:46:18 2024 -Column : SNPID CHR POS EA NEA EAF BETA SE Z CHISQ P MLOG10P OR OR_95L OR_95U N N_CASE N_CONTROL DIRECTION STATUS NOTE Fri Feb 2 19:46:18 2024 -DType : string Int64 Int64 category category float32 float64 float64 float64 float64 float64 float64 float64 float64 float64 Int64 Int64 Int64 object category object Fri Feb 2 19:46:18 2024 -Verified: T T T T T T T T T T T T T T T T T T T T NA Fri Feb 2 19:46:18 2024 Finished sanity check for statistics.
In [16]:
Copied!
mysumstats.data
mysumstats.data
Out[16]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | CHISQ | ... | MLOG10P | OR | OR_95L | OR_95U | N | N_CASE | N_CONTROL | DIRECTION | STATUS | NOTE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:1:G:A | 1 | 1 | A | G | 0.004 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960099 | Duplicated |
1 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960099 | Duplicated |
2 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960099 | Duplicated |
3 | 1:2 | 1 | 2 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Multiallelic |
4 | 1:2 | 1 | 2 | T | TAA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Multiallelic |
5 | 1:3:T:A | 1 | 3 | A | T | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960099 | Multiallelic |
6 | 1:3:T:GA | 1 | 3 | T | GA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960399 | Multiallelic |
7 | 1 | 23 | 4 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Sex chromosomes |
9 | 3 | 1 | 6 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
10 | 4 | 1 | 7 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
11 | 5 | 1 | 8 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
16 | 10 | 1 | 123456789 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | POS with separator |
19 | 13 | 1 | 13 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | POS float |
23 | 16 | 1 | 16 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Lowercase alleles |
29 | 22 | 1 | 22 | AT | GT | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980599 | Not normalizated allelels |
30 | 23 | 1 | 23 | AT | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Both Uppercase and lowercases |
31 | 24 | 1 | 24 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | N float |
37 | 30 | 1 | 30 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 130000 | 40000 | --++ | 9980099 | N!=N_CONTROL +N_CASE |
43 | 36 | 1 | 36 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | SE out of range |
61 | 54 | 1 | 54 | AG | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Clean sumstats |
62 | 55 | 1 | 55 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Clean sumstats |
21 rows × 21 columns
check data consistency¶
In [17]:
Copied!
mysumstats.check_data_consistency()
mysumstats.check_data_consistency()
Fri Feb 2 19:46:19 2024 Start to check data consistency across columns...v3.4.38 Fri Feb 2 19:46:19 2024 -Current Dataframe shape : 21 x 21 ; Memory usage: 19.94 MB Fri Feb 2 19:46:19 2024 -Tolerance: 0.001 (Relative) and 0.001 (Absolute) Fri Feb 2 19:46:19 2024 -Checking if BETA/SE-derived-MLOG10P is consistent with MLOG10P... Fri Feb 2 19:46:19 2024 -Variants with inconsistent values were not detected. Fri Feb 2 19:46:19 2024 -Checking if BETA/SE-derived-P is consistent with P... Fri Feb 2 19:46:19 2024 -Variants with inconsistent values were not detected. Fri Feb 2 19:46:19 2024 -Checking if MLOG10P-derived-P is consistent with P... Fri Feb 2 19:46:19 2024 -Variants with inconsistent values were not detected. Fri Feb 2 19:46:19 2024 -Checking if N is consistent with N_CASE + N_CONTROL ... Fri Feb 2 19:46:19 2024 -Not consistent: 1 variant(s) Fri Feb 2 19:46:19 2024 -Variant SNPID with max difference: 30 with 10000 Fri Feb 2 19:46:19 2024 -Note: if the max difference is greater than expected, please check your original sumstats. Fri Feb 2 19:46:19 2024 Finished checking data consistency across columns.
normalize variants¶
In [18]:
Copied!
mysumstats.normalize_allele()
mysumstats.normalize_allele()
Fri Feb 2 19:46:19 2024 Start to normalize indels...v3.4.38 Fri Feb 2 19:46:19 2024 -Current Dataframe shape : 21 x 21 ; Memory usage: 19.94 MB Fri Feb 2 19:46:19 2024 -Not normalized allele IDs:22 ... Fri Feb 2 19:46:19 2024 -Not normalized allele:['AT' 'GT']... Fri Feb 2 19:46:19 2024 -Modified 1 variants according to parsimony and left alignment principal. Fri Feb 2 19:46:19 2024 Finished normalizing indels.
In [19]:
Copied!
mysumstats.data
mysumstats.data
Out[19]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | CHISQ | ... | MLOG10P | OR | OR_95L | OR_95U | N | N_CASE | N_CONTROL | DIRECTION | STATUS | NOTE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:1:G:A | 1 | 1 | A | G | 0.004 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960099 | Duplicated |
1 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960099 | Duplicated |
2 | 1:1:A:G | 1 | 1 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960099 | Duplicated |
3 | 1:2 | 1 | 2 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Multiallelic |
4 | 1:2 | 1 | 2 | T | TAA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Multiallelic |
5 | 1:3:T:A | 1 | 3 | A | T | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960099 | Multiallelic |
6 | 1:3:T:GA | 1 | 3 | T | GA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960399 | Multiallelic |
7 | 1 | 23 | 4 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Sex chromosomes |
9 | 3 | 1 | 6 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
10 | 4 | 1 | 7 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
11 | 5 | 1 | 8 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
16 | 10 | 1 | 123456789 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | POS with separator |
19 | 13 | 1 | 13 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | POS float |
23 | 16 | 1 | 16 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Lowercase alleles |
29 | 22 | 1 | 22 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Not normalizated allelels |
30 | 23 | 1 | 23 | AT | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Both Uppercase and lowercases |
31 | 24 | 1 | 24 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | N float |
37 | 30 | 1 | 30 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 130000 | 40000 | --++ | 9980099 | N!=N_CONTROL +N_CASE |
43 | 36 | 1 | 36 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | SE out of range |
61 | 54 | 1 | 54 | AG | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Clean sumstats |
62 | 55 | 1 | 55 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Clean sumstats |
21 rows × 21 columns
remove duplicated / multiallelic variants¶
In [20]:
Copied!
mysumstats.remove_dup(mode="md")
mysumstats.remove_dup(mode="md")
Fri Feb 2 19:46:19 2024 Start to remove duplicated/multiallelic variants...v3.4.38 Fri Feb 2 19:46:19 2024 -Current Dataframe shape : 21 x 21 ; Memory usage: 19.94 MB Fri Feb 2 19:46:19 2024 -Removing mode:md Fri Feb 2 19:46:19 2024 Start to sort the sumstats using P... Fri Feb 2 19:46:19 2024 Start to remove duplicated variants based on snpid...v3.4.38 Fri Feb 2 19:46:19 2024 -Current Dataframe shape : 21 x 21 ; Memory usage: 19.94 MB Fri Feb 2 19:46:19 2024 -Which variant to keep: first Fri Feb 2 19:46:19 2024 -Removed 2 based on SNPID... Fri Feb 2 19:46:19 2024 Start to remove duplicated variants based on CHR,POS,EA and NEA... Fri Feb 2 19:46:19 2024 -Current Dataframe shape : 19 x 21 ; Memory usage: 19.94 MB Fri Feb 2 19:46:19 2024 -Which variant to keep: first Fri Feb 2 19:46:19 2024 -Removed 1 based on CHR,POS,EA and NEA... Fri Feb 2 19:46:19 2024 Start to remove multiallelic variants based on chr:pos... Fri Feb 2 19:46:19 2024 -Current Dataframe shape : 18 x 21 ; Memory usage: 19.94 MB Fri Feb 2 19:46:19 2024 -Which variant to keep: first Fri Feb 2 19:46:19 2024 -Removed 1 multiallelic variants... Fri Feb 2 19:46:19 2024 -Removed 4 variants in total. Fri Feb 2 19:46:19 2024 -Sort the coordinates based on CHR and POS... Fri Feb 2 19:46:19 2024 Finished removing duplicated/multiallelic variants.
In [21]:
Copied!
mysumstats.data
mysumstats.data
Out[21]:
SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | Z | CHISQ | ... | MLOG10P | OR | OR_95L | OR_95U | N | N_CASE | N_CONTROL | DIRECTION | STATUS | NOTE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1:1:G:A | 1 | 1 | A | G | 0.004 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960099 | Duplicated |
1 | 1:2 | 1 | 2 | T | TAA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Multiallelic |
2 | 1:3:T:GA | 1 | 3 | T | GA | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9960399 | Multiallelic |
3 | 3 | 1 | 6 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
4 | 4 | 1 | 7 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
5 | 5 | 1 | 8 | T | C | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | CHR with prefix |
6 | 13 | 1 | 13 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | POS float |
7 | 16 | 1 | 16 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Lowercase alleles |
8 | 22 | 1 | 22 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Not normalizated allelels |
9 | 23 | 1 | 23 | AT | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Both Uppercase and lowercases |
10 | 24 | 1 | 24 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | N float |
11 | 30 | 1 | 30 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 130000 | 40000 | --++ | 9980099 | N!=N_CONTROL +N_CASE |
12 | 36 | 1 | 36 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | SE out of range |
13 | 54 | 1 | 54 | AG | A | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980399 | Clean sumstats |
14 | 55 | 1 | 55 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Clean sumstats |
15 | 10 | 1 | 123456789 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | POS with separator |
16 | 1 | 23 | 4 | A | G | 0.996 | 0.0603 | 0.0103 | 5.854369 | 34.273636 | ... | 8.327348 | 1.062155 | 1.040927 | 1.083816 | 160000 | 120000 | 40000 | --++ | 9980099 | Sex chromosomes |
17 rows × 21 columns
sort genome coordinate¶
In [22]:
Copied!
mysumstats.sort_coordinate()
mysumstats.sort_coordinate()
Fri Feb 2 19:46:19 2024 Start to sort the genome coordinates...v3.4.38 Fri Feb 2 19:46:19 2024 -Current Dataframe shape : 17 x 21 ; Memory usage: 19.94 MB Fri Feb 2 19:46:19 2024 Finished sorting coordinates.
sort column¶
In [23]:
Copied!
mysumstats.sort_column()
mysumstats.sort_column()
Fri Feb 2 19:46:20 2024 Start to reorder the columns...v3.4.38 Fri Feb 2 19:46:20 2024 -Current Dataframe shape : 17 x 21 ; Memory usage: 19.94 MB Fri Feb 2 19:46:20 2024 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,Z,CHISQ,P,MLOG10P,OR,OR_95L,OR_95U,N,N_CASE,N_CONTROL,DIRECTION,STATUS,NOTE Fri Feb 2 19:46:20 2024 Finished reordering the columns.