QC and filtering
GWASLab provides all-in-one functions and customizable functions for sumstats QC and filtering.
Methods Summary
Sumstats Methods | Options | Description |
---|---|---|
.check_sanity() |
n .ncase ,ncontrol ,beta ,se ,eaf ... |
sanity check for statistics including BETA, SE, Z, CHISQ, EAF, OR, N... |
.check_data_consistency() |
check if BETA/SE-derived P/MLOG10P = original P/MLOG10P , N = N_CASE + N_CONTROL ... |
|
.remove_dup() |
mode="md" , keep='first' , keep_col="P" , remove=False |
remove duplicated, multiallelic or NA variants |
.filter_value() |
expr , inplace=False |
filter in variants base on expr |
.filter_flanking_by_id() |
snpid , inplace=False |
filter in variants in the specified flacking regions (SNPID/rsID) |
.filter_flanking_by_chrpos() |
chrpos , inplace=False |
filter in variants in the specified flacking regions (CHR POS tuples) |
.filter_region_in() |
path , inplace=False , high_ld=False , build="19" |
filter in variants in the specified region define by a bed file |
.filter_region_out() |
path , inplace=False , high_ld=False , build="19" |
filter out variants in the specified region define by a bed file |
Statistics Sanity Check
Default parameters have been updated since v3.4.36
.check_sanity()
: Basic sanity check will. be performed on statistics to check if there are any extreme values
or values out of expected ranges
.
Comparison will be performed with float_tolerence = 1e-7
for any float type statistics. For example, eaf=(0, 1)
will be converted to eaf=(-1e-7, 1 + 1e-7)
.
Parameters | Type | Range |
---|---|---|
float_tolerence |
float |
tolerence for comparison |
n=(0,2**31-1)) |
interger |
0<N< \(2^{31}-1\) |
ncase=(0,2**31-1) |
interger |
0<N< \(2^{31}-1\) |
ncontrol=(0,2**31-1) |
interger |
0<N< \(2^{31}-1\) |
mac=(0,2**31-1) |
interger |
MAC>=0 |
eaf=(0,1) |
float |
0<EAF<1 |
chisq=(0,float("Inf")) |
float |
CHISQ>0 |
p=(0,1) |
float |
0<P<1 (Any P=0 will cause a warning) |
mlog10p=(0,9999) |
float |
0<MLOG10P<9999 |
beta=(-100,100) |
float |
-10<BETA<10 |
z=(-9999,9999) |
float |
-9999<z<9999 |
se=(0,float("Inf")) |
float |
SE>0 |
OR=(-100,100) |
float |
-100<log(OR)<100 |
OR_95L=(0,float("Inf")) |
float |
OR_95L>0 |
OR_95U=(0,float("Inf")) |
float |
OR_95U>0 |
HR=(-100,100) |
float |
-100<log(HR)<100 |
HR_95L=(0,float("Inf")) |
float |
HR_95L>0 |
HR_95U=(0,float("Inf")) |
float |
HR_95U>0 |
info=(0,1) |
float |
0<INFO<1 |
direction |
string |
only contains "+" ,"-" ,"0" or "?" |
Remove duplicated or multiallelic variants
After standardizing and normalizing the sumstats, you can also remove duplicated or multiallelic variants using:
mode=d
, remove duplicate variants.- remove duplicate SNPs based on 1. SNPID,
- remove duplicate SNPs based on 2. CHR, POS, EA, and NEA
- remove duplicate SNPs based on 3. rsID
mode=s
,remove duplicate variants.- remove duplicate SNPs based on 1. SNPID
mode=c
,remove duplicate variants.- remove duplicate SNPs based on 2. CHR, POS, EA, and NEA
mode=r
,remove duplicate variants.- remove duplicate SNPs based on 3. rsID
mode=m
, remove multiallelic variants.- remove multiallelic SNPs based on 4. CHR, POS
remove=True
: remove NAskeep_col
: use which column to sort the values (keep_ascend=True
: ascending order)keep
: keep 'first' or 'last'.
Example
sumstats.remove_dup(mode="md",keep='first',keep_col="P",remove=False)
Fri Jan 13 17:34:38 2023 Start to sort the sumstats using P...
Fri Jan 13 17:34:38 2023 Start to remove duplicated variants based on snpid...
Fri Jan 13 17:34:38 2023 -Current Dataframe shape : 9 x 11
Fri Jan 13 17:34:38 2023 -Which variant to keep: first
Fri Jan 13 17:34:38 2023 -Removed 1 based on SNPID...
Fri Jan 13 17:34:38 2023 Start to remove duplicated variants based on rsID...
Fri Jan 13 17:34:38 2023 -Removed 1 based on rsID...
Fri Jan 13 17:34:38 2023 Start to remove duplicated variants based on CHR,POS,EA and NEA...
Fri Jan 13 17:34:38 2023 -Current Dataframe shape : 7 x 11
Fri Jan 13 17:34:38 2023 -Which variant to keep: first
Fri Jan 13 17:34:38 2023 -Removed 1 based on CHR,POS,EA and NEA...
Fri Jan 13 17:34:38 2023 Start to remove multiallelic variants based on chr:pos...
Fri Jan 13 17:34:38 2023 -Which variant to keep: first
Fri Jan 13 17:34:38 2023 -Removed 0 multiallelic variants...
Fri Jan 13 17:34:38 2023 -Removed 3 variants in total.
Fri Jan 13 17:34:38 2023 -Sort the coordinates...
Fri Jan 13 17:34:38 2023 Finished removing successfully!
This will remove duplicated and multiallelic variants and keep the one with the lowest P.
Before:
After
Check_data consistency
GWASLab checks if BETA/SE-derived P/MLOG10P = original P/MLOG10P
or N = N_CASE + N_CONTROL
.
Filtering variants
Filter the sumstats by expr
(a wrapper of pandas.DataFrame.query
), and return a new Sumstats Object by default. This allows method chaining. For example, you can filter certain variants first and then create a Mahanttan plot like mysumstats.filter_value('BETA<0 & CHR==1').plot_mqq()
.
Options | DataType | Description | Default |
---|---|---|---|
expr |
string |
the query string used fot filtering. For example: '1>BETA>0 & N>10000' | |
inplace |
boolean |
if False, return a new Sumstats object. If true, the current Sumstats object will be filtered in place. | False |
pd.DataFrame.query()
Please check https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html
Filtering variants in flanking regions
Available since v3.4.37
Extract variants in the specified flanking regions.
Options | DataType | Description | Default |
---|---|---|---|
snpid |
list |
a list of reference SNPID or rsID. ["rs123", "1:123:A:G"] |
|
snpid |
list |
a list of reference CHR, POS tuples [(1, 12345), (2, 67891)] |
|
windonsizekb |
int |
flanking window size in kb ["rs123", "1:123:A:G"] |
500 |
inplace |
boolean |
if False, return a new Sumstats object. If true, the current Sumstats object will be filtered in place. | False |
Filtering variants in regions using BED
Filter variants in pre-defined regions or regions defined in bed files.
Options | DataType | Description | Default |
---|---|---|---|
path |
string |
path to the bed files | None |
high_ld |
boolean |
if True, filter high ld regions using built-in data | False |
inplace |
boolean |
if False, return a new Sumstats object. If true, the current Sumstats object will be filtered in place. | False |
Example
7. filter_in & filter_out (deprecated)
-gt
: greater than
- lt
: less than
- eq
: equal to
- inplace
: True or False. If False, return a dataframe. If true, the Sumstats object will be filtered.