Sumstats Object in GWASLab

In GWASLab, sumstats were stored in a Sumstats Object, which is built on pandas Dataframe. All other functions are designed as methods of this Sumstats Object.

To load any sumstats into the object, simply specify the column name and load the raw GWAS summary statistics from a pandas DataFrame or specify a file path. All raw data will be loaded as "string" datatype.

gl.Sumstats()

mysumstats = gl.Sumstats(
             sumstats,
             fmt=None,
             snpid=None,
             rsid=None,
             chrom=None,
             pos=None,
             ea=None,
             nea=None,
             ...
)

Options

sumstats: either a file path string or a pandas DataFrame

Currently, GWASLab supports the following columns:

Option	DataType	Description	Header in GWASLab
`snpid`	`string`	variant ID column name, preferably in chr:pos:ea:nea format.	`SNPID`
`rsid`	`string`	dbSNP rsID column name	`rsID`

The minimum required columns are just either rsidor snpid. All other columns and options are optional.

Option	DataType	Description	Header in GWASLab
`fmt`	`string`	input sumstats format. For formats supported by GWASLab, please check https://github.com/Cloufield/formatbook	-
`chrom`	`string`	chromosome column name	`CHR`
`pos`	`string`	basepair position column name	`POS`
`ea`	`string`	effect allele column name; BETA, OR, HR, EAF are in reference to EA...	`EA`
`nea`	`string`	non-effect allele column name	`NEA`
`ref`	`string`	reference allele column name; the allele on reference genome	`REF`
`alt`	`string`	alternative allele column name; the allele that is not on reference genome; when `ea`,`ref` and `alt` are specified, `nea` will be inferred.	`ALT`
`eaf`	`string`	effect allele frequency	`EAF`
`neaf`	`string`	non-effect allele frequency. NEAF will be converted to EAF (EAF = 1 - NEAF) while loading.	`EAF`
`n`	`string` or `integer`	sample size column name or just input a single `integer` as sample size for all variants	`N`
`beta`	`string`	effect size beta column name; in reference to EA	`BETA`
`se`	`string`	standard error column name	`SE`
`chisq`	`string`	chi square column name	`CHISQ`
`z`	`string`	z score column name	`Z`
`p`	`string`	p value column name	`P`
`mlog10p`	`string`	-log10(P) column name	`MLOG10P`
`info`	`string`	imputation info or rsq column name	`INFO`
`OR`	`string`	odds ratio column name; in reference to EA	`OR`
`OR_95L`	`string`	odds ratio lower 95% CI column name	`OR_95L`
`OR_95U`	`string`	odds ratio upper 95% CI column name	`OR_95U`
`direction`	`string`	direction column name. GWASLab uses METAL format (e.g. `++--+?+0`)	`DIRECTION`
`other`	`list`	a list of other column names you want to keep with the core columns (probably some annotations).	-
`ncontrol`	`string` or `integer`	sample size column name for controls or just input a single `integer` as sample size for all variants	`N_CONTROL`
`ncase`	`string` or `integer`	sample size column name for cases or just input a single `integer` as sample size for all variants	`N_CASE`
`HR`	`string`	hazrad ratio column name	`HR`
`HR_95U`	`string`	hazrad ratio upper 95% CI column name	`HR_95U`
`HR_95L`	`string`	hazrad ratio lower 95% CI column name	`HR_95L`
`beta_95U`	`string`	beta upper 95% CI column name	`BETA_95U`
`beta_95L`	`string`	beta lower 95% CI column name	`BETA_95L`
`i2`	`string`	I2 column name	`I2_HET`
`phet`	`string`	heterogeneity test P value	`P_HET`
`dof`	`string` or `integer`	degree of freedom	`DOF`
`snpr2`	`string`	column name for proportion of phenotypic variance explained by each variant	`SNPR2`
`maf`	`string`	minor allele frequency column header	`MAF`
`f`	`string`	F statistics column header	`F`
`status`	`string`	status code column name. GWASLab uses a 7-digit vairant status code. For details, please check status code page.	`STATUS`
`verbose`	`boolean`	if True, print log.	-
`build`	`string`	genome build. `19` for hg19, `38` for hg38 and `99` for unknown.	The first two digits of `STATUS`
`**arg`	`string`	additional parameters for pd.read_table() function. Some common options include : `sep`,`nrows`, `skiprows` and `na_values`.	-

Loading sumstats

Load by specifying columns

You can load the sumstats by specifying the columns like:

Load sumstats by manually specifying columns

mysumstats = gl.Sumstats("t2d_bbj.txt.gz",
             snpid="SNPID",
             chrom="CHR",
             pos="POS",
             ea="Allele2",
             nea="Allele1",
             eaf="AF_Allele2",
             beta="BETA",
             se="SE",
             p="p.value",
             n="N")

Load by specifying formats

GWASLab supports common sumstats formats and you can load sumstats by specifying fmt.

GWASLab uses a manually curated format conversion dictionary in https://github.com/Cloufield/formatbook. Currently, it supports the following formats:

`Keyword`	`Description`
`ssf`	GWAS-SSF
`gwascatalog`	GWAS Catalog format
`pgscatalog`	PGS Catalog format
`plink`	PLINK output format
`plink2`	PLINK2 output format
`saige`	SAIGE output format
`regenie`	output format
`fastgwa`	output format
`metal`	output format
`mrmega`	output format
`fuma`	input format
`ldsc`	input format
`locuszoom`	input format
`vcf`	gwas-vcf format
`bolt_lmm`	output format

Update Formatbook using gl.update_formaybook()

gl.update_formatbook()
Mon Jul 17 17:38:11 2023 Updating formatbook from: https://raw.github.com/Cloufield/formatbook/main/formatbook.json
Mon Jul 17 17:38:12 2023 Overwrite formatbook to :  /home/yunye/gwaslab/gwaslab/src/gwaslab/data/formatbook.json
Mon Jul 17 17:38:12 2023 Available formats: auto,bolt_lmm,fastgwa,gwascatalog,gwascatalog_hm,gwaslab,ldsc,metal,mrmega,mtag,pgscatalog,pgscatalog_hm,pheweb,plink,plink2,regenie,saige,ssf,template,vcf
Mon Jul 17 17:38:12 2023 Formatbook has been updated!

Load sumstats by simply specifying the format

mysumstats = gl.Sumstats("t2d_bbj.txt.gz", fmt="saige")

Load sumstats with `auto` mode

GWASLab also provides an auto mode (fmt="auto"; available since v3.4.21) which assumes A1 or alternative allele (ALT) is the effect allele (EA) and Frq refers to the allele frequency of effect allele (EAF). Common headers will be detected. You can find the conversion table here

Load sumstats with auto mode

mysumstats = gl.Sumstats("t2d_bbj.txt.gz", fmt="auto")

Load sumstats from chromosome-separated files

GWASLab supports loading sumstats from chromosome-separated files (file names need to be in the same pattern.). Just use @ to replace the chromosome numbers.

Example

mysumstats = gl.Sumstats("t2d.chr@.txt.gz",fmt="metal")

Check and save sumstats

After loading, the raw data columns will be renamed to new columns without ambiguity and the DataFrame is stored in .data :

Example

mysumstats.data

You can simply save the processed data using pandas saving functions, for example:

Example

mysumstats.data.to_csv("./mysumstats.csv")

or convert the sumstats to other sumstats using GWASLab to_format() function (recommended):

Example

mysumstats.to_format("./mysumstats", fmt="ldsc",hapmap3=True, exclude_hla=True, build="19")

Please check GWASLab - Format for more details.

Saving half-finished Sumstats Object

If the pipeline is very long, and you need to temporarily save the Sumstats Object, you can use the .dump_pickle() method to temporarily save the Sumstats Object.

Please check GWASLab - Pickle for more details.

Logging

All manipulation conducted to the sumstats will be logged for reproducibility and traceability.

The log is stored in a gl.Log() object. You can check it by .log.show()and save it using .log.save()

Example

mysumstats.log.show()

mysumstats.log.save()

Sumstats summary

You can check the meta information of sumstats by:

Example

mysumstats.summary()

Other functions

Other functions of GWASLab are implemented as the methods of Sumstats Object.

Example

mysumstats.basic_check()

mysumstats.plot_mqq()

...