Skip to content

GWASLab

image

badge Downloads badge_pip badge_commit_m

  • A handy Python-based toolkit for handling GWAS summary statistics (sumstats).
  • Each process is modularized and can be customized to your needs.
  • Sumstats-specific manipulations are designed as methods of a Python object, gwaslab.Sumstats.

4th Anniversary & Major Upgrade

This year marks the 4th year since GWASLab launched. To celebrate this milestone, we will release a major upgrade v4 with significant improvements and new features.

GWASLab v4.0.0

GWASLab v4.0.0 introduces major improvements including:

  • Command Line Interface (CLI): Process sumstats directly from the command line
  • Performance Improvements: Optimized algorithms and data structures for faster processing
  • Visualization Parameter Management: Centralized parameter management system for all plotting functions
  • API Consistency: Standardized parameter names across functions
  • Enhanced Documentation: Comprehensive documentation and tutorials

Installation

Install via pip

The latest version of GWASLab now supports Python 3.9, 3.10, 3.11, and 3.12.

Recommended: Python 3.12

We recommend using Python 3.12 for the best performance and compatibility with GWASLab v4.0.0.

pip install gwaslab

Install via uv

uv is a fast Python package installer and resolver written in Rust.

uv pip install gwaslab

Install in conda environment

Create a Python 3.9, 3.10, 3.11 or 3.12 environment and install gwaslab using pip. We recommend Python 3.12:

conda create -n gwaslab -c conda-forge python=3.12

conda activate gwaslab

pip install gwaslab

or create a new environment using yml file environment.yml

conda env create -n gwaslab -f environment.yml -c conda-forge

Install using docker (deprecated)

A docker file is available here for building local images.

Quick Start

Python API

import gwaslab as gl

# load plink2 output
mysumstats = gl.Sumstats("sumstats.txt.gz", fmt="plink2")

# or load sumstats with auto mode (auto-detecting commonly used headers) 
# assuming ALT/A1 is **EA**, and frq is **EAF**
mysumstats = gl.Sumstats("sumstats.txt.gz", fmt="auto")

# or you can specify the columns:
mysumstats = gl.Sumstats("sumstats.txt.gz",
             snpid="SNP",
             chrom="CHR",
             pos="POS",
             ea="ALT",
             nea="REF",
             eaf="Frq",
             beta="BETA",
             se="SE",
             p="P",
             direction="Dir",
             n="N",
             build="19")

# basic quality control
mysumstats.basic_check()

# manhattan and qq plot
mysumstats.plot_mqq()

# extract lead variants
lead_variants = mysumstats.get_lead()

Command Line Interface (CLI) - New in v4.0.0

# Show version
gwaslab version

# Basic QC and output
gwaslab --input sumstats.tsv --fmt auto --qc --out cleaned.tsv --to-fmt gwaslab

# Harmonization with reference
gwaslab --input sumstats.tsv --fmt auto --ref-seq ref.fasta --harmonize --out harmonized.tsv --to-fmt gwaslab

# Format conversion
gwaslab --input sumstats.tsv --fmt gwaslab --out sumstats.ldsc --to-fmt ldsc

See CLI documentation for more details.

Documentation and Tutorials

Documentation and tutorials for GWASLab are available at https://cloufield.github.io/gwaslab/.

Features

Loading and Formatting

  • Loading sumstats by simply specifying the software name or format name, or specifying each column name.
  • Converting GWAS sumstats to specific formats:
  • LDSC / MAGMA / METAL / PLINK / SAIGE / REGENIE / MR-MEGA / GWAS-SSF / FUMA / GWAS-VCF / BED...
  • check available formats
  • Optional filtering of variants in commonly used genomic regions: Hapmap3 SNPs / High-LD regions / MHC region

Standardization & Normalization

  • Variant ID standardization
  • CHR and POS notation standardization
  • Variant POS and allele normalization
  • Genome build : Inference and Liftover

Quality Control, Value Conversion & Filtering

  • Statistics sanity check
  • Extreme value removal
  • Equivalent statistics conversion
    • BETA/SE , OR/OR_95L/OR_95U
    • P, Z, CHISQ, MLOG10P
  • Customizable value filtering

Harmonization

  • rsID assignment based on CHR, POS, and REF/ALT (with sweep mode for large datasets)
  • CHR POS assignment based on rsID using a reference text file
  • Palindromic SNPs and indels strand inference using a reference VCF
  • Check allele frequency discrepancy using a reference VCF
  • Reference allele alignment using a reference genome sequence FASTA file

Visualization

  • MQQ plot: Manhattan plot, QQ plot or MQQ plot (with a bunch of customizable features including auto-annotate nearest gene names)
  • Miami plot: Mirrored Manhattan plot
  • Brisbane plot: GWAS hits density plot
  • Regional plot: GWAS regional plot with LD information
  • Genetic correlation heatmap: LDSC-rg genetic correlation matrix
  • Scatter plots:
  • Variant effect size comparison
  • Allele frequency comparison
  • Trumpet plot (plot of MAF and effect size with power lines)

Visualization Examples

image image image image

Other Utilities

  • Read LDSC h2 or rg outputs directly as DataFrames (auto-parsing).
  • Extract lead variants given a sliding window size.
  • Extract novel loci given a list of known lead variants or known loci obtained from GWAS Catalog.
  • Clumping and fine-mapping utilities.
  • LD score regression (LDSC) integration for heritability estimation.
  • Logging: keep a complete record of manipulations applied to the sumstats.
  • Sumstats summary: give you a quick overview of the sumstats.
  • ...

Command Line Interface (CLI) - New in v4.0.0

GWASLab now provides a unified command-line interface for processing GWAS summary statistics. The CLI supports:

  • Quality control (QC)
  • Harmonization
  • Format conversion
  • Batch processing
  • Various output formatting options

See CLI documentation for complete details and examples.

Performance Improvements in v4.0.0

  • Optimized algorithms: Core functions have been reimplemented with optimized algorithms and data structures
  • Sweep mode: New sweep mode for rsID assignment and strand inference, significantly faster for large datasets
  • Efficient data processing: Improved memory usage and processing speed for large datasets
  • Better parallelization: Enhanced multi-threading support

Issues

  • GWASLab is currently under active development, with frequent updates.
  • Note: Known issues are documented at https://cloufield.github.io/gwaslab/KnownIssues/.

How to Cite

  • GWASLab preprint: He, Y., Koido, M., Shimmori, Y., Kamatani, Y. (2023). GWASLab: a Python package for processing and visualizing GWAS summary statistics. Preprint at Jxiv, 2023-5. https://doi.org/10.51094/jxiv.370

Sample Data Used for Tutorial

  • Sample GWAS data used in GWASLab is obtained from: http://jenger.riken.jp/ (Suzuki, Ken, et al. "Identification of 28 new susceptibility loci for type 2 diabetes in the Japanese population." Nature genetics 51.3 (2019): 379-386.).

Acknowledgement

Thanks to @sup3rgiu, @soumickmj and @gmauro for their contributions to the source codes.

Contacts

AI Era

With assistance from LLM-powered IDEs, I was finally able to restructure the package as I had long intended. This includes reimplementing functions to reduce overhead, adding comprehensive test modules, writing optimized functions, and improving overall code quality and maintainability. I could not imagine how long it would have taken to accomplish all of this without AI assistance.