Skip to content

Command Line Interface (CLI)

GWASLab provides a unified command-line interface for processing GWAS summary statistics. The CLI supports quality control (QC), harmonization, format conversion, and various output formatting options.

Warning — heavy development: The CLI is changing quickly (flags, defaults, and behavior may shift between releases). Pin a GWASLab version for reproducible pipelines, check gwaslab --help after upgrading, and expect occasional breaking changes until the interface stabilizes.

Basic Usage

The CLI follows a unified interface pattern:

gwaslab --input <file> --fmt <format> [--options] --to-fmt <format> --out <file>

Quick Examples

# Show version
gwaslab version

# Show default and current config paths
gwaslab config

# Show one configured path by key
gwaslab config show config
gwaslab config show reference

# Resolve built-in path key
gwaslab path config

# List all formats in formatbook
gwaslab formatbook list

# Show one format mapping
gwaslab formatbook show metal

# Reference catalog + downloads (see "Download" and "list ref" sections)
gwaslab list ref --available
gwaslab download ref 1kg_eas_hg19
# Sumstats download: output dir is -o / --output-dir / -d / --directory (all equivalent)
gwaslab download sumstats GCST90270926 --directory downloads
gwaslab download-sumstats GCST90270926 --output-dir downloads

# Basic QC and output
gwaslab --input sumstats.tsv --fmt auto --qc --out cleaned.tsv --to-fmt gwaslab

# Harmonization with reference
gwaslab --input sumstats.tsv --fmt auto --ref-seq ref.fasta --harmonize --out harmonized.tsv --to-fmt gwaslab

# Format conversion only
gwaslab --input sumstats.tsv --fmt gwaslab --out sumstats.ldsc --to-fmt ldsc

# Plot (CLI runs fix_chr + fix_pos if basic_check was not run)
gwaslab --input sumstats.tsv --plot manhattan --out manhattan.png

# Assign rsID (CLI runs fix_chr + fix_pos if basic_check was not run)
gwaslab --input sumstats.tsv --fmt auto --assign-rsid --ref-rsid-vcf /path/to/rsid.vcf.gz --out output.tsv --to-fmt gwaslab

# Liftover (CLI runs fix_chr + fix_pos if basic_check was not run)
gwaslab --input sumstats.tsv --liftover 19 38 --out lifted_hg38.tsv

# Infer build (hg19/hg38) from HapMap3 coordinates
gwaslab --input sumstats.tsv --infer-build --out inferred.tsv

# Extract lead signals
gwaslab --input sumstats.tsv --get lead --out lead.tsv

Utility Subcommands

config

Inspect GWASLab path configuration and query a single configured path.

# Show default + current path config
gwaslab config

# Show one configured path by keyword
gwaslab config show config
gwaslab config show reference

# JSON output (easy to parse)
gwaslab config --json

Options:

Option Description
--json Print as JSON
show <keyword> Show JSON content for config/reference/formatbook; otherwise print resolved path

path

Resolve a local path by built-in key or downloaded reference keyword.

# Built-in keys
gwaslab path config
gwaslab path reference
gwaslab path formatbook
gwaslab path data_directory

# Downloaded reference keyword
gwaslab path <downloaded_keyword>

Options:

Option Description
keyword Built-in key or downloaded reference keyword

formatbook

Inspect and update format definitions in the formatbook (e.g., saige, metal).

# List all available formats
gwaslab formatbook list

# Show mapping for one format
gwaslab formatbook show saige

# JSON output
gwaslab formatbook list --json

# Update local formatbook from remote repository
gwaslab formatbook update

Actions and options:

Command Description
formatbook list List available formats in formatbook
formatbook show <format> Show header mapping for one format
formatbook update Update formatbook from remote source
--json Print output in JSON format (list only)

Download (references and GWAS Catalog sumstats)

GWASLab keeps flat legacy commands (download-ref, download-sumstats) and adds a grouped form so documentation and flags line up:

Goal Grouped command Legacy (same behavior)
Fetch a packaged reference by keyword gwaslab download ref KEY gwaslab download-ref KEY
Fetch GWAS Catalog sumstats by GCST gwaslab download sumstats GCST… gwaslab download-sumstats GCST…

Unified output directory flag for sumstats: any of -o, --output-dir, -d, or --directory sets the download folder (same underlying option).

# References
gwaslab download ref 1kg_eas_hg19
gwaslab download ref 1kg_eas_hg19 --directory ~/.gwaslab --overwrite
gwaslab download-ref 1kg_eas_hg19 --directory ~/.gwaslab

# GWAS Catalog sumstats
gwaslab download sumstats GCST90270926
gwaslab download sumstats GCST90270926 --directory ./downloads
gwaslab download-sumstats GCST90270926 -o ./downloads

Reference downloads also accept --local-filename and --overwrite (see gwaslab download ref --help).

list ref

List available reference keywords (from the bundled/updated reference catalog) and/or downloaded entries registered in your local config. With no scope flags, both sections are shown.

gwaslab list ref
gwaslab list ref --available
gwaslab list ref --downloaded
gwaslab list ref --available --downloaded --json
Option Description
--available Only keywords you can install with gwaslab download ref …
--downloaded Only keywords already recorded under downloaded in config
--json Machine-readable output
-q / --quiet Less library logging

GWAS Catalog sumstats do not have a browseable list command in the CLI (you supply a GCST… ID).

Command Structure

Required Arguments

Argument Description Example
--input Input sumstats file path (required for main processing mode) --input data/sumstats.tsv

Optional Arguments

Argument Description Default
--out Output file path (prefix) None
--to-fmt Output format gwaslab
--tab-fmt Tabular format (tsv, csv, parquet) tsv
--nrows Number of rows to read (for testing) None
--quiet Suppress output messages False
--threads Number of threads for parallel processing 1

Many other flags (--fix-chr, --fix-chr-pos, variant filters, --get, --plot-chr, harmonization, liftover, etc.) are summarized in Processing Options below and in gwaslab --help.

Plot / Get shared optional arguments

Argument Description Default
--sig-level Significance threshold used in Manhattan/MQQ/Regional plotting 5e-8
--ylim <min> <max> Y-axis limits for plotting None
--highlight <id ...> Variant IDs (e.g. rsID/SNPID) to highlight in Manhattan/MQQ/Regional plots None
--sig-level-extract P-value threshold for --get operations 5e-8
--windowsizekb Window size (kb) for --get lead 500

Processing Options

When multiple flags are used in one command, the CLI applies steps in this order: optional fix_* flags → --qc / remove / dedup / normalize → --filter-regionvariant filters (--extract / --exclude / BED / --chr / MAF / MAC / --snps-only / --min-info) → harmonization → assign-rsid → rsid-to-chrpos → infer-build → liftover → plot (if any) → --get (if set) → to_format output. See gwaslab --help for the authoritative list.

Quality Control (QC)

Perform quality control on sumstats using basic_check():

# Basic QC
gwaslab --input sumstats.tsv --fmt auto --qc --out cleaned.tsv --to-fmt gwaslab

# QC with remove bad variants
gwaslab --input sumstats.tsv --fmt auto --qc --remove --out cleaned.tsv --to-fmt gwaslab

# QC with remove duplicates
gwaslab --input sumstats.tsv --fmt auto --qc --remove-dup --out cleaned.tsv --to-fmt gwaslab

# QC with normalize indels
gwaslab --input sumstats.tsv --fmt auto --qc --normalize --out cleaned.tsv --to-fmt gwaslab

# QC with all options
gwaslab --input sumstats.tsv --fmt auto --qc --remove --remove-dup --normalize --threads 4 --out cleaned.tsv --to-fmt gwaslab

QC Options:

Option Description
--qc Perform quality control (basic_check)
--remove Remove bad quality variants detected during QC
--remove-dup Remove duplicated or multi-allelic variants
--normalize Normalize indels (e.g., ATA:AA -> AT:A)

Optional coordinate and ID fixes

Run individual Sumstats.fix_*() steps without full --qc. These run before QC when combined in one command. Use them to normalize columns before export or downstream steps.

Option Description
--fix-chr fix_chr() only (chromosome notation)
--fix-pos fix_pos() only (position dtype / range)
--fix-chr-pos fix_chr() then fix_pos() (same as --fix-chr --fix-pos)
--fix-chr-pos-allele fix_chr(), fix_pos(), and fix_allele()
--fix-allele fix_allele() only (allele notation)
--fix-id fix_id() only (SNPID / rsID column)

If both fix_chr and fix_pos run (via --fix-chr-pos or the pair --fix-chr --fix-pos), the CLI treats coordinates as ready for steps that normally auto-run fix_chr + fix_pos (e.g. liftover, --assign-rsid).

Subset rows after --filter-region (if any) and before harmonization, assign-rsid, liftover, plotting, or --get. List-based filters require a SNPID or rsID column. BED and chromosome filters run fix_chr + fix_pos first if coordinates are not already normalized.

See also examples/10_cli/09_variant_filters.sh for runnable demos (uses ../../src when run from a git checkout).

# Keep only IDs in a file (one per line; # comments and extra columns ignored)
gwaslab --input sumstats.tsv --extract keep.txt --out subset.tsv --to-fmt gwaslab

# Drop IDs from a file
gwaslab --input sumstats.tsv --exclude drop.txt --out pruned.tsv --to-fmt gwaslab

# Autosomes only (example)
gwaslab --input sumstats.tsv --chr 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 --out autosomes.tsv --to-fmt gwaslab

# MAF / MAC (uses MAF, or EAF/FRQ as min(f, 1−f); MAC from column or 2×N×MAF if N present)
gwaslab --input sumstats.tsv --maf 0.01 --max-maf 0.5 --mac 20 --out filtered.tsv --to-fmt gwaslab

# SNPs only; imputation quality
gwaslab --input sumstats.tsv --snps-only --min-info 0.8 --out qc.tsv --to-fmt gwaslab

Variant filter options:

Option Description Default
--extract Path to file of variant IDs to keep (first column per line) None
--exclude Path to file of variant IDs to remove None
--extract-bed BED path: keep variants overlapping intervals (0-based half-open; uses --build) None
--exclude-bed BED path: remove variants overlapping intervals None
--chr Keep only listed chromosomes (space-separated; PLINK 2 --chr analog). For --plot regional, a single --chr together with --start and --end is interpreted as the plot chromosome instead of a filter None
--maf Minimum MAF None
--max-maf Maximum MAF None
--mac Minimum MAC (uses MAC column, or 2 × N × MAF when N and a frequency column exist) None
--snps-only Keep rows where EA and NEA are single-nucleotide False
--min-info Minimum INFO (requires a column named INFO, case-insensitive) None

Harmonization

Harmonize sumstats with reference data:

# Basic harmonization (without reference files)
gwaslab --input sumstats.tsv --fmt auto --harmonize --out harmonized.tsv --to-fmt gwaslab

# Harmonization with reference sequence for allele flipping
gwaslab --input sumstats.tsv --fmt auto --harmonize --ref-seq /path/to/reference.fasta --out harmonized.tsv --to-fmt gwaslab

# Harmonization with rsID assignment
gwaslab --input sumstats.tsv --fmt auto --harmonize --ref-rsid-vcf /path/to/reference.vcf.gz --out harmonized.tsv --to-fmt gwaslab

# Full harmonization pipeline
gwaslab --input sumstats.tsv --fmt auto \
  --harmonize \
  --ref-seq /path/to/reference.fasta \
  --ref-rsid-vcf /path/to/rsid.vcf.gz \
  --ref-infer /path/to/inference.vcf.gz \
  --maf-threshold 0.40 \
  --ref-maf-threshold 0.4 \
  --sweep-mode \
  --threads 8 \
  --out harmonized.tsv \
  --to-fmt gwaslab

Harmonization Options:

Option Description Default
--harmonize Perform harmonization False
--basic-check Run basic QC in harmonization True
--no-basic-check Skip basic QC in harmonization -
--ref-seq Reference sequence file (FASTA) for allele flipping None
--ref-rsid-tsv Reference rsID HDF5 file (legacy name, accepts HDF5 path) None
--ref-rsid-vcf Reference rsID VCF/BCF file None
--ref-infer Reference VCF/BCF file for strand inference None
--ref-alt-freq INFO field name for ALT allele frequency when using --ref-infer AF
--ref-maf-threshold MAF threshold for reference 0.4
--maf-threshold MAF threshold for sumstats 0.40
--sweep-mode Use sweep mode for large datasets False

Infer Build

Infer genome build (hg19/hg38) from HapMap3 SNP coordinates:

gwaslab --input sumstats.tsv --infer-build --out inferred.tsv --to-fmt gwaslab

Infer Build Options:

Option Description Default
--infer-build Infer genome build from HapMap3 coordinates False

Assign rsID

Assign rsID to variants using reference data:

# Assign rsID from HDF5 file
gwaslab --input sumstats.tsv --fmt auto --assign-rsid --ref-rsid-tsv /path/to/rsid.hdf5 --out output.tsv --to-fmt gwaslab

# Assign rsID from VCF file
gwaslab --input sumstats.tsv --fmt auto --assign-rsid --ref-rsid-vcf /path/to/rsid.vcf.gz --overwrite empty --out output.tsv --to-fmt gwaslab

Assign rsID Options:

Option Description Default
--assign-rsid Assign rsID to variants (auto runs fix_chr + fix_pos if basic_check not run) False
--ref-rsid-tsv Reference rsID HDF5 file (legacy name, accepts HDF5 path) None
--ref-rsid-vcf Reference rsID VCF/BCF file None
--overwrite Overwrite mode (all, invalid, empty) empty
--threads Number of threads for parallel processing 1

rsID to CHR:POS

Convert rsID to CHR:POS coordinates:

# Convert rsID to CHR:POS using VCF (auto-generates HDF5)
gwaslab --input sumstats.tsv --fmt auto --rsid-to-chrpos --ref-rsid-vcf /path/to/reference.vcf.gz --build 19 --out output.tsv --to-fmt gwaslab

# Convert rsID to CHR:POS using existing HDF5 file
gwaslab --input sumstats.tsv --fmt auto --rsid-to-chrpos --ref-rsid-tsv /path/to/reference.hdf5 --build 19 --out output.tsv --to-fmt gwaslab

rsID to CHR:POS Options:

Option Description Default
--rsid-to-chrpos Convert rsID to CHR:POS False
--ref-rsid-vcf Reference VCF file for rsID to CHR:POS conversion (auto-generates HDF5) None
--ref-rsid-tsv Reference HDF5 file path for rsID to CHR:POS conversion None
--build Genome build version 19
--threads Number of threads for parallel processing 4 (when using rsid-to-chrpos)

Liftover

Convert coordinates between genome builds:

# Liftover from hg19 to hg38
gwaslab --input sumstats.tsv --fmt auto --liftover 19 38 --out lifted_hg38.tsv --to-fmt gwaslab

Liftover Notes:

  • CLI auto-runs fix_chr + fix_pos before liftover if basic_check was not run.
  • You can still run --qc earlier in the same command when full QC is preferred.

Plotting

Generate plots from one input sumstats file:

# Manhattan
gwaslab --input sumstats.tsv --plot manhattan --out manhattan.png

# QQ
gwaslab --input sumstats.tsv --plot qq --out qq.png

# Combined Manhattan+QQ
gwaslab --input sumstats.tsv --plot mqq --out mqq.png

# Regional (explicit chromosome flag)
gwaslab --input sumstats.tsv --plot regional --plot-chr 6 --start 26000000 --end 34000000 --out region.png

# Regional (legacy: one --chr with --start/--end is treated as the plot region, not a chromosome filter)
gwaslab --input sumstats.tsv --plot regional --chr 6 --start 26000000 --end 34000000 --out region.png

Forest plots are not supported on the CLI; use the Python API (gl.plot_forest(), see Forest plot).

Plot Options:

Option Description Default
--plot Plot type: manhattan, qq, mqq, regional, miami None
--sig-level Significance threshold for Manhattan/MQQ/Regional plots 5e-8
--ylim Y-axis range for Manhattan/MQQ/Regional plots (min max) None
--highlight Variant IDs to highlight in Manhattan/MQQ/Regional plots None
--plot-chr Chromosome for --plot regional (use with --start, --end) None
--start, --end Genomic interval for --plot regional (1-based positions as in CLI) None
--chr With --plot regional: if exactly one value is given and --start/--end are set, that value is the regional chromosome (same role as --plot-chr). With multiple values, or without regional plotting, --chr is a variant filter (see Variant filters) None

Notes: - --plot miami is currently not available from single-input CLI mode and exits with guidance to use Python API. - As with liftover/assign-rsid, CLI runs fix_chr + fix_pos before plotting if basic_check was not run.

Get / variant lists

--get writes lead, novel, or proxy results to --out / --output and exits (does not run the same path as --extract file filtering). --extract is only a variant-ID list for in-pipeline filtering (see Variant filters).

Extract lead or novel variants:

# Lead variants
gwaslab --input sumstats.tsv --get lead --out lead.tsv

# Novel variants with GWAS Catalog EFO trait(s)
gwaslab --input sumstats.tsv --get novel --efo EFO_0004340 --out novel.tsv

--get options:

Option Description Default
--get lead, novel, or proxy — write results to --out / --output, then exit (proxy not yet implemented) None
--sig-level-extract P-value threshold for --get 5e-8
--windowsizekb Lead-variant window size (kb) 500
--efo One or more EFO IDs for novel extraction None
--only-novel Return only truly novel hits in novel extraction False

Output Formatting Options

Basic Formatting

# Output in gwaslab format (default)
gwaslab --input sumstats.tsv --fmt auto --out output --to-fmt gwaslab

# Output in LDSC format
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt ldsc

# Output in PLINK format
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt plink

# Output as CSV instead of TSV
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --tab-fmt csv

# Output without compression
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --no-gzip

Advanced Formatting

# Output with bgzip compression and tabix index
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --bgzip --tabix

# Extract only HapMap3 variants
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --hapmap3

# Exclude HLA region
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --exclude-hla

# Exclude HLA with custom range (in Mbp)
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --exclude-hla --hla-lower 20 --hla-upper 30

# Add chromosome prefix
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --chr-prefix chr

# Use numeric notation for X, Y, MT (23, 24, 25)
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --xymt-number

# Add N column with specified value
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --n 10000

Output Formatting Options:

Option Description Default
--to-fmt Output format gwaslab
--tab-fmt Tabular format (tsv, csv, parquet) tsv
--no-gzip Disable gzip compression False (gzip enabled)
--bgzip Use bgzip compression False
--tabix Create tabix index (requires bgzip) False
--hapmap3 Extract HapMap3 variants only False
--exclude-hla Exclude HLA region False
--hla-lower HLA region lower bound (Mbp) 25
--hla-upper HLA region upper bound (Mbp) 34
--n Add N column with specified value None
--chr-prefix Prefix for chromosome column ""
--xymt-number Use numeric notation for X, Y, MT False

For complete workflow examples, see CLI Workflow Examples.

Output File Naming

The CLI follows a consistent naming pattern for output files:

  • Basic format: {output_path}.{format}.{tab_fmt}.gz
  • Example: output.gwaslab.tsv.gz

  • With filters: {output_path}.{filter}.{format}.{tab_fmt}.gz

  • Example: output.hapmap3.gwaslab.tsv.gz (with --hapmap3)
  • Example: output.noMHC.gwaslab.tsv.gz (with --exclude-hla)

  • Without compression: {output_path}.{format}.{tab_fmt}

  • Example: output.gwaslab.tsv (with --no-gzip)

  • Log file: {output_path}.{format}.log

  • Example: output.gwaslab.log

Tips and Best Practices

1. Input Format Detection

Use --fmt auto to let GWASLab automatically detect the input format:

gwaslab --input sumstats.tsv --fmt auto --qc --out output --to-fmt gwaslab

2. Parallel Processing

Use --threads to speed up processing for large files:

gwaslab --input large_sumstats.tsv --fmt auto --qc --threads 8 --out output --to-fmt gwaslab

3. Quiet Mode

Use --quiet to suppress verbose output in scripts:

gwaslab --input sumstats.tsv --fmt auto --qc --out output --to-fmt gwaslab --quiet

4. Testing with Small Samples

Use --nrows to test commands on a subset of data:

gwaslab --input large_sumstats.tsv --fmt auto --nrows 1000 --qc --out test_output --to-fmt gwaslab

5. Combining Operations

You can combine multiple processing steps in a single command:

# QC + Harmonization + Format conversion
gwaslab --input sumstats.tsv --fmt auto \
  --qc --remove-dup \
  --harmonize --ref-seq ref.fasta \
  --out output --to-fmt gwaslab

6. Reference Files

For harmonization, you typically need: - Reference sequence (FASTA): For allele flipping - rsID reference (VCF/TSV): For rsID assignment - Inference reference (VCF): For strand inference of palindromic SNPs

Download reference files from: - dbSNP - 1000 Genomes - UCSC Genome Browser

7. Memory Considerations

For very large files: - Use --sweep-mode for harmonization (faster for large datasets) - Process in chunks if memory is limited - Consider using --nrows for testing first

Common Use Cases

Use Case 1: Quick Format Check

# Just load and check the file (no output)
gwaslab --input sumstats.tsv --fmt auto

Use Case 2: Standard QC Workflow

gwaslab --input sumstats.tsv --fmt auto \
  --qc --remove --remove-dup --normalize \
  --out qc_sumstats --to-fmt gwaslab

Use Case 3: Prepare for LDSC

gwaslab --input sumstats.tsv --fmt gwaslab \
  --out ldsc_input --to-fmt ldsc --no-gzip

Use Case 4: Prepare for Meta-analysis

gwaslab --input sumstats.tsv --fmt auto \
  --qc --remove-dup \
  --harmonize --ref-seq ref.fasta --ref-rsid-vcf dbsnp.vcf.gz \
  --out meta_ready --to-fmt gwaslab --exclude-hla

Use Case 5: Extract Replication Set

gwaslab --input discovery.tsv --fmt gwaslab \
  --out replication --to-fmt gwaslab --hapmap3 --build 19

Troubleshooting

Issue: File not found

Error: FileNotFoundError: [Errno 2] No such file or directory

Solution: Check that the input file path is correct and accessible.

Issue: Format detection fails

Error: Format auto-detection doesn't work

Solution: Specify the format explicitly with --fmt:

gwaslab --input sumstats.tsv --fmt gwaslab --qc --out output --to-fmt gwaslab

Issue: Memory errors with large files

Error: Out of memory errors

Solution: - Use --threads 1 to reduce memory usage - Process in smaller chunks - Use --sweep-mode for harmonization

Issue: Reference file errors

Error: Reference file not found or invalid

Solution: - Verify reference file paths are correct - Check that reference files are properly formatted - Ensure VCF files are indexed if using VCF references

Getting Help

For more information:

# Show help message
gwaslab --help

# Show version
gwaslab version

For detailed documentation on specific functions, see: - QC & Filtering - Harmonization - Format - Standardization

Supported Formats

GWASLab supports many input and output formats through the formatbook repository. Common formats include:

  • Input: auto, gwaslab, plink, ldsc, vcf, and many more
  • Output: gwaslab, ldsc, plink, plink2, saige, fastgwa, regenie, vcf, and many more

Check the formatbook repository for the complete list of supported formats.