Skip to content

Command Line Interface (CLI)

GWASLab provides a unified command-line interface for processing GWAS summary statistics. The CLI supports quality control (QC), harmonization, format conversion, and various output formatting options.

Basic Usage

The CLI follows a unified interface pattern:

gwaslab --input <file> --fmt <format> [--options] --to-fmt <format> --out <file>

Quick Examples

# Show version
gwaslab version

# Basic QC and output
gwaslab --input sumstats.tsv --fmt auto --qc --out cleaned.tsv --to-fmt gwaslab

# Harmonization with reference
gwaslab --input sumstats.tsv --fmt auto --ref-seq ref.fasta --harmonize --out harmonized.tsv --to-fmt gwaslab

# Format conversion only
gwaslab --input sumstats.tsv --fmt gwaslab --out sumstats.ldsc --to-fmt ldsc

Command Structure

Required Arguments

Argument Description Example
--input Input sumstats file path --input data/sumstats.tsv
--fmt Input format (default: auto) --fmt gwaslab or --fmt auto

Optional Arguments

Argument Description Default
--out Output file path (prefix) None
--to-fmt Output format gwaslab
--tab-fmt Tabular format (tsv, csv, parquet) tsv
--nrows Number of rows to read (for testing) None
--quiet Suppress output messages False
--threads Number of threads for parallel processing 1

Processing Options

Quality Control (QC)

Perform quality control on sumstats using basic_check():

# Basic QC
gwaslab --input sumstats.tsv --fmt auto --qc --out cleaned.tsv --to-fmt gwaslab

# QC with remove bad variants
gwaslab --input sumstats.tsv --fmt auto --qc --remove --out cleaned.tsv --to-fmt gwaslab

# QC with remove duplicates
gwaslab --input sumstats.tsv --fmt auto --qc --remove-dup --out cleaned.tsv --to-fmt gwaslab

# QC with normalize indels
gwaslab --input sumstats.tsv --fmt auto --qc --normalize --out cleaned.tsv --to-fmt gwaslab

# QC with all options
gwaslab --input sumstats.tsv --fmt auto --qc --remove --remove-dup --normalize --threads 4 --out cleaned.tsv --to-fmt gwaslab

QC Options:

Option Description
--qc Perform quality control (basic_check)
--remove Remove bad quality variants detected during QC
--remove-dup Remove duplicated or multi-allelic variants
--normalize Normalize indels (e.g., ATA:AA -> AT:A)

Harmonization

Harmonize sumstats with reference data:

# Basic harmonization (without reference files)
gwaslab --input sumstats.tsv --fmt auto --harmonize --out harmonized.tsv --to-fmt gwaslab

# Harmonization with reference sequence for allele flipping
gwaslab --input sumstats.tsv --fmt auto --harmonize --ref-seq /path/to/reference.fasta --out harmonized.tsv --to-fmt gwaslab

# Harmonization with rsID assignment
gwaslab --input sumstats.tsv --fmt auto --harmonize --ref-rsid-vcf /path/to/reference.vcf.gz --out harmonized.tsv --to-fmt gwaslab

# Full harmonization pipeline
gwaslab --input sumstats.tsv --fmt auto \
  --harmonize \
  --ref-seq /path/to/reference.fasta \
  --ref-rsid-vcf /path/to/rsid.vcf.gz \
  --ref-infer /path/to/inference.vcf.gz \
  --ref-alt-freq AF \
  --maf-threshold 0.40 \
  --ref-maf-threshold 0.4 \
  --sweep-mode \
  --threads 8 \
  --out harmonized.tsv \
  --to-fmt gwaslab

Harmonization Options:

Option Description Default
--harmonize Perform harmonization False
--basic-check Run basic QC in harmonization True
--no-basic-check Skip basic QC in harmonization -
--ref-seq Reference sequence file (FASTA) for allele flipping None
--ref-rsid-tsv Reference rsID HDF5 file (legacy name, accepts HDF5 path) None
--ref-rsid-vcf Reference rsID VCF/BCF file None
--ref-infer Reference VCF/BCF file for strand inference None
--ref-alt-freq Allele frequency field name in VCF INFO AF
--ref-maf-threshold MAF threshold for reference 0.4
--maf-threshold MAF threshold for sumstats 0.40
--sweep-mode Use sweep mode for large datasets False

Assign rsID

Assign rsID to variants using reference data:

# Assign rsID from HDF5 file
gwaslab --input sumstats.tsv --fmt auto --assign-rsid --ref-rsid-tsv /path/to/rsid.hdf5 --out output.tsv --to-fmt gwaslab

# Assign rsID from VCF file
gwaslab --input sumstats.tsv --fmt auto --assign-rsid --ref-rsid-vcf /path/to/rsid.vcf.gz --overwrite empty --out output.tsv --to-fmt gwaslab

Assign rsID Options:

Option Description Default
--assign-rsid Assign rsID to variants False
--ref-rsid-tsv Reference rsID HDF5 file (legacy name, accepts HDF5 path) None
--ref-rsid-vcf Reference rsID VCF/BCF file None
--overwrite Overwrite mode (all, invalid, empty) empty
--threads Number of threads for parallel processing 1

rsID to CHR:POS

Convert rsID to CHR:POS coordinates:

# Convert rsID to CHR:POS using VCF (auto-generates HDF5)
gwaslab --input sumstats.tsv --fmt auto --rsid-to-chrpos --ref-rsid-vcf /path/to/reference.vcf.gz --build 19 --out output.tsv --to-fmt gwaslab

# Convert rsID to CHR:POS using existing HDF5 file
gwaslab --input sumstats.tsv --fmt auto --rsid-to-chrpos --ref-rsid-tsv /path/to/reference.hdf5 --build 19 --out output.tsv --to-fmt gwaslab

rsID to CHR:POS Options:

Option Description Default
--rsid-to-chrpos Convert rsID to CHR:POS False
--ref-rsid-vcf Reference VCF file for rsID to CHR:POS conversion (auto-generates HDF5) None
--ref-rsid-tsv Reference HDF5 file path for rsID to CHR:POS conversion None
--build Genome build version 19
--overwrite-rtc Overwrite existing CHR:POS False
--chunksize Chunk size for processing 5000000
--threads Number of threads for parallel processing 4 (when using rsid-to-chrpos)

Output Formatting Options

Basic Formatting

# Output in gwaslab format (default)
gwaslab --input sumstats.tsv --fmt auto --out output --to-fmt gwaslab

# Output in LDSC format
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt ldsc

# Output in PLINK format
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt plink

# Output as CSV instead of TSV
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --tab-fmt csv

# Output without compression
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --no-gzip

Advanced Formatting

# Output with bgzip compression and tabix index
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --bgzip --tabix

# Extract only HapMap3 variants
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --hapmap3

# Exclude HLA region
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --exclude-hla

# Exclude HLA with custom range (in Mbp)
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --exclude-hla --hla-lower 20 --hla-upper 30

# Add chromosome prefix
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --chr-prefix chr

# Use numeric notation for X, Y, MT (23, 24, 25)
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --xymt-number

# Add N column with specified value
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --n 10000

Output Formatting Options:

Option Description Default
--to-fmt Output format gwaslab
--tab-fmt Tabular format (tsv, csv, parquet) tsv
--no-gzip Disable gzip compression False (gzip enabled)
--bgzip Use bgzip compression False
--tabix Create tabix index (requires bgzip) False
--hapmap3 Extract HapMap3 variants only False
--exclude-hla Exclude HLA region False
--hla-lower HLA region lower bound (Mbp) 25
--hla-upper HLA region upper bound (Mbp) 34
--n Add N column with specified value None
--chr-prefix Prefix for chromosome column ""
--xymt-number Use numeric notation for X, Y, MT False

For complete workflow examples, see CLI Workflow Examples.

Output File Naming

The CLI follows a consistent naming pattern for output files:

  • Basic format: {output_path}.{format}.{tab_fmt}.gz
  • Example: output.gwaslab.tsv.gz

  • With filters: {output_path}.{filter}.{format}.{tab_fmt}.gz

  • Example: output.hapmap3.gwaslab.tsv.gz (with --hapmap3)
  • Example: output.noMHC.gwaslab.tsv.gz (with --exclude-hla)

  • Without compression: {output_path}.{format}.{tab_fmt}

  • Example: output.gwaslab.tsv (with --no-gzip)

  • Log file: {output_path}.{format}.log

  • Example: output.gwaslab.log

Tips and Best Practices

1. Input Format Detection

Use --fmt auto to let GWASLab automatically detect the input format:

gwaslab --input sumstats.tsv --fmt auto --qc --out output --to-fmt gwaslab

2. Parallel Processing

Use --threads to speed up processing for large files:

gwaslab --input large_sumstats.tsv --fmt auto --qc --threads 8 --out output --to-fmt gwaslab

3. Quiet Mode

Use --quiet to suppress verbose output in scripts:

gwaslab --input sumstats.tsv --fmt auto --qc --out output --to-fmt gwaslab --quiet

4. Testing with Small Samples

Use --nrows to test commands on a subset of data:

gwaslab --input large_sumstats.tsv --fmt auto --nrows 1000 --qc --out test_output --to-fmt gwaslab

5. Combining Operations

You can combine multiple processing steps in a single command:

# QC + Harmonization + Format conversion
gwaslab --input sumstats.tsv --fmt auto \
  --qc --remove-dup \
  --harmonize --ref-seq ref.fasta \
  --out output --to-fmt gwaslab

6. Reference Files

For harmonization, you typically need: - Reference sequence (FASTA): For allele flipping - rsID reference (VCF/TSV): For rsID assignment - Inference reference (VCF): For strand inference of palindromic SNPs

Download reference files from: - dbSNP - 1000 Genomes - UCSC Genome Browser

7. Memory Considerations

For very large files: - Use --sweep-mode for harmonization (faster for large datasets) - Process in chunks if memory is limited - Consider using --nrows for testing first

Common Use Cases

Use Case 1: Quick Format Check

# Just load and check the file (no output)
gwaslab --input sumstats.tsv --fmt auto

Use Case 2: Standard QC Workflow

gwaslab --input sumstats.tsv --fmt auto \
  --qc --remove --remove-dup --normalize \
  --out qc_sumstats --to-fmt gwaslab

Use Case 3: Prepare for LDSC

gwaslab --input sumstats.tsv --fmt gwaslab \
  --out ldsc_input --to-fmt ldsc --no-gzip

Use Case 4: Prepare for Meta-analysis

gwaslab --input sumstats.tsv --fmt auto \
  --qc --remove-dup \
  --harmonize --ref-seq ref.fasta --ref-rsid-vcf dbsnp.vcf.gz \
  --out meta_ready --to-fmt gwaslab --exclude-hla

Use Case 5: Extract Replication Set

gwaslab --input discovery.tsv --fmt gwaslab \
  --out replication --to-fmt gwaslab --hapmap3 --build 19

Troubleshooting

Issue: File not found

Error: FileNotFoundError: [Errno 2] No such file or directory

Solution: Check that the input file path is correct and accessible.

Issue: Format detection fails

Error: Format auto-detection doesn't work

Solution: Specify the format explicitly with --fmt:

gwaslab --input sumstats.tsv --fmt gwaslab --qc --out output --to-fmt gwaslab

Issue: Memory errors with large files

Error: Out of memory errors

Solution: - Use --threads 1 to reduce memory usage - Process in smaller chunks - Use --sweep-mode for harmonization

Issue: Reference file errors

Error: Reference file not found or invalid

Solution: - Verify reference file paths are correct - Check that reference files are properly formatted - Ensure VCF files are indexed if using VCF references

Getting Help

For more information:

# Show help message
gwaslab --help

# Show version
gwaslab version

For detailed documentation on specific functions, see: - QC & Filtering - Harmonization - Format - Standardization

Supported Formats

GWASLab supports many input and output formats through the formatbook repository. Common formats include:

  • Input: auto, gwaslab, plink, ldsc, vcf, and many more
  • Output: gwaslab, ldsc, plink, plink2, saige, fastgwa, regenie, vcf, and many more

Check the formatbook repository for the complete list of supported formats.