Command Line Interface (CLI)
GWASLab provides a unified command-line interface for processing GWAS summary statistics. The CLI supports quality control (QC), harmonization, format conversion, and various output formatting options.
Basic Usage
The CLI follows a unified interface pattern:
Quick Examples
# Show version
gwaslab version
# Basic QC and output
gwaslab --input sumstats.tsv --fmt auto --qc --out cleaned.tsv --to-fmt gwaslab
# Harmonization with reference
gwaslab --input sumstats.tsv --fmt auto --ref-seq ref.fasta --harmonize --out harmonized.tsv --to-fmt gwaslab
# Format conversion only
gwaslab --input sumstats.tsv --fmt gwaslab --out sumstats.ldsc --to-fmt ldsc
Command Structure
Required Arguments
| Argument | Description | Example |
|---|---|---|
--input |
Input sumstats file path | --input data/sumstats.tsv |
--fmt |
Input format (default: auto) |
--fmt gwaslab or --fmt auto |
Optional Arguments
| Argument | Description | Default |
|---|---|---|
--out |
Output file path (prefix) | None |
--to-fmt |
Output format | gwaslab |
--tab-fmt |
Tabular format (tsv, csv, parquet) |
tsv |
--nrows |
Number of rows to read (for testing) | None |
--quiet |
Suppress output messages | False |
--threads |
Number of threads for parallel processing | 1 |
Processing Options
Quality Control (QC)
Perform quality control on sumstats using basic_check():
# Basic QC
gwaslab --input sumstats.tsv --fmt auto --qc --out cleaned.tsv --to-fmt gwaslab
# QC with remove bad variants
gwaslab --input sumstats.tsv --fmt auto --qc --remove --out cleaned.tsv --to-fmt gwaslab
# QC with remove duplicates
gwaslab --input sumstats.tsv --fmt auto --qc --remove-dup --out cleaned.tsv --to-fmt gwaslab
# QC with normalize indels
gwaslab --input sumstats.tsv --fmt auto --qc --normalize --out cleaned.tsv --to-fmt gwaslab
# QC with all options
gwaslab --input sumstats.tsv --fmt auto --qc --remove --remove-dup --normalize --threads 4 --out cleaned.tsv --to-fmt gwaslab
QC Options:
| Option | Description |
|---|---|
--qc |
Perform quality control (basic_check) |
--remove |
Remove bad quality variants detected during QC |
--remove-dup |
Remove duplicated or multi-allelic variants |
--normalize |
Normalize indels (e.g., ATA:AA -> AT:A) |
Harmonization
Harmonize sumstats with reference data:
# Basic harmonization (without reference files)
gwaslab --input sumstats.tsv --fmt auto --harmonize --out harmonized.tsv --to-fmt gwaslab
# Harmonization with reference sequence for allele flipping
gwaslab --input sumstats.tsv --fmt auto --harmonize --ref-seq /path/to/reference.fasta --out harmonized.tsv --to-fmt gwaslab
# Harmonization with rsID assignment
gwaslab --input sumstats.tsv --fmt auto --harmonize --ref-rsid-vcf /path/to/reference.vcf.gz --out harmonized.tsv --to-fmt gwaslab
# Full harmonization pipeline
gwaslab --input sumstats.tsv --fmt auto \
--harmonize \
--ref-seq /path/to/reference.fasta \
--ref-rsid-vcf /path/to/rsid.vcf.gz \
--ref-infer /path/to/inference.vcf.gz \
--ref-alt-freq AF \
--maf-threshold 0.40 \
--ref-maf-threshold 0.4 \
--sweep-mode \
--threads 8 \
--out harmonized.tsv \
--to-fmt gwaslab
Harmonization Options:
| Option | Description | Default |
|---|---|---|
--harmonize |
Perform harmonization | False |
--basic-check |
Run basic QC in harmonization | True |
--no-basic-check |
Skip basic QC in harmonization | - |
--ref-seq |
Reference sequence file (FASTA) for allele flipping | None |
--ref-rsid-tsv |
Reference rsID HDF5 file (legacy name, accepts HDF5 path) | None |
--ref-rsid-vcf |
Reference rsID VCF/BCF file | None |
--ref-infer |
Reference VCF/BCF file for strand inference | None |
--ref-alt-freq |
Allele frequency field name in VCF INFO | AF |
--ref-maf-threshold |
MAF threshold for reference | 0.4 |
--maf-threshold |
MAF threshold for sumstats | 0.40 |
--sweep-mode |
Use sweep mode for large datasets | False |
Assign rsID
Assign rsID to variants using reference data:
# Assign rsID from HDF5 file
gwaslab --input sumstats.tsv --fmt auto --assign-rsid --ref-rsid-tsv /path/to/rsid.hdf5 --out output.tsv --to-fmt gwaslab
# Assign rsID from VCF file
gwaslab --input sumstats.tsv --fmt auto --assign-rsid --ref-rsid-vcf /path/to/rsid.vcf.gz --overwrite empty --out output.tsv --to-fmt gwaslab
Assign rsID Options:
| Option | Description | Default |
|---|---|---|
--assign-rsid |
Assign rsID to variants | False |
--ref-rsid-tsv |
Reference rsID HDF5 file (legacy name, accepts HDF5 path) | None |
--ref-rsid-vcf |
Reference rsID VCF/BCF file | None |
--overwrite |
Overwrite mode (all, invalid, empty) |
empty |
--threads |
Number of threads for parallel processing | 1 |
rsID to CHR:POS
Convert rsID to CHR:POS coordinates:
# Convert rsID to CHR:POS using VCF (auto-generates HDF5)
gwaslab --input sumstats.tsv --fmt auto --rsid-to-chrpos --ref-rsid-vcf /path/to/reference.vcf.gz --build 19 --out output.tsv --to-fmt gwaslab
# Convert rsID to CHR:POS using existing HDF5 file
gwaslab --input sumstats.tsv --fmt auto --rsid-to-chrpos --ref-rsid-tsv /path/to/reference.hdf5 --build 19 --out output.tsv --to-fmt gwaslab
rsID to CHR:POS Options:
| Option | Description | Default |
|---|---|---|
--rsid-to-chrpos |
Convert rsID to CHR:POS | False |
--ref-rsid-vcf |
Reference VCF file for rsID to CHR:POS conversion (auto-generates HDF5) | None |
--ref-rsid-tsv |
Reference HDF5 file path for rsID to CHR:POS conversion | None |
--build |
Genome build version | 19 |
--overwrite-rtc |
Overwrite existing CHR:POS | False |
--chunksize |
Chunk size for processing | 5000000 |
--threads |
Number of threads for parallel processing | 4 (when using rsid-to-chrpos) |
Output Formatting Options
Basic Formatting
# Output in gwaslab format (default)
gwaslab --input sumstats.tsv --fmt auto --out output --to-fmt gwaslab
# Output in LDSC format
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt ldsc
# Output in PLINK format
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt plink
# Output as CSV instead of TSV
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --tab-fmt csv
# Output without compression
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --no-gzip
Advanced Formatting
# Output with bgzip compression and tabix index
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --bgzip --tabix
# Extract only HapMap3 variants
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --hapmap3
# Exclude HLA region
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --exclude-hla
# Exclude HLA with custom range (in Mbp)
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --exclude-hla --hla-lower 20 --hla-upper 30
# Add chromosome prefix
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --chr-prefix chr
# Use numeric notation for X, Y, MT (23, 24, 25)
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --xymt-number
# Add N column with specified value
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab --n 10000
Output Formatting Options:
| Option | Description | Default |
|---|---|---|
--to-fmt |
Output format | gwaslab |
--tab-fmt |
Tabular format (tsv, csv, parquet) |
tsv |
--no-gzip |
Disable gzip compression | False (gzip enabled) |
--bgzip |
Use bgzip compression | False |
--tabix |
Create tabix index (requires bgzip) | False |
--hapmap3 |
Extract HapMap3 variants only | False |
--exclude-hla |
Exclude HLA region | False |
--hla-lower |
HLA region lower bound (Mbp) | 25 |
--hla-upper |
HLA region upper bound (Mbp) | 34 |
--n |
Add N column with specified value | None |
--chr-prefix |
Prefix for chromosome column | "" |
--xymt-number |
Use numeric notation for X, Y, MT | False |
For complete workflow examples, see CLI Workflow Examples.
Output File Naming
The CLI follows a consistent naming pattern for output files:
- Basic format:
{output_path}.{format}.{tab_fmt}.gz -
Example:
output.gwaslab.tsv.gz -
With filters:
{output_path}.{filter}.{format}.{tab_fmt}.gz - Example:
output.hapmap3.gwaslab.tsv.gz(with--hapmap3) -
Example:
output.noMHC.gwaslab.tsv.gz(with--exclude-hla) -
Without compression:
{output_path}.{format}.{tab_fmt} -
Example:
output.gwaslab.tsv(with--no-gzip) -
Log file:
{output_path}.{format}.log - Example:
output.gwaslab.log
Tips and Best Practices
1. Input Format Detection
Use --fmt auto to let GWASLab automatically detect the input format:
2. Parallel Processing
Use --threads to speed up processing for large files:
3. Quiet Mode
Use --quiet to suppress verbose output in scripts:
4. Testing with Small Samples
Use --nrows to test commands on a subset of data:
5. Combining Operations
You can combine multiple processing steps in a single command:
# QC + Harmonization + Format conversion
gwaslab --input sumstats.tsv --fmt auto \
--qc --remove-dup \
--harmonize --ref-seq ref.fasta \
--out output --to-fmt gwaslab
6. Reference Files
For harmonization, you typically need: - Reference sequence (FASTA): For allele flipping - rsID reference (VCF/TSV): For rsID assignment - Inference reference (VCF): For strand inference of palindromic SNPs
Download reference files from: - dbSNP - 1000 Genomes - UCSC Genome Browser
7. Memory Considerations
For very large files:
- Use --sweep-mode for harmonization (faster for large datasets)
- Process in chunks if memory is limited
- Consider using --nrows for testing first
Common Use Cases
Use Case 1: Quick Format Check
Use Case 2: Standard QC Workflow
gwaslab --input sumstats.tsv --fmt auto \
--qc --remove --remove-dup --normalize \
--out qc_sumstats --to-fmt gwaslab
Use Case 3: Prepare for LDSC
Use Case 4: Prepare for Meta-analysis
gwaslab --input sumstats.tsv --fmt auto \
--qc --remove-dup \
--harmonize --ref-seq ref.fasta --ref-rsid-vcf dbsnp.vcf.gz \
--out meta_ready --to-fmt gwaslab --exclude-hla
Use Case 5: Extract Replication Set
gwaslab --input discovery.tsv --fmt gwaslab \
--out replication --to-fmt gwaslab --hapmap3 --build 19
Troubleshooting
Issue: File not found
Error: FileNotFoundError: [Errno 2] No such file or directory
Solution: Check that the input file path is correct and accessible.
Issue: Format detection fails
Error: Format auto-detection doesn't work
Solution: Specify the format explicitly with --fmt:
Issue: Memory errors with large files
Error: Out of memory errors
Solution:
- Use --threads 1 to reduce memory usage
- Process in smaller chunks
- Use --sweep-mode for harmonization
Issue: Reference file errors
Error: Reference file not found or invalid
Solution: - Verify reference file paths are correct - Check that reference files are properly formatted - Ensure VCF files are indexed if using VCF references
Getting Help
For more information:
For detailed documentation on specific functions, see: - QC & Filtering - Harmonization - Format - Standardization
Supported Formats
GWASLab supports many input and output formats through the formatbook repository. Common formats include:
- Input:
auto,gwaslab,plink,ldsc,vcf, and many more - Output:
gwaslab,ldsc,plink,plink2,saige,fastgwa,regenie,vcf, and many more
Check the formatbook repository for the complete list of supported formats.