Output sumstats in certain formats

GWASLab provides a flexible formatting and saving function.

.to_format()

mysumstats.to_format(
          path="./sumstats",
          fmt="ldsc",   
          ...
          )

Options

`.to_format()` options	DataType	Description	Default
`path`	`string`	the path for the output file; only prefix is needed.	`"./sumstats"`
`fmt`	`string`	output format for sumstats. Currently support `plink`, `plink2`, `ldsc`, `saige`, `fastgwa`, `regenie` and so forth. For details, please check https://github.com/Cloufield/formatbook.	`"gwaslab"`
`tab_fmt`	`string`	tabular format type when `fmt` is not 'vcf', 'bed', or 'annovar'. Options: `tsv`, `csv`, `parquet`	`"tsv"`
`cols`	`list`	list of additional columns to include in the output	`None`
`extract`	`list`	a list of variant SNPIDs to include.	`None`
`exclude`	`list`	a list of variant SNPIDs to exclude.	`None`
`id_use`	`SNPID` or `rsID`	specify which ID to use when excluding or extracting variants.	`rsID`
`hapmap3`	`boolean`	If True, only output Hapmap3 SNPs.	`False`
`exclude_hla`	`boolean`	If True, exclude variants in the MHC region from the output.	`False`
`hla_range`	`tuple`	a tuple of 2 numbers (Mbp) indicating the start and the end position of the HLA region.	`(25,34)`
`build`	`string`	reference genome build.	`None`
`n`	`float`	sample size to add as 'N' column.	`None`
`no_status`	`boolean`	If True, exclude 'STATUS' column from output.	`False`
`xymt_number`	`boolean`	If True, output sex chromosomes as numeric codes (23, 24, 25 for X, Y, MT).	`False`
`xymt`	`list`	3-element list of sex chromosome notations. If `None`, automatically derived from Sumstats object's species (species-aware). Default: `None` (auto-detect from species)	`None`
`chr_prefix`	`string`	Add a prefix to chromosomes. For example, 6 -> chr6.	`""`
`gzip`	`boolean`	If True, gzip compress the output file.	`True`
`bgzip`	`boolean`	If True, bgzip the output file. Only works for bed and vcf format.	`False`
`tabix`	`boolean`	If True, use tabix to index the bgzipped output file. Only works for bed and vcf format. Requires `bgzip=True`.	`False`
`tabix_indexargs`	`dict`	extra parameters for pysam.tabix_index()	`{}`
`md5sum`	`boolean`	If True, calculate and output the file MD5 hashes	`False`
`to_csvargs`	`dict`	extra parameters for pd.to_csv()	`None`
`to_tabular_kwargs`	`dict`	extra parameters for tabular format output (tsv, csv, parquet)	`None`
`float_formats`	`dict`	a dictionary to specify the float format for each column.	`None`
`validate`	`boolean`	If True, use gwas-ssf CLI tool for validation (only for SSF format).	`False`
`verbose`	`boolean`	If True, print logs.	`True`
`output_log`	`boolean`	If True, save the log to a file.	`True`
`ssfmeta`	`boolean`	If True, output a gwas-ssf-style meta file.	`False`

Format dictionary

Using float_formats, you can specify the formats for numbers.

Default formats for floating-point numbers

{'EAF': '{:.4g}', 'BETA': '{:.4f}', 'Z': '{:.4f}','CHISQ': '{:.4f}','SE': '{:.4f}','OR': '{:.4f}','OR_95U': '{:.4f}','OR_95L': '{:.4f}','INFO': '{:.4f}','P': '{:.4e}','MLOG10P': '{:.4f}','DAF': '{:.4f}'}

Output File Naming

The output filename is automatically constructed based on the format and tabular format: - Pattern: {path}.{fmt}.{tab_fmt}[.gz] - Examples: - path="./sumstats", fmt="gwaslab", tab_fmt="tsv", gzip=True → sumstats.gwaslab.tsv.gz - path="./output", fmt="ldsc", tab_fmt="csv", gzip=False → output.ldsc.csv - path="./data", fmt="vcf", bgzip=True → data.vcf.bcf.gz (if bgzip) or data.vcf.gz (if gzip)

Examples

GWASLab supports commonly used tabular formats, which are listed in a companion repository formatbook.

formatbook

For more details, please check formatbook

Basic format conversion

# Convert to LDSC format
mysumstats.to_format(path="./output", fmt="ldsc")
# Output: output.ldsc.tsv.gz

# Convert to PLINK format with CSV
mysumstats.to_format(path="./output", fmt="plink", tab_fmt="csv", gzip=False)
# Output: output.plink.csv

# Convert to VCF with bgzip and tabix index
mysumstats.to_format(path="./output", fmt="vcf", bgzip=True, tabix=True)
# Output: output.vcf.bcf.gz and output.vcf.bcf.gz.tbi

Filtering and formatting

# Extract HapMap3 SNPs only
mysumstats.to_format(path="./hapmap3", fmt="ldsc", hapmap3=True)

# Exclude HLA region
mysumstats.to_format(path="./no_hla", fmt="gwaslab", exclude_hla=True, hla_range=(25, 34))

# Extract specific variants
mysumstats.to_format(path="./subset", fmt="gwaslab", extract=["rs123", "rs456"], id_use="rsID")

# Add chromosome prefix and N column
mysumstats.to_format(path="./formatted", fmt="gwaslab", chr_prefix="chr", n=10000)

Advanced options

# Custom float formatting
float_formats = {'P': '{:.2e}', 'BETA': '{:.6f}', 'SE': '{:.6f}'}
mysumstats.to_format(path="./formatted", fmt="gwaslab", float_formats=float_formats)

# Output as Parquet format
mysumstats.to_format(path="./parquet", fmt="gwaslab", tab_fmt="parquet", gzip=False)

# Generate MD5 checksum
mysumstats.to_format(path="./checksummed", fmt="gwaslab", md5sum=True)

Species-Aware Sex Chromosome Handling

The xymt parameter is automatically derived from the Sumstats object's species when None:

# Human (default) - automatically uses ["X", "Y", "MT"]
mysumstats = gl.Sumstats("data.txt", species="homo sapiens")
mysumstats.to_format("output.tsv")  # Uses X, Y, MT

# Chicken - automatically uses ["Z", "W", "MT"]
mysumstats = gl.Sumstats("data.txt", species="chicken")
mysumstats.to_format("output.tsv")  # Uses Z, W, MT

# You can still override if needed
mysumstats.to_format("output.tsv", xymt=["X", "Y", "MT"])  # Override to human convention

CLI usage

# Basic format conversion
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt ldsc

# With filtering options
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab \
  --hapmap3 --exclude-hla --n 10000

# Output as CSV without compression
gwaslab --input sumstats.tsv --fmt gwaslab --out output --to-fmt gwaslab \
  --tab-fmt csv --no-gzip

Special Format Specifications

GWASLab supports several specialized formats for variant annotation tools. These formats have specific coordinate conventions that are automatically handled.

BED Format (0-based)

The BED (Browser Extensible Data) format is used by the UCSC Genome Browser and other tools. GWASLab outputs BED format with 0-based, half-open coordinates.

Output columns: CHR, START, END, NEA/EA, STRAND, SNPID

Coordinate conventions: - SNPs: START = POS - 1, END = POS - 1 + len(NEA) (0-based) - Insertions: START = POS, END = POS (0-based) - Deletions: START = POS, END = POS + len(NEA) - 1 (0-based)

Example:

mysumstats.to_format(path="./output", fmt="bed", bgzip=True, tabix=True)
# Output: output.bed.gz (bgzipped and tabix-indexed)

Source: UCSC BED Format Specification

VEP Format (1-based)

The VEP (Variant Effect Predictor) format is used by Ensembl's VEP tool for variant annotation. GWASLab outputs VEP format with 1-based coordinates.

Output columns: CHR, START, END, NEA/EA, STRAND, SNPID

Coordinate conventions: - SNPs: START = END = POS + (len(NEA) - 1) (1-based) - Insertions: START = POS + 1, END = POS (VEP convention: START > END indicates insertion) - Deletions: START = POS + 1, END = POS + (len(NEA) - 1) (1-based)

VEP Insertion Convention

VEP format uses START > END for insertions as a special convention to indicate an insertion between positions. This is intentional and correct according to VEP specifications.

Example:

mysumstats.to_format(path="./output", fmt="vep", bgzip=True, tabix=True)
# Output: output.vep.gz (bgzipped and tabix-indexed)

Source: Ensembl VEP Format Documentation

Annovar Format (1-based)

The Annovar format is used by the ANNOVAR tool for functional annotation of genetic variants. GWASLab outputs Annovar format with 1-based coordinates.

Output columns: CHR, START, END, NEA_out, EA_out, SNPID

Format specification: According to the ANNOVAR documentation, the input format requires: - First 5 columns: Chromosome, Start, End, Reference Allele, Alternative Allele - Coordinate system: 1-based (by default) - Insertions: Use - for reference allele - Deletions: Use - for alternative allele

Coordinate conventions: - SNPs: START = POS, END = POS - 1 + len(NEA) - Example: SNP A/G at POS=100 → START=100, END=100 (since len(NEA)=1) - Matches ANNOVAR example: 1 948921 948921 T C - Insertions: START = END = POS (matches ANNOVAR specification) - Example: Insertion A/AT at POS=200 → START=200, END=200, REF=-, ALT=TC - Matches ANNOVAR example: 1 11403596 11403596 - AT (START=END=POS) - Deletions: START = POS, END = POS - 1 + len(NEA) - Example: Deletion AT/A at POS=300 → START=300, END=302, REF=TC, ALT=- - Matches ANNOVAR example: 1 13211293 13211294 TC - (2-bp deletion)

Examples from ANNOVAR documentation:

1 948921 948921 T C comments: rs15842, a SNP in 5' UTR
1 13211293 13211294 TC - comments: rs59770105, a 2-bp deletion
1 11403596 11403596 - AT comments: rs35561142, a 2-bp insertion

Example:

mysumstats.to_format(path="./output", fmt="annovar", bgzip=True, tabix=True)
# Output: output.annovar.gz (bgzipped and tabix-indexed)

Source: ANNOVAR Input Format Documentation

Format Comparison

Format	Coordinate System	Insertion Convention	Deletion Convention
BED	0-based, half-open	`START = END = POS`	`START = POS`, `END = POS + len(NEA) - 1`
VEP	1-based	`START = POS + 1`, `END = POS` (START > END)	`START = POS + 1`, `END = POS + len(NEA) - 1`
Annovar	1-based	`START = END = POS`	`START = POS`, `END = POS - 1 + len(NEA)`

Automatic Coordinate Conversion

GWASLab automatically handles coordinate conversion for all variant types (SNPs, insertions, deletions) based on the selected format. You don't need to manually adjust coordinates.

SSF Format (GWAS-SSF v0.1)

The SSF (Summary Statistics Format) is a standardized format for GWAS summary statistics proposed to improve interoperability and reproducibility. GWASLab supports outputting and validating SSF format files.

Format Specification: - Source: GWAS-SSF v0.1 - Separator: Tab (\t) - Missing values: #NA - File extension: .tsv or .tsv.gz

Required columns: - chromosome - Chromosome number - base_pair_location - Base pair position - effect_allele - Effect allele - other_allele - Non-effect allele - standard_error - Standard error - effect_allele_frequency - Effect allele frequency - p_value - P-value

Optional columns: - beta, odds_ratio, or hazard_ratio - At least one effect measure required - neg_log_10_p_value - Alternative to p_value - rsid - dbSNP rsID - variant_id - Variant identifier - info - Imputation quality score - ref_allele - Reference allele - n - Sample size - ci_upper, ci_lower - Confidence interval bounds

Column order: SSF format has a strict column order requirement. GWASLab automatically orders columns according to the SSF specification.

Example:

# Output in SSF format
mysumstats.to_format(path="./output", fmt="ssf")

# Output SSF format with metadata file
mysumstats.to_format(path="./output", fmt="ssf", ssfmeta=True)

# Output and validate SSF format
mysumstats.to_format(path="./output", fmt="ssf", validate=True)

SSF Metadata: When ssfmeta=True, GWASLab generates a YAML metadata file alongside the SSF file containing: - Study information - Sample characteristics - File checksums (MD5) - Format version information

SSF Validation

GWASLab provides built-in validation for SSF format files to ensure compliance with the GWAS-SSF specification. The built-in validator replicates the behavior of gwas-sumstats-validator (also known as gwas-ssf CLI tool) and performs the same validation checks.

Validation checks: 1. File extension: Must be .tsv or .tsv.gz 2. Required columns: All 7 required columns must be present 3. Effect field: At least one of beta, odds_ratio, or hazard_ratio must be present 4. P-value field: Either p_value or neg_log_10_p_value must be present 5. Column order: Columns must follow the SSF specification order 6. Chromosome coverage: Validates presence of all autosomes (1-22) 7. Data validation: Checks data types, ranges, and consistency 8. Minimum rows: Requires at least 100,000 variants (configurable)

Usage:

# Validate SSF file during output
mysumstats.to_format(path="./output", fmt="ssf", validate=True)

# Validation uses built-in validator by default
# If gwas-ssf CLI is available, it will be used instead

Validation methods: - Primary: Uses gwas-ssf CLI tool (gwas-sumstats-validator) if available (external dependency) - Fallback: Uses built-in GWASLab validator that replicates the same validation logic as gwas-sumstats-validator (no external dependencies)

Validation output: - Success: ✓ SSF validation successful - Failure: Lists specific validation errors and issues found

Example validation output:

✓ SSF validation successful: File passes all validation checks

or

✗ SSF validation failed: Missing required columns: ['standard_error']
  - Missing required columns: ['standard_error']