Harmonization

GWASLab provides reference-dependent harmonization functions.

Harmonization Workflow

The harmonization process follows a sequential workflow to align and standardize summary statistics with reference data:

Load Sumstats
     │
     ▼
Basic Check (optional)
     │
     ▼
Check Reference Alignment
     │
     ├─→ NEA ≠ REF ─→ Flip Allele Stats ─┐
     │                                    │
     └─→ NEA == REF ──────────────────────┘
                         │
                         ▼
                    Assign rsID
                         │
                         ▼
                    Infer Strand
                         │
                         ├─→ Strand mismatch ─→ Flip Allele Stats ─┐
                         │                                         │
                         └─→ Strand correct ───────────────────────┘
                                             │
                                             ▼
                                    Harmonized Sumstats

Workflow Steps:

Basic Check (optional, basic_check=True): Fix IDs, chromosomes, positions, and alleles
Check Reference Alignment (check_ref()): Verify NEA matches REF in reference genome
Flip Allele Stats (if needed): Flip alleles and allele-specific statistics when NEA ≠ REF
Assign rsID (assign_rsid()): Annotate variants with rsIDs from reference
Infer Strand (infer_strand()): Determine strand orientation for palindromic SNPs and indels
Flip Allele Stats (if needed): Flip again based on strand inference results

All-in-One Function

Use .harmonize() to run the complete workflow automatically:

mysumstats.harmonize(
    ref_seq="reference.fasta",
    ref_rsid_vcf="dbsnp.vcf.gz",
    ref_infer="1kg.vcf.gz",
    ref_alt_freq="AF"
)

Methods summary

Sumstats Methods	Options	Description
`.check_ref()`	`ref_path`	Check alignment with a reference genome FASTA file
`.assign_rsid()`	`ref_rsid_tsv`, `ref_rsid_vcf`, `threads=1`, `chunksize=5000000`, `overwrite="empty"`	Annotate rsID using a reference tabular file or VCF/BCF file
`.infer_strand()`	`ref_infer`, `ref_alt_freq=None`, `maf_threshold=0.40`, `remove_snp=""`, `daf_tolerance=0.2`, , `mode="pi"`, `threads=1`, `remove_indel=""`	Infer the strand of palindromic variants/indistinguishable indels using reference VCF/BCF files based on allele frequency in INFO
`.check_af()`	`ref_infer`, `ref_alt_freq=None`, `maf_threshold=0.40`, `threads=1`	Calculate difference in allele frequencies (DAF) between sumstats EAF and reference VCF ALT frequency
`.flip_allele_stats()`		After alignment and inferring, flip the alleles and allele-specific statistics to harmonize the variants.
`.harmonize()`	`basic_check=True`, `ref_seq=None`, `ref_rsid_tsv=None`, `ref_rsid_vcf=None`, `ref_infer=None`, `ref_alt_freq=None`, `maf_threshold=0.40`, `threads=1`, `remove=False`, `sweep_mode=False`, `checkref_args={}`, `removedup_args={}`, `assignrsid_args={}`, `inferstrand_args={}`, `flipallelestats_args={}`, `fixid_args={}`, `fixchr_agrs={}`, `fixpos_args={}`, `fixallele_args={}`, `sanitycheckstats_args={}`, `normalizeallele_args={}`	All-in-one function for harmonization

Align NEA with REF in the reference genome

.check_ref(): Check if NEA is aligned with the reference sequence. After checking, the tracking status code will be changed accordingly.

`.check_ref()` options	DataType	Description	Default
`ref_path`	`string`	path to the reference genome FASTA file.
`ref_seq_mode` (available since v3.4.42)	`v` or `s`	`v` for vectorized implementation (faster); `s` for single row iteration mode	`v` since v3.4.42 (except v3.4.43)

Chromosome Conversion

Chromosome format conversion between sumstats and reference files is handled automatically by ChromosomeMapper (accessed via mysumstats.mapper). No manual chr_dict parameter is needed.

Example

mysumstats.check_ref(ref_path="ref_genome.fa")
mysumstats.flip_allele_stats()

Note

check_ref() only change the status code. Use flip function .flip_allele_stats() to flip the allele-specific stats.

Assign rsID according to CHR, POS, REF/ALT

.assign_rsid() : Annotated variants with rsID using a reference tsv file (1KG variants) and reference vcf file (tabix indexed, entire dbSNP).

See https://cloufield.github.io/gwaslab/AssignrsID/

Example

mysumstats.assign_rsid(ref_rsid_tsv = gl.get_path("1kg_dbsnp151_hg19_auto"), 
                       ref_rsid_vcf = "/home/yunye/mydata/d_disk/dbsnp/GCF_000001405.25.vcf.gz",
                       threads=1)

For TSV file, variants will be matched using SNPID (CHR:POS:NEA:EA) for quick assigning.
For VCF file, GWASLab will first extract all variants in the reference file with matching CHR and POS. And then compare EA/NEA in sumstats with REF/ALT in reference vcf. When matching, it will annotate the variant in sumstats with the matching rsID in reference vcf.

Check palindromic SNPs or indistinguishable Indels

.infer_strand() and .infer_strand2():

Infer the strand for palindromic SNPs (A/T, T/A, G/C, C/G) with MAF ≤ maf_threshold.
Check the alignment status of indels with the REF allele in a reference VCF/BCF file and verify if the allele frequencies are consistent (DAF ≤ daf_tolerance).
Update STATUS code 7th digit to reflect strand inference results (0=non-palindromic, 1=forward, 2=reverse, 3=indel forward, 4=ambiguous indel, 6=indel reverse, 7=MAF too high, 8=not found, 9=unchecked).

DAF in GWASLab

DAF in GWASLab: Difference between effect allele frequency (EAF) in sumstats and ALT frequency in reference VCF/BCF file

Note

This DAF in GWASLab is not the derived allele frequency in evolutionary genetics.

`.infer_strand()` options	DataType	Description	Default
`ref_infer`	`string`	path to the reference VCF/BCF file (must be tabix-indexed for efficient querying).
`ref_alt_freq`	`string`	the field for alternative allele frequency in VCF INFO (e.g., "AF", "AF_popmax", "gnomAD_AF"). If None, auto-detection is attempted.
`maf_threshold`	`float`	maximum minor allele frequency threshold. Only palindromic SNPs with MAF ≤ `maf_threshold` will be inferred	`0.4`
`ref_maf_threshold`	`float`	maximum minor allele frequency threshold for reference RAF. Used as additional filter for both palindromic SNPs and indels.	`0.4`
`daf_tolerance`	`float`	difference in allele frequency tolerance for indels. Only indels with	EAF - RAF
`remove_snp`	, `7` or `8`	`7` remove palindromic SNPs with MAF unable to infer. `8`: remove palindromic SNPs with No information in reference VCF/BCF	``
`remove_indel`	or `8`	`8`: indistinguishable indels with No information in reference VCF/BCF	``
`mode`	`p`, `i`, or `pi`	`p`: infer palindromic SNPs only. `i`: infer indels only. `pi`: infer both (default).	`pi`
`threads`	`int`	number of CPU threads to use	`1`
`cache_options` (available since v3.4.42)	`dict`	options for using cache to speed up this step	`None`

`.infer_strand2()` options	DataType	Description	Default
`path` / `vcf_path` / `tsv_path`	`string`	path to reference file (VCF/BCF or TSV). `vcf_path` overrides `path`, `tsv_path` overrides both. If TSV not provided, generated from VCF.
`assign_cols`	`tuple` or `str`	column names to extract from reference (default: ("AF",)). First column will be renamed to RAF.	`("AF",)`
`maf_threshold`	`float`	maximum minor allele frequency threshold for sumstats EAF	`0.4`
`ref_maf_threshold`	`float`	maximum minor allele frequency threshold for reference RAF	`0.4`
`daf_tolerance`	`float`	difference in allele frequency tolerance for indels	`0.2`
`threads`	`int`	number of CPU threads for processing	`6`
`reuse_lookup`	`bool`	if True, reuse existing lookup table TSV file if found	`True`
`convert_to_bcf`	`bool`	if True, convert VCF to BCF format for faster processing	`False`
`strip_info`	`bool`	if True, strip INFO fields from VCF during lookup table generation	`True`

Chromosome Conversion

Chromosome format conversion is handled automatically by ChromosomeMapper. No manual chr_dict parameter is needed.

`cache_options`	DataType	Description	Default
`cache_manager`	CacheManager object or `None`	If any between cache_loader and cache_process is not None, or use_cache is True, a CacheManager object will be created automatically.
`trust_cache`	`bool`	Whether to completely trust the cache or not. Trusting the cache means that any key not found inside the cache will be considered as a missing value even in the VCF file.	True
`cache_loader`	Object with a get_cache() method or `None`	Object with a get_cache() method or None.
`cache_process`	Object with an apply_fn() method or `None`	Object with an apply_fn() method or None.
`use_cache`	`bool`	If any of the cache_manager, cache_loader or cache_process is not None, this will be set to True automatically. If set to True and all between cache_manager, cache_loader and cache_process are None, the cache will be loaded (or built) on the spot.	False

Example

# Old method: per-variant VCF queries (good for small datasets)
mysumstats.infer_strand(ref_infer=gl.get_path("1kg_eas_hg19"),
                        ref_alt_freq="AF",
                        maf_threshold=0.40,
                        daf_tolerance=0.20,
                        threads=4)
mysumstats.flip_allele_stats()

# New optimized method: bulk lookup table (faster for large datasets)
mysumstats.infer_strand2(vcf_path=gl.get_path("1kg_eas_hg19"),
                         maf_threshold=0.40,
                         ref_maf_threshold=0.40,
                         daf_tolerance=0.20,
                         threads=6)
mysumstats.flip_allele_stats()

# Using cache to speed up old method
mysumstats.infer_strand(ref_infer=gl.get_path("1kg_eas_hg19"),
                        cache_options={"use_cache":True})
mysumstats.flip_allele_stats()

Note

infer_strand() and infer_strand2() only change the status code. Use flip function .flip_allele_stats() to flip the allele-specific stats.

When to use infer_strand2()

Use .infer_strand2() for: - Large datasets (millions of variants) - When you have bcftools installed - When you want faster processing with reduced I/O operations - When you plan to reuse lookup tables for multiple runs

Use .infer_strand() for: - Small to medium datasets (< 1 million variants) - When bcftools is not available - When you need maximum parallelization across CPU cores

Check the difference in allele frequency

.check_af() : Check the allele frequency discrepancy with a reference VCF/BCF file. This function calculates the difference (DAF) between the effect allele frequency (EAF) in your sumstats and the alternative allele frequency (ALT_AF) in the reference VCF.

Purpose:

Quality control: Identify variants with large differences in allele frequency between sumstats and reference
Validation: Verify that EAF values in sumstats are consistent with reference population frequencies
Detect potential issues: Large |DAF| values (> 0.2) may indicate allele mismatches, strand flips, or data quality issues

Important: Please make sure your sumstats are already harmonized, and the variants in reference VCF are also aligned. GWASLab will retrieve information only for matched variants (CHR, POS, EA-ALT, and NEA-REF).

`.check_af()` options	DataType	Description	Default
`ref_infer`	`string`	path to the reference VCF/BCF file (must be tabix-indexed for efficient querying).
`ref_alt_freq`	`string`	the field for alternative allele frequency in VCF INFO (e.g., "AF", "AF_popmax", "gnomAD_AF"). If None, auto-detection is attempted.
`maf_threshold`	`float`	minor allele frequency threshold for filtering variants. Only variants with MAF ≤ threshold in both sumstats and reference are included in DAF statistics.	`0.4`
`column_name`	`string`	name of the column to store DAF values	`"DAF"`
`suffix`	`string`	suffix to append to column name (useful when comparing multiple reference populations)	`""`
`threads`	`int`	number of CPU threads to use	`1`
`force`	`bool`	if True, check all variants regardless of STATUS codes	`False`

Example

mysumstats.check_af(ref_infer=gl.get_path("1kg_eas_hg19"), 
                    ref_alt_freq="AF",
                    threads=2)

Terminology:

DAF: Difference in Allele Frequency = EAF (sumstats) - ALT_AF (reference VCF)
EAF: Effect allele frequency (from your sumstats)
RAF: Reference ALT allele frequency (from reference VCF/BCF file)

Interpretation:

Positive DAF: EAF in sumstats is higher than reference
Negative DAF: EAF in sumstats is lower than reference
Large |DAF| values (> 0.2) may indicate:
Population differences (expected for population-specific variants)
Allele mismatches (check EA/NEA alignment)
Strand flips (use infer_strand() to resolve)
Data quality issues (verify EAF calculation in sumstats)

You may want to check the allele frequency discrepancy with a reference VCF. Just specify the path and the right allele frequency for your target ancestry in INFO field.

Allele frequency correlation plot

GWASLab will simply calculate DAF = EAF (sumstats) - frequency in VCF file, and store the results in DAF column.

DAF can then be used for plotting (.plot_daf()) or filter variants.

Example

mysumstats.plot_daf(threshold=0.12)

Flip allele-specific statistics

.flip_allele_stats() : Flip allele-specific statistics to harmonize the variants based on the tracking status code in STATUS.

Example

mysumstats.check_ref(ref_path="ref_genome.fa")
mysumstats.flip_allele_stats()

mysumstats.infer_strand()
mysumstats.flip_allele_stats()

Sweep Mode for Large Datasets

Available since v3.4.42

Sweep mode is an optimized processing mode designed for large datasets that significantly improves performance when harmonizing summary statistics with reference VCF/BCF files by reducing I/O operations. Sweep mode uses bcftools for efficient VCF/BCF processing.

bcftools Required

Sweep mode requires bcftools to be installed and available in your PATH. If bcftools is not found, GWASLab will raise a RuntimeError.

Verify bcftools installation:

bcftools --version

Example output:

bcftools 1.19
Using htslib 1.19
Copyright (C) 2023 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

How Sweep Mode Works

When sweep_mode=True, GWASLab uses a different processing strategy that creates lookup tables instead of querying the reference file per-variant:

Normal mode (sweep_mode=False):
For strand inference (ref_infer): Uses _parallelize_infer_strand which queries the reference VCF/BCF file using tabix for each variant (or chunk of variants). Each query requires a separate I/O operation to the reference file.
For rsID assignment (ref_rsid_vcf): Uses _parallelize_assign_rsid which also performs per-variant or per-chunk tabix queries against the reference VCF/BCF file.
Sweep mode (sweep_mode=True):
For strand inference (ref_infer): Uses _infer_strand_with_annotation which:
1. First extracts all needed variants from the reference VCF/BCF in one pass using bcftools (bcftools view, bcftools query) to create a lookup table (TSV)
2. Then annotates all variants in your sumstats using this lookup table (fast in-memory operations)
3. Finally infers strand orientation using the annotated allele frequencies
For rsID assignment (ref_rsid_vcf): Uses _assign_rsid which:
1. First extracts a lookup table from the reference VCF/BCF in one pass using bcftools (bcftools view, bcftools query, bcftools index) via _extract_lookup_table_from_vcf_bcf
2. Then assigns rsIDs to all variants using _assign_from_lookup (fast in-memory operations)

Key Advantages of Sweep Mode

Reduced I/O operations: Instead of many tabix queries (one per variant/chunk), sweep mode makes a single pass through the reference file to extract all needed data
Better performance for large datasets: The lookup table approach scales better when processing millions of variants
Reusable lookup tables: The extracted lookup tables can be saved and reused in subsequent runs (if lookup_path is specified)
Lower overhead: Fewer file system operations and less network I/O (if using remote files)

When to Use Sweep Mode

Use sweep mode when: - Processing large datasets (millions of variants) - Using large reference VCF/BCF files (e.g., full dbSNP, 1000 Genomes) - You want to minimize I/O operations and improve throughput - You plan to reuse lookup tables for multiple harmonization runs - Processing variants from remote/network-mounted reference files

Use normal mode when: - Processing small to medium datasets (< 1 million variants) - You need maximum parallelization across CPU cores (normal mode parallelizes tabix queries) - The reference file is small or already indexed in a way that makes per-variant queries efficient - You prefer lower memory usage (sweep mode loads lookup tables into memory)

Usage Example

Example

# Harmonization with sweep mode for large dataset
mysumstats.harmonize(
    ref_seq="/path/to/reference.fasta",
    ref_rsid_vcf="/path/to/dbsnp.vcf.gz",
    ref_infer="/path/to/reference.vcf.gz",
    ref_alt_freq="AF",
    sweep_mode=True,  # Enable sweep mode
    threads=8
)

Note

Sweep mode only affects processing with ref_rsid_vcf (VCF/BCF files) and ref_infer (for strand inference)
TSV-based rsID assignment (ref_rsid_tsv) always uses the same efficient method regardless of sweep mode
Sweep mode uses bcftools for VCF/BCF processing - ensure bcftools is installed and in your PATH (see version info above)
Sweep mode creates temporary lookup TSV files that can be reused if you specify lookup_path in the kwargs
The lookup table extraction step may take some time initially, but subsequent annotation is very fast
bcftools operations are parallelized using the threads parameter

Assign CHR and POS according to rsID and reference data

mysumstats.rsid_to_chrpos()  

mysumstats.rsid_to_chrpos2()