Harmonization
GWASLab provides reference-dependent harmonization functions.
Harmonization Workflow
The harmonization process follows a sequential workflow to align and standardize summary statistics with reference data:
Load Sumstats
│
▼
Basic Check (optional)
│
▼
Check Reference Alignment
│
├─→ NEA ≠ REF ─→ Flip Allele Stats ─┐
│ │
└─→ NEA == REF ──────────────────────┘
│
▼
Assign rsID
│
▼
Infer Strand
│
├─→ Strand mismatch ─→ Flip Allele Stats ─┐
│ │
└─→ Strand correct ───────────────────────┘
│
▼
Harmonized Sumstats
Workflow Steps:
- Basic Check (optional,
basic_check=True): Fix IDs, chromosomes, positions, and alleles - Check Reference Alignment (
check_ref()): Verify NEA matches REF in reference genome - Flip Allele Stats (if needed): Flip alleles and allele-specific statistics when NEA ≠ REF
- Assign rsID (
assign_rsid()): Annotate variants with rsIDs from reference - Infer Strand (
infer_strand()): Determine strand orientation for palindromic SNPs and indels - Flip Allele Stats (if needed): Flip again based on strand inference results
All-in-One Function
Use .harmonize() to run the complete workflow automatically:
Methods summary
| Sumstats Methods | Options | Description |
|---|---|---|
.check_ref() |
ref_path |
Check alignment with a reference genome FASTA file |
.assign_rsid() |
ref_rsid_tsv,ref_rsid_vcf,threads=1, chunksize=5000000, overwrite="empty" |
Annotate rsID using a reference tabular file or VCF/BCF file |
.infer_strand() |
ref_infer,ref_alt_freq=None,maf_threshold=0.40,remove_snp="",daf_tolerance=0.2, ,mode="pi",threads=1,remove_indel="" |
Infer the strand of palindromic variants/indistinguishable indels using reference VCF/BCF files based on allele frequency in INFO |
.check_af() |
ref_infer,ref_alt_freq=None,maf_threshold=0.40,threads=1 |
Calculate difference in allele frequencies (DAF) between sumstats EAF and reference VCF ALT frequency |
.flip_allele_stats() |
After alignment and inferring, flip the alleles and allele-specific statistics to harmonize the variants. | |
.harmonize() |
basic_check=True, ref_seq=None,ref_rsid_tsv=None,ref_rsid_vcf=None,ref_infer=None,ref_alt_freq=None,maf_threshold=0.40,threads=1,remove=False,sweep_mode=False,checkref_args={},removedup_args={},assignrsid_args={},inferstrand_args={},flipallelestats_args={},fixid_args={},fixchr_agrs={},fixpos_args={},fixallele_args={},sanitycheckstats_args={},normalizeallele_args={} |
All-in-one function for harmonization |
Align NEA with REF in the reference genome
.check_ref(): Check if NEA is aligned with the reference sequence. After checking, the tracking status code will be changed accordingly.
.check_ref() options |
DataType | Description | Default |
|---|---|---|---|
ref_path |
string |
path to the reference genome FASTA file. | |
ref_seq_mode (available since v3.4.42) |
v or s |
v for vectorized implementation (faster); s for single row iteration mode |
v since v3.4.42 (except v3.4.43) |
Chromosome Conversion
Chromosome format conversion between sumstats and reference files is handled automatically by ChromosomeMapper (accessed via mysumstats.mapper). No manual chr_dict parameter is needed.
Note
check_ref() only change the status code. Use flip function .flip_allele_stats() to flip the allele-specific stats.
Assign rsID according to CHR, POS, REF/ALT
.assign_rsid() : Annotated variants with rsID using a reference tsv file (1KG variants) and reference vcf file (tabix indexed, entire dbSNP).
See https://cloufield.github.io/gwaslab/AssignrsID/
Example
- For TSV file, variants will be matched using SNPID (CHR:POS:NEA:EA) for quick assigning.
- For VCF file, GWASLab will first extract all variants in the reference file with matching CHR and POS. And then compare EA/NEA in sumstats with REF/ALT in reference vcf. When matching, it will annotate the variant in sumstats with the matching rsID in reference vcf.
Check palindromic SNPs or indistinguishable Indels
.infer_strand() and .infer_strand2():
- Infer the strand for palindromic SNPs (A/T, T/A, G/C, C/G) with MAF ≤
maf_threshold. - Check the alignment status of indels with the REF allele in a reference VCF/BCF file and verify if the allele frequencies are consistent (DAF ≤
daf_tolerance). - Update STATUS code 7th digit to reflect strand inference results (0=non-palindromic, 1=forward, 2=reverse, 3=indel forward, 4=ambiguous indel, 6=indel reverse, 7=MAF too high, 8=not found, 9=unchecked).
DAF in GWASLab
DAF in GWASLab: Difference between effect allele frequency (EAF) in sumstats and ALT frequency in reference VCF/BCF file
Note
This DAF in GWASLab is not the derived allele frequency in evolutionary genetics.
.infer_strand() options |
DataType | Description | Default |
|---|---|---|---|
ref_infer |
string |
path to the reference VCF/BCF file (must be tabix-indexed for efficient querying). | |
ref_alt_freq |
string |
the field for alternative allele frequency in VCF INFO (e.g., "AF", "AF_popmax", "gnomAD_AF"). If None, auto-detection is attempted. | |
maf_threshold |
float |
maximum minor allele frequency threshold. Only palindromic SNPs with MAF ≤ maf_threshold will be inferred |
0.4 |
ref_maf_threshold |
float |
maximum minor allele frequency threshold for reference RAF. Used as additional filter for both palindromic SNPs and indels. | 0.4 |
daf_tolerance |
float |
difference in allele frequency tolerance for indels. Only indels with | EAF - RAF |
remove_snp |
, 7 or 8 |
7 remove palindromic SNPs with MAF unable to infer. 8: remove palindromic SNPs with No information in reference VCF/BCF |
`` |
remove_indel |
or 8 |
8: indistinguishable indels with No information in reference VCF/BCF |
`` |
mode |
p, i, or pi |
p: infer palindromic SNPs only. i: infer indels only. pi: infer both (default). |
pi |
threads |
int |
number of CPU threads to use | 1 |
cache_options (available since v3.4.42) |
dict |
options for using cache to speed up this step | None |
.infer_strand2() options |
DataType | Description | Default |
|---|---|---|---|
path / vcf_path / tsv_path |
string |
path to reference file (VCF/BCF or TSV). vcf_path overrides path, tsv_path overrides both. If TSV not provided, generated from VCF. |
|
assign_cols |
tuple or str |
column names to extract from reference (default: ("AF",)). First column will be renamed to RAF. | ("AF",) |
maf_threshold |
float |
maximum minor allele frequency threshold for sumstats EAF | 0.4 |
ref_maf_threshold |
float |
maximum minor allele frequency threshold for reference RAF | 0.4 |
daf_tolerance |
float |
difference in allele frequency tolerance for indels | 0.2 |
threads |
int |
number of CPU threads for processing | 6 |
reuse_lookup |
bool |
if True, reuse existing lookup table TSV file if found | True |
convert_to_bcf |
bool |
if True, convert VCF to BCF format for faster processing | False |
strip_info |
bool |
if True, strip INFO fields from VCF during lookup table generation | True |
Chromosome Conversion
Chromosome format conversion is handled automatically by ChromosomeMapper. No manual chr_dict parameter is needed.
cache_options |
DataType | Description | Default |
|---|---|---|---|
cache_manager |
CacheManager object or None |
If any between cache_loader and cache_process is not None, or use_cache is True, a CacheManager object will be created automatically. | |
trust_cache |
bool |
Whether to completely trust the cache or not. Trusting the cache means that any key not found inside the cache will be considered as a missing value even in the VCF file. | True |
cache_loader |
Object with a get_cache() method or None |
Object with a get_cache() method or None. | |
cache_process |
Object with an apply_fn() method or None |
Object with an apply_fn() method or None. | |
use_cache |
bool |
If any of the cache_manager, cache_loader or cache_process is not None, this will be set to True automatically. If set to True and all between cache_manager, cache_loader and cache_process are None, the cache will be loaded (or built) on the spot. | False |
Example
# Old method: per-variant VCF queries (good for small datasets)
mysumstats.infer_strand(ref_infer=gl.get_path("1kg_eas_hg19"),
ref_alt_freq="AF",
maf_threshold=0.40,
daf_tolerance=0.20,
threads=4)
mysumstats.flip_allele_stats()
# New optimized method: bulk lookup table (faster for large datasets)
mysumstats.infer_strand2(vcf_path=gl.get_path("1kg_eas_hg19"),
maf_threshold=0.40,
ref_maf_threshold=0.40,
daf_tolerance=0.20,
threads=6)
mysumstats.flip_allele_stats()
# Using cache to speed up old method
mysumstats.infer_strand(ref_infer=gl.get_path("1kg_eas_hg19"),
cache_options={"use_cache":True})
mysumstats.flip_allele_stats()
Note
infer_strand() and infer_strand2() only change the status code. Use flip function .flip_allele_stats() to flip the allele-specific stats.
When to use infer_strand2()
Use .infer_strand2() for:
- Large datasets (millions of variants)
- When you have bcftools installed
- When you want faster processing with reduced I/O operations
- When you plan to reuse lookup tables for multiple runs
Use .infer_strand() for:
- Small to medium datasets (< 1 million variants)
- When bcftools is not available
- When you need maximum parallelization across CPU cores
Check the difference in allele frequency
.check_af() : Check the allele frequency discrepancy with a reference VCF/BCF file. This function calculates the difference (DAF) between the effect allele frequency (EAF) in your sumstats and the alternative allele frequency (ALT_AF) in the reference VCF.
Purpose:
- Quality control: Identify variants with large differences in allele frequency between sumstats and reference
- Validation: Verify that EAF values in sumstats are consistent with reference population frequencies
- Detect potential issues: Large |DAF| values (> 0.2) may indicate allele mismatches, strand flips, or data quality issues
Important: Please make sure your sumstats are already harmonized, and the variants in reference VCF are also aligned. GWASLab will retrieve information only for matched variants (CHR, POS, EA-ALT, and NEA-REF).
.check_af() options |
DataType | Description | Default |
|---|---|---|---|
ref_infer |
string |
path to the reference VCF/BCF file (must be tabix-indexed for efficient querying). | |
ref_alt_freq |
string |
the field for alternative allele frequency in VCF INFO (e.g., "AF", "AF_popmax", "gnomAD_AF"). If None, auto-detection is attempted. | |
maf_threshold |
float |
minor allele frequency threshold for filtering variants. Only variants with MAF ≤ threshold in both sumstats and reference are included in DAF statistics. | 0.4 |
column_name |
string |
name of the column to store DAF values | "DAF" |
suffix |
string |
suffix to append to column name (useful when comparing multiple reference populations) | "" |
threads |
int |
number of CPU threads to use | 1 |
force |
bool |
if True, check all variants regardless of STATUS codes | False |
Terminology:
- DAF: Difference in Allele Frequency = EAF (sumstats) - ALT_AF (reference VCF)
- EAF: Effect allele frequency (from your sumstats)
- RAF: Reference ALT allele frequency (from reference VCF/BCF file)
Interpretation:
- Positive DAF: EAF in sumstats is higher than reference
- Negative DAF: EAF in sumstats is lower than reference
- Large |DAF| values (> 0.2) may indicate:
- Population differences (expected for population-specific variants)
- Allele mismatches (check EA/NEA alignment)
- Strand flips (use
infer_strand()to resolve) - Data quality issues (verify EAF calculation in sumstats)
You may want to check the allele frequency discrepancy with a reference VCF. Just specify the path and the right allele frequency for your target ancestry in INFO field.
Allele frequency correlation plot
GWASLab will simply calculate DAF = EAF (sumstats) - frequency in VCF file, and store the results in DAF column.
DAF can then be used for plotting (.plot_daf()) or filter variants.
Flip allele-specific statistics
.flip_allele_stats() : Flip allele-specific statistics to harmonize the variants based on the tracking status code in STATUS.
Example
Sweep Mode for Large Datasets
Available since v3.4.42
Sweep mode is an optimized processing mode designed for large datasets that significantly improves performance when harmonizing summary statistics with reference VCF/BCF files by reducing I/O operations. Sweep mode uses bcftools for efficient VCF/BCF processing.
bcftools Required
Sweep mode requires bcftools to be installed and available in your PATH. If bcftools is not found, GWASLab will raise a RuntimeError.
Verify bcftools installation:
Example output:
How Sweep Mode Works
When sweep_mode=True, GWASLab uses a different processing strategy that creates lookup tables instead of querying the reference file per-variant:
- Normal mode (sweep_mode=False):
- For strand inference (
ref_infer): Uses_parallelize_infer_strandwhich queries the reference VCF/BCF file using tabix for each variant (or chunk of variants). Each query requires a separate I/O operation to the reference file. -
For rsID assignment (
ref_rsid_vcf): Uses_parallelize_assign_rsidwhich also performs per-variant or per-chunk tabix queries against the reference VCF/BCF file. -
Sweep mode (sweep_mode=True):
- For strand inference (
ref_infer): Uses_infer_strand_with_annotationwhich:- First extracts all needed variants from the reference VCF/BCF in one pass using bcftools (
bcftools view,bcftools query) to create a lookup table (TSV) - Then annotates all variants in your sumstats using this lookup table (fast in-memory operations)
- Finally infers strand orientation using the annotated allele frequencies
- First extracts all needed variants from the reference VCF/BCF in one pass using bcftools (
- For rsID assignment (
ref_rsid_vcf): Uses_assign_rsidwhich:- First extracts a lookup table from the reference VCF/BCF in one pass using bcftools (
bcftools view,bcftools query,bcftools index) via_extract_lookup_table_from_vcf_bcf - Then assigns rsIDs to all variants using
_assign_from_lookup(fast in-memory operations)
- First extracts a lookup table from the reference VCF/BCF in one pass using bcftools (
Key Advantages of Sweep Mode
- Reduced I/O operations: Instead of many tabix queries (one per variant/chunk), sweep mode makes a single pass through the reference file to extract all needed data
- Better performance for large datasets: The lookup table approach scales better when processing millions of variants
- Reusable lookup tables: The extracted lookup tables can be saved and reused in subsequent runs (if
lookup_pathis specified) - Lower overhead: Fewer file system operations and less network I/O (if using remote files)
When to Use Sweep Mode
Use sweep mode when: - Processing large datasets (millions of variants) - Using large reference VCF/BCF files (e.g., full dbSNP, 1000 Genomes) - You want to minimize I/O operations and improve throughput - You plan to reuse lookup tables for multiple harmonization runs - Processing variants from remote/network-mounted reference files
Use normal mode when: - Processing small to medium datasets (< 1 million variants) - You need maximum parallelization across CPU cores (normal mode parallelizes tabix queries) - The reference file is small or already indexed in a way that makes per-variant queries efficient - You prefer lower memory usage (sweep mode loads lookup tables into memory)
Usage Example
Example
Note
- Sweep mode only affects processing with
ref_rsid_vcf(VCF/BCF files) andref_infer(for strand inference) - TSV-based rsID assignment (
ref_rsid_tsv) always uses the same efficient method regardless of sweep mode - Sweep mode uses bcftools for VCF/BCF processing - ensure bcftools is installed and in your PATH (see version info above)
- Sweep mode creates temporary lookup TSV files that can be reused if you specify
lookup_pathin the kwargs - The lookup table extraction step may take some time initially, but subsequent annotation is very fast
- bcftools operations are parallelized using the
threadsparameter