QC and Filtering

GWASLab provides comprehensive quality control (QC) and filtering functions for summary statistics. These tools help you identify and remove problematic variants, check data consistency, and filter variants based on various criteria.

Overview

GWASLab's QC and filtering capabilities include:

Quality checks: Sanity checks for statistics, data consistency validation
Duplicate removal: Remove duplicated and multiallelic variants
Variant filtering: Filter by value, region, variant type, and more
All-in-one QC: basic_check() performs multiple QC steps in one call

Methods Summary

Quality Control Methods

Method	Options	Description
`.basic_check()`	`remove=False`, `remove_dup=False`, `normalize=True`, `threads=1`, ...	All-in-one QC function that performs multiple checks and fixes
`.check_sanity()`	`n`, `ncase`, `ncontrol`, `beta`, `se`, `eaf`, `p`, `mlog10p`, ...	Sanity check for statistics including BETA, SE, Z, CHISQ, EAF, OR, N...
`.check_data_consistency()`		Check if `BETA/SE-derived P/MLOG10P = original P/MLOG10P`, `N = N_CASE + N_CONTROL`...
`.remove_dup()`	`mode="md"`, `keep='first'`, `keep_col="P"`, `remove=False`	Remove duplicated, multiallelic or NA variants

Filtering Methods

Method	Options	Description
`.filter_value()`	`expr`, `inplace=False`	Filter variants based on expression (pandas query syntax)
`.filter_flanking_by_id()`	`snpid`, `windowsizekb=500`, `inplace=False`	Filter variants in flanking regions around specified SNPID/rsID
`.filter_flanking_by_chrpos()`	`chrpos`, `windowsizekb=500`, `inplace=False`	Filter variants in flanking regions around specified CHR:POS coordinates
`.filter_region_in()`	`path`, `high_ld=False`, `build="19"`, `inplace=False`	Filter in variants in regions defined by BED file
`.filter_region_out()`	`path`, `high_ld=False`, `build="19"`, `inplace=False`	Filter out variants in regions defined by BED file
`.filter_palindromic()`	`mode="out"`, `inplace=False`	Filter palindromic variants (keep or remove)
`.filter_snp()`	`mode="in"`, `inplace=False`	Filter SNPs (keep or remove)
`.filter_indel()`	`mode="out"`, `inplace=False`	Filter indels (keep or remove)
`.filter_hapmap3()`	`inplace=False`	Filter to HapMap3 variants only
`.exclude_hla()`	`inplace=False`	Exclude variants in HLA region (chr6:25-35 Mb)

Statistics Sanity Check

.check_sanity() performs basic sanity checks on statistics to identify extreme values or values out of expected ranges.

Comparison is performed with float_tolerance = 1e-7 for any float type statistics. For example, eaf=(0, 1) will be converted to eaf=(-1e-7, 1 + 1e-7) to account for floating-point precision.

Default Parameter Ranges:

Parameter	DataType	Default Range	Description
`float_tolerance`	`float`	`1e-7`	Tolerance for floating-point comparisons
`n`	`tuple`	`(0, 2**31-1)`	Sample size: 0 < N < 2³¹-1
`ncase`	`tuple`	`(0, 2**31-1)`	Number of cases: 0 < N_CASE < 2³¹-1
`ncontrol`	`tuple`	`(0, 2**31-1)`	Number of controls: 0 < N_CONTROL < 2³¹-1
`mac`	`tuple`	`(0, 2**31-1)`	Minor allele count: MAC ≥ 0
`eaf`	`tuple`	`(0, 1)`	Effect allele frequency: 0 < EAF < 1
`maf`	`tuple`	`(0, 0.5)`	Minor allele frequency: 0 < MAF < 0.5
`chisq`	`tuple`	`(0, float("Inf"))`	Chi-square statistic: CHISQ > 0
`z`	`tuple`	`(-9999, 9999)`	Z-score: -9999 < Z < 9999
`t`	`tuple`	`(-99999, 99999)`	T-statistic: -99999 < T < 99999
`f`	`tuple`	`(0, float("Inf"))`	F-statistic: F > 0
`p`	`tuple`	`(0, 1)`	P-value: 0 < P < 1 (P=0 will cause a warning)
`mlog10p`	`tuple`	`(0, 99999)`	Negative log10 p-value: 0 < MLOG10P < 99999
`beta`	`tuple`	`(-100, 100)`	Effect size: -100 < BETA < 100
`se`	`tuple`	`(0, float("Inf"))`	Standard error: SE > 0
`OR`	`tuple`	`(0, 100)`	Odds ratio: 0 < OR < 100
`OR_95L`	`tuple`	`(0, float("Inf"))`	OR 95% CI lower bound: OR_95L > 0
`OR_95U`	`tuple`	`(0, float("Inf"))`	OR 95% CI upper bound: OR_95U > 0
`HR`	`tuple`	`(0, 100)`	Hazard ratio: 0 < HR < 100
`HR_95L`	`tuple`	`(0, float("Inf"))`	HR 95% CI lower bound: HR_95L > 0
`HR_95U`	`tuple`	`(0, float("Inf"))`	HR 95% CI upper bound: HR_95U > 0
`info`	`tuple`	`(0, 2)`	Imputation info score: 0 < INFO < 2

Usage:

# Use default ranges
mysumstats.check_sanity()

# Customize ranges
mysumstats.check_sanity(
    p=(1e-300, 1),           # Allow very small p-values
    beta=(-50, 50),          # Tighter beta range
    info=(0.8, 2),           # Require INFO > 0.8
    float_tolerance=1e-6     # Stricter tolerance
)

Notes:

Columns not present in the sumstats are automatically skipped
Invalid values are flagged but not automatically removed (use with basic_check(remove=True) to remove)
The direction column is checked to ensure it only contains "+", "-", "0", or "?"

Remove duplicated or multiallelic variants

After standardizing and normalizing the sumstats, you can also remove duplicated or multiallelic variants using:

mysumstats.remove_dup(mode="md")

mode=d , remove duplicate variants.
- remove duplicate SNPs based on 1. SNPID,
- remove duplicate SNPs based on 2. CHR, POS, EA, and NEA
- remove duplicate SNPs based on 3. rsID
mode=s ,remove duplicate variants.
- remove duplicate SNPs based on 1. SNPID
mode=c ,remove duplicate variants.
- remove duplicate SNPs based on 2. CHR, POS, EA, and NEA
mode=r ,remove duplicate variants.
- remove duplicate SNPs based on 3. rsID
mode=m, remove multiallelic variants.
- remove multiallelic SNPs based on 4. CHR, POS
remove=True : remove NAs
keep_col : use which column to sort the values (keep_ascend=True: ascending order)
keep: keep 'first' or 'last'.

Example

mysumstats.remove_dup(mode="md",keep='first',keep_col="P",remove=False)

Fri Jan 13 17:34:38 2023 Start to sort the sumstats using P...
Fri Jan 13 17:34:38 2023 Start to remove duplicated variants based on snpid...
Fri Jan 13 17:34:38 2023  -Current Dataframe shape : 9  x  11
Fri Jan 13 17:34:38 2023  -Which variant to keep:  first
Fri Jan 13 17:34:38 2023  -Removed  1  based on SNPID...
Fri Jan 13 17:34:38 2023 Start to remove duplicated variants based on rsID...
Fri Jan 13 17:34:38 2023  -Removed  1  based on rsID...
Fri Jan 13 17:34:38 2023 Start to remove duplicated variants based on CHR,POS,EA and NEA...
Fri Jan 13 17:34:38 2023  -Current Dataframe shape : 7  x  11
Fri Jan 13 17:34:38 2023  -Which variant to keep:  first
Fri Jan 13 17:34:38 2023  -Removed  1  based on CHR,POS,EA and NEA...
Fri Jan 13 17:34:38 2023 Start to remove multiallelic variants based on chr:pos...
Fri Jan 13 17:34:38 2023  -Which variant to keep:  first
Fri Jan 13 17:34:38 2023  -Removed  0  multiallelic variants...
Fri Jan 13 17:34:38 2023  -Removed  3  variants in total.
Fri Jan 13 17:34:38 2023  -Sort the coordinates...
Fri Jan 13 17:34:38 2023 Finished removing successfully!

This will remove duplicated and multiallelic variants and keep the one with the lowest P.

Before:

After

Check_data consistency

mysumstats.check_data_consistency()

GWASLab checks if BETA/SE-derived P/MLOG10P = original P/MLOG10P or N = N_CASE + N_CONTROL.

Filtering Variants by Expression: `filter_value()`

Filter variants using pandas query syntax. This is the most flexible filtering method and allows complex conditional filtering.

mysumstats.filter_value(expr, inplace=False)

Parameters:

Parameter	DataType	Description	Default
`expr`	`str`	Query string using pandas syntax. For example: `'BETA > 0 & N > 10000'`	Required
`inplace`	`bool`	If `False`, return a new Sumstats object. If `True`, filter in place.	`False`

Features:

Supports method chaining (returns new object by default)
Uses pandas DataFrame.query() syntax
Can combine multiple conditions with & (and), | (or), ~ (not)

pandas.DataFrame.query()

For detailed query syntax, see: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

Filter by p-value and effect size

# Keep only significant variants with positive effect
significant = mysumstats.filter_value('P < 5e-8 & BETA > 0')

# Filter and plot in one chain
mysumstats.filter_value('BETA < 0 & CHR == 1').plot_mqq()

Complex filtering

# Multiple conditions
filtered = mysumstats.filter_value(
    'P < 1e-5 & INFO > 0.8 & N > 5000 & EAF > 0.01 & EAF < 0.99'
)

# Filter by chromosome
chr1_variants = mysumstats.filter_value('CHR == 1')

# Filter by range
beta_range = mysumstats.filter_value('-1 < BETA < 1')

Filtering Variants in Flanking Regions

Extract variants within a specified window around target variants. Useful for regional association plots and fine-mapping.

Available since v3.4.37

`.filter_flanking_by_id()`

Filter variants in flanking regions around specified SNPID or rsID.

mysumstats.filter_flanking_by_id(
    snpid=["rs123", "rs456", "1:12345:A:G"],
    windowsizekb=500,
    inplace=False
)

Parameters:

Parameter	DataType	Description	Default
`snpid`	`list`	List of SNPID or rsID. Examples: `["rs123"]`, `["1:12345:A:G"]`, `["rs123", "rs456"]`	Required
`windowsizekb`	`int`	Flanking window size in kilobases (kb) on each side	`500`
`inplace`	`bool`	If `False`, return a new Sumstats object. If `True`, filter in place.	`False`

`.filter_flanking_by_chrpos()`

Filter variants in flanking regions around specified chromosome and position coordinates.

mysumstats.filter_flanking_by_chrpos(
    chrpos=[(1, 12345), (2, 67891)],
    windowsizekb=500,
    inplace=False
)

Parameters:

Parameter	DataType	Description	Default
`chrpos`	`list`	List of (CHR, POS) tuples. Example: `[(1, 12345), (2, 67891)]`	Required
`windowsizekb`	`int`	Flanking window size in kilobases (kb) on each side	`500`
`inplace`	`bool`	If `False`, return a new Sumstats object. If `True`, filter in place.	`False`

Extract variants around lead SNPs

# Extract variants ±500kb around a lead SNP
regional = mysumstats.filter_flanking_by_id(
    snpid=["rs123456"],
    windowsizekb=500
)

# Extract variants around multiple positions
multiple_regions = mysumstats.filter_flanking_by_chrpos(
    chrpos=[(1, 12345678), (2, 87654321)],
    windowsizekb=200  # ±200kb window
)

Filtering Variants in Regions Using BED Files

Filter variants based on genomic regions defined in BED (Browser Extensible Data) format files or built-in regions.

`.filter_region_in()`

Keep only variants within specified regions.

mysumstats.filter_region_in(
    path="regions.bed",
    high_ld=False,
    build="19",
    inplace=False
)

`.filter_region_out()`

Remove variants within specified regions.

mysumstats.filter_region_out(
    path="regions.bed",
    high_ld=False,
    build="19",
    inplace=False
)

Parameters:

Parameter	DataType	Description	Default
`path`	`str`	Path to BED file. BED format: `chr\tstart\tend` (0-based, half-open)	`None`
`high_ld`	`bool`	If `True`, use built-in high-LD regions (commonly excluded regions)	`False`
`build`	`str`	Genome build for high-LD regions (`"19"` for hg19, `"38"` for hg38)	`"19"`
`inplace`	`bool`	If `False`, return a new Sumstats object. If `True`, filter in place.	`False`

BED File Format:

chr1    1000000    2000000
chr2    5000000    6000000
chr3    10000000   11000000

Filter by BED file

# Keep variants in regions defined by BED file
in_regions = mysumstats.filter_region_in(path="my_regions.bed")

# Remove variants in high-LD regions (built-in)
no_high_ld = mysumstats.filter_region_out(high_ld=True, build="38")

# Remove variants in custom regions
filtered = mysumstats.filter_region_out(path="exclude_regions.bed")

Common use cases

# Exclude high-LD regions (often done before LDSC)
mysumstats.filter_region_out(high_ld=True, build="38")

# Keep only variants in gene regions
gene_regions = mysumstats.filter_region_in(path="genes.bed")

Quick Reference: Common QC Workflows

Standard QC Pipeline

import gwaslab as gl

# Load sumstats
mysumstats = gl.Sumstats("sumstats.txt.gz", fmt="plink2")

# Comprehensive QC
mysumstats.basic_check(remove=True, remove_dup=True, normalize=True)

# Additional filtering
mysumstats.filter_palindromic(mode="out")  # Remove palindromic variants
mysumstats.exclude_hla()                    # Exclude HLA region
mysumstats.filter_value('INFO > 0.8 & EAF > 0.01 & EAF < 0.99')  # Quality filters

# Check results
mysumstats.summary()

Pre-LDSC Filtering

# Standard filtering for LDSC
mysumstats.filter_hapmap3()                 # HapMap3 variants only
mysumstats.filter_palindromic(mode="out")   # Remove palindromic
mysumstats.exclude_hla()                    # Exclude HLA
mysumstats.filter_region_out(high_ld=True, build="38")  # Exclude high-LD regions
mysumstats.filter_value('INFO > 0.9 & MAF > 0.01')      # Quality filters

Regional Analysis

# Extract variants around lead SNPs for regional plots
regional = mysumstats.filter_flanking_by_id(
    snpid=["rs123456", "rs789012"],
    windowsizekb=500
)

# Plot regional association
regional.plot_regional()

Notes

Inplace vs New Object: Most filtering methods return a new Sumstats object by default (inplace=False), allowing method chaining. Set inplace=True to modify the current object.
Method Chaining: You can chain multiple filtering operations: mysumstats.filter_value('P < 1e-5').filter_palindromic(mode="out").plot_mqq()
Performance: For large datasets, consider using inplace=True to avoid creating multiple copies of the data.
Status Codes: Filtering operations may update the STATUS column. Check status codes after filtering to understand variant quality.
Logging: All QC and filtering operations are logged. Use mysumstats.log.show() to review the operations performed.

All-in-One QC: `basic_check()`

The basic_check() method is a comprehensive QC function that performs multiple quality control steps in a single call. It's recommended for initial QC of your sumstats.

mysumstats.basic_check(
    remove=False,           # Remove bad quality variants
    remove_dup=False,       # Remove duplicated variants
    normalize=True,         # Normalize indels
    threads=1,              # Number of threads
    verbose=True
)

What it does:

The function performs the following checks in sequence:

Check Step	Function	Description
1. ID Check	`fix_id`	- Validates and fixes SNPID format (CHR:POS:NEA:EA pattern) - Extracts CHR and POS from SNPID if available - Reconstructs SNPID from CHR, POS, EA, and NEA if missing - Standardizes SNPID delimiters (converts "_" and "-" to ":") - Validates rsID format (rs[numbers] pattern) - Removes 'chr' prefix from SNPID if present - Extracts EA and NEA from SNPID if needed
2. Chromosome Check	`fix_chr`	- Standardizes chromosome notation (converts to integer type) - Extracts chromosome numbers from various formats (e.g., "chr1", "1", "chrX") - Maps special chromosomes: X→23, Y→24, MT→25 - Validates chromosome values (removes invalid or out-of-range values if remove=True) - Checks for missing chromosome values
3. Position Check	`fix_pos`	- Validates basepair positions fall within valid chromosomal bounds - Removes thousands separators ("," or "_") from position strings - Converts positions to Int64 type - Filters out positions outside valid range (default: 0 to 250,000,000) - Removes invalid or missing positions if remove=True
4. Allele Check	`fix_allele`	- Validates allele fields contain valid nucleotide characters (A, T, C, G) - Converts all alleles to uppercase - Standardizes allele format using categorical data types - Checks for missing EA or NEA values - Identifies variants where EA == NEA (non-variant) - Classifies variants as SNPs, indels, normalized, or not normalized - Removes invalid alleles if remove=True
5. Statistics Sanity Check	`check_sanity`	- Validates numerical summary statistics fall within valid ranges: * Sample sizes (N, N_CASE, N_CONTROL): 0 to 2,147,483,647 * Allele frequencies (EAF, MAF): EAF in (0, 1), MAF in (0, 0.5) * Effect sizes (BETA): -100 to 100 * Standard errors (SE): > 0 * P-values (P): 0 to 1 * Negative log10 p-values (MLOG10P): 0 to 99,999 * Z-scores (Z): -9,999 to 9,999 * Chi-square (CHISQ): ≥ 0 * Odds ratios (OR, HR): exp(-100) to exp(100) * Imputation info (INFO): 0 to 2 - Checks and fixes data types for all statistical columns - Removes variants with invalid statistics
6. Data Consistency	`check_data_consistency`	- Checks consistency between BETA/SE and Z-score (if Z = BETA/SE) - Checks consistency between BETA/SE and P-value (if P derived from BETA/SE) - Checks consistency between MLOG10P and P-value (if P = 10^(-MLOG10P)) - Checks consistency between N and N_CASE + N_CONTROL (if all present) - Reports inconsistencies (likely due to rounding) but does not remove variants
7. Indel Normalization (if normalize=True)	`normalize_allele`	- Normalizes indels using left-alignment and parsimony principles (VCF standard) - Left-aligns variants by removing common sequence context - Adjusts positions to reflect normalized representation - Only processes variants that need normalization (indels) - Updates status codes to indicate normalization state
8. Duplicate Removal (if remove_dup=True)	`remove_dup`	- Removes duplicate variants based on SNPID (if mode includes 'd' or 's') - Removes duplicate variants based on rsID (if mode includes 'd' or 'r') - Removes duplicate variants based on CHR:POS:EA:NEA (if mode includes 'd' or 'c') - Removes multi-allelic variants (same CHR:POS, if mode includes 'm') - Keeps best variant based on specified column (default: P-value)
9. Coordinate Sorting	`sort_coordinate`	- Sorts variants by genomic coordinates (CHR, then POS)
10. Column Sorting	`sort_column`	- Sorts columns to GWASLab default order

Parameters:

remove (bool): Remove bad quality variants detected during fixing
remove_dup (bool): Remove duplicated or multiallelic variants
normalize (bool): Perform indel normalization
threads (int): Number of threads for parallel processing
fix_id_kwargs, fix_chr_kwargs, etc.: Keyword arguments passed to individual functions

Basic QC workflow

import gwaslab as gl

# Load sumstats
mysumstats = gl.Sumstats("sumstats.txt.gz", fmt="plink2")

# Run comprehensive QC
mysumstats.basic_check(remove=True, remove_dup=True)

# Check results
mysumstats.summary()

Additional Filtering Methods

Filter by Variant Type

`.filter_palindromic()`

Filter palindromic variants (e.g., A/T, C/G SNPs that are ambiguous on the reverse strand).

# Remove palindromic variants (recommended for most analyses)
mysumstats.filter_palindromic(mode="out", inplace=False)

# Keep only palindromic variants
mysumstats.filter_palindromic(mode="in", inplace=False)

Parameters: - mode (str): "in" to keep palindromic variants, "out" to remove them (default: "out") - inplace (bool): If False, return a new Sumstats object. If True, filter in place.

`.filter_snp()` and `.filter_indel()`

Filter variants by type (SNPs vs indels).

# Keep only SNPs (remove indels)
mysumstats.filter_snp(mode="in", inplace=False)

# Remove indels
mysumstats.filter_indel(mode="out", inplace=False)

Parameters: - mode (str): "in" to keep the variant type, "out" to remove it - inplace (bool): If False, return a new Sumstats object. If True, filter in place.

Filter by Reference Panel

`.filter_hapmap3()`

Filter to variants present in HapMap3 reference panel (commonly used for LDSC and other analyses).

mysumstats.filter_hapmap3(inplace=False)

Parameters: - inplace (bool): If False, return a new Sumstats object. If True, filter in place.

Exclude Specific Regions

`.exclude_hla()`

Exclude variants in the HLA (Human Leukocyte Antigen) region on chromosome 6 (chr6:25-35 Mb). This region has complex LD structure and is often excluded from certain analyses.

mysumstats.exclude_hla(inplace=False)

Parameters: - inplace (bool): If False, return a new Sumstats object. If True, filter in place.

Common filtering workflow

# Remove palindromic variants
mysumstats.filter_palindromic(mode="out")

# Remove indels
mysumstats.filter_indel(mode="out")

# Exclude HLA region
mysumstats.exclude_hla()

# Filter to HapMap3 variants (for LDSC)
mysumstats.filter_hapmap3()

Filtering with Thresholds: `filter_in()` and `filter_out()`

These methods provide a convenient way to filter variants based on threshold values.

# Filter in variants (keep variants meeting criteria)
mysumstats.filter_in(
    gt={"INFO": 0.9, "N": 10000},      # Keep variants with INFO > 0.9 and N > 10000
    lt={"P": 1e-5},                     # Keep variants with P < 1e-5
    eq={"CHR": 1},                      # Keep variants on chromosome 1
    inplace=False
)

# Filter out variants (remove variants meeting criteria)
mysumstats.filter_out(
    gt={"P": 0.05},                     # Remove variants with P > 0.05
    lt={"INFO": 0.3},                   # Remove variants with INFO < 0.3
    eq={"CHR": "X"},                    # Remove variants on chromosome X
    inplace=False
)

Parameters: - gt (dict): Dictionary of {column: threshold} for "greater than" filtering - lt (dict): Dictionary of {column: threshold} for "less than" filtering - eq (dict): Dictionary of {column: value} for "equal to" filtering - inplace (bool): If False, return a new Sumstats object. If True, filter in place.

Note on deprecated methods

The filter_in() and filter_out() methods are still available and functional, but filter_value() with pandas query syntax is generally more flexible and recommended for complex filtering.

QC and Filtering

Overview

Methods Summary

Quality Control Methods

Filtering Methods

Statistics Sanity Check

Remove duplicated or multiallelic variants

Check_data consistency

Filtering Variants by Expression: filter_value()

Filtering Variants in Flanking Regions

.filter_flanking_by_id()

.filter_flanking_by_chrpos()

Filtering Variants in Regions Using BED Files

.filter_region_in()

.filter_region_out()

Quick Reference: Common QC Workflows

Standard QC Pipeline

Pre-LDSC Filtering

Regional Analysis

Notes

All-in-One QC: basic_check()

Additional Filtering Methods

Filter by Variant Type

.filter_palindromic()

.filter_snp() and .filter_indel()

Filter by Reference Panel

.filter_hapmap3()

Exclude Specific Regions

.exclude_hla()

Filtering with Thresholds: filter_in() and filter_out()

Filtering Variants by Expression: `filter_value()`

`.filter_flanking_by_id()`

`.filter_flanking_by_chrpos()`

`.filter_region_in()`

`.filter_region_out()`

All-in-One QC: `basic_check()`

`.filter_palindromic()`

`.filter_snp()` and `.filter_indel()`

`.filter_hapmap3()`

`.exclude_hla()`

Filtering with Thresholds: `filter_in()` and `filter_out()`