Infer Ancestry

GWASLab infers the genetic ancestry of your summary statistics by comparing effective allele frequencies (EAF) with 1000 Genomes Project reference data using Fst (fixation index) calculations.

How It Works

The function compares the effective allele frequencies from your sumstats against allele frequencies from 26 populations in the 1000 Genomes Project. It calculates Fst values for each population and identifies the population with the minimum average Fst, which indicates the closest genetic ancestry match.

Loads 1000 Genomes reference data (Hapmap3 SNPs) for the specified genome build
Matches CHR:POS coordinates between your sumstats and reference data
Aligns alleles (EA/NEA) with reference ALT/REF alleles
Calculates Fst values for each variant against all 26 populations
Identifies closest ancestry based on minimum average Fst
Updates metadata in the Sumstats object

.infer_ancestry()

mysumstats.infer_ancestry(build="19", verbose=True)

Parameters

Parameter	DataType	Description	Default
`build`	`str`	Genome build version. Must be "19" (hg19/GRCh37) or "38" (hg38/GRCh38). Required.	Uses `mysumstats.build`
`ancestry_af`	`str`	Path to custom allele frequency file. If None, uses built-in 1kg_hm3_hg19_eaf or 1kg_hm3_hg38_eaf based on build.	`None`
`verbose`	`boolean`	If True, print detailed logs including Fst values for all populations	`True`

Required Columns

The function requires the following columns in your sumstats: - CHR: Chromosome - POS: Base pair position - EA: Effect allele - NEA: Non-effect allele - EAF: Effect allele frequency

Build Parameter

The build parameter is required and must match your sumstats genome build. If not specified, the function will use mysumstats.build.

Return Value

The function returns a string indicating the closest ancestry (e.g., "EUR", "EAS", "AFR") and updates the Sumstats object metadata: - Updates mysumstats.meta["gwaslab"]["inferred_ancestry"] with the inferred ancestry code

Reference Data

The function uses 1000 Genomes HapMap3 allele-frequency panels. Resolution order:

Downloaded full panel in ~/.gwaslab/ (if you ran download_ref("1kg_hm3_hg19_eaf") or ..._hg38_eaf)
Builtin core panel shipped with gwaslab (offline, no download)

Builtin core panel (default when not downloaded)

Files: PAN.hapmap3.hg19.EAF.core.tsv.gz and PAN.hapmap3.hg38.EAF.core.tsv.gz (~30,000 variants each)
Combined package size: under 5 MB for both builds
Selection: MAF 0.05–0.95, then top variants by max pairwise Fst among EUR/EAS/AMR/SAS/AFR (same algorithm applied per build)
Sufficient for typical ancestry inference; use the full panel for maximum overlap

Optional full panel (download)

1kg_hm3_hg19_eaf: PAN.hapmap3.hg19.EAF.tsv.gz (~1.2M variants, hg19)
1kg_hm3_hg38_eaf: PAN.hapmap3.hg38.EAF.tsv.gz (~1.2M variants, hg38)

import gwaslab as gl
gl.download_ref("1kg_hm3_hg19_eaf")  # optional; overrides builtin when present

Data format

Tab-separated values with CHR, POS, REF, ALT, 26 subpopulation AF columns, and EUR/EAS/AMR/SAS/AFR superpopulation columns.

Custom Reference Files

You can provide a custom allele frequency file using the ancestry_af parameter. The custom file should: - Be in TSV format (tab-separated) - Contain columns: CHR, POS, REF, ALT, and population-specific allele frequency columns - Match the genome build of your sumstats

Supported Ancestry Populations

The function compares against 26 populations from the 1000 Genomes Project:

European (EUR): GBR, FIN, IBS, CEU, TSI
East Asian (EAS): CHS, CDX, CHB, JPT, KHV
African (AFR): YRI, LWK, GWD, ESN, MSL, ACB, ASW
South Asian (SAS): GIH, PJL, BEB, STU, ITU
American (AMR): MXL, PUR, CLM, PEL

Examples

Basic usage

# Infer ancestry (build will be taken from mysumstats.build)
mysumstats.infer_ancestry()

# Check the inferred ancestry
print(mysumstats.meta["gwaslab"]["inferred_ancestry"])  # e.g., "EUR"

Specify build explicitly

# Infer ancestry for hg19 data
mysumstats.infer_ancestry(build="19")

# Infer ancestry for hg38 data
mysumstats.infer_ancestry(build="38")

Quiet mode

# Infer ancestry without verbose output
mysumstats.infer_ancestry(build="19", verbose=False)

Custom reference file

# Use a custom allele frequency file
mysumstats.infer_ancestry(build="19", ancestry_af="/path/to/custom_af.tsv")

Output

The function provides detailed logging output showing Fst values for all populations:

Start to infer ancestry based on Fst ...(version)
 -Estimating Fst using 12345 variants...
 -Superpopulation (mean Fst):
 -FST_EAS : 0.002345
 -FST_EUR : 0.003456
 ...
 -Population (mean Fst):
 -FST_JPT : 0.001234
 -FST_CHB : 0.001456
 ...
 -Closest superpopulation: EAS
 -Closest population: JPT
Finished inferring ancestry.

Within each section, groups are sorted by mean Fst (lowest to highest). The returned label is the overall minimum across all superpopulations and populations.

Notes

Fst-like score calculation

GWASLab calculates an \(F_{ST}\)-like allele-frequency differentiation score based on expected heterozygosity.

For two populations with allele frequencies \(p_1\) and \(p_2\):

Within-population expected heterozygosity: \(H_S = \frac{2p_1(1-p_1) + 2p_2(1-p_2)}{2}\)
Mean allele frequency: \(p_T = \frac{p_1 + p_2}{2}\)
Total expected heterozygosity: \(H_T = 2p_T(1-p_T)\)
Fst-like score: \(\frac{H_T - H_S}{H_T}\)

This score reflects how much of the total allele-frequency variation can be attributed to differences between the two populations. In GWASLab, it is used as a simple \(F_{ST}\)-like distance for comparing GWAS summary-statistics allele frequencies with reference population allele frequencies. Lower values indicate greater similarity, and the population with the minimum average score is selected as the closest ancestry match.

This value is referred to as an \(F_{ST}\)-like score rather than a formal \(F_{ST}\) estimate because it is calculated only from allele frequencies in summary statistics and reference panels, without individual-level genotypes or genotype-count-based variance estimation.

EAF Accuracy: The accuracy of ancestry inference depends on the accuracy of EAF values in your sumstats. Inconsistent or mislabeled EAF values may lead to incorrect ancestry inference.
Reference Data: Uses HapMap3 SNPs from the 1000 Genomes Project (see Reference Data section above for details). The reference files are automatically downloaded on first use and cached locally.
Allele Alignment: The function automatically handles allele flipping to ensure proper comparison between sumstats and reference data.
Metadata Storage: The inferred ancestry is stored in the Sumstats object metadata for future reference.
File Format: The reference files are compressed TSV files (.tsv.gz) that are automatically decompressed during loading.

When to Use

When the genetic ancestry of your sumstats is unknown
To verify the reported ancestry of your dataset
As part of quality control workflows to detect potential data issues
Before performing population-specific analyses
To ensure proper matching with reference datasets

.infer_build(): Infer the genome build of your sumstats
.check_af(): Compare allele frequencies with reference data
.harmonize(): Harmonize alleles with reference data (may be needed before ancestry inference)