Infer Ancestry
GWASLab infers the genetic ancestry of your summary statistics by comparing effective allele frequencies (EAF) with 1000 Genomes Project reference data using Fst (fixation index) calculations.
How It Works
The function compares the effective allele frequencies from your sumstats against allele frequencies from 26 populations in the 1000 Genomes Project. It calculates Fst values for each population and identifies the population with the minimum average Fst, which indicates the closest genetic ancestry match.
- Loads 1000 Genomes reference data (Hapmap3 SNPs) for the specified genome build
- Matches CHR:POS coordinates between your sumstats and reference data
- Aligns alleles (EA/NEA) with reference ALT/REF alleles
- Calculates Fst values for each variant against all 26 populations
- Identifies closest ancestry based on minimum average Fst
- Updates metadata in the Sumstats object
.infer_ancestry()
Parameters
| Parameter | DataType | Description | Default |
|---|---|---|---|
build |
str |
Genome build version. Must be "19" (hg19/GRCh37) or "38" (hg38/GRCh38). Required. | Uses mysumstats.build |
ancestry_af |
str |
Path to custom allele frequency file. If None, uses built-in 1kg_hm3_hg19_eaf or 1kg_hm3_hg38_eaf based on build. | None |
verbose |
boolean |
If True, print detailed logs including Fst values for all populations | True |
Required Columns
The function requires the following columns in your sumstats:
- CHR: Chromosome
- POS: Base pair position
- EA: Effect allele
- NEA: Non-effect allele
- EAF: Effect allele frequency
Build Parameter
The build parameter is required and must match your sumstats genome build. If not specified, the function will use mysumstats.build.
Return Value
The function returns a string indicating the closest ancestry (e.g., "EUR", "EAS", "AFR") and updates the Sumstats object metadata:
- Updates mysumstats.meta["gwaslab"]["inferred_ancestry"] with the inferred ancestry code
Reference Data
The function uses 1000 Genomes HapMap3 allele-frequency panels. Resolution order:
- Downloaded full panel in
~/.gwaslab/(if you randownload_ref("1kg_hm3_hg19_eaf")or..._hg38_eaf) - Builtin core panel shipped with gwaslab (offline, no download)
Builtin core panel (default when not downloaded)
- Files:
PAN.hapmap3.hg19.EAF.core.tsv.gzandPAN.hapmap3.hg38.EAF.core.tsv.gz(~30,000 variants each) - Combined package size: under 5 MB for both builds
- Selection: MAF 0.05–0.95, then top variants by max pairwise Fst among EUR/EAS/AMR/SAS/AFR (same algorithm applied per build)
- Sufficient for typical ancestry inference; use the full panel for maximum overlap
Optional full panel (download)
1kg_hm3_hg19_eaf:PAN.hapmap3.hg19.EAF.tsv.gz(~1.2M variants, hg19)1kg_hm3_hg38_eaf:PAN.hapmap3.hg38.EAF.tsv.gz(~1.2M variants, hg38)
Data format
Tab-separated values with CHR, POS, REF, ALT, 26 subpopulation AF columns, and EUR/EAS/AMR/SAS/AFR superpopulation columns.
Custom Reference Files
You can provide a custom allele frequency file using the ancestry_af parameter. The custom file should:
- Be in TSV format (tab-separated)
- Contain columns: CHR, POS, REF, ALT, and population-specific allele frequency columns
- Match the genome build of your sumstats
Supported Ancestry Populations
The function compares against 26 populations from the 1000 Genomes Project:
European (EUR): GBR, FIN, IBS, CEU, TSI
East Asian (EAS): CHS, CDX, CHB, JPT, KHV
African (AFR): YRI, LWK, GWD, ESN, MSL, ACB, ASW
South Asian (SAS): GIH, PJL, BEB, STU, ITU
American (AMR): MXL, PUR, CLM, PEL
Examples
Basic usage
Specify build explicitly
Quiet mode
Custom reference file
Output
The function provides detailed logging output showing Fst values for all populations:
Start to infer ancestry based on Fst ...(version)
-Estimating Fst using 12345 variants...
-Superpopulation (mean Fst):
-FST_EAS : 0.002345
-FST_EUR : 0.003456
...
-Population (mean Fst):
-FST_JPT : 0.001234
-FST_CHB : 0.001456
...
-Closest superpopulation: EAS
-Closest population: JPT
Finished inferring ancestry.
Within each section, groups are sorted by mean Fst (lowest to highest). The returned label is the overall minimum across all superpopulations and populations.
Notes
Fst-like score calculation
GWASLab calculates an \(F_{ST}\)-like allele-frequency differentiation score based on expected heterozygosity.
For two populations with allele frequencies \(p_1\) and \(p_2\):
- Within-population expected heterozygosity: \(H_S = \frac{2p_1(1-p_1) + 2p_2(1-p_2)}{2}\)
- Mean allele frequency: \(p_T = \frac{p_1 + p_2}{2}\)
- Total expected heterozygosity: \(H_T = 2p_T(1-p_T)\)
- Fst-like score: \(\frac{H_T - H_S}{H_T}\)
This score reflects how much of the total allele-frequency variation can be attributed to differences between the two populations. In GWASLab, it is used as a simple \(F_{ST}\)-like distance for comparing GWAS summary-statistics allele frequencies with reference population allele frequencies. Lower values indicate greater similarity, and the population with the minimum average score is selected as the closest ancestry match.
This value is referred to as an \(F_{ST}\)-like score rather than a formal \(F_{ST}\) estimate because it is calculated only from allele frequencies in summary statistics and reference panels, without individual-level genotypes or genotype-count-based variance estimation.
- EAF Accuracy: The accuracy of ancestry inference depends on the accuracy of EAF values in your sumstats. Inconsistent or mislabeled EAF values may lead to incorrect ancestry inference.
- Reference Data: Uses HapMap3 SNPs from the 1000 Genomes Project (see Reference Data section above for details). The reference files are automatically downloaded on first use and cached locally.
- Allele Alignment: The function automatically handles allele flipping to ensure proper comparison between sumstats and reference data.
- Metadata Storage: The inferred ancestry is stored in the Sumstats object metadata for future reference.
- File Format: The reference files are compressed TSV files (
.tsv.gz) that are automatically decompressed during loading.
When to Use
- When the genetic ancestry of your sumstats is unknown
- To verify the reported ancestry of your dataset
- As part of quality control workflows to detect potential data issues
- Before performing population-specific analyses
- To ensure proper matching with reference datasets
Related Functions
.infer_build(): Infer the genome build of your sumstats.check_af(): Compare allele frequencies with reference data.harmonize(): Harmonize alleles with reference data (may be needed before ancestry inference)