Infer Genome Build
GWASLab uses chromosome and base pair position information from built-in Hapmap3 SNPs to infer the reference genome build (GRCh37/hg19 or GRCh38/hg38) for your summary statistics.
How It Works
The function compares the CHR:POS coordinates in your sumstats against Hapmap3 SNP coordinates for both hg19 and hg38. The genome build with the most matching variants is inferred as the correct build version.
- Loads Hapmap3 reference data for both hg19 and hg38
- Matches CHR:POS coordinates between your sumstats and reference data
- Counts matches for each build version
- Infers the build based on which has more matches
- Updates STATUS codes to reflect the inferred build (if
change_status=True) - Updates metadata in the Sumstats object
.infer_build()
Parameters
| Parameter | DataType | Description | Default |
|---|---|---|---|
verbose |
boolean |
If True, print detailed logs including match counts for each build | True |
Reliability
The results are more reliable if you have a reasonable amount of variants. The function will issue a warning if fewer than 10,000 matching variants are found, as this may indicate unreliable inference.
STATUS Code Updates
The function automatically updates the STATUS code to reflect the inferred build: - For hg19: Sets digit 1 to "1" (build = 19) - For hg38: Sets digit 1 to "3" and digit 2 to "8" (build = 38)
This ensures the STATUS code accurately reflects the inferred genome build.
Return Value
The function updates the Sumstats object in place:
- Updates mysumstats.data with modified STATUS codes
- Updates mysumstats.build with the inferred build ("19" or "38")
- Updates mysumstats.meta["gwaslab"]["genome_build"] with the inferred build
Examples
Basic usage
Output
The function provides detailed logging output:
-Loading Hapmap3 variants data...
-CHR and POS will be used for matching...
-Matching variants for hg19: num_hg19 = 12345
-Matching variants for hg38: 5678
-Since num_hg19 >> num_hg38, set the genome build to hg19 for the STATUS code....
If fewer than 10,000 matches are found:
Notes
- Requires CHR and POS columns: The function requires valid chromosome and position information in your sumstats
- Hapmap3 reference: Uses built-in Hapmap3 SNP coordinates for matching
- Automatic metadata update: The inferred build is automatically stored in the Sumstats object's metadata
- STATUS code integration: The STATUS code is automatically updated to reflect the inferred build, which is important for downstream harmonization steps
- Uncertain inference: If the match counts are equal, the function cannot determine the build and will log a message
When to Use
- When the genome build of your sumstats is unknown
- Before performing harmonization or liftover operations
- To verify the genome build of your data
- As part of quality control workflows
Related Functions
.set_build(): Manually set the genome build.liftover(): Convert coordinates between genome builds (requires knowing the source build)