Checking if Lead Variants are Novel
GWASLab can check if the lead variants of your summary statistics overlap with reported variants or not based on the physical distance.
.get_novel()
mysumstats.get_novel(
known=None,
efo=None,
only_novel=False,
windowsizekb_for_novel=1000,
windowsizekb=500,
sig_level=5e-8,
if_get_lead=True,
group_key=None,
use_p=False,
anno=False,
wc_correction=False,
use_cache=True,
cache_dir="./",
build="19",
source="ensembl",
gwascatalog_source="NCBI",
output_known=False,
verbose=True
)
GWASLab checks overlap with a local file of variants or records in GWAS Catalog.
Required Parameters
Either known or efo must be provided:
-
known:stringorpandas.DataFrame, path to the local file of reported variants or a DataFrame containing known variants with CHR/POS columns -
efo:stringorlist, EFO ID, MONDO ID, or trait name for the target trait, which is used for querying the GWAS Catalog API v2.
Examples:
- EFO ID: "EFO_0001360" (type 2 diabetes)
- MONDO ID: "MONDO_0005148" (type 2 diabetes)
- Trait name: "type 2 diabetes mellitus"
- Multiple traits: ["EFO_0001360", "EFO_0001361"]
Parameters
| Parameter | DataType | Description | Default |
|---|---|---|---|
known |
string or DataFrame |
Path to local file of reported variants or DataFrame with CHR/POS columns | None |
efo |
string or list |
EFO ID, MONDO ID, or trait name(s) for querying GWAS Catalog | None |
only_novel |
boolean |
If True, output only novel variants | False |
windowsizekb_for_novel |
int |
Window size (kb) for determining if lead variants overlap with reported variants in GWAS Catalog | 1000 |
windowsizekb |
int |
Window size (kb) for lead variant extraction | 500 |
sig_level |
float |
P value threshold for lead variant extraction | 5e-8 |
if_get_lead |
boolean |
If True, first extract lead variants using .get_lead() |
True |
group_key |
string |
Column name for grouping variants (e.g., trait/phenotype ID) | None |
use_p |
boolean |
If True, use P values instead of MLOG10P for extraction | False |
anno |
boolean |
If True, annotate variants with gene information | False |
wc_correction |
boolean |
If True, apply Winner's Curse correction to effect sizes | False |
use_cache |
boolean |
If True, use cached GWAS catalog data | True |
cache_dir |
string |
Directory for caching downloaded GWAS catalog data | "./" |
build |
"19" or "38" |
Genome build version "19" (GRCh37/hg19) or "38" (GRCh38/hg38) | "19" |
source |
"ensembl" or "refseq" |
Database source for gene annotation | "ensembl" |
gwascatalog_source |
"NCBI" or "EBI" |
Source for GWAS catalog data | "NCBI" |
output_known |
boolean |
If True, additionally output the reported variants | False |
verbose |
boolean |
If True, print logs | True |
Return Value
Returns a pandas.DataFrame containing variants with novelty status and metadata:
NOVEL: Boolean indicating novelty statusDISTANCE_TO_KNOWN: Distance to nearest known variant (in base pairs)LOCATION_OF_KNOWN: Relative position to known variantKNOWN_ID: ID of matching known variant- Additional metadata from known variants (if available)
EFO ID
You can find the EFO ID by simply searching in GWAS Catalog. For example, the EFO ID for T2D can be obtained:

EFO ID, MONDO ID, or Trait Name
The efo parameter now supports multiple formats:
- EFO ID: "EFO_0001360" (e.g., for type 2 diabetes)
- MONDO ID: "MONDO_0005148" (e.g., for type 2 diabetes)
- Trait name: "type 2 diabetes mellitus" (e.g., for type 2 diabetes)
- Multiple traits: ["EFO_0001360", "EFO_0001361"] (list of trait identifiers)
The function automatically handles MONDO to EFO conversion and trait name lookups. If a MONDO ID is provided, it will attempt to find the corresponding EFO ID. If that fails, it will try using the trait name.
Genome Build
When using GWAS Catalog (efo parameter), ensure your sumstats are based on the correct genome build. GWAS Catalog data is available for both GRCh37 (build="19") and GRCh38 (build="38"). Make sure to specify the correct build parameter.
GWAS Catalog Trait Associations
Only associations with the specified EFO trait will be obtained. This does not include associations with child traits.
Examples
Check novelty using GWAS Catalog (EFO ID)
Check novelty using GWAS Catalog (trait name)
Check novelty using local file
Check novelty using DataFrame
Get only novel variants
Output known variants as well
Skip lead variant extraction
Multiple traits
Custom cache directory
Notes
- The function first extracts lead variants (unless
if_get_lead=False) and then checks their novelty. - Novelty is determined based on physical distance: variants within
windowsizekb_for_novelkb of a known variant are considered "known". - GWAS Catalog queries require internet connection and may take some time depending on the trait.
- Cached GWAS Catalog data is stored in
cache_dirand reused in subsequent runs (ifuse_cache=True). - The function supports both GRCh37 (build="19") and GRCh38 (build="38") genome builds.
- When using
group_key, variants are compared within groups, useful for multi-trait analyses.
Reference
- Buniello, A., MacArthur, J. A. L., Cerezo, M., Harris, L. W., Hayhurst, J., Malangone, C., ... & Parkinson, H. (2019). The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic acids research, 47(D1), D1005-D1012.