Skip to content

Statistics conversion

GWASLab can convert equivalent statistics, including:

Target stats Original stats Implementation
MLOG10P P sumstats["MLOG10P"] = -np.log10(sumstats["P"])
P MLOG10P sumstats["P"] = np.power(10,-sumstats["MLOG10P"])
P Z sumstats["P"] = ss.chi2.sf(sumstats["Z"]**2, 1) (equivalent to two-sided normal test)
P CHISQ sumstats["P"] = ss.chi2.sf(sumstats["CHISQ"], 1)
OR
OR_95L
OR_95U
BETA
SE
sumstats["OR"] = np.exp(sumstats["BETA"]),
sumstats["OR_95L"] = np.exp(sumstats["BETA"]-ss.norm.ppf(0.975)*sumstats["SE"]),
sumstats["OR_95U"] = np.exp(sumstats["BETA"]+ss.norm.ppf(0.975)*sumstats["SE"])
BETA
SE
OR
OR_95L
OR_95U
sumstats["BETA"] = np.log(sumstats["OR"]),
sumstats["SE"]=(np.log(sumstats["OR"]) - np.log(sumstats["OR_95L"]))/ss.norm.ppf(0.975),
sumstats["SE"]=(np.log(sumstats["OR_95U"]) - np.log(sumstats["OR"]))/ss.norm.ppf(0.975)
Z BETA/SE sumstats["Z"] = sumstats["BETA"]/sumstats["SE"]
CHISQ P sumstats["CHISQ"] = ss.chi2.isf(sumstats["P"], 1)
CHISQ Z sumstats["CHISQ"] = (sumstats["Z"])**2
MAF EAF sumstats["MAF"] = sumstats["EAF"].apply(lambda x: min(x,1-x) if pd.notnull(x) else np.nan)

Extreme P values

For extreme P values (P < 1e-308), set extreme=True to overcome float64 precision limitations. MLOG10P will be calculated using Z-scores (or BETA/SE) or CHISQ using the method described here:

mysumstats.fill_data(to_fill=["MLOG10P"], extreme=True)

image

When extreme=True:

  • Z-scores (or BETA and SE) or CHISQ will be used to calculate MLOG10P directly, bypassing P-value calculation
  • Two additional columns P_MANTISSA and P_EXPONENT will be added to represent P values in scientific notation
  • This allows handling of extremely small P-values that exceed standard floating-point precision

Formulas for Extreme MLOG10P Calculation:

From Z-scores (or BETA/SE):

log_pvalue = log(2) + norm.logsf(|Z|)  # two-sided test
log10_pvalue = log_pvalue / log(10)
MLOG10P = -log10_pvalue
P_MANTISSA = 10^(log10_pvalue mod 1)
P_EXPONENT = floor(log10_pvalue)

Where:

  • norm.logsf(|Z|) is the log survival function of the standard normal distribution
  • log(2) accounts for the two-sided test
  • P_MANTISSA and P_EXPONENT represent P in scientific notation: P = P_MANTISSA × 10^P_EXPONENT

From CHISQ (with degrees of freedom):

log_pvalue = chi2.logsf(CHISQ, df)
log10_pvalue = log_pvalue / log(10)
MLOG10P = -log10_pvalue
P_MANTISSA = 10^(log10_pvalue mod 1)
P_EXPONENT = floor(log10_pvalue)

Where:

  • chi2.logsf(CHISQ, df) is the log survival function of the chi-square distribution
  • df is the degrees of freedom (typically 1 for GWAS)

Note

The conversion is implemented using scipy and numpy.

  • ss : import scipy.stats as ss
  • np : import numpy as np

See examples here.

fill_data()

mysumstats.fill_data( 
    to_fill=None,
    df=None,
    overwrite=False,
    verbose=True,
    only_sig=False,
    sig_level=5e-8,
    extreme=False
)

Options

Option DataType Description Default
to_fill str or list Column name(s) to fill. Valid values: "OR", "OR_95L", "OR_95U", "BETA", "SE", "P", "Z", "CHISQ", "MLOG10P", "MAF", "SIG". Note: "SIG" creates a "SIGNIFICANT" column (boolean) based on P or MLOG10P threshold None
df str Column name containing degrees of freedom for chi-square tests (only used when filling CHISQ) None
overwrite boolean If True, overwrite existing values in target columns False
verbose boolean If True, display progress messages True
only_sig boolean If True, only fill data for significant variants (P < sig_level) False
sig_level float Significance threshold for P-value filtering (used when only_sig=True or when filling SIG column) 5e-8
extreme boolean If True, use extreme value calculations for MLOG10P (helpful when P < 1e-300) False

Conversion Priority

GWASLab uses the following priority order when multiple source columns are available:

  • For P: MLOG10PZCHISQ
  • For MLOG10P: PZCHISQ (or BETA/SE if extreme=True)
  • For BETA/SE: OR/OR_95L/OR_95U
  • For OR/OR_95L/OR_95U: BETA/SE
  • For Z: BETA/SE
  • For CHISQ: ZP
  • For MAF: EAF (MAF = min(EAF, 1-EAF))
  • For SIG: P or MLOG10P (creates SIGNIFICANT column: True if P < sig_level or MLOG10P > -log10(sig_level))

Iterative Filling Process

The function performs iterative filling in multiple rounds:

  • Round 1: Attempts to fill all requested columns using available source data
  • Subsequent rounds: Newly filled columns may enable additional conversions
  • The process continues until all columns are filled or no further progress can be made
  • This allows complex conversions like: BETA/SEZPMLOG10P

Column Handling

  • If a target column already exists and overwrite=False, it will be skipped
  • When extreme=True, MLOG10P is calculated from Z-scores (or BETA/SE) to handle P-values below float64 precision limits
  • Intermediate columns created during conversion (but not requested) are automatically removed

Examples

Basic conversion

# Raw data with **BETA**, **SE**, **P**
# **SNPID** **CHR** **POS** **EA**  **NEA** **EAF** **BETA**    **SE**  **P**   **STATUS**
# 1:725932_G_A  1   725932  G   A   0.9960  -0.0737 0.1394  0.5970  9999999

# Fill missing statistics
# GWASLab will automatically search for equivalent statistics
mysumstats.fill_data(to_fill=["MLOG10P", "Z", "OR", "OR_95L", "OR_95U"])

# Output:
# Start filling data using existing columns...
#  -Raw input columns:  ['**SNPID**', '**CHR**', '**POS**', '**EA**', '**NEA**', '**EAF**', '**BETA**', '**SE**', '**P**', '**STATUS**']
#  -Overwrite mode:  False
#  -Skipping columns:  []
# Filling columns:  ['MLOG10P', 'OR', 'OR_95L', 'OR_95U', 'Z']
#   - Filling **OR** using **BETA** column...
#   - Filling **OR_95L**/**OR_95U** using **BETA**/**SE** columns...
#   - Filling **MLOG10P** using **P** column...
#   - Filling **Z** using **BETA**/**SE** columns...
# Finished filling data using existing columns.

Fill only significant variants

# Fill statistics only for variants with **P** < 5e-8
mysumstats.fill_data(
    to_fill=["MLOG10P", "Z"],
    only_sig=True,
    sig_level=5e-8
)

Overwrite existing columns

# Overwrite existing **MLOG10P** values
mysumstats.fill_data(
    to_fill=["MLOG10P"],
    overwrite=True
)

Fill single column

# Fill a single column (can pass string instead of list)
mysumstats.fill_data(to_fill="Z")

Fill MAF from EAF

# Calculate **MAF** from **EAF**
mysumstats.fill_data(to_fill=["MAF"])
# **MAF** = min(**EAF**, 1-**EAF**)

Extreme P values

# Handle extreme **P** values using **Z**-scores
mysumstats.fill_data(
    to_fill=["MLOG10P"],
    extreme=True
)
# Creates P_MANTISSA and P_EXPONENT columns
# **MLOG10P** is calculated from **Z**-scores (or **BETA**/**SE**) to avoid precision loss

Fill SIGNIFICANT column

# Create a SIGNIFICANT column based on **P**-value threshold
mysumstats.fill_data(
    to_fill=["SIG"],
    sig_level=5e-8
)
# Creates SIGNIFICANT column: True if **P** < 5e-8 or **MLOG10P** > 7.3