Assigning CHR/POS
GWASLab can update CHR/POS using a pre-processed rsID-to-CHR:POS mapping table stored in HDF5 format. The HDF5-based approach provides optimized performance for processing large datasets with parallel processing capabilities.
Note on .rsid_to_chrpos()
The old TSV-based implementation of .rsid_to_chrpos() has been removed due to poor performance. Please use .rsid_to_chrpos2() with HDF5 files instead, which provides significantly better performance for large datasets.
.rsid_to_chrpos2()
Available since v3.4.31
.rsid_to_chrpos2() is a high-performance function to assign CHR and POS based on rsID using a HDF5 file derived from dbSNP reference VCF files. This method is optimized for large datasets and uses parallel processing for fast lookups.
How it works
The function uses pre-processed HDF5 files (one per chromosome) that contain rsID-to-POS mappings organized into 10 groups based on modulo 10 grouping (rsID % 10). This grouping strategy:
- Creates approximately equal-sized groups for balanced data distribution
- Enables efficient parallel lookups by processing chromosome-group combinations independently
- Minimizes memory usage by loading only relevant groups into memory
- Supports fast random access to specific rsID ranges using pandas index operations
- Separate files per chromosome allow true parallel writes without locking overhead
- CHR not stored - assigned from chromosome number (extracted from filename or provided parameter)
- Index-based matching - uses pandas index intersection for fast O(1) average-case lookups
- Smart processing - if CHR column is available, only processes relevant chromosome files (much faster)
Reference VCF from dbSNP
First, download reference VCF file from dbSNP ftp site.
Requirements:
- The VCF file must be indexed (
.tbifor bgzip-compressed VCF, or.csifor BCF format) bcftoolsmust be installed and available in your PATH - the function usesbcftools queryto extract data from VCF files- Install bcftools:
conda install -c bioconda bcftoolsor download from https://www.htslib.org/download/ - Verify installation:
bcftools --version
Example output:
bcftools 1.19
Using htslib 1.19
Copyright (C) 2023 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
gl.process_vcf_to_hfd5()
From gwaslab v3.4.42, process_ref_vcf() was renamed to process_vcf_to_hfd5()
Prerequisites:
bcftoolsmust be installed - this function usesbcftools queryto extract POS and rsID data from VCF files- Install:
conda install -c bioconda bcftoolsor from https://www.htslib.org/download/ - Verify:
bcftools --version
Example output:
bcftools 1.19
Using htslib 1.19
Copyright (C) 2023 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Process the VCF file and convert it to HDF5 files (one per chromosome) using .process_vcf_to_hfd5(). This step may take 1-3 hours for a 30GB VCF file, depending on your system. Each chromosome is written to a separate HDF5 file for maximum parallel processing speed.
| Option | DataType | Description | Default |
|---|---|---|---|
vcf |
string |
the path to dbSNP VCF file (must be indexed with .tbi/.csi) | - |
directory |
string |
the directory where you want output the converted HDF5 file | the same as VCF file |
chr_dict |
dict |
dictionary for chromosome mapping (e.g., NC_000001.10 -> 1) | None |
complevel |
int |
compression level for HDF5 (0-9, higher = more compression) | 3 |
threads |
int |
number of threads for parallel processing | 1 |
chr_list |
list |
list of chromosomes to process (None = all 1-25) | None |
overwrite |
bool |
if True, overwrite existing HDF5 files; if False, skip existing files | False |
Example:
directory="/home/yunye/work/gwaslab/examples/vcf_hd5/"
vcf = "/home/yunye/CommonData/Reference/ncbi_dbsnp/ncbi_dbsnp/db155/GCF_000001405.25.gz"
gl.process_vcf_to_hfd5(vcf=vcf,
directory=directory,
chr_dict=gl.get_NC_to_number(build="19"),
complevel=3,
threads=6)
# To overwrite existing HDF5 files, set overwrite=True:
# gl.process_vcf_to_hfd5(vcf=vcf,
# directory=directory,
# chr_dict=gl.get_NC_to_number(build="19"),
# complevel=3,
# threads=6,
# overwrite=True)
2025/12/24 14:35:36 Start to process VCF file to HDF5 using bcftools:
2025/12/24 14:35:36 -Reference VCF path: /home/yunye/CommonData/dbsnp/GCF_000001405.25.gz
2025/12/24 14:35:36 -Grouping method: modulo 10 (creating 10 groups per chromosome)
2025/12/24 14:35:36 -Output format: separate HDF5 file per chromosome (no locking, true parallel writes)
2025/12/24 14:35:36 -Compression level: 3
2025/12/24 14:35:36 -Number of threads: 6
2025/12/24 14:35:36 -Overwrite existing files: False
2025/12/24 14:35:36 -HDF5 Output directory: /home/yunye/CommonData/dbsnp
2025/12/24 14:35:36 -HDF5 File pattern: GCF_000001405.25.chr{chr_num}.rsID_CHR_POS_mod10.h5
2025/12/24 14:35:36 -Processing chromosomes in parallel...
HDF5 File Structure
The generated HDF5 files are organized as one file per chromosome, each containing 10 groups (group_0 through group_9):
File Organization:
- One HDF5 file per chromosome:
{vcf_file_name}.chr{chr_num}.rsID_CHR_POS_mod10.h5 - Example:
GCF_000001405.25.chr1.rsID_CHR_POS_mod10.h5,GCF_000001405.25.chr2.rsID_CHR_POS_mod10.h5, etc. - CHR is not stored in the HDF5 file - it's encoded in the filename (e.g.,
chr1→ CHR = 1)
Group Organization:
- Groups are named
group_0,group_1, ...,group_9 - Each rsID is assigned to a group based on:
group_id = rsID_number % 10 - For example:
rs123456789→789 % 10 = 9→ stored ingroup_9
Data Structure within each group:
Each group contains a DataFrame with rsn as index and POS as column (CHR is not stored):
| Structure | DataType | Description | Example |
|---|---|---|---|
Index: rsn |
int64 |
rsID number (without "rs" prefix) - used as index for fast matching | 123456789 |
Column: POS |
int32 |
Base pair position | 10177 |
Note: The rsn column is stored as the DataFrame index, enabling fast index-based matching using pandas index intersection operations.
Example of data storage:
HDF5 Directory: /path/to/dbsnp/
├── GCF_000001405.25.chr1.rsID_CHR_POS_mod10.h5
│ ├── group_0
│ │ └── DataFrame: rsn (int64) as index, POS (int32) as column
│ │ Example rows:
│ │ POS
│ │ rsn
│ │ 100 10177
│ │ 200 10235
│ │ 300 10352
│ │ ...
│ │
│ ├── group_1
│ │ └── DataFrame: rsn (int64) as index, POS (int32) as column
│ │ ...
│ │
│ └── group_9
│ └── DataFrame: rsn (int64) as index, POS (int32) as column
│ Example rows:
│ POS
│ rsn
│ 99 10616
│ 199 10789
│ ...
│
├── GCF_000001405.25.chr2.rsID_CHR_POS_mod10.h5
│ └── (same structure with chromosome 2 data)
│
└── ... (one file per chromosome)
Accessing HDF5 data directly:
You can inspect the HDF5 file structure using pandas:
import pandas as pd
import glob
import os
# Find all chromosome HDF5 files
h5_dir = "/path/to/dbsnp/"
h5_files = glob.glob(os.path.join(h5_dir, "*.chr*.rsID_CHR_POS_mod10.h5"))
# Access a specific chromosome file (e.g., chromosome 1)
chr1_file = [f for f in h5_files if '.chr1.' in f][0]
with pd.HDFStore(chr1_file, mode='r') as store:
# List all groups
print("Available groups:", store.keys())
# Access a specific group
group_0 = store['group_0']
print("\nGroup 0 structure:")
print(group_0.head())
print(f"\nGroup 0 shape: {group_0.shape}")
print(f"Group 0 index name: {group_0.index.name}")
print(f"Group 0 dtypes:\n{group_0.dtypes}")
print("\nNote: CHR is not stored - extract from filename")
print("Note: rsn is stored as index for fast matching")
# Extract CHR from filename
import re
match = re.search(r'\.chr(\d+)\.', chr1_file)
if match:
chr_num = int(match.group(1))
print(f"CHR from filename: {chr_num}")
# Search for a specific rsID (e.g., rs123456789) using index
rsn_to_find = 123456789
group_id = rsn_to_find % 10 # = 9
group_9 = store[f'group_{group_id}']
# Use index-based lookup (faster than column filtering)
if rsn_to_find in group_9.index:
result = group_9.loc[[rsn_to_find]]
print(f"\nFound rs{rsn_to_find} in chromosome {chr_num}:")
print(result)
print(f"CHR: {chr_num} (from filename)")
print(f"POS: {result['POS'].iloc[0]}")
Storage optimization:
The HDF5 format uses optimized data types to minimize storage:
- int32 for POS (4 bytes per value, range: up to ~2.1 billion)
- int64 for rsn (8 bytes per value, sufficient for all rsID numbers)
- CHR not stored - encoded in filename (saves 1 byte per variant)
This results in approximately 12 bytes per variant (8 + 4), plus HDF5 overhead and compression. With compression level 3, a typical dbSNP v155 file (~1 billion variants) may compress to ~9-14 GB. The separate files per chromosome also enable better parallel processing and reduce file size per chromosome.
.rsid_to_chrpos2()
Options
| Option | DataType | Description | Default |
|---|---|---|---|
path |
string |
the path to the HDF5 directory (containing chromosome-specific files) | - |
ref_rsid_to_chrpos_hdf5 |
string |
Path to HDF5 directory/file (takes precedence over VCF path) | None |
ref_rsid_to_chrpos_vcf |
string |
Path to VCF file (auto-generates HDF5 path from VCF location) | None |
threads |
int |
number of threads to use for parallel processing | 4 |
build |
string |
genome build version for CHR and POS (e.g., "19", "38") | "99" |
rsid |
string |
column name containing rsID values | "rsID" |
chrom |
string |
column name for chromosome values to be updated | "CHR" |
pos |
string |
column name for position values to be updated | "POS" |
status |
string |
column name for status codes | "STATUS" |
verbose |
bool |
if True, print progress messages | True |
Processing Strategy:
The function automatically chooses the optimal processing strategy based on your data:
-
If CHR column has values: Processes by chromosome first (much faster)
-
Groups data by chromosome, then by group (modulo 10)
- Only processes relevant chromosome HDF5 files
-
Example: If data has CHR=1, only checks
*.chr1.rsID_CHR_POS_mod10.h5 -
If CHR column is missing/empty: Searches across all chromosomes
- Groups by group (modulo 10), then tries all chromosome files
- Automatically deduplicates results (keeps best match per variant)
- Slower but necessary when CHR is unknown
Example
# Load summary statistics
mysumstats = gl.Sumstats("my_sumstats.txt")
# Before: missing CHR and POS
mysumstats.data[['rsID', 'CHR', 'POS']].head()
# Output:
# rsID CHR POS
# 0 rs123456 NaN NaN
# 1 rs234567 NaN NaN
# 2 rs345678 NaN NaN
# 3 rs456789 NaN NaN
# 4 rs567890 NaN NaN
# Option 1: Direct path to HDF5 directory (contains chromosome-specific files)
mysumstats.rsid_to_chrpos2(
path="/home/yunye/work/gwaslab/examples/vcf_hd5/",
threads=6
)
# Option 2: Using ref_rsid_to_chrpos_hdf5 parameter
mysumstats.rsid_to_chrpos2(
ref_rsid_to_chrpos_hdf5="/home/yunye/work/gwaslab/examples/vcf_hd5/",
threads=6
)
# Option 3: Using VCF path (auto-generates HDF5 path from VCF location)
mysumstats.rsid_to_chrpos2(
ref_rsid_to_chrpos_vcf="/path/to/dbsnp/GCF_000001405.25.gz",
threads=6
)
# After: CHR and POS assigned
mysumstats.data[['rsID', 'CHR', 'POS']].head()
# Output:
# rsID CHR POS
# 0 rs123456 1 10177
# 1 rs234567 1 10235
# 2 rs345678 1 10352
# 3 rs456789 1 10505
# 4 rs567890 1 10511
Processing workflow:
- rsID preprocessing: Extracts numeric rsID values (strips first 2 characters) and identifies valid rsIDs
- HDF5 file discovery: Finds all chromosome-specific HDF5 files in the directory (e.g.,
*.chr*.**rsID**_**CHR**_**POS**_mod10.h5) -
Processing strategy:
-
If CHR column available: Groups data by chromosome first, then by group (modulo 10), processes each chromosome-group combination
- If CHR column not available: Groups by group, searches across all chromosome files, then deduplicates results
- Group assignment: Each rsID is assigned to one of 10 groups based on
rsID % 10 - Parallel lookup: Multiple threads process tasks in parallel (each task = chromosome + group combination)
- Index-based matching: Uses pandas index intersection for fast matching (rsn as index in both sumstats and reference data)
- CHR assignment: CHR is assigned from the chromosome number (extracted from filename or provided parameter)
- POS assignment: POS is extracted from the matched reference data
- Deduplication: When searching across all chromosomes, keeps the first match per variant (prioritizes matches with both CHR and POS)
Performance characteristics:
- Speed: Processing 10,000 variants typically takes 1-3 seconds with 4-6 threads
- Memory: Only loads relevant HDF5 groups into memory (one group at a time per thread)
- Scalability: Can handle datasets with millions of variants efficiently
- Parallelization: Uses multiprocessing for concurrent task processing (chromosome × group combinations)
- Index-based matching: Uses pandas index intersection for O(1) average-case lookups
- Optimized design: Separate files per chromosome enable true parallel writes and efficient lookups
- Smart processing: When CHR is available, only processes relevant chromosome files (much faster)
Log output example:
2025/12/24 17:30:44 Start to assign CHR and POS using rsIDs ...(v4.0.0)
2025/12/24 17:30:44 -Current Dataframe shape : 10000 x 11 ; Memory usage: 0.64 MB
2025/12/24 17:30:44 -Source hdf5 file: /path/to/dbsnp/
2025/12/24 17:30:44 -Threads to use: 6
2025/12/24 17:30:44 -Grouping method: modulo 10 (rsID % 10)
2025/12/24 17:30:44 -Non-Valid rsIDs: 218
2025/12/24 17:30:44 -Valid rsIDs: 9782
2025/12/24 17:30:44 -Found HDF5 files for 25 chromosome(s)
2025/12/24 17:30:44 -Initiating CHR ...
2025/12/24 17:30:44 -Initiating POS ...
2025/12/24 17:30:44 -Searching across all chromosomes (CHR column not available or single-file format)
2025/12/24 17:30:44 -Created 250 processing tasks: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
2025/12/24 17:32:44 -Variants matched: 1922 / 9782
2025/12/24 17:32:44 -Merging group data...
2025/12/24 17:32:44 -Append data...
2025/12/24 17:32:44 Start to fix chromosome notation (CHR) ...(v4.0.0)
2025/12/24 17:32:44 -Variants with standardized chromosome notation: 1922
2025/12/24 17:32:44 -Variants with NA chromosome notations: 8078
2025/12/24 17:32:44 Finished fixing chromosome notation (CHR).
2025/12/24 17:32:44 Start to fix basepair positions (POS) ...(v4.0.0)
2025/12/24 17:32:44 -Position bound:(0 , 250,000,000)
2025/12/24 17:32:44 -Removed variants outliers: 0
2025/12/24 17:32:44 Finished fixing basepair positions (POS).
2025/12/24 17:32:44 Finished assigning CHR and POS using rsIDs.
Tips for optimal performance
- Use appropriate thread count: Set
threadsto match your CPU cores (typically 4-8 for most systems) - Pre-generate HDF5 files: Process VCF files once and reuse the HDF5 directory for multiple datasets
- Compression level: Use
complevel=3(default) for a good balance between file size and processing speed - Memory considerations: The function loads one HDF5 group per thread, so memory usage scales with thread count
- Large datasets: For datasets with >10M variants, consider processing in batches or using more threads
- Separate files per chromosome: The new format allows true parallel writes during processing, significantly faster than single-file approach
- CHR column advantage: If your data already has CHR values, the function will process much faster by only checking relevant chromosome files
- Index-based matching: Uses pandas index operations for efficient matching (faster than binary search for this use case)
- Task-based progress: Task numbers are printed during processing to show progress (e.g., "1 2 3 4 5...")
- Automatic deduplication: When searching across all chromosomes, the function automatically keeps the best match per variant