Assigning CHR/POS
GWASLab can update CHR/POS using a pre-processed SNPID-rsID table to assign CHR and POS based on rsID. Currently, only variants in 1KG phase3v5 are supported.
.rsid_to_chrpos()
Reference data
GWASLab provides a download function gl.download_ref() and two curated tables which contains ~80M 1KG variants:
- hg19:- gl.download_ref("1kg_dbsnp151_hg19_auto")
- hg38:- gl.download_ref("1kg_dbsnp151_hg38_auto")
1kg_dbsnp151_hg19_auto
~/.gwaslab$ zcat 1kg_dbsnp151_hg19_auto.txt.gz |head
SNPID   rsID    CHR     POS     NEA     EA
1:10177:A:AC    rs367896724     1       10177   A       AC
1:10235:T:TA    rs540431307     1       10235   T       TA
1:10352:T:TA    rs555500075     1       10352   T       TA
1:10505:A:T     rs548419688     1       10505   A       T
1:10511:G:A     rs534229142     1       10511   G       A
1:10539:C:A     rs537182016     1       10539   C       A
1:10542:C:T     rs572818783     1       10542   C       T
1:10579:C:A     rs538322974     1       10579   C       A
1:10616:CCGCCGTTGCAAAGGCGCGCCG:C        rs376342519     1       10616   CCGCCGTTGCAAAGGCGCGCCG  C
.rsid_to_chrpos()
# if not downloaded yet :
# gl.download_ref("1kg_dbsnp151_hg19_auto")
# assign chr and pos using rsID
mysumstats.rsid_to_chrpos( path = gl.get_path("1kg_dbsnp151_hg19_auto"))
| Option | DataType | Description | Default | 
|---|---|---|---|
| path | string | the path to reference tabular file | - | 
Example
Assign CHR and POS using rsID
mysumstats.data
EA  NEA EAF BETA    SE  P   N   DIRECTION   STATUS  rsID
0   G   A   0.9960  -0.0737 0.1394  0.59700 166718  -?+-    9960099 rs565766235
1   G   A   0.0040  0.0737  0.1394  0.59730 166718  +?-+    9960099 rs534711480
2   C   T   0.0051  0.0490  0.1231  0.69080 166718  +?-+    9960099 rs540210562
3   TAA T   0.8374  0.0213  0.0199  0.28460 166718  -?++    9960399 rs529266287
4   T   A   0.8593  0.0172  0.0156  0.27050 166718  -?++    9960099 rs28544273
... ... ... ... ... ... ... ... ... ... ...
9995    C   T   0.1292  -0.0350 0.0191  0.06686 191764  ----    9960099 rs55938238
9996    C   T   0.4886  -0.0014 0.0094  0.88070 191764  -0+-    9960099 rs7520225
9997    C   T   0.9476  -0.0061 0.0216  0.77790 191764  ---+    9960099 rs151238770
9998    C   CT  0.9418  -0.0047 0.0199  0.81500 191764  ---+    9960399 rs137909285
9999    G   A   0.9828  -0.0084 0.0433  0.84620 191764  +--+    9960099 rs77575110
mysumstats.rsid_to_chrpos(path=gl.get_path("1kg_dbsnp151_hg19_auto"))
Fri Jan 27 00:06:24 2023 Start to update chromosome and position information based on rsID...
Fri Jan 27 00:06:24 2023  -Current Dataframe shape : 10000  x  10
Fri Jan 27 00:06:24 2023  -rsID dictionary file: /home/he/.gwaslab/1kg_dbsnp151_hg19_auto.txt.gz
Fri Jan 27 00:06:24 2023  -Setting block size:  10000000
Fri Jan 27 00:06:24 2023  -Loading block: 0   1   2   3   4   5   6   7   
Fri Jan 27 00:11:44 2023  -Updating CHR and POS finished.Start to re-fixing CHR and POS... 
Fri Jan 27 00:11:44 2023 Start to fix chromosome notation...
Fri Jan 27 00:11:44 2023  -Current Dataframe shape : 10000  x  12
Fri Jan 27 00:11:45 2023  -Variants with standardized chromosome notation: 9942
Fri Jan 27 00:11:45 2023  -Variants with fixable chromosome notations: 0
Fri Jan 27 00:11:45 2023  -Variants with NA chromosome notations: 58
Fri Jan 27 00:11:45 2023  -No unrecognized chromosome notations...
Fri Jan 27 00:11:46 2023 Finished fixing chromosome notation successfully!
Fri Jan 27 00:11:46 2023 Start to fix basepair positions...
Fri Jan 27 00:11:46 2023  -Current Dataframe shape : 10000  x  12
Fri Jan 27 00:11:46 2023  -Converting to Int64 data type ...
Fri Jan 27 00:11:46 2023  -Position upper_bound is: 250,000,000
Fri Jan 27 00:11:46 2023  -Remove outliers: 0
Fri Jan 27 00:11:46 2023  -Converted all position to datatype Int64.
Fri Jan 27 00:11:46 2023 Finished fixing basepair position successfully!
mysumstats.data
rsID    EA  NEA EAF BETA    SE  P   N   DIRECTION   STATUS  CHR POS
0   rs565766235 G   A   0.9960  -0.0737 0.1394  0.59700 166718  -?+-    9960099 1   725932
1   rs534711480 G   A   0.0040  0.0737  0.1394  0.59730 166718  +?-+    9960099 1   725933
2   rs540210562 C   T   0.0051  0.0490  0.1231  0.69080 166718  +?-+    9960099 1   737801
3   rs529266287 TAA T   0.8374  0.0213  0.0199  0.28460 166718  -?++    9960399 1   749963
4   rs28544273  T   A   0.8593  0.0172  0.0156  0.27050 166718  -?++    9960099 1   751343
... ... ... ... ... ... ... ... ... ... ... ... ...
9995    rs55938238  C   T   0.1292  -0.0350 0.0191  0.06686 191764  ----    9960099 1   3142135
9996    rs7520225   C   T   0.4886  -0.0014 0.0094  0.88070 191764  -0+-    9960099 1   3142137
9997    rs151238770 C   T   0.9476  -0.0061 0.0216  0.77790 191764  ---+    9960099 1   3142161
9998    rs137909285 C   CT  0.9418  -0.0047 0.0199  0.81500 191764  ---+    9960399 1   3142212
9999    rs77575110  G   A   0.9828  -0.0084 0.0433  0.84620 191764  +--+    9960099 1   3142762
.rsid_to_chrpos2()
Available since v3.4.31
.rsid_to_chrpos2() is a function to assign CHR and POS based on rsID using a HDF5 file derived from dbSNP reference VCF files.
Refernce VCF from dbSNP
First, download reference VCF file from dbSNP ftp site.
gl.process_vcf_to_hfd5()
From gwaslab v3.4.42,  process_ref_vcf() was renamed to process_vcf_to_hfd5()
Process the vcf file and convert it to HDF5 file using .process_vcf_to_hfd5(). This step may take up to one or two hours.
| Option | DataType | Description | Default | 
|---|---|---|---|
| vcf | string | the path to dbSNP VCF file | - | 
| directory | string | the directory where you want output the converted HDF5 file | the same as VCF file | 
directory="/home/yunye/work/gwaslab/examples/vcf_hd5/"
vcf = "/home/yunye/CommonData/Reference/ncbi_dbsnp/ncbi_dbsnp/db155/GCF_000001405.25.gz"
gl.process_vcf_to_hfd5(vcf=vcf,
                   directory=directory,
                   chr_dict=gl.get_NC_to_number(build="19"))
Fri Nov 10 11:27:59 2023 Start processing VCF files:
Fri Nov 10 11:27:59 2023  -Reference VCF path:/home/yunye/CommonData/Reference/ncbi_dbsnp/ncbi_dbsnp/db155/GCF_000001405.25.gz
Fri Nov 10 11:27:59 2023  -Output group size:20000000
Fri Nov 10 11:27:59 2023  -Compression level:9
Fri Nov 10 11:27:59 2023  -Loading chunksize:20000000
Fri Nov 10 11:27:59 2023  -HDF5 Output path: /home/yunye/work/gwaslab/examples/vcf_hd5/rsID_CHR_POS_groups_20000000.h5
Fri Nov 10 11:27:59 2023  -Log output path: /home/yunye/work/gwaslab/examples/vcf_hd5/rsID_CHR_POS_groups_20000000.log
Fri Nov 10 11:27:59 2023  -Processing chunk: 0 1 2 3 4 ...
Assign CHR POS using the HDF5 file.
.rsid_to_chrpos2()
Options
| Option | DataType | Description | Default | 
|---|---|---|---|
| path | string | the path to the HDF5 file | - | 
| n_cores | int | number of threads to use | ./ | 
| build | string | genome build version for CHR and POS | "99" | 
Example
...
mysumstats.rsid_to_chrpos2(path="/home/yunye/work/gwaslab/examples/vcf_hd5/rsID_CHR_POS_groups_20000000.h5")
Fri Nov 10 17:30:44 2023 Start to assign CHR and POS using rsIDs... 
Fri Nov 10 17:30:44 2023  -Source hdf5 file:  ./vcf_hd5/rsID_CHR_POS_groups_20000000.hd5
Fri Nov 10 17:30:44 2023  -Cores to use :  4
Fri Nov 10 17:30:44 2023  -Blocksize (make sure it is the same as hdf5 file ):  20000000
Fri Nov 10 17:30:44 2023  -Non-Valid rsIDs:  58
Fri Nov 10 17:30:44 2023  -Duplicated rsIDs except for the first occurrence:  0
Fri Nov 10 17:30:44 2023  -Valid rsIDs:  9942
Fri Nov 10 17:30:44 2023  -Initiating CHR ... 
Fri Nov 10 17:30:44 2023  -Initiating POS ... 
Fri Nov 10 17:30:44 2023  -Divided into groups:  41
Fri Nov 10 17:30:44 2023   - {0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 18, 19, 26, 27, 28, 37, 38, 39, 43, 44, 48, 49, 52, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74}
Fri Nov 10 17:30:46 2023  -Number of groups in HDF5:  92
Fri Nov 10 17:30:46 2023  -Max index of groups in HDF5:  105
Fri Nov 10 17:32:44 2023  -Merging group data... 
Fri Nov 10 17:32:44 2023  -Append data... 
Fri Nov 10 17:32:44 2023 Start to fix chromosome notation...
Fri Nov 10 17:32:44 2023  -Current Dataframe shape : 10000  x  13
Fri Nov 10 17:32:44 2023  -Checking CHR data type...
Fri Nov 10 17:32:44 2023  -Variants with standardized chromosome notation: 9705
Fri Nov 10 17:32:44 2023  -Variants with fixable chromosome notations: 0
Fri Nov 10 17:32:44 2023  -Variants with NA chromosome notations: 295
Fri Nov 10 17:32:44 2023  -No unrecognized chromosome notations...
Fri Nov 10 17:32:44 2023  -Sanity check for CHR...
Fri Nov 10 17:32:44 2023  -Removed 0 variants with CHR < 1...
Fri Nov 10 17:32:44 2023 Finished fixing chromosome notation successfully!
Fri Nov 10 17:32:44 2023 Start to fix basepair positions...
Fri Nov 10 17:32:44 2023  -Current Dataframe shape : 10000  x  13
Fri Nov 10 17:32:44 2023  -Converting to Int64 data type ...
Fri Nov 10 17:32:44 2023  -Position upper_bound is: 250,000,000
Fri Nov 10 17:32:44 2023  -Remove outliers: 0
Fri Nov 10 17:32:44 2023  -Converted all position to datatype Int64.
Fri Nov 10 17:32:44 2023 Finished fixing basepair position successfully!
Fri Nov 10 17:32:44 2023 Finished assigning CHR and POS using rsIDs.