Harmonization
Load test data
Example
stdout:
Fri Feb 2 19:36:24 2024 GWASLab v3.4.38 https://cloufield.github.io/gwaslab/
Fri Feb 2 19:36:24 2024 (C) 2022-2024, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com
Fri Feb 2 19:36:24 2024 Start to load format from formatbook....
Fri Feb 2 19:36:24 2024 -gwaslab format meta info:
Fri Feb 2 19:36:24 2024 - format_name : gwaslab
Fri Feb 2 19:36:24 2024 - format_source : https://cloufield.github.io/gwaslab/
Fri Feb 2 19:36:24 2024 - format_version : 20231220_v4
Fri Feb 2 19:36:24 2024 Start to initialize gl.Sumstats from file :/home/yunye/work/gwaslab/examples/toy_data/to_harmonize.tsv
Fri Feb 2 19:36:24 2024 -Reading columns : CHR,NEA,SE,P,DIRECTION,BETA,SNPID,NOTE,N,EAF,EA,POS
Fri Feb 2 19:36:24 2024 -Renaming columns to : CHR,NEA,SE,P,DIRECTION,BETA,SNPID,NOTE,N,EAF,EA,POS
Fri Feb 2 19:36:24 2024 -Current Dataframe shape : 15 x 12
Fri Feb 2 19:36:24 2024 -Initiating a status column: STATUS ...
Fri Feb 2 19:36:24 2024 #WARNING! Version of genomic coordinates is unknown...
Fri Feb 2 19:36:24 2024 Start to reorder the columns...v3.4.38
Fri Feb 2 19:36:24 2024 -Current Dataframe shape : 15 x 13 ; Memory usage: 19.94 MB
Fri Feb 2 19:36:24 2024 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,DIRECTION,STATUS,NOTE
Fri Feb 2 19:36:24 2024 Finished reordering the columns.
Fri Feb 2 19:36:24 2024 -Column : SNPID CHR POS EA NEA EAF BETA SE P N DIRECTION STATUS NOTE
Fri Feb 2 19:36:24 2024 -DType : object string int64 category category float64 float64 float64 float64 int64 object category object
Fri Feb 2 19:36:24 2024 -Verified: T F T T T T T T T T T T NA
Fri Feb 2 19:36:24 2024 #WARNING! Columns with possibly incompatable dtypes: CHR
Fri Feb 2 19:36:24 2024 -Current Dataframe memory usage: 19.94 MB
Fri Feb 2 19:36:24 2024 Finished loading data successfully!
A list of variants that needs to be harmonized. Issues are described in NOTE column.
| SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | NOTE |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1:1066403:T:C | 1 | 1066403 | C | T | 0.5276 | 0.0043 | 0.0109 | 0.68910 | 191764 | ++-+ | 9999999 | Clean |
| 1:1163266:G:A | 1 | 1163266 | G | A | 0.8355 | 0.0122 | 0.0120 | 0.30810 | 191764 | -+++ | 9999999 | Flip |
| 1:1213224:T:A | 1 | 1213224 | T | A | 0.3345 | 0.0071 | 0.0095 | 0.45280 | 191764 | +++- | 9999999 | Flip + Palindromic |
| 1:3066761:A:T | 1 | 3066761 | T | A | 0.1245 | -0.0066 | 0.0157 | 0.67420 | 191764 | -+++ | 9999999 | Palindromic + No flip |
| 1:3997271:T:TTTTA | 1 | 3997271 | TTTTA | T | 0.3840 | 0.0188 | 0.0135 | 0.16300 | 191764 | ++++ | 9999999 | Indel + Both on genome + No flip |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3:183629306:T:TTCTC | 3 | 183629306 | TTCTC | T | 0.0082 | -0.0083 | 0.0555 | 0.88060 | 191764 | +--+ | 9999999 | Indel + Both on genome + No information |
| 4:99731866:G:C | 4 | 99731866 | G | C | 0.3883 | 0.0052 | 0.0093 | 0.57360 | 191764 | +-++ | 9999999 | Palindromic+ Flip + Different strand (REF G, T... |
| 8:89935201:C:G | 8 | 89935201 | G | C | 0.3533 | 0.0103 | 0.0094 | 0.27560 | 191764 | ++++ | 9999999 | Palindromic + Different strand (REF A; TOMMO G... |
| X:4206466:G:C | X | 4206466 | G | C | 0.8727 | -0.0055 | 0.0104 | 0.59840 | 191764 | -+-- | 9999999 | Palindromic + Flip + No information |
| X:5053229:A:T | X | 5053229 | T | A | 0.9747 | 0.0218 | 0.0225 | 0.33180 | 191764 | ++++ | 9999999 | Palindromic + No information |
Perform harmonization
- ref_seq : reference genome fasta file for allele alignment
- ref_infer : vcf file with allele frequency information for inferring strand and comparing allele frequency
- ref_alt_freq : field in INFO of vcf file for alternative allele frequency
Example
stdout:
Fri Feb 2 19:36:24 2024 Start to check SNPID/rsID...v3.4.38
Fri Feb 2 19:36:24 2024 -Current Dataframe shape : 15 x 13 ; Memory usage: 19.94 MB
Fri Feb 2 19:36:24 2024 -Checking SNPID data type...
Fri Feb 2 19:36:24 2024 -Converting SNPID to pd.string data type...
Fri Feb 2 19:36:24 2024 -Checking if SNPID is CHR:POS:NEA:EA...(separator: - ,: , _)
Fri Feb 2 19:36:25 2024 Finished checking SNPID/rsID.
Fri Feb 2 19:36:25 2024 Start to fix chromosome notation (CHR)...v3.4.38
Fri Feb 2 19:36:25 2024 -Current Dataframe shape : 15 x 13 ; Memory usage: 19.94 MB
Fri Feb 2 19:36:25 2024 -Checking CHR data type...
Fri Feb 2 19:36:25 2024 -Variants with standardized chromosome notation: 13
Fri Feb 2 19:36:25 2024 -Variants with fixable chromosome notations: 2
Fri Feb 2 19:36:25 2024 -No unrecognized chromosome notations...
Fri Feb 2 19:36:25 2024 -Identifying non-autosomal chromosomes : X, Y, and MT ...
Fri Feb 2 19:36:25 2024 -Identified 2 variants on sex chromosomes...
Fri Feb 2 19:36:25 2024 -Standardizing sex chromosome notations: X to 23...
Fri Feb 2 19:36:27 2024 Finished fixing chromosome notation (CHR).
Fri Feb 2 19:36:27 2024 Start to fix basepair positions (POS)...v3.4.38
Fri Feb 2 19:36:27 2024 -Current Dataframe shape : 15 x 13 ; Memory usage: 19.94 MB
Fri Feb 2 19:36:27 2024 -Converting to Int64 data type ...
Fri Feb 2 19:36:28 2024 -Position bound:(0 , 250,000,000)
Fri Feb 2 19:36:28 2024 -Removed outliers: 0
Fri Feb 2 19:36:28 2024 Finished fixing basepair positions (POS).
Fri Feb 2 19:36:28 2024 Start to fix alleles (EA and NEA)...v3.4.38
Fri Feb 2 19:36:28 2024 -Current Dataframe shape : 15 x 13 ; Memory usage: 19.94 MB
Fri Feb 2 19:36:28 2024 -Converted all bases to string datatype and UPPERCASE.
Fri Feb 2 19:36:28 2024 -Variants with bad EA : 0
Fri Feb 2 19:36:28 2024 -Variants with bad NEA : 0
Fri Feb 2 19:36:28 2024 -Variants with NA for EA or NEA: 0
Fri Feb 2 19:36:28 2024 -Variants with same EA and NEA: 0
Fri Feb 2 19:36:28 2024 -Detected 0 variants with alleles that contain bases other than A/C/T/G .
Fri Feb 2 19:36:30 2024 Finished fixing alleles (EA and NEA).
Fri Feb 2 19:36:30 2024 Start to perform sanity check for statistics...v3.4.38
Fri Feb 2 19:36:30 2024 -Current Dataframe shape : 15 x 13 ; Memory usage: 19.94 MB
Fri Feb 2 19:36:30 2024 -Comparison tolerance for floats: 1e-07
Fri Feb 2 19:36:30 2024 -Checking if 0 <= N <= 2147483647 ...
Fri Feb 2 19:36:30 2024 -Removed 0 variants with bad/na N.
Fri Feb 2 19:36:30 2024 -Checking if -1e-07 < EAF < 1.0000001 ...
Fri Feb 2 19:36:30 2024 -Removed 0 variants with bad/na EAF.
Fri Feb 2 19:36:30 2024 -Checking if -1e-07 < P < 1.0000001 ...
Fri Feb 2 19:36:30 2024 -Removed 0 variants with bad/na P.
Fri Feb 2 19:36:30 2024 -Checking if -100.0000001 < BETA < 100.0000001 ...
Fri Feb 2 19:36:30 2024 -Removed 0 variants with bad/na BETA.
Fri Feb 2 19:36:30 2024 -Checking if -1e-07 < SE < inf ...
Fri Feb 2 19:36:30 2024 -Removed 0 variants with bad/na SE.
Fri Feb 2 19:36:30 2024 -Checking STATUS and converting STATUS to categories....
Fri Feb 2 19:36:30 2024 -Removed 0 variants with bad statistics in total.
Fri Feb 2 19:36:30 2024 -Data types for each column:
Fri Feb 2 19:36:30 2024 -Column : SNPID CHR POS EA NEA EAF BETA SE P N DIRECTION STATUS NOTE
Fri Feb 2 19:36:30 2024 -DType : string Int64 Int64 category category float32 float64 float64 float64 Int64 object category object
Fri Feb 2 19:36:30 2024 -Verified: T T T T T T T T T T T T NA
Fri Feb 2 19:36:30 2024 Finished sanity check for statistics.
Fri Feb 2 19:36:30 2024 Start to normalize indels...v3.4.38
Fri Feb 2 19:36:30 2024 -Current Dataframe shape : 15 x 13 ; Memory usage: 19.94 MB
Fri Feb 2 19:36:30 2024 -No available variants to normalize..
Fri Feb 2 19:36:30 2024 Finished normalizing variants successfully!
Fri Feb 2 19:36:30 2024 Start to reorder the columns...v3.4.38
Fri Feb 2 19:36:30 2024 -Current Dataframe shape : 15 x 13 ; Memory usage: 19.94 MB
Fri Feb 2 19:36:30 2024 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,DIRECTION,STATUS,NOTE
Fri Feb 2 19:36:30 2024 Finished reordering the columns.
Fri Feb 2 19:36:31 2024 Start to check if NEA is aligned with reference sequence...v3.4.38
Fri Feb 2 19:36:31 2024 -Current Dataframe shape : 15 x 13 ; Memory usage: 19.94 MB
Fri Feb 2 19:36:31 2024 -Reference genome FASTA file: /home/yunye/.gwaslab/hg19.fa
Fri Feb 2 19:36:31 2024 -Checking records: 1 2 3 4 5 6 7 X 8 9 10 11 12 13 14 15 16 17 18 20 Y 19 22 21 M
Fri Feb 2 19:36:47 2024 -Variants allele on given reference sequence : 6
Fri Feb 2 19:36:47 2024 -Variants flipped : 6
Fri Feb 2 19:36:47 2024 -Raw Matching rate : 80.00%
Fri Feb 2 19:36:47 2024 -Variants inferred reverse_complement : 0
Fri Feb 2 19:36:47 2024 -Variants inferred reverse_complement_flipped : 0
Fri Feb 2 19:36:47 2024 -Both allele on genome + unable to distinguish : 3
Fri Feb 2 19:36:47 2024 -Variants not on given reference sequence : 0
Fri Feb 2 19:36:47 2024 Finished checking if NEA is aligned with reference sequence.
Fri Feb 2 19:36:47 2024 Start to adjust statistics based on STATUS code...v3.4.38
Fri Feb 2 19:36:47 2024 -Current Dataframe shape : 15 x 13 ; Memory usage: 0.00 MB
Fri Feb 2 19:36:47 2024 Start to flip allele-specific stats for SNPs with status xxxxx[35]x: ALT->EA , REF->NEA ...v3.4.38
Fri Feb 2 19:36:47 2024 -Flipping 6 variants...
Fri Feb 2 19:36:47 2024 -Swapping column: NEA <=> EA...
Fri Feb 2 19:36:47 2024 -Flipping column: BETA = - BETA...
Fri Feb 2 19:36:47 2024 -Flipping column: DIRECTION +-?0 <=> -+?0 ...
Fri Feb 2 19:36:47 2024 -Flipping column: EAF = 1 - EAF...
Fri Feb 2 19:36:47 2024 -Changed the status for flipped variants : xxxxx[35]x -> xxxxx[12]x
Fri Feb 2 19:36:47 2024 Finished adjusting.
Fri Feb 2 19:36:47 2024 Start to infer strand for palindromic SNPs/align indistinguishable indels...v3.4.38
Fri Feb 2 19:36:47 2024 -Current Dataframe shape : 15 x 13 ; Memory usage: 0.00 MB
Fri Feb 2 19:36:47 2024 -Number of threads/cores to use: 1
Fri Feb 2 19:36:47 2024 -Reference VCF: /home/yunye/.gwaslab/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz
Fri Feb 2 19:36:47 2024 -Checking prefix for chromosomes in vcf files...
Fri Feb 2 19:36:47 2024 -No prefix for chromosomes in the VCF files.
Fri Feb 2 19:36:47 2024 -Field for alternative allele frequency in VCF INFO: AF
Fri Feb 2 19:36:47 2024 -Identified 8 palindromic SNPs...
Fri Feb 2 19:36:47 2024 -After filtering by MAF< 0.4 , 6 palindromic SNPs with unknown strand will be inferred...
Fri Feb 2 19:36:47 2024 -Non-palindromic : 2
Fri Feb 2 19:36:47 2024 -Palindromic SNPs on + strand: 2
Fri Feb 2 19:36:47 2024 -Palindromic SNPs on - strand and needed to be flipped: 2
Fri Feb 2 19:36:47 2024 -Palindromic SNPs with MAF not available to infer : 2
Fri Feb 2 19:36:47 2024 -Palindromic SNPs with no macthes or no information : 1
Fri Feb 2 19:36:47 2024 -Identified 3 indistinguishable Indels...
Fri Feb 2 19:36:47 2024 -Indistinguishable indels will be inferred from reference vcf ref and alt...
Fri Feb 2 19:36:47 2024 -DAF tolerance: 0.2
Fri Feb 2 19:36:47 2024 -Indels ea/nea match reference : 1
Fri Feb 2 19:36:47 2024 -Indels ea/nea need to be flipped : 1
Fri Feb 2 19:36:47 2024 -Indels with no macthes or no information : 1
Fri Feb 2 19:36:47 2024 Finished inferring strand for palindromic SNPs/align indistinguishable indels.
Fri Feb 2 19:36:47 2024 Start to adjust statistics based on STATUS code...v3.4.38
Fri Feb 2 19:36:47 2024 -Current Dataframe shape : 15 x 13 ; Memory usage: 0.00 MB
Fri Feb 2 19:36:47 2024 Start to flip allele-specific stats for standardized indels with status xxxx[123][67][6]: ALT->EA , REF->NEA...v3.4.38
Fri Feb 2 19:36:47 2024 -Flipping 1 variants...
Fri Feb 2 19:36:47 2024 -Swapping column: NEA <=> EA...
Fri Feb 2 19:36:47 2024 -Flipping column: BETA = - BETA...
Fri Feb 2 19:36:47 2024 -Flipping column: DIRECTION +-?0 <=> -+?0 ...
Fri Feb 2 19:36:47 2024 -Flipping column: EAF = 1 - EAF...
Fri Feb 2 19:36:47 2024 -Changed the status for flipped variants xxxx[123][67]6 -> xxxx[123][67]4
Fri Feb 2 19:36:47 2024 Start to flip allele-specific stats for palindromic SNPs with status xxxxx[12]5: (-)strand <=> (+)strand...v3.4.38
Fri Feb 2 19:36:47 2024 -Flipping 2 variants...
Fri Feb 2 19:36:47 2024 -Flipping column: BETA = - BETA...
Fri Feb 2 19:36:47 2024 -Flipping column: DIRECTION +-?0 <=> -+?0 ...
Fri Feb 2 19:36:47 2024 -Flipping column: EAF = 1 - EAF...
Fri Feb 2 19:36:47 2024 -Changed the status for flipped variants: xxxxx[012]5: -> xxxxx[012]2
Fri Feb 2 19:36:47 2024 Finished adjusting.
stdout:
Fri Feb 2 19:36:48 2024 Start to sort the genome coordinates...v3.4.38
Fri Feb 2 19:36:48 2024 -Current Dataframe shape : 15 x 13 ; Memory usage: 0.00 MB
Fri Feb 2 19:36:48 2024 Finished sorting coordinates.
Fri Feb 2 19:36:48 2024 Start to reorder the columns...v3.4.38
Fri Feb 2 19:36:48 2024 -Current Dataframe shape : 15 x 13 ; Memory usage: 0.00 MB
Fri Feb 2 19:36:48 2024 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,DIRECTION,STATUS,NOTE
Fri Feb 2 19:36:48 2024 Finished reordering the columns.
All variants were checked and harmonized based on the reference datasets. The manipulations are reflected in STATUS column.
| SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | NOTE |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1:1066403:T:C | 1 | 1066403 | C | T | 0.5276 | 0.0043 | 0.0109 | 0.68910 | 191764 | ++-+ | 9960000 | Clean |
| 1:1163266:G:A | 1 | 1163266 | A | G | 0.1645 | -0.0122 | 0.0120 | 0.30810 | 191764 | +--- | 9960010 | Flip |
| 1:1213224:T:A | 1 | 1213224 | A | T | 0.6655 | -0.0071 | 0.0095 | 0.45280 | 191764 | ---+ | 9960011 | Flip + Palindromic |
| 1:3066761:A:T | 1 | 3066761 | T | A | 0.1245 | -0.0066 | 0.0157 | 0.67420 | 191764 | -+++ | 9960001 | Palindromic + No flip |
| 1:3997271:T:TTTTA | 1 | 3997271 | TTTTA | T | 0.3840 | 0.0188 | 0.0135 | 0.16300 | 191764 | ++++ | 9960363 | Indel + Both on genome + No flip |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3:183629306:T:TTCTC | 3 | 183629306 | TTCTC | T | 0.0082 | -0.0083 | 0.0555 | 0.88060 | 191764 | +--+ | 9960368 | Indel + Both on genome + No information |
| 4:99731866:G:C | 4 | 99731866 | C | G | 0.3883 | 0.0052 | 0.0093 | 0.57360 | 191764 | +-++ | 9960012 | Palindromic+ Flip + Different strand (REF G, T... |
| 8:89935201:C:G | 8 | 89935201 | G | C | 0.6467 | -0.0103 | 0.0094 | 0.27560 | 191764 | ---- | 9960002 | Palindromic + Different strand (REF A; TOMMO G... |
| X:4206466:G:C | 23 | 4206466 | C | G | 0.1273 | 0.0055 | 0.0104 | 0.59840 | 191764 | +-++ | 9960018 | Palindromic + Flip + No information |
| X:5053229:A:T | 23 | 5053229 | T | A | 0.9747 | 0.0218 | 0.0225 | 0.33180 | 191764 | ++++ | 9960008 | Palindromic + No information |
Check summary
| Values | Percentage |
|---|---|
| 15 | |
| 0 | 0.0 |
| ... | ... |
| 6.67 | |
| 6.67 | |
| 6.67 | |
| 6.67 | |
| 6.67 |
| Genome_Build | rsID&SNPID | CHR&POS | Stadardize&Normalize | Align | Panlidromic_SNP&Indel | Count | Percentage(%) |
|---|---|---|---|---|---|---|---|
| Unchecked | rsid unknown & SNPID valid | CHR valid & POS valid | standardized SNP | Match: NEA=REF | Not_palindromic_SNPs | 1 | 6.67 |
| Unchecked | rsid unknown & SNPID valid | CHR valid & POS valid | standardized SNP | Match: NEA=REF | Palindromic+strand | 1 | 6.67 |
| Unchecked | rsid unknown & SNPID valid | CHR valid & POS valid | standardized SNP | Match: NEA=REF | Palindromic-strand_fixed | 1 | 6.67 |
| Unchecked | rsid unknown & SNPID valid | CHR valid & POS valid | standardized SNP | Match: NEA=REF | Indistinguishable | 1 | 6.67 |
| Unchecked | rsid unknown & SNPID valid | CHR valid & POS valid | standardized SNP | Match: NEA=REF | No_matching_or_no_info | 1 | 6.67 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| Unchecked | rsid unknown & SNPID valid | CHR valid & POS valid | standardized & normalized indel | Match: NEA=REF | Unchecked | 1 | 6.67 |
| Unchecked | rsid unknown & SNPID valid | CHR valid & POS valid | standardized & normalized indel | Flipped_fixed | Unchecked | 1 | 6.67 |
| Unchecked | rsid unknown & SNPID valid | CHR valid & POS valid | standardized & normalized indel | Both_alleles_on_ref+indistinguishable | Indel_match | 1 | 6.67 |
| Unchecked | rsid unknown & SNPID valid | CHR valid & POS valid | standardized & normalized indel | Both_alleles_on_ref+indistinguishable | Indel_flipped_fixed | 1 | 6.67 |
| Unchecked | rsid unknown & SNPID valid | CHR valid & POS valid | standardized & normalized indel | Both_alleles_on_ref+indistinguishable | No_matching_or_no_info | 1 | 6.67 |
VCF : NCBI Sequence Identifiers
For some reference file the chromosome notation is not in the form of 1, or chr1, instead, they are using NCBI Sequence Identifiers (for example: NC_000001.11).
See https://www.ncbi.nlm.nih.gov/genbank/sequenceids/
For example: GCF_000001405.25.vcf.gz
NC_000001.10 10059 rs1570391745 C G . . RS=1570391745;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:0.9997,0.0003425|dbGaP_PopFreq:1,0
NC_000001.10 10060 rs1639544146 C CT . . RS=1639544146;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL;R5;GNO;FREQ=dbGaP_PopFreq:1,0
NC_000001.10 10060 rs1639544159 CT C . . RS=1639544159;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=DEL;R5;GNO;FREQ=dbGaP_PopFreq:1,0
In such case, gwaslab provide built-in conversion table.
{1: 'NC_000001.10',
2: 'NC_000002.11',
3: 'NC_000003.11',
4: 'NC_000004.11',
5: 'NC_000005.9',
6: 'NC_000006.11',
7: 'NC_000007.13',
8: 'NC_000008.10',
9: 'NC_000009.11',
10: 'NC_000010.10',
11: 'NC_000011.9',
12: 'NC_000012.11',
13: 'NC_000013.10',
14: 'NC_000014.8',
15: 'NC_000015.9',
16: 'NC_000016.9',
17: 'NC_000017.10',
18: 'NC_000018.9',
19: 'NC_000019.9',
20: 'NC_000020.10',
21: 'NC_000021.8',
22: 'NC_000022.10',
23: 'NC_000023.10',
24: 'NC_000024.9',
25: 'NC_012920.1'}
Specify it in assignrsid_args and inferstrand_args for the all-in-one function: