Skip to content

Tokenization

Definition
AI-generated

Tokenization is the process of splitting raw input (text, sequence, or other discrete stream) into tokens—units a model consumes—using a fixed vocabulary, learned subword merges (BPE, SentencePiece), k-mers, codons, or domain-specific rules.

Why it matters in GWAS

Genomic and protein transformers inherit biases from k-mer or residue tokenization (stride, ambiguity codes, uppercase masking); text LLM workflows inherit tokenizer and vocabulary choices that affect rare allele symbols, rsIDs, and chemical names. Methods should name the tokenizer and vocabulary version alongside model checkpoints for reproducibility.

Example usage

"We used byte-pair tokenization for the trait descriptions but fixed 6-mer tokenization for the reference window fed to the DNA encoder."

References

  • Sennrich R, Haddow B, Birch A. (2016). Neural machine translation of rare words with subword units. ACL.

Last updated (UTC · Git history)