Tokenization¶

Definition

AI-generated

Tokenization is the process of splitting raw input (text, sequence, or other discrete stream) into tokens—units a model consumes—using a fixed vocabulary, learned subword merges (BPE, SentencePiece), k-mers, codons, or domain-specific rules.

Topics

LLM and Agents Machine learning concepts

Why it matters in GWAS¶

Genomic and protein transformers inherit biases from k-mer or residue tokenization (stride, ambiguity codes, uppercase masking); text LLM workflows inherit tokenizer and vocabulary choices that affect rare allele symbols, rsIDs, and chemical names. Methods should name the tokenizer and vocabulary version alongside model checkpoints for reproducibility.

Example usage¶

"We used byte-pair tokenization for the trait descriptions but fixed 6-mer tokenization for the reference window fed to the DNA encoder."

References¶

Sennrich R, Haddow B, Birch A. (2016). Neural machine translation of rare words with subword units. ACL.

← Token Tool Calling →

Last updated 2026-04-05 (UTC · Git history)

Tokenization¶

Why it matters in GWAS¶

Example usage¶

Related terms¶

References¶