Reinforcement Learning from Human Feedback (RLHF)¶
Definition
AI-generated
Reinforcement learning from human feedback (RLHF) aligns a pretrained language model by training a reward model on human preferences, then optimizing the policy—often with proximal policy updates on sampled outputs—so the assistant is more helpful, honest, and harmless according to labeler rankings.
Topics
Synonyms
Why it matters in GWAS¶
Instruction-tuned models used for curation or methods drafting inherit RLHF tradeoffs (verbosity, refusals, preference bias); critical genomic claims should still be checked against databases and statistics, not model politeness.
Example usage¶
"The RLHF-tuned assistant refused to estimate individual disease risk from a VCF paste, which we preferred over a speculative guess."
Related terms¶
References¶
- Ouyang L, et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.
- Christiano PF, et al. (2017). Deep reinforcement learning from human preferences. NeurIPS.
Last updated (UTC · Git history)