Alignment
Tools for understanding how transformer predictions are built layer-by-layer
Implementations of selected inverse reinforcement learning algorithms.
Robust recipes to align language models with human and AI preferences
A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).
A simulation framework for RLHF and alternatives. Develop your RLHF method without collecting human data.
RLHF implementation details of OAI's 2019 codebase
Reference implementation for DPO (Direct Preference Optimization)
Train transformer language models with reinforcement learning.
Keeping language models honest by directly eliciting knowledge encoded in their activations.
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
Representation Engineering: A Top-Down Approach to AI Transparency
Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".
Aligning pretrained language models with instruction data generated by themselves.
A guidance language for controlling large language models.
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)
An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
RewardBench: the first evaluation tool for reward models.
Xwin-LM: Powerful, Stable, and Reproducible LLM Alignment
Scalable toolkit for efficient model alignment
A recipe for online RLHF and online iterative DPO.
Recipes to train reward model for RLHF.