Skip to content
View basicv8vc's full-sized avatar

Organizations

@dmlc @slofast

Block or report basicv8vc

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Stars

Alignment

32 repositories

Tools for understanding how transformer predictions are built layer-by-layer

Python 456 47 Updated Jun 2, 2024

Implementations of selected inverse reinforcement learning algorithms.

Python 991 238 Updated Oct 21, 2022

Robust recipes to align language models with human and AI preferences

Python 4,886 423 Updated Nov 21, 2024

A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).

Python 779 48 Updated Jan 4, 2025

A simulation framework for RLHF and alternatives. Develop your RLHF method without collecting human data.

Python 790 59 Updated Jul 1, 2024

RLHF implementation details of OAI's 2019 codebase

Python 165 8 Updated Jan 14, 2024

Reference implementation for DPO (Direct Preference Optimization)

Python 2,316 191 Updated Aug 11, 2024

Train transformer language models with reinforcement learning.

Python 10,579 1,369 Updated Jan 12, 2025

Keeping language models honest by directly eliciting knowledge encoded in their activations.

Python 192 33 Updated Jan 6, 2025

Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback

Python 1,391 118 Updated Jun 13, 2024

Representation Engineering: A Top-Down Approach to AI Transparency

Jupyter Notebook 770 88 Updated Aug 14, 2024

Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".

93 11 Updated Mar 9, 2024

Aligning pretrained language models with instruction data generated by themselves.

Python 4,238 495 Updated Mar 27, 2023

[NIPS2023] RRHF & Wombat

Python 802 49 Updated Sep 22, 2023

A guidance language for controlling large language models.

Jupyter Notebook 19,436 1,059 Updated Jan 7, 2025

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

Python 1,867 148 Updated Jan 10, 2025

Official repository for ORPO

Python 429 40 Updated May 31, 2024

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)

Python 6,043 684 Updated Jan 9, 2025

An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)

Python 7,832 881 Updated Jan 12, 2025

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Jupyter Notebook 388 61 Updated Aug 16, 2024

Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).

HTML 176 37 Updated Dec 16, 2024

RewardBench: the first evaluation tool for reward models.

Python 485 56 Updated Jan 8, 2025

Xwin-LM: Powerful, Stable, and Reproducible LLM Alignment

Python 1,027 41 Updated May 31, 2024

Scalable toolkit for efficient model alignment

Python 670 85 Updated Jan 12, 2025

A recipe for online RLHF and online iterative DPO.

Python 480 51 Updated Dec 28, 2024

Recipes to train reward model for RLHF.

Python 1,129 79 Updated Dec 12, 2024