ARC Brainstorming/Ideas (Dec 2024)

Classic transduction approach

motivation: use TTT (program search/hypotheses filtering via gradient descent)
data is king: scale high-quality synthetic riddle generation
- aim for collisions with the private ARC test-set, most ARC-AGI-1 riddles were probably generated by Chollet himself, attack vector: "replicate Chollet" - already extracted the "core-ideas" behind ARC-1 with Sonnet and generated >1k derived riddle-generator+verifier pairs (open-ended generation starting from core-idea descriptions and re-arc generators as seed-samples).
- maximize synthetic riddle diversity and quality, e.g. by selecting in-context examples for generation cleverly (take inspiration from Novelty Search) and filtering poor outputs or near-duplicates
- strive for new original ideas, e.g. use human in the loop for steering & filtering/ranking of synthetic riddle generation
- collect input & output of human feedback to capture the human-preferences (e.g. to filter out-of-human-distribution samples out)
search best open-weights model for transduction fine-tuning (probably needs to be <= 8B), test coding and reasoning-models
catch up with experienced teams: gain experience how transduction generalizes (e.g. by training data of single re-arc generator or simple riddle sub-sets, e.g. generators with lowest number of source-lines)
try fine-tuning a VLM, e.g. visual representation ARC boards (vision encoders might have advantages in recognizing 2D spatial relations)
collect a list of riddle board augmentation techniques and test their effectiveness for fine-tuning and during TTT (e.g. see arcmentations)
conduct TTT experiments: augmentations, freezing layers, training only FFN/KQV, different optimizers, regularization & hyper parameters
check effectiveness of hallucination detection (via internal state classifier) and mitigation for ARC riddles (e.g. paper FactCheckmate)
use the test-input board for consistency-checking: predict test output & swap with training example -> verify that training example output is predicted correctly given test-input&output as training pair (check if this consistency-test can be used in the TTT objective).
try to apply nanoGPT speedrun (leaderboard) insights (e.g. Muon optimizer)
evaluate effectiveness of adding <pause> thought-tokens during fine-tuning (paper and coconut)
train board transformation from natural language description task (separate the deduction of the board transformation from its execution)

ARC inductive coding agent

motivation: human interpretable program search heuristic with intermediate results (potentially multi-step)
create an ARC agent environment which allows riddle analysis, solution checking, plan storage and code execution, e.g. to alliteratively re-writing improve & fix synthesized transformation-program candidates
- find good structured prompt template containing the training examples, plan/idea, history of failed attempts, found solutions and DSL functions used for "similar" riddles (retrieved via RAG), current program candidate, resulting output board or compiler/interpreter error messages, etc.
generate an ARC CoT "reasoning" dataset of transformation description based on synthetic riddle-generator source-code as oracles
- describe the content and features of the training boards, commonalities & differences, obvious relationship between input and output
train a model to generate transformation hypotheses (natural language)
test effectiveness of RAG: train a model to describe the first impression or core-concept of a riddle. Use embeddings of description for retrieval of similar solutions in riddle db.
train verifier / process reward model & use tree-search-techniques to find correct result: detect invalid a hypothesis as soon as possible & backtrack (before synthesizing a candidate program)
reflect with a strong teacher model about failures -> generate data for back-tracking/self-correction, give oracle knowledge about riddle-generators to teacher model
augment riddle with basic data of riddle boards (in/out sizes, color-histogram, segmentation)
use meta-planning prompts to summarize and criticize a rollout and to make suggestions what should be tried next (potentially combine with RL to learn what an effective strategy for ARC riddles is, e.g. what to ask/try first, which prompts/questions uncover the transformation rules)

Agent implementation

training data: generate riddle board descriptions with the riddle-generator source as oracel information (not shown during training)
optimize key prompts of system in outer loop with llm (learn to ask the 'right questions' in higher meta-cognition layer)
reflect on failed attempts with llm to generate filtered/summarized context for future request
use llm to generate corrected reasoning traces from failed attempts when shown ground-truth example
adaptive sampling: start sampling with low temperature and increase temperature gradually for reasoning (requires solid verifier)
aim to limit the amount of reasoning-work: for simple 'well-known' riddles a direct transduction should be enough, try quick-shots first and check them before starting more elaborate reasoning (requires solid verifier, i.e. simpler for induction/program synthesis)
try to classify based on a generated riddle-analysis if an induction vs. transduction approach should be tried first

Combine induction & transduction

challenge: llm program generation is slow
estimate max llm based program generation throughput (e.g. non-naive tree-search with prefix-caching)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ideas.md

ideas.md

ARC Brainstorming/Ideas (Dec 2024)

Classic transduction approach

ARC inductive coding agent

Agent implementation

Combine induction & transduction

Files

ideas.md

Latest commit

History

ideas.md

File metadata and controls

ARC Brainstorming/Ideas (Dec 2024)

Classic transduction approach

ARC inductive coding agent

Agent implementation

Combine induction & transduction