Skip to content

Latest commit

 

History

History
77 lines (60 loc) · 4.44 KB

README.md

File metadata and controls

77 lines (60 loc) · 4.44 KB

Reasoning LLMs

Philip Dhingra's latest thoughts on reasoning LLMs.

About me

I'm an independent AGI researcher and consultant based in San Francisco. I also co-work out of the Canopy Jackson Square office.

During the pandemic, I built a prototype for Chain-of-Thought prompting and then achieved SOTA for ARC-AGI on Kaggle (+38% accuracy). Afterwards, I won a silver medal in Cornell's BirdCLEF competition. You can read more about me here or check out my LinkedIn.

Multi-query problem-solving

Sam Altman wants customers to run ChatGPT queries not just for minutes, but days, maybe even months, to solve hard math problems, such as the Riemann Hypothesis, or to find new cancer drugs. However, the reception to OpenAI's latest model, o1-preview, has been lukewarm, at best. What went wrong?

Open-source researchers like myself are still trying to reverse-engineer o1, and so far, it appears to be a bespoke version of either Tree-of-Thoughts (ToT) or one of the other "of-Thoughts" approaches. (see below) I believe ToT-style approaches can bear fruit if we focus on hard, not soft, feedback. Not every problem would apply here, but consider either Game of 24 or ARC-AGI (the subject of my research). With these "puzzles," you don't have to wait for the hidden test set to validate the model's predictions. With Go24, you can simply check whether the arithmetic operations get you to 24. With ARC-AGI, if you use a code-gen approach, like what Ryan Greenblat did, you can just run the code against the inputs and test for consistency.

At every reasoning step, you can mathematically verify whether ToT got it right. This way, the LLM builds true experience, as opposed to what it appears o1 is doing, which is using the LLM to analyze its own prompt outputs. o1 currently recapitulates some of the same garbage-in/garbage-out traps that lots of LLM research has fallen into over the last two years. Everybody thought that more parameters, better prompts, larger context windows, or multiple agents would provide categorical improvements to LLMs. But now we're in the winter of AI discontent, feeling like these gains are starting to "log out," i.e., proving to be sources of logarithmic—not exponential—gains.

But what if we could provide hard deterministic feedback for systems that are run outside the LLM, as opposed to the soft fuzzy feedback from the LLM itself? Couldn't we escape the garbage-in/garbage-out trap?

The "of-Thoughts" family

We're two generations into the "of Thoughts" strategy for augmenting LLMs, and progress is moving fast. AI critics like Gary Marcus mock these as mere "prompt engineering" techniques, but when you start to get to the multi-query methods, as shown in Tree-of-Thoughts and OpenAI's o1, it's apparent we've only scratched the surface.

A brief timeline:

(1 line = 1 month)

2022 Oct 31 - Chain of Thoughts (first generation)
.
.
.
.
.
.
2023 May 17 - Tree of Thoughts (start of second generation, discussion)
.
.
.
.
2023 Oct 11 - Diversity of Thoughts (discussion)
.
.
2024 Jan 16 - Boosting of Thoughts
.
.
.
.
.
2024 Jun 6 - Buffer of Thoughts
.
.
.
.
(present)

AI code assistants

(1 line = 1 quarter)

2021 Oct 29 - GitHub Copilot plugin released using Codex, a descendent of GPT-3 .
.
.
.
2022 Q1
.
.
.
2023 Q1
.
.
2023 Oct 15 - Cursor, which uses GPT-4, has its first post on Hacker News
.
.
2024 Q3
(present)