10 Comments

Great article!

I do take LSAT scores, Olympiad solving, etc with a bag of salt due to data contamination. Can this be overcome by using an eval that was created after the cutoff date? For example, using Olympiad questions from 2024, I instead of 2020

Expand full comment

Depends on what you're using it for. For basic head to head yes that works, but to actually choose one in production requires you to make much more specific evals

Expand full comment

Hi, thanks for the article. How much fine-tuning do you need to do to score high on things like the LSAT and Olympiad tests? What if you never trained the models with LSAT questions and just trained them with the underlying knowledge that a normal person taking the LSAT with no significant study but the underlying skills would have, would they still score significantly? Standardized tests are such a narrow way to measure intelligence anyway. Most kids don't grow up loving to stare at a book, but I bet if you "made" them and gave them the right environment (and trained them to do well on these types of tests), they would all score high on these limited types of tests. I do not believe in these tests or studies that relate intelligence to life experiences; correlation does not imply causation, so of course someone who scores high on these tests would typically have "better job" and "life prospects" later because they have preconditioned themselves to live life that way or want to "score high" and "achieve more". I don't know, I just don't believe in these narrow notions of intelligence, and even less in using IQ or LSAT scores as a measure of an LLM's supposed cognitive span or reasoning ability. We need much better benchmarks and measures (I will be thinking about better ones for a while). I know I was being tangential, but yes, I wanted to give my opinion. Let's push the state of the art in benchmarks and measures (e.g. ability to solve hierarchical decisions, solve unsolved or unseen problems, do science, etc.). By the way, I read another one of your articles and I really liked your Socratic AI approach. I also believe in modular approaches with internal feedback loops (I think the end-to-end or zero|one-shot approaches need more internal feedback for models to advance to the level of solving extremely hard problems because of the amount of scale they require to be successful; I guess this is kind of intuitive and obvious to you, but I wanted to mention my opinion anyway).

Expand full comment

Great article! I think there is a lot of misunderstanding around the capacity of our current evals to actually "test" LLMs reasoning abilities. This is a great resource to point people to, for a high-level overview of the meta-problems.

Also, after reading this line "as someone who is definitely in the very highest percentile of LLM usage, I still can’t easily say which LLMs to use for particular use cases", it would be interesting for me to see what LLMs/stacks you use for various tasks

Expand full comment

Thanks! I use Perplexity and GPT-4 primarily, with a bunch of little things running on device with eg ollama.

Expand full comment

Really interesting thoughts. I’m trying to build an eval for front-end specific engineering problems right now - so this post is spot-on.

Have you seen any eval frameworks that allow you to run these multi-dialog / multi-step agent behaviours to reach a final outcome that is acceptable .. and then collect such outcomes as a way to fine-tune and improve the pass@1 for future inference of the model ? I believe I saw true-lens, but it didn’t quite have the same level of complex evaluations in it

Expand full comment

Thanks! And that's very interesting. There isn't one afaik, everything good has to be handmade, partly why I ended up making loop evals to test. Please do let me know how you get on would love to tinker

Expand full comment

Absolutely . Will post about it, and send an update when it’s out 👍🏻

Expand full comment

Nice. Is loop-eval something I could tinker with ?

Expand full comment

Have fun;! https://github.com/marquisdepolis/LOOP-Evals

It's also highly specific. Anything useful ends up that way annoyingly.

Or this for life sciences: https://github.com/marquisdepolis/galen-evals

Expand full comment