9 Comments
User's avatar
Gautham Srinivas's avatar

Great article!

I do take LSAT scores, Olympiad solving, etc with a bag of salt due to data contamination. Can this be overcome by using an eval that was created after the cutoff date? For example, using Olympiad questions from 2024, I instead of 2020

Expand full comment
Rohit Krishnan's avatar

Depends on what you're using it for. For basic head to head yes that works, but to actually choose one in production requires you to make much more specific evals

Expand full comment
Valentin Baltadzhiev's avatar

Great article! I think there is a lot of misunderstanding around the capacity of our current evals to actually "test" LLMs reasoning abilities. This is a great resource to point people to, for a high-level overview of the meta-problems.

Also, after reading this line "as someone who is definitely in the very highest percentile of LLM usage, I still can’t easily say which LLMs to use for particular use cases", it would be interesting for me to see what LLMs/stacks you use for various tasks

Expand full comment
Rohit Krishnan's avatar

Thanks! I use Perplexity and GPT-4 primarily, with a bunch of little things running on device with eg ollama.

Expand full comment
Kshitij Banerjee's avatar

Really interesting thoughts. I’m trying to build an eval for front-end specific engineering problems right now - so this post is spot-on.

Have you seen any eval frameworks that allow you to run these multi-dialog / multi-step agent behaviours to reach a final outcome that is acceptable .. and then collect such outcomes as a way to fine-tune and improve the pass@1 for future inference of the model ? I believe I saw true-lens, but it didn’t quite have the same level of complex evaluations in it

Expand full comment
Rohit Krishnan's avatar

Thanks! And that's very interesting. There isn't one afaik, everything good has to be handmade, partly why I ended up making loop evals to test. Please do let me know how you get on would love to tinker

Expand full comment
Kshitij Banerjee's avatar

Absolutely . Will post about it, and send an update when it’s out 👍🏻

Expand full comment
Kshitij Banerjee's avatar

Nice. Is loop-eval something I could tinker with ?

Expand full comment
Rohit Krishnan's avatar

Have fun;! https://github.com/marquisdepolis/LOOP-Evals

It's also highly specific. Anything useful ends up that way annoyingly.

Or this for life sciences: https://github.com/marquisdepolis/galen-evals

Expand full comment