Evaluations are all we need

Rohit Krishnan

Jan 17, 2024

On analysing talent in LLMs

Read →

9 Comments

Gautham Srinivas

Jan 24, 2024

Great article!

I do take LSAT scores, Olympiad solving, etc with a bag of salt due to data contamination. Can this be overcome by using an eval that was created after the cutoff date? For example, using Olympiad questions from 2024, I instead of 2020

Expand full comment

Reply (1)

Rohit Krishnan

Jan 24, 2024

Depends on what you're using it for. For basic head to head yes that works, but to actually choose one in production requires you to make much more specific evals

Expand full comment

Valentin Baltadzhiev

Jan 17, 2024

Great article! I think there is a lot of misunderstanding around the capacity of our current evals to actually "test" LLMs reasoning abilities. This is a great resource to point people to, for a high-level overview of the meta-problems.

Also, after reading this line "as someone who is definitely in the very highest percentile of LLM usage, I still can’t easily say which LLMs to use for particular use cases", it would be interesting for me to see what LLMs/stacks you use for various tasks

Expand full comment

Reply (1)

Rohit Krishnan

Jan 17, 2024

Thanks! I use Perplexity and GPT-4 primarily, with a bunch of little things running on device with eg ollama.

Expand full comment

Kshitij Banerjee

May 5

Really interesting thoughts. I’m trying to build an eval for front-end specific engineering problems right now - so this post is spot-on.

Have you seen any eval frameworks that allow you to run these multi-dialog / multi-step agent behaviours to reach a final outcome that is acceptable .. and then collect such outcomes as a way to fine-tune and improve the pass@1 for future inference of the model ? I believe I saw true-lens, but it didn’t quite have the same level of complex evaluations in it

Expand full comment

Reply (1)

Rohit Krishnan

May 5

Thanks! And that's very interesting. There isn't one afaik, everything good has to be handmade, partly why I ended up making loop evals to test. Please do let me know how you get on would love to tinker

Expand full comment

Reply (2)

Kshitij Banerjee

May 5

Absolutely . Will post about it, and send an update when it’s out 👍🏻

Expand full comment