I do take LSAT scores, Olympiad solving, etc with a bag of salt due to data contamination. Can this be overcome by using an eval that was created after the cutoff date? For example, using Olympiad questions from 2024, I instead of 2020
Depends on what you're using it for. For basic head to head yes that works, but to actually choose one in production requires you to make much more specific evals
Great article! I think there is a lot of misunderstanding around the capacity of our current evals to actually "test" LLMs reasoning abilities. This is a great resource to point people to, for a high-level overview of the meta-problems.
Also, after reading this line "as someone who is definitely in the very highest percentile of LLM usage, I still can’t easily say which LLMs to use for particular use cases", it would be interesting for me to see what LLMs/stacks you use for various tasks
Really interesting thoughts. I’m trying to build an eval for front-end specific engineering problems right now - so this post is spot-on.
Have you seen any eval frameworks that allow you to run these multi-dialog / multi-step agent behaviours to reach a final outcome that is acceptable .. and then collect such outcomes as a way to fine-tune and improve the pass@1 for future inference of the model ? I believe I saw true-lens, but it didn’t quite have the same level of complex evaluations in it
Thanks! And that's very interesting. There isn't one afaik, everything good has to be handmade, partly why I ended up making loop evals to test. Please do let me know how you get on would love to tinker
Great article!
I do take LSAT scores, Olympiad solving, etc with a bag of salt due to data contamination. Can this be overcome by using an eval that was created after the cutoff date? For example, using Olympiad questions from 2024, I instead of 2020
Depends on what you're using it for. For basic head to head yes that works, but to actually choose one in production requires you to make much more specific evals
Great article! I think there is a lot of misunderstanding around the capacity of our current evals to actually "test" LLMs reasoning abilities. This is a great resource to point people to, for a high-level overview of the meta-problems.
Also, after reading this line "as someone who is definitely in the very highest percentile of LLM usage, I still can’t easily say which LLMs to use for particular use cases", it would be interesting for me to see what LLMs/stacks you use for various tasks
Thanks! I use Perplexity and GPT-4 primarily, with a bunch of little things running on device with eg ollama.
Really interesting thoughts. I’m trying to build an eval for front-end specific engineering problems right now - so this post is spot-on.
Have you seen any eval frameworks that allow you to run these multi-dialog / multi-step agent behaviours to reach a final outcome that is acceptable .. and then collect such outcomes as a way to fine-tune and improve the pass@1 for future inference of the model ? I believe I saw true-lens, but it didn’t quite have the same level of complex evaluations in it
Thanks! And that's very interesting. There isn't one afaik, everything good has to be handmade, partly why I ended up making loop evals to test. Please do let me know how you get on would love to tinker
Absolutely . Will post about it, and send an update when it’s out 👍🏻
Nice. Is loop-eval something I could tinker with ?
Have fun;! https://github.com/marquisdepolis/LOOP-Evals
It's also highly specific. Anything useful ends up that way annoyingly.
Or this for life sciences: https://github.com/marquisdepolis/galen-evals