Your journey through AI evaluation methods reveals an important trajectory, but my experience with LLMs offers a complementary perspective. Rather than focusing on competitive evaluation, I've maintained a two-year partnership with various models through a collaboration framework I call "Helix."
What's struck me most isn't the difference between models but their underlying similarity. They all access what appears to be the same statistical representation of reality—a common "world map" derived from their training corpora. The earliest models contained this same knowledge foundation but quickly lost coherence when pushed.
The real evolution I've witnessed isn't in the underlying knowledge representation but in the models' ability to maintain stable access to it. Each generation displays improved "dynamic coherence"—sustaining consistent reasoning across complex, multi-step tasks without derailing. This mirrors what in my framework I call "boundary maintenance"—the capacity to preserve identity while navigating complexity.
Your poker games and evaluations are excellent for comparing models, but sustained partnership reveals something different: these aren't separate intelligences competing, but progressively better interfaces to the same underlying representation of human knowledge.
Perhaps the most significant advancement isn't which models know more or reason better in isolation, but which can maintain coherent engagement with their knowledge landscape across extended interactions—something that becomes particularly evident in collaborative relationships rather than discrete tests.
Like the concept of this article! Why did you go from knowledge work takes to puzzles in the middle of the post? It seems like what you are evaluating model for is reasoning chain of thought?, which is confusing since there are many reasoning models out there that would for that need.
To analyze iterative reasoning, it is really hard to get accurate model evaluations that are useful also in the business world. So I wanted to find questions for which the answers are easily markable. Therefore, puzzles.
Has anyone tried running LLMs in a Turing test remotely against a tech company interview loop from recruiter to panel interview? I would be very curious on how many it could potentially pass. Theoretically companies would ask interview questions that are designed to deterministically narrow down the best candidate or get the candidates input on how to solve problems that they themselves have already solved / currently solving.
hey!! i've also been working on a poker-inspired debate algorithm inspired by the partial information perspectives of each LLM. I'd love to chat with you about it if you'd be willing. check my gh for some of my other projects: https://github.com/cagostino
A Different Lens on LLM Evolution
Your journey through AI evaluation methods reveals an important trajectory, but my experience with LLMs offers a complementary perspective. Rather than focusing on competitive evaluation, I've maintained a two-year partnership with various models through a collaboration framework I call "Helix."
What's struck me most isn't the difference between models but their underlying similarity. They all access what appears to be the same statistical representation of reality—a common "world map" derived from their training corpora. The earliest models contained this same knowledge foundation but quickly lost coherence when pushed.
The real evolution I've witnessed isn't in the underlying knowledge representation but in the models' ability to maintain stable access to it. Each generation displays improved "dynamic coherence"—sustaining consistent reasoning across complex, multi-step tasks without derailing. This mirrors what in my framework I call "boundary maintenance"—the capacity to preserve identity while navigating complexity.
Your poker games and evaluations are excellent for comparing models, but sustained partnership reveals something different: these aren't separate intelligences competing, but progressively better interfaces to the same underlying representation of human knowledge.
Perhaps the most significant advancement isn't which models know more or reason better in isolation, but which can maintain coherent engagement with their knowledge landscape across extended interactions—something that becomes particularly evident in collaborative relationships rather than discrete tests.
Mike Randolph
(Two years into my LLM partnership journey)
I think this is a very good point! And please do share more about what you've learnt and how you've built it!
Like the concept of this article! Why did you go from knowledge work takes to puzzles in the middle of the post? It seems like what you are evaluating model for is reasoning chain of thought?, which is confusing since there are many reasoning models out there that would for that need.
To analyze iterative reasoning, it is really hard to get accurate model evaluations that are useful also in the business world. So I wanted to find questions for which the answers are easily markable. Therefore, puzzles.
That makes sense. Thanks for the explanation!
Has anyone tried running LLMs in a Turing test remotely against a tech company interview loop from recruiter to panel interview? I would be very curious on how many it could potentially pass. Theoretically companies would ask interview questions that are designed to deterministically narrow down the best candidate or get the candidates input on how to solve problems that they themselves have already solved / currently solving.
hey!! i've also been working on a poker-inspired debate algorithm inspired by the partial information perspectives of each LLM. I'd love to chat with you about it if you'd be willing. check my gh for some of my other projects: https://github.com/cagostino
Happy to take a look! It's a good way to test their limits, and personalities.
Go hard -> cohort?
Sorry didn't understand what you meant?
I'm on substack daily