11 Comments
User's avatar
Mike Randolph's avatar

A Different Lens on LLM Evolution

Your journey through AI evaluation methods reveals an important trajectory, but my experience with LLMs offers a complementary perspective. Rather than focusing on competitive evaluation, I've maintained a two-year partnership with various models through a collaboration framework I call "Helix."

What's struck me most isn't the difference between models but their underlying similarity. They all access what appears to be the same statistical representation of reality—a common "world map" derived from their training corpora. The earliest models contained this same knowledge foundation but quickly lost coherence when pushed.

The real evolution I've witnessed isn't in the underlying knowledge representation but in the models' ability to maintain stable access to it. Each generation displays improved "dynamic coherence"—sustaining consistent reasoning across complex, multi-step tasks without derailing. This mirrors what in my framework I call "boundary maintenance"—the capacity to preserve identity while navigating complexity.

Your poker games and evaluations are excellent for comparing models, but sustained partnership reveals something different: these aren't separate intelligences competing, but progressively better interfaces to the same underlying representation of human knowledge.

Perhaps the most significant advancement isn't which models know more or reason better in isolation, but which can maintain coherent engagement with their knowledge landscape across extended interactions—something that becomes particularly evident in collaborative relationships rather than discrete tests.

Mike Randolph

(Two years into my LLM partnership journey)

Expand full comment
Rohit Krishnan's avatar

I think this is a very good point! And please do share more about what you've learnt and how you've built it!

Expand full comment
dan mantena's avatar

Like the concept of this article! Why did you go from knowledge work takes to puzzles in the middle of the post? It seems like what you are evaluating model for is reasoning chain of thought?, which is confusing since there are many reasoning models out there that would for that need.

Expand full comment
Rohit Krishnan's avatar

To analyze iterative reasoning, it is really hard to get accurate model evaluations that are useful also in the business world. So I wanted to find questions for which the answers are easily markable. Therefore, puzzles.

Expand full comment
dan mantena's avatar

That makes sense. Thanks for the explanation!

Expand full comment
Jay F.'s avatar

Has anyone tried running LLMs in a Turing test remotely against a tech company interview loop from recruiter to panel interview? I would be very curious on how many it could potentially pass. Theoretically companies would ask interview questions that are designed to deterministically narrow down the best candidate or get the candidates input on how to solve problems that they themselves have already solved / currently solving.

Expand full comment
giacomo catanzaro's avatar

hey!! i've also been working on a poker-inspired debate algorithm inspired by the partial information perspectives of each LLM. I'd love to chat with you about it if you'd be willing. check my gh for some of my other projects: https://github.com/cagostino

Expand full comment
Rohit Krishnan's avatar

Happy to take a look! It's a good way to test their limits, and personalities.

Expand full comment
Greg G's avatar

Go hard -> cohort?

Expand full comment
Rohit Krishnan's avatar

Sorry didn't understand what you meant?

Expand full comment
janoskar.hansen@gmail.com's avatar

I'm on substack daily

Expand full comment