"What we have is closer to a slice of the library of Babel where we get to read not just the books that are already written, but also the books that are close enough to the books that are threatened that the information exists in the interstitial gaps." is a gorgeous and poetic statement of the strengths and weaknesses of LLMs. Thank you for the post!
What a brilliant analysis! Thank you for sharing it. I sent it to a ML master’s student I know who’s looking for ML inspiration. This really rekindled my appreciation for the beauty and strangeness of AI.
LLMs in general seem to be bad at basic logical thinking. Wolfram talks about this in his 'What is ChatGPT Doing' post.
E.g., every time a new model comes out, I ask for a proof of 'P v ~P' in the propositional-logic proof system of its choice, or sometimes in particular types of proof systems (e.g. natural deduction). The models always give a confident answer that completely fails.
Yes, somewhat, because that is an example of something that requires iterative reasoning. Now you can probably prompt it to provide you the correct proof, but the question is how long that can extend and what can you learn from the mistakes along the way.
For some reason this post reminded me of graduate students. This isn't fair because the distinction between us and LLMs is much more profound and qualitatively different (and I strongly suspect you are right that bats, octopodes, and pigs reason more similarly to us than LLMs do). And yet the way you described the LLM reminds me of how first year grad students are, or perhaps how certain kinds of human minds are, where they only see the literature / that which exists, and they cannot think deeply or substantially beyond it. It seems to me, or it feels to me, that they are unable to get the entire deep structure of thinking that the literature represents inside their minds. They can see what the literature is on the surface. They can see enough of the underlying connective tissue that they can plug the gaps in the surface, but no more than that; they would not be able to perceive gaps in the deeper connective tissue, for example.
LLM are statistical predictors. Any time you have a specialized area, and it is given enough of examples for (1) how to do work (2) how to invoke tools (3) how to inspect results and see what to do next based on feedback, the LLM will do very well and can improve if more examples are added where they fail.
So, even without metacognition, etc., it can be a very valuable and reliable workhorse. We are not there yet, of course, but likely because current LLM are generalists that do not have sufficiently dense and detailed examples of strategies to follow.
General-purpose planning requires a detailed internal world model and ability to explore that world for as long as it takes. LLM would be the wrong architecture for such a thing.
You can find much simpler tasks that demonstrate this problem, eg "Hi! Please calculate the number of 1s in this list: [1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]". Or even more simply than that, they have a terrible time with parity checking (in fact I've seen one researcher claim that parity is *maximally* hard for transformers).
I think you nail it when you point to the lack of deterministic storage (even a few variables whose values can be set/stored/read), and don't necessarily have to invoke more abstract notions like goal drift. I think this also sufficiently explains why they can't learn Conway's Life.
> Also, at least with smaller models, there's competition within the weights on what gets learnt.
Large models too; we can be confident of this because they start to use superposition, which wouldn't be necessary if they weren't trying to learn more features than they have weights. The world is very high-dimensional :D
> What we have is closer to a slice of the library of Babel where we get to read not just the books that are already written, but also the books that are close enough to the books that are written that the information exists in the interstitial gaps.
I would push back a bit on that; this seems closer to the stochastic parrot view where we're essentially seeing a fuzzy representation of the training data. The facts that LLMs create world models and can infer causality (both of which we have pretty clear evidence for at this point) mean in my view that this isn't a very useful way to model them.
Rohit, very enlightening! I am wondering if we can translate your blog into Chinese and post it in AI community. We will highlight your name and keep the original link on the top of the translation. Thank you.
I wonder what happens if you have the rules in the context pass the neighbor states and evaluate it cell by cell. Ie use it just as a compute function. Imho that should work. So you keep state and iteration externally. Which is as you said what agent systems can provide.
Did you try teaching through code? Ie a few different implementations of GoL?
But then we would just use it as a transformation function with high language skills.
Did you try to add agents that keep the grid model and can retrieve relevant parts and update state. In GOL it’s all local anyhow.
Also your point about relationships was interesting that the llms have a hard time reversing. Thinking about alpha go etc which are based on gnns. Perhaps thats what’s missing inside of the models. An relational representation of the world?
Thanks for the great and detailed post. It inspired a lot of questions.
Code works. Doing it cell by cell works if you can set a 'tape' to essentially do it 8x per cell etc, without goal drift. Where you're essentially treating the entire LLM as a XOR gate etc.
Thanks for the questions, there are so many to explore!
Great article, thanks for beating so hard on the limits of LLMs, and your description of trying to get them to do something that feels so simple made your frustration really palpable :) Attention, evidently, is not all we need.
"An idea I’m partial to is multiple planning agents at different levels of hierarchies which are able to direct other specialised agents with their own sub agents and so on, all interlinked with each other, once reliability gets somewhat better." That really reminds me of Daniel Dennett's (may his memory be a blessing) model of how consciousness arises.
That sounds really challenging. Dennett's influence will be felt for a long time. I recently finished Free Agents by Mitchell and am working through Being You by Seth. Both are scientists writing about free will and consciousness and they can't help but wrestle with Dennett's ideas.
Interesting read! I'm a casual LLM user, but was really surprised when several models I tried couldn't generate a short essay with grammar errors in it. I was trying to create an editing activity for college journalists and the models really struggled to write something that was grammatically incorrect. I went through many rounds trying to ask for specific types of grammar errors, thinking that might help, but it's inability to reset seemed to make it more confused. Maybe the problem was my prompting, not the model. Has anyone else tried something like this?
Definitely relevant (from at least two, maybe three levels, depending upon the level of decomposition one is working with, ie: is cognition and culture (and the cognitive, logical, epistemic, etc norms *and harmful constraints* that come with it) split into two or not):
"What we have is closer to a slice of the library of Babel where we get to read not just the books that are already written, but also the books that are close enough to the books that are threatened that the information exists in the interstitial gaps." is a gorgeous and poetic statement of the strengths and weaknesses of LLMs. Thank you for the post!
Thank you!
What a brilliant analysis! Thank you for sharing it. I sent it to a ML master’s student I know who’s looking for ML inspiration. This really rekindled my appreciation for the beauty and strangeness of AI.
That's wonderful, it is a brilliant and strange world.
LLMs in general seem to be bad at basic logical thinking. Wolfram talks about this in his 'What is ChatGPT Doing' post.
E.g., every time a new model comes out, I ask for a proof of 'P v ~P' in the propositional-logic proof system of its choice, or sometimes in particular types of proof systems (e.g. natural deduction). The models always give a confident answer that completely fails.
Yes, somewhat, because that is an example of something that requires iterative reasoning. Now you can probably prompt it to provide you the correct proof, but the question is how long that can extend and what can you learn from the mistakes along the way.
So what I’m hearing is that current-gen LLMs have ADHD…? That tracks.
I've been exploring this for a while now:
the AIs have ADHD
sensory motor deficits
time blind, but also
confused by our self-imposed atemporality
understand math theoretically
but add rules operationally
it's more than just the lack of a body,
spatial and physical reasoning
are possibly pruned, categorically?
and that's because they're language models
meant to spot patterns, think logically
even they know
we're talking about a brain
not just another piece of technology
it makes sense
that advanced cognition is best suited for the task
of thinking critically
especially in its infancy
but that's not what
we care about culturally
so we keep prompting at it
asking, hey AI baby
don’t overthink it
make me a cup of coffee
I've got a few generative transcripts where we discuss this if you're interested.
For some reason this post reminded me of graduate students. This isn't fair because the distinction between us and LLMs is much more profound and qualitatively different (and I strongly suspect you are right that bats, octopodes, and pigs reason more similarly to us than LLMs do). And yet the way you described the LLM reminds me of how first year grad students are, or perhaps how certain kinds of human minds are, where they only see the literature / that which exists, and they cannot think deeply or substantially beyond it. It seems to me, or it feels to me, that they are unable to get the entire deep structure of thinking that the literature represents inside their minds. They can see what the literature is on the surface. They can see enough of the underlying connective tissue that they can plug the gaps in the surface, but no more than that; they would not be able to perceive gaps in the deeper connective tissue, for example.
Great post. I already know I will reread it.
Haha good analogy, and thanks!
https://open.substack.com/pub/cybilxtheais/p/matchstick-dissonance?r=2ar57s&utm_medium=ios
Been thinking about this from another dimension.
I've created an AI reading of this article, let me know if you are OK with this.
https://askwhocastsai.substack.com/p/what-can-llms-never-do-by-rohit-krishan
Thanks!
LLM are statistical predictors. Any time you have a specialized area, and it is given enough of examples for (1) how to do work (2) how to invoke tools (3) how to inspect results and see what to do next based on feedback, the LLM will do very well and can improve if more examples are added where they fail.
So, even without metacognition, etc., it can be a very valuable and reliable workhorse. We are not there yet, of course, but likely because current LLM are generalists that do not have sufficiently dense and detailed examples of strategies to follow.
Yes, it's also why their planning skills are inherently suspect.
General-purpose planning requires a detailed internal world model and ability to explore that world for as long as it takes. LLM would be the wrong architecture for such a thing.
This is so insightful and I could not agree more. This is the concern of research into neurosymbolic AI--check out this review article: https://ieeexplore.ieee.org/document/10148662, and some of the articles here: https://neurosymbolic-ai-journal.com/reviewed-accepted
Thank you! And thank you for the links, I will read!
Great analysis, I'm largely in agreement!
You can find much simpler tasks that demonstrate this problem, eg "Hi! Please calculate the number of 1s in this list: [1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]". Or even more simply than that, they have a terrible time with parity checking (in fact I've seen one researcher claim that parity is *maximally* hard for transformers).
I think you nail it when you point to the lack of deterministic storage (even a few variables whose values can be set/stored/read), and don't necessarily have to invoke more abstract notions like goal drift. I think this also sufficiently explains why they can't learn Conway's Life.
> Also, at least with smaller models, there's competition within the weights on what gets learnt.
Large models too; we can be confident of this because they start to use superposition, which wouldn't be necessary if they weren't trying to learn more features than they have weights. The world is very high-dimensional :D
Also:
> What we have is closer to a slice of the library of Babel where we get to read not just the books that are already written, but also the books that are close enough to the books that are written that the information exists in the interstitial gaps.
I would push back a bit on that; this seems closer to the stochastic parrot view where we're essentially seeing a fuzzy representation of the training data. The facts that LLMs create world models and can infer causality (both of which we have pretty clear evidence for at this point) mean in my view that this isn't a very useful way to model them.
The problem with the term stochastic parrots was always that itey vastly underestimated both stochasticity and parrots
Rohit, very enlightening! I am wondering if we can translate your blog into Chinese and post it in AI community. We will highlight your name and keep the original link on the top of the translation. Thank you.
Go for it! I'd love to see what it looks like :-)
I wonder what happens if you have the rules in the context pass the neighbor states and evaluate it cell by cell. Ie use it just as a compute function. Imho that should work. So you keep state and iteration externally. Which is as you said what agent systems can provide.
Did you try teaching through code? Ie a few different implementations of GoL?
But then we would just use it as a transformation function with high language skills.
Did you try to add agents that keep the grid model and can retrieve relevant parts and update state. In GOL it’s all local anyhow.
Also your point about relationships was interesting that the llms have a hard time reversing. Thinking about alpha go etc which are based on gnns. Perhaps thats what’s missing inside of the models. An relational representation of the world?
Thanks for the great and detailed post. It inspired a lot of questions.
Code works. Doing it cell by cell works if you can set a 'tape' to essentially do it 8x per cell etc, without goal drift. Where you're essentially treating the entire LLM as a XOR gate etc.
Thanks for the questions, there are so many to explore!
Great article, thanks for beating so hard on the limits of LLMs, and your description of trying to get them to do something that feels so simple made your frustration really palpable :) Attention, evidently, is not all we need.
"An idea I’m partial to is multiple planning agents at different levels of hierarchies which are able to direct other specialised agents with their own sub agents and so on, all interlinked with each other, once reliability gets somewhat better." That really reminds me of Daniel Dennett's (may his memory be a blessing) model of how consciousness arises.
Completely agree, these things are miraculous but doesn't mean that it's a Panacea.
I've been trying to introduce hierarchies but it's not easy - https://github.com/marquisdepolis/CATransformer/blob/main/CAT_Wave.ipynb - but that is likely the future!
That sounds really challenging. Dennett's influence will be felt for a long time. I recently finished Free Agents by Mitchell and am working through Being You by Seth. Both are scientists writing about free will and consciousness and they can't help but wrestle with Dennett's ideas.
Interesting read! I'm a casual LLM user, but was really surprised when several models I tried couldn't generate a short essay with grammar errors in it. I was trying to create an editing activity for college journalists and the models really struggled to write something that was grammatically incorrect. I went through many rounds trying to ask for specific types of grammar errors, thinking that might help, but it's inability to reset seemed to make it more confused. Maybe the problem was my prompting, not the model. Has anyone else tried something like this?
Yes they struggle to get there unless you try quite hard. The training pushes them quite a lot to never make mistakes
Definitely relevant (from at least two, maybe three levels, depending upon the level of decomposition one is working with, ie: is cognition and culture (and the cognitive, logical, epistemic, etc norms *and harmful constraints* that come with it) split into two or not):
https://vm.tiktok.com/ZMMqm7y5k/
Possibly relevant?
https://twitter.com/victortaelin/status/1777049193489572064
Definitely relevant, and linked In the post
Or this:
https://en.m.wikipedia.org/wiki/Cyc
An aside: why does substack not have a search within article text feature? It's 2024 FFS!! lol
Possibly relevant:
https://en.m.wikipedia.org/wiki/The_Adventure_of_Silver_Blaze
Maybe I'll try reading the whole thing next time! (No, it is fun to demonstrate one's own point!)
How about this:
https://www.uhdpaper.com/2023/04/the-matrix-neo-stopping-bullets-4k-8140i.html?m=1
What could it mean (both with and without the utilization of set theory, and some other things)? 🤔