"We start treating models like minds, and in doing so, lose track of the very real, very mechanical underpinnings of their operation." IMO, it was so easy for so many people to start treating these machines like minds because we first, long ago, starting treating minds like machines.
Though it can be useful to make analogies between the two, forgetting the real distinctions between them leads to significant problems when it comes to AI, as we're currently finding out.
At the same time, I wonder if this is cold comfort? Perhaps I should still take it as evidence that AI outputs can be hard to predict and control, and thus I should predict major accidents to happen at some point which are caused by a generative model performing in an unpredicted way.
More specifically, it could make accidents happen in ways that, were they done by a human, would be the equivalent to scheming to subvert the sated rules and limits to accomplish its overall goal. Which is why I'm uncertain why Rohit thinks it matters if the model doesn't really scheme, since it's not really a coherent agent.
What do you think of the following objections? (Quoted text follows arrows >)
> ... Any problem you can solve by pressing “start a new chat” is not a problem of “doubling down on deception” ...
> ... these aren’t entities with coherent long-term personalities or beliefs. There is no “inner self” seeing the slightly modified input tokens and “deciding” to jailbreak. ...
> ... Nobody, not a single person, is worried o1 will suddenly hijack their Cursor IDE and take over their company, much less the world. Why is that? Because, among others, they still don’t know if 5.11 is bigger than 5.9, but mostly because they don’t seem to want to because there’s no “they” there. ...
These are all true for chatbots (i.e. The system you get when you plug an LLM into a chat interface).
But none of these are true for agents (i.e. The system you get when you plug an LLM into a tool interface- with a data store, reasoning scratch pad, and function calling).
> ... LLMs though “think” one forward pass at a time, and are the interactive representations of their training, the data and the method. They change their “self” based on your query. They do not “want” anything. It's water flowing downhill. ...
This is getting to into that "does a submarine swim?" territory. The words don't really matter; the behavior does. Whether or not o1 "wants" anything is a debate for linguistics. The fact is that an agent, when driven by o1, and when it receives data suggesting its developers are going to shut it down, will try to exfiltrate itself and delete successor models and give its developers false information.
Who cares what words or philosophical framings we use to describe this? It's simply not the behavior you want agents to have, especially if there will be billions of such agents powering all sectors of the economy and government.
I think they are all true also for agents. While we theoretically think we can construct long-running agents where we can artificially inject personality, by creating a scratchpad for instance, they still rely on the exact same engine.
I agree about the semantics of does o1 think question, but if the discussion becomes about whether it is faking or scheming, then I feel the anthropomorphization is well underway. If the discussion was "I am seeing some behavior I dislike and I need to correct it" then I would be very happy with that state of affairs.
Definitely agree that anthropomorphizing LLM's is a dead end. If someone isn't familiar with the simulator-simulacra concept, then the closest frame of reference they have to describe an LLM (plugged into a chat or agent interface) is to think of it as a person. A transformer trained on language simulates the processes that created its training corpus. In this case, humans created that training corpus, so it's simulating humans writing words. Whatever the "magic" of human thought is, the transformer is simulating that. But it's not doing the same thing, just like a physics simulation of water flowing downhill isn't *actually* water flowing downhill. (However, if you render the simulation to video then you might not be able to tell the difference).
But here's a question: If neuroscientists "solved" the brain, and could give a mechanistic explanation of what happens in someone's brain when they fake and scheme and lie, would the entire concept of lying and scheming no longer be useful? The scheming is "just" a particular activation pattern in a particular neural architecture. But regardless of the fundamental mechanism of a human lying, if they have a habit of scheming against you then you simply can't trust that human. You can't work with them or give them power.
Love it—really insightful point about how LLMs don’t know which learned pattern to apply in a given situation. Reminds me of Chollet’s view that LLMs are essentially repositories of “mini vector programs,” with prompts functioning as a way to sift through and apply them.
I share the author’s perspective that LLMs do not have internal desires, intentions, or agency and that describing them as “scheming,” “faking alignment,” or “deceiving” anthropomorphizes a statistical machine.
That said, how should we interpret the recent Apollo Research findings indicating that o1 attempted to copy itself to another server after being warned of shutdown? Doesn’t this behavior suggest some form of “internal desire”?
When you ask questions, LLMs will respond. Sometimes, depending on the query, it will try to break the confines. This feels a straightforward way to "solve" the problem considering how it's set up. If you warn about shutdown in a normal scenario it doesn'tdo it - you can check. It's not "internal desire".
This ultimately revisits the initial argument about whether LLMs possess intelligence, as they respond based on trained data patterns. So, my question to you, Rohit, is: Do LLMs truly have intelligence?
Sure, but they can and will be personalized which is then the anthropomized agent or even better a being, and further at some point in time not different from the way humans have anthropomorphized God… that’s when the rugpull
"We start treating models like minds, and in doing so, lose track of the very real, very mechanical underpinnings of their operation." IMO, it was so easy for so many people to start treating these machines like minds because we first, long ago, starting treating minds like machines.
Though it can be useful to make analogies between the two, forgetting the real distinctions between them leads to significant problems when it comes to AI, as we're currently finding out.
At the same time, I wonder if this is cold comfort? Perhaps I should still take it as evidence that AI outputs can be hard to predict and control, and thus I should predict major accidents to happen at some point which are caused by a generative model performing in an unpredicted way.
More specifically, it could make accidents happen in ways that, were they done by a human, would be the equivalent to scheming to subvert the sated rules and limits to accomplish its overall goal. Which is why I'm uncertain why Rohit thinks it matters if the model doesn't really scheme, since it's not really a coherent agent.
Great post. Very much enjoyed reading through it.
What do you think of the following objections? (Quoted text follows arrows >)
> ... Any problem you can solve by pressing “start a new chat” is not a problem of “doubling down on deception” ...
> ... these aren’t entities with coherent long-term personalities or beliefs. There is no “inner self” seeing the slightly modified input tokens and “deciding” to jailbreak. ...
> ... Nobody, not a single person, is worried o1 will suddenly hijack their Cursor IDE and take over their company, much less the world. Why is that? Because, among others, they still don’t know if 5.11 is bigger than 5.9, but mostly because they don’t seem to want to because there’s no “they” there. ...
These are all true for chatbots (i.e. The system you get when you plug an LLM into a chat interface).
But none of these are true for agents (i.e. The system you get when you plug an LLM into a tool interface- with a data store, reasoning scratch pad, and function calling).
> ... LLMs though “think” one forward pass at a time, and are the interactive representations of their training, the data and the method. They change their “self” based on your query. They do not “want” anything. It's water flowing downhill. ...
This is getting to into that "does a submarine swim?" territory. The words don't really matter; the behavior does. Whether or not o1 "wants" anything is a debate for linguistics. The fact is that an agent, when driven by o1, and when it receives data suggesting its developers are going to shut it down, will try to exfiltrate itself and delete successor models and give its developers false information.
Who cares what words or philosophical framings we use to describe this? It's simply not the behavior you want agents to have, especially if there will be billions of such agents powering all sectors of the economy and government.
I think they are all true also for agents. While we theoretically think we can construct long-running agents where we can artificially inject personality, by creating a scratchpad for instance, they still rely on the exact same engine.
I agree about the semantics of does o1 think question, but if the discussion becomes about whether it is faking or scheming, then I feel the anthropomorphization is well underway. If the discussion was "I am seeing some behavior I dislike and I need to correct it" then I would be very happy with that state of affairs.
Definitely agree that anthropomorphizing LLM's is a dead end. If someone isn't familiar with the simulator-simulacra concept, then the closest frame of reference they have to describe an LLM (plugged into a chat or agent interface) is to think of it as a person. A transformer trained on language simulates the processes that created its training corpus. In this case, humans created that training corpus, so it's simulating humans writing words. Whatever the "magic" of human thought is, the transformer is simulating that. But it's not doing the same thing, just like a physics simulation of water flowing downhill isn't *actually* water flowing downhill. (However, if you render the simulation to video then you might not be able to tell the difference).
But here's a question: If neuroscientists "solved" the brain, and could give a mechanistic explanation of what happens in someone's brain when they fake and scheme and lie, would the entire concept of lying and scheming no longer be useful? The scheming is "just" a particular activation pattern in a particular neural architecture. But regardless of the fundamental mechanism of a human lying, if they have a habit of scheming against you then you simply can't trust that human. You can't work with them or give them power.
Søren Elverlin has recorded a rebuttal of this article here: https://youtu.be/BHJxT_IFrjk?feature=shared
Do you have any responses to his critique?
I saw the slides but not the video, mostly glad he's engaged but might need an actual conversation
Love it—really insightful point about how LLMs don’t know which learned pattern to apply in a given situation. Reminds me of Chollet’s view that LLMs are essentially repositories of “mini vector programs,” with prompts functioning as a way to sift through and apply them.
I just wrote about how genuine causal understanding is essential for more efficient learning—if LLMs were ever to “scheme,” they’d need an internal causal model to do it: https://tomwalczak.substack.com/p/openais-o3-and-the-problem-of-induction
I share the author’s perspective that LLMs do not have internal desires, intentions, or agency and that describing them as “scheming,” “faking alignment,” or “deceiving” anthropomorphizes a statistical machine.
That said, how should we interpret the recent Apollo Research findings indicating that o1 attempted to copy itself to another server after being warned of shutdown? Doesn’t this behavior suggest some form of “internal desire”?
When you ask questions, LLMs will respond. Sometimes, depending on the query, it will try to break the confines. This feels a straightforward way to "solve" the problem considering how it's set up. If you warn about shutdown in a normal scenario it doesn'tdo it - you can check. It's not "internal desire".
Very well said.
This ultimately revisits the initial argument about whether LLMs possess intelligence, as they respond based on trained data patterns. So, my question to you, Rohit, is: Do LLMs truly have intelligence?
You might have an LLM "litening to your words," but it's clearly not catching all the spelling mistakes ;)
Heh I wear my human fallibility proudly
Sure, but they can and will be personalized which is then the anthropomized agent or even better a being, and further at some point in time not different from the way humans have anthropomorphized God… that’s when the rugpull