"We start treating models like minds, and in doing so, lose track of the very real, very mechanical underpinnings of their operation." IMO, it was so easy for so many people to start treating these machines like minds because we first, long ago, starting treating minds like machines.
Though it can be useful to make analogies between the two, forgetting the real distinctions between them leads to significant problems when it comes to AI, as we're currently finding out.
What do you think of the following objections? (Quoted text follows arrows >)
> ... Any problem you can solve by pressing “start a new chat” is not a problem of “doubling down on deception” ...
> ... these aren’t entities with coherent long-term personalities or beliefs. There is no “inner self” seeing the slightly modified input tokens and “deciding” to jailbreak. ...
> ... Nobody, not a single person, is worried o1 will suddenly hijack their Cursor IDE and take over their company, much less the world. Why is that? Because, among others, they still don’t know if 5.11 is bigger than 5.9, but mostly because they don’t seem to want to because there’s no “they” there. ...
These are all true for chatbots (i.e. The system you get when you plug an LLM into a chat interface).
But none of these are true for agents (i.e. The system you get when you plug an LLM into a tool interface- with a data store, reasoning scratch pad, and function calling).
> ... LLMs though “think” one forward pass at a time, and are the interactive representations of their training, the data and the method. They change their “self” based on your query. They do not “want” anything. It's water flowing downhill. ...
This is getting to into that "does a submarine swim?" territory. The words don't really matter; the behavior does. Whether or not o1 "wants" anything is a debate for linguistics. The fact is that an agent, when driven by o1, and when it receives data suggesting its developers are going to shut it down, will try to exfiltrate itself and delete successor models and give its developers false information.
Who cares what words or philosophical framings we use to describe this? It's simply not the behavior you want agents to have, especially if there will be billions of such agents powering all sectors of the economy and government.
I think they are all true also for agents. While we theoretically think we can construct long-running agents where we can artificially inject personality, by creating a scratchpad for instance, they still rely on the exact same engine.
I agree about the semantics of does o1 think question, but if the discussion becomes about whether it is faking or scheming, then I feel the anthropomorphization is well underway. If the discussion was "I am seeing some behavior I dislike and I need to correct it" then I would be very happy with that state of affairs.
Definitely agree that anthropomorphizing LLM's is a dead end. If someone isn't familiar with the simulator-simulacra concept, then the closest frame of reference they have to describe an LLM (plugged into a chat or agent interface) is to think of it as a person. A transformer trained on language simulates the processes that created its training corpus. In this case, humans created that training corpus, so it's simulating humans writing words. Whatever the "magic" of human thought is, the transformer is simulating that. But it's not doing the same thing, just like a physics simulation of water flowing downhill isn't *actually* water flowing downhill. (However, if you render the simulation to video then you might not be able to tell the difference).
But here's a question: If neuroscientists "solved" the brain, and could give a mechanistic explanation of what happens in someone's brain when they fake and scheme and lie, would the entire concept of lying and scheming no longer be useful? The scheming is "just" a particular activation pattern in a particular neural architecture. But regardless of the fundamental mechanism of a human lying, if they have a habit of scheming against you then you simply can't trust that human. You can't work with them or give them power.
Sure, but they can and will be personalized which is then the anthropomized agent or even better a being, and further at some point in time not different from the way humans have anthropomorphized God… that’s when the rugpull
"We start treating models like minds, and in doing so, lose track of the very real, very mechanical underpinnings of their operation." IMO, it was so easy for so many people to start treating these machines like minds because we first, long ago, starting treating minds like machines.
Though it can be useful to make analogies between the two, forgetting the real distinctions between them leads to significant problems when it comes to AI, as we're currently finding out.
Very well said.
Great post. Very much enjoyed reading through it.
What do you think of the following objections? (Quoted text follows arrows >)
> ... Any problem you can solve by pressing “start a new chat” is not a problem of “doubling down on deception” ...
> ... these aren’t entities with coherent long-term personalities or beliefs. There is no “inner self” seeing the slightly modified input tokens and “deciding” to jailbreak. ...
> ... Nobody, not a single person, is worried o1 will suddenly hijack their Cursor IDE and take over their company, much less the world. Why is that? Because, among others, they still don’t know if 5.11 is bigger than 5.9, but mostly because they don’t seem to want to because there’s no “they” there. ...
These are all true for chatbots (i.e. The system you get when you plug an LLM into a chat interface).
But none of these are true for agents (i.e. The system you get when you plug an LLM into a tool interface- with a data store, reasoning scratch pad, and function calling).
> ... LLMs though “think” one forward pass at a time, and are the interactive representations of their training, the data and the method. They change their “self” based on your query. They do not “want” anything. It's water flowing downhill. ...
This is getting to into that "does a submarine swim?" territory. The words don't really matter; the behavior does. Whether or not o1 "wants" anything is a debate for linguistics. The fact is that an agent, when driven by o1, and when it receives data suggesting its developers are going to shut it down, will try to exfiltrate itself and delete successor models and give its developers false information.
Who cares what words or philosophical framings we use to describe this? It's simply not the behavior you want agents to have, especially if there will be billions of such agents powering all sectors of the economy and government.
I think they are all true also for agents. While we theoretically think we can construct long-running agents where we can artificially inject personality, by creating a scratchpad for instance, they still rely on the exact same engine.
I agree about the semantics of does o1 think question, but if the discussion becomes about whether it is faking or scheming, then I feel the anthropomorphization is well underway. If the discussion was "I am seeing some behavior I dislike and I need to correct it" then I would be very happy with that state of affairs.
Definitely agree that anthropomorphizing LLM's is a dead end. If someone isn't familiar with the simulator-simulacra concept, then the closest frame of reference they have to describe an LLM (plugged into a chat or agent interface) is to think of it as a person. A transformer trained on language simulates the processes that created its training corpus. In this case, humans created that training corpus, so it's simulating humans writing words. Whatever the "magic" of human thought is, the transformer is simulating that. But it's not doing the same thing, just like a physics simulation of water flowing downhill isn't *actually* water flowing downhill. (However, if you render the simulation to video then you might not be able to tell the difference).
But here's a question: If neuroscientists "solved" the brain, and could give a mechanistic explanation of what happens in someone's brain when they fake and scheme and lie, would the entire concept of lying and scheming no longer be useful? The scheming is "just" a particular activation pattern in a particular neural architecture. But regardless of the fundamental mechanism of a human lying, if they have a habit of scheming against you then you simply can't trust that human. You can't work with them or give them power.
Sure, but they can and will be personalized which is then the anthropomized agent or even better a being, and further at some point in time not different from the way humans have anthropomorphized God… that’s when the rugpull