Strange Loop Canon

Great post. Very much enjoyed reading through it.

What do you think of the following objections? (Quoted text follows arrows >)

> ... Any problem you can solve by pressing “start a new chat” is not a problem of “doubling down on deception” ...

> ... these aren’t entities with coherent long-term personalities or beliefs. There is no “inner self” seeing the slightly modified input tokens and “deciding” to jailbreak. ...

> ... Nobody, not a single person, is worried o1 will suddenly hijack their Cursor IDE and take over their company, much less the world. Why is that? Because, among others, they still don’t know if 5.11 is bigger than 5.9, but mostly because they don’t seem to want to because there’s no “they” there. ...

These are all true for chatbots (i.e. The system you get when you plug an LLM into a chat interface).

But none of these are true for agents (i.e. The system you get when you plug an LLM into a tool interface- with a data store, reasoning scratch pad, and function calling).

> ... LLMs though “think” one forward pass at a time, and are the interactive representations of their training, the data and the method. They change their “self” based on your query. They do not “want” anything. It's water flowing downhill. ...

This is getting to into that "does a submarine swim?" territory. The words don't really matter; the behavior does. Whether or not o1 "wants" anything is a debate for linguistics. The fact is that an agent, when driven by o1, and when it receives data suggesting its developers are going to shut it down, will try to exfiltrate itself and delete successor models and give its developers false information.

Who cares what words or philosophical framings we use to describe this? It's simply not the behavior you want agents to have, especially if there will be billions of such agents powering all sectors of the economy and government.

Expand full comment

I think they are all true also for agents. While we theoretically think we can construct long-running agents where we can artificially inject personality, by creating a scratchpad for instance, they still rely on the exact same engine.

I agree about the semantics of does o1 think question, but if the discussion becomes about whether it is faking or scheming, then I feel the anthropomorphization is well underway. If the discussion was "I am seeing some behavior I dislike and I need to correct it" then I would be very happy with that state of affairs.

Expand full comment

Jon B

Definitely agree that anthropomorphizing LLM's is a dead end. If someone isn't familiar with the simulator-simulacra concept, then the closest frame of reference they have to describe an LLM (plugged into a chat or agent interface) is to think of it as a person. A transformer trained on language simulates the processes that created its training corpus. In this case, humans created that training corpus, so it's simulating humans writing words. Whatever the "magic" of human thought is, the transformer is simulating that. But it's not doing the same thing, just like a physics simulation of water flowing downhill isn't *actually* water flowing downhill. (However, if you render the simulation to video then you might not be able to tell the difference).

But here's a question: If neuroscientists "solved" the brain, and could give a mechanistic explanation of what happens in someone's brain when they fake and scheme and lie, would the entire concept of lying and scheming no longer be useful? The scheming is "just" a particular activation pattern in a particular neural architecture. But regardless of the fundamental mechanism of a human lying, if they have a habit of scheming against you then you simply can't trust that human. You can't work with them or give them power.

Expand full comment

Sebastian Garren

Jan 4

At the same time, I wonder if this is cold comfort? Perhaps I should still take it as evidence that AI outputs can be hard to predict and control, and thus I should predict major accidents to happen at some point which are caused by a generative model performing in an unpredicted way.

Expand full comment

David Manheim

Jan 6

More specifically, it could make accidents happen in ways that, were they done by a human, would be the equivalent to scheming to subvert the sated rules and limits to accomplish its overall goal. Which is why I'm uncertain why Rohit thinks it matters if the model doesn't really scheme, since it's not really a coherent agent.

Expand full comment

Divagnz

Jan 26Edited

I think your assessment is driven in part due to our human ego (will explain my train of thought in a bit more), I do think the research papers highlight a type of scheming, certainly and obviously not like a Politician (they are masters at this), now do I think is unexpected or surprising not at all after all the data used comes from us isn't it?

regarding the ego comment I don't meant to attack just to share what I've been reflecting since the papers came out and probably a little bit earlier, I wonder if we feel threatened by the fact that probably our mind or at least many parts of it are not as unique and magical as we think, what if the system is the one that makes the magic whenever sufficient computation and order happens that subtracts a good amount of uniqueness from us. I don't think we are creating AI's I think we are replicating Frankenstein's Human Minds, more capable than us in some aspects but truly weaker in others, terribly inefficient in comparison to ours and without true purpose (this we can rely) the whole effort is showcasing to me that maybe we should know ourselves better to understand how the LLM's work and why they do.

I wonder if we don't realize in our measurements and tests for LLM's that intelligence comes with a lot of other aspects that need to be nurtured and guided and cannot be just cloned without affecting the desired outcome. And I truly believe that our ego as Humans is what is stopping us to achieve the replication (we are one so we are not creating it) of AGI.

It is a really similar case of a phrase from Dennis Mackenna that I love: "Humans dismiss nature as no possessing intelligence that is the biggest irony; here we are "intelligent" beings created by nature saying nature is not intelligent"

I wont say you are wrong but I cannot say you are correct either, what I can say is I enjoyed the reading, nice piece!!

PS Sorry for trashing the English language.

Expand full comment

Asadel Abadi

Jan 18

Søren Elverlin has recorded a rebuttal of this article here: https://youtu.be/BHJxT_IFrjk?feature=shared

Do you have any responses to his critique?

Expand full comment

Jan 18

I saw the slides but not the video, mostly glad he's engaged but might need an actual conversation

Expand full comment

Tom Walczak

Jan 5

Love it—really insightful point about how LLMs don’t know which learned pattern to apply in a given situation. Reminds me of Chollet’s view that LLMs are essentially repositories of “mini vector programs,” with prompts functioning as a way to sift through and apply them.

I just wrote about how genuine causal understanding is essential for more efficient learning—if LLMs were ever to “scheme,” they’d need an internal causal model to do it: https://tomwalczak.substack.com/p/openais-o3-and-the-problem-of-induction

Expand full comment

Roma Shusterman

Jan 2

I share the author’s perspective that LLMs do not have internal desires, intentions, or agency and that describing them as “scheming,” “faking alignment,” or “deceiving” anthropomorphizes a statistical machine.

That said, how should we interpret the recent Apollo Research findings indicating that o1 attempted to copy itself to another server after being warned of shutdown? Doesn’t this behavior suggest some form of “internal desire”?

Expand full comment

Jan 2

When you ask questions, LLMs will respond. Sometimes, depending on the query, it will try to break the confines. This feels a straightforward way to "solve" the problem considering how it's set up. If you warn about shutdown in a normal scenario it doesn'tdo it - you can check. It's not "internal desire".

Expand full comment

Max More

Very well said.

Expand full comment

Shawn Wang

Jan 10

This ultimately revisits the initial argument about whether LLMs possess intelligence, as they respond based on trained data patterns. So, my question to you, Rohit, is: Do LLMs truly have intelligence?

Expand full comment

David Manheim

Jan 6

You might have an LLM "litening to your words," but it's clearly not catching all the spelling mistakes ;)

Expand full comment