How do you compete in providing intelligence?
This week was a masterclass in helping figure this out. OpenAI announced a new model, GPT-4o, which is better and faster than its flagship GPT-4-Turbo. It’s also natively multimodal, meaning it can read and write text, images, audio and video natively.
(The “o” in 4o stands for “omnimodal” which is a fancier way to say multimodal, I think?). It can:
see what’s going on around it and understands it
have full conversations with no latency
you can interrupt it and have a natural conversation
sounds like Scarlett Johansson
live translation and conversation across multiple languages
OpenAI’s demo was clearly timed. The next day Google had their I/O conference, and demonstrated many of the same capabilities, and then some.
2 million token context window
spatial reasoning (it could remember where the person had left their glasses)
a fully multimodal model too, which can see and read and listen to you
they have Gemini Nano and Flash, small models which can work on-device
live spam testing
integrated into their products - from search to google cloud to android to messages and more
Sam talked about Universal Basic Compute, a way to provide their flagship model for free to all users with minimal usage cap. They delivered on that with GPT-4o.
This was my prediction for OpenAI, and between them and Google, this is exactly what we got.
Google was already giving Gemini 1.5 Pro away for limited time, as is Meta, and they’re integrating it into all their offerings. We’ve entered the Universal Basic Compute era in some ways at least.
It also means a whole new war has started in the AI domain, this is the year of productisation.
The bull case
The bull case is to look at GPT-4o as the first salvo in creating a fully multimodal model that understands the world. Why is this important? Because until now most advanced models were good in one or two domains (text models for text, video for video, image for image, sometimes crossing boundaries).
And it’s smart! It’s the same model as the “im-also-a-good-gpt2-chatbot” that quietly debuted on lmsys last week.
We have a fully native model which can work across domains, and is incredibly fast! Below the human threshold for latency, which is why it feels like “Her”, the Scarlett Johansen movie where a digital assistant comes alive.
This is great, of course. Because now we can actually have a real-time assistant which can not just listen and interrupt us, but also see our screen and see through our video. It can listen, reply, see, and respond. Same as Gemini, just seems to work much better.
Which covers most everything we talked about in the anything-to-anything machine. Not quite making the best songs from Suno, but close. And with exceptional emotional aptitude. We finally have a version that looks close to the one everyone thought of as the paradigm of Artificial General Intelligence from every book or movie they’ve ever seen.
This will also be the element that wows the entire world!
It will revolutionise many (many) domains!
Education literally just got so much easier. We can now have an infinitely patient tutor which can see us, see our screen, and listen to us, and answer any questions as they come up.
It means you can now find and nurture talent which might have gone unnoticed, or give help where needed.
If you are a productivity fiend, you can have an assistant now watching your screen and keeping you on track!
You can even create your own Meta Rayban glasses, just incorporate the model to answer whatever you ask and link it up to see what you're seeing…
You can genuinely have an agent which answers questions on your behalf, beware robocallers!
Siri can finally get good and actually perform tasks on your desktop or your phone.
And it’s quite likely that this release seen this way is just the first step. If we’ve managed to figure out how to train a fast, capable, useful model that already beats GPT-4-Turbo, that’s a big deal!
But it also means that we can probably now teach the model using a whole lot more data than we could have imagined before, towards one model. OpenAI didn’t even mention changes in context size or better embedding models, though of course those too most likely exist behind the scenes.
Which means GPT-5 will have these abilities, but trained with much more data and therefore a much higher bar of capabilities. It will be much smarter, and able to understand multiple modalities natively. Especially as they get more reliable across the domains, which is what is likely to emerge as they learn from multiple domains as humans do, that would be a true step change.
We might finally have AI agents which can plan and execute more complex plans, and take the first step towards AGI.
The bear case
We didn’t see anything new.
Native multimodal is great, though it’s not exactly new. Google showed the first version in December. It wasn’t brilliant, but it worked and was a clear proof of concept. Which means OpenAI was lapped its core capability, even as it stood out in terms of engineering.
The ability to interrupt itself isn’t that revolutionary either. We can program that in! We can even make the model interrupt itself if it felt the answer wasn’t good enough.
What about the ability to translate realtime? We’ve had that too. Remember when Microsoft did that with Skype? This links that capability with text generation. What about videos or images? We’ve had that too, though not nearly as fast, nor seemingly smooth.
Which means what we have is really good engineering, and a great product! It takes the existing pieces of innovation that lay around, and combine them into a smooth, unique experience unlike any that came before.
But it’s also not progress per se towards AGI. At best it’s a step along the way, in the sense that all technological progress is somewhat pointed towards that which comes next. We’re exploring what can be built with the models we already have, and making their capabilities a bit better.
The reasoning is somewhat improved, but hallucinations remain aplenty and maybe a bit more than before
The models are better in some ways, definitely faster, and definitely cheaper
The ability to identify emotion alongside is nice
Also, Google demonstrated most of these abilities, as did Meta, Claude a few others, Pi with emotion.
Sure, OpenAI combined existing capabilities seen in multiple places and destroyed many startups in the process, but is that what a research lab is meant to do? To scoop YC companies from 6 months ago?
A slightly better model that’s also faster (and cheaper) is great, but it still has all the same flaws as the old models.
Synthesis
It’s quite likely that even if it’s not the first salvo towards GPT-5 with a truly extraordinary leap in some aspect (where’s Q*?) it still broadens the base of the pyramid enough that AI will disseminate even deeper into the public consciousness. We can also see this with the broadening of capabilities, as multiple models start to come close to providing the functionalities you’d want.
We’re indisputably living in the age of AI.
Just like the cloud, or mobile, or the internet, or computers before that, AI in this form will seep into every corner of our lives. It will be great, and parts will be terrible, and we will create things that none of us have quite thought of yet. Who knows, maybe the Star Trek holodeck is what our kids will play in.
This also means it’s the era of AI getting productised. We saw how you can distill even more capabilities into a model than we’d thought, without compromising much efficiency.
Claude for writing or some reasoning tasks or sometimes coding, GPT for those if you like that, GPT if you want to create your own personalised GPTs, Gemini if you want it integrated into your work domain, Meta built into your glasses or WhatsApp.
While it seems true that OpenAI still has the lead in terms of the “best” model out there, that lead is no longer commanding. We’re within spitting distance of having multiple models for each use case and you’d have to work hard to a) choose the one you want, and b) work with it to get the output you want.
AIs are becoming true “fuzzy processors”. LLMs are anything-to-anything machines, but they’re incomplete. They take in an input, and do their best to convert it to an output. Which might be right, or sometimes wrong. It will need guidance, and guardrails, and error correction, and many layers of questioning and answering, before it becomes reliable.
Just like we weren’t satisfied with the XOR gates until we managed to get a million of them working together in tandem, we won’t be satisfied until we can run LLMs in a loop millions of times so that they can collectively help us create intelligence.
So in some ways GPT-4o and the latest Gemini models are like the creation of mainframes. It’s IBM’s System 360 for intelligence. A monumental achievement and incredibly useful, especially to those of us who have some time-share access via digital punch cards.
The true change will come once we can enable large numbers of them to work together. And we’re getting glimpses of how they can do this across all modalities that are important to use. Whether that’s writing code or seeing something or listening to something or writing or reading something or a mixture of all of these.
I’ve written before about the tasks that LLMs can’t do. The answer to many of those tasks were to use LLMs in a loop to do harder tasks. We can’t do that if each LLM call costs $7, but when they cost $0.0007, a lot more will get unlocked.
The thing about processors (the origin of Silicon Valley) were that they were hard to build, so there were only a few who could, and the consumers were pretty abstracted away from the choice. I think the same will be true to some extent for the fuzzy processors. We had multiple PC models, but the same few chips. The “computers” we will build with the fuzzy processors will be similar to agents, to do the tasks.
These have one major difference, in that today they are consumer technologies, so distribution will win out, at least until there’s another paradigm shift. Even with a shift there will be fast following. Which is very bullish for existing Big Tech. Yes, even Google.
The era of productisation might seem a little boring at times (remember the cloud?) but it creates the baseline for truly extraordinary things to get built on top! After all, that’s how we got things from PyTorch to transformers. Scientia potentia est.
> I’ve written before about the tasks that LLMs can’t do. The answer to many of those tasks were to use LLMs in a loop to do harder tasks. We can’t do that if each LLM call costs $7, but when they cost $0.0007, a lot more will get unlocked.
Spot on, and why price reduction (AND latency improvement) matters.
Reflection based prompts or other AI output revision flows have been both too expensive and too slow for most use-cases. That's starting to change, and I'm excited to see the ramifications.