Toddler vs AI
A comparison between a three year old with dinosaur obsession and a Generative Pre-trained Transformer
I
Consider a few example sentences:
There were happy happy shepherds, he's a baby angel wildebeest
Jingle bells, jingle bell, jingle all the warthog
1 Red, 2 Green, 3 Yellow, 4 Blue, 5 Red, 7 Green, 11 Green
The pteranodon and the parasauralophus also share their relationship with their pteranodon but the parasauralophus did not have wings like the other two
The Irish Elk is a beautiful animal, the elk is a most beautiful animal. It is similar to the moose, but has a more larger head, and is stronger
Twinkle twinkle little star looking like a baby
Can you tell which ones were said by my three year old vs a text generation AI derived from GPT-2 that I've been playing with?
Whether you're on Team AI or Team Toddler, two things are for sure. 1) They sound eerily similar, and 2) None of them sound like what an adult would say.
Answers - 1, 4 and 6 are AI. 2 is toddler. 3 and 5 are both!
There has been an extraordinary amount of adulation for GPT-2 and GPT-3. I'd go so far as to say even so it's still underrated. There have also been a fair few negative notes about how it's limited.
But the arguments against don't actually argue against GPT-3, it argues against looking at it as the be all and end all of all AI. Which is completely fair, and fighting a weird strawman.
What does it do? From a Forbes article:
GPT-3 can create anything that has a language structure – which means it can answer questions, write essays, summarize long texts, translate languages, take memos, and even create computer code.
In terms of where it fits within the general categories of AI applications, GPT-3 is a language prediction model. This means that it is an algorithmic structure designed to take one piece of language (an input) and transform it into what it predicts is the most useful following piece of language for the user.
That's why comparing it to my three year old is so much fun to me.
If I use the same types of input data, they come out with similar sounding snippets.
I could've used better training data of course to make the sentences sound better. I used nursery rhymes and stories and Wikipedia articles on prehistoric animals, dinosaurs and wild animals today as those were the same things that my son is obsessed with - to make the comparison fairer.
They both make relatively similar outputs on relatively similar topics. As far as I know my son hasn't read 8 million documents scraped from the web like GPT-2, or the 45 TB of text sourced from across the internet like GPT-3. The amazing feats that GPTs perform is because they have terabytes of data vs, say, 100MB for SwiftKey that helps predict the next word to type in your smartphone keyboard.
What is incredible about both are that they have somehow internalised the insanities of the English language grammar structure. While my son still uses "drinked" or "catched" flying in the face of irregular verbs, his sentence structures are coherent and cogent. Same with the language training model.
II
[Sidenote: Here is where I start demonstrating my ignorance about state of the art AI development.]
What is it that this overrated but underrated, amazing yet flaw ridden, technology giving us?
It has somehow structured the corpus of language that we use, using concepts that sound familiar to us, and creating sentences and paragraphs which are eerily familiar.
The eerie part isn't an accident. It's a rather clear-cut case of uncanny valley, especially as the documents get slightly longer or complex. But the fact that the algorithms can't solve everything today isn't a flaw, it's step 1 of n steps.
For instance, while it can write funny sentences about parasauralophuses, it doesn't know that the crest on its head looks "funny" to us, or that it used to trumpet (probably) like an elephant. Or that trumpeting is a sound, similar to honking, but different from roaring.
When a paragraph gets created, the software might figure out that the internal connections inside its global data model means that these implicit associations I made above are also reflected internally. Which it very much is. That is how it actually "understands" that parasauralophus has a crest in the first place, and that it makes a sound as a key interesting characteristic, and that the sound is similar to that of an elephant (probably), and that the sound is similar to other sounds that have been mentioned in the compendium of all human knowledge that is Wikipedia.
What it doesn't have though, and what the three year old has, is multiple systems that give him those facts independently. He knows his parasauralophus because it's one of his toys, looks orange and kind of funny as a toy though not in the museum, he's seen a video one time that shows it trumpeting, it's put close by with a whole bunch of other dinosaurs, and near other hadrosaurs in the museum, and we had several deep and meandering conversations about its crest.
As AI researcher Geoffrey Hinton said:
Extrapolating the spectacular performance of GPT3 into the future suggests that the answer to life, the universe and everything is just 4.398 trillion parameters.
The final output might be very similar to the neural net in GPT, but the process to get there includes a much larger number of models which all have very similar taxonomies. The mental models are modular. Once you know what animals are, then you go deeper into what mammals are, then carnivores and herbivores and omnivores, then egg laying mammals who are weird looking, and so on down the stack. It provides a complex modular network tapestry that other, newer, concepts can hang on to.
Everything isn't explained from the very start, but rather explained by their relationship to each other. While the first training of GPT and it's ilk creates those modular representations somewhat implicitly, it's the explicit representations in legible categories that makes the difference.
And it's the fact that there's the same taxonomy that's shared by all of us, the whole world he interacts with, that enables my son to learn more about his dinosaurs. He has a clear concept cloud of what a "dinosaur" is, linked to "extinct animals" and "lizards" and "birds". He can then sub-link concepts from "dinosaur" to different types of dinosaurs like "theropods" and "hadrosaurs" and "flying lizards that are always right next to dinosaurs but for some arcane reason aren't called dinosaurs like quetzalcoatlus and pterosaurs".
And because the concepts are linked together, any new information that comes in has a context within which it can be analysed. We're starting from a pre-trained network that keeps being able to use previously trained modules to build upon. It's what allows a small sample learning to take place since we're not immediately trying to build up a tool that can help create full sentences across all domains of knowledge.
This isn't new. The original AI approach hinted at as much. The difference was that it's bloody difficult to make this come about from a blank slate. Human language and concepts are amorphous and annoyingly imprecise in all the ways that a computer would find impossible to interpret.
But if there's a reasonably coherent cloud of concepts being built organically already outside the confines of rule based programming, where the net is being auto configured through self learning, that would change the dynamic altogether. It might still not work but at least there's a path to true understanding.
As we grow we start with a few pieces of more general knowledge, and then concepts, and then layer on other pieces of information linked to it. And growing things one concept at a time, while painful (has your three year old asked you to explain 'mind' to you yet?), still ends up being a far more flexible way to learn and grow.
The fact that this is how we seem to learn is why I'm reasonably confident in our ability to make bigger strides in AI in the coming years and/or decades. Currently there is no taxonomy, not really, that it adheres to. If I rewrite a sentence to try and predict the next word/ phrase/ paragraph/ context, the net needs to rerun itself. There's no comfy quicksave today in larger loops.
And while Google understanding my garbled English and giving me sensible results seems like magic, it's still not what you'd actually call comprehension. And that's not a fault, it just means the next module has to be built. It would be like getting annoyed at a chassis for not having air-conditioning.
III
Part of the feedback oriented learning that babies do is therefore dependent on the actual unfolding of "next steps" in the genetic code. But it's not just the genetic code that's important, it's the actual interaction and learning from the environment. It's the theory of embodied intelligence, where cognition and learning comes from the way our bodies interact with the world.
Part of the benefit is of course that you can "just react" to the world, rather than performing the complicated mental manoeuvring that's otherwise required to get even the basic amount of actual know-how.
Sounds like the problem with AI today actually that everyone was simultaneously thunderously optimistic and pessimistic and scared about. It can happily say "the chair sits on the cat" because it doesn't care what a chair is, what a cat is, what sitting is, what it might look like, and the actual implausible physics of it. With a large enough dataset it might start figuring it out, but that's brute forcing corner cases away. Add an unknown word or concept and we're back to zero.
It feels like both the embodied cognition concept and the modular concept stacking idea all rely on something common. That to be truly useful we need to not just do pattern matching on wild amounts of data, but actually apply that learning to the real world. Otherwise the algorithm will happily go on thinking chairs can sit on cats and pteranodons are similar to parasauralophus. Is that the world we live in?
To be fair, I asked my son, and he wasn't sure either.