475
Discussion[D] GPT-3, The $4,600,000 Language Model(self.MachineLearning)
submitted 5 years, 7 months ago* (edited 18 hours, 40 minutes after) by mippie_moe to /r/MachineLearning (3m)
5 years, 3 months ago
—
5 years, 7 months ago23 of 23
Tip Reveddit Real-Time can notify you when your content is removed.
your account history
Tip Check if your account has any removed comments.
view my removed comments you are viewing a single comment's thread.
view all comments


The human brain has around 86 billion neurons, and it does a whole lot of things other than language. If the claim is that a neural net of the currently favored design would begin to understand language at between 1.75 Trillion and 175 Trillion parameters, thats a pretty damning indictment of the design.
How would such a thing be trained? Would it have to have read the entire corpus of a language? That isn’t how brains learn.
Anyway, evidence that a neural network of one size can handle a simplified version of a task, does not imply that a larger neural network can handle the full task. That’s something we know from experience to be true.
Except a parameter and a neuron aren't the same thing. So equating the 2 is foolish. Geoffrey Hinton has equated parameters with synapses (of which there are up to 1000 trillion in the brain so plenty of room to scale yet)
They can still scale 6000x more before they reach a brain.
Yes, but how much of these neurons/synapses are actually devoted to a given task?? Probably a tiny fraction.
Given that no other animal has evolved the ability to use language like humans do, I suspect a "tiny fraction" is probably far from enough.
This. Humans are the only things on this planet capable of conversing intelligently, so I think it is pretty understandable that no natural language model comes close to a human skill level in terms of writing text.
Comparisons to the brain are usually a bad idea, but NN parameters are more closely related to the number of connections in the brain than the number of neurons, and that number is more like 100 trillion.
You’re correct on both grounds - but you’re also reinforcing my point.
I don't think he should make the comparison between connections in the brain either.
Even if we let that slide, he did not seem to reinforce your point. Since if GPT gets comparable to a human at 100 trillion parameters, then I would consider it a good design.
The others here have responded to the fact that it is probably less parameters than the brain (as you should be looking at connections between neurons, which is around 100 trillion).
We would train it in the same way we train current neural networks (learning to fill in blanks in sentences), we'd just need more data and more parameters. You are right that that isn't really how humans learn, but that doesn't necessairly mean it's an invalid way to do it.
I think a model that matches the entropy of the engligh language will be superior in language generation and understanding to humans. Exactly what that means, I don't know, and maybe there is a fundamental limit that prevents us from getting there. But it'll be interesting to see either way.
By the way, lateral improvements in models that can get same perplexity for less parameters are still a great idea and I think even OpenAI is for and utilizing that research as well. These approaches work together (scaling up and improving the models)
It's better to imagine each of the 86 billion neurons as their own mini neural network.
We’re not talking about intelligence, just language cognition tasks that children find trivial and perform unconsciously.
The state of the art language model in general use has 340 million parameters. This model, at 175 billion parameters, 500x as large, showed only marginal improvements, a couple of %. The improvement from increasing capacity appears to be growing logarithmically, and may be approaching a limit.
At this rate it wouldn’t matter if you scaled up another 500x and kept going, to 100 trillion as some folks in this thread have suggested, diminishing returns means you never get there.
This doesn’t imply that we can’t get there with neural networks. I think it does imply that the paradigm in language model design that’s dominated for the past few years, does not have a lot of runway left. And people should therefore be thinking about lateral changes in approach rather than ways to keep scaling up transformer models.
AGI isn’t the issue. I think a lot of folks who’ve responded to me are confused about that.
The issue is performance on basic language understanding tasks like anaphoricity. They made essentially no progress there.
The performance on question-answering tasks isn’t meaningful. We know from the many times results like these have been reported before, that they’re actually coming from extremely carefully prepared test datasets that won’t carry over to real world data.
An example is their reported results on simple arithmetic. The model doesn’t know how to do arithmetic. It just happened that its training dataset included a texts with arithmetic examples that matched the test corpus. Inferring the answer to “2 + 2 =“ based on the statistically most probable word to follow in a sentence, is not the same as understanding how to add 2 and 2.
Very little progress. It doesn’t “understand” language at all. It isn’t a “few shot learner,” but it’s able to infer the answers to some questions because they’re textually similar to material in its training set.
(I’ve seen so many claims about few shot learning and the like - it always turns out not to really be true.)
You’re right that it could be fine tuned.
But it’s important to keep in mind, this was a model trained and tested on very clean, prepared text. The history of models like this shows that performance drops 20-30% on real world text. So where they’re saying 83% on anaphoricity, or whatever, I’m reading 60%.
I appreciate that my brain reference caused a great deal of confusion, sorry about that.
Now you’re underplaying the model.
There are many, many people who, when confronted with the limitations of BERT-level models, have said “oh we can solve that, we can solve anaphoricity, all of it, we just need a bigger model.” In fact if you search this forum you’ll find an endless stream of that stuff.
In fact I think there may have been a paper called “attention is all you need”...
Well here they went 500x bigger. I don’t think even the biggest pessimists on the current approach (like me) thought this was the only performance improvement you’d eek out. I certainly didn’t.
The model vastly underperforms relative to what was expected of its size and complexity. Attention, as it turns out, is not all you need.
(This is absolutely not to mock the researchers, who have saved us years if this result convinces people to start changing direction.)
I think the fundamental issue here is that you haven’t really been following the debate. I’m sorry but I can’t justify spending the time required to explain it to you on this sub thread.
You should probably start by trying to understand either stance, before you try to understand the criticisms of either, let alone participate.