473
Discussion[D] GPT-3, The $4,600,000 Language Model(self.MachineLearning)
submitted 5 years, 7 months ago* (edited 18 hours, 40 minutes after) by mippie_moe to /r/MachineLearning (3m)
since 5 years, 7 months ago
13 of 13
Tip Reveddit Real-Time can notify you when your content is removed.
your account history
Tip Check if your account has any removed comments.
view my removed comments you are viewing a single comment's thread.
view all comments


We’re not talking about intelligence, just language cognition tasks that children find trivial and perform unconsciously.
The state of the art language model in general use has 340 million parameters. This model, at 175 billion parameters, 500x as large, showed only marginal improvements, a couple of %. The improvement from increasing capacity appears to be growing logarithmically, and may be approaching a limit.
At this rate it wouldn’t matter if you scaled up another 500x and kept going, to 100 trillion as some folks in this thread have suggested, diminishing returns means you never get there.
This doesn’t imply that we can’t get there with neural networks. I think it does imply that the paradigm in language model design that’s dominated for the past few years, does not have a lot of runway left. And people should therefore be thinking about lateral changes in approach rather than ways to keep scaling up transformer models.
AGI isn’t the issue. I think a lot of folks who’ve responded to me are confused about that.
The issue is performance on basic language understanding tasks like anaphoricity. They made essentially no progress there.
The performance on question-answering tasks isn’t meaningful. We know from the many times results like these have been reported before, that they’re actually coming from extremely carefully prepared test datasets that won’t carry over to real world data.
An example is their reported results on simple arithmetic. The model doesn’t know how to do arithmetic. It just happened that its training dataset included a texts with arithmetic examples that matched the test corpus. Inferring the answer to “2 + 2 =“ based on the statistically most probable word to follow in a sentence, is not the same as understanding how to add 2 and 2.
Very little progress. It doesn’t “understand” language at all. It isn’t a “few shot learner,” but it’s able to infer the answers to some questions because they’re textually similar to material in its training set.
(I’ve seen so many claims about few shot learning and the like - it always turns out not to really be true.)
You’re right that it could be fine tuned.
But it’s important to keep in mind, this was a model trained and tested on very clean, prepared text. The history of models like this shows that performance drops 20-30% on real world text. So where they’re saying 83% on anaphoricity, or whatever, I’m reading 60%.
I appreciate that my brain reference caused a great deal of confusion, sorry about that.
Now you’re underplaying the model.
There are many, many people who, when confronted with the limitations of BERT-level models, have said “oh we can solve that, we can solve anaphoricity, all of it, we just need a bigger model.” In fact if you search this forum you’ll find an endless stream of that stuff.
In fact I think there may have been a paper called “attention is all you need”...
Well here they went 500x bigger. I don’t think even the biggest pessimists on the current approach (like me) thought this was the only performance improvement you’d eek out. I certainly didn’t.
The model vastly underperforms relative to what was expected of its size and complexity. Attention, as it turns out, is not all you need.
(This is absolutely not to mock the researchers, who have saved us years if this result convinces people to start changing direction.)
I think the fundamental issue here is that you haven’t really been following the debate. I’m sorry but I can’t justify spending the time required to explain it to you on this sub thread.
You should probably start by trying to understand either stance, before you try to understand the criticisms of either, let alone participate.
In this case, the errors were on your part.