[D] GPT-3, The $4,600,000 Language Model

Pricing is using $1.50 per V100. Current spot pricing on AWS for a V100 $0.918. Using spot would cost about $2.8M. Obviously, the problem with spot is they can be terminated at any moment!

permalinkparentcontextauthor-focusas-ofpreserve

[–]AxeLond7 points5 years, 7 months ago

The thing is that you wouldn't be able to train this on any servers AWS offers. It's not about if it's cheaper or faster, it's if you can load the model into memory and run anything at all, for which the answer will be, No.

In the paper they say the model was trained using V100's and a high-bandwidth cluster provided by Microsoft. Most likely this is something similar to NVSwitch which links together GPUs and allows them to share GPU resources. You can link together the VRAM of 16 GPUs by combining each GPU with a NVSwitch, and the switch is a huge piece of silicon that costs about the same as the GPU itself. You're looking at a $200,000 server, just load the model. The cost is just a simple approximation, it wouldn't actually work.

https://www.nvidia.com/en-us/data-center/nvlink/

https://www.nvidia.com/en-us/data-center/dgx-a100/

permalinkparentcontexthide replies (1)author-focusas-ofpreserve

[–][deleted]2 points5 years, 7 months ago

While it would likely be enormously cost-prohibitive, AWS does offer some "private" tiers.

For example, the u-12tb1.metal instance type has 12 TB of RAM and 448 CPU cores. While this one is aimed at in-memory DBs, they do have some other huge cluster offerings.

permalinkparentcontexthide replies (1)as-of

[–]AxeLond2 points5 years, 7 months ago

I don't think many will be running the 175b parameter model anywhere, even OpenAI is probably hurting a bit after doing it. They also published smaller models which I think would be enough, the 13B param is still like 10x the largest GPT-2 model. Humans were only 52% accurate at identifying fake articles written by the 175B model, pretty much just guess 50/50, but even for the 13B model people were only 55% accurate.

13 B you can probably reasonably well on a single Tesla A100 with 40 GB VRAM.

But technology advancements will make these things more accessible as well. Nvidia's NVSwitch solution is incredibly niche and expensive by requiring you to build a board that wires every GPU to every other GPU in the server.

AMD with 3rd gen infinity fabric will try to do that built in to the CPU + GPU. Nvidia was limited to PCIe 3.0 and it wasn't fast enough. With Zen 3 or 4 AMD is moving to PCIe 5.0 which can do 63GB/s compared to 16GB of gen 3. They will be using this to interconnect 8 GPU and a EPYC processor in the El Capitan 2 exaflop supercomputer with full GPU resource sharing. The NVSwitch has a port bandwidth of 50 GB/s, so in a few years an off the shelf server will be able to do this stuff instead of needing a super niche product.

https://en.wikichip.org/wiki/nvidia/nvswitch

This thing is absolutely ridiculous, it's a 100W linking cable.

In 2022 AMD servers will be able to do this without specific hardware,

https://www.anandtech.com/show/15596/amd-moves-from-infinity-fabric-to-infinity-architecture-connecting-everything-to-everything

That's when models of this size can start to become common.

permalinkparentcontexthide replies (1)author-focusas-ofpreserve

[–][deleted]2 points5 years, 7 months ago

Thanks for sharing the specifics on this. Very exciting stuff!

permalinkparentcontextas-of

[–][deleted]2 points5 years, 7 months ago* (edited 1 week, 6 days after)

[deleted] by user

parenthide replies (1)as-of

[–]catandDuck2 points5 years, 7 months ago

To be fair, that pricing isn't on the article title, just this post. But it certainly is an 'advertisement,' considering that cost is estimated using its own product.

permalinkparentcontextauthor-focusas-ofpreserve

[–]farmingvillein2 points5 years, 7 months ago

You are correct on a single instance. But the numbers cited by OP are a better analog for "true" cost, since, when you scale up, you can't really use spot instances (without a lot of custom work), since if you have a cluster of 50 machines and 1 of them drops out, then the whole thing goes down (at least with common out-of-the-box implementations of scaled GPU training).

permalinkparentcontextauthor-focusas-ofpreserve

r/reveddit removed.substack.com