More

eiz · on April 17, 2023

https://arxiv.org/pdf/2302.13971.pdf table 15. 1770394 A100-80GB hours to train the entire model suite at the going rate for cloud 8xA100-80GBs (~$12/hr if you could actually get capacity) is ~$2.6M, under extremely optimistic assumptions. YMMV on bulk pricing ;) "the more you buy the more you save"

Robotbeat · on April 17, 2023

Hmmm… the values in the 7B model seem feasible. An order of magnitude lower GPU hours, plus presumably the lower parameter count means it probably could fit on a 24GB Radeon RX 7900 XTX, which has higher single precision flops than the A100 and costs $1000 instead of $15,000.

An order of magnitude lower GPU-hour time, plus if you train it for 210 days instead of 21 days, means you could do a 7B model with 20 consumer GPUs which are $1000 apiece. $20k, not counting mainboard, etc. Really not bad. Might even be doable as a volunteer project.

nl · on April 17, 2023

I'm not aware of any efficient transformer training code for AMD cards.

Also most training is done using bfloat, not single precision (which is usually only used for accumulators)

Robotbeat · on April 18, 2023

Sure, you would need to rewrite the training code for AMD's ecosystem. If you're using mixed precision training, I suppose you're right about BF16. That puts the relative performance of A100 about 2.5x that of the Radeon RX 7900 XT. May be better to go with the Nvidia GeForce RTX 4090 with a $1600 retail.

titaniumtown · on April 18, 2023

It all works with pytorch and huggingface's transformers library out of the box with Rocm.

slavik81 · on April 19, 2023

You would need to compile a few components from source for Navi 31 if you were to try it today, so out-of-the-box is perhaps an overstatement, but it's certainly doable.

eiz · on April 15, 2023

> Where is the connection between computational details and the model's high-level behavior? Do we even know?

This is an active area of study ("mechanistic interpretability") and it's very early days. For instance here's a paper I read recently that tries to explain how a very simple transformer learns how to do modular arithmetic: https://arxiv.org/abs/2301.05217

Curious what interesting results people are aware of in this area.

eiz · on April 15, 2023

> 4. Describing positional embeddings as multiplicative. They are generally (and very counterintuitively to me, but nevertheless) additive with token embeddings.

Worth noting that rotary position embeddings, used in many recent architectures (LLaMA, GPT-NeoX, ...), are very similar to the original sin/cos position embedding in the transformer paper but using complex multiplication instead of addition

jaidhyani · on April 16, 2023

TIL. Man, I'm behind on my paper reading.

eiz · on April 11, 2023

> They do not have any graphics rendering circuitry.

What? Not having a display output is not the same as not having graphics rendering circuitry. Here's vulkaninfo from an A100 box: https://gist.github.com/eiz/c1c3e1bd99341e11e8a4acdee7ae4cb4

programmer_dude · on April 11, 2023

This may not contradict what I said. Do you know for a fact these things are implemented using dedicated hardware?

Edit: I do not see a rasterizer anywhere in the block diagram (pg 14): https://resources.nvidia.com/en-us-genomics-ep/ampere-archit...

Look at Turing's block diagram here (pg 20): https://images.nvidia.com/aem-dam/en-zz/Solutions/design-vis...

You can clearly see that the "Raster Engine" and "PolyMorph Engine" are missing from GA100 (but can be seen in TU100 for example).

To learn about these Graphics Engines see: https://www.anandtech.com/show/2918/2

eiz · on April 11, 2023

Fair enough. In the GH100 architecture doc https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepa... (page 18) they do mention retaining 2 graphics-capable TPCs but it's clearly not the focus.

eiz · on April 10, 2023

Here's a couple more quotes from Yann LeCun, their Chief AI Scientist:

https://twitter.com/ylecun/status/1643945742850031616

https://twitter.com/ylecun/status/1644503237699969026

pretty clear the non-release and extremely half assed response to the leak is just CYA

eiz · on Dec 4, 2022

I tried this:

> Solve the following word problem step by step:

> "John is in a train moving west at ten miles per hour. Jill is in a train moving north at twenty miles per hour. How fast are John and Jill moving apart from each other?"

To solve this problem, we first need to understand the concept of speed. Speed is a measure of distance traveled over time. In this problem, we are told that John and Jill are traveling at different speeds in different directions.

Since the two trains are moving in different directions, we cannot simply add or subtract their speeds to find the combined speed at which John and Jill are moving apart from each other. Instead, we need to use the Pythagorean theorem to find the combined speed.

The Pythagorean theorem states that in a right triangle, the square of the length of the hypotenuse (the longest side of the triangle) is equal to the sum of the squares of the lengths of the other two sides. In this problem, we can consider the trains to be moving along the two sides of a right triangle, with the distance between them as the hypotenuse.

We can use the Pythagorean theorem to find the distance between John and Jill as follows:

First, we square the speed of each train and add the results:

10 mph * 10 mph + 20 mph * 20 mph = 100 + 400 = 500

Next, we take the square root of the result to find the distance between John and Jill:

sqrt(500) = 22.36 mph

Therefore, the combined speed at which John and Jill are moving apart from each other is approximately 22.36 mph.

---

Results seem pretty inconsistent though. Other attempts have given complete nonsense.

eiz · on April 28, 2022

I've done this a few times too.

> There’s no way to pre-register the pet.

At least on Delta, you can. You need to call the airline and arrange it.

> Oh, and to go through security, you have to take a nervous and scared cat out of their carry case in the middle of the security line, walk through the metal detector, and then they have to test both of your hands for residue while still carrying the cat.

I always refuse the normal line and get a private screening. So far I haven't had any issues doing that.

Still, it's a huge pain in the ass.

bmitc · on April 29, 2022

I can't remember Delta's exact policy, but if I recall correctly, you can call ahead of time but when it comes to flight day, they can still refuse you if too many pets have already checked in. Don't quote me on that though.

nicolexcaba · on April 29, 2022

That's how it was before Avvinue. We just launched the world's first pet travel booking platform so that you can get your pet reserved 100% online and confirms the place on the flight so you don't have sit on hold for hours just to be denied.

eiz · on Sept 11, 2021

> This means no closures, which K implementers consider a feature (I don't).

having not touched K in about 15 years, when did this change? in k3:

    K 3.2 2004-09-23 Copyright (C) 1993-2004 Kx Systems
    LIN32 16CPU 15985MB ubuntu 0 EVAL  
    
      f:{a:x+1;{a+x}}
      g:f 1
      g
    {a+x}
      g 2
    4
      a
    value error
    a
    ^
    parse error

mlochbaum · on Sept 11, 2021

This comes as a surprise to me! I thought none of Whitney's Ks had closures—although I did neglect to mention that kuc and oK add them. Digging around and asking on the K Matrix/Discord I found some posts that suggest that the K3 form is very limited. My read of these is that functions can refer to variables in the immediately surrounding scope (only one level up), and their values will be copied in when the function is reached in the source code. So it would be equivalent to adding extra arguments to g and passing the variable values in that way. And it wouldn't allow the programmer to create object-like things and couldn't create reference loops requiring garbage collection.

See https://news.ycombinator.com/item?id=22572778, as well as https://chat.stackexchange.com/transcript/message/53999576#5... with more discussion on following days.

moonchild · on Sept 12, 2021

> their values will be copied in when the function is reached in the source code. So it would be equivalent to adding extra arguments to g and passing the variable values in that way. And it wouldn't allow the programmer to create object-like things and couldn't create reference loops requiring garbage collection.

That is definitely desirable! K (and kin) are fully referentially transparent and have value semantics. It would be bizarre and inconsistent to break referential transparency for closures.

eiz · on Nov 18, 2020

> If you think Chat apps are winning, let me know when you can buy an item online without an account linked to an Email. I'd love to see an example.

https://www.amazon.com/gp/help/customer/display.html?nodeId=...

horsawlarway · on Nov 18, 2020

Sure, but it's limited to mobile numbers where an sms provides the same value as a proof of record. That's why it's limited to mobile phones.

And SMS operates with the same functional principle as email - I own the text after receiving it, and can use it as needed.

eiz · on Oct 2, 2020

We can do full (permanent) remote.

zerr · on Oct 2, 2020

Worldwide or US-only?

anniely · on Oct 2, 2020

US only at this time, unfortunately.