https://arxiv.org/pdf/2302.13971.pdf table 15. 1770394 A100-80GB hours to train the entire model suite at the going rate for cloud 8xA100-80GBs (~$12/hr if you could actually get capacity) is ~$2.6M, under extremely optimistic assumptions. YMMV on bulk pricing ;) "the more you buy the more you save"
Hmmm… the values in the 7B model seem feasible. An order of magnitude lower GPU hours, plus presumably the lower parameter count means it probably could fit on a 24GB Radeon RX 7900 XTX, which has higher single precision flops than the A100 and costs $1000 instead of $15,000.
An order of magnitude lower GPU-hour time, plus if you train it for 210 days instead of 21 days, means you could do a 7B model with 20 consumer GPUs which are $1000 apiece. $20k, not counting mainboard, etc. Really not bad. Might even be doable as a volunteer project.
Sure, you would need to rewrite the training code for AMD's ecosystem. If you're using mixed precision training, I suppose you're right about BF16. That puts the relative performance of A100 about 2.5x that of the Radeon RX 7900 XT. May be better to go with the Nvidia GeForce RTX 4090 with a $1600 retail.
You would need to compile a few components from source for Navi 31 if you were to try it today, so out-of-the-box is perhaps an overstatement, but it's certainly doable.
> Where is the connection between computational details and the model's high-level behavior? Do we even know?
This is an active area of study ("mechanistic interpretability") and it's very early days. For instance here's a paper I read recently that tries to explain how a very simple transformer learns how to do modular arithmetic: https://arxiv.org/abs/2301.05217
Curious what interesting results people are aware of in this area.
> 4. Describing positional embeddings as multiplicative. They are generally (and very counterintuitively to me, but nevertheless) additive with token embeddings.
Worth noting that rotary position embeddings, used in many recent architectures (LLaMA, GPT-NeoX, ...), are very similar to the original sin/cos position embedding in the transformer paper but using complex multiplication instead of addition
> "John is in a train moving west at ten miles per hour. Jill is in a train moving north at twenty miles per hour. How fast are John and Jill moving apart from each other?"
To solve this problem, we first need to understand the concept of speed. Speed is a measure of distance traveled over time. In this problem, we are told that John and Jill are traveling at different speeds in different directions.
Since the two trains are moving in different directions, we cannot simply add or subtract their speeds to find the combined speed at which John and Jill are moving apart from each other. Instead, we need to use the Pythagorean theorem to find the combined speed.
The Pythagorean theorem states that in a right triangle, the square of the length of the hypotenuse (the longest side of the triangle) is equal to the sum of the squares of the lengths of the other two sides. In this problem, we can consider the trains to be moving along the two sides of a right triangle, with the distance between them as the hypotenuse.
We can use the Pythagorean theorem to find the distance between John and Jill as follows:
First, we square the speed of each train and add the results:
At least on Delta, you can. You need to call the airline and arrange it.
> Oh, and to go through security, you have to take a nervous and scared cat out of their carry case in the middle of the security line, walk through the metal detector, and then they have to test both of your hands for residue while still carrying the cat.
I always refuse the normal line and get a private screening. So far I haven't had any issues doing that.
I can't remember Delta's exact policy, but if I recall correctly, you can call ahead of time but when it comes to flight day, they can still refuse you if too many pets have already checked in. Don't quote me on that though.
That's how it was before Avvinue. We just launched the world's first pet travel booking platform so that you can get your pet reserved 100% online and confirms the place on the flight so you don't have sit on hold for hours just to be denied.
> This means no closures, which K implementers consider a feature (I don't).
having not touched K in about 15 years, when did this change? in k3:
K 3.2 2004-09-23 Copyright (C) 1993-2004 Kx Systems
LIN32 16CPU 15985MB ubuntu 0 EVAL
f:{a:x+1;{a+x}}
g:f 1
g
{a+x}
g 2
4
a
value error
a
^
parse error
This comes as a surprise to me! I thought none of Whitney's Ks had closures—although I did neglect to mention that kuc and oK add them. Digging around and asking on the K Matrix/Discord I found some posts that suggest that the K3 form is very limited. My read of these is that functions can refer to variables in the immediately surrounding scope (only one level up), and their values will be copied in when the function is reached in the source code. So it would be equivalent to adding extra arguments to g and passing the variable values in that way. And it wouldn't allow the programmer to create object-like things and couldn't create reference loops requiring garbage collection.
> their values will be copied in when the function is reached in the source code. So it would be equivalent to adding extra arguments to g and passing the variable values in that way. And it wouldn't allow the programmer to create object-like things and couldn't create reference loops requiring garbage collection.
That is definitely desirable! K (and kin) are fully referentially transparent and have value semantics. It would be bizarre and inconsistent to break referential transparency for closures.