> Most of the weights in llms are 0, that's interesting. Do you have a rough per...

machiaweliczny · on July 24, 2024

My uneducated guess is that with many layers you can implement something akin to graph in brain by nulling lots of previous later outputs. I actually suspect that current models aren’t optimal with layers all of the same size but i know shit

kridsdale3 · on July 24, 2024

This is quite intuitive. We know that a biological neural net is a graph data structure. And ML systems on GPUs are more like layers of bitmaps in Photoshop (it's a graphics processor). So if most of the layers are akin to transparent pixels, in order to build a graph by stacking, that's hyper memory inefficient.