Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
yoquan
on March 3, 2022
|
parent
|
context
|
favorite
| on:
DeepNet: Scaling Transformers to 1k Layers
Actually no. Each layer requires output from previous one, which means sequentially computation. While wider layers can utilize GPU parallel computation better. This is kind of trade-off between less memory (less parameters) vs longer time.
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search: