| | Resiliency at Scale: Managing Google's TPUv4 Machine Learning Supercomputer (micahlerner.com) |
| 1 point by mlerner 7 months ago | past |
|
| | ServiceRouter: Hyperscale and Minimal Cost Service Mesh at Meta (micahlerner.com) |
| 1 point by mlerner on March 31, 2024 | past |
|
| | ServiceRouter: Hyperscale and Minimal Cost Service Mesh at Meta (micahlerner.com) |
| 4 points by nalgeon on March 30, 2024 | past |
|
| | ServiceRouter: Hyperscale and Minimal Cost Service Mesh at Meta (micahlerner.com) |
| 4 points by sbdchd on March 29, 2024 | past |
|
| | Understanding Google's File System (2020) (micahlerner.com) |
| 142 points by tosh on March 19, 2024 | past | 41 comments |
|
| | A Cloud-Scale Characterization of Remote Procedure Calls (micahlerner.com) |
| 2 points by mlerner on March 5, 2024 | past |
|
| | A Cloud-Scale Characterization of Remote Procedure Calls (micahlerner.com) |
| 3 points by nalgeon on March 4, 2024 | past |
|
| | A Cloud-Scale Characterization of Google's Remote Procedure Calls (micahlerner.com) |
| 1 point by mlerner on March 4, 2024 | past |
|
| | Gemini, Amazon's system for fast failure recovery in distributed model training (micahlerner.com) |
| 2 points by mlerner on March 2, 2024 | past |
|
| | Defcon: Preventing overload with graceful feature degradation (2023) (micahlerner.com) |
| 237 points by mlerner on Feb 29, 2024 | past | 95 comments |
|
| | Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints (micahlerner.com) |
| 3 points by mlerner on Feb 4, 2024 | past |
|
| | XFaaS: Hyperscale and Low Cost Serverless Functions at Meta (micahlerner.com) |
| 194 points by greghn on Jan 31, 2024 | past | 75 comments |
|
| | Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints (micahlerner.com) |
| 4 points by mlerner on Jan 31, 2024 | past |
|
| | XFaaS: Hyperscale and Low Cost Serverless Functions at Meta (micahlerner.com) |
| 3 points by mlerner on Jan 27, 2024 | past |
|
| | XFaaS: Hyperscale and Low Cost Serverless Functions at Meta (micahlerner.com) |
| 3 points by mlerner on Jan 24, 2024 | past | 1 comment |
|
| | Blueprint: A Toolchain for Highly-Reconfigurable Microservice Applications (micahlerner.com) |
| 2 points by mlerner on Jan 21, 2024 | past |
|
| | Efficient Memory Management for Large Language Model Serving with PagedAttention (micahlerner.com) |
| 3 points by mlerner on Jan 20, 2024 | past |
|
| | Efficient Memory Management for Large Language Model Serving with PagedAttention (micahlerner.com) |
| 2 points by ingve on Jan 16, 2024 | past |
|
| | Blueprint: A Toolchain for Highly-Reconfigurable Microservice Applications (micahlerner.com) |
| 1 point by malphite on Jan 15, 2024 | past |
|
| | Efficient Memory Management for Large Language Model Serving with PagedAttention (micahlerner.com) |
| 1 point by mlerner on Jan 11, 2024 | past |
|
| | Blueprint: A Toolchain for Highly-Reconfigurable Microservice Applications (micahlerner.com) |
| 1 point by mlerner on Jan 4, 2024 | past |
|
| | Blueprint: A Toolchain for Highly-Reconfigurable Microservice Applications (micahlerner.com) |
| 5 points by mlerner on Jan 2, 2024 | past | 2 comments |
|
| | Defcon: Preventing Overload with Graceful Feature Degradation (micahlerner.com) |
| 4 points by mlerner on July 29, 2023 | past | 1 comment |
|
| | Defcon: Preventing Overload with Graceful Feature Degradation (micahlerner.com) |
| 4 points by mlerner on July 25, 2023 | past |
|
| | Defcon: Preventing Overload with Graceful Feature Degradation (micahlerner.com) |
| 1 point by ingve on July 25, 2023 | past |
|
| | A Decade of Clos Topologies and Centralized Control in Google’s Datacenter (micahlerner.com) |
| 4 points by greghn on July 6, 2023 | past |
|
| | Towards an adaptable systems architecture for memory tiering at warehouse-scale (micahlerner.com) |
| 28 points by mlerner on June 29, 2023 | past | 4 comments |
|
| | Sundial: Fault-Tolerant Clock Synchronization for Datacenters (micahlerner.com) |
| 2 points by mlerner on June 27, 2023 | past | 1 comment |
|
| | Efficient On-Chip Memory Allocation for Production Machine Learning Accelerators (micahlerner.com) |
| 1 point by ingve on June 26, 2023 | past |
|
| | Sundial: Fault-Tolerant Clock Synchronization for Datacenters (micahlerner.com) |
| 3 points by mlerner on June 25, 2023 | past |
|
|
| More |