We often experienced cascading failure, especially during rolling restarts. A no...

ilyt · on April 11, 2023

Ah, the infinite fun of managing distrubuted systems, I've seen similar failure modes in pretty much anything distributed. While in one node systems the spike of traffic just causes it to sorta work slow, cascading failures caused by latency plague most of the distributed ones.

Whether it's process management or just say node having too little memory and spinning in GC too much.

Mixing app and DB (which is I guess happening here) also can be fun, as now app being overloaded can cause DB being overloaded. You'd probably be just fine if infinispan was used as a remote database instead of embedded one.