Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We often experienced cascading failure, especially during rolling restarts. A node would start shutting down and Infinispan would start to try to rebalance. Due to the large volume of sessions, other nodes would start to become unresponsive and stop replying. Eventually, you'd end up in a situation where it would give give up trying to shut the node down cleanly and just kill itself. That wouldn't be a big deal if you weren't doing a rolling restart. When the first node doesn't shut down cleanly, the data should be "safe" since it is replicated to at least N owners. In practice, the other nodes also get restarted, also shut down uncleanly and sessions are lost. Secondly, as the cluster became unresponsive, requests to refresh sessions would start to time out, which would also cause those sessions to be "lost" since they would eventually hit the maximum idle time.

As long as we wouldn't do any restarts, it would sort of work. Problems would pop up when due to high load, one or more nodes would become unresponsive and liveness probes would restart nodes. That would often cause the kind of cascading failure described above.

Most of these problems are also the result of running it in Kubernetes. We very quickly learned to remove the liveness probes and to massively increase the grace period. This helped, but only so much. We still had rather frequent failures similar to the one I just described.

Maybe if we wouldn't have run it in Kubernetes and we would be more knowledgeable about Infinispan, we could've gotten a stable set up. For us, as a small team without that specialized knowledge, we struggled to get a stable set up.



Ah, the infinite fun of managing distrubuted systems, I've seen similar failure modes in pretty much anything distributed. While in one node systems the spike of traffic just causes it to sorta work slow, cascading failures caused by latency plague most of the distributed ones.

Whether it's process management or just say node having too little memory and spinning in GC too much.

Mixing app and DB (which is I guess happening here) also can be fun, as now app being overloaded can cause DB being overloaded. You'd probably be just fine if infinispan was used as a remote database instead of embedded one.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: