Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is obviously sometimes the case. But more often I’ve seen IO bound apps spending all their time on network roundtrip latency. I.E. not a few poorly performing SQL queries, but a thousand queries which all take a millisecond or two.


Totally. I've seen similar things. I've also seen thread contention (such as on a connection pool) that can look a lot like a slow database query/ an "IO bound" workflow. I think profiling is just really hard and lots of code tends to be very inefficient at actually performing IO.


I spent my last 2 weeks optimizing a decade old SaaS to handle massive traffic spikes from one of our biggest customers. We had other customers serving similar amounts of traffic, but with smaller data sets.

- Increasing the numbers of servers running the app. App connections were still stacking up though. This gave us more breathing room though for connections to start stacking and handle small spikes.

- The database seemed very overloaded with so many concurrent connections. I began putting everything I could into memcached (we already had a lot of data in it, but I put more).

- now we had a cache hotspot. Some digging found a age old bug in our cache driver where it didn’t actually keep things in process memory after fetching from memcached and we had a medium sized key getting fetch 100s of times per request.

- Days and days of app optimization after profiling. Our average response time went improved by more than 50%. The site would still start collapsing under a little load.

- While profiling in a single request all queries would complete very quickly (<50ms). Somehow the DB was still the bottleneck. We overprovisoned it significantly and it still would collapse.

- I started collecting counts and timings for cumulative and maximum single cache/db, read/writes to our log stack.

- the bottleneck was clearly still the DB.

- at this point we were desperate. Thinking it might have been an issue in the underlying VM we live migrated the DB to a new VM.

- the database was still the bottleneck.

In the end the thing that fixed it? A simple OPTIMIZE TABLE.

Somehow ANALYZE TABLE hadn’t detected anything but rebuilding the table still fixed the issue.

If anyone is looking for a good load testing tool, Vegeta was invaluable. I highly recommend it.


These are hinted by high system% usage when your system is busy (ie higher than say 10%). If it looks cpu bound but spends a lot of time in the kernel thread switching or synchronization (eg mutexes) is happening too much.


And worth noting that this can be missed during development if you have a good network to the server but customer is using a not so great WiFi network.

We had this at work, where one customer complained some operation was very slow, taking around 30 minutes. Couldn't pin it down, copied their database to my machine and it took only a couple of minutes. A bit of digging and I found that in this case this module caused a few million of fairly trivial SQL statements to be executed.

Each took less than a millisecond to execute locally, but round-trip time over WiFi can be 10 milliseconds or more. So suddenly 2 minutes becomes over 20 minutes.

I asked how the client connected to the LAN, and it was indeed via WiFi. As a quick fix we got the customer to use a network cable, which did indeed reduce the running time to a few minutes. The proper fix was to a bit of caching.


> And worth noting that this can be missed during development if you have a good network

It can be even worse if development is done using a local database, possibly on super fast local SSD. The latencies can be orders of magnitude lower, hiding performance issues that would be obvious even with only a millisecond of additional latency.

I've seen many vendors claim that they only "support" monolithic single-machine setups (sometimes even virtualization has "unacceptable overhead") when it's blatantly obvious that the application is just written with the assumption that database latency is approximately zero.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: