I skimmed the article, and this seems to explain a few things for me. The image ...

boulos · on June 20, 2016

Sorry to hear about your troubles! Assuming you got hit by our painful network outage, we can only repeat: sorry, and we have taken serious action internally to avoid this again.

To explain the difference between your experience (outages taking you out) and "Google.com", I'd guess the difference is that "Google.com" is massively distributed. Perhaps you were running in just one zone or region, or maybe even quite sophisticatedly running across two regions (say us-central1 and us-east1). For Google.com, we have 15 "major" datacenter locations (https://www.google.com/about/datacenters/inside/locations/in...) which are approximately regions in Compute Engine / Cloud parlance.

To your other question though, Cloud is not a side project. Google happens to be enormous, so even though we have thousands of folks across Technical Infrastructure (TI) working on Cloud, thousands divided by tens of thousands is still a "small" percentage (but TI is bigger than say YouTube or Android).

[Edit: And please reach out to support! Don't be silently unhappy, have someone call us up, even if it's just to strangle a PM about how difficult it is to use.]

Disclosure: I work on Compute Engine.

fpgaminer · on June 20, 2016

> To explain the difference between your experience (outages taking you out) and "Google.com"

My comment was somewhat terse (to remain as brief as possible) so perhaps it wasn't clear. My logic was "Google.com is supremely reliable, which is a supremely difficult task, so Google must have really good engineering chops, so their products must be really well engineered." In other words, I was saying that I had in my head an image of Google being filled with great engineers building generally well engineered products. I wasn't commenting on reliability of service, which we haven't had any issues with that I can presently recall (we're only in us-central1-c).

> [Edit: And please reach out to support! Don't be silently unhappy, have someone call us up, even if it's just to strangle a PM about how difficult it is to use.]

Google provides no way to contact support without paying them for a support contract. It makes no sense for us to pay Google for the privilege of debugging their service/software. And I've crawled my way to support through sales before. For all my effort, and an inability to use the platform for a week, I was left with "Sorry, here's an issue tracker". The same kind of issue tracker that's filled with stagnant issues that are months or a year old.

The same argument could naively be made about AWS, which also charges for the privilege of reporting issues to them. But in all the years that I've used AWS I've only once needed to contact them with a problem. It was a billing issue, which they fixed, and then comp'd us with free service for the trouble. With Google Cloud we would need to diagnose, debug, document, and report an issue roughly every day of development.

mimming · on June 20, 2016

Sorry to hear things are more difficult. I'd love to hear specifics, if you're interested in sharing. I'd love to try to get them fixed :)

And yeah, I'm a developer advocate on GCP

fpgaminer · on June 20, 2016

It's a thousand paper cuts scenario; nothing big and specific, just running into little problems constantly. Honestly, that's worse than running into big problems. It makes us go from trying to accomplish what should be a simple change, to spending three hours hammering an API with random combinations of inputs to find the magical incantation that makes it work correctly.

Some examples:

Cloud Storage doesn't support multi part uploads. The best it seems to have is the ability to compose objects. Honestly that's a better system than dodgy multi-part or resumable uploads, but there are hard restrictions on composing objects. You can't compose more than 32 objects, and you can't compose more than 2 layers deep. So with two iterations you can compose at most 1024 objects. That's not great for uploading large objects through our servers in small chunks. If our chunks are, say, 10MB than the largest final file size we can achieve is only 10 gigs.

On AWS when connecting EC2 to RDS we just threw together the VPC and then configured the EC2 servers with the RDS's hostname. Easy. On GCP we basically _had_ to use Cloud SQL Proxy. Now, again, it seems that Cloud SQL Proxy is a better system overall, but it required fiddling with our server setup, upgrading our MySQL library (which caused other issues), and other random dickery. Another annoyance.

We use Go for our backend servers, and GCP's Go API libraries are all autogenerated, and might as well not be documented. We frequently receive the opaque error "required: required" when trying to blindly figure out the API. It's become an office joke. "Why won't Ubuntu recognize this Wifi card?" "Because required: required man, obviously."

Google App Engine's dev_appserver.py completely broke after an update, caused in part by another Google library being installed (protobuf...). Still not sure if the fix was rolled into a release yet...

The web interface frequently breaks and requires manually refreshing, and it's generally slow and unresponsive on the best of days. It also loves to switch me to my personal account and throw errors because I don't have access to the project I was trying to access...

The "scopes" for launching a compute instance aren't documented to the extent that we know which ones provide what privileges. Really the whole privilege system on GCP is a mess and pales in comparison to AWS. I recall some obvious permissions were just outright missing a few weeks ago.

We have some Go code that uses the API to launch a compute instance. When specifying the scopes on the command line for launching an instance they seemed to require being accompanied by the service account "email" address. So in the Go code we specify the service account email and the scopes. One day during development I forgot to set the service account, didn't notice, and everything worked as normal...

I was not able to find an obvious place where preemptible instances report being killed. Not in the activity logs or the serial console log (which is not saved/available when the instance shuts down). shrugs I didn't feel like looking deeper into it.

Startup scripts specified when launching a compute instance run every time the instance starts. Makes sense in retrospect given the name, but it's in contrast to AWS where the script runs once, and in contrast to the example startup script given in the documentation (which installs things ... not something a script that runs every time the machine boots up should do). And it's not very helpful. A script that runs once ever is more practical than a script that runs every boot.

Figuring out exactly how to cook up my own compute images in a format that GCP likes required finding a random video on YouTube from a Google developer.

Some of the documentation (this was either for Datastore or some part of App Engine) is actually just a bunch of marketing copy with no technical meat to it, leaving me to just assume how various features work (because they aren't actually documented anywhere else).

New strange behavior from MySQL running on Cloud SQL that we still haven't nailed down (random lock contentions) that we never encountered on RDS.

Random networking failures on fresh compute instances.

Random upload failures to Cloud Storage.

Transferring objects from one bucket in Cloud Storage to another bucket using the transfer interface resulted in the ACLs being lost for all the objects.

Random things get deprecated every other week. Image aliases last week, something about the Cloud Storage metadata was weird the week before that, etc.

The CLI randomly failing to query for the list of compute instances for tab completion, instead just tab completing an instance that was deleted 10 minutes ago.

_pfwi · on June 20, 2016

HN comment space was not sufficient enough for me to write about the ways AWS drives me insane. So, here is my blog on 1000 cuts by AWS: https://medium.com/google-cloud/the-future-of-cloud-computin.... I feel Google cloud is much better engineered, focusing on developer happiness and productivity.

Talking specifically about my field, Cloud, Big Data and DataScience, its so painful to build a decent data stack that can handle few terabytes of data, let alone petabytes of data. Google Cloud (Pub/Sub, Dataflow & Big Query) make it a breeze to handle petabytes of data. You can literally debug a petabyte scale pipeline, while its running. Unified logs, metrics, monitoring, alerting is another feature that shows how well the Google Cloud platform is built with developer in mind.

asciimike · on June 20, 2016

Totally hear you on the death by a thousand tiny cuts :(

GCS provides multi-part and resumable uploads (https://cloud.google.com/storage/docs/json_api/v1/how-tos/up...), though I agree that the docs make it hard to find given how deeply they are nested. We use resumable uploads in Firebase Storage (mobile GCS: firebase.google.com/docs/storage) to great effect, and routinely upload some pretty huge files with no problems.

Definitely hear you on autogenerated libs sucking: the gcloud-* libs are designed to address some of those issues. gcloud-golang is still under development (https://github.com/googlecloudplatform/gcloud-golang), but might be a good place to start.

GCP is working to address a number of permissions issues with Cloud IAM (https://cloud.google.com/iam), which will provide more fine grained control over resources. I believe Cloud PubSub already uses this model.

Firebase (which shares certain services with GCP) has free developer support (firebase.google.com/support), and as you can imagine, we're inundated with questions and have two teams working 24/7 to address them. Free developer support is a great thing for developers, but providing high quality support at Google scale is probably the hardest thing to do--people just don't scale the same way machines do.

That's why so many of us are active on social media/HN/etc., we want to talk directly to developers and get feedback so we can improve our products, but we typically aim for high quality feedback (like this, thank you :), where we can engage with savvy developers to solve their problems, or at least get actionable feedback to guide our roadmap (x is a bad experience, have you considered y and z which would save me n hours). Ideally, this feedback trickles down into all areas of the product, and even across products (when it comes to permissions, console changes, docs, etc.), though it can take some time to implement those changes.

(Disclosure: PM on Firebase, and work closely with Cloud)

jonwayne · on June 27, 2016

Re:

> Google App Engine's dev_appserver.py completely broke after an update, caused in part by another Google library being installed (protobuf...). Still not sure if the fix was rolled into a release yet...

We've been having a lot of fun with how tricky namespace packages are in Python. We've got a fix in for this issue that should hopefully be in the next SDK release, and we're looking into ways to better isolate dev_appserver from the OS environment.

A simple workaround is to activate an empty virtualenv before running dev_appserver.

mimming · on June 27, 2016

Thanks for the writeup! I'll poke the appropriate teams. Sorry about the paper cuts -_-

solidsnack9000 · on June 20, 2016

Sorry to hear about your bad experience with GCP. In my network there are a lot of people who speak well of it, specifically with an eye to performance. Would you say your difficulties are around the APIs and processes? Or have there been misaligned expectations?