Speaking as Linode user with all my eggs in their basket: I'm sad that they are falling so much behind other players like DigitalOcean. I'm looking more and more to DO these days just because of the speed they are able to deliver new features.
That being said this is a great addition! Looking forward to trying it out when it reaches my datacenter. And I'm also looking forward to see what their next big project will be.
I don't know, I'm currently using Linode and am trying to keep my stuff as platform-agnostic as possible (just give me dumb Linux boxen).
Most of the great new features proliferating on other providers seemed designed to encourage vendor lock-in. On my current project, I made one concession to my usual aversion to proprietary lock-in and went with AWS S3 for storage of user uploads (also looked at Azure's offering but my BizSpark application was denied, so meh), since there really is no equivalent. I'd thought about using Linode's beta block-storage (which would have probably been more performant), but I was afraid I would be forever fighting with managing volumes and nfs mounts (or massively overbuying capacity).
S3 API compatible object storage is pretty agnostic. Google supports that, for example. Would be a more meaningful announcement from Linode than this one, at least for me.
Why does block storage seem to take so long for companies to implement? DigitalOcean only implemented it in the past year, when they had hundreds of employees. It seems like it would be a priority feature, so I imagine a sizable team was working on it. Why did it take so long for digitalocean, and now Linode, to implement block storage? Are there some inherently architecture-dependenent complexities that render it a deceivingly difficult project to implement?
I was the Engineering Manager for the Storage team at DigitalOcean that took the Block Storage project from conception to launch (though I no longer work there), so I might be able to shed some light.
In general, it's really hard to do at-scale network-backed storage - by the time your applications get access to the file system, there are a myriad of abstractions that aren't always receiptive to the idea of the network "going away", or even a modicum of lag. On top of that, in order for it to be profitable, you need to work at a massively-shared scale. This means expensive SSDs, servers and switches that require a lot of capex and no guaranteed revenue because it's a new product. For us, this meant building entirely new network architecture in some places to allow the massive amount of data being shared in the storage cluster and across VMs to not overwhelm existing traffic, etc.
In order to create the reliability in network and persistency that your normal application desires, you need extremely strong consistency and low latency. Every replication strategy (replication and erasure coding) requires each write to touch more than one SSD/HDD/NVMe device in order to acknowledge the write, and that all needs to happen in a shared system with an immense amount of contention, every time.
It takes a while because you only get one opportunity to get all of this right - it's one thing if the network has a few more blips in a month, or if there's a bit more CPU contention than you'd like, but you absolutely can't lose peoples' data.
I can understand why companies are so hesitant to do this - there may be technical debt in their software/network stack that makes it very difficult, or they may not want to proceed unless they have the right set of experts working on the project.
Not the OP, but I've been running production Ceph clusters for the past 4.5 years at two different fortune-50 companies.
We've had very good success with Ceph for block storage and a fairly rough time with it for object storage. We're currently doing our best to try to improve it (both our own upstream contributions and collaborating with RedHat).
From a technology standpoint, I think it is very interesting and for the most part has a lot of very good engineering.
However, it is fairly complex and even today it's very easy to have a hard time with it when starting out. You really need to pay attention to every detail and your hardware selection is extremely important.
It is extremely resilient, and goes to great lengths to preserve your data. Ceph can be performant, however that requires very good hardware and network.
My experience is limited up to the Jewel release (we haven't upgraded to Luminous and we are not planning on using BlueStore anytime soon).
Ceph is good! I'm a bit removed from keeping up on the day-to-day, but I always respected it as being a solid and dependable piece of open source software.
There are options that will perform better, but they are almost always considerably more expensive than FOSS, and all have their own weird scaling quirks.
With the launch of BlueStore a few months ago as well as improvements in erasure coding, I wouldn't hesitate to take a look at it again if I was starting a new project.
Reliable storage software is very difficult to develop in general. If you want reliability and low cost or reliability and performance it's much harder.
In some sense, "cloud native" architecture has shifted all the hard problems into storage by treating all non-storage resources as transient. So storage is the one place where persistent state exists and you can't just reboot it into a clean state.
In short, at the scale any block service is intended to serve, HDDs/SSDs are going to fail, networks are going to be down, data corruptions are going to happen, unpredictable latencies are going to appear, etc. etc. All these failures are destined to happen at a not so infrequent rate. The service offers the abstraction/illusion of "HDD in the cloud that never fails (or may 9s SLA)" despite all that is pretty hard to pull off.
Passing block-level storage through a virtualization platform is not an easy task, because you're talking about a physical connection to the block storage (i.e. fiber, 10G copper, etc) passing through the hypervisor, to become accessible at the VM level. When dealing with a cluster of hypervisors, making that block-level storage accessible, and portable, to entire infrastructure, requires a level of engineering effort.
Plus all the existing difficulty of running a highly available, highly performant SAN at scale before you even get to counting the added complexity of virtualization. That alone is a task of a difficulty level that many have forgotten, what with AWS et al handling most of it for us for the last decade.
Bytemark's Cloud Servers went all in on network storage from 2012, 10 years after our first platform that only used local storage. The main motivation was so all our servers would be live-migratable in case of hardware trouble, but it also means customers aren't having to choose between "local and fast" or "remote and flexible" - we've spent years making "remote and fast" work smoothly.
But yes it was a hard road to get there, particularly in the days when Linux's 10Gbps drivers and btrfs were less good than they are now, and we also needed to write a new NBD server for all the live migration to work smoothly (https://github.com/BytemarkHosting/flexnbd-c)
To pile on what others are saying, implementing reliable, functional distributed networked storage at scale is an insanely difficult engineering task. EBS (AWS Elastic Block Storage) was notorious for years for causing problems both externally (reddit) and internally (causing AWS outages). If anything, one should be impressed that smaller companies like DO and Linode are able to offer block storage at all.
I'm in a similar basket with my non-work projects and to be honest, most of what Linode is missing is available if you use multiple providers.
For instance, I use OVH object storage in conjunction with an image hosting site on Linode. The latency from OVH Canada to Newark is small enough its pretty seamless and if OVH Canada goes down, I can use an EU location with higher latency. Linode fails over to their London location (assuming there is enough availability with VMs, the spin up may be automated but I run 0 webservers there 99.9% of the time).
Personally, I wouldn't pick DO because they tend to have poorly disclosed problems with their "new features" that you really only get an answer to if you contact support or dig through their documentation. For instance, DO's object store doesn't handle index files at all but everyone else's does. So if you try to switch, you almost immediately end up with "eh...srsly?" moment.
Linode and other hosts at least deploy the standard feature set when they expect money from you.
I remember when I interviewed with them, they were adament about catching up with AWS. I asked what they thought of Linode and they said "not in our radar." They billed themselves as an AWS competitor but I just don't see it.
I create DigitalOcean droplets on-demand via their API and roughly 10% of the requests just fail with a generic error message, and then get into an error state with "failure to create droplet."
It's... pretty frustrating, so I'm looking for a new provider with a simple API.
Hey Justin, Danny from DO here :-). can you open a ticket by and make sure you add "funwithjustin" in the email, we want to make sure you're having fun and not buzzkilled.
@ghshephard @riffic @tyingq Your feedback is really important to us, we internalize it and make sure it is heard loud and clear with our Product leads. Totally appreciate the feedback.
Yup - I've had the same experience with Digital Ocean through their web interface - though feels closer to 20% for me. It always seems to work the second time I try and create a droplet though, so it's never really bothered me - though I did find it rather ... odd? One would think that creating new droplets by now would be absolutely foolproof.
I had the worst customer support experience with Digital Ocean in my whole life. Having been a Dreamhost customer for over a decade I thought DO would treat me well. I got really pissed off by how bad they handled things so I tried Linode instead. Man, how happy I am now. I don't care if DO is doing better, I care that Linode treats me with respect as a paying customer.
Yeah, I got turned off DO when a bunch of data was deleted when I resized a node. My fault for not backing up first, but the support folks at DO didn’t have to be quite so snarky about it.
I wonder how difficult it would be to build an S3 like interface on top of this so you can get a coarse pay-for-what-you-use, avoid downtime, but also have a much larger capacity than the max of 10TB for one drive.
You might be able to build this on top of minio. Start with, say, 4 1GB linodes (smallest), with 8 volumes each (the max) of the smallest volume size of 10GB, and a somewhat low 1:3 parity in minio (redundancy is mostly handled by linode replication). That would be 320GB * $0.10 + 4 * $5 = $52/mo to start with. After some utilization threshold, incrementally resize all drives to grow dynamically, the parity drives would fill in while a drive is offline & resizing. The parity is also enough to resize the linodes one at a time too if you need to up their compute capacity. This system could grow up to 320 TB raw / 240TB accessible.
The last I poked around at this idea when linode block storage was introduced, this "should" work with minio, but I got the impression they didn't really consider this kind of use case.
If you run it at a higher level of abstraction than linode volumes with parity and can tolerate up to 8 drives / 1 server going down at once (like this setup on minio should get you), that would take you a long ways towards adding one or two 9's.
High availability is just one S3 feature though. Other important features are paying for what you use, and effectively unlimited storage growth (you will probably revisit this before you hit $32k/mo the 320TB would cost). Even if this didn't add reliability, those other features still have utility above the raw Block Storage Volumes provided by linode here.
I wish these block storage services gave you some idea of failure rate/durability and availability. Amazon publishes some rough volume loss rates but not even Google tells you what kind of durability to expect out of a persistent volume. They all say they are tri-replicated, which semi-implies highly durable storage. What about availability?
Lastly, I'd love to know if DO/Linode have custom rolled their solution or are using Ceph or something similar. Not that I don't trust them, but they aren't recruiting the same engineers as Google.
Based on their open jobs listings, DigitalOcean is using Ceph. I really hope (for everyone's sake) that Linode didn't roll their own solution.
As someone who runs ~100 Ceph clusters at multi-petabyte scale, publishing availability and durability SLAs is not an easy task, however not impossible either.
Since they just started offering it, Linode probably doesn't have accurate statistics to share, and most people can't correctly interpret very small probabilities anyway. They'd probably be better off saying something like "you should assume that each volume will fail at some point in its life".
They have been offering it since June, FWIW. And it is worth knowing the order of magnitude of expected failure rates compared to just running against the local SSD.
That being said this is a great addition! Looking forward to trying it out when it reaches my datacenter. And I'm also looking forward to see what their next big project will be.