Announcing Linode Block Storage Volumes

erikrothoff · on Feb 1, 2018

Speaking as Linode user with all my eggs in their basket: I'm sad that they are falling so much behind other players like DigitalOcean. I'm looking more and more to DO these days just because of the speed they are able to deliver new features.

That being said this is a great addition! Looking forward to trying it out when it reaches my datacenter. And I'm also looking forward to see what their next big project will be.

jcadam · on Feb 1, 2018

I don't know, I'm currently using Linode and am trying to keep my stuff as platform-agnostic as possible (just give me dumb Linux boxen).

Most of the great new features proliferating on other providers seemed designed to encourage vendor lock-in. On my current project, I made one concession to my usual aversion to proprietary lock-in and went with AWS S3 for storage of user uploads (also looked at Azure's offering but my BizSpark application was denied, so meh), since there really is no equivalent. I'd thought about using Linode's beta block-storage (which would have probably been more performant), but I was afraid I would be forever fighting with managing volumes and nfs mounts (or massively overbuying capacity).

tyingq · on Feb 2, 2018

S3 API compatible object storage is pretty agnostic. Google supports that, for example. Would be a more meaningful announcement from Linode than this one, at least for me.

alexhayes · on Feb 2, 2018

As does DO

jcadam · on Feb 2, 2018

So you're saying I done good. I feel slightly better :)

tyingq · on Feb 2, 2018

Heh, yes. The S3 API/protocol is becoming a defacto standard. Even supported in proprietary things like NetApp storage appliances.

chatmasta · on Feb 1, 2018

Why does block storage seem to take so long for companies to implement? DigitalOcean only implemented it in the past year, when they had hundreds of employees. It seems like it would be a priority feature, so I imagine a sizable team was working on it. Why did it take so long for digitalocean, and now Linode, to implement block storage? Are there some inherently architecture-dependenent complexities that render it a deceivingly difficult project to implement?

nickvanw · on Feb 1, 2018

I was the Engineering Manager for the Storage team at DigitalOcean that took the Block Storage project from conception to launch (though I no longer work there), so I might be able to shed some light.

In general, it's really hard to do at-scale network-backed storage - by the time your applications get access to the file system, there are a myriad of abstractions that aren't always receiptive to the idea of the network "going away", or even a modicum of lag. On top of that, in order for it to be profitable, you need to work at a massively-shared scale. This means expensive SSDs, servers and switches that require a lot of capex and no guaranteed revenue because it's a new product. For us, this meant building entirely new network architecture in some places to allow the massive amount of data being shared in the storage cluster and across VMs to not overwhelm existing traffic, etc.

In order to create the reliability in network and persistency that your normal application desires, you need extremely strong consistency and low latency. Every replication strategy (replication and erasure coding) requires each write to touch more than one SSD/HDD/NVMe device in order to acknowledge the write, and that all needs to happen in a shared system with an immense amount of contention, every time.

It takes a while because you only get one opportunity to get all of this right - it's one thing if the network has a few more blips in a month, or if there's a bit more CPU contention than you'd like, but you absolutely can't lose peoples' data.

I can understand why companies are so hesitant to do this - there may be technical debt in their software/network stack that makes it very difficult, or they may not want to proceed unless they have the right set of experts working on the project.

neom · on Feb 1, 2018

How do you like Ceph generally as a technology, rest of the implementation aside, thoughts on Ceph?

antongribok · on Feb 1, 2018

Not the OP, but I've been running production Ceph clusters for the past 4.5 years at two different fortune-50 companies.

We've had very good success with Ceph for block storage and a fairly rough time with it for object storage. We're currently doing our best to try to improve it (both our own upstream contributions and collaborating with RedHat).

From a technology standpoint, I think it is very interesting and for the most part has a lot of very good engineering. However, it is fairly complex and even today it's very easy to have a hard time with it when starting out. You really need to pay attention to every detail and your hardware selection is extremely important.

It is extremely resilient, and goes to great lengths to preserve your data. Ceph can be performant, however that requires very good hardware and network.

My experience is limited up to the Jewel release (we haven't upgraded to Luminous and we are not planning on using BlueStore anytime soon).

rodgerd · on Feb 2, 2018

> However, it is fairly complex and even today it's very easy to have a hard time with it when starting out.

Sage's talk at LCA covered the work they're doing here; https://www.youtube.com/watch?v=GrStE7XSKFE

But yes, at small scale Gluster is still a lot easier to deploy and run.

nickvanw · on Feb 1, 2018

Ceph is good! I'm a bit removed from keeping up on the day-to-day, but I always respected it as being a solid and dependable piece of open source software.

There are options that will perform better, but they are almost always considerably more expensive than FOSS, and all have their own weird scaling quirks.

With the launch of BlueStore a few months ago as well as improvements in erasure coding, I wouldn't hesitate to take a look at it again if I was starting a new project.

nik736 · on Feb 2, 2018

Is DO using Ceph for their block storage solution?

petecooper · on Feb 1, 2018

As a DO customer – thanks for the insight, this is really good to know.

wmf · on Feb 1, 2018

Reliable storage software is very difficult to develop in general. If you want reliability and low cost or reliability and performance it's much harder.

In some sense, "cloud native" architecture has shifted all the hard problems into storage by treating all non-storage resources as transient. So storage is the one place where persistent state exists and you can't just reboot it into a clean state.

fredliu · on Feb 1, 2018

In short, at the scale any block service is intended to serve, HDDs/SSDs are going to fail, networks are going to be down, data corruptions are going to happen, unpredictable latencies are going to appear, etc. etc. All these failures are destined to happen at a not so infrequent rate. The service offers the abstraction/illusion of "HDD in the cloud that never fails (or may 9s SLA)" despite all that is pretty hard to pull off.

distantsounds · on Feb 1, 2018

Passing block-level storage through a virtualization platform is not an easy task, because you're talking about a physical connection to the block storage (i.e. fiber, 10G copper, etc) passing through the hypervisor, to become accessible at the VM level. When dealing with a cluster of hypervisors, making that block-level storage accessible, and portable, to entire infrastructure, requires a level of engineering effort.

sneak · on Feb 1, 2018

Plus all the existing difficulty of running a highly available, highly performant SAN at scale before you even get to counting the added complexity of virtualization. That alone is a task of a difficulty level that many have forgotten, what with AWS et al handling most of it for us for the last decade.

mattbee · on Feb 1, 2018

Bytemark's Cloud Servers went all in on network storage from 2012, 10 years after our first platform that only used local storage. The main motivation was so all our servers would be live-migratable in case of hardware trouble, but it also means customers aren't having to choose between "local and fast" or "remote and flexible" - we've spent years making "remote and fast" work smoothly.

But yes it was a hard road to get there, particularly in the days when Linux's 10Gbps drivers and btrfs were less good than they are now, and we also needed to write a new NBD server for all the live migration to work smoothly (https://github.com/BytemarkHosting/flexnbd-c)

not_kurt_godel · on Feb 2, 2018

To pile on what others are saying, implementing reliable, functional distributed networked storage at scale is an insanely difficult engineering task. EBS (AWS Elastic Block Storage) was notorious for years for causing problems both externally (reddit) and internally (causing AWS outages). If anything, one should be impressed that smaller companies like DO and Linode are able to offer block storage at all.

AFNobody · on Feb 1, 2018

I'm in a similar basket with my non-work projects and to be honest, most of what Linode is missing is available if you use multiple providers.

For instance, I use OVH object storage in conjunction with an image hosting site on Linode. The latency from OVH Canada to Newark is small enough its pretty seamless and if OVH Canada goes down, I can use an EU location with higher latency. Linode fails over to their London location (assuming there is enough availability with VMs, the spin up may be automated but I run 0 webservers there 99.9% of the time).

Personally, I wouldn't pick DO because they tend to have poorly disclosed problems with their "new features" that you really only get an answer to if you contact support or dig through their documentation. For instance, DO's object store doesn't handle index files at all but everyone else's does. So if you try to switch, you almost immediately end up with "eh...srsly?" moment.

Linode and other hosts at least deploy the standard feature set when they expect money from you.

sdotsen · on Feb 2, 2018

I remember when I interviewed with them, they were adament about catching up with AWS. I asked what they thought of Linode and they said "not in our radar." They billed themselves as an AWS competitor but I just don't see it.

jbrooksuk · on Feb 1, 2018

Index files as in static sites? That’s in beta currently.

funwithjustin · on Feb 1, 2018

I create DigitalOcean droplets on-demand via their API and roughly 10% of the requests just fail with a generic error message, and then get into an error state with "failure to create droplet."

It's... pretty frustrating, so I'm looking for a new provider with a simple API.

dldlecec · on Feb 2, 2018

Hey Justin, Danny from DO here :-). can you open a ticket by and make sure you add "funwithjustin" in the email, we want to make sure you're having fun and not buzzkilled.

dldlecec · on Feb 2, 2018

@ghshephard @riffic @tyingq Your feedback is really important to us, we internalize it and make sure it is heard loud and clear with our Product leads. Totally appreciate the feedback.

ghshephard · on Feb 2, 2018

Yup - I've had the same experience with Digital Ocean through their web interface - though feels closer to 20% for me. It always seems to work the second time I try and create a droplet though, so it's never really bothered me - though I did find it rather ... odd? One would think that creating new droplets by now would be absolutely foolproof.

riffic · on Feb 2, 2018

Have you contacted DO support? You may want to dig deeper into finding a root cause for this.

tyingq · on Feb 2, 2018

Wow...is that a common experience, or unique to your account? Sounds terrible, I'm surprised this isn't more well known.

caiobegotti · on Feb 2, 2018

I had the worst customer support experience with Digital Ocean in my whole life. Having been a Dreamhost customer for over a decade I thought DO would treat me well. I got really pissed off by how bad they handled things so I tried Linode instead. Man, how happy I am now. I don't care if DO is doing better, I care that Linode treats me with respect as a paying customer.

Veen · on Feb 2, 2018

Yeah, I got turned off DO when a bunch of data was deleted when I resized a node. My fault for not backing up first, but the support folks at DO didn’t have to be quite so snarky about it.

infogulch · on Feb 1, 2018

I wonder how difficult it would be to build an S3 like interface on top of this so you can get a coarse pay-for-what-you-use, avoid downtime, but also have a much larger capacity than the max of 10TB for one drive.

You might be able to build this on top of minio. Start with, say, 4 1GB linodes (smallest), with 8 volumes each (the max) of the smallest volume size of 10GB, and a somewhat low 1:3 parity in minio (redundancy is mostly handled by linode replication). That would be 320GB * $0.10 + 4 * $5 = $52/mo to start with. After some utilization threshold, incrementally resize all drives to grow dynamically, the parity drives would fill in while a drive is offline & resizing. The parity is also enough to resize the linodes one at a time too if you need to up their compute capacity. This system could grow up to 320 TB raw / 240TB accessible.

The last I poked around at this idea when linode block storage was introduced, this "should" work with minio, but I got the impression they didn't really consider this kind of use case.

jerf · on Feb 1, 2018

Depends on how many 9s you want in your reliability. The first couple aren't too bad, gets harder after that.

infogulch · on Feb 1, 2018

If you run it at a higher level of abstraction than linode volumes with parity and can tolerate up to 8 drives / 1 server going down at once (like this setup on minio should get you), that would take you a long ways towards adding one or two 9's.

High availability is just one S3 feature though. Other important features are paying for what you use, and effectively unlimited storage growth (you will probably revisit this before you hit $32k/mo the 320TB would cost). Even if this didn't add reliability, those other features still have utility above the raw Block Storage Volumes provided by linode here.

hemancuso · on Feb 1, 2018

I wish these block storage services gave you some idea of failure rate/durability and availability. Amazon publishes some rough volume loss rates but not even Google tells you what kind of durability to expect out of a persistent volume. They all say they are tri-replicated, which semi-implies highly durable storage. What about availability?

Lastly, I'd love to know if DO/Linode have custom rolled their solution or are using Ceph or something similar. Not that I don't trust them, but they aren't recruiting the same engineers as Google.

antongribok · on Feb 1, 2018

Based on their open jobs listings, DigitalOcean is using Ceph. I really hope (for everyone's sake) that Linode didn't roll their own solution.

As someone who runs ~100 Ceph clusters at multi-petabyte scale, publishing availability and durability SLAs is not an easy task, however not impossible either.

tkulick · on Feb 7, 2018

Hey there - Tory Kulick here from Linode. Our Block Storage solution leverages Ceph.

wmf · on Feb 1, 2018

Since they just started offering it, Linode probably doesn't have accurate statistics to share, and most people can't correctly interpret very small probabilities anyway. They'd probably be better off saying something like "you should assume that each volume will fail at some point in its life".

hemancuso · on Feb 1, 2018

They have been offering it since June, FWIW. And it is worth knowing the order of magnitude of expected failure rates compared to just running against the local SSD.

AFNobody · on Feb 1, 2018

Its probably custom to be 100% honest. The only host I know offering that stuff out of the box using open source is OVH using OpenStack.

antongribok · on Feb 1, 2018

DO is using Ceph.

OVH is using Ceph as well (that talks to OpenStack) as far as I know.

antongribok · on Feb 1, 2018

I know that DigitalOcean uses Ceph under the hood. Does anyone know what Linode is using?

tkulick · on Feb 7, 2018

We are leveraging Ceph for our Block Storage solution.