Mesosphere Announces First Data Center OS and $36M in Funding

brandonb · on Dec 8, 2014

Congrats to the team! Ben was one of the most brilliant students in my undergrad class and it's awesome to see him employ his talents in such a big way.

I worked on healthcare.gov last year and it's hard to overstate the potential impact of a tool like DCOS. At one point, we had 2000+ VMs, most manually configured, with no monitoring, and completely different configs between dev, test, and prod (not intentionally). Straightforward operations like migrating half of the database servers from one VLAN to the other took months, small mistakes like changing a database password could result in hours-long outages, and simply getting the data that Mesosphere displays automatically would sometimes take weeks and other times simply be impossible.

Of course, clean devops hygiene would have eliminated much of the pain in the first place, but not every organization has the expertise to do things right. In fact, most don't, and the solution for most organizations is good tooling that automates as much of the system as possible and provides good development discipline for the rest.

23david · on Dec 8, 2014

Having 2000+ VMs without automation or monitoring in place is pretty unusual. Especially for a site like healthcare.gov which must have had all sorts of HIPAA and data security requirements.

But it's honestly not hard to hook all those machines up to a system like Chef/Puppet/Saltstack/Ansible and start automating common tasks within a few days.

Migrating databases between networks and rotating passwords would generally be outside of the scope of a tool like Mesosphere. Again, this is something that can be easily handled by existing automation tools. With most databases though, there's nothing straightforward about migrating data to nodes on different vlans or password rotations. If it's a one-time task, I recommend hiring a database consulting firm to do the migration or rotation.

I think that have good defaults and enforcing best-practices is a good idea. But I think that a lot of this can be achieved with existing tools. IMO, it makes more sense for organizations to automate and orchestrate Mesos deployments via existing/mature DevOps tools like Chef/Puppet/Ansible/Saltstack. Would also be exciting to see deployments working via NixOps/NixOS.

KaiserPro · on Dec 8, 2014

Older person here.

Whats nice is that people are thinking about supercomputers again, even if they insist on calling them "the cloud".

First things first, look up beowulf (http://en.wikipedia.org/wiki/Beowulf_cluster) which is a suite of tools that implents a multi machine scheduler, and message passing interface. Whats nice is that if one host is overloaded, it can migrate the process to another. (however I'm not sure what performance is like nowadays)

In the world of VFX for movies, we've been dealing with schedulers for years. Programmes like alfred, tractor, qube and deadline can dispatch tasks, and deal with dependencies at massive scale.

The first thing to note is that "DCOS" is really a discrete set of parts; a scheduler, machine state enforcement, network config, Storage, and the underlying OS.

With careful planning, the state enforcement tool (puppet and the like) can take care of all of these tasks except a global scheduler.

The beuty of the VFX scheduler is that they understand dependencies really well. (I need x to complete before I run y, I need feature z to run) A lot of newer schedulers really don't understand this concept well.

KaiserPro · on Dec 8, 2014

Its really important to understand that puppet and the like cannot (without heavy engineering) act as a task dispatcher. The big feature of "DCOS" is task distribution.

In linux terms its like comparing the CPU scheduler to chmod. Yes you can make a program run by chmoding a file to +x, but its the scheduler that is responsible for making sure the programme has CPU time.

23david · on Dec 8, 2014

Task distribution primitives are handled by Mesos and implemented in the various apps running on Mesos (Marathon, Chronos, etc.)

It seems inaccurate to call this a feature of DCOS.

Puppet and the like can easily configure and manage Mesos and the various apps running on top, giving all the task dispatch functionality goodness.

DCOS will hopefully be a great integrated Mesos distribution. Seems to me that by supporting machine provisioning and having a per-node licensing model, it's being positioning as a direct competitor to Openstack and VMWare.

kelseyhightower · on Dec 8, 2014

Disclosure: CoreOS employee; Kubernetes contributor.

After reading the title I was a bit tempted to call BS since I think Kubernetes should have the rights to "The first Datacenter OS" tagline[1]. However, after reading up on the project details at https://mesosphere.com/learn I can see how Mesosphere came to the conclusion of being a DCOS, if not necessarily the first. Mesosphere goes a bit further than Kubernetes and offers a solution to the storage problem and attempts to address other "userland" concerns by shipping Apache Spark, Cassandra, Kafka, and Hadoop. So maybe it would be more accurate to call this a datacenter distro on top of the Kubernetes kernel?

Regardless, I think the concept of a datacenter OS will be the key to commoditizing IaaS providers and leveling the playing field in terms of features and usability for those who have not given up on the dream of running a "private" cloud.

Why will the DCOS work where others have failed?

Current solutions aimed at taming the datacenter operate at the machine/VM level, which exposes the OS for each machine, and completely punts on the application. Guess who gets to stitch it all back together? A DCOS is designed to manage applications directly, commonly via application containers, which means we can treat the OS running on the underlying machines like firmware and limit our interactions to basic updates and minimal configuration -- think CoreOS.

What about PaaS?

That's a topic worthy of a lengthy discussion, but I think it boils down to the lack of control found in most PaaS platforms[2]. In order for a PaaS offering to be successful it must make opinionated decisions about how to deploy and run applications; a bit too inflexible for most people. On the other hand, a DCOS seems to hit the sweet spot between IaaS and PaaS.

[1] I'm sure you can make an argument for Joyent's SmartDataCenter (https://www.joyent.com/private-cloud) as well. [2] Deis (http://deis.io/overview) attempts to address this issue.

23david · on Dec 8, 2014

Curious why you wouldn't consider Openstack to be a Datacenter OS? :-)

Zariel · on Dec 8, 2014

Isn't it running ontop of the Mesos kernel, which has been around for longer than Kubernetes?

presspot · on Dec 8, 2014

A bit of the history:

The Mesosphere DCOS is built around the Apache Mesos kernel

The Mesos kernel was developed at UC Berkeley in 2009 [1].

Spark was written as a sample app on top of it [2].

Ben Hindman and his colleagues at the UC Berkeley AmpLab had always envisioned Mesos as a kernel inside of a full-blown operating system [3]. They finally brought it to market.

[1] https://www.usenix.org/legacy/event/nsdi11/tech/full_papers/...

[2] "We have implemented Mesos in 10,000 lines of C++. The system scales to 50,000 (emulated) nodes and uses ZooKeeper for fault tolerance. To evaluate Mesos, we have ported three cluster computing systems to run over it: Hadoop, MPI, and the Torque batch scheduler. To validate our hypothesis that specialized frameworks provide value over general ones, we have also built a new framework on top of Mesos called Spark, optimized for iterative jobs where a dataset is reused in many parallel operations, and shown that Spark can outperform Hadoop by 10x in iterative machine learning workloads." ibid.

[3] http://people.csail.mit.edu/matei/papers/2011/hotcloud_datac...

Zariel · on Dec 8, 2014

Yep, also had input from Google around the time there were deploying (or building) their new scheduler to replace Borg, Omega.

presspot · on Dec 8, 2014

So, yes, Mesos predates Kubernetes by many years.

kelseyhightower · on Dec 8, 2014

That was my first question as well, but there must be a reason why Kubernetes was brought in.

kozyraki · on Dec 8, 2014

My view on this the following: Mesos is similar to the kernel of a conventional operating system (e.g. Linux). It provides very basic services (scheduling, interrupts, device management, etc) and a syscall API. But nobody wants to program to this API. Hence you need libc or other similar libraries to provide a higher level API that programmers use to interact with the kernel. Kubernetes, Marathon, Aurora, etc are such libraries, each optimizing for a different class of applications and providing different functionality. The two (the kernel and the libraries) need each other.

boulos · on Dec 8, 2014

I see what you're saying Christos, but I think I'd prefer "runtime" over library (library sounds like it's just a little convenience).

kozyraki · on Dec 9, 2014

Sure, pthreads and other thread libraries that include a lot of runtime functionality would probably be a better analogy.

presspot · on Dec 8, 2014

I see Kubernetes as more of a programming model, not an operating system. It provides a way to express services and have them scheduled onto a datacenter. The rocket-science of how you schedule those workloads and optimize them in the same partition as other workloads is what you need a DCOS for.

In the Mesosphere world, Kubernetes is a "datacenter service" which is installed on your datacenter so that you can run Kubernetes workloads. You might also want to install DEIS to run DEIS-organized workloads. Or Spark, for Spark workloads... and so on -- all multitenant in the cluster. This is what the DCOS is uniquely good at, and why it qualifies as a true operating system.

fsaintjacques · on Dec 8, 2014

I didn't see any reference to storage solution, am I misreading?

andyidsinga · on Dec 8, 2014

I was just thinking the same thing ...how are block and object storage supported under HDFS etc.

michaelsbradley · on Dec 8, 2014

So, the commoditization of on-demand and highly scalable virtual computing infrastructures, together with the rising popularity of "containerization" for app and service composition, seems to be creating an "orchestration crisis", or an "orchestration business opportunity," depending on your vantage point.

Are we about to see the emergence of what might be termed "new wave mainframe" computing?

randomsearch · on Dec 8, 2014

> be creating an "orchestration crisis"

Spot on, and very well put.

The problem with existing orchestration tools, and tools like chef, puppet, etc. is that they're all a bit piecemeal and complicated. What we need is a step up the abstraction hierarchy, and some standardisation.

We'll probably know when we've got there if companies no longer have any real idea of how many servers or VMs from different providers that they're utilising. They'll just know which applications they're running and how much it's costing them.

presspot · on Dec 8, 2014

It's very similar to mainframe computing, only now it's available to every business and, frankly, every business needs it to be competitive.

superuser2 · on Dec 8, 2014

"Mainframe" implies sales guys and multi-million-dollar monolithic machines. This might be described as "DIY mainframe."

larryweya · on Dec 8, 2014

I'm a recent mesos convert but I think "first" is a tad inaccurate if you consider Joyent's SmartOS and it's recently open sourced Smart Data Center.

presspot · on Dec 8, 2014

There are a lot of components to an operating system. It's not just the technology components, it's the product components and the business components. E.g., Does it have an API? Does it have an SDK? Does it have a user interface? Does it have an init system, a chron, a storage system, service discovery? Does it have an ecosystem of third party developers? I posit that the OS Checklist is fairly long and that no of the other systems you mention have the complete OS package.

larryweya · on Dec 8, 2014

If you were to look at SmartOS and Smart Data Center, you'd realize that it does have ALL of those components. The place I feel SDC and SmartOS fall short (of DCOS) is in application deployment/orchestration which I think is huge for devs who've had to manage VMs in the cloud.

Another win for DCOS is that it can run anywhere while SmartOS only runs on baremetal. SmartOS does come with some nice goodies like the Manta object storage platform and Manatee, a Postgres replication and failover platform. Application orchestration would make it a contender IMHO.

dmpk2k · on Dec 8, 2014

I posit that the OS Checklist is fairly long and that no of the other systems you mention have the complete OS package.

They do.

E.g. SmartOS is a UNIX. It'd be pretty odd if it didn't have an init system. Or storage. Or cron. Or POSIX. Or whatever else you'd like to add to the list that UNIX systems usually have...

justincormack · on Dec 8, 2014

I assumed the OP meant a distributed init system, distributed storage, etc.

23david · on Dec 8, 2014

Master Controller to Minions... configure thyselves...

Datacenter Controller maybe, but calling this an OS?

throwaway923482 · on Dec 8, 2014

That's just like your definition, man. Not everybody is that anal retentive about their definition of 'operating system'. (I doubt Tannenbaum would be in favor of yours, for instance).

throwaway23458 · on Dec 8, 2014

That's just like your definition, man. I doubt Tannenbaum would agree with it.

hendzen · on Dec 8, 2014

Mesosphere should really evangelize libprocess [0] more. Probably one of the cooler C++ libraries out there.

[0] - https://github.com/apache/mesos/tree/master/3rdparty/libproc...

corysama · on Dec 8, 2014

OK. But, what is it? I tried reading the documentation, but all it said was "readme: this is the readme for libprocess."

adamnemecek · on Dec 8, 2014

"Libprocess is a library written in C/C++ that provides an actor style message-passing programming model that leverages efficient operating system event mechanisms. Libprocess is very similar to Erlang's process model, including basic constructs for sending and receiving messages. I'm excited about giving people an opportunity to use this software, so look for lots more details to be added here shortly!"

http://www.eecs.berkeley.edu/~benh/libprocess/

tknaup · on Dec 8, 2014

It's the actor library at the core of Mesos (and a many other things). It's what makes it scale to 10,000s of nodes, and probably more. It's just nobody was able to test it on more nodes without breaking the bank.

on Dec 8, 2014

[deleted]

presspot · on Dec 8, 2014

Mesosphere's stack is in full production at major companies, including one of the largest financial services companies and one of the largest consumer electronics companies. General availability is next year, but paying customers are using it in production today--at very large scale.

23david · on Dec 8, 2014

Chef/Puppet/Saltstack/Ansible/CFEngine/Nix OS

michaelsbradley · on Dec 8, 2014

Nice tools, but (and it feels funny to say this) they seem rather "sticks and stones" in comparison to the integration achieved with open source PaaS layers like deis atop CoreOS; Mesosphere's DCOS even more so.

23david · on Dec 8, 2014

They're general purpose devops automation frameworks... you can choose to take any of them and make this 'deep integrated' layer you speak of.

For example, Openstack deployments are commonly automated using Puppet scripts. (Really amazing stuff if you haven't seen it before...)

DCOS may choose to reinvent the wheel, but unless there's a core innovation in the way they're handling orchestration or automation, I think they're better off if they leverage an existing battle-tested toolchain. Adding another tool will just mean more work for the (underworked and lazy?) DevOps teams who will be responsible for managing DCOS.

KaiserPro · on Dec 8, 2014

Puppet and the like are state enforcement tools, not task schedulers.

As someone who used to run a 6000 core farm in 2007 (its not 25,000) I can tell you that puppet isn't going to help task placement. It can create machine that will run a certain app, but without some heavy programming it'll never balance or detect need and respond sensibly.

23david · on Dec 8, 2014

DevOps tools are rapidly evolving. Now that they've basically finished with machine provisioning, dependency management and orchestration, all major DevOps automation frameworks are going into managing reactive infrastructure. Enforcing a task schedule is just another form of state enforcement.

These are all pretty similar: - configuration management: ensure package oracle-java-8 is installed on machines A,B,C with this specific configuration. - orchestration: ensure my-awesome-java-app on machines A,B,C is running to databases on machine D,E,F - deployment with constraints: ensure that four instances of my-awesome-java-app are running on at least 2 physical machines with over 4TB free disk space. - job runner: ensure that script X runs on a cluster every __ minutes. when script X runs, send the output to script Y

I think that you'll see task placement and job scheduling primitives being integrated into DevOps tooling in the next 6-12 months.

Saltstack already has many of the primitives in place for building out reactive infrastructure, http://docs.saltstack.com/en/latest/ref/runners/all/salt.run...

Would be great to hear about tools for Puppet, Chef, Ansible, other…

KaiserPro · on Dec 9, 2014

Salt is terribly immature at the moment (I know because I use it professionally) I really wouldn't trust it for running tasks as well. Out of the box it starts to get horribly slow after around 500 nodes. (you need to spool up 600 tasks on 600 machines? that'll take 10 minutes guys.)

For distributed cron, we use jenkins. Which has the advantage of keeping "build history"

For task placement we use alfred (https://renderman.pixar.com/resources/RPS_13.5/rps_manuals/a...) yup its old. However it works like a champ, and its fast. (as in it'll dispatch thousands of tasks a second.)

You do hit a limit when you go over 6000 "slots" (each slot accepts one task, and the main dispatcher is single threaded). Dispatching is simple and task building has simple syntax that easily grows to thousands of tasks in one job. monitoring is also simple, as each task ships logs and exit status back to the dispatcher. It also has mechanisms to cope with bad/slow/unhappy machines.

cookrn · on Dec 8, 2014

A well discussed and related story from a few days ago: https://news.ycombinator.com/item?id=8694940

preillyme · on Dec 8, 2014

But this is our official announcement of the DCOS project. The other post was about Ben's ideas that helped drive the creation.

throwaway892348 · on Dec 8, 2014

I don't think there was any confusion but thanks for making sure it stays that way.

bc1323 · on Dec 8, 2014

The DCOS project looks amazing. The command line interface looks like a heroku toolbelt for your very own servers. Cool server usage visualizations too.

tomcart · on Dec 8, 2014

DCOS is an interesting description, as the idea of a data centre (to my tiny mind at least) is made more fuzzy by concepts like AWS AZs.

Do people expect that the 'DC' will span AZs, regions even? Or is the separation of these things valuable in some way?

How about the idea of dev vs prod environments? Will the isolation provided be strong enough that we'll happily drop everything onto a single cluster of machines?

dang · on Dec 8, 2014

We changed the url from http://techcrunch.com/2014/12/07/mesosphere-releases-first-d... because this one is a somewhat more substantive article (though not the title). Via https://news.ycombinator.com/item?id=8715055.

23david · on Dec 8, 2014

lol. mind blown.

"Mesosphere Announces First Data Center OS And $36M In Funding" - Techcrunch

"Mesosphere’s new data center mother brain will blow your mind" - GigaOM