This looks interesting (although I'm not in the target market, too small)...
But if I were looking at this, judging from the quality of people they've amassed in their engineering team, is there any chance they won't be acquired in 6 months?
To anyone looking to take a bet on this, what is the answer to "what's your plan for when your stellar team gets acquired?" And what answer will satisfy that buyer?
Update: Adding another question, does this "environment" (where any really great product with great talent in it can be acquired very quickly) have a chilling effect on purchases for products like this?
Hi! So, at every step -- from conception to funding to building the team and now building the product -- we have done so to build a big, successful public company. Not only do we (the founders) share that conviction, but it is shared by our investors and employees as well. For better or for ill, we are -- as we memorably declared to one investor -- ride or die.
Also, if it's of any solace, I really don't think any of the existing players would be terribly interested in buying a company that has so thoroughly and unequivocally rejected so many of their accrued decisions! ;) I'm pretty sure their due diligence would reveal that we have taken a first principles approach here that is anathema to the iterative one they have taken for decades -- and indeed, these companies have shown time and time again that they don't want to risk their existing product lines to a fresh approach, no matter how badly customers want it.
I read through your "Compensation as a Reflection of Values" article [1] and just wanted to say that I love it. It reflects and relates so much to my own values towards work, as a life philosophy, that I felt refreshed knowing others not only think this way but also have the power to implement such a culture. Thanks for trying that, I hope it becomes something more common to workers in general.
Your approach to pay is really refreshing and attractive as an engineer, and also seems like the exact type of thing most VC or larger tech firms would really hate. That alone feels like evidence of your conviction
Ha! Well, I think our investors think we're very idiosyncratic -- but they also can't help but admire the results: a singular team, drawn in part by not having to worry about the Game of Thrones that is most corporate comp structures. ;)
Smaller teams will always win the communication overhead comparison even without thinking about organizational trees and therefore indirections too much. Communication is one of the biggest problems in organizations and the society, so more direct and therefore clearer communication can make the organization more efficient and keep the spirits high. It also doesn't hurt to have a team made only of extremely senior engineers or other professionals in their field. Even better, if those engineers are great personalities too. There is only one culprit: you have to have a very capable driver to put this powerful engine to good use so to say. If you drive the powerful engine in the wrong direction, you are actually putting more, not less distance between the destination and your current position. It seems, the goal for Oxide Computer is clear and I wholeheartedly wish you the best of luck.
I hope you are able to keep the investors convinced and stick with it! I'm a Swedish-American that's mostly lived in the US, but has been working back in Sweden for the last 4 years. I'm culturally mostly Californian, but the work atmosphere in Sweden is just less cut throat and much nicer. You pay with salary sure, but it's definitely worth it. Your descriptions on the website feel a bit similar.
I presume you wouldn't consider European remote given your PST timezone requirement, but I guess I'll consider your company one of those dream places to work were I to make my return to the US!
I was literally remarking to a workmate 'this looks like Sun 2.0', then I see who's on the team :). Congrats, I'll be keeping an eye out if you ever start shipping to Australia.
We're getting there! My second shot was yesterday, and Steve and Jess are both completely done -- so we expect to get back to the garage soon! In the meantime, fellow Oxide engineer Adam Leventhal and I have been doing a Twitter Space every Monday at 5p Pacific; so far, it's been incredible with some amazing people dropping by -- come hang out!
Any thoughts on re-listening to your previous podcasts, finding interesting topics that were either skipped or digressed from and inviting back the guests to do a Q&A style podcast? I feel there are deep wells of interesting topics to be discussed.
See, my immediate reaction was that Oxide is by Sun people who are still scarred by that acquisition; they'll fight tooth and nail to avoid a repeat and if it did get forced through there would be an immediate and complete exodus.
Well, that one too. Honestly, one of the consequences of having a team consisting mostly (but not entirely!) of industry vets: we've collectively seen a lot of shit. In fact, a topic I would never want to bring up among Oxide employees: who has had an acquisition in their career gone the worst? There are just too many contenders, which itself is a sad commentary on the industry!
Fortunately, some of those same DSSD folks have joined us at Oxide -- and let's just say that they are of like mind with respect to Oxide's approach. ;)
That requires them to be publicly traded (which I don’t think oxide is), or for a majority of the private share holders to essentially give up on the company.
Now I don’t know how oxide is set up, but I’d assume the founders still retain the large majority of shares.
I agree... what I think they meant to say is something along the lines of software defaults are already optimized to maximize and take advantage of the hardware's abilities so work is completed faster. The 'with the software baked in' should be changed to reflect the value proposition that Oxide is alluding to.
Going by that logic, you should never take a chance on a bad company because they are bad, and a good company because they are too good and might get acquired. So should you just never rely on a small company for anything?
That's the question I was genuinely asking. Do longer-term minded buyers think this way? Our company is too small and just use AWS, we're just not perspective buyers. But I'm trying to understand the mindset of a CapEx style buyer whose timelines are multiple years.
This team is, by all measures, going to hit it out of the park. There's just a solid amount of talent, experience and insight all-round.
And to be clear, I am not at all disparaging teams that get acquired – that would be silly. I'm just saying that we are in an environment these days where very few of these kinds of companies get a chance to grow before being acquired and WE are the ones that lose even though the people working at the company rightfully earn a nice payout.
I have the same "fear" about Tailscale, a company whose product we love and have started using, and are about to purchase.
But the fact that a member of the founding team themselves answered my message above in plain english (not surprising), is honestly refreshing.
No one is going to bet the farm on that solution. I'd be surprised if big SaaS vendors like Atlassian or DropBox go with it.
But on the other hand I can see F500s (oil & gas companies, big engineering and defense firms, etc.) getting a rack or two to run their cloud-like stuff. They would not be taking much risk; this would be one system among many others they have, and it will have a life of 5 to 7 years anyway (a few million dollars and 7 years is peanuts for an oil & gas or mining company whose CapEx goes into the billions, over 50+ years horizons).
I think the value is in having a cloud-like system that doesn't require an entire IT/Ops team to run.
Private companies can't just get bought out. They have to agree to be acquired. There is not some roaming force of Big Corp M&A people who forcefully acquihire companies.
But second, I'd love to understand the compute vs storage tradeoff chosen here. Looking at the (pretty!) picture [1], I was shocked to see "Wow, it's mostly storage?". Is that from going all flash?
Given how much of the rack is storage, I'm not sure which Milan was chosen (and so whether that's 2048 threads or 4196 [edit: real cores, 4196 threads]), but it seems like visually 4U is compute? [edit: nope] Is that a mistake on my part, because dual-socket Milan at 128 threads per socket is 256 threads per server, so you need at least 8 servers to hit 2048 "somethings", or do the storage nodes also
have Milans [would make sense] and their compute is included [also fine!] -- and so similarly that's how you get a funky 30 TiB of memory?
[Top-level edit from below: the green stuff are the nodes, including the compute. The 4U near the middle is the fiber]
P.S.: the "NETWORK SPEED 100 GB/S" in all caps / CSS loses the presumably 100 Gbps (though the value in the HTML is 100 gb/s which is also unclear).
Leaving that RAM for ZFS L2 ARC perhaps? I don't think they would use Illumos as the hypervisor OS without also using OpenZFS with it. They also need some for management, the control UI, a DB for metrics and more.
Btw. if I count correctly, they have 20 SSD slots per node (if a node is full width) and 16 nodes. They would need 2 TB to reach 1 PB of "raw" capacity with the obvious redundancy overhead of ~ 20%.
It is also quite possible, they don't use ZFS at all and use e.g. Ceph or something like it but I don't think that is the case, because that wouldn't be cantrillian. :-) E.g. using Minio, they can provide something S3 like on top of a cluster of ZFS storage nodes too but they most likely get better latency with local ZFS and not a distributed filesystem. Financial institutions especially seem to be part of the target here and there latency can be king.
I'm fairly confident the nodes are half width; if you look at the latches it very much would appear you can pull out half of every 2u at once, and if you look at the rear there's 2 net cables going into each side.
Good observation, it looks like it. It probably makes upgrading/ maintenance easier since the unit of failure is smaller. Of course, you can also only tackle stuff, that demands no more than 64 cores before you have to rearchitecture your monolith into a distributed system, which has lots of overhead.
Duh! I got tricked by the things near the PDU as "oh, these must be the pure-compute nodes".
So maybe that's the better question: what are the 4U worth of stuff surrounding the power? More networking stuff? Management stuff? (There was some swivel to the back of the rack / with networking, but I can't find it now)
Edit: Ahh! The rotating view is on /product and so that ~4U is the fiber. (Hat tip to Jon Olson, too)
Control-plane most likely, and having a mid-centered PDU probably adds to heat on the upper stack, which shortens life over time.
As someone who has designed quite a few datacenters, whats more interesting to me in this evolution of computing is the reduction in cabling.
Cabling in a DC is a huge suck on all aspects - plastics, power, blah blah blah - the list is long....
But there are a LOT of cabling companies that do LV out there - so the point is that when these types of systems get more "obelisk" like, are many of these companies going to die? (I'm looking at you Cray and SGI.)
When I worked at Intel - I had a friend who was a proc designer at MIPS - and we talked about rack insertion and a global back-plane for the rack (which we all know to be common now) - but this was ~1997 or so... but when I built the Brocade HQ - cables were still massive and it was an art to properly dress them.
Lucas was the same - so many human work hours spent on just cable mgmt...
Their diagrams of system resiliency is odd in my opinion:
That looks like a ton of failures that they can negotiate...
Whats weird is the SPF isn't going to be in your DC/HQ/Whatever - its going to be outside - this is why we have always sought +2+ carrier ISPs or built private infra...
A freaking semi truck crashed into a telephone pole in Sacramento the other day and wiped comcast off the map to half the region.
Thats ONE fiber line that brought down 100K+ connections...
---
EDIT: I guess what I am actually saying is that this entire marketing strat is to convince any companies that *"failure is imminent and please buy things that are going to fail, but don't worry because you bought plenty more things to live beyond the epic failure that these devices will have"*
---
Not to discredit anything this company has going for its product - but their name is literally "RUST" (*oxide*) --- which we all know is what kills metal.
On the topic of naming, there was thought put into it...
> With accelerating conviction that we would build a company to do this, we needed a name — and once we hit on Oxide, we knew it was us: oxides form much of the earth’s crust, giving a connotation of foundation; silicon, the element that is the foundation of all of computing, is found in nature in its oxide; and (yes!) iron oxide is also known as Rust, a programming language we see playing a substantial role for us. Were there any doubt, that Oxide can also be pseudo-written in hexadecimal — as 0x1de — pretty much sealed the deal!
Power footprint also confirms that the compute density is pretty low.
We built a few racks of Supermicro AMD servers (4 X computes in 2U), and we load tested it to 23kva peak usage (about 1/2 full with nthat type of nodes only, our DC would let us go further)
Were also over 1 PB of disks (unclear how much of this is redundancy), also in NVMe (15.36 TB x 24 in 2U is a lot of storage...)
Other then that not a bad concept, not sure of a premium they will charge or what will be comparable on price.
They basically reinvented mainframes.
Seems it has a lot in common with Z series.
Scalable locked in hardware, virtualization,
reliability, engineered for hardware swaps,
upgrades.
A proprietary operating system (?) from what someone said.
(Offshoot of Solaris +++ (???) By that I mean that most of it, or all of it might be open sourced forks, but it will be an OS only meant to run on their systems.
(It would be fun to get it working at home, on a couple of PCs or a bunch of PIs)
They lack specialized processors to offload some
workloads to.
Perhaps in modern terms shelfs of GPUs or a shelf fast FPGA
, DSP processors. The possibilities are huge.
I didn't find any mention of from what I read.
They also lack the gigantic legacy effort to be
compatible, which is a good thing.
Their approach to reliability isn't quite on par with mainframes, AIUI. At least, not yet. And the programming model is also quite different - a mainframe can seamlessly scale from lots of tiny VM workloads (what Oxide seems to be going for) to large vertically-scaled shared-everything SSI, and anything in between.
Ignoring hardware reliability, thanks to the integration, their solution should be more reliable than whatever byzantine solutions are currently used in their target market. I've worked in a shop (a well-known name that I won't mention) that had a mix of "chat ops" and Perl scripts integrated with JIRA where you could request a Linux VM through a JIRA ticket and get it automatically provisioned, I assume from some big chassis running VMWare, and then use git+Puppet to configure it. It works, but it's a lot of software from different sources and there is always one thing or the other failing. And the security of all that stuff is probably patchy, regardless of audits.
That being said, this solution is the mother of all lock ins...
I could see it used for the non-critical part of a company's infrastructure. I would not run production stuff on it, but it could work for development systems, test boxes, etc. Basically give developers access and let them create and destroy as many VMs as they need, whenever they need.
Yeah, I noticed that too. The green wireframe looking stuff is actually text in spans/divs next to, or overlayed on pictures. The little "nodes" are this character, for example: ⎕. The effect is pretty unique.
You have to scroll all the way through the page to activate all the gimmicks. Then it never stops, and loads 2.5 cores permanently to 100%, which makes CPU fan spin to the max.
Oh my god, that's hilarious. I was wondering why my lap was warm all of a sudden. Htop said firefox was the culprit so I closed out all my tabs except this one. Then I read your comment, opened the page, and scrolled all the way to the bottom—my cpu temp just steadily rose till it throttled. All the animations are smooth, though.
Note: I'm typing this from a 9 year old thin-and-light, so that's probably part of the problem.
It’s all fun stories from people doing amazing things with computer hardware and low level software. Like Ring-Sub-Zero and DRAM driver level software.
> Our firmware is open source. We will be transparent about bug fixes. No longer will you be gaslit by vendors about bugs being fixed but not see results or proof.
There are lots of reasons to be enthusiastic about Oxide but for me, this one takes the cake. I hope they are successful, and I hope this attitude spreads far and wide.
- vendor-locked at the rack - if you have hardware from someone else, it can't live in the same cabinet
I guess if you just want a pretty data center in a box and look like what they consider a 'normal' enterprise to be, it might appeal. But I'm not sure how many people asked for Apple-style hardware in the DC.
Why is it important what kind of virtualization? It works and since it is built for this hardware it will likely be more reliable then anything you're putting together yourself.
The specs are damn good. When it is all top-of-the-line, inflexibility is kind of a mute point. Where else are you going to go?
> But I'm not sure how many people asked for Apple-style hardware in the DC.
Well integrated, performant and reliable hardware that runs VMs where you can put anything on it is pretty much all everyone running their own hardware is looking for.
Honestly I am surprised how many here completely misunderstand what their value proposition is.
> Why is it important what kind of virtualization?
Because if I ran this, would have to manage it. Given that I have lots of virtualization to manage already, I would want it to use the same tooling for rather obvious reasons.
> is pretty much all everyone running their own hardware is looking for.
I don't think you talk to many people who do this, but as someone who manages 8 figures worth of hardware, I can tell you that is absolutely not true.
> The specs are damn good. When it is all top-of-the-line, inflexibility is kind of a mute point. Where else are you going to go?
To some hardware that actually fits my use case, that is managable in an existing environment? Oh wait - I already have that. I mean, seriously - do you think they're the only shop selling nice machines?
The value-add is all wrong, unless you are a greenfield deployment willing to bet it all on this particular single vendor, and your needs match their offering.
> lots of virtualization to manage already, I would want it to use the same tooling
I'm not saying you would want to, but maybe their expectation is that you'd plan to transition everything to their system. Either gradually as part of the normal cycle of replacing old hardware or all at once if you want to be aggressive.
If their way is actually better, then it might make sense. You'd go through an annoying transition period but be better off in the end.
The hardware options do seem limited, but maybe that would change if their business takes off and they get enough customers to justify it. They're definitely saying simplicity is a good thing, but maybe that's just marketing spin that sounds better than the alternative of saying they're not yet in a position to offer that flexibility.
I don't see details on the API, but it seems likely you could write a libvirt provider for it and use existing virsh tooling (Cockpit / CloudStack / ...).
> - vendor-locked at the rack - if you have hardware from someone else, it can't live in the same cabinet
This describes legacy IBM platforms quite well. If they can leverage hyperscaling tech to be better and cheaper than what IBM is currently offering, that's enough to make it worthwhile.
This is a selling point - if it's actually better (which, why not? most of the existing virtualization management solutions either suck or are hugely expensive).
If it's not better, big deal? I'm assuming you could just throw Linux on these things and run on the metal or use something different, right? Given how much bcantrill (and other Oxide team members) have discussed loving open hardware, I seriously doubt they would intentionally try to lock down their own product!
> vendor-locked at the rack - if you have hardware from someone else, it can't live in the same cabinet
This is aimed at players so big that they want to buy at the rack level and have no desire to ever touch or carve up anything. It's a niche market, but for them this is actually a plus.
"But I'm not sure how many people asked for Apple-style hardware in the DC."
It's probably selling to the "Amazon-style hardware in your DC market", which I think should be fairly ripe. Building your own private cloud from parts defeats a lot of the purpose...avoiding your own plumbing.
As I understand it, Oxide is going to have deep software integration into their hardware. So the expectation isn't that the servers in this rack will be running Windows or a generic Linux distribution. In case anyone from Oxide is here, is my understanding correct? And if so, will there be a way to run a smaller version of an Oxide system, say for testing or development, without purchasing an entire rack at a time?
Anyway, glad to finally get a glimpse of what Oxide has to offer. Looking forward to seeing a lot more.
My understanding is you will use a API to provision virtual machines on top of the Oxide hypervisor/software stack, which is bhyve running on Illumos. So you can still just run your favorite Linux distro or windows or a BSD if you want[1].
Agreed, I would love to hear more about the management plane. I'm glad it's API-driven, but I still have some questions about things like which hypervisor they are using.
If it's a custom software stack, might be nice to get a miniature dev-kit!
They will use Illumos with Bhyve, @bcantrill said it in a podcast just a few months ago. I have linked it somewhere in my comments (look at my profile).
Illumos is the project that multiple distributions build on from what I understand. In a way, it could be likened to GNU/ Linux as Illumos probably contains not only the kernel but other tools and libraries as well. There is e.g. omniosce.org, perhaps Nexenta and Joyent/ Samsung SmartOS.
In reality I would never want this type of hardware... It reminds me of the old boat anchor bladecenter rigs we used to use. They were great, up until you had to replace one of the blades after the support was up. It's not always practical to replace hardware every 3 years like we're supposed to, so this type of stuff sticks around and gets some barnacles.
What would be fantastic would be if the entire industry committed to an open spec for large chassis like this with a standardized networking and storage overlay... But that would never happen because vendor lock-in is the big money maker in 'enterprise'.
> What would be fantastic would be if the entire industry committed to an open spec for large chassis like this with a standardized networking and storage overlay
Isn't the Open Compute Project supposed to be working on that kind of stuff?
It seems like a lot of Oxide information is currently hiding out in podcasts and other media - does anyone know how the AuthN, AuthZ, ACL system is going to work?
One of the most powerful elements of the trust root system is in audit ability and access control for both service-to-service and human-to-system aspects and I'm really interested in seeing how this plays out.
For example, a service mesh where hosts can be identified securely and authorized in a specific role unlocks a lot of low-friction service-to-service security. I'm curious what Oxide plan to provide in this space API and SDK wise.
I see some Zanzibar related projects on their GitHub, so it can be assumed the ACL system will be based on the principles there - but that's more a framework than an implementation.
The storage is always the difficult part in these architectures. Are you distributing across all nodes? It appears that each sled is an individual compute unit with 10 drives. Are the drives on a proverbial island and only accessible to that local node, or is there some distributed storage going on that you can talk about?
On paper with RDMA and NVMe-OF you could access any drive from any compute unit... but that's easier said than done :)
Soo... we’re switching back to blade servers again?
The problem with this model is its no longer commodity hardware. You are kinda locked into their exosystem of specialized network and server equipment.
And it of course introduces some unique failure modes to mitigate too.
Not to say its not a cool idea, it’s just interesting to see how hardware trends oscillate between commodity and highly specialized proprietary designs.
Congrats Oxide team! More competition in this space is always a good thing.
I'm curious about management. Can the rack operate completely standalone? I assume when you have multiple there will be some management abstraction above the rack layer?
The closest direct equivalent that I can think of to this is AWS outposts. Are there any others that I'm forgetting?
The density they're getting here is significantly higher than AWS Outposts, which is interesting. The top-end (~$600k) AWS Outposts seem to max out at around 1k CPUs and 4.5 TB RAM in a rack (e.g. 12x m5.24xlarge = 12x 384 GB), while this rack can house 2k CPUs and 30 TB (!) RAM.
Outposts seem like a solution to the problem, "for regulatory or compliance reasons we are required for data to reside and be processed within a physical space we control." For that problem, an organization that is otherwise on AWS might find Outposts to be appealing. I can imagine an engineering team's response to such a requirement as "Oh yeah? Fine, its but its going to cost you $600k per year per rack!".
I believe Oxide is attempting to capture a much broader market than that.
Yes, well it isn't that dense either. As I have written, it's 32 CPUs (16x 2 CPUs). 1 TB of RAM per CPU is not that huge a deal, it's perhaps 16x 64 GB (Milan uses 8 channels, 2 DIMMs per channel is reasonable), if you consider that is 16 GB of RAM per core. In HPC, you would probably shrink it to 1/4 of the volume (half width, 1 U dual socket server). Oxide probably focuses on optimal thermal efficiency since their limit isn't the space so much as the power density/ max. power per rack in existing DCs, which they are already pushing hard. (Of course they have lower power options too but they probably will not use 2048 cores.)
The problem with pushing higher compute density is you're running into the limits of what most DCs can provide in terms of power and cooling for a single rack. Usually it's specialized HPC facilities or hyperscalers pushing the power and cooling to handle stuff like that. Those people aren't likely Oxide's customers - they've already got their own hardware solutions.
Also thinking if the Website is not finished? All the "Read More" actually hide very little information, if so why hide it? And doesn't seems to explain the company very well. Seems like we need to listen to their PodCast to find out what is going on. ( Edit: Found a Youtube Video about it https://www.youtube.com/watch?v=vvZA9n3e5pc )
>Get the most efficient power rating and network speeds of 100GBps without the pain of cable
100GBps would be impressive, 100Gbps would be ... not much?
A interesting thing is all the terminal like graphics are actually HTML/CSS and not images.
Literally every single server vendor, and almost all (if not all?) storage vendors on the planet are pushing HCI because that is what mid and large size companies want. This is the fastest growing market segment in hardware (because they now realize that hybrid-cloud is the preferred customer model, and most of their customers now are deploying or already have deployed their own internal cloud). Oxide appears to me to be HCI done correctly.
I currently work for one of their competitors, and for one, am keeping at eye on their careers page!
Also, 100Gb meets requirements for 99.999% of the customers out there.
It's interesting that the RAM/CPU ratio is about double the default shapes from AWS/GCP. In practice I have generally seen those shapes run on the low side of CPU utilization for most workloads, so I think the choice makes sense.
I'm curious if ARC will be running with primarycache=metadata to rely on low latency storage and in-VM cache, otherwise I could see ARC using a fair bit of that RAM overhead in the hosts.
Intel is losing on the client and server. everyone is jumping ship to either ARM or AMD for client/server. hopefully Intel new engineer CEO can turn it around like AMD engineer CEO (Lisa Su)
Few companies buy on pure performance. Right now, AMD has the performance kings and the price/performance kings.
Intel could win price/performance, but they would need to cannibalize their own low-end and mid-range market. If they could make a good bet that they would have high yield in one more cycle, that would make sense. If they don't think that will happen, there's nothing much that will save them, and they're extracting the money that they can right now.
No, CPUs is more relevant. Linux runs on M1, and if they sold CPUs, someone would make a board they could be put on that fit in standard server form factors. For this type of comparison, people want CPUs, not the next version of Xserve.
But Apple doesn't sell the M1 cpu without a Mini or MB carrier included.
I think relatively few people want to buy a rackmount server based on a motherboard that hosts a cannibalized M1 limited to the 16GB of RAM that it came with, paying the price premium for the rest of the machine. An M1 seems to be roughly equivalent to an Ryzen 5000-series CPU, and you can get those for $300 (6c/12t) through $1K (16c/32t) without having to go through the labor costs of pulling out the CPU from a $900 carrier.
I'm not sure I follow. The relevant part of the thread as I saw it was "too bad Apple does not sell CPUs." Selling the CPUs would mean you wouldn't need to cannibalize anything, and in a discussion about the merits of AMD vs Intel as the part used in a platform like this, comparing the bare CPU seems more relevant.
Rack Mac Pros are targeted at racks that are mostly filled with audio/video equipment. Apple really doesn't seem to have any interest in selling server products again.
So? It's still rack-mountable workstation-class hardware that will probably be running on Apple Silicon at some point. And it will probably be possible to boot Linux on it, similar to existing M1 Macs. That's pretty indistinguishable from many servers.
You seem to be thinking that a server and a workstation are the same, ignoring that server skus need oob management, apis, hardware support and so many other things as table stakes
Not all of it, and not necessarily well, which was one reason they weren't super popular except for the case you really needed Apple software. It seemed more aimed at the "I want to colocate a box somewhere and I run Apple in the office and I might also want to in the datacenter for some reason or another" rather than "this is a solid platform that offers all the bells and whistles I would expect because I'm deploying tens or hundreds or these".
Intel still has ~90% of x86 Server Market in unit Shipment, slightly higher in Revenue sold. And their renewed Roadmap from Pat Gelsinger seems to bring a lot of their product forward ( Rightly so ).
That is speaking as someone who wants AMD to grab more market shares ( and has been stating the same for nearly three years and constantly being told off by all AMD fans they are doing fine )
ARM I've seen evidence people are jumping ship too, but is the same true of AMD? This is the best shot they're going to get at it and I for one haven't heard all that much noise pro-AMD.
They make really nice chips, but what happens if BigCorpXYZ just gets a quote from AMD and goes straight to Intel to get it matched - i.e. the Cloud isn't that performance-intensive, so now they get to stay on the Intel stack for less money.
This is a solid private-cloud play aimed at those corporations (probably mainly financials, but other sectors too I'm sure...) who don't want to outsource to the likes of AWS / GCP.
Not just “don’t want to”, I’d hope: They should be able to win on the economics, too, assuming customers that care more about TCO for a fixed or steady-growing workload rather than elasticity.
This is not my area of expertise, but it does look like that[0].
That "custom software," though, is where the magic often lies. As a software person that worked at hardware companies for most of my career, I know all too well, how disrespectful hardware people are of software. If they have a good software-respecting management chain, then it might be pretty awesome.
Bingo. I've personally had to deal with this in other high-density systems. Less cooling not only has the obvious effects, but also reduces PS efficiency which can cause other problems. Cosmic-ray-induced memory errors can also be a problem at those altitudes (or even half that). That's a bit easier to deal with in principle, but the rate of ECC scrubbing required can start to impact performance. Stack that on top of thermal CPU throttling, and you'll have a system that's just slower than it should be. Just as importantly, the slowdown will be uneven across components, so it's effectively a different kind of system when you're debugging low-level issues.
I think it's a good sign that they're aware of the additional support issues associated with higher altitude. Shows that they've really thought things through.
FYI: The CTO of the Oxide Computer Company is Bryan Cantrill. (He is responding in this discussion as "bcantrill" -- I assume[!].) You can read about him on Wiki: https://en.wikipedia.org/wiki/Bryan_Cantrill He also has many interesting and thoughtful recorded talks on YouTube. I highly recommend them.
I am posting this info because it seems their "team" page (https://oxide.computer/team/) is no longer working. I thought it was weird there was no way to see the senior leaders from the website. When I first opened the site, I vaguely remembered this brand name, but could not remember who was behind it.
@bcantrill: I assume this is a mistake. Plus I cannot find a 'Team' link anywhere on the current website.
After the acquisition of Joyent by Samsung, who here is interested in buying an extremely locked down hardware that seems seems to be a next-gen SmartOS (provisioning etc), i.e. rethinking networking and nodes all in holistic sense.
Well, congrats then! I've been waiting for news (and listening to the "On The Metal" podcast) for a long while now, and this seems like a great way to push the envelope on server hardware.
Looks cool. Please add an RSS feed to your blog, oxide. So people can keep up. The rss logo is not a link (at least on mobile) and there is no auto discovery tag (that my reader can find).
Also your site badly crashes Brave iOS browser fwiw.
I'm an idiot. I thought this was like the SGI UV300 where you'd view the whole thing as a single computer and everything would be NUMA'd away. It looks like it's not like that, though.
However I also do not see when I will buy a full shelf of gear, even though I would love to. Will they also release a maxi version, micro version and a nano version? Ie. 2U server, PRO workstation and an small formfactor?
I think the innovations they have done to these computers deserve to be even more places than just in a massive and awesome data processing rack.
Actually, standard Azure rack is on the order of $1.1 mm of hardware depending on SKU if I am not mistaken. So I would guess, it could be more like $2 mm. There is the aspect of management and other vendors Dell/EMC + VMware like you to pay way more than the hardware costs for e.g. VXRail/ vSphere licences. That is the real target.
If someone manages to port bhyve to Linux, they will definitely make a name for themselves.
But honestly, the equivalent is just libvirt on commodity hardware with openzfs storage; the value here is high end hardware with custom firmware and well-integrated software, not really something you can port usefully.
> Some will say that we should be paying people differently based on different geographical locations. I know there are thoughtful people who pay folks differently based on their zip code, but (respectfully), we disagree with this approach. Companies spin this by explaining they are merely paying people based on their cost of living, but this is absurd: do we increase someone's salary when their spouse loses their job or when their kid goes to college? Do we slash it when they inherit money from their deceased parent or move in with someone? The answer to all of these is no, of course not: we pay people based on their work, not their costs. The truth is that companies pay people less in other geographies for a simple reason: because they can. We at Oxide just don't agree with this; we pay people the same regardless of where they pick up their mail.
I have a few legacy HP Proliant (cheap eBay) rackmount servers in my office closet. Oxide looks awesome, but obviously not targeted for home / small business use. I was hoping they would offer single-u servers.
All NVMe seems like a good starting point, but I'd hope that some day there will be a more capacity-oriented variant for people who actually know what they're doing with exabyte-scale storage.
Nothing any more, but I used to work on such systems at Facebook. The public name was Tectonic; there was a paper at FAST this year IIRC. As time goes by, this kind of scale is going to be more common. I still remember when having a single petabyte was something to brag about.
Looks fantastic, and the hardware specs appeal to me greatly - but I'm not sure there is an actual market outside the "cult of personality" bubble. A few SV wannabes will buy into this to trade off a Twitter relationship with the Oxide founders - but does anyone really see the IT teams at Daimler, Proctor & Gamble, Morgan Stanley... et al - actually going for this over HPE/Dell and AWS/Azure? We are a long way away from "Nobody ever got fired for buying from Oxide".
You wouldn't have to pitch it initially as a replacement for your on-prem HPE/Dell. It could be pitched as a replacement for the hosted private cloud you have from IBM, Oracle, etc, that you're unhappy with.
Expensive, since they engaged Pentagram for branding.
More seriously, the animations have "ascii-animation" classes in the dom, tui is probably more fitting aesthetic label if you ask me. Not sure if a lib is involved or its custom. Either way it is done very nicely.
"Only" 2048 CPU cores per rack is actually not that much by nowadays standards - its 16 U of 2x 64 core CPUs. Perhaps is could be more U if they used the lower core counts but e.g. higher frequency per core SKUs but I don't think they do. (And the picture kind of confirms it). They use 2U servers though so they are able to use lower speed but bigger fans and have more expansion cards and 2,5" form factor drives perhaps.
The of course have to fit storage, which needs lots of CPU PCIe lanes for all the NVMe storage and networking (probably 2 or 4 U) and power conversion to power the bus bar and more somewhere. They probably use the 42 U+ standard 19" racks to fit in standard customers DCs. They also don't have such a high power budget as custom DCs for cloud providers do.
1 PB of flash is quite a bit but you could get perhaps 5x as much with HDDs probably (even with a relatively low density of 40x 12 x 12 TB). The problem really is I think, they wouldn't be able to write the HDD firmware in Rust in time (or at all, because no HDD manufacturer would sell an HDD to them without making sure their proprietary firmware is used). SSDs don't necessarily have this property as they are much more like the other components of a modern server.
Sleek AF Pentagram-designed website with a nice balance of style and nerdiness like the ASCII art animations.
Can't miss the Halt and Catch Fire tv references including Haley Clark alongsize Woz and other tech legends and in the blog post about the launch a terminal window with character names like Gordon Clark, etc. Love to see it.
The OS can still do cost-based memory allocation considering the latencies of going between nodes. These Milan chips have tons of memory controllers for local memory and compute nodes can allocate all those PCIe channels to talk to a shared memory module (IBM's OMI goes in that direction - a little bit of extra latency, but lots of bandwidth and ability to go a little bit further than DDR4/5 can go). I think the bigger POWER9 boxes do this kind of thing. Migrating processes to off-board cores is silly in this case, but core/socket/drawer pinning can go a long way towards making this seamless while enabling applications that wouldn't be feasible in more mundane boxes.
> The OS can still do cost-based memory allocation considering the latencies of going between nodes.
That's a rather seamless extension of what OS's have to do already in order to deal with NUMA. Pinning can definitely be worthwhile since the default pointless shuttling of workloads across cores is already killing performance on existing NUMA platforms. But that could be addressed more elegantly by adjusting some tuning knobs and still allowing for migration in rare cases.
Could one reimplement SSI at the OS layer, similar to existing distributed OS's? Distributed shared memory is usually dismissed because of the overhead involved in typical scenarios, but this kind of hardware platform might make it feasible.
There was Mosix at the OS layer in the 1990s and Virtual Iron at the hypervisor layer in the aughts. I think the cost and performance of software SSI just doesn't intersect with demand anywhere.
AIUI, Plan9 is not quite fully SSI. It is a distributed OS, and gets quite close, but it's missing support for distributed memory (i.e. software-implemented shared memory exposing a single global address space); it also does not do process checkpointing and auto-migration, without which you don't really have a "global" system image.
Mosix and VirtualIron worked at a time 1GBps ethernet was in its infancy. Today 10GBps are consumer grade and 40GBps can go over Cat 8 copper, roughly equivalent to a DDR4-4000 channel.
Not great, but this is near COTS hardware. They can do significantly better than that.
Uh no, DDR4-4000 (which servers can't use BTW) is ~256 Gigabits per second. Latency is also a killer; optimized Infiniband is ~1 us which is 10x slower than local RAM at ~100 ns.
Sorry. I wrote GBps when I should Gbps and I got myself fooled in the process. We can somewhat mitigate the latencies with good caches. The overall machine will suffer from a bad case of NUMA ,but would still behave better than a cluster.
How can they offer a secure boot solution under the GPLv3 - my understanding is the anti tivoization clauses means they need to release their keys or allow admins and hackers and others to escape the secure boot chain if they are physically in front of the machine or own the machine.
I don't know anything, but I remember hearing it's secure boot with only their releases, if you want to run your own software it's not secure boot anymore, but you're free to run whatever you want.
One issue with this is normally this requires a bypass of secure boot option.
So device is not set and forget secure, but you have to physically secure it (not that you wouldn't, but now it can really matter if someone has physical access to the machine).
GPLv3 has been used more recently as a block to others taking code and developing things further given some of the clauses and incompatibilities it introduced.
This looks great. From a business perspective, I would be concerned that it would be hard to prevent companies like Dell from entering this space as a competitor quite rapidly.
I think it would more play in the space of a NetApp/Cisco FlexPod or VCE's Vblock, but what those customers are really purchasing is the certified validation of core enterprise apps on a particular hardware/software stack, as well as the massive engineering and support organizations that those companies can bring to bear to validate firmware upgrade and to swoop in in the event of an issue. You also seem to get a LOT more flexibility.
I am not a hater in the least but I really am failing to understand what is unique about this offering. It seems like you have no options regarding the internals, and so scaling compute separately from storage doesn't seem possible. I also am very suspect about offerings like this that have not yet released a second version of their key parts. Everyone says that they are going to be backwards compatible, but then the reality of managing heterogenous generations of gear in a homogenous fashion strikes and you get weird behavior all around.
Long story short, I would love to know what a customer of this scale of physical infrastructure is getting with Oxide that they would not be better served by going to one of the major vendors.
We use Nutanix where I work and this has made everyone very excited. Though they would need something similar to Nutanix CE to make us switch entirely (i.e. the ability to run non-production unsupported on commodity hardware).
To me this is a webpage that promises the technology before the days of AWS and GCP with some cool ascii art animation.
Sure, they have a solid team of engineers but what is the value proposition exactly? A blackbox server that we built to our taste (AMD Milan, what if I want Intel Xeon?) to provide you custom software to manage and monitor server health and notify you if you could upgrade/downgrade to a different size? Oh and no cables with lots of aesthetics to make your datacenter look pretty... And?
Cantrill has always said it's for people who want Facebook class on-premise infrastructure but don't have a $900B market cap and a hundred engineers designing and building custom boxes.
Can definitely see it for a company size of Dropbox, big enough to already be working with ODMs, big enough to be sensitive to the kind of headaches you get from a heterogeneous fleet of ILOM processors designed by deranged engineers.
Dropbox is big enough that they could just acquire Oxide now before they even get to market. That might even be the plan all along. I can't imagine there are more than like a dozen companies that are their target market, i.e. big enough to need Facebook-level datacenters but not big enough (yet) to have that engineering team.
I get the target of HyperConverged infrastructure. It's a pretty big market: potentially all private/colo datacenters. And there are only a few players left. Dell/EMC/VMWare, Nutanix, Cisco's schizophrenic offerings, a waning HP, cloud providers trying to make a 'hybrid' play, etc. And most don't buy one or two of these things. It's rows and rows.
But most of those are so entrenched and wrapped up in their customers. I imagine the target here is actually acquisition, it would just be too hard to get a foothold as an up-and-comer.
Also it usually means giving loaner gear to companies for an extended period for them to evaluate pre-purchase, showing your support, etc. That's a lot of up-front cost for someone without a warchest.
I'm also kind of surprised by the site. It sells to geeks well, but isn't the normal "look at our customers in key industry segments!", "something something Gartner magic quadrant", "whitepapers!" thing. Selling to execs on these things is usually a matter of convincing them they're not making a "bad" decision. They're "cool", but enough industry people agree with them that it's not career limiting if it doesn't pan out.
I like the idea of the product, and it would be nice to have another player. But it's like starting a new car company, and I feel like they're selling to mechanics.
This level of integration surely won't come cheap, from what I recall of Server purchasing a target price of ~200-500k per rack would be expected on a TCO of roughly 2x the rack price over 3 years. (Assuming you are buying from Quanta/SuperMicro or other commodity integrator)
It's possible the prices are different now, but you would need customers looking to drop > 1 million dollars in CapEx for the management capabilities they are providing. Possibly non-cloud Fortune-500?
That reminds me a lot of the Sun Microsystem mega servers from 20 years ago. Those were kind of the cure-all solutions to high scalability web services before Google et al. pioneered cloud-like services on commodity software.
They'd have to sell these at a significant loss to make up for the risk any company would have to take to build out a DC on first generation hardware from a startup.
I mean, how many companies are building out DCs left and right these days? Not many. This will fit in nicely for brand new projects that require nothing more than a rack. Once a company puts something through the paces for 2-3 years, and the engineers managing it love it, then the slow migration from UCS (or Nutanix) to Oxide begins. This is usually how I've seen new hardware architectures introduced at mid and large size companies.
good question. not someone too large to want to pay the per-node margins. not someone too small to want to pay .. not someone who is satisfied with using VMs on a cloud provider. not someone who is selling these as part of a turnkey solution for whatever segment is left.
that said, I do feel persistently sad that we can't fix structural problems because the market is such a gradient descent world.
As for the website, some of the animations could use a spring like movement profile to feel more physical. The website also isn't reachable over IPv6, so I would be very careful with the promised IPv6 capabilities of the server too ;-)
Bravo! Better servers for people who want to own their infra. Too many people seek out cloud services just to get a modern control plane. Server "UI" has been long neglected.
And finally, its nice to see people with brains building real things with nary a mention of "blockchain".
Why are we hard coupling the hardware to the software? The whole secret of the success of M1 and ARM in servers is that lots of software has long ago stopped being hyper-aware of what hardware it is running on.
What software are we talking about anyways? It's all incredibly vague, but it seems to reach all the way into the Kubernetes sphere. Why would I run this over something I can use on my next job?
> Why are we hard coupling the hardware to the software? The whole secret of the success of M1 and ARM in servers is that lots of software has long ago stopped being hyper-aware of what hardware it is running on.
The software running on M1 is a bespoke fit for it. That's why the performance in macOS on M1 is phenomenal. It was custom made to execute optimally on it.
It's probs cheaper than AWS if you already have on prem infra. AWS has pretty damn good margins.
And the idea of "these racks are my kubernetes cluster and are supported by the OEM as such" has a lot of value to a lot of the medium sized IT departments I've run across.
Can you expand on what you mean on "coupling the hardware to the software"?
"Attests the software version is the version that is valid and shipped by the Oxide Computer Company"
this makes "Oxide Computer Company" the primary target and point of vulnerability in multiple ways.
1) rogue employees (state-sponsored, corporate espionage) could replace the software. customers could do nothing about it, and might not even be told.
2) sale of the company by the VCs or a Corporate take-over gives no guarantee that what is safe now will be safe in future, no matter what the VCs or the company says right now.
3) whatever expertise "Oxide Computer Company" thinks they have, they're the single-point-of-failure. the larger the number of customers, the less likely that a given vulnerability will be immediately fixed and distributed out.
this is just some of the possibilities. sorry to say that there's so many things wrong with this idea it's really hard to hold back and not say anything.
now, if the full source code right to the bedrock is available, and the CUSTOMER is given FULL CONTROL, THEN we do not have a problem.
by "full control", that includes:
* all DRM keys including TPM signing private keys
* all peripheral initialisation source code
(including DDR4 firmware, PCIe firmware and USB3 firmware)
* BMC (Boot Management Console) source code
* BIOS source code
* Operating system source code
* full source code for all tools and toolchains for the above
to avoid vendor lock-in and the possibility of the toolchain
itself introducing rogue code.
this is one hell of a list and it's almost impossible to fulfil with today's "NDA'd proprietary firmware 3rd party licensing" mindset. the only company in this secure server space to my knowledge that's achieved this is Raptor Engineering with the TALOS-II, when running with the Kestrel BMC replacement, on the Lattice ECP5 FPGA.
"Attests the software version is the version that is valid and shipped by the Oxide Computer Company"
So in other words these servers will implement restrictive code signing practices and will be vendor-controlled, not owner-controlled?
This is not my idea of "secure", and really in the wake of things like the Solarwinds or RSA hacks it shouldn't be anyone's idea of secure. Vendor-holds-the-keys is not an acceptable security model.
A comment below mentions open firmware, open firmware is useless without the right to deploy modified versions of it.
I'm familiar with the concept. Does this mean that attestation to a different root of trust than Oxide will also be feasible, and that this is just a default?
Oxide makes the hardware, so it makes sense to use them as the root of trust since you already have to trust them to not make backdoored hardware. Why bother adding more parties? Also, for remote attestation to make sense, it needs to be done in the hardware itself (ie. keys burned into the silicon). I'm not sure how that's supposed to work if you add your own keys, or whether that would even make sense.
"it needs to be done in the hardware itself (ie. keys burned into the silicon)" - this isn't true; this is confusing Trusted Boot and Secure Boot, which are not the same thing (nor is it the only way of implementing Secure Boot).
Owner-controlled remote attestation is entirely viable, e.g. Talos II is capable of this with a FlexVer module.
> "it needs to be done in the hardware itself (ie. keys burned into the silicon)" - this isn't true; this is confusing Trusted Boot and Secure Boot, which are not the same thing (nor is it the only way of implementing Secure Boot).
I meant as opposed to keys/signing done in software.
>Owner-controlled remote attestation is entirely viable, e.g. Talos II is capable of this with a FlexVer module.
I skimmed the product brief[1] and it looks like it's basically a TPM that has a secure communications channel (as opposed to LPC which can be MITMed)? I'm not really sure how this is an improvement, because you're still relying on the hardware vendor to send the PCR values. So at the end of the day you still have to trust the hardware vendor, although the signing is done by you, but I'm not really sure how this adds any benefit.
FlexVer doesn't hardcode any keys - you can fully reinitialize the TPM to your liking, but doing so destroys any secrets already stored. So the trick is that you initialize it for your infrastructure and have to do secure reprovisioning if it ever fails to provide the same key answers (which would indicate tampering)
But if I were looking at this, judging from the quality of people they've amassed in their engineering team, is there any chance they won't be acquired in 6 months?
To anyone looking to take a bet on this, what is the answer to "what's your plan for when your stellar team gets acquired?" And what answer will satisfy that buyer?
Update: Adding another question, does this "environment" (where any really great product with great talent in it can be acquired very quickly) have a chilling effect on purchases for products like this?
Hopefully some Oxide people can answer :-)