Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: SadServers – Test your Linux troubleshooting skills (sadservers.com)
597 points by fduran on Oct 26, 2022 | hide | past | favorite | 128 comments
Hello, I'm building SadServers.com, a SaaS where users can test their Linux troubleshooting skills on real Linux servers in a "Capture the Flag" fashion.

I hope this is useful, to learn more about the project please see https://github.com/fduran/sadservers




Based on your architecture diagram it looks like you're spinning up an instance per-user? As you're probably finding now, you will hit AWS limits quickly.

You might instead want to have a smaller pool of (larger) servers that you run co-resident VMs on with https://firecracker-microvm.github.io/. That will avoid account limits and also keep your AWS costs more predictable.


That's kinda nice use case for the WASM machine/linux emulators, then you just need to provide image and user can run it in the browser

> You might instead want to have a smaller pool of (larger) servers that you run co-resident VMs on with https://firecracker-microvm.github.io/. That will avoid account limits and also keep your AWS costs more predictable.

I'd imagine (still waiting for it to load lmao) most of it could be containers too.


Someone else linked https://github.com/copy/v86 which seems really neat.

I like making jokes with coworkers about implementing this or that bit of infra with WASM-based tools mostly to get a rise out of them but each time I make the joke I look into some of the tools or projects and the balance of joke to "I'm actually serious" shifts a little bit to the right.


So then users experience will be poor due to the slowness and non-standard implementation. A better solution IMO would be to provide a container with SSH access.


Just run them in Linux VMs with WASM, on the users' browsers. Make them all pay for it with higher utility bills and greater wear & tear on their hardware.

trollface.jpg


This is actually a good idea for this -- the user wants the education, they can pay for it with their own hardware. Keep your costs low!


Probably a better experience for everyone. You just have to distribute the image (rather than running vms) and the user gets instantaneous responses.


If it is hosted on AWS the bandwidth of distributing the images is likely more than the cost of the compute.


Cloudflare exists


I thought Cloudflare only ensures high usage of the free tier for "web"-ish responses, which doesn't even include .txt files. But I suppose this use case is several orders of magnitude away from that of EasyList, at least in request rate.


I mean you could just pay and use R2 directly. I think it would still be much cheaper.


Why not spin up containers instead of VMs? Seems to me containers would fit much better than VMs.


If the goal of the test is to debug a sad linux server, containers are going to severely limit what ways the server can be sad in, isn't it?


Can you give me an example of some of the severe limitations you're mentioning?


I can give you a bunch of things that can't be simulated in a container:

* Boot problems, such as: GRUB config/install errors, kernel parameters, init startup errors, blocking processes

* Many network scenarios, such as: PXE issues, multipath, load-balacing, anything requiring configuring network interface settings, firewall configuration.

* Resetting an unknown root password

* Booting directly to bash

* Filesystem mounts through fstab or systemd mounts

There's probably more I could think of, but I think that's a good list.


I don't think the DNS exercise would behave the same although that probably depends on how the container was setup. Docker usually controls /etc/resolv.conf. Another exercise is "try to figure out if you're in a container or VM so that'd definitely be different"


The question is not if the exercises would behave identically, but if you can test the objective in a container. For example, you can totally test, screw up, and fix DNS in a container. I would think that "try to figure out if you're in a container or VM" would be exactly the same as it is right now.


Containers have a history of escape vulnerabilities, for reasons like sharing a kernel with the host and other containers.

VMs are designed from the ground up to isolate guests, rather than focusing on application deployment.

Firecracker is the modern container alternative in untrusted compute scenarios, with Fly.io even converting container images into Firecracker VMs.


>Containers have a history of escape vulnerabilities

Generally agreed, but for this use-case do we care?


I haven’t gotten any of the challenges to load, but if you’re going to simulate a sysadmin it would make sense to give you high privileges (or even root) on the box. The more privileged you are inside a container, the more attack surface you expose.


Which is why you create a "dummy" host VM that hosts containers. Nobody's saying "host containers on your prod webserver." On the other hand, spinning up a VM for every user seems insane to me.


User mapping is now a standard feature in Kubernetes, so escape vulnerabilities aren't so much an issue anymore. Additionally, you can use gVisor.


User namespaces have resulted in multiple new container breakout CVEs in the last year. Some guides actually recommend disabling user namespaces because they are still somewhat new and perilous.


You're talking about creating new user namespaces inside a container, not running a container in a user namespace. Running a container in a user namespace is strictly a security improvement over running it in the host user namespace.

Also, all container runtimes automatically block unshare(CLONE_NEWUSER) with seccomp already (unless they've disabled seccomp, which I'm not sure if Kubernetes still does).


What are the ones in the last year? They provide security benefits as well. I mean, you could say the Linux kernel is also dangerous and the Windows kernel and pretty much anything that has ever had a CVE. You can also limit it to specific users too if that is a major concern.


Bypassing container security is easier than bypassing VM security.


Then wouldn't that be the ultimate test ;)


I haven't fully grokked this yet, but one trick I've used in the past to get around limits is AWS Organizations, creating a sub-account per property. A bit more setup but can keep things cleaner administratively.


AWS will raise limits if you ask. Increasing EC2 instance limits is usually a quick turn around.


Yes, the default limits are there to prevent abuse and runaway misconfigurations. They won't turn down revenue if you confirm it's intentional.


At least for the tests I've done on a small startup recently, they've also implemented some automatic quota increases for EC2. I ran commands that would have (or did) eclipsed my quota, and got an email that my quotas were bumped a few minutes later.


Yes thanks!


I'd suggest integrating https://bellard.org/jslinux/ and running the VM in the browser if you can - then you can scale without running out of resources.


Thanks, I've been looking at WASM, for ex https://github.com/snaplet/postgres-wasm/tree/main/packages/... , it would certainly simplify everything to "download a fat file".


Have you seen https://copy.sh/v86/ ? It doesn't run as fast as jslinux but is BSD Licensed, on Github, and supports resuming the VM from a snapshot.

https://github.com/copy/v86


Didn't know about this, thanks!


or linux kernel port on webassembly.


Very cool! This reminds me of the ops challenge @ Slack. I'm not sure if they still do this, but the SRE/platform infra interview used to involve a VM running a malfunctioning LAMP stack.

You'd get SSH access to the VM, then submit a diagnostic report of what was broken (and how you fixed it).

Reminded me of how Red Hat used to run their certification test (RHCE). I probably still have the live CDs for my RHCE laying around somewhere.


I've had interviews like that in the past, and really enjoyed them. Much better than "Draw an architecture diagram for how you'd handle a serverless IoT application" - where you lose points, silenly, because you didn't pick something the interviewer expected you to do.

Usually a simple combination of immutable files, SELinux policies, and types in configuration files were enough for most of the challenges. Though now and again you'd find they'd given you a server with packages removed, or not yet installed.


Oh that reminds me, I loved the original Stripe CTF, it's been 10 years already! https://twitter.com/fduran/status/240321390698442753


New challenge: Fix SadServers’ sad servers


And while we’re at it, we might as well write a wrapper around low-upvote Server Fault questions in the hope that they attract more attention when the problem is gamified.


Seems like it's out of capacity:

    An error occurred (VcpuLimitExceeded) when calling the RunInstances operation: You have requested more vCPU capacity than your current vCPU limit of 64 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.
Maybe something like https://leaningtech.com/webvm-server-less-x86-virtual-machin... would be cheaper and more reliable for this kind of thing?


Yes, HN effect lol-sob.

Mitigation: reducing servers life time temporarily so more people can try.


Usually I roll my eyes when someone posts their own website to HN and it crashes under load. But given the nature and complexity of yours I think there's room for understanding and patience :)


Thanks, I did some stress-testing and infra is scalable enough but I forgot about the AWS quotas, my bad. Quota increase requested and servers are killed off so hopefully "soon" the issue will go away.


Scaling this service without breaking the bank could become its own "sad server" scenario.

I'd start by moving the test VMs to bare-metal servers running libvirt. You can get a 128GB RAM server for ~110 EUR and that should be able to run around 120 concurrent VMs assuming 1GB of RAM to each (CPU isn't a major issue in this case).


Very cool project!

I was the founder of a school training software engineers, we had an infrastructure track that got a lot of our students to land SRE positions. When asking employers for feedback about our grads, one feedback kept coming: they lack experience when it comes to troubleshooting.

So I went on a quest to simulate that infra debugging while in an academic context.

I came up with the idea of giving students broken servers. I used Docker container and would setup a simple workload and mess it up with classic issues.

Needless to say students generally did not like it :) debugging isn’t fun. But it did help a lot.


I'd love to get the actual VM content offline, packaged as Vagrantfiles or Containerfiles. Love the idea though! Go to Pluralsight and pitch it to them :)


A few people have suggested offering content offline as a Docker image etc, good idea, thanks.


The idea is really cool, but all I see is "Waiting for server..." and nothing happens.


That's the trick you failed the first challenge : "Did you try to turn it off and on again?"


I love this idea, I'll definitely try it out when provisioning for scenario machines is up again. Nice work.


Hack The Box -> Fix The Box


Really cool idea.

After choosing a problem, the endpoint you poll at https://sadservers.com/celery-progress/xxxx repeatedly returns {pending: true, current: 0, total: 100, percent: 0} for me.


yes good catch (I should forbid internet access to this end point), poor queue is waiting on VM up but there's no quota left until other VMs are garbaged-collected.


I'm assuming you're spinning up an EC2 instance for each lab. What do you think about using pre-built docker images for each challenge instead? that way they can spin up in just a couple of seconds. Might also be cheaper?


I wanted to do full VMs rather than Docker images but yes I could do Docker images or dedicated big instances with VMs on top like somebody else is suggesting.


Not a bad idea but something to consider; this limits the options for kernel level things quite considerably


probably lxd would be better.


Completed the first challenge and it was a lot of fun - spoiler I've never had to use the 'lsof' command before.


>Completed the first challenge and it was a lot of fun - spoiler I've never had to use the 'lsof' command before.

I've been waiting a while for the "sad server" to come up for me and read the scenario (saint john) whilst waiting.

lsof was the first thing that came to mind after reading the scenario.

I guess that once I actually get a "sad server" I'll make it "happy" quickly :)


Can I download the images so I could run it on my own machine ? I'd really appreciate, I've got an interview very soon :)


Commenting to give this a try later, I've routinely been the person to get these kinds of gremlins escalated

I've long wanted for some sort of mock, "things are broken - I want to see how you think" approach for sysad


In the "tricks of hacker news" -

     188 points by fduran 3 hours ago | unvote | flag | hide | past | favorite | 68 comments
If you click 'favorite' it will save it to your favorites list. This is a publicly visible list - yours is https://news.ycombinator.com/favorites?id=bravetraveler and mine is https://news.ycombinator.com/favorites?id=shagie which makes it easy to get a bookmark type style functionality within HN.

As I tend to favorite less often than I comment, it makes it easier to find those things I want to find again.


Much appreciated! I'm woeful about using not using features like this, it's a character fault at this point.

The HN interface too tends to just have my eyes filter out those links... but that's no defense.

Especially good to know that it's publicly viewable!

Not that I'm particularly worried of being outed by anything I favorite here, it's just good to be mindful of the data we make and where it goes.


Can't get to the first problem because of HN hug but anyway there are fake ways to "solve" it like renaming the logfile (what they test for solved is provided).


This is a self-test, not a certification. The goal is not to defeat the verification goal, but to learn something. So yeah, it's perfectly acceptable that the tests are not bullet-proof.


Depends on how the broken program writes to the log.

If it does

    while true; do echo hello >> bad.log; done
Then renaming bad.log will not solve the challenge.


Replace it with a symlink to /dev/null! Or /dev/full if we feel like it.

(Yes, these are bad solutions, since the instructions explicitly said to stop the process which is writing.)


It will still keep writing to an open inode


No, the “while” loop I was commenting would not.


There are ways to cheat but not so simple; there's a script that checks for the solution and a hash of the script is checked for modifications.


Are you familiar with Trueability? https://www.trueability.com/

It seems like this is a similar SaaS.


Didn't know about this one. There's quite a few labs/sandbox SaaS but what I've seen so far is that they are more for training with a "follow the recipe" model (do this do that to configure something, rather than "this (real) server is broken, fix it (with possibly different solutions)" which imho is more real-life and useful.


I believe the company was founded by some coworkers of mine way back when at Rackspace who often interviewed Linux admins with a lab VM and I assume they just automated the setup and spun it off as their own business. At least that's what happened as far as I can tell; I didn't know the parties involved.


This looks interesting! But it keeps loading forever saying "Your server is being created" (hit VM limit again?)


"Have you turned it off and on again?"


Well this sucks I wanted to try it lol. It's timing out for me or throws an error.


Interesting idea! Looking forward to trying this once some VMs are available. :-)


I only want to say that I love the name SadServers. Strongly memorable.


The tasks loading infinitely, is it a zero challenge?


did you read up on the problems with leetcode?


Hi, not sure what the question means, I came up with the scenarios not copying from leetcode if that's what you mean.


I think they mean 'are you aware of the limitations of Leetcode-like tests and the downsides of their (over)use in hiring processes?'

(FWIW I think this is a very cool and fun educational project regardless of what usefulness it might or might not have in IT hiring decisions, and I'm looking forward to playing with it)


This is badass, just what I need!


>Practice for your next SRE/DevOps interview.

Are SREs and DevOps tasked with administration of operating systems?


> Are SREs and DevOps tasked with administration of operating systems?

yes, eventually.

you can dress it up in all the fancy terms that you like. but devops and SREs are sysadmins with better PR.

its critical that SREs understand _how_ to debug a system, so that they can work out how to put in fixes, and or design better systems.


Both SRE and DevOps are essentially evolved sysadmin roles. The DevOps philosophy is cross-functional and many sysadmins have adopted a DevOps approach. The latest edition of the classic sysadmin book "The Practice of System and Network Administration" is now centered around DevOps.


If you have ops somewhere in your responsibilities, then yes.


depends on what layer the issue is happening at. I know everyone thinks the OS has been abstracted away but my ticket queue says otherwise. "yaml engineering" is just a control surface, I still need to pop the hood often.


Yeah. Random data point: One of my most favorite SRE interviews ever (serious fun!) involved hands-on troubleshooting that eventually required gdb.


How do you automate something you can't do manually?


My only feedback is that this is unrealistic because today developers wouldn’t try to debug something, they’d just destroy the instance, push a commit and hope it fixed something infra related then recreate it.

Why would you need to understand how something works? Just use containers. /s


Developers just need to understand everything because we need developers to do everything and meet all deadlines. We wouldn't dare consider a support role that could troubleshoot it because then there would be no point to having developers that can do everything! /s


Support doesn't deliver features, we need new features! /s


If most developers can't debug a VM, then anyone who can will be able to charge a premium. If you have a proficiency in ops, remember that the next time you negotiate a compensation package.

[Edited my compensation numbers to avoid down votes - yikes]


I feel like you definitely have to target particular companies and more specifically specific titles and skills to offer to do so.

My guess is trying to sell high end services as a "principal software engineer" isn't going to be enough to justify that cash comp to a lot of people hiring.


I wouldn't think of it as trying to sell yourself as a "principal software engineer" on an open market.

I'd make a list of the companies where hiring/scaling the ops team will make or break the business's value delivery, and filter by companies aware of this.

You can knock this out at the recruiting step, just by asking about open developer headcount vs. open SRE ops headcount. Ask which direction that ratio seems to be going, and if there's anyone you can talk to whose job it is to change that ratio (director or VP mandate).

The referral network from working at a hyperscaler co in ops is a great way to break into the space.


Thanks for the heads up!


This is so sad but so true!


If its dumb and it works it's not dumb.


> It's also my not-so-secret hope that a sophisticated enough version of SadServers could be used by tech companies (or for companies that carry on job interviews on their behalf) to automate or facilitate the Linux troubleshooting interview section.

Yup, that's what I was afraid of.


Why are you afraid of this? My org has run a hands-on technical exam with a stack of linux admin basics (I won't enumerate them here because people do their research) but they are based on real problems we've had and the feedback is overwhelmingly "this was one of the best technical interviews I've ever had."

We ask the engineer who is proctoring the interview to think about the following question: Would you want to pair with that engineer again?

If that answer is no, then we probably won't go further because pairing with engineers to troubleshoot is what we do every day.

Some great resumes have died with not knowing how to see what's running on port 80.


Yeah, we did this at a previous employer.

One example, is we had them ssh, download & extract a tarball (the Linux source, but the content doesn't matter). Sometimes, they'd gunzip to stdout. The reaction tells you a lot "lol whoopsie" followed by a quick fix: person knows what they're doing. "uh… what is going on? did I break it?" followed with general cluelessness… maybe not.

That did occasionally break tmux, though.

Part of it was "what are the specs of this thing you're SSH'd into?" and we had one candidate who was adamant the numbers must be wrong: 2 GiB is too little RAM, no machine is that small! Yeah we didn't spin up 128 GiB VM for your interview…


I never cease to be amazed at how few people really realize just how little hardware is often required for getting real work done. You'd be surprised just how much that 2GB vm with a couple cores can handle!


I started with a single 1xx MHz core and 16MB of RAM. And I'm sure some with even less, lol.

Supporting your point: Hardware is awesome if you use it wisely.


My first Linux box was a 20mhz 386SX laptop with 3 megs of RAM (1 meg on the motherboard, 2 in an expansion.) I could barely run Linux 0.99.x. The distro was SLS, and it came on 12 or so floppy disks. I quickly upgraded to a 486 with 8 megs RAM, then 20... which seemed incredible at the time (1994-ish.)

It's amazing how bloated today's software is...


If you give the person you're interviewing access to the same tools they'd have in a regular day on the job (Google, manpages, etc.), I'd say that's a fair and probably relatively enjoyable interview.

Rejecting someone because they can't recall the correct netstat syntax doesn't seem like good hiring practice, but I assume in good faith that's not what you meant :)


Yeah, I google, tealdear, "--help", and manpage anything I don't use at least once a week, every time. Usually I don't remember them otherwise, and if I think I do, I don't trust my memory that well. Only exception is if I remember enough to be able to ctrl+r them out of shell history faster than I can do those things—and actually, for some of those, I do use them often, but couldn't possibly tell you how because I only run a couple commands 99% of the time and always pull them out of history unless it's one of the rare exceptional cases—I couldn't rsync for a particular outcome without consulting a reference, to save my life, even though I use it often.

And usually you only use a fairly small set of tools that often, in any job, and which set will depend on the employer, how things are set up, and what exactly you're doing.

Oh and somehow I get "-r" versus "-R" for "recursive" wrong almost every time, even for commands I type almost daily, unless I check first. It's weird. If tools could get on the same damn page about which means "recursive", that'd be great.

TL;DR I do have a pretty good idea what I'm doing, but look like an absolute idiot if anyone watches me do it. Much worse, even, if I know they're watching and we're not in some kind of relatively high-trust relationship (so, definitely not in an interview setting).


Exactly, all man pages and google is fair. We want to see how they think not rote memorization.


I love this point. Joke: are are you hiring?

I'm quite happy to try to demonstrate how I think, but I hate hate hate leet code because A) it's not relevant to showing how one thinks and B) I've read so much dunking on it on HN that I'm now stopping interviews when they pull out the hackerrank or live code to say 'without using the library, reverse this linked list'.


That sounds awesome! Wish I got the chance to do more hands-on interviews in the mobile dev space, most of my interviews just end up being run of the mill leetcoding.


People in higher up positions like yourself will rarely be subjected to testing with tools like this. You are basically trying to remove the human from equation and industrialize the whole process.


What we're trying to do is respect peoples' time. We can get more about someone's technical understanding in 30 minutes of hands on exercises than we can in a full day of panel interviews. It's better for us as we have a much better understanding of where you're at Linux wise and it's better for you because you only need to come to two hours of interviews, total. Seems like a win win to me.


In my experience this type of interview (and coding interviews in general) usually fall into one of two categories: 1) "I learned this neat trick and want to show candidates how smart I am" or 2) "I have this bug in prod and I want to see if you can fix it for me."

If the interview was along the lines of upgrading the packages on the system, debugging why nginx was crashing, figuring out the specs of the system, etc. that is totally fine with me and I believe respectful of a candidates time. Unfortunately it always turns into something else when people need to come up with new "challenges" for canidates.


Framing a question like “a system has a high load average, what commands would you use to begin diagnosing that?” and taking that conversation as deep as the candidate can go is neither time consuming nor requires a panel of people.


No, I'm trying to make sure the person who is interviewing for a job where they will deal with computers on a daily basis appears to have seen a computer at some prior point in their life.

I wouldn't feel the need to do this if so many candidates didn't fail rudimentary tests. A SWE candidate MUST be able to write the function min(), in the language and tooling of their choice. But in an interview, a sizable fraction cannot. (The actual bar is far higher than min(), ofc., but min() ought to be trivial.)


> Why are you afraid of this?

> My org has run a hands-on technical exam with a stack of linux admin basics ... they are based on real problems we've had and the feedback is overwhelmingly "this was one of the best technical interviews I've ever had."

You essentially answered your own question.

Putting thought into the interview process and working with candidates through real problems is valuable. I cannot say the same for outsourcing or "automating" this portion of an interview using 3rd party SaaS.


We do this in our org as well. 30 minutes of troubleshooting linux issues is a good way to evaluate a candidates experience. We run it as a team exercise with the candidate so that we also get the added bonus of how do they work in a team setting, how do they communicate, etc.


Is it bad though? The problem with Leetcode is that it's an extremely unrealistic test. This on the other hand seems like it actually tests real-world scenarios, and you can get there without grinding. I'm pretty sure I can pass all the tests they've currently got despite having no formal sysadmin experience, just using common developer knowledge, common sense and strategic Google-fu.


The Redhat Certified System Admin, Redhat Certified System Engineer and similar tests require practical, general hands-on skills to solve broken systems. The performance tuning and troubleshooting exams go into more detail and more complex scenarios. No internet access, but resources are available if you understand how to use them. Would never suggest people should solely hire on those certs, but if someone takes the time to complete 7 hands on tests for the certified architect certification, it's a strong indicator they have skills.

Even so, test taking can be stressful but it's arguably less stressful than actual production support with people waiting on the result. Whether people really want to put candidates in a stressful situation is up to them. Sadserver seems like it's somewhere in the middle vs some of the things I've seen. One job interview put me in a room with a boot cd, and an ancient computer with a cdrom so slow you got exactly one chance to boot the media and recover the system in the time limit. But the job was for a trading company, so if you couldn't handle that they didn't want you. It was a fun exercise but would I do that to someone else? Probably not.


Please don't post shallow dismissals, especially of other people's work.

[...] Please don't pick the most provocative thing in an article or post to complain about in the thread.

https://news.ycombinator.com/newsguidelines.html


Already exists. I can't remember the name, but the infra company that I used to work for used one of these as part of their interview loop.


That doesn't mean that I'd charge individual users :-)

Heck, I'm not even asking for an email (and I had to do extra session management coding for that).


but why? a real test that is repeatable, realistic and not _overly_ hard. Sure for a junior software its a bad fit. but for a devop/sre/sysadmin, its a great fit.

its certainly better than some crappy whiteboarding session, or worse a take home test.


I knew this is where it was headed :/


Cool!




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: