I've spent the last 7 years using SSH to get onto a prod box a few dozen times. Less than once a month. When I do have to get on the box, we use signed keys that expire in 24 hours (a manager or SRE is required to sign the key if you need to get on a box).
The article is talking about the scaling issues of SSH access via jump boxes or bastions. I'd argue that a better solution to scaling SSH access is to invest in tooling that makes SSH access unnecessary for most of your team. Centralized logs, error reporting, distributed traces, etc. are fairly well solved for most installations. Interactive access to prod (e.g. running DB migrations) requires a little more investment, but tools like ECS Exec make that fairly accessible without requiring SSH access.
I agree with you. But you are viewing it from a web application centric organisation. There are lot of other types of organisations out there, for eg. a company that uses vendor products and now has to give access to servers, databases etc. to the vendor.
The scenarios you need access to the downstream resources is mainly for doing ops on them. Other things such as viewing logs can is predictable and can be exposed via some tool to developers.
None of takes away the need for tcpdump, strace, core dumps, etc. Surprisingly much troubleshooting needs to be done in production. I have yet to see views of the contrary survive contact with the real world.
The idea of expiring ssh keys seems completely reasonable, but it's also a good idea to have enough centralized access control to be able to actually give someone temporary access allocations. Key validity is evaluated locally, against system time.
tcpdump and co do happen occasionally, but less than I had expected. This isn't a small installation. We support around 30M active users and routinely push 20k qps in web traffic. We run on a few thousand vCPUs in EC2.
It's a pretty big architectural philosophy shift. All our production workload runs on spot instances, unhealthy hosts come and go quickly enough and our traffic rebalances fast enough that we just don't spend that much time debugging low-level issues. Core dumps get shipped to S3. We do continuous profiling with Google Cloud Profiler. There's a lot of tooling required but once that investment is made things run very well.
DB migrations should definitely not be done interactively and definitely not in an SSH session. Write them beforehand, have them reviewed with the rest of the code, and have your deployment process run them.
Just to clarify our process: DB migrations are part of code review, they run as a script, but we don't have that script run automatically as part of CD. We've been bit by enough migration surprises that we require someone watching and able to interrupt and cancel the migration if needed. But that's the extent of action required. Run this command, Ctrl-C if necessary. Definitely not YOLO'ing in a psql in prod.
That's a really good process that scales forever even if you have a good person for it. Once you want to have more people do it, I find it easier to have hard-coded timeouts that cancel and rollback if the migration will be locked. I've also done auto rollback in these cases.
It's still all access control, I don't really see a difference. Yes SSH access could perhaps give you access to more things, with less fine-grained permissions, but it just depends on the replacement(s).
The overall point should be, figure out what your needed access control permissions are and then find and use tools to meet those needs. SSH might be the tool, it might not be.
Shifting from tail -f <logs> to any web UI doesn't remove the access control for logging, it just shifts it into the fancy web UI. That may or may not be better, it's just different. One could certainly argue the fancy web UI is a better UX, but that's a totally different reason for selecting that tool or not. Access Control happens either way.
Er, so most of those supposed problems sound like you could fix them by just using unprivileged users jumping through the jump server via `ssh -J`/ProxyJump? Or a VPN, which is pretty close to what it sounds like the product they're trying to sell is anyways. And if you're going to try and sell a Teleport competitor, you really should do a better job convincing me that it's going to actually be secure in the first place. I don't see source code or audits anywhere on this website.
But it is written on the website: "without security compromise" - so it must be secure!!!
Also there is a lot of low contrast text on that website - this shows that they really know what they are doing and you can trust them. I am sure it is a very good company and you should give all the login credentials of your org to them and also install their binary on all servers.
Isn´t "trust in company" the reason why you are using Linux servers?
Even when using a jump server, I've solved the issue of revoking access in the past by using LDAP to control access to servers. Instead of adding user accounts directly to server, the account information is stored in LDAP -- including public SSH keys. If you want to revoke a user's access across the entire infrastructure, then you can do so in one fell swoop.
I also have set this up to restrict SSH access to particular hosts based on the LDAP record.
The problem they are describing isn't a problem with Jump servers... it's a problem with distributed authorization.
I haven't seen many organisations actually set SSH PAM via LDAP or another delegated authentication system.
You are right about the problem statement. It is distributed authorization problem. And it is a very hard problem to solve or visualize for a fast moving company unless it is a problem.
Do you have some data about how many orgs actually set SSH PAM via LDAP?
Where can I see this data? I would like to check your statements.
We are on the internet, so would you please like to add the source of knowledge to your statements - it is a very basic and good feature called "URL", please use it!
Or do you just want to say "I have not seen many orgs with SSH + LDAP in my career because I have never worked in one and from that I conclude that the whole world works like this"?
Most orgs are incompetent. We had same discussion with many of our clients, where their requirement was stupid shit like "change password every month" while we had to negotiate that no, we don't even use passwords in the first place, we use hardware tokens for SSH keys.
And I'm talking about few big local banks, where accounts on Red Hat boxes are still created by some ops dude manually according to some docs.
I don't think neither me nor anyone reading that statement would come to a conclusion that the works the way I suggested in that argument.
I can talk about me interviewing and empirical data of talking to dozens of companies from seed to series B and how they have been managing access to servers. But I won't, I would rather urge you to do basic trend search either on google or your favorite platform for SSH PAM via LDAP or SSH LDAP and see it for yourself where the world is heading [1].
Oh, I’m sure it’s super rare. It’s actually quite easy to setup, but I’m not sure many people bother with the setup because LDAP (I’m not counting Active Directory) in general isn’t all that common. I know this just from the rarity of articles posted about getting it configured.
But once you do it, it’s something that’s easy to keep using because it’s so useful.
My favorite was setting up LDAP in combination with a jump host where I had a special program for the SSH command shell (like prgmr.com). I had it setup where the use could authenticate with a password, but then upload an SSH key from the custom shell.
Based on the interviewing I did last year, the clear trending solution, for enterprise, is Cyberark. I saw that all over the place for root password management.
Cyberark [1] and delinea [2] are definitely leading enterprise solution right now. Okta too has an offering in this space but I haven't seen it used widely yet.
But there are quite some solutions in market at this point that are in growing trend. You would find teleport [3], strongdm [4] in high growth companies where as Adaptive.live [5], Idemium.io [6] and now hoop.dev in the early stage to series B.
This isn’t root password management. Or at least, it shouldn’t be. Users shouldn’t have root passwords for end devices. This is about controlling access to remote servers and/or sudo access to those servers. None of which requires the root password on the remote server, unless I’m missing something. Is this for more ephemeral keys?
I set up LDAP with sshPublicKey extension in 2021 together with some scripts to configure new servers to use it and a small command line tool to add and remove GIDs per user (across any infrastructure). It's working great. The biggest effort was learning to manage LDAP, which is a bit anachronistic, but the payoff is worth it I think. Even with <20 users, managing public keys across all of our infrastructure was not sustainable.
The problem the article is describing is frankly problem with lack of good configuration management. Hell, you can use SSH keys to authorize via sudo and use hardware token to store those keys, sidestepping problem of "user left their id_rsa somewhere"
I am a bit confused about this product. I had once seen a product called Runops [1], the customer list and product testimonials are exactly same as this product Hoop.dev [2].
[1] https://runops.io/
Jump hosts seem like an anti-pattern in the era of AWS SSM and Tailscale.
It is far too easy to misconfigure network policies and grant them access to infrastructure that they shouldn't.
And with Tailscale you can run the agent within SaaS products like Github Actions or Terraform Cloud to securely manage their access into your systems.
I believe you still need a bastion host to query a database, for instance, unless you want to set up SSM on existing hosts - my current project is fully serverless, so I had to set up an EC2 instance to serve as the bastion. The beauty of SSM is that the host can be fully on a private subnet, not exposed to the wider internet as commonly suggested.
Yeah, if you’re using managed services within AWS you need a relay host. It doesn’t need to punch a hole to the outside world (like a bastion host) but it still needs some manner to allow tailscale (an ec2 box) to route to those services.
> Jump Servers must be able to reach to a certain private network and this requires specific configuration for each environment;
Yup. Use VPNs and firewall ACLs. Jump servers are leftovers of bad practices where good practices were too hard to implement.
> Burden of managing SSH keys of users throughout all nodes. Rotation is required when someone leaves or enter the organization;
That is extremely trivial if you have (you should) any sensible configuration management in place. We just store them in LDAP with user data and distribute where neede (gitlab, servers)
> Role management requires managing sudoers files, making sure file system permissions are properly configured and users are within their proper groups;
Ah yes, managing a text file, so fucking hard /s
> Nodes must be updated with the tooling necessary to interact with internal services.
>Keep a list of updated services (DNS) available to interact with it
see the point about CM
> Usually, infrastructure enginners are a scarce team and keeping all these components updated are hard to tackle. Over time, these nodes will onboard more users and tooling, which will increase the complexity over managing these resources.
Which is why you write it once and use automation. I don't think we touched our sudoers or ssh key management module in years, it was written once then had some small changes but that's about it
AWS Session Manager is great. However, to connect to an RDS/Aurora/Elasticache instance, you must still create an intermediate EC2 instance to run SSM commands against.
We use Basti (https://github.com/BohdanPetryshyn/basti/issues) to set up and manage the jump host. The tool automatically starts/stops the instance, which is excellent for irregular access.
This is what I was looking for. It seems weird to always have a jump box sitting up, waiting for someone to come mess with it. Part of the tooling should be to spin up / boot an ephemeral instance.
I'd rather patch a bunch of openssh servers than a bunch of proprietary software agents that essentially keep reverse ssh tunnels open for me. Or did I miss something?
Use of SSH agent forwarding is dangerous as it allows an attacker to gain access to more key materials to access more servers. Using it casually in an article about SSH security is a bit worrying.
Not with the confirm option of ssh-add. I've had agent forwarding on for every host (trusted and untrusted) for a decade now, without worry, because my ssh agent confirms with me each use of any ssh key.
Interesting. However in practice, I don't ssh-add my keys, they get loaded on first use by the ssh client. Is there a way to make ssh load keys into the agent with that option set?
You need to fully trust your target server, so you need to manage your known_hosts diligently and make sure you trust the host you connect to. If you just accept the host key without checking, you allow any host to use your SSH key for authentication. Any SSH server can accept your private key as authentication. Also, if the target host is infiltrated, it can use your private SSH key for authentication elsewhere without your knowledge.
Explanation: IF you use SSH agent and have ssh options set up, you get a channel thru SSH where you could use your SSH agent on remote host.
Good side: You can then chain authenticate and say use same SSH agent to authorize sudo, hence getting sudo without password, just secured with your private key. Add hardware token to store said key and you're pretty secure.
Bad side: .... so can any other process with right permissions on the system, therefore compromised system can try to impersonate you.
One way to mitigate is is to make sure servers can't talk to eachother to via SSH, if user can access A and B but A can't access B and vice versa the escalation is limited.
Other way is to set agent to ask every time something wants to use the key which half-solves it (attacker would need to time the attack to occur right before "valid" use") but from what I remember it still doesn't show you what is trying to use your key (at least for gpg-agent's ssh agent functionality) so it's kinda not that useful of a feature.
The point of ssh-agent is to use the private key to authenticate yourself.
If you forward it to a remote host, you are granting access to the remote to do that (i.e., the remote can authenticate as you), thus, you must trust the remote.
From the docs,
> Agent forwarding should be enabled with caution. Users with the ability to bypass file permissions on the remote host (for the agent's UNIX-domain socket) can access the local agent through the forwarded connection. An attacker cannot obtain key material from the agent, however they can perform operations on the keys that enable them to
authenticate using the identities loaded into the agent. A safer alternative may be to use a jump host (see -J).
You need -A if you're running ssh on the intermediate host, i.e., on the jump host in this case. But -J doesn't run ssh on the intermediate host, it more or less runs two ssh's on your local host. The first from local to the jump, the second from local to the eventual target, through a tunnel forwarding the connecting through the jump[1]. But because all the SSH processing is always local, it always has access to the local ssh-agent: you don't need -A.
And, as someone points out upthread, you need to fully trust the remote machine to pass -A. You usually shouldn't, in most cases that I think people using jump hosts in corporate settings would be interacting with jump hosts: it permits other employees to impersonate you, by abusing your forwarded ssh-agent, if they have sufficient access on the jump host.
The article is talking about the scaling issues of SSH access via jump boxes or bastions. I'd argue that a better solution to scaling SSH access is to invest in tooling that makes SSH access unnecessary for most of your team. Centralized logs, error reporting, distributed traces, etc. are fairly well solved for most installations. Interactive access to prod (e.g. running DB migrations) requires a little more investment, but tools like ECS Exec make that fairly accessible without requiring SSH access.