A few years ago, a submarine cable was knocked out, between our startup's MVP servers in Singapore AWS, and our networked factory stations in Asia.
In lieu of the submarine cable, something closer to a wet string was being routed over. It had such high latency and packet loss, that a watchdog timer I'd implemented on the stations was timing out.
Fortunately, the remote access we'd built into our stations (SSH and OpenVPN) still worked, albeit at slow speeds, like a 300 baud dialup, and crazy-high latency.
Having occasionally dealt with performance a bit like that as a kid, and knowing my way around Linux, it was like "I've been training my whole life for this moment."
So I just flexed the old command-line-and-editor-when-you-feel-every-byte-transmitted skill, and got the stations working, before the factory even knew about the submarine cable, saving our infinite-9s uptime.
It was nothing compared to what NASA does, but terminal animation effects would've ended both of our missions.
I'd take a solid 300 baud connection over a spotty cellular connection during most emergencies.
Having worked and played on flaky high-latency and low-throughput networks much of my life, I mostly visualize things in my head—as you likely mean by 'command-line-and-editor-when-you-feel-every-byte-transmitted'. Open a connection and queue your commands; wait for output. It works if you don't make any typos.
Preferably, when the connection is too slow or flaky (bad cellular), I make a local script and echo the script to be executed remotely to a file on the remote server, and then execute it with nohup - with the input to create the script and execute it coming from a pre-made file on my local machine redirected to SSH.
Bad cellular often works in bursts, in my experience. Also, redirect output to a file if needed.
This works nicely in situations where your connection may drop frequently for periods exceeding timeouts. On that note, keep ssh timeouts high. With really spotty cellular you might have the network drop for 5-15 minutes between reliable transmissions. SSH connections can stay alive nearly indefinitely if the timeouts allow for it and IP address don't change.
> local script [...] nohup [...] redirect output to a file
mosh hostname -- screen -S philsnow
I don't think I've had to use this over a truly terrible connection, but mosh worked a treat a ~decade ago while tunneling through DNS from a cruise ship that charged exorbitant rates for wifi while underway, but which allowed unlimited DNS traffic.
> It works if you don't make any typos
mosh helps a bit with that too: you get local predictive echo of your keystrokes and you can't "recall" keystrokes but you can queue up backspaces to cover up your typos. Doesn't help where a single typo-ed keystroke is a hotkey that does something you didn't want, though.
I did something similar once when something broke while I was on a plane. The plane wifi was blocking ssh (I HATE when they do this) so I opened up an existing digital ocean instance, sshed into my home computer from it via the web interface, then uploaded some large files from the home computer and did a kubectl apply. The latency was insane but luckily the file upload was fast since my home connection is fast.
It was fun to think about the path each packet was taking - from my laptop to the plane router, to a satellite, back to a digital ocean computer in the UK, to my home computer in NJ, and back. It wasn’t as bad when you think about the magic there.
Smart, and not something someone would realize if they'd been too insulated from how things work.
Regarding planes/cafes/guest/etc. WiFi, I now usually put any "emergency remote plumbing access" on port 443 (though usually not HTTPS), to reduce the likelihood of some random non-SPI ruleset blocking us in an emergency.
I think that by 2050, 443 will be the only port used by applications that require a non local connection, because the chance that some participant in the network has blocked any other port is simply too high.
Nothing like typing in an ssh prompt and each time you press a button on the keyboard you have to wait a few seconds for the char to appear on the screen, because it has to go to the server and back first... that't the true "blind typing" :)))
That needs a binary to run on the remote host for protocol support, doesn't it? I wonder how long that'd take to get transferred and running, over as slow a link as it sounds like this was.
Since you can reach the host, presumably it has some kind of access to system updates / packages. In that case, ideally you paste a command like "sudo apt update && sudo apt install -y mosh\n" or something. You can even paste the newline at the same time and save a round-trip.
It isn't perfectly clear even in the detailed description whether any other link was available, but I tend to suspect that if anything faster had been, then it would have been used in preference to the one that actually was.
It's a cable going between entire countries that got cut. Anything inside the country should still be full speed.
There is no link that could be used in preference. Any repo that the factory stations could access would be unusable by Singapore AWS. And any repo that Singapore AWS could access would be unusable by the factory stations. But the two ends of the link don't have to use the same repo.
"Be specific" sounds like a high-stakes interview, so I'd better answer. :)
* SITUATION: Factory reports MVP factory stations for pilot customer "not turning on" for the day, reason unknown.
* TASK: Get stations up in time for production line, so startup doesn't go out of business.
* ACTION: Determined cause was unexplained networking problem outside our control, and that was triggering some startup time checks. But that the stations were resilient enough that I could carefully edit the code on the stations to relax some timeouts, and enough packets were getting through that we might be able to operate anyway. After that worked for one station, carefully changed the remaining ones.
* RESULT: Factory stations worked for that production day, and our startup was therefore still in business. We were later advised of the submarine cable failure, and the factory acknowledged connectivity problems. Our internal post mortem discussions included why the newer boot code hadn't been installed, and revisiting backup connectivity options. From there, thanks to various early-startup engineering magic, and an exciting mix of good and bad luck, we eventually finished a year contract in a high-quality brand's factory production line successfully.
No technical wizardry in the immediate story, but a lot of various smart things we'd done proactively (including triaging what we did and didn't spend precious overextended time on), plus some luck (and of course the fact that some Internet infrastructure heroes' work had them routing any packets at all)... all came together... and got us through a freak failure of a submarine cable we'd never heard of, which could've ended our startup right there.
(Details on Action, IIRC: Assumed command of the incident, and activated Astronaut Mode manner. Attempted to remote access, and found network very poor. Alerted factory of network problem in their connectivity to AWS Singapore, but they initially thought there was no problem. Could tell there was some very spotty connectivity problems (probably including using `mtr-tiny`), so focused troubleshooting stations on that assumption. Realized how the station would behave in this situation, and that the boot-ish checks for various network connectivity were probably timing out. Or, less likely, there could be a bug in handling the exceptional situation. Investigating, very slowly due to poor Internet, found that the stations didn't have the current version of the boot code, which would've reported diagnostics better on the station display, so factory personnel might see it and tell us. Using `vi`, made careful, minimal changes to the timeouts directly on one station, in the old version of the the code on there (in either Python or Bash; I forget). Restarted station, and it worked. Carefully did the same to the other stations. Everything worked, and other parts of the station software had been done smart enough that they could cope with the production day's demands. Despite the poor connectivity, and the need to do network requests for each production item that passed through the station, before it could move on.)