I always felt nervous about resizing a filesystem that has a swap file on it. Never had a kernel panic but would swapoff the file just in case.
If I created a swap partition in LVM I can force it and move it around to the best location WRT multiple hard drives and possible mix of HDD and SSD. With a file its a little more abstract and I could emulate it with some pointless abstraction, but why bother. A typical "this century" example would be running my boot and swap on local disk and storing all real world data on the NAS over iSCSI. Obviously now that "Everything" is virtualized this kind of system administration is void unless you're admin on the cluster itself LOL.
The problem with swap is memory seems to have gotten cheaper than storage and swap of any form takes storage space. Its not the capex that's the problem, its the opex of now you've got an extra 32GB that "has to" be backed up and "has to" be virus scanned or whatever security theatre, and that swap file "has to" be treated as the highest security PII HIPPA PCI category because who knows whats been swapped out of memory onto it, it could be chock full of CC numbers how do you know for sure unless its empty or not there? Paying for more memory doesn't increase your backup storage / security risk / virus scan times and the latter costs more than the former. I don't think non-cloudy admins realize there are cloudy admins out there with forests of little systems with like 8 GB of ram and 16 GB of storage, so the "old" rules from the 1980s about provision twice your ram as swap would increase the disk required from 16 GB to 32 GB doubling my costs for, essentially, nothing. Another problem with swap is in the old days if you needed more memory than you have physically you had to buy more memory, then for awhile you could emulate memory very slowly with disk but much faster than calling an IBM CE and buying more memory, now if you need more memory you spin up more instances or rebuild the instance on a larger flavor about as fast as how you'd add swap in the olden days, probably faster and it can be done dynamically in some architectures.
> The problem with swap is memory seems to have gotten cheaper than storage and swap of any form takes storage space. Its not the capex that's the problem, its the opex of now you've got an extra 32GB that "has to" be backed up and "has to" be virus scanned or whatever security theatre, and that swap file "has to" be treated as the highest security PII HIPPA PCI category because who knows whats been swapped out of memory onto it,
Why on earth you'd backup swap?
Also encrypted swap is pretty easy to setup, then again most distros don't give that option on install.
We have tiny swap, like 1GB, working basically as "early warning" of "hey, this server probably could use few GBs more of RAM". There is very little use for any more swap than that
>I don't think non-cloudy admins realize there are cloudy admins out there with forests of little systems with like 8 GB of ram and 16 GB of storage, so the "old" rules from the 1980s about provision twice your ram as swap would increase the disk required from 16 GB to 32 GB doubling my costs for, essentially, nothing.
The "old rules" for swap were never relevant. But dumb myths die hard.
They did make sense in some contexts. Specifically for laptops expected to hibernate you need to preserve the whole ram. Then you need the extra size for whatever your swap was already using. For small ram systems, 2x the memory size was a good rule of thumb.
It stopped making sense when we stopped using swap and filing memory in normal situations.
Traditional *nix and BSD systems use swap partitions for crashdumps, so giant swap partitions made sense before those systems supported mini dumps. Admins that grew up with those systems are used to the "2xram" rule of thumb.
I just yesterday had to re-partition a FreeBSD kernel test box with 256GB of RAM and a 2GB swap partition because I desperately needed a crashdump and the mini dump would have been 7GB.
The quotes around "has to", and especially the virus scan, makes me think there's a blanket corporate policy that applies to all files, regardless of file type.
I stopped using earlyoom because I got frustrated with it always seeming to be exactly the wrong thing for a given moment that would be killed. Now though OOM events seem to turn into situations that require me to force a reboot, because kswapd goes crazy, using all available CPU, and I can barely (if at all) register any keyboard/mouse input.
Seems like a bad reason to disable swap, but I can't find a way to do what I really want, which is simply to reserve some CPU for input. And then I suppose in order to do something useful with it, require that any process that wants it be allowed at least some small allocation. Maybe that's the hard part and why I haven't found a way? It doesn't happen often, it's just really annoying when it does.
In theory this should be possible with cgroups. The mechanism is there but I don’t know of a way to easily set up a policy that does what you want.
It should be possible to allocate say 95% of the system resources to the default cgroup and then you could create a secondary cgroup — recovery — which has access to the last 5% of the system resources which you could use to run commands such as “kill” or “top” to recover the system state.
Additionally you could run a second ssh server in the recovery cgroup which you could ssh into in the case of system lock up.
In reality it is probably easier to just reboot in most cases, or if you are dealing with servers use ipmi.
I don't understand why this issue still persists on linux. As far as I can tell all earlyoom needs to do is kill the process that has eaten the most memory in the last minute or so. On windows this issue is non-existent.
It's not that simple. Imagine something leaking memory running on parallel with something bursty. For example your browser leaks, but you run a big grep|sort in the background. Or have some GC runtime which allocates in batches and just decided it needs another chunk to manage.
On Windows you don't have oom at all because it trades that solution for just swapping forever until either you can't do anything or manage to kill the right app yourself.
People have reported that their machines with small amount of RAM are now fully usable where previously the system become completely unresponsive when swapping started.
I know; I did; I just didn't find that the answer to 'what would I ideally like/not like killed' was the same on a per-binary basis - it varied, and it'd always, by Sod's law, be wrong. e.g. if Firefox was set not to be killed, it would be a tab misbehaving; if Slack was allowed to be killed, it would be while I was mid-message, and so on.
If it had some concept of 'in-use', for which you could define rules like 'has an active window' or 'is playing media', that might work better for me.
Could the window manager be configured to communicate with this mechanism? The window manager knows what windows you're manipulating at the moment. (I imagine that terminal processes would be somewhat more complicated to handle.)
Windows has been fine with swap files since the age of spinning disks, so I guess using a swap file instead of a swap partition is fine.
Speaking of Windows: How is it that Linux seems to grind to a halt when it is forced to swap, while Windows works just fine in the same scenario? Sure, on Windows a lot of swap file "use" is just bookkeeping of empty pages because Windows doesn't overcommit memory. But even if it actually swaps it stays responsive, while I've had to reboot linux boxes more than once because they stayed unresponsive after memory pressure. Does Windows have a smarter swap algorithm? Is it more proactive about swapping stuff back in once memory pressure is gone? A better scheduler that gives more precedence to "interactive" things? And most importantly: can linux implement those things too?
Ah yes, swapiness. I think Windows is set to something like ... 65% or so. Or was it 85%. I forget, but I looked into this a long time ago when I was having to suffer through having about half as much ram as was ideal for my use case scenarios on both windows and linux. I remember coming across some method to get it changed on windows as well, but I forget that method too. Or rather, I think it's some reg edit that needs doing, but I forget what/where.
Based on the explanation in the documentation <https://docs.kernel.org/admin-guide/sysctl/vm.html?highlight...>, I set 'swappiness' to 100. I figure that SSDs are fine with random IO, so swap isn't likely to be much -if at all- slower than pulling pages back in from disk.
Wrong direction. Bumping swappiness up means OS will make it more likely to keep more pages in cache, setting it to higher value makes OS treat it more like RAM and that's usually slow and jittery
Using more swap may make the system faster. It means there is more room for file-backed pages in RAM, to cache read and writes. A swappiness value should be chosen based on the system IO workload and swap device performance characteristics.
Windows swap never worked well. The OS was always famous for trashing constantly, and only started to lose that fame when computers started coming with way more RAM than any sane OS could use.
These links look like you just googled "windows thrashing" and copied the first 4 results. Did you do that?
Addressing your links in order:
1. I know that the word is thrashing, I was quoting the OP.
2. The existence of someone who thinks they experienced thrashing (but didn't do any investigation to see if they were actually experiencing thrashing), 13 years ago, doesn't seem very difinotive to me.
3. Please read this page. You'll see that the first sentence is "hard drive is being continuously thrashed even when no applications are apparently running". How is the system thrashing when there are no applications running? FYI, "thrashing" does not just mean ANY frequent hard drive reads for any reason.
4. I don't know what this article has to do with the question I asked. The article title is "How do I tell if my Windows server is swapping?" which seems completely unrelated to my question about how often Windows swaps?
Eh, are you sure we're really comparing apples to apples here?, at least when it comes to server based applications. I've overloaded windows multiple times and yes I could move the mouse around and Mayne open notepad if it already cached in memory, but you could get stuck waiting for IOPS trying to do anything else for insanely long periods of time.
I've not done much in this realm since NVMe SSD has been common so not sure how this behaves these days.
The VirtualAlloc API in Windows provides explicit control over commitment. You can reserve address space using the MEM_RESERVE flag, and commit with MEM_COMMIT. These can be combined into one.
Therefore, we cannot say that "Windows doesn't overcommit". Windows applications can overcommit. You don't know which ones might do that, for what purpose.
We use servers with swap files. It works fine and does not grind to halt. Right now one server has 10/15 GB RAM utilization and 3.7/3.7GB swap utilization. It's pretty responsive.
Virutal memory provides a larger memory than available RAM, but all of it can be backed by mass storage. In that case, you have all of the storage; just not all of it is RAM. When not all of it is backed by storage, then the situation is overcommitted.
Once you swap a lot out and you get back some free memory linux will swap it back in on demand, which means a lot of micro-lag moments once apps start accessing now-in-swap things.
Linux assumes it's better to have not-used thing in swap while RAM is used for IO caching, that's why you see this behaviour, there is (AFAIK) no mechanism to go "okay, we have few gigs free, let's bring stuff from swap back".
Reboot-less method is to add new LV with swap and swapoff the old one but that's more PITA than restart...
If you use swap because you don’t have enough RAM for things you need in RAM, you’re going to have a very bad time.
The only time swap isn’t just delaying (and growing) the inevitable shitshow is when, for some reason:
1) some program is consuming some significant amount of Ram, but doesn’t need to run for awhile, and it’s somehow faster/easier to let it swap to disk than shutting it down and restarting it while you do something else that consumes Ram.
2) something you’re running has some amount of bloat or leak that you know will never get exercised, but also never will grow larger than your swap file and take your system out entirely.
Both of those situations are not only rare, but easy to misjudge and end up wedging your system.
Chrome does both of those. It takes gobs of memory that won't be needed for many hours, and it takes gobs of memory that will never be needed but are roughly stable and won't overwhelm a suitable swap file.
I found that gcc works mostly alright on swap for compiling C++. I assume that data structures for template instantiations are never really deallocated, and it often falls into category 2 of your deacription.
And then compilation eventually finishes and the allocated memory is released on process termination.
Disk cache is really flexible at the OS level. Look up 'vmtouch' on Linux and play with some recently used files, you'd be surprised how many are in memory. If you don't have swap and only have 'X' amount of RAM free the program will still load and run fine, it'll just do so in a way that more directly thrashes the disk. If you add some swap the OS will page out some other memory that's not being used giving 'X+Y' RAM free for disk caching that the OS will use.
In fact this usage of free memory as disk cache is so ingrained that memory used this way doesn't even appear allocated. OSes are pretty much all designed to use whatever free RAM is available as disk cache. The disk cache shrinks if that RAM is actually needed.
Effectively a great way to think of swap is that your OS is always swapping, even without a swapfile! If you read a file there's swapping of that file into memory, it's even done using the standard memory paging that swap files do. Without a swap file you still have swapping but you've just removed the ability for the OS to write out memory that it thinks is less relevant than the files you're currently trying to read. Which is almost always a loss.
That is temporary and evictable kernel memory usage. It also isn’t allocating more memory than it has while doing it, it evicts things when it runs out of unallocated memory.
Technically a program, but not what I think anyone typically considers one in this context?
What other ones are you aware of?
I ask because generally programs use RAM because it’s fast - otherwise they’d use something else. Having a program intentionally allocate more than is available would almost certainly result in serious performance issues it would have a hard time dealing with predictably - as compared to using disk for the ‘excess’, and then reading/loading what it needs.
Well simply put any allocation that's not really needed any time soon but is also not freed. Swap let's you use that.
In terms of what benefits from that most modern programs are actually surprisingly flexible in memory allocation. Anything with a garbage collection behind the scenes will flush more often if needed. Having swap to push the non needed programs out let's GC based programs keep allocations longer which is intended to enable reuse
Most programs actually. Compilers, databases, even your browser! I mean, you can argue that they should manage paging out manually, but it's so much easier to use swap and an unmodified program.
None of those go out of their way to do so, and certainly not databases. Every database tuning documentation I've ever read says avoid swap at all costs.
Some just allocate when they need it until it doesn't work (cough Chrome cough), but that is a far, far throw from intentionally allocating more memory than the machine has, and none of them seem to do it to intentionally only work on a small subset at a time, even if de-facto that is what happens.
What sort of performance or reliability would you expect anyway from a database that intentionally loaded the entire database into memory first without even caring how much would fit?
Every database I’ve ever dealt with has fixed memory limits that get set.
Otherwise the database server will OOM even with swap if it doesn’t limit memory consumption and how much it loads, on any machine, with a given large enough database. And the size of the database is up to the user.
It’s a fundamental part of the problem.
Speed and reliability to some extent are literally core requirements of database servers, so any database software that doesn’t do it is going to have a bad time.
> The only concrete answer to "why have swap if I have enough RAM" is "well, if you run out of RAM..."
Linux with swap available can swap out never used pages (say a part of library that never gets called) and use freed memory to cache disk IO. We see that on every server that has enough data to fill the remaining RAM with page cache. But you only need like a gig of swap to take advantage of it.
Yep the lesson a lot of people who claim 'no swap is better' need to learn is that your OS is always swapping, even without a swap file. The lack of swapfile just removes some flexibility in how that's handled.
The OS will page in files that are being accessed regularly. Without a swap file you don't stop the OS swapping between memory and disk, that'd be pathological since disk access is that much slower.
Instead what not having a swapfile really means is that you are telling the OS "anything allocated ever on the system is more important than disk cache".
I read somewhere that Linux hibernation requires swap (1-1.5X RAM?). So I have been configuring my laptops with a 1.5X swap partition. Unfortunately Linux hibernation has not worked for me since the early 2000's, across 10-15 different laptops. Apparently hibernation can use swapfiles, but I've never been able to verify that.
From the kernel docs, `image-size` section, which relates to the hibernation image size.
> Reading from it returns the current image size limit, which is set to around 2/5 of the available RAM size by default.
I have successfully been hibernating a laptop after a few hours in suspend with swap allocated 50% the size of my RAM, so 8GB when I had a 16GB laptop, and 16GB now on a 32GB laptop. Never had any issues.
If it’s seemingly as easy as reading a wiki and configuring a few things I’m curious why it doesn’t just work by default. Does it just not get enough dev love?
My personal experience is that the situation has improved somewhat but the hardware still plays a large part. A Kaby Lake laptop I had setup hibernation to encrypted swap (not file but partition iirc) worked well enough besides a few times when some peripherals e.g. wifi failed to power on after resuming from hibernation; A Zen3+ laptop from the same manufacturer that I got last week, running Arch (with the latest kernel and packages) _always_ had problem resuming, wifi (and who knows what else) never gets turned back on, the system sometimes enter into a weird state showing just a blank screen, not accepting any keypresses, or otherwise get stuck in some stage in the boot process with no output so it's hard to debug. So far it's been a real crapshoot.
I use a swapfile, and hibernate frequently (after iirc 2h of suspend, so sometimes multiple times a day). Does take a bit of setup (by far the most supportive of people arguing against Linux in favour of macOS/Windows thing I've done) but works well now it is.
Swap space is used for many other things than just running out of memory. Given the huge number of comments on this post that get it wrong so badly, I'm not going to respond to all of them, so please do your own research.
The short answer is that not having swap causes the system to run with one hand tied behind its back, since it’s forced to store unused/rarely used stuff in RAM, taking up space that could be used more productively (like disk cache).
Yeah, no swap probtnakes sense for GCed languages. If you are going to touch all of the memory frequently it doesn't make sense to try to swap it out. You are better off forcing Linux to swap out filesystem cache which at least has a chance of not being used in the next minute.
The only real thing that you may be missing is SSH being swapped out and some other system processes but those may be 2MiB altogether.
Maybe an option would be disabling swap just for the JVM. IDK if that is something you can do with cgroups. But if 99% of your anonymous RAM is pinned anyways there is little to gain.
In every commercial setting I’ve been involved with, something has gone very wrong if you’re swapping, since you spec the system for the use case, rather than a fraction of it. You’ll still have the logs to debug whatever went wrong. What problems do you see?
Yes, exactly. If your machine is starting to use swap, then it's already running away on memory that you hadn't planned for. The swap just delayed the inevitable OOM. You might as well let it OOM immediately.
I tried swap on zram briefly, but then switched back to a swap partition and enabling zswap. It's basically like zram but when the reserved ram section runs full it will hit the disk with the lru pages. A tiered system. That means as long a you're only slightly using too much ram, it will swap in compressed RAM entirely.
So the only use case I see for swap in zram is if you never want to swap to disk, but still want to use swap for some reason.
Swap on zram seems very interesting. My current home computer is a NUC with 16GiB, but I sometimes build from source and a tmpfs drive is helpful for that so I’m a little concerned I might not have quite enough RAM. Thoughts?
I’ll definitely double my RAM for my next system (in a year or two) and will definitely have to consider swap on zram then.
Not having quite enough RAM is the exact scenario when swap on ZRAM is most useful. Assuming the data you have in RAM is compressible it lets you use a few extra gigabytes before hitting disk swap or OOM (depending on if you have a disk swap device configured in addition to ZRAM).
Zswap is somewhat similar and might also work well for you, it uses disk swap but with a compressed cache in RAM.
> I sometimes build from source and a tmpfs drive is helpful for that so I’m a little concerned I might not have quite enough RAM.
I used to do this as well, but consider that the VFS cache basically does the same thing. The main difference is that cached data can be dropped or flushed if necessary to free memory, whereas this can't be done automatically with tmpfs.
Benchmark both cases and I'll bet you'd be surprised at how little difference there is.
The nice thing about swap here is that the kernel is under no obligation to sync the data to disk "soon". I have lots of tmp files that are written and read quickly and never hit the disk. With tmpfs+swap the kernel is free to swap out unused files if that memory is better used elsewhere, but otherwise won't bother touching the disk.
I do a lot of builds on my system and most are small enough to entirely live in RAM. But when you build Firefox it is best to start flushing some data to disk.
That's another great point. With a large build in tmpfs, you can either run into the size limit of the tmpfs, breaking the build and wasting time, or run out of memory and end up with an unresponsive machine.
Entrusting this to the VFS cache has better predictability, and nearly the same performance in many cases.
No. ZRAM is a compressed ramdisk. Darwin “compressed memory” is an intermediate state for pages as they age out of the working set, but before they are written out to a swapfile.
Incidentally, Darwin creates, defragments and reclaims swapfiles dynamically. All this complaining about swap partition / file sizes is kind of … silly. None of that code is rocket science.
I wouldn’t do less than 64GB. I’m running 32 on my current NUC as my main machine with Windows 11 and WSL and a RAM drive. With current source sizes and Electron apps consuming RAM, just 32 GB is no longer a nice to have.
From the 1990's there was a perceptible difference in Linux or Windows if you could dedicate one entire secondary HDD to a swap partition.
At the beginning of the HDD was where the sectors are most rapidly accessed physically.
For SSD there should not be much difference between a separate partition or a file, or whether the reserved sectors the swap occupies are at the beginning of the drive. Should probably be aligned with the block size of the SSD though.
Still may be helpful to have a separate SSD for swap besides the drive the OS is on.
It would be worth trying but I like a swap file in Linux so I can confine Linux to a single type EXT4 partition right next to my NTFS WIndows partition.
With Windows it does work good without a swap file as long as you have plenty of memory left over after Windows and your desired apps are loaded. Handling further data loads within reason.
The main problem is my little VM powered by CentOS or Amazon Linux 2 (which is based upon Fedora Linux) runs out of memory when I run "yum update". I have 1 core CPU with 512MB ram. The BaseOS repo from RHEL and co is just too big. It requires a minimum of 3 GB RAM. So work around is to add swap to AWS EC2/Lightsail instance. I believe there is a dnf bug open in RHN, but no progress was made so far https://bugzilla.redhat.com/show_bug.cgi?id=2040170 Adding a 4/8GB swap space file solved my problem with the `yum update`. That is just one example. There are many users with few resources.
It may require more than 512MiB RAM but it certainly requires nowhere close to 3GiB. I used to have a 1GiB RAM VM CentOS 8 VM on Azure until recently and it had no problems updating.
there are vps that are being sold for peants. i have a 128mb ram NAT vps... bought it out of curiosity and i concur. i am unable to update that thing because of the out of memory thing
“128MB NAT VPS” thing is OpenVZ or otherwise container based.
$5 per YEAR for a container, 128MB RAM, 1GB persistent storage, SolusVM user panel, and 123.45.67.89:16300 - 16320 NATd to your enp0s3. swapon/swapoff don’t work. I tried installing QEMU, didn’t work either because storage is I/O rate limited. IIRC, OpenVZ has some sort of unified cache control for RAM and storage I/O, and trying to circumvent quota through “disk” don’t work because of it.
I’m not sure what the use case might be, legal or not, but there seems to be dozen or so operators taking our credit cards in the exact same manner. Maybe the operation itself is a money laundering, or maybe it’s to run elaborate offsite contraptions for illegal activities, or to run SEO fake sites? I don’t know.
Sure. I’m familiar with the NAT VPS providers. For the most part they are just providing an ultra cheap service, as there seems to be a market for that. Some people are OK with Ipv6 only or some ipv4 ports and they will just funnel the needful thru Cloudflare and they have an ultra ultra cheap website/service. Even cheaper than the cheapest shared hosting but they still have a box they can control to some extent.
Even with openvz/LXC though, they will generally give you a few choices of distro flavors. So in general if given a memory allowance 256 or under, definitely don’t choose anything Red Hat based or you’ll suffer from yum potentially not working.
Regarding swap partitions: on virtualised systems (like clouds) this is rather easy since you can just edit EBS/LUN/whatever on the provider side, grow, shrink, add completely new ones without doing any of the classical disk management tasks.
On cloud machines, LVM doesn't make a lot of sense, and neither does lots of partitions. Partitions are from a time where you have one disk and want to setup filesystem boundaries. But in a virtualised setup you can slice block storage any way you want, and present it to the VM as completely separate disks. It's a bit like having LVM "on the outside" of the VM. This way, the VM itself becomes simpler and has to care less about the specifics of the disk.
If you are on physical hardware, I'd argue that if you're not using LVM or ZFS, you're probably doing it wrong. Only really small scale (SD card, eMMC, USB SSD, raspberry PI type of stuff) works better without it. That said, even LVM wouldn't be too bad in such cases. (You'd have to get down to JFFS2 and the likes to really not use LVM)
There should be a set of VFS hooks so that a filesystem can offer up pages of swap to the kernel without an explicit swap file. I remember tinkering with this around 2000, but I didn’t get very far. I would love to see this explored.
I'd love to see this. There appear to be some programs to monitor usage and dynamically add more swap but they all appear buggy and unmaintained.
If course if you need more than a bit of swap there is something suspicious about your system but I have often found this extra RAM is useful. For example I often have many open files in my editor, great candidates to swap out the ones that I am not currently editing. Or /tmp in RAM, I don't need it persisted but if the RAM could be better used for other things feel free to swap the unused files out.
There is absolutely no reason to use those over Ubuntu and many reasons not to use Arch in your critical server infrastructure, but Debian v. Ubuntu would be essentially half a dozen of these or 6 of those.
Ubuntu’s primary benefit over Other distributions is that the strict release cycle allows for efficient planning of upgrade cycles. Pretty much anything specific to the local workloads I wouldn’t use the distro packages for anyway, so the distro doesn’t matter so much.
Just about the only think I use swap for on Linux, is not bombing out during the fork/exec (even for something like a system call) from large processes. Back when I last looked, it appeared they need enough backable virtual address space, even when they don't use it. The filesystem swap seems good enough for that.
> Swap files appear to work fine, including on a mirrored root filesystem, and I've read that they're basically just as efficient as swap partitions these days
I've been saying this for awhile, and I always get berated by angry linux nerds who didn't do any research. The times they are a changin'?
We were using swapfiles on Linux at Amazon in 2001 just fine. No idea if someone perf tested it, but we did have pretty decent perf testing (and we did adjust swappiness based on perf testing).
Many of the commenters here seem to think this is about moving away from swap altogether, but according to the article this is switching from a swap partition to a swap file, so still using swap.
Eh, just use LVM. I have no idea why it isn't a standard on some distros. Then you can mix, match and resize at will, and as a bonus you can migrate system live from one device to another without rebooting.
> (If you're willing to use swap files anyway you can always add more swap space with a swap file even if you started with a swap partition. However, if your swap partition is too big, shrinking it or limiting how much of it gets used is more annoying.)
You can't shrink swap file without swapoff & swapon anyway so it's illusion of an improvement
It seems reasonable that a university would take a moment to share their methods and findings so that others who might consider doing the same thing can benefit from the experience of others.
Because recent distributions have enabled it by default. I hadn’t used swap for a decade or so since acquiring 16gb ram, but one day noticed a new swapfile in my rootfs.
This article is always thrown around if the topic comes up, at least in my experience it's wrong or at least not useful for my usecases (general purpose web hosting stuff in all kinds of flavours)
- swap on spinning rust outright kills the system if memory is low. it's a full blown crash.
- swap on ssd act's much the same
- zram swap stalls and slows down the system for several minutes
I just raise vm.min_free_kbytes to make the oom killer faster and disable swap. At least the machines don't crash hard this way.
Which is exactly the situation you never, ever want a server in. So it’s actually worse than a “crash”
Me and my team removed swap on tens of thousands of machines cause we were sick of dealing with this. We wanted the machines to fail hard and fast and not go into a state where it’s doing only a minuscule amount of actual work while trying to recover.
And I think that, sadly, it's not even a full solution, because linux can manage to get thrashing even without swap. It pages in and out things like memory mapped files or the content of executables of stopped processes. See for instance:
https://serverfault.com/questions/898388/how-to-prevent-kern...
It really doesn't. When the author article writes about swap helping "Under moderate/high memory contention" -- most people have already powercycled their device at this point (or monitoring has pulled the struggling node from production). It's utterly useless, due to the huge difference in latency between RAM and storage.
On most production systems, memory allocation is roughly 1% system and 99% application (not counting temporary, evictable allocations like disk cache). Modern applications are not designed to have their pages swapped out to disk and it is not particularly helpful to swap the teeny tiny bits of OS.
There are parts of this article which are true, but the overall picture it paints is not accurate and the conclusion that swap is helpful is not correct for high performance systems - both workstations and servers.
The basic argument is that swap can be preferrable to hard eviction for managing disk cache, which can be true, but disk-heavy workloads are less common than ever especially in general-use clusters.
(The other argument is the usual mix of "but it can swap out unused areas" in which case you should fix the program to not allocate a ton of memory it doesn't need; and "but the OOM killer will come" and yes that's the entire point please move my process elsewhere).
> "but it can swap out unused areas" in which case you should fix the program to not allocate a ton of memory it doesn't need
Isn't this basically asking every application to independently reimplement their own swap-style functionality? Surely it makes more sense to have swap as a system level feature that every application can take advantage of?
That way the system has more information available to make swapping decisions since it can take into account the memory usage of all applications and drivers/etc together, and applications can take advantage of the most advanced swapping algorithms in the latest versions of the system without having to be individually updated.
If you are suggesting that your applications are allocating a bunch of memory which they literally never use in any way, then I don't think that is really a common issue in practice.
In practice I expect the typical scenario you're thinking of is applications are allocating a bunch of memory which they use only briefly or rarely and could easily fetch or recompute that data again later if needed. That's exactly the scenario which swap optimizes for. So why should every application individually implement logic to optimize for it? Fixing these kind of issues is not as simple as "just don't allocate the memory".
> If you are suggesting that your applications are allocating a bunch of memory which they literally never use in any way, then I don't think that is really a common issue in practice.
That is what I'm suggesting, and I suggest you look at how much space is wasted by startup-initialized data in libraries for features you'll never use, JIT representations that never get compiled for more than one shape, class metadata that never gets touched and vtables which don't get used, how easy it is in any GC language to accidentally keep a large buffer alive until the end of a lexical scope rather than its last real use, etc. etc.
All these problems you're describing are way more complicated than "just don't allocate the memory" and would involve sophisticated logic to optimize (like predicting exactly what features are or are not going to be used by an application at start time without significantly increasing the startup latency, etc). Why not just let swap be that optimization?
So why not address those edge cases with something like "madvise" style configuration rather than just saying application developers need to reinvent the wheel any time they do something outside the golden path?
Because that would require the programmer do something explicit to deal with a runtime edgecase they fundamentally don’t have the context to decide the right course of action on?
Especially if it’s a library or the like?
That infrequent call might be key for program stability in one context or pointless (garbage collector whatever- key in a major server program, pointless in a toy app).
The moral of the story is, if under load code or data gets moved from where access is fast (RAM) to where it is slow ( even SSD and most NVMe counts here) then gets accessed when under load? It makes the problem much worse.
I don't understand the argument. Performance-wise, swap will obviously decrease the one but runtime-wise isn't it going to actually enable your application to continue running rather than crashing, albeit at much lower speeds?
Once you exhaust the RAM and you run with no swap configured, what happens with your application?
I'd argue that the fact such a simple and elegant solution as virtual memory is able to effectively cover such a wide range of use cases while eliminating the need for every application developer to duplicate their efforts is the exact opposite of "mediocre" and in fact it's the exact kind of efficiency-minded thinking that the software industry needs more of.
Actually no - this is the situation that swap kills your system.
That rarely used code path will now take 50 seconds because it got called when it was swapped out (and the system is under memory pressure), instead of either OOM’ing awhile ago or completing in a couple microseconds like it usually does.
And ‘infrequently called’ here could be every couple minutes.
If the system is under such strong memory pressure despite swap then you already screwed up. And trying to run that much without swap would be even worse.
It almost always happens at some point. Someone fat fingers a memory allocation. Memory leak from a long running process finally grows too big, someone installs an update and it doubles memory usage on something, etc.
I had it happen recently that the backup/sync software for a NAS had a constant factor memory consumption based on the number of files it was syncing. Transferred in a bunch of data, and blam. Commercial NAS, and they used swap (shitty, never buy QNAP). Wedged so hard it took a hard power cycle to even get console, AND caused data corruption in the ZFS pools.
It’s what happens next that decides the stability of the system.
If it can grow into swap and keep going, things start crawling, load builds up, buffers expand, and the system eventually grinds to a halt. On Linux, often in a really wedged and irritating/impossible to fix way.
Or, if no swap, OOM killer shoots something in the head (hopefully the offender), and we’re back to normal (minus the thing it killed, which is usually the thing you didn’t want it to - but at least the system works and you know something is wrong).
In either case, caches have dropped to near zero awhile ago so performance is already getting bad. It’s if we enter a death spiral, or get death early enough the whole system doesn’t spiral.
> In either case, caches have dropped to near zero awhile ago so performance is already getting bad.
Right. You can get pretty deep into a death spiral even without swap. I feel like the better-performing solution is to make OOM trigger earlier, and to then go ahead and have swap be on.
Somewhat tangentially, I'd really like a setting for minimum disk cache, and that would do so much to help prevent thrashing.
Only in a very, very tiny window. I’ve never seen a system sustain it for more than a trivial amount of time before it OOM kills something and then voila, enough memory.
I’m sure someone here will chime in with their example though.
I mean, it happened to me twice in the last couple months on a laptop where I did a lazy install and didn't set up swap. Firefox and 2-3 smaller programs ate up so much memory that it stopped responding for multiple minutes and then went completely dead.
I rarely need or set up much swap on a server, but I've managed to have the same kind of problem with the OOM killer not kicking in very quickly on a server. And on my desktops swap is able to soak up many gigabytes and improve my performance a lot with thrashing basically never happening.
Swap is a performance optimization, so why would you remove it if you care about performance? Contrary to popular belief, swap isn't a way to get extra memory for free: it won't help you if your workload size is bigger than your physical memory capacity and that is not what it is meant to be used for.
I don't find the term "performance" useful outside of a specific context. The main reason we don't use swap is predictability.
We know how much memory a particular VM should be using. If it exceeds that, failing quickly rather than changing behavior (slowing down) is far preferable, and then you correct whatever the problem is.
All the testing we've done with swap has shown it to have negative in server environments. I see it as useful for client machines, and maybe a bandaid to get by with underpowered systems if you have to for some reason. But that shouldn't happen in prod, especially if you're a public cloud user, as most folks here seem to be.
I think that is where the confusion comes in. Almost everyone agrees that if your workload is slowing down due to paging that is bad.
If your working set is bigger than your RAM you have a problem, swap or not.
However swap helps optimize your RAM usage so that your working set can fit with less waste. It allows the kernel more flexibility with what pages to evict from RAM. Without swap it can only evict pages backed by files. If you start evicting files that are in your working set you are just as screwed as if it starts evicting anonymous pages in your working set. With swap it can evict other unused pages before touching the pages in your working set.
I think there is some truth that it can be harder to notice with swap because there is more buffer between running great and literally crashing. However, in either situation you should be monitoring IO wait and application performance to ensure that you have enough memory for your working set.
If I created a swap partition in LVM I can force it and move it around to the best location WRT multiple hard drives and possible mix of HDD and SSD. With a file its a little more abstract and I could emulate it with some pointless abstraction, but why bother. A typical "this century" example would be running my boot and swap on local disk and storing all real world data on the NAS over iSCSI. Obviously now that "Everything" is virtualized this kind of system administration is void unless you're admin on the cluster itself LOL.
The problem with swap is memory seems to have gotten cheaper than storage and swap of any form takes storage space. Its not the capex that's the problem, its the opex of now you've got an extra 32GB that "has to" be backed up and "has to" be virus scanned or whatever security theatre, and that swap file "has to" be treated as the highest security PII HIPPA PCI category because who knows whats been swapped out of memory onto it, it could be chock full of CC numbers how do you know for sure unless its empty or not there? Paying for more memory doesn't increase your backup storage / security risk / virus scan times and the latter costs more than the former. I don't think non-cloudy admins realize there are cloudy admins out there with forests of little systems with like 8 GB of ram and 16 GB of storage, so the "old" rules from the 1980s about provision twice your ram as swap would increase the disk required from 16 GB to 32 GB doubling my costs for, essentially, nothing. Another problem with swap is in the old days if you needed more memory than you have physically you had to buy more memory, then for awhile you could emulate memory very slowly with disk but much faster than calling an IBM CE and buying more memory, now if you need more memory you spin up more instances or rebuild the instance on a larger flavor about as fast as how you'd add swap in the olden days, probably faster and it can be done dynamically in some architectures.