Tutorial – Write a System Call

warriorkitty · on Nov 20, 2016

I decided to read ep1[0] too and I saw a picture "use all the memory". I don't know if it's funnier that I checked if you have an "alt" HTML tag or that you actually wrote the text from the picture. People with alt tags are MVPs. :)

[0] - https://brennan.io/2016/10/13/kernel-dev-ep1/

phillc73 · on Nov 20, 2016

Do you remember when mousing over a picture would show the alt tag in something resembling a tooltip? (Maybe it can still be set like this, but isn't default in Chrome or Firefox anymore.) I recall being quite amused sometimes at what people would write as their alt text.

prashnts · on Nov 20, 2016

I think the `title` attribute is used to show the tooltips.

> I recall being quite amused sometimes at what people would write as their alt text.

XKCD :)

brenns10 · on Nov 20, 2016

The alt text came from the link title in the markdown :)

thirdreplicator · on Nov 21, 2016

Thoroughly enjoyed the tutorial, but why would one want to make a custom system call? What superpowers does this give you? Thanks in advance for your answers.

geofft · on Nov 21, 2016

It's your best interface with the kernel. It's simple and high-performance. It's specifically what you want if you want to pass structured data in-memory to the kernel.

In a strict technical sense, there's nothing you need a syscall for, you can just read/write data (or maybe do an ioctl) on a new device node or something. In fact, OpenAFS supports routing its "syscall" on Linux through ioctls on /proc/fs/openafs/syscall, because Linux makes it deliberately annoying to patch the syscall table from a kernel module so as to make life harder for rootkits.

However, it's simpler to pass data structures if you can use a syscall. It's much higher-performance than opening a file node. And if you expect to run in an environment where you don't know if a particular file will exist (e.g., a chroot), it's useful to use a syscall directly, because that's always available. For instance, getrandom was added in July 2014 partly for this reason, and partly so that if you ran out of file descriptors to open /dev/urandom you could still get randomness.

Here are all the syscalls added in the last two years:

* pkey_mprotect, pkey_alloc, pkey_free: support for a new Intel processor feature, Memory Protection Keys https://lwn.net/Articles/643797/

* preadv2, pwritev2: add a flags argument so you can do a non-blocking preadv or pwritev without opening the file in non-blocking mode https://lwn.net/Articles/670231/

* copy_file_range: copy data between two file descriptors, using filesystem support for efficient copies if possible https://lwn.net/Articles/659523/

* mlock2: add a flags argument so you can mlock memory when it's next accessed https://lwn.net/Articles/650538/

* membarrier: force a memory barrier on all running threads to help with userspace RCU, garbage collection, etc. http://man7.org/linux/man-pages/man2/membarrier.2.html

* userfaultfd: implement userspace paging https://www.kernel.org/doc/Documentation/vm/userfaultfd.txt

* execveat: a version of execve that takes a file descriptor (or a fd and relative path) instead of a string to execute http://man7.org/linux/man-pages/man2/execveat.2.html

eximius · on Nov 21, 2016

Hmm... This is certainly very interesting. Can anyone think of any neat kernel-only things that one might implement for kicks as a learning project? Particularly for someone who hasn't done kernel programming? It could definitely be a silly thing, but probably more useful than printing to the kernel log.

jevinskie · on Nov 21, 2016

Providing guaranteed access to random numbers has been a recent example of a new, badly needed, but fairly simple syscall. With getrandom(), you avoid the complexities of open/read/close and its associated error handling.

https://lwn.net/Articles/606141/

brenns10 · on Nov 21, 2016

Character devices are a fertile ground for cool projects in my opinion. They're not very hard to make (typically just a kernel module, no booting a custom kernel), most unix tools interact with them naturally because they're just files, and they can do many interesting things within the kernel. One of my recent projects (after system calls) was to create a "chat server" in the kernel with a character device. Good references are Robert Love's Linux Kernel Development, 3rd edition, and the Linux Kernel Module Development Guide.

voltagex_ · on Nov 20, 2016

Great tutorial. Just a tip - if you change www.kernel.org to cdn.kernel.org you'll get a closer mirror site.

brenns10 · on Nov 20, 2016

Oh very nice, thanks for the tip!

dezgeg · on Nov 21, 2016

One correction to the strncpy_from_user part, specifically this:

> The process could try to read another process’s memory by giving a pointer that maps into another process’s address space.

This cannot happen, there is no such thing as "a pointer that maps into another process's address space". A virtual address in Linux (on x86 and probably almost all arches) accesses either the processes own memory map (where access to unmapped addresses causes a fault even when done from ring 0) or the kernel virtual mapping.

rogerb · on Nov 20, 2016

Really cool tutorial, thanks for writing this up !

xenadu02 · on Nov 20, 2016

I thought Linux uses sysenter/sysexit, not int 0x80/iret?

Still a good tutorial; there is no magic, it's all just software.

conductor · on Nov 20, 2016

You are right, Linux uses INT 0x80 on x86 only when the SYSCALL/SYSENTER and SYSRET/SYSEXIT instructions are not available.

brenns10 · on Nov 20, 2016

Yes, I have a few inaccuracies I'm correcting right now. That is one of them.