Self-contained Linux applications with lone Lisp

JonChesterfield · on Jan 23, 2024

OK, yes, that does work. You can totally put application specific stuff in the elf and reflect on it at runtime. Much easier is to point a symbol value at the start of the data in the elf and refer to that symbol, then you don't have to crawl the headers yourself. The linker does it for you.

See https://github.com/graphitemaster/incbin for a pretty wrapper around that.

Aside from there being a drastically simpler answer to this problem available in sketchily documented fashion (the asm guys know .incbin is a thing, but higher level languages overcomplicate it), it's a good post. Idea is sound and goal was achieved.

ELF linkers will create symbols pointing to sections with C identifiers as names too if you want to do that instead of inline asm. Without linker scripts I mean, they just do that out of the box.

zokier · on Jan 23, 2024

Rust has `#[link_section]`, `#[export_name]`, and `include_bytes!()`, which together allow you to put data from a file under specific symbol in a specific section. Maybe not applicable to the problem in tfa but seems to accomplish all that incbin does too.

trealira · on Jan 23, 2024

Also, C23 has #embed included as a preprocessor directive. It serves similar purpose to include_bytes!(), although I don't think there's a way to specify the section it goes in within the executable.

https://thephd.dev/finally-embed-in-c23

saagarjha · on Jan 26, 2024

I haven't used #embed yet personally, but if it gives you an array or similar back then you should be able to apply attributes to the variable to tell the compiler where to store it?

trealira · on Jan 26, 2024

I know GCC has section attributes, but I don't think there's any standard C23 attribute equivalent.

electroly · on Jan 23, 2024

Good persistence in finding a good solution, but I wish the author found a solution that didn't involve mold. I'm in the same situation; I have a programming language that builds self-contained executables by bundling bytecode with a prebuilt interpreter. But I'm not nearly as smart as the author, so I link a fixed multi-megabyte chunk of sentinel bytes into the interpreter (using xxd to produce a literal array in a .c file) and at embed time I search for the bytes and overwrite them in-place. This works for executables on any platform and doesn't require a special linker, but the hardcoded limit (and wasted space when you come under the limit) is undesirable.

regularfry · on Jan 23, 2024

If you're happy with a compile and link step, you can embed arbitrary data at link time with an ordinary linker by making yourself a linkable object with objcopy(1). I've played with it in the past as a way to embed an sqlite database into a ruby interpreter, which lets you do funky things like reimplementing `require` to read from the embedded database.

electroly · on Jan 23, 2024

It's a possibility. It's not unreasonable for me to ship builds of objcopy and lld that I can run at embed time. An immediate difficulty is that I support Windows and macOS, so I need a solution for PE and Mach-O executables too. I think a solution probably exists but I may need a separate solution for each platform. Embedding resources into binaries is pretty easy, it's just a matter of how to do it without shipping an entire C toolchain to users of my language.

regularfry · on Jan 23, 2024

I suspect the easier thing to do today is actually just embed tcc, and use its linker. Generating a C source file that embeds whatever binary you want in a string literal is straightforward templating. I couldn't really do that at the time.

electroly · on Jan 23, 2024

That's a great idea. I'll have to check and see if some fork of tcc today supports all my target platforms, but I bet it does.

saagarjha · on Jan 26, 2024

The linker that ships on macOS supports embedding data from a file with -sectcreate.

zokier · on Jan 23, 2024

would using libbfd or llvms objcopy library be of use here? https://llvm.org/doxygen/namespacellvm_1_1objcopy.html

pitherpather · on Jan 23, 2024

I don't know how clean or simple or portable you wish your build environment to be, but would it be worth embedding at compile-time?

Thinking of the general need in these situations to produce a custom-named interpreter/executable, could it be worth accessing the program name itself to find a paired source file? E.g., in invoking ./foo2.o it would look for ./foo2.code -- a two-file distribution allowing to double-click on the executable??

Could there be a non-unicode flag at the end of a special elf file which allows arbitrary unicode data to be concatenated after that? I.e., an agreed loader-ignore-hereafter convention or similar? (Asking with no knowledge of ELF internals, besides hints given in the OP.)

electroly · on Jan 23, 2024

Using two files would definitely work and, honestly, be a lot simpler. But it's a neat trick to make it a single file. For my toy language, it mostly serves to hide the fact that I'm not really compiling to native. People won't ask questions if it's a single executable that file(1) says is a statically linked binary.

Appending to the end of the ELF file does work. It won't mess anything up because your bytes will be outside of any ELF section. You can insert a known sentinel string and then search for it at runtime. The main problem is that you have to open your own executable file up for reading so you can locate the data at the end, and on Linux that requires having /proc, AFAIK. The nice thing about these other techniques is we're not assuming anything about the filesystem we're in. In a chroot environment you might not have /proc.

pitherpather · on Jan 23, 2024

Given your pursuit of elegance, I imagine you could ultimately have a --clone or --cloner command-line switch which would allow any executable instance based upon your interpreter to create a new executable instance, but encapsulating newly-supplied source code. In this sense your interpreter could go viral. (In tcl/tk context, freeWrap might be an example for study.)

Relatedly, I don't know whether, given argv[0] and your targets, one can at least copy the named file, even if one cannot open it directly for reading.

eptcyka · on Jan 23, 2024

The first arg to your program is a path to the binary that's being executed. No /proc required.

electroly · on Jan 23, 2024

It's usually the path to the binary being executed, but you can pass anything you want when you exec. e.g. execl("/bin/ls", "definitely not /bin/ls", NULL);

pitherpather · on Jan 23, 2024

> In this article I will demonstrate this capability [lisp code directly embedded], explain how it works and the journey to implementing it. Every script bundling tool I've ever seen unpacks the code to some file system location and then reads it back in. I came up with a different solution.

> The lone-embed tool copies the lisp code into the ELF and then creates a LOAD segment for it. Linux then maps in the embedded code automatically at load time before the interpreter has even started.

And the discussion which follows I found very informative.

zoom6628 · on Jan 23, 2024

If you aren't aware have a look at the work of HNer jart and her amazing work on cosmopolitan c library and red bean APE. Might be something to learn or share. Just thinking out loud.

lioeters · on Jan 23, 2024

Indeed, Cosmopolitan C is amazing.

> Cosmopolitan Libc makes C a build-anywhere run-anywhere language, like Java, except it doesn't need an interpreter or virtual machine. Instead, it reconfigures stock GCC and Clang to output a POSIX-approved polyglot format that runs natively on Linux + Mac + Windows + FreeBSD + OpenBSD + NetBSD + BIOS on AMD64 and ARM64 with the best possible performance.

https://justine.lol/cosmopolitan/