From the paper [0], they're using a specialized model structure, so at least they are not part of the LLM hype. That's good. But I still wonder how this compares to existing rule- and manual heuristics-based approaches like github/linguist.
I just compared it to "file" on my downloads folder and here's a summary of the differences (in essentially no particular order, but alphabetically by filename [not extension] if that matters):
1. Magikia identified more csv/tsv than file, but also had a few false-positives, most of which were one-word-per line (e.g. m3u, dictionaries, a list of checksums), so I guess technically a CSV with only one entry per line if you want to stretch the definition
2. A few PDF files were missed by file but flagged correctly by Magika; Zathura (muPDF backend) printed warnings about "repairing document" when opening, so these were malformed, but not so much that they wouldn't render. There was one PDF file that file recognized and appears to be well-formed but missed by Magika
3. Magika completely failed to recognize any old "mod" tracker audio files (.xm .mod and .it .s3m extensions) while file caught them perfectly. Maybe a hole in the training data?
4. Magika completely messed up on an old-school fixed-column field ALL CAPS data file, tagging it as vbscript. File just called it "plain text"
5. Slight nitpick, but magika flagged all ELF files as executables, even non-executable .ko and .o files
6. File did not flag a single YAML file (instead marked as plain text), magika caught at least some of them
7. Magika mis-flagged 3 out of 4 ssh public keyfiles; one as javascript, two as powershell
8. Magika was much better at specifically identifying zip-wrapped formats. One JAR file was missed by file, but caught by magika and an android .jar was identified as android by magika but just JAR by file.
9. File incorrectly tagged several raw disk images with a file-format that was near the beginning of the image; magika just gave up and called it an octet-stream
10. Magika missed the only .mobi file, calling it octet-stream while file got it correct
11. Magika missed all of the djvu files (maybe another hole in the trainign data?); file got them all correct.
12. A source file with a .jsx extension[A] that is definitely not the well-known jsx (no XML-like syntax and "final class Foo" declarations). Magika just said plain-text and file said C++ which is definitely wrong, so Magika wins here.
13. A very short asciidoctor file was misidentified as TCL by Magika and plain-text by file
14. An html snippet was called "Twig template" by Magika and plain-text by file
15. A Wikipedia markup file was called "javascript" by Magika and plain-text by file
16. A unix .mbox style e-mail file was correctly identified by Magika and called plain-text by file
17. Magika caught a .mp4 file that file missed; which should have been a slam-dunk for file; I'll dig into this one later and file a PR for file, since it doesn't appear to be malformed at all
18. A project gutenberg .txt file (The Divine Comedy) was tagged as vbscript; looking above at other issues, being aggressive about thinking things are programming files is a bias of Magika
19. A pk-zip file containing only .jpg files was misidentified as a TIFF file by magika. File got it correct; not sure what went wrong here
20. Magika misidentified an HEIC as a video file, which makes some sense
21. File correctly identified an old ASF video, Magika gave up and just called it an octet-stream
22. Magika correctly identified every iso-9660 file I manually looked for; file missed a couple
23. An html file misidentified as a ruby file; not a small one either, but missing the <html> header
24. What appears to be a whitespace-separated file with "#" line comments was identified as CSV by Magika and plain-text by file
25. several Gentoo ebuild files were identified as such by file, but as a shell-script by Magika (syntax is very similar)
26. Magika did not appear to identify any of the zstandard compressed files, file caught all of them
27. A brotli compressed json file was identified by Magika but just as octets by file
28. Two shell scripts were marked plain-text by Magika but caught by file
29. Another plain-text novel was identified as .csv by Magika
30. A zip file was missed by file, but caught by Magika
31. 3 .html files missing html headers were recognized by Magika but not by file
[0]: https://securityresearch.google/magika/2025_icse_magika.pdf