As someone that has worked in a space that has to deal with uploaded files for t...

m0shen · on Feb 16, 2024

Made a small test to try it out: https://gist.github.com/moshen/784ee4a38439f00b17855233617e9...

    hyperfine ./magika.bash ./file.bash
    Benchmark 1: ./magika.bash
      Time (mean ± σ):     706.2 ms ±  21.1 ms    [User: 10520.3 ms, System: 1604.6 ms]
      Range (min … max):   684.0 ms … 738.9 ms    10 runs
    
    Benchmark 2: ./file.bash
      Time (mean ± σ):      23.6 ms ±   1.1 ms    [User: 15.7 ms, System: 7.9 ms]
      Range (min … max):    22.4 ms …  29.0 ms    111 runs
    
    Summary
      './file.bash' ran
       29.88 ± 1.65 times faster than './magika.bash'

barrkel · on Feb 16, 2024

Realistically, either you're identifying one file interactively and you don't care about latency differences in the 10s of ms, or you're identifying in bulk (batch command line or online in response to requests), in which case you should measure the marginal cost and exclude Python startup and model loading times.

chmod775 · on Feb 16, 2024

Going by those number it's taking almost a second to run, not 10s of ms. And going by those numbers, it's doing something massively parallel in that time. So basically all your cores will spike to 100% for almost a second during those one-shot identifications. It looks like GP has a 12-16 threads CPU, and it is using those while still being 30 times slower than single-threaded libmagic.

That tool needs 100x more CPU time just to figure out some filetypes than vim needs to open a file from a cold start (which presumably includes using libmagic to check the type).

If I had to wait a second just to open something during which that thing uses every resource available on my computer to the fullest, I'd probably break my keyboard. Try using that thing as a drop-in file replacement, open some folder in your favorite file manager, and watch your computer slow to a crawl as your file manager tries to figure out what thumbnails to render.

It's utterly unsuitable for "interactive" identifications.

m0shen · on Feb 16, 2024

My little script is trying to identify in bulk, at least by passing 165 file paths to `magika`, and `file`.

Though, I absolutely agree with you. I think realistically it's better to do this kind of thing in a library rather than shell out to it at all. I was just trying to get an idea on how it generally compares.

Another note, I was trying to be generous to `magicka` here because when it's single file identification, it's about 160-180ms on my machine vs <1ms for `file`. I realize that's going to be quite a bit of python startup in that number, which is why I didn't go with it when pushing that benchmark up earlier. I'll probably push an update to that gist to include the single file benchmark as well.

m0shen · on Feb 16, 2024

I've updated this script with some single-file cli numbers, which are (as expected) not good. Mostly just comparing python startup time for that.

    make
    sqlite3 < analyze.sql
    file_avg              python_avg         python_x_times_slower_single_cli
    --------------------  -----------------  --------------------------------
    0.000874874856301821  0.179884610224334  205.611818568799
    file_avg            python_avg     python_x_times_slower_bulk_cli
    ------------------  -------------  ------------------------------
    0.0231715865881818  0.69613745142  30.0427184289163

ebursztein · on Feb 16, 2024

We did release the npm package because indeed we create a web demo and thought people might want to also use it. We know it is not as fast as the python version or a C++ version -- which why we did mark it as experimental.

The release include the python package and the cli which are quite fast and is the main way we did expect people to use -- sorry if that hasn't be clear in the post.

The goal of the release is to offer a tool that is far more accurate that other tools and works on the major file types as we hope it to be useful to the community.

Glad to hear it worked on your files

m0shen · on Feb 16, 2024

Thank you for the release! I understand you're just getting it out the door. I just hope to see it delivered as a native library or something more reusable.

I did try the python cli, but it seems to be about 30x slower than `file` for the random bag of files I checked.

I'll probably take some time this weekend to make a couple of issues around misidentified files.

I'll definitely be adding this to my toolset!

invernizzi · on Feb 16, 2024

Hello! We wrote the Node library as a first functional version. Its API is already stable, but it's a bit slower than the Python library for two reasons: it loads the model at runtime, and it doesn't do batch lookups, meaning it calls the model for each file. Other than that, it's just as fast for single file lookups, which is the most common usecase.

m0shen · on Feb 16, 2024

Good to know! Thank you. I'll definitely be trying it out. Though, I might download and hardcode the model ;)

I also appreciate the use of ONNX here, as I'm already thinking about using another version of the runtime.

Do you think you'll open source your F1 benchmark?

tudorw · on Feb 16, 2024

Can we do the 1600 if known, if not, let the AI take a guess?

m0shen · on Feb 16, 2024

Absolutely, and honestly in a non-interactive ingestion workflow you're probably doing multiple checks anyway. I've worked with systems that call multiple libraries and hand-coded validation for each incoming file.

Maybe it's my general malaise, or disillusionment with the software industry, but when I wrote that I was really just expecting more.

michaelt · on Feb 16, 2024

> The model appears to only detect 116 file types [...] Where libmagic detects... a lot. Over 1600 last time I checked

As I'm sure you know, in a lot of applications, you're preparing things for a downstream process which supports far fewer than 1600 file types.

For example, a printer driver might call on file to check if an input is postscript or PDF, to choose the appropriate converter - and for any other format, just reject the input.

Or someone training an ML model to generate Python code might have a load of files they've scraped from the web, but might want to discard anything that isn't Python.

theon144 · on Feb 16, 2024

Okay, but your one file type is more likely to be included in the 1600 that libmagic supports rather than Magika's 116?

For that matter, the file types I care about are unfortunately misdetected by Magika (which is also an important point - the `file` command at least gives up and says "data" when it doesn't know, whereas the Magika demo gives a confidently wrong answer).

I don't want to criticize the release because it's not meant to be a production-ready piece of software, and I'm sure the current 116 types isn't a hard limit, but I do understand the parent comment's contention.

eapriv · on Feb 17, 2024

Surely identifying just one file type (or two, as in your example) is a much simpler task that shouldn’t rely on horribly inefficient and imprecise “AI” tools?

lebean · on Feb 16, 2024

It's for researchers, probably.

m0shen · on Feb 16, 2024

Yeah, there is this line:

    By open-sourcing Magika, we aim to help other software improve their file identification accuracy and offer researchers a reliable method for identifying file types at scale.

Which implies a production-ready release for general usage, as well as usage by security researchers.