Software dark matter is the enemy of software transparency

pudo · on Nov 13, 2022

After reading this I’m fully confused by how they define dark matter. Stuff that doesn’t come from the distro package manager? Everything installed via other mechanisms? Assets copied into the container as part of the build mechanism?

Wouldn’t it make more sense to define dark matter as all the stuff that is installed in a container but never activated (unless exploited?)

rst · on Nov 13, 2022

That's their explicit definition: "Software dark matter refers to files that are not tracked by operating system (OS) package managers (like `apt` or `apk`), which renders these files and the packages they represent invisible—or at least complicated to find—to software composition analysis and security scanning tools."

That seems to specifically exclude software installed by, say, language-specific package managers (Cargo, Rubygems, npm and derivatives) -- which on the whole seems pretty perverse. Dealing with those does indeed complicate SBOM maintenance -- but people use them anyway for very good reasons (which sometimes include getting more secure versions of the packaged code!), and having tools that work in the real world requires dealing with that complexity, not wishing it away.

hakre · on Nov 13, 2022

This is good to underline, but I found this confusing or contradicting as well.

Also because of as they write about containers. In a container all files are tracked. That's the container.

derefr · on Nov 13, 2022

Different meaning of "tracked." This is about static-analysis systems that seek to understand the "provenance" of the files that go into the container-image, so that they can alert you to vulnerabilities in the container's dependencies.

"Dark matter" here is anything these tools can't see / notice vulnerabilities in.

ilyt · on Nov 13, 2022

So any DB container by definition would have massively high percentage just because DB app itself is few tens of MB but database data is in tens of gigabytes ?

Seems like really useless metric for containers.

I can get it for OSes (some packages there do manage DB data, and even have option to remove it when removing package) but for container it does seem a bit pointless

derefr · on Nov 13, 2022

No...? Again, we're talking about container images, not containers. Specifically, public container images sitting in registries like Docker Hub. People aren't burning their Postgres data into a container image and then pushing it, public-readable, to an image registry.

(But also, even ignoring that, I believe the metric used by the article is number-of-files, not byte-size. A DB might be large in byte-size, but is usually relatively negligible in number-of-files, usually holding individual table chunk files of 1GB or larger.)

hakre · on Nov 14, 2022

As the container is the result of a build process, unless the tools aren't the build tools themselves, the whole container should be treated dark matter and just rebuild. It's process, not state.

derefr · on Nov 14, 2022

It's the build process for the container-image (i.e. the Dockerfile or equivalent) that the tooling being discussed here is analyzing; not the resultant container image, nor containers spawned from said image.

The goal is, presumably, to figure out when a given docker image was created in such a way that it burns in a vulnerable version of some library; so that the author can be alerted that they need to (update their Dockerfile and) rebuild their image.

"Dark matter", under this definition, is anything that gets injected during the build process of the image, that is not itself traceable to some other versioned package management system with vulnerable-version deprecation. Without such information, an automated agent like the one described in the article cannot then propagate deprecations from consumed package-versions to produced image-tags.

A good example of such "dark matter" would be a static binary built outside the Dockerfile using a CI system, where the CI then creates a docker image by running a Dockerfile that simply injects the expected prebuilt binary into an image with an ADD stanza. Does that binary contain vulnerable versions of embedded static libraries? Who knows?

hakre · on Nov 14, 2022

Not sure it is that easy. The Docker API provides introspection for those as well as also there is no Light Matter only because the example project is not using an ADD stanza any longer but the Dockerfile context is from a tar ball created by that project as a reproducible build artefact.

dlor · on Nov 13, 2022

This is basically the definition we used. It's practically important because scanners really do miss software copied in via other mechanisms, and most of them give zero indication about it. For a few basic examples, try running your favorite scanner on the wordpress, node, or busybox images on DockerHub and see what the scanner finds.

For Wordpress, most scanners will miss that PHP or Wordpress are even installed in the image. The scanners spit out lots of data, but it's only about what they can find, offering the illusion of completeness or transparency.

rob74 · on Nov 13, 2022

Well then I guess scanners need to improve... I mean, the current version of Wordpress (and other software) is being made available as a Docker image because this is faster and more convenient than making it available via the package system, so it kinda makes sense that they are not available (or available much later) via apt/apk/whatever. Calling all other methods of distribution (pulling software from Github or via the various language-specific package managers) "dark matter" expresses the desire of not wanting to deal with that stuff, but surely won't make the "problem" go away.

RobotToaster · on Nov 13, 2022

I guess the point is you could have an open source program in the package manager, that then downloads a closed source binary blob component, that could be doing something undesirable.

nerdponx · on Nov 13, 2022

I have the exact same confusions and questions as you. I think maybe they consider "dark matter" to be anything for which the source is not publicly available and so cannot be analyzed by security tools that don't have access to the private sources.

I also agree with your "wouldn't it make more sense" definition. From the perspective of a developer concerned about the security and robustness of their own deployment, "dark matter" would be anything that ends up in my container that I don't actually need to run the app in the container.

hakre · on Nov 13, 2022

I also had problems to learn more about that. For me the article creates more confusion than has some theoretic or practical value.

pabs3 · on Nov 13, 2022

Another tool for identifying software dark matter on a Debian based system is cruft-ng:

https://github.com/a-detiste/cruft-ng/

Eventually it will enable a Debian system to account for (in some way) every single file visible on the live system.

thenoblesunfish · on Nov 13, 2022

A concrete example of files appearing in the "dark matter" would make this clearer.

aghlabid00x0 · on Nov 13, 2022

“dark matter” sounds like marketing speak.

choeger · on Nov 13, 2022

So a container that, say, builds nginx from source and doesn't delete the source tree is considered full of dark matter because so many files are not installed from the package manager?

That appears to be a difficult definition.

dlor · on Nov 13, 2022

Basically yes - unless you also keep enough metadata around somewhere for a scanner to know what version of nginx is installed. This can be done out-of-band with an SBOM, or in-band by using package manager metadata.

Programmatic · on Nov 13, 2022

The answer to that would be to make a builder image and copy out the installed files. That makes a smaller container anyhow so has several advantages.

henrydark · on Nov 13, 2022

Except that dark matter might not actually exist, and it's a change in equations that is needed.

I'm not sure what the software analogy is, except that there's no software dark matter, just people not familiar with their tools.

peter_d_sherman · on Nov 14, 2022

A long time ago, I wrote a program for personal use which I called "FileHasher", or something like that.

FileHasher (or whatever I called it) -- was basically a "poor man's antivirus utility" -- that is, it didn't scan memory, didn't check boot blocks, didn't scan system [E|EE]PROMS like BIOS, and it knew nothing about rootkits -- or how to detect them.

But what FileHasher did do was to take a point-in-time "metadata snapshot" -- of all of the files on my PC -- their path, their filename, their size, their date, and a custom 16 or 32 byte hash of their contents. This data was put into a single simple space or tab or comma delimited text file (a "poor man's database" <g>) which contained in its filename the date and time (as a string) when this file was generated.

The idea was, I'd run a completely fresh OS install. Then, as the absolute first thing I'd do after the OS install, I'd copy "FileHasher" onto my PC via USB drive, and run it to generate a metadata snapshot file of all of the system's files...

FileHasher could then be run at any time subsequent -- to generate an additional "point-in-time" metadata snapshot information file.

Once two such files were created from two points in time -- FileHasher could compare them -- and list ALL files that had been created, deleted, or modified -- since the initial or previous run.

The idea was, that a virus, if it were to exist, would probably create/modify/delete at least one file -- and FileHasher in reporting mode (if used with diligence, say, before and after software installs, and at various other dates/times) -- would help a person with a keen eye -- in finding/identifying/fixing what the problem was, based on the list of created/deleted/modified files...

Tracking the Software Dark Matter in the various layers of container(ized) images -- sounds like a very similar (and good!) idea!

Will it solve every possible container security problem?

Probably not -- but it's a good step in the right direction!

(Was my "virus checker" perfect? No! But it was better than no virus checker! <g> ("A Little Bit Of Something" > "Nothing" -- you know, from Philosophy 101! <g>))

williamcotton · on Nov 13, 2022

I’ve used an approach to copy only the shared libs from the build container to the production container. This not only gets rid of most of this “dark matter” but results in much smaller containers!

https://github.com/williamcotton/express-c/blob/master/demo/...

topaz0 · on Nov 13, 2022

Odd metaphor, given that one of the defining features of dark matter is that it is transparent.

vitiral · on Nov 13, 2022

So the answer is "use bazel" or "use Nixpkg" since then every input is not only known by the build system, but is hashed in the dependency tree?