GitHub has the feature of downloading the tree as a zip, why is this not used?

rwmj · on April 2, 2024

There are two main reasons, one bad, one good(-ish):

(1) Traditional autoconf assumed you only had a shell and cc and make, and so the ./configure script which is a huge ball of obscure shell commands is shipped. I think most distros will now delete and recreate these files, which probably should have happened a lot earlier. (Debian has been mostly doing this right for a long time already.)

(2) Some programs generate a lot of code (eg. in libnbd we generate thousands of lines of boilerplate C from API descriptions*). To avoid people needing to install the specific tools that we use to generate that code, we distribute the generated files in the tarball, but it's not present in git. You can still build from git directly, and you can also verify the generated code exactly matches the tarball, but both cases mean extra build dependencies for end users and packagers.

* Generating boilerplate code is a good thing in general as it reduces systematic errors, which are a vastly more common source of bugs compared to highly targeted supply chain attacks.

watt · on April 2, 2024

I advocate for checking in the auto-generated code. You can see the differences between the tool runs, can see how changes in tooling affect the generated code, can see what might have caused a regression (hey it happens).

Sometimes tooling can generate unstable files, I recall there was time when Eclipse was notorious there, for example when saving XML files they liked to reorder all the attributes. But these are bugs that need to be fixed. Tooling should generate perfectly reproducible files.

rwmj · on April 2, 2024

We started off doing this, but you end up with enormous diffs which are themselves confusing. Example, only about 5% of this change is non-generated:

https://github.com/libguestfs/libguestfs/commit/5186251f8f68...

Probably depends on the project as to whether this is feasible, but for us we intentionally want to generate everything we can in order to reduce systematic errors.

dtech · on April 2, 2024

in github, you can mark a file a generated [1], which hides it in the PR view by default

[1] https://docs.github.com/en/repositories/working-with-files/m...

acdha · on April 2, 2024

Wouldn’t an attacker like JiaT75 do that to increase the odds of someone skimming it?

sgtcodfish · on April 2, 2024

They might try - that's why it's important if you're generating + committing generated code that you also have a CI step that runs before merging anything which ensures that the generated code is up-to-date and rejects any change request where generated code is out of date.

Mostly this helps with people simply forgetting to re-run the generator in their PR but it's a useful defence against people trying to smuggle things into the generated files, too!

acdha · on April 2, 2024

Yeah, I guess my general thought is that anything which encourages hiding files is actively risky unless you have some kind of robust validation process. As an example, I was wondering how many people would notice an extra property in a typically gigantic NPM lock file as long as it didn’t break any of the NPM functions.

jjgreen · on April 2, 2024

The same feature recently added to GitLab

maccard · on April 2, 2024

I disagree - you should ensure your dependencies are clearly listed. Docker excels at this - it's a host platform independent way of giving you a text based representation of an environment.

arp242 · on April 2, 2024

Docker is a Linux thing, and very much not host-platform independent. It's just "chroot on steroids", and you're essentially just shipping a bunch of Linux binaries in a .tar.gz.

It works on other systems because they emulate or virtualize enough of a Linux system to make it work. That's all fine, but comes with serious trade-offs in terms of performance, system integration, and things like that. A fair trade-off, but absolutely not host-platform independent.

bandrami · on April 2, 2024

Sort of. I have about 15 containers running on my dev laptop as I type. Which versions of xz are on each of them, and how do I make sure of that?

efrecon · on April 2, 2024

Downloading the tarball/zip with just a shell and regular utils is possible. See https://github.com/efrecon/ungit

Lockal · on April 2, 2024

There is a minor note that technically there is a weak guarantee that checksums won't break after server update and recompression with different version / alternative implementation of gzip.

https://github.com/orgs/community/discussions/45830

Aissen · on April 2, 2024

It's not a minor note, it's a major reason that the github auto-generated tarballs are useless as-is, since they are not stable.

colejohnson66 · on April 2, 2024

This was not GitHub’s fault, but Git itself combined with cache pruning. Specifically, GitHub updating to Git 2.38 which changed the algorithm. Non-cached tarballs were regenerated on demand, and all hell broke loose: https://github.blog/2023-02-21-update-on-the-future-stabilit...

Aissen · on April 2, 2024

It was not the first instance of this happening; other times I'm not certain it was git's fault.

rwmj · on April 2, 2024

github have (for the moment) backed down and currently the auto-generated tarballs are stable, but they have in the past and may in the future change this.

Lockal · on April 2, 2024

Thank you for highlighting this. I've started a new discussion https://github.com/orgs/community/discussions/116557 to provide strong guarantees for checksum stability for autogenerated tarballs attached to releases.

dhx · on April 2, 2024

For many projects, the release tarballs only contain the files necessary to build the software, and not the following items that may be present in the same repository:

- scripts used by project developers to import translations from another system and commit those translation changes to the repository

- scripts to build release tarballs and sign them

- continuous integration scripts

- configuration and scripts used to setup a developer IDE environment

roywashere · on April 2, 2024

You can use .gitattributes export-ignore to influence what gets into the tarballs and what stays into the repository! It's super powerful but not often used

kzrdude · on April 2, 2024

And export-subst to insert the current tag or git revision into the archive too.

In fact export-subst is powerful enough that there is probably some way to create an exploit triggered by a particular payload inside a commit or tag message? :)

Maybe not triggered, but it could be part of the chain.

bdd8f1df777b · on April 2, 2024

I smell a new backdooring opportunity. Modifying .gitattributes to surreptitiously sneak some binary files into the GitHub release tarballs. Few poeple would take a look at .gitattributes.

juitpykyk · on April 2, 2024

What would be the problem of downloading a few MB more, aren't these source tarballs just used to build the distro binary and then they are deleted?

dhx · on April 2, 2024

As this backdoor has shown, extra unnecessary files in the source files can make it easier to hide malicious code. If you take Gentoo as an example, when a software package is built, Gentoo creates a sandboxed environment first, disallowing the build process from impacting the rest of the operating system.[1] Removing superfluous files from the source tarballs minimises the ability for an attacker to get malicious code inside the sandboxed build environment.

Sandboxes for building software are commonly used throughout Linux distributions, but I am unsure how strict those sandboxes are in general e.g. whether they use seccomp and really tighten what a build script can get up to. At least on Gentoo, there is a subset of packages (such as GNU coreutils) that are always just assumed to be needed to build software and they're always present in the sandbox. Build dependencies aren't as granular as "this build needs to use awk but not sed".

[1] https://wiki.gentoo.org/wiki/Sandbox_(Portage)