Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

GitHub has the feature of downloading the tree as a zip, why is this not used?


There are two main reasons, one bad, one good(-ish):

(1) Traditional autoconf assumed you only had a shell and cc and make, and so the ./configure script which is a huge ball of obscure shell commands is shipped. I think most distros will now delete and recreate these files, which probably should have happened a lot earlier. (Debian has been mostly doing this right for a long time already.)

(2) Some programs generate a lot of code (eg. in libnbd we generate thousands of lines of boilerplate C from API descriptions*). To avoid people needing to install the specific tools that we use to generate that code, we distribute the generated files in the tarball, but it's not present in git. You can still build from git directly, and you can also verify the generated code exactly matches the tarball, but both cases mean extra build dependencies for end users and packagers.

* Generating boilerplate code is a good thing in general as it reduces systematic errors, which are a vastly more common source of bugs compared to highly targeted supply chain attacks.


I advocate for checking in the auto-generated code. You can see the differences between the tool runs, can see how changes in tooling affect the generated code, can see what might have caused a regression (hey it happens).

Sometimes tooling can generate unstable files, I recall there was time when Eclipse was notorious there, for example when saving XML files they liked to reorder all the attributes. But these are bugs that need to be fixed. Tooling should generate perfectly reproducible files.


We started off doing this, but you end up with enormous diffs which are themselves confusing. Example, only about 5% of this change is non-generated:

https://github.com/libguestfs/libguestfs/commit/5186251f8f68...

Probably depends on the project as to whether this is feasible, but for us we intentionally want to generate everything we can in order to reduce systematic errors.


in github, you can mark a file a generated [1], which hides it in the PR view by default

[1] https://docs.github.com/en/repositories/working-with-files/m...


Wouldn’t an attacker like JiaT75 do that to increase the odds of someone skimming it?


They might try - that's why it's important if you're generating + committing generated code that you also have a CI step that runs before merging anything which ensures that the generated code is up-to-date and rejects any change request where generated code is out of date.

Mostly this helps with people simply forgetting to re-run the generator in their PR but it's a useful defence against people trying to smuggle things into the generated files, too!


Yeah, I guess my general thought is that anything which encourages hiding files is actively risky unless you have some kind of robust validation process. As an example, I was wondering how many people would notice an extra property in a typically gigantic NPM lock file as long as it didn’t break any of the NPM functions.


The same feature recently added to GitLab


I disagree - you should ensure your dependencies are clearly listed. Docker excels at this - it's a host platform independent way of giving you a text based representation of an environment.


Docker is a Linux thing, and very much not host-platform independent. It's just "chroot on steroids", and you're essentially just shipping a bunch of Linux binaries in a .tar.gz.

It works on other systems because they emulate or virtualize enough of a Linux system to make it work. That's all fine, but comes with serious trade-offs in terms of performance, system integration, and things like that. A fair trade-off, but absolutely not host-platform independent.


Sort of. I have about 15 containers running on my dev laptop as I type. Which versions of xz are on each of them, and how do I make sure of that?


Downloading the tarball/zip with just a shell and regular utils is possible. See https://github.com/efrecon/ungit


There is a minor note that technically there is a weak guarantee that checksums won't break after server update and recompression with different version / alternative implementation of gzip.

https://github.com/orgs/community/discussions/45830


It's not a minor note, it's a major reason that the github auto-generated tarballs are useless as-is, since they are not stable.


This was not GitHub’s fault, but Git itself combined with cache pruning. Specifically, GitHub updating to Git 2.38 which changed the algorithm. Non-cached tarballs were regenerated on demand, and all hell broke loose: https://github.blog/2023-02-21-update-on-the-future-stabilit...


It was not the first instance of this happening; other times I'm not certain it was git's fault.


github have (for the moment) backed down and currently the auto-generated tarballs are stable, but they have in the past and may in the future change this.


Thank you for highlighting this. I've started a new discussion https://github.com/orgs/community/discussions/116557 to provide strong guarantees for checksum stability for autogenerated tarballs attached to releases.


For many projects, the release tarballs only contain the files necessary to build the software, and not the following items that may be present in the same repository:

- scripts used by project developers to import translations from another system and commit those translation changes to the repository

- scripts to build release tarballs and sign them

- continuous integration scripts

- configuration and scripts used to setup a developer IDE environment


You can use .gitattributes export-ignore to influence what gets into the tarballs and what stays into the repository! It's super powerful but not often used


And export-subst to insert the current tag or git revision into the archive too.

In fact export-subst is powerful enough that there is probably some way to create an exploit triggered by a particular payload inside a commit or tag message? :)

Maybe not triggered, but it could be part of the chain.


I smell a new backdooring opportunity. Modifying .gitattributes to surreptitiously sneak some binary files into the GitHub release tarballs. Few poeple would take a look at .gitattributes.


What would be the problem of downloading a few MB more, aren't these source tarballs just used to build the distro binary and then they are deleted?


As this backdoor has shown, extra unnecessary files in the source files can make it easier to hide malicious code. If you take Gentoo as an example, when a software package is built, Gentoo creates a sandboxed environment first, disallowing the build process from impacting the rest of the operating system.[1] Removing superfluous files from the source tarballs minimises the ability for an attacker to get malicious code inside the sandboxed build environment.

Sandboxes for building software are commonly used throughout Linux distributions, but I am unsure how strict those sandboxes are in general e.g. whether they use seccomp and really tighten what a build script can get up to. At least on Gentoo, there is a subset of packages (such as GNU coreutils) that are always just assumed to be needed to build software and they're always present in the sandbox. Build dependencies aren't as granular as "this build needs to use awk but not sed".

[1] https://wiki.gentoo.org/wiki/Sandbox_(Portage)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: