Debian Janitor: 60k Lintian Issues Automatically Fixed

cookiengineer · on Aug 23, 2020

To be honest, this sounds more like a >/dev/null solution to me. The underlying issue is that debian packages are out of date, and that the repos (ppas) containing them are heavily unmaintained up to the point were decades old libraries are required even when upstream has moved on.

Why not go with the approach to try and fix it upstream where it belongs instead of maintaining a set of fixers downstream - which will likely not work anymore in the forseeable future?

I mean, if you start building an auto-fixer pipeline for an auto-linted error... then is that error actually valid in the first place? Probably not.

The reason why a lot of developers these days have moved to Arch's/AUR's ecosystem is not the amount of available packages (they probably are around the same) - It's integration of packaging with upstream.

Every time a package upstream is out-of-date or not working, you'll likely find an Arch user trying to fix it with a pull request in the git repos.

The debian janitor should've tried to automate pull requests rather than trying to automate self-hosted fixes downstream without letting upstream even know that their packages are wrongly formatted.

But that's just my two cents, maintained a lot of PPAs in the past for both Debian and Ubuntu and have moved on to Arch/Manjaro, because I ain't got no time for that anymore.

jelmer · on Aug 23, 2020

This is fixing issues in the packaging, not the upstream part of the package. I have plans to also have it fix issues in upstreams, but as you say - those will go into the upstream repositories.

It processes the main Debian archive, not PPAs. The bot makes changes to the Git repository with the packaging - developers would get a pull request with the changes after doing verification, and they can just click "Merge" on the GitLab UI. That seems pretty similar to what you're saying would happen in arch.

For a broader description of the design, see https://jelmer.uk/debian-janitor.html

ckastner · on Aug 23, 2020

> Why not go with the approach to try and fix it upstream where it belongs instead

This tool auto-fixes things in the Debian packaging -- the stuff that lives in the debian/ directory. Upstream has nothing to do with this.

As a package maintainer, I find this tool very helpful. Sure, I fix issues manually, too, but doing that over all packages is just tedious.

IngoBlechschmid · on Aug 23, 2020

These fixes only correct issues in the Debian package metadata, not in the actual package source, hence nothing to be upstreamed.

thunkpad · on Aug 23, 2020

> I mean, if you start building an auto-fixer pipeline for an auto-linted error... then is that error actually valid in the first place? Probably not.

Yeah that does sound like a waste of time for a relatively small project.

But I actually think this makes perfect sense once you factor in how massive the effective codebase of Debian is. It's also a much different beast; there are benefits and costs to moving more slowly and methodically. Making seemingly small architectural changes (like the switch from 'extra' to 'optional') is awkward just because of the structure of the project.

Anything to take tedium off the plates of devs who are committing upstream in preparation for the next release has to be a good thing, right?

Greek0 · on Aug 23, 2020

> I mean, if you start building an auto-fixer pipeline for an auto-linted error... then is that error actually valid in the first place? Probably not.

Just because you can automate it does not mean that the task is worthless.

React provides auto-migration scripts to newer versions, Python had the 2to3 utility, the Linux kernel has patches auto-generated with Coccinelle, editors have "rename this variable and all references" functions. Starting from a certain code base size there are changes that are worthwhile and best tackled with automation.

Isomer · on Aug 23, 2020

This is an excellent point.

One change the Janitor can try to make automatically is upgrading debhelper version. Newer debhelper versions include support for compiling code with additional security measures. So we end up with more secure packages.

If ~everyone migrates off an older debhelper version, then the debhelper developers don't need to keep maintaining that code and can delete it, leading to a more maintainable code base, less complex documentation that says "X, unless you're on an older version, when it's Y" and so on.

The Janitor is pretty good at trying something like upgrading debhelper, then doing a build, and running the package tests, performing a package diff to make sure nothing unexpected happened then proposing a merge. Doing this by hand is slow and laborious. If debhelper always tried to autoupgrade to the newest version when it was run, it would be frustratingly slow.

And once the merge proposal is accepted, then it's clear which packages need human attention, and which were able to be cleanly migrated by automation. We use this a lot in the janitor. Run a fixer over all eligible packages, and see which ones succeed. Then investigate a handful of packages that failed, see if there is a common pattern that can be extracted and fixed. Then we reapply the fixer to all the remaining packages, see how many were fixed, and repeat the process until there are only a very few remaining.

xioxox · on Aug 23, 2020

Though a lot of these errors are in the package metadata, Debian doesn't make it easy for itself as the build system and packaging format is very complex. There are lots of different build system variants, relying on weird poorly-documented helper scripts which have strange interactions. In my opinion, there's just too much magic in the debhelper system to understand how to fix problems. You also have the long, complex and constantly changing list of rules which need to be applied for each update. They are very strict about noting down the copyright of all the files, for example (which I can understand to some extent).

I think the biggest issue is the social problem of the long and arduous process to become a trusted Debian Developer. Although there's now an easier Debian Maintainer option for restricted access, it's still not easy to become one. Other people have to go through a developer to update or add packages. If you can't find a responsive developer, then package changes can sit for months or years in a broken state before they are applied (as I have found out). The whole packaging process is a frustrating exercise in bureaucracy compared to other non-Debian distributions (e.g. Fedora), which is no fun to be involved in for most people.

josteink · on Aug 23, 2020

> I think the biggest issue is the social problem of the long and arduous process to become a trusted Debian Developer. Although there's now an easier Debian Maintainer option for restricted access, it's still not easy to become one.

To be fair, with the impact you could have in a role like this, with millions (billions?) of Debian or Debian-based devices and installations depending on it, worldwide...

It IMO makes sense that you only grant access once competence, non-malicious intentions and your willingness to stick around is established.

debiandev · on Aug 23, 2020

> weird poorly-documented helper scripts which have strange interactions

I've never found a packaging tool without an extensive and clear manpage with examples.

> I think the biggest issue is the social problem of the long and arduous process to become a trusted Debian Developer. Although there's now an easier Debian Maintainer option for restricted access, it's still not easy to become one. Other people have to go through a developer to update or add packages.

There is an arduous process to become a senior engineer in a FAANG, or a tenured professor, or an airline pilot.

That's why a lot of people trust Debian.

bhaak · on Aug 23, 2020

> I've never found a packaging tool without an extensive and clear manpage with examples.

That might be fine if you already know which packaging tool to start with. Debian has several and it's not clear upfront which one is the preferred one. But even then I'd say that there is something like too much information.

And the process is complex. If I want to package a standard autoconf/cmake program, I have to do a lot of command typing. If you compare that to Gentoo, Arch or even Homebrew packages where you essentially edit one file and one or two command invocations, that's just cumbersome.

I recently tried to find out what I would have to do to for a NMU and I just couldn't find any entrance point that made sense to me. It felt like trying to get into a cabal and all the information is encoded in arcane Latin.

jelmer · on Aug 23, 2020

The process is indeed complex, but there are efforts to standardize on a single simple way to do packaging. Unfortunately with an archive of 30k source packages, that is a slow process.

A great resource is https://trends.debian.net/, which tracks some of the different ways of doing packaging, as well as the progress on convergence.

With the new style debhelper that we're converging on and a straightforward package, creating a new package should not have to involve a lot of typing - although it's still split across multiple files.

Isomer · on Aug 23, 2020

The janitor is trying to help make packaging less toilsome:

- The janitor is trying to automatically migrate people to newer versions of debhelper, reducing the number of different build system variants.

- The janitor attempts to perform validation of all the long, complex, changing rules for you, so if you meet their requirements then it will propose an upgrade. If the janitor can automatically bring you into compliance with the rules, it will, otherwise it'll leave you upgraded to the step that requires a human to intervene. You can manually upgrade it one step, and then the janitor will loop back around and do the rest for you.

- The janitor will also fix suboptimal, or unnecessary packaging changes and improve them for you too, hopefully leaving you with a much smaller, simpler packaging system that's easier to reason about.

VLM · on Aug 23, 2020

If a process doesn't work, adding lots of people to it will not help.

A major part of the problem is scalability. "In the old days" the automated system that eats signed binary packages and outputs a usable apt-get accessible archive, couldn't do much to any individual package if we wanted it to complete in time. Times change and people seem to want more testing and automation, toss the binary packages and rebuild all archs from scratch, more automated tests, etc.

In the long run you're basically asking why the archive automation scripts accept incoming packages with "debian-changelog-has-wrong-day-of-week" instead of kicking those package updates back. Adding more people isn't going to fix it, adding a large system working around the process is kinda helpful, but really the archive automation scripts should just not accept packages with "trailing-whitespace" any more than they'd accept packages with invalid GPG signatures. This will result in more processing time for updates, but this can be clouded and parallelized and is fixable in 2020, even if simpler archive processing made engineering sense in 1995.

beefhash · on Aug 23, 2020

> You also have the long, complex and constantly changing list of rules which need to be applied for each update. They are very strict about noting down the copyright of all the files, for example (which I can understand to some extent).

Tracking licensing information very strictly makes sense.

Other than that, however, the Debian packaging process is something only a lawyer could love.

Isomer · on Aug 23, 2020

Hi, I've been working with Jelmer on the janitor and related infrastructure for over a year now and worked with him writing this blog post.

As others have pointed out, this is used for Debian metadata, so upstreaming doesn't really apply. While "priority extra" might seem trivial, there are other much more sophisticated examples of what the janitor can do. For example, applying multiarch fixes. These fixes are a lot more complicated, so we didn't want to show their code and have to explain everything that's going on during the blog post, as the blog post was already long enough. This was just to give people an idea that it's possible to make large scale, high quality fixes relatively easily and how they might go about it if they also have a fix they want to propose.

We've been using the lintian metadata fixes as a proving ground to learn how to make safe, high quality changes at scale. Fixing one package is trivial, doing even a mediocre job of fixing 60,000 packages is much much harder. Nobody wants crappy automated junk in the repository.

There are also other easy to overlook, massive non-technical problems in this space, like trying to convince people that the Janitor is doing worthwhile work and people should embrace it. Significantly more thought and effort has gone into making sure people are delighted (or at least not too annoyed) by the janitor than the actual changes themselves. There's things like making sure you can correctly identify and preserve a large number of idiosyncratic formatting styles. This is open source software, and everyone has their own distinct opinions as to what is "best". You also don't want to overwhelm people with fixes they're not interested in. We've already got sketches of blog posts coming up that discuss some of these issues and how the janitor handles them.

The janitor can do things other than just lintian fixes, such as multiarch improvements and merge in new upstream releases and even upstream snapshots. These aren't enabled yet, as there are still being polished up to the standards Debian demands. For some of them (like merging new upstreams) there a lot of very difficult non-technical questions that still need answering. (How do you make sure that a compromised upstream developer submitting a backdoor into a common library doesn't have that code automatically pushed out to everyone?)

The goal of the janitor is to help Debian move forward (more) rapidly, to increase the overall quality of packages in the archive, and free up the humans so they can concentrate on things that require human judgement (eg writing high quality descriptions).

If the janitor is successful, then you shouldn't have to spend much time working on packages. All of the drudge work that takes human time should be automated, leaving only the things that require actual human discretion. This helps alleviate your "I ain't got no time for that". We're not there yet, but we're working towards it.

We have discussed being able to apply fixes further upstream. It's an even harder problem to get right, and again, most of the complexity is in the non-technical side. So far, we've not tackled this, but we have deeply considered it. For the technical sided problems, before we can do any upstream fixes, we need a reliable way of being able to find the upstream repositories, we need to be working on the latest upstream snapshot, we need to know how to build the package, etc. These are all problems the Janitor is already working on solving cleanly and elegantly first.

Don't worry, we've got some awesome plans for the Janitor. Lintian fixes are just the very first step.

renewiltord · on Aug 23, 2020

It is already as up the stream as it goes, bruv. Debian is the well from which Debian packages spring.

OnlyOneCannolo · on Aug 23, 2020

I think "upstream" here means the software repo itself, not the Debian packaging repo.

Pubbian · on Aug 23, 2020

Yes indeed Pubbian

Pubbian · on Aug 23, 2020

Pubbian