Hacker News new | past | comments | ask | show | jobs | submit login

What I like about git is that it stores only the minimum amount of information, and this makes it easy to explain. A commit hash is a hash of canonical information, not of derived information.

It seems really ugly to store derived information in a commit (specifically, that the hash would be altered by it).

It seems that Jeff has said the same thing, but Linus disagrees. Vocally.

http://www.spinics.net/lists/git/msg161336.html




From my understanding, they're essentially adding this as an additional bit of information that's minimally required. The currently used timestamps are error prone and thus will be replaced by generation numbers which are more robust. They're still adhering to the principle of only storing the minimum amount of information, they're just adding generation numbers to that set.

In fact, you could make the argument that timestamps are the derived information that git has been storing all along while generation numbers are the canonical information which should have been stored from the beginning. Generation numbers are a result of the state of the tree, while timestamps are derived from the ambient (and potentially incorrect!) environment from which the commit was made.


Well, generation numbers can be determined by counting up through parent commits. So they are derived information, it's just that that takes ages and lots of disk seeks to count through.

Timestamps aren't really needed. They are information that is useful to use that we want to store, just like the date in an email. Thus they are as required as the names of the author and committer.

The reason for the discussion about the commit timestamps is (AIUI) a heuristic optimisation that works because they happen to be there and happen to (most of the time) be in order.


>Well, generation numbers can be determined by counting up through parent commits. So they are derived information, it's just that that takes ages and lots of disk seeks to count through. >Timestamps aren't really needed.

First off, timestamps are needed. They're used to order commits in the history. Generation numbers do the same thing, but a bit more elegantly, because they avoid most of potential clock issues in a distributed environment.

You're making the assumption that the set of derived data and the set containing the absolute minimum amount of data git needs to work are mutually exclusive sets. They're not, especially if the derived data is computationally expensive to get, and is still used for normal operation.

One of git's primary goals is fast, scalable performance. Commit generation numbers help reduce potential errors with the current timestamp approach. However, they're expensive to calculate, and don't scale well at all. Linus' argument is that instead of calculating them every time, it's far simpler to just add them in and be done with it.


By that definition of "derived information", the hash is "derived information" since it's based of the changes made to source data (whatever that data may be).

That said, point taken about the necessity of both generation numbers and timestamps. But that invalidates the OPs comment about git storing "only the minimum amount of information". It sounds like that's never been a hard principle.


git does store "only the minimum amount of information".

Here's what Linus had to say about it:

> Generation numbers are _completely_ redundant with the actual structure

> of history represented by the parent pointers.

Not true. That's only true if you add ".. if you parse the whole history" to that statement.

And we've never parsed the whole history, because it's just too expensive and doesn't scale. So right now we depend on commit dates with a few hacks.

So no, generation numbers are not at all redundant. They are fundamental. It's why we had this discussion six years ago.

From: http://www.spinics.net/lists/git/msg161348.html


Thanks for the background! (Seriously, not trying to be snarky.)

That info does support my original point that generational numbers probably should have been stored from the start and timestamps are the more "derivative" bit of information since it comes from the environment and not the data itself.

Thus, rlpb's concern that storing generational numbers pollutes its design of storing "only the minimum amount of information" isn't necessarily well founded, since the generational number might be more minimal and correct than the current timestamp. That was the aim of my original post: generational info is fundamental, not extraneous derived info, and probably have been stored with commits in the first place.


>It seems really ugly to store derived information in a commit

I don't understand how generation numbers are derived information. They are used to find the position of the commit in relation to another. That makes them information that is essential to the commit. The problem was to get around them not being there timestamps were compared and that is not reliable for obvious reasons. So I really don't understand why any one would complain about this.


They're derived. You can tell that they're derived information by the fact that you can compute them for old commits, long after commit time, which is exactly what part of this proposal is to do. You can derive them simply by counting the maximum number of steps between a commit and any of its roots. The essential information isn't the generation numbers; it's the structure of the commit history -- the actual chains of commits, with all of the branches and merges. Generation numbers are just an artifact of counting.

On the other hand, this information is very handy, once you have it, for certain algorithms, and it could be expensive to re-compute all the time, which is why the proposal is to generate and store them explicitly. (This is also the reason that timestamps have been used before, even if they were a bit of a hack -- they're readily available, and way faster than recomputing generation numbers all the time)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: