I haven’t read it yet, but I’m lucky to call the author a friend. I read all of ...

abhijat · on April 1, 2019

Thanks, I bought the book.

Still undecided as to which language to work in, I wanted to use rust but the one part I'm unsure about is possibly where tree like data structures will need to be implemented?

I guess that's tough to do in rust? I will give it a try.

steveklabnik · on April 1, 2019

I personally like to work on one thing at a time. If you're really trying to learn about how Git works, you should, in my opinion, use a language you already know well. That way, you're focused on learning one thing, not two things at once. So I would recommend using the language that you know best.

Not everyone agrees with me, of course.

As I said, I haven't read the book yet, so I can't tell you how easy/difficult it will be in Rust. The author is learning Rust now, incidentally. But if you need a graph (which you probably will), I'd advise you to use something like https://crates.io/crates/petgraph instead of building one yourself. The difficulty is in writing graphs, not using them.

Twisol · on April 1, 2019

Graphs are harder if you try to use pointers to identify neighbors. In Rust, a pointer is not merely an index into memory; it is a statement and guarantee about how that memory will be accessed. These semantics are more than a graph structure needs.

To "get around" this (but see the next paragraph), we usually associate each node to a simple identifier such as an integer (usize), and store all nodes into a Vec. Neighbors are indexed by this identifier rather than a pointer. (Indirection? Maybe. But I'm pretty sure most processors support a base+offset mode just as easily as a direct reference mode.)

Honestly, I think this is good practice even outside of Rust. If you're using pointers, and you want to associate some extra data to a node that isn't part of its graph structure -- a label, say, or some other structure that you're modeling the relationships of -- there's no good way to add that data without modifying the definition of a node. A pointer indexes a _single_ point in memory. An abstract identifier may index any number of points in memory -- just create another table containing the new information you want to associate.

codetrotter · on April 1, 2019

> To "get around" this (but see the next paragraph), we usually associate each node to a simple identifier such as an integer (usize), and store all nodes into a Vec. [...]

> Honestly, I think this is good practice even outside of Rust.

It is, and they use this way of writing their code a lot in game dev. They call it Entity Component System, or ECS for short. The concept of ECS extends a bit beyond what you described above but fundamentally I perceive your description to fall in line with ECS.

https://en.wikipedia.org/wiki/Entity_component_system

An ECS library for Rust exists, named Specs. It might be of interest.

https://slide-rs.github.io/specs/

Twisol · on April 1, 2019

I am indeed alluding to the principles of ECS! I didn’t want to beat readers over the head with what might be perceived as “a whole new way to architect your software!!1!”, but this is the next step in that direction.

skybrian · on April 1, 2019

There are benefits but they come with downsides: you can have node ids pointing to deleted nodes, and collecting unused nodes is up to you. If you want to avoid fragmentation, you need to recycle nodes and possibly node ids as well. These are things garbage collectors handle.

It's not all that different from a database, though.

Twisol · on April 2, 2019

> It's not all that different from a database, though.

Indeed! I'm tempted to coin a variant of Greenspun's Tenth Rule: any sufficiently complicated program contains an ad-hoc, informally specified, bug-ridden, incomplete implementation of a database engine.

> There are benefits but they come with downsides

Downsides relative to what? The downsides listed are all in common with traditional pointer-based references, so I would argue that garbage collection is rather orthogonal to the question of storing indices instead of pointers. Any allocation comes out of a memory arena of some kind, be it an explicit vector of slots or the implicitly-defined standard heap. The tools for solving these problems are the same in all cases.

Certainly, Rust's references avoid all of the problems you listed. Rust pointers essentially embed the semantics of a garbage collector at compile-time [0], in the domain where ownership patterns can be strictly verified. But nodes within a graph are already a poor fit for the ownership model -- the entity that "owns" a node is really the graph itself, not its neighbors -- so that's the level of granularity at which Rust's references are useful. You need something else within the scope of the graph.

EDIT: On reflection, you might be referring to a language like Java which has an ambient global garbage collector. Indeed, using indices instead of pointers means you're on your own -- you've allocated the memory arena through the standard means, but then you take on the responsibility of managing that memory yourself. This is a fair criticism! Purely in my experience, data modeled as a loose graph of directly-related objects is a lot more difficult to understand and maintain than data modeled indirectly using some form of identifier -- mostly because of the effects I mentioned in my earlier post on associating new information to an entity.

[0] https://words.steveklabnik.com/borrow-checking-escape-analys...

skybrian · on April 2, 2019

Yes, that's what I meant. Cleaning up unreferenced nodes (should you want to do that) would require some kind of mini-garbage collection algorithm. And indeed, git has a gc command to do this, even though reference counting would work for a DAG. It's doable, but it's not what I'd call simple.

But if you know through some other means the exact time when a node should be deleted, you can delete it at that time, and anyone following a soft reference will find that it's no longer there, which may be a way of catching a bug. This is how both databases and entity component systems work. But it does mean that resolving a reference can fail, and you have to handle that somehow.

steveklabnik · on April 1, 2019

Recycling IDs is pretty easy with generational indices.

fulafel · on April 4, 2019

Does specs provide answers to these issues?

taktoa · on April 1, 2019

Trees are easy in Rust. It's graphs that are harder.

littlestymaar · on April 1, 2019

Git “trees” are graphs though, since you have merge commits.

FullyFunctional · on April 2, 2019

More to the point, they are DAGs. Git objects cannot form a loop, thus they are acyclic. </pedantic>

littlestymaar · on April 3, 2019

That's absolutely correct. And your emphasis on this is on point, since DAG are relatively easy to implement in Rust: You just need to use ref-counting (instead of plain `Box` for a Tree) whereas general purpose graphs are harder to implement (and can lead to run-time bugs if node deletions aren't managed properly).