I love to see people reimplementing existing tools on their own, because I find that to be a great way to learn more about those tools. I started on a Git implementation in Rust as well, though I haven't worked on it in a while: https://github.com/avik-das/gitters
Just reading source code is a whole different experience than actually implementing something yourself. Implementing something pretty much forces you to "be right" in your understanding, while reading can be anything from "really studying" to skimming.
Completely agree, though sometimes you HAVE to go through existing source when you do something wrong. I've implemented libraries based off of specs and white papers, and even then there's some vagueness that doesn't work in practice when there are holes.
Love to see things like this all the same as they tend to solidify protocols/specifications. This of course can have both good and bad results.
I haven't even looked at the git source code. My implementation was based mostly on the official docs. The docs include a section on the internals: https://git-scm.com/book/en/v1/Git-Internals
I haven’t read it yet, but I’m lucky to call the author a friend. I read all of his tweets about it while he was building it. I expect it to be of extremely high quality. I’ve got a lot of respect for his abilities.
The only reason I haven’t read it yet is that I think it deserves a lot of attention and I haven’t made time yet.
Still undecided as to which language to work in, I wanted to use rust but the one part I'm unsure about is possibly where tree like data structures will need to be implemented?
I guess that's tough to do in rust? I will give it a try.
I personally like to work on one thing at a time. If you're really trying to learn about how Git works, you should, in my opinion, use a language you already know well. That way, you're focused on learning one thing, not two things at once. So I would recommend using the language that you know best.
Not everyone agrees with me, of course.
As I said, I haven't read the book yet, so I can't tell you how easy/difficult it will be in Rust. The author is learning Rust now, incidentally. But if you need a graph (which you probably will), I'd advise you to use something like https://crates.io/crates/petgraph instead of building one yourself. The difficulty is in writing graphs, not using them.
Graphs are harder if you try to use pointers to identify neighbors. In Rust, a pointer is not merely an index into memory; it is a statement and guarantee about how that memory will be accessed. These semantics are more than a graph structure needs.
To "get around" this (but see the next paragraph), we usually associate each node to a simple identifier such as an integer (usize), and store all nodes into a Vec. Neighbors are indexed by this identifier rather than a pointer. (Indirection? Maybe. But I'm pretty sure most processors support a base+offset mode just as easily as a direct reference mode.)
Honestly, I think this is good practice even outside of Rust. If you're using pointers, and you want to associate some extra data to a node that isn't part of its graph structure -- a label, say, or some other structure that you're modeling the relationships of -- there's no good way to add that data without modifying the definition of a node. A pointer indexes a _single_ point in memory. An abstract identifier may index any number of points in memory -- just create another table containing the new information you want to associate.
> To "get around" this (but see the next paragraph), we usually associate each node to a simple identifier such as an integer (usize), and store all nodes into a Vec. [...]
> Honestly, I think this is good practice even outside of Rust.
It is, and they use this way of writing their code a lot in game dev. They call it Entity Component System, or ECS for short. The concept of ECS extends a bit beyond what you described above but fundamentally I perceive your description to fall in line with ECS.
I am indeed alluding to the principles of ECS! I didn’t want to beat readers over the head with what might be perceived as “a whole new way to architect your software!!1!”, but this is the next step in that direction.
There are benefits but they come with downsides: you can have node ids pointing to deleted nodes, and collecting unused nodes is up to you. If you want to avoid fragmentation, you need to recycle nodes and possibly node ids as well. These are things garbage collectors handle.
It's not all that different from a database, though.
> It's not all that different from a database, though.
Indeed! I'm tempted to coin a variant of Greenspun's Tenth Rule: any sufficiently complicated program contains an ad-hoc, informally specified, bug-ridden, incomplete implementation of a database engine.
> There are benefits but they come with downsides
Downsides relative to what? The downsides listed are all in common with traditional pointer-based references, so I would argue that garbage collection is rather orthogonal to the question of storing indices instead of pointers. Any allocation comes out of a memory arena of some kind, be it an explicit vector of slots or the implicitly-defined standard heap. The tools for solving these problems are the same in all cases.
Certainly, Rust's references avoid all of the problems you listed. Rust pointers essentially embed the semantics of a garbage collector at compile-time [0], in the domain where ownership patterns can be strictly verified. But nodes within a graph are already a poor fit for the ownership model -- the entity that "owns" a node is really the graph itself, not its neighbors -- so that's the level of granularity at which Rust's references are useful. You need something else within the scope of the graph.
EDIT: On reflection, you might be referring to a language like Java which has an ambient global garbage collector. Indeed, using indices instead of pointers means you're on your own -- you've allocated the memory arena through the standard means, but then you take on the responsibility of managing that memory yourself. This is a fair criticism! Purely in my experience, data modeled as a loose graph of directly-related objects is a lot more difficult to understand and maintain than data modeled indirectly using some form of identifier -- mostly because of the effects I mentioned in my earlier post on associating new information to an entity.
Yes, that's what I meant. Cleaning up unreferenced nodes (should you want to do that) would require some kind of mini-garbage collection algorithm. And indeed, git has a gc command to do this, even though reference counting would work for a DAG. It's doable, but it's not what I'd call simple.
But if you know through some other means the exact time when a node should be deleted, you can delete it at that time, and anyone following a soft reference will find that it's no longer there, which may be a way of catching a bug. This is how both databases and entity component systems work. But it does mean that resolving a reference can fail, and you have to handle that somehow.
That's absolutely correct. And your emphasis on this is on point, since DAG are relatively easy to implement in Rust: You just need to use ref-counting (instead of plain `Box` for a Tree) whereas general purpose graphs are harder to implement (and can lead to run-time bugs if node deletions aren't managed properly).
It's pretty good, I'd recommend it. I haven't finished the book yet, but what I've read thus far has been good.
Something I'd wish I'd known (although it wouldn't have changed my decisions to purchase) is that it's not an exploration style book (i.e. "Let's cat this file and find out what it contains and why.") it's more of an explanation (i.e. "When I cat this file, it outputs XYZ which means ABC which I know from my research of the git source."). So the author isn't taking you along on their research, but rather coming back to you after the research is done to explain their findings from the ground up.
This means early chapters have a lot of, "You'll just have to trust me XYZ means ABC." But this is also understandable given the complexity of git; there isn't really a square one.
I also would have preferred the author use something like Python instead of Ruby for the reference implementation. IMO Python is a little more ubiquitous and easier to install/setup than Ruby. Ruby also leaves Windows devs at a disadvantage. But that's just me being pedantic.
>It's so much fun, but not that practical for scalable websites.
Git based kv has a bit different purpose than the regular kv storage. They are intended for communication between entities, running in parallel, sort of transactional memory.
They are not intended for users' data storage.
Not sure, but the idea is that you could not only read and write, but write in parallel so the keys are merged according to the merge rule you've provided.
Strongly recommend using some standard FOSS license before plenty of people add commits and it gets a big mess clearing up the licensing situation later.
Also, not having a license file isn't a messy situation, that means “this project is protected under Berne Convention copyright“: the author is the only one holding every rights on the code and every use that is not explicitly allowed is a copyright infringement (unless it's fair use).
The author doesn't hold copyright on code other people submitted unless they explicitly give ownership via a CLA or similar. A licence would make it clearer, or at least a lot easier for other people to consume.
Exactly, but since it's a learning project it's not obvious it's supposed to receive and accept code contribution anyway (so far it didn't receive any, the only two commits are fixes to typos in the README).
That being said, it would be nice from the author to put the code under a permissive license to allow other people to play with his code too (at the moment, even forking it is a copyright infringement…).
Yes you're right, github forking is allowed. Pulling the code on your computer, changing it and contributing changes on your fork is still illegal though.
"Any User-Generated Content you post publicly, including issues, comments, and contributions to other Users' repositories, may be viewed by others. By setting your repositories to be viewed publicly, you agree to allow others to view and 'fork' your repositories (this means that others may make their own copies of Content from your repositories in repositories they control)."
(Crucially, it doesn't require an open-source license, though.)
Second: even without that, there's such a thing as an implied license:
Pretty much all the TOS says is there's an implicit reproduction license (other users can see & fork the work) and possibly broadcast (the fork itself has the visibility of the original). Not adaptation, not use, not exploitation, …
And that license grant is solely through github as a service, it's unclear that a local clone is even permitted.
> And that license grant is solely through github as a service, it's unclear that a local clone is even permitted.
That license grant has been added specifically to make GitHub itself waterproof (AIUI), so it makes sense it doesn't extend to user's rights. Look, but don't touch.
That's... somewhat true. My main objection is to "the author is the only one holding every rights on the code and every use that is not explicitly allowed is a copyright infringement (unless it's fair use)".
The ToS doesn't say there's an implicit reproduction license, though; it says there's an explicit reproduction license.
The other licenses can still be argued to be implicit. For instance, you have a decent argument that local clones are an implicit license – GitHub provides a "Clone or download" button directly on the repo page, and it's one of the main use cases of GitHub. (Other arguments exist.)
Thanks for the clarification on the “implicit license” point, I glossed over a little bit quickly. I should definitely have said “every use that is not explicitly or implicitly allowed is a copyright infringement”.
Your first point doesn't really bring much though, since it falls in the “explicitly allowed” part of my comment.
Overall, my whole point stands still: if anyone went on GitHub, downloaded the project and did anything with it that went beyond fair use, that would be a copyright infringement because neither the author nor GitHub granted you any permission to do so.
It isn't called that way and doesn't offer the same amount of protection everywhere, but the Berne convention itself includes copyright exceptions [1] that I included in the broad “fair use” phrase.
Not the op, but it seems your linked bit of the BC justsays that countries are allowed to legislate exceptions. The only hardcoded exception is the short citation one.
I guess it depends on what you call «use». You can read the code, for sure, probably even save it on your disk (it may depend on your jurisdiction though).
But, can you compile it ? I'm not sure… Better ask your lawyer. And what about running the compiled binary ? I don't think you're allowed to do that.
That's how copyleft licenses work. Base copyright law is more restrictive and possessing an unlicensed copy can be infringing in certain countries. This is why public domain is sometimes problematic.
> No, you copied on your hard drive something that was offered to you for free on github.
It wasn't offered for free local reproduction since that right was not explicitly granted, and Github's license grant does not grant it either (as far as my reading goes). Though the country you're in may have a private copy exception, in which case you'd be in the clear I think (depending on the specifics of that exception).
That's not how copyright law works. The particulars vary by country but the share everything internet culture doesn't automatically grant permission to make copies.
since you sound knowledgeable- if I go and without any contract "donate" some of my code to the repo- what becomes of the right to that code (the patch I contributed)?
You retain the copyright of the patch and can re-use it somewhere else under a different licence.
And in turn, the project you submitted it to cannot re-licence that patched section of code (e.g. become either GPL licenced) without your permission, as it does not belong to them.
One issue is under the Berne baseline, even given github's license grant[0], there is no license to make adaptation, arrangement or derivative work. So it's unclear that the patch would even be legal to start with, in the sense that it's either a modification / adaptation of the work or a derivative of it.
Yes, the patch would be illegal, but the original author still wouldn't have all the rights on it. And it could also be illegal if the original author distributed the amended code elsewhere.
Edit: I edited my comment to say “could” instead of “would” because the original author could argue that the author of the patch implicitly gave him the right to redistribute his patch by contributing it to a public repository. I'm not sure it would stand in court, but I'd say it would have a non-null chance of success.
But If the author, who initially claimed the project was a learning project, decided to use it commercially, he clearly wouldn't be allowed to use the patch. (and again, it could be different if the patch author willingly contributed to a commercial product).
Git derives patches from files on disk, so the patch inherits the file's license. Logically you commit files or even the whole source tree, a patch is an optimization detail, which is used to rebuild the source tree at a given point, so it's a matter of what license the source tree has at that point.
This isn't really relevant to the legal status of a change (i.e. patch) to a proprietary work where the source code is public.
The answer is that making the change is already usually copyright infringement (though I think some countries have a concept of private copies being exempt from these types of restrictions). But redistribution of your patch would definitely be copyright infringement because a license to create a derived work was not given to you -- and patches are by definition derived works.
Actually any high performance GC'd languages would be fine too because latency is a non-issue for long running git operations (you won't notice if your git clone pauses for 100ms, whereas you will notice if your UI does). Throughput of malloc() and GCd languages tends to be similar when latency isn't a concern.
Performance for higher-level languages is usually great in-so-far as you're able to essentially write C code in that higher level language. When the language's limitations inhibit you from writing the C code you want to write, performance usually suffers. In java's case, lack of value types and stack allocation can be a major performance hindrance. Boxing is also a problem, although, as the mailing list post notes, this is easily overcome-able via manual specialization.
I was speaking more to stability. Rust is designed to be an incredibly safe language without sacrificing any performance; that seems like a good match for a version-control system.
IIRC Rust's safety is provided by affine types; all languages with affine or linear types can provide the same guarantees Clean and Mercury both come to mind of the top of my head (IIRC Clean had "Concurrent" in its name at one point), and I think there are both Haskell and F# variants with either affine or linear types.
In addition there are many other solutions to safe parallelism and/or concurrency, some of which don't require a type system at all; Erlang is famous for safe concurrency and is untyped.
Lastly, there's good old fashioned multiprocessing which can be safe just by not sharing memory.
There is no one feature that is new in Rust, but it has a relatively unique set of features in the non-GC language world; ATS is the only one coming to mind, though I'm sure there are some other niche ones.
I love this combination in rust because latency sensitive operations in GCd languages are notoriously hard to achieve. Lisp was able to be an operating system because nobody needed to run quake at 100fps on a lisp machine. With GC you can pick latency or throughput but can't reliably get both without coding around the GC.
This does mean for me that when considering things that rust is particularly good at, latency sensitive applications stand out; this is not to say it's bad at non-latency sensitive applications, just that one has a lot more choices when latency is a non-issue.
To be honest, I wish there was a movement for something like "literate changelog", where the CL was propely linked with something like a blog post, and specific repo versions. I guess Pull Requests sort of take that role.. But sometimes PR diffs aren't the same as annotated code in a blog post.
Can somebody tell me what is behind the recent rewrite-all-the-things-in-rust craze? I get it can have some benefits in terms of security, but it seems rewriting so many things in it just for the sake of it is a bit excessive.
I understand some of these are very likely for educational purposes (like this one and others; it's good for getting more familiar with the language), but it still seems to be a bit of a strange trend (especially since people who don't need to learn are doing it, seemingly just because "yay rust").
I support RIIR 100%. It’s much easier for me to contribute to Rust projects. It’s not just the language (but that is definitely a part). I don’t need to worry about shit like “how do I build this project” or "my code might run on platforms x and y but I'm not sure about platform z". With c/cpp, the build process can be potentially very complicated. Missing headers, missing binaries. With Rust, I just run “cargo build”.
Furthermore, the resulting code feels just so sturdy. I can also expose it as C and it can be used from the likes of Python.
Yeah, I just spent a couple hours configuring CMake when I was porting a Windows app to Linux, and it was a pain tracking down all of the dependencies. I had a similar app in Rust that worked with a `cargo build` out of the box (actually, my friend rewrote my Rust version to C++ because I was lax in updating it).
Say what you want about Rust v C/C++, but you cannot tell me that Rust's build process isn't easy. In fact, building for other platforms is pretty trivial, just `rustup add <target>` or wherever and you can target pretty much any common platform and many uncommon ones, and those targets will get updated with everything else. In fact, it's so nice that I have to convince myself to not use it as the way to distribute CLI apps (`cargo install <tool>`).
Actually there's a lot of great reasons to rewrite everything in every language. Git is an especially good piece of software to implement everywhere because it's relatively stable and it's pretty useful.
As for actual reasons, one good example is so you can keep your dependencies in the language, using the language package manager. For Go nobody even questions that this is worth it; it enables painless cross compiling and completely static, libc-free binaries. For Rust that may not be a thing, but you do at least get the benefits that you could integrate Git functionality without having to hack around in porcelain.
This one here is a learning experience by it's own description, but I would suggest people stop complaining about "rewriting everything" in $LANGUAGE. The opposite complaint is often cited as a reason why to not use the language (that, for example, basic programs haven't already been ported.) If we did build an alternate world with feature parity, unit testing, optimizations, in a memory safe language, I doubt many people would be complaining about the strange trend of rewriting things anymore.
That said, even when using those bindings, I've had to drop down to wrapping the porcelain sometimes. I'd love to have a native rust git implementation. It'd be easier to hack on, and to abuse for the type of git interactions I've written.
And it'd be one less external dependency to worry about when cross-compiling. I love how easy it is to install things from source in Rust - and it's pretty easy to add flag to make the program use your CPU's special instruction set to make it even faster.
That's a good point; you're right that it's helpful to be able to interface with such a ubiquitous program natively in a language of choice. I did see a mention on the page of deploying as a crate and using in another program; that seems very convenient.
> stop complaining
Not a complaint; more a question as to why there's a specific move around rust. I appreciate the reply; that's exactly the kind of response I was looking for.
In addition to what has already been said, I think there's a lot of interest in reimplementing command line tools in Rust because there are a lot of useful support libraries for TUI development. Because of the lower barrier of entry (no fussing with build/linker issues to use convenience libraries) you're seeing an increase in people trying things just to try them.
For me, dealing with something written in Rust is less painful than dealing with something with bindings for Rust, and honestly I'd consider rewriting something for that reason. And that's the case for nearly any language.
For example, I rewrote `tar` in JavaScript because I wanted to use it in the browser, and I didn't want to fiddle with trying to compile the existing project to JS. It took me a weekend and it worked pretty well for the project I needed it for. That project has since died (completely redesigned), but the tar stuff still works.
These days I'm getting lazy, so for something like git, I'll often exec out to the CLI app instead of fiddling with bindings, especially if it's not a performance-critical part of my app. However, I'd definitely look for a rewrite first and clean bindings second to use as a lib, especially if the rewrite had a suitable license (anything not copyleft).
I would like to see some sources like this that are language agnostic that give you the tools needed to implement your own popular tool. For example, where could I look to find a written description of the way git works from the ground up? Kind of like a "guide to implementing X" type of thing, but without code.
Rust is pretty darn portable as it is though. It runs on the major platforms. Are you thinking its Assembly or something? Everywhere FireFox is compiled to Rust has to run on, which is a lot of platforms.
Maybe it wont run on your toaster but if you are making git commits through your toaster you got other issues.
I agree with the sentiment behind your comment. Rust is very portable, but isn't portable to all platforms. Obviously it works on the big ones like Windows, macOS, most variants of Linux, BSD etc. but it doesn't on Alpine Linux. IIRC there was an issue compiling it without glibc (which Alpine lacks).
Edit - apparently rustc can now be linked to musl instead of glibc in nightly. Cool!
Compiling most Rust programs with MUSL is fine, and available on stable. But the rust compiler itself with MUSL had some issues, and this were worked out very recently, and so it hasn’t totally ridden the trains to stable yet. https://github.com/rust-lang/rust/issues/59302
Eh, why not? Seems like a chicken-and-egg thing. If there's no Rust support for some platform, fewer people will want to write things in it because they won't be able to run them where they want. But the less Rust software there is, the less interest anyone would have in writing build tools for less popular platforms.
Break out of these patterns by writing useful software in it that people would like to have on less popular platforms, so more people feel motivated to build and maintain build tools for it.
If you really needed to run Rust code on some platform LLVM doesn’t natively support, it does allow you to compile to C. Presumably you could then compile that using whatever C compiler works for your platform.
Although Rust is not as portable as C, going through these hoops would mean that —- modulo codegen bugs —- the generated C code should still be as memory-safe as the original Rust code.
You can run Rust code on microcontrollers where 32mb of RAM is far beyond the amount available. You won't be able to compile on that platform, but you can certainly target it (but git might not fit).