My perspective is that writing this kind of system in a language such as Python is actually a great thing because for myself Python is more widely readable and approachable compared to C++ or C which is what databases are often programmed in. If someone wants to be serious they can port it to a low level language . As it stands it's educational and useful for studying.
I wrote a distributed pseudo multi model SQL/graph Cypher/Document and dyanmodb style database in Python https://GitHub.com/samsquire/hash-db for the same goal of learning how database engines could work in a distributed way.
This is why you see a community of pure Java RDBMSs (Hypersonic, H2, Derby, etc), if you don't need the big iron scale it's easier to ship/use the database or even embed it in memory if needed.
H2 has quite heavy use in production too. For example Apache Ignite uses it as the core of it's distributed SQL engine. Ignite takes a distributed SQL query and splits it up into single node queries, which are executed on each node by H2, and then Ignite merges the results back.
H2 is pretty solid. I used to ship a product with it as a default database. Customers could optionally use MySQL, MSSQL or Oracle, but for small businesses H2 was more than enough.
Idk. Python is just as bad as C / C++, with the downside that you cannot do much of the interesting stuff one would need to if they wanted to learn how to make databases, if they use Python.
Both C and Python are very "approachable" if you ignore the bad language design, inconsistencies and other "gotchas" and only take the "easy" parts, disregarding edge cases. However, with C you could at least have a fighting chance to learn how to do things right, but with Python you'll never even know what the real thing is like.
The real problems you will have to solve when working with a database anyone would want to use are, for example:
* Memory layout of the buffers used to write / read / cache this data. In Python, you don't even have a concept of memory alignment / layout -- it's all happening somewhere in the interpreter.
* How to best service multiple requests concurrently. Again, Python offers nothing here, and nothing to experiment with. But this is a huge part of working with databases. The whole two-stage commit, transaction, consistency guarantees -- it's the central point of databases, but Python gives you no tools to even try anything like that.
* Working with various underlying storage... most of it has no Python interface (but does have C interface). Eg. if you want to understand how to optimize storing data by using some filesystem -- those filesystems do often expose similar interfaces that go beyond VFS, but they won't be immediately available to Python.
* Working with vectorization of queries -- again, Python doesn't have a concept of ISA, doesn't have any way to instruct the code to utilize any particular CPU instructions... but this is where a lot of work is done by people who work on real databases. Not being able to get any meaningful access to query optimization, planning becomes pointless / useless -- what are your concerns going to be when you write a query planner if you still have no idea how it's going to be executed?
* Similar stuff goes for networking -- whenever you need to solve something that goes beyond the absolute trivial you will at best rely on Python bindings to some library that actually does that rather than on Python code proper. In other words, if your goal was to learn how to do it, you will not achieve that goal because the actual work will happen elsewhere.
So... I don't know... there is no way you can really learn how to make databases in Python. You can probably learn something, depending on what's your baseline, but you will not be ready to make a real thing, if all you have is Python. It's a difference between toy doctor set and learning to be a surgeon...
My learning style is piecemeal: I do a bit of that, a bit of another thing with the long term goal to combine everything I did into one project. It's proof of concept or prototype approach.
The experts are doing databases in C and C++ and Rust. But I'm not a C, Rust or C++ expert, so I need to start somewhere, where I am today.
I start small accomplishable goals to get the idea of the problem solution so I'm not distracted by boilerplate C, C++ or Rust. My multiversion concurrency control is in Java.
You might think all the things are easy but they're not easy to everyone. We have to start somewhere and one way to start is to write the parser in a simple language so you're not wrestling with memory management.
If I tried to do all the things you mentioned in C++, Rust or C it would be too much work in one step. I need to start small to have an achievable result.
Guess to get back to the post. He did it for learning, not to be an exact duplicate or competitor to replace other RDBMS.
Often when learning, you do not implement every difficult edge case, the most complex. You are just trying to get the jist of it. You want a smaller problem to solve.
Maybe as a learning project, it isn't important to have concurrency, networking, memory management. Unless, any of those things happens to be of interest to learn about also, then add them back in.
I don't think this is trying to be an argument for Python as good to build an RDB in. (of course it isn't for all the reasons you list)
Python just happens to be an easy language for beginners, so why not build a basic RDB to learn about that too.
Because, especially for learning, you need tools adequate for the task... Masters can be more flexible and use inferior tools and still be successful; students need every bit of help they can get, and they need the tools that aren't going to betray them every step of the way.
Giving someone Python to make a database, is like giving a student in a culinary school a dull knife: it's hard to do it with the right tools, but it becomes mission impossible when you are also crippled by your tools.
It's the same analogy I used before: using toy "doctor set" vs. learning to be a surgeon. There's no path that will bring you from using a toy set to be a surgeon. It serves a different purpose: entertainment / roleplay. You don't mean to roleplay as a programmer by using Python, right?
You need tools adequate to the task, but the task isn't necessarily what other people think it is. In this case, I'd bet the tasks are to learn what's going on inside a database and become better at programming in Python, not to write a high-performance production-ready database implementation.
Cool stuff. I had similar intuitions- Python allow me to focus on the high-level concepts. Albeit, there were times where I wished I had gone with a statically-typed + compiled language.
DSL in a string? Is that 'really nice'? I haven't used or needed this in Python that I can think of, but surely we can do better than that?
Even a dict with expected keys and construction via the bitwise or operator (which would roughly match the form of a lot of the grammar) would be better wouldn't it? Imports could be imports, just mixed in somehow.
This is just first thoughts at a glance, maybe I'm missing something.
Lark supports, and recommends, writing and storing the grammar in a .lark file. We have syntax highlighting support in all major IDEs, and even in github itself. For example, here is Lark's built-in grammar for Python: https://github.com/lark-parser/lark/blob/master/lark/grammar...
The rationale is that it's more terse and has less visual clutter than a DSL over Python, which makes it easier to read and write.
It’s an existing, standard, language for describing grammars. Quite clearly the trade-off being made here. Seemingly not the one you would’ve made. Don’t act like it’s objectively bad.
in a string. I didn't say the DSL was novel. I learnt BNF too. (It does have extras that are either unique to it or standard beyond my familiarity though.)
I was always surprised at how the Ruby community talks about DSLs when using Ruby classes and catch all methods, overloading, etc. Nothing bad with it, but I feel it is not really a language, just very dynamic Ruby code.
A DSL that is a string, while having some downsides, does not have the arbitrary limitations of the "host" language. Either approach has its pros an cons.
I'm not that versed in Ruby, but I think similarly say the Django ORM in Python borders on DSL - especially with use of bitwise operators to build a query.
It's the tooling aspect that's my gripe with it in a string really, it makes it less likely (and more editor-specific) that I can have syntax highlighting, LSP, etc. That's incidentally the only way in which I don't prefer SQL to Django ORM - i.e. it really isn't that it's a DSL (SQL) that bothers me, it's the string.
But hopefully obviously it's not a big deal, I was just surprised at the look of it when I saw it described as 'really nice' and that the project describes itself as 'focussed on ergonomics'. It just doesn't seem brilliant to me, fine perhaps, par for the course apparently, but not remarkable.
Edit: I see what you mean. I surveyed a bunch a parser generator libraries, and they also seemed to use a text based DSL- rather than DSL based on python structures. What you're describing would have made the grammar development more ergonomic and simple.
I realise that (with some additions, certainly - it has imports) I just don't think embedding it in a string is 'really nice', personally. The 'developer experience' will be crap or reliant on hyper-specific tooling; so likely crap. (The RHS of that 'or' will be too niche to be good and well-maintained, probably.)
Not to be coy or disrespectful( and this is excellent work and a way to learn new things) but if one is generating a parser as a means to an end(the end being execution of the AST on the database), what are the learnings one gets from just the parser bit? Is it some kind of optimization you need to keep doing to the generated parser to make it more efficient?
Would a logical next step be Generate an optimal query plan from the AST(somehow..)?
That's said, I'm curious how complete this LearnDB is. SQLite is hard to read not only it's old but also it covers a lot of SQL and following SQL spec makes hings complicated. SQLite has great test suite so it's nice if you run the suit against this implementation.
This is fantastic, and seems like a great way for someone (me) to learn DS&A better. I can explain how a B+tree works, but if you asked me to code it, I’d freeze up.
I love databases and Python, so this was really interesting to walk through. Thanks for the post.
Most definitely. The b-tree implementation was the first motivation for starting the project. Especially, all the details around node rebalancing and splitting. And the fact that it was an on-disk structure, added another wrinkle to the thinking about the impl
It doesn't have a notion of atomically batching multiple statements, i.e. transaction. But beyond that, it's a single file database,
which can only have a single process (learndb instance) that is operating on the database (file). So you get consistency and isolation via being a single connection database. Durability, you get to the extent that the file system is durable. So it's somewhere on the ACIDity spectrum.
Re: Query planning/optimization
I haven't implemented this; but I've considered where the optimization could module sit:
The parser spits out an AST. This or a derived intermediate representation could be optimized,i.e. the AST could be rewritten or nodes deleted,
before the VM executes the AST.
I was surprised to find out MicroPython had a database module based on Berkley DB in its standard library. It uses Berkley DB 1.x, which is the version of Berkley DB with a permissive license.
I am curious about the benefits and limitations of using Python in this project as opposed to C++. How well is LearnDB able to support low-level concurrency control and storage management?
I began using python as a way to "mock" out the overall design; intending to re-implement it in rust. The main reasoning for using python: was the ability to focus on "high-level" concepts and speed of tinkering.
This implements a single process, single thread, single connection database- so performance and low-level concurrency control were not explicit goals or really optimized for. For those (real-live concerns) rust or C++ are much better; but also come with their set of complexities.
Nice! it is good to see past blind spots e.g. "you must write this kind of app in this kind of language, pure Python is high level it is only for glue". I wouldn't have thought to do a project like this because of all of those kinds of heuristics. And as other have said, a hackable DB in the same language as the app can have a lot of advantages.
Perhaps a more accurate claim would be "SQLite inspired". Calling it a clone is misleading.
Mad props to the author. Many Python programmers never had proper training in computer science, so it is encouraging to see people filling in the gaps of their knowledge.
I think it doesn't even get close to being a criticism, and it's certainly unclear if the goal is to literally clone SQLite or to implement SQLite-ish. This is a fair question.
Just trying to encourage clear and accurate communication. I agree with your first sentence. The only thing unfair is the author claiming it is a SQLite clone. It isn't, as we both seem to agree. It is a form of cheating.
Fair. It’s “inspired by”, not a “clone”. Frankly, I don’t think these terms are that specific, that one couldn’t level the same point against “inspired by”.. in what sense is it inspired?
The project is entitled LearnDB, and this is how they describe their motivation in the first two paragraphs of the README:
> > What I Cannot Create, I Do Not Understand -Richard Feynman
> In the spirit of Feynman's immortal words, the goal of this project is to better understand the internals of databases by implementing a relational database management system (RDBMS) (sqlite clone) from scratch.
> This project was motivated by a desire to: 1) understand databases more deeply and 2) work on a fun project.
It sounds like they couldn't care less if it's production-ready.
Okay this seems like the perfect project to test out the various AI repo update solutions. Anyone want to do this for fun, you can have your own agenda, I'm just bored!
Your comment seems to have nothing to do with the post and mostly has a bunch of links. If there is some relevance please make it clearer by editing your comment.
Your projects might be very cool in which case just submit them to HN on their own instead of commenting.
Commenting on other people's downvotes is usually not helpful. Accept the votes, learn from the experience and take that forward when participating in this community. The guidelines are pretty good, please read them as well: https://news.ycombinator.com/newsguidelines.html
Because you people that bring up the latest overhyped and useless buzzword nonsense on every single post are super tiring.
We do not need a comment about "someone should apply AI/ML to this" on every post any more than we need a "this but NFT" or "this but blockchain" or "this but React" or "this but OOP" comment. Your comment is just noise.
My perspective is that writing this kind of system in a language such as Python is actually a great thing because for myself Python is more widely readable and approachable compared to C++ or C which is what databases are often programmed in. If someone wants to be serious they can port it to a low level language . As it stands it's educational and useful for studying.
I wrote a distributed pseudo multi model SQL/graph Cypher/Document and dyanmodb style database in Python https://GitHub.com/samsquire/hash-db for the same goal of learning how database engines could work in a distributed way.