Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: I wrote a RDBMS (SQLite clone) from scratch in pure Python (github.com/spandanb)
371 points by spanspan on Aug 13, 2023 | hide | past | favorite | 77 comments
I wrote a relational database management system (RDBMS) (sqlite clone) from scratch in pure Python.



Thank you for sharing

My perspective is that writing this kind of system in a language such as Python is actually a great thing because for myself Python is more widely readable and approachable compared to C++ or C which is what databases are often programmed in. If someone wants to be serious they can port it to a low level language . As it stands it's educational and useful for studying.

I wrote a distributed pseudo multi model SQL/graph Cypher/Document and dyanmodb style database in Python https://GitHub.com/samsquire/hash-db for the same goal of learning how database engines could work in a distributed way.


This is why you see a community of pure Java RDBMSs (Hypersonic, H2, Derby, etc), if you don't need the big iron scale it's easier to ship/use the database or even embed it in memory if needed.


H2 has quite heavy use in production too. For example Apache Ignite uses it as the core of it's distributed SQL engine. Ignite takes a distributed SQL query and splits it up into single node queries, which are executed on each node by H2, and then Ignite merges the results back.


Yes. It's an interesting area, irrespective of implementation language.

I had tried out one or two Java-based RDBMSes back in the day, via programs written in Java, for fun.

I think one was HSQLDB.

https://en.m.wikipedia.org/wiki/HSQLDB

There was also another interesting one called PointBase, which was developed by Bruce Scott, an Oracle founder, and others.

https://en.m.wikipedia.org/wiki/PointBase


H2 is pretty great and fully supports JSONB and the like [1]. Full java-only RDBMS.

[1]: https://www.h2database.com/html/main.html


H2 is pretty solid. I used to ship a product with it as a default database. Customers could optionally use MySQL, MSSQL or Oracle, but for small businesses H2 was more than enough.


Thanks, sounds worth checking out.


I totally agree with that. I loved the ugit [1] build Git from scratch in Python series for that.

[1] https://www.leshenko.net/p/ugit/


Thanks for sharing. This looks like another fun project to work through over a weekend.


Idk. Python is just as bad as C / C++, with the downside that you cannot do much of the interesting stuff one would need to if they wanted to learn how to make databases, if they use Python.

Both C and Python are very "approachable" if you ignore the bad language design, inconsistencies and other "gotchas" and only take the "easy" parts, disregarding edge cases. However, with C you could at least have a fighting chance to learn how to do things right, but with Python you'll never even know what the real thing is like.


> but with Python you'll never even know what the real thing is like

What do you mean? Surely the concepts are similar regardless of using Python or C?


Not even close.

The real problems you will have to solve when working with a database anyone would want to use are, for example:

* Memory layout of the buffers used to write / read / cache this data. In Python, you don't even have a concept of memory alignment / layout -- it's all happening somewhere in the interpreter.

* How to best service multiple requests concurrently. Again, Python offers nothing here, and nothing to experiment with. But this is a huge part of working with databases. The whole two-stage commit, transaction, consistency guarantees -- it's the central point of databases, but Python gives you no tools to even try anything like that.

* Working with various underlying storage... most of it has no Python interface (but does have C interface). Eg. if you want to understand how to optimize storing data by using some filesystem -- those filesystems do often expose similar interfaces that go beyond VFS, but they won't be immediately available to Python.

* Working with vectorization of queries -- again, Python doesn't have a concept of ISA, doesn't have any way to instruct the code to utilize any particular CPU instructions... but this is where a lot of work is done by people who work on real databases. Not being able to get any meaningful access to query optimization, planning becomes pointless / useless -- what are your concerns going to be when you write a query planner if you still have no idea how it's going to be executed?

* Similar stuff goes for networking -- whenever you need to solve something that goes beyond the absolute trivial you will at best rely on Python bindings to some library that actually does that rather than on Python code proper. In other words, if your goal was to learn how to do it, you will not achieve that goal because the actual work will happen elsewhere.

So... I don't know... there is no way you can really learn how to make databases in Python. You can probably learn something, depending on what's your baseline, but you will not be ready to make a real thing, if all you have is Python. It's a difference between toy doctor set and learning to be a surgeon...


My learning style is piecemeal: I do a bit of that, a bit of another thing with the long term goal to combine everything I did into one project. It's proof of concept or prototype approach.

The experts are doing databases in C and C++ and Rust. But I'm not a C, Rust or C++ expert, so I need to start somewhere, where I am today.

I start small accomplishable goals to get the idea of the problem solution so I'm not distracted by boilerplate C, C++ or Rust. My multiversion concurrency control is in Java.

You might think all the things are easy but they're not easy to everyone. We have to start somewhere and one way to start is to write the parser in a simple language so you're not wrestling with memory management.

If I tried to do all the things you mentioned in C++, Rust or C it would be too much work in one step. I need to start small to have an achievable result.

Not everyone is Stonebraker or Linus Torvalds.


Guess to get back to the post. He did it for learning, not to be an exact duplicate or competitor to replace other RDBMS.

Often when learning, you do not implement every difficult edge case, the most complex. You are just trying to get the jist of it. You want a smaller problem to solve.

Maybe as a learning project, it isn't important to have concurrency, networking, memory management. Unless, any of those things happens to be of interest to learn about also, then add them back in.

I don't think this is trying to be an argument for Python as good to build an RDB in. (of course it isn't for all the reasons you list)

Python just happens to be an easy language for beginners, so why not build a basic RDB to learn about that too.


Because, especially for learning, you need tools adequate for the task... Masters can be more flexible and use inferior tools and still be successful; students need every bit of help they can get, and they need the tools that aren't going to betray them every step of the way.

Giving someone Python to make a database, is like giving a student in a culinary school a dull knife: it's hard to do it with the right tools, but it becomes mission impossible when you are also crippled by your tools.

It's the same analogy I used before: using toy "doctor set" vs. learning to be a surgeon. There's no path that will bring you from using a toy set to be a surgeon. It serves a different purpose: entertainment / roleplay. You don't mean to roleplay as a programmer by using Python, right?


You need tools adequate to the task, but the task isn't necessarily what other people think it is. In this case, I'd bet the tasks are to learn what's going on inside a database and become better at programming in Python, not to write a high-performance production-ready database implementation.


The concepts and syntax are similar, but Python will always be "riding upon a horse", whereas C can be the horse.


Cool stuff. I had similar intuitions- Python allow me to focus on the high-level concepts. Albeit, there were times where I wished I had gone with a statically-typed + compiled language.


Long long time ago someone rewrote/ported SQLite from C to C# -> https://code.google.com/archive/p/csharp-sqlite/wikis/Letter... - note how welcome Dr. Richard Hipp was on the effort!

Probably here on github -> https://github.com/CsharpDatabase/CsharpSQLite - and possibly some more clones after that.


Kudos, i bet that was a fun and rewarding experience.

I know this was never meant to be fast, but could you produce some benchmarks for shits and giggles?


Off topic but do you know of any good resources/talks/blog posts that cover how to create useful benchmarks?


Not OP, but Brendan Gregg has written a lot about performance and benchmarking.

* https://www.brendangregg.com/activebenchmarking.html

* https://www.brendangregg.com/blog/2018-06-30/benchmarking-ch...


Great, thank you!


It would be a fun exercise to implement something like TPC-C for learndb and see how this done.


Thanks to this post I learned about Lark, which looks like a really nice parser library for Python.

The JSON tutorial on their site is excellent - shows how to build a basic parser for JSON, then goes into some great detail about how to improve its performance: https://lark-parser.readthedocs.io/en/latest/json_tutorial.h...

Here's the grammar used for the RDBMS project: https://github.com/spandanb/learndb-py/blob/master/learndb/l...


Highly recommend Lark for Python projects -- it's easy to use :)

Their IDE was super useful for debugging the grammar: https://www.lark-parser.org/ide/

We use Lark for a SQL-like language tailored for using AI models in EvaDB: https://github.com/georgia-tech-db/evadb/blob/master/evadb/p... https://github.com/georgia-tech-db/evadb/

If you like Lark, please consider sponsoring them: https://github.com/sponsors/lark-parser


DSL in a string? Is that 'really nice'? I haven't used or needed this in Python that I can think of, but surely we can do better than that?

Even a dict with expected keys and construction via the bitwise or operator (which would roughly match the form of a lot of the grammar) would be better wouldn't it? Imports could be imports, just mixed in somehow.

This is just first thoughts at a glance, maybe I'm missing something.


Lark supports, and recommends, writing and storing the grammar in a .lark file. We have syntax highlighting support in all major IDEs, and even in github itself. For example, here is Lark's built-in grammar for Python: https://github.com/lark-parser/lark/blob/master/lark/grammar...

The rationale is that it's more terse and has less visual clutter than a DSL over Python, which makes it easier to read and write.


It’s an existing, standard, language for describing grammars. Quite clearly the trade-off being made here. Seemingly not the one you would’ve made. Don’t act like it’s objectively bad.


in a string. I didn't say the DSL was novel. I learnt BNF too. (It does have extras that are either unique to it or standard beyond my familiarity though.)


I was always surprised at how the Ruby community talks about DSLs when using Ruby classes and catch all methods, overloading, etc. Nothing bad with it, but I feel it is not really a language, just very dynamic Ruby code.

A DSL that is a string, while having some downsides, does not have the arbitrary limitations of the "host" language. Either approach has its pros an cons.


I'm not that versed in Ruby, but I think similarly say the Django ORM in Python borders on DSL - especially with use of bitwise operators to build a query.

It's the tooling aspect that's my gripe with it in a string really, it makes it less likely (and more editor-specific) that I can have syntax highlighting, LSP, etc. That's incidentally the only way in which I don't prefer SQL to Django ORM - i.e. it really isn't that it's a DSL (SQL) that bothers me, it's the string.

But hopefully obviously it's not a big deal, I was just surprised at the look of it when I saw it described as 'really nice' and that the project describes itself as 'focussed on ergonomics'. It just doesn't seem brilliant to me, fine perhaps, par for the course apparently, but not remarkable.


That's pretty standard for parsing libraries - have you seen any good ones (for Python or other languages) that don't use a DSL like this?

The only one I've seen is this one: https://parsy.readthedocs.io/en/latest/tutorial.html



Here is another PEG one, though Guile allows expressing things as strings and as macro calls / s-expressions:

5: https://www.gnu.org/software/guile/manual/html_node/PEG-Pars...


̶T̶h̶i̶s̶ ̶s̶t̶r̶i̶n̶g̶ ̶i̶s̶ ̶t̶h̶e̶ ̶g̶r̶a̶m̶m̶a̶r̶ ̶t̶h̶a̶t̶ ̶l̶a̶r̶k̶ ̶u̶s̶e̶s̶ ̶t̶o̶ ̶p̶a̶r̶s̶e̶ ̶t̶h̶e̶ ̶u̶s̶e̶r̶ ̶s̶u̶b̶m̶i̶t̶t̶e̶d̶ ̶s̶q̶l̶ ̶i̶n̶t̶o̶ ̶a̶n̶ ̶A̶S̶T̶.̶ ̶T̶h̶i̶s̶ ̶p̶a̶r̶s̶i̶n̶g̶ ̶i̶s̶ ̶f̶a̶r̶ ̶m̶o̶r̶e̶ ̶c̶o̶m̶p̶l̶e̶t̶e̶ ̶a̶n̶d̶ ̶r̶o̶b̶u̶s̶t̶ ̶t̶h̶a̶n̶ ̶w̶h̶a̶t̶ ̶d̶o̶i̶n̶g̶ ̶t̶h̶i̶s̶ ̶i̶n̶ ̶p̶u̶r̶e̶ ̶p̶y̶t̶h̶o̶n̶ ̶w̶o̶u̶l̶d̶ ̶a̶l̶l̶o̶w̶ ̶(̶w̶i̶t̶h̶o̶u̶t̶ ̶o̶f̶ ̶c̶o̶u̶r̶s̶e̶ ̶i̶m̶p̶l̶e̶m̶e̶n̶t̶i̶n̶g̶ ̶t̶h̶e̶ ̶e̶n̶t̶i̶r̶e̶ ̶l̶e̶x̶e̶r̶ ̶a̶n̶d̶ ̶p̶a̶r̶s̶e̶r̶ ̶t̶h̶a̶t̶ ̶l̶a̶r̶k̶ ̶i̶m̶p̶l̶e̶m̶e̶n̶t̶s̶)̶.̶

T̶h̶e̶r̶e̶ ̶m̶a̶y̶ ̶b̶e̶ ̶s̶o̶m̶e̶ ̶p̶o̶s̶t̶ ̶p̶a̶r̶s̶i̶n̶g̶ ̶v̶a̶l̶i̶d̶a̶t̶i̶o̶n̶ ̶t̶h̶a̶t̶ ̶c̶a̶n̶ ̶b̶e̶ ̶d̶o̶n̶e̶ ̶h̶e̶r̶e̶-̶ ̶b̶u̶t̶ ̶t̶h̶a̶t̶ ̶w̶o̶u̶l̶d̶ ̶b̶e̶ ̶s̶o̶m̶e̶t̶h̶i̶n̶g̶ ̶t̶h̶a̶t̶'̶s̶ ̶b̶e̶y̶o̶n̶d̶ ̶t̶h̶e̶ ̶d̶o̶m̶a̶i̶n̶ ̶o̶f̶ ̶t̶h̶e̶ ̶p̶a̶r̶s̶e̶r̶.̶

Edit: I see what you mean. I surveyed a bunch a parser generator libraries, and they also seemed to use a text based DSL- rather than DSL based on python structures. What you're describing would have made the grammar development more ergonomic and simple.


It is https://en.m.wikipedia.org/wiki/Backus–Naur_form or something similar. A python DSL would certainly be less expressive and useful.


I realise that (with some additions, certainly - it has imports) I just don't think embedding it in a string is 'really nice', personally. The 'developer experience' will be crap or reliant on hyper-specific tooling; so likely crap. (The RHS of that 'or' will be too niche to be good and well-maintained, probably.)


Not to be coy or disrespectful( and this is excellent work and a way to learn new things) but if one is generating a parser as a means to an end(the end being execution of the AST on the database), what are the learnings one gets from just the parser bit? Is it some kind of optimization you need to keep doing to the generated parser to make it more efficient?

Would a logical next step be Generate an optimal query plan from the AST(somehow..)?


This is very nice!

SQLite is very hard to read, but this one is actually quite comprehensible. Especially the VM part: https://github.com/spandanb/learndb-py/blob/master/learndb/v...

Compare it with this: https://github.com/sqlite/sqlite/blob/master/src/vdbe.c

That's said, I'm curious how complete this LearnDB is. SQLite is hard to read not only it's old but also it covers a lot of SQL and following SQL spec makes hings complicated. SQLite has great test suite so it's nice if you run the suit against this implementation.


This is fantastic, and seems like a great way for someone (me) to learn DS&A better. I can explain how a B+tree works, but if you asked me to code it, I’d freeze up.

I love databases and Python, so this was really interesting to walk through. Thanks for the post.


Most definitely. The b-tree implementation was the first motivation for starting the project. Especially, all the details around node rebalancing and splitting. And the fact that it was an on-disk structure, added another wrinkle to the thinking about the impl


How much of the SQLite test suite will pass?


Does it support any ACID guarantees? Query planning/optimization?

I’m not asking this to imply that it can/should. I just want to know what all was attempted besides b-trees and SQL.

I’d like to do something like this someday. Nice work!


Re: ACID guarantees

It doesn't have a notion of atomically batching multiple statements, i.e. transaction. But beyond that, it's a single file database, which can only have a single process (learndb instance) that is operating on the database (file). So you get consistency and isolation via being a single connection database. Durability, you get to the extent that the file system is durable. So it's somewhere on the ACIDity spectrum.

Re: Query planning/optimization

I haven't implemented this; but I've considered where the optimization could module sit: The parser spits out an AST. This or a derived intermediate representation could be optimized,i.e. the AST could be rewritten or nodes deleted, before the VM executes the AST.


On a tangent: is there something like mapDB [1] but for Python?

1 - https://mapdb.org


First, great project. Second, the code is very readable. Finally thank you for excellent comments.


long time ago the Python2 stdlib shipped a ton of DB libs like BerkelyDB. https://wiki.python.org/moin/BerkeleyDB


I was surprised to find out MicroPython had a database module based on Berkley DB in its standard library. It uses Berkley DB 1.x, which is the version of Berkley DB with a permissive license.

https://docs.micropython.org/en/v1.20.0/library/btree.html


The documentation is excellent.


Really cool project!

I am curious about the benefits and limitations of using Python in this project as opposed to C++. How well is LearnDB able to support low-level concurrency control and storage management?


I began using python as a way to "mock" out the overall design; intending to re-implement it in rust. The main reasoning for using python: was the ability to focus on "high-level" concepts and speed of tinkering. This implements a single process, single thread, single connection database- so performance and low-level concurrency control were not explicit goals or really optimized for. For those (real-live concerns) rust or C++ are much better; but also come with their set of complexities.


Nice OP. How would you say this compares to the inbuilt 'sqlite3' module in Python? Is it more or less portable? Different features?


the `sqlite3` module provides an interface to `sqlite` (https://www.sqlite.org/index.html), so that comparison doesn't really make sense.

OP rewrote the actual database in Python, so (if it's a 1:1 equivalence), you would still use the python `sqlite3` module to connect to OP's project.

As mentioned, it's an educational project, not really meant to be used as a replacement for sqlite in projects etc though.


Good for you! Want a medal?


i used to use gadfly, aaron watters's sql database in pure python; the first release was in 01994

i'm not sure current python is capable of running it


Is this really a relational database if it does not have JOINs?


It does have joins


The splash page says that Learndb supports the following: select, from, where, group by, having, limit, order by

I dont see "JOIN" there.

So either the DB doesnt support it or the documentation (on the splash page!) is wrong.

Thanks for the downvote.


> Thanks for the downvote.

Please don't do this here.

Additionally, it could have been anyone.


Nice! it is good to see past blind spots e.g. "you must write this kind of app in this kind of language, pure Python is high level it is only for glue". I wouldn't have thought to do a project like this because of all of those kinds of heuristics. And as other have said, a hackable DB in the same language as the app can have a lot of advantages.


In what way is this a "SQLite clone"?


Single file, embedded database with similar logical organization


Perhaps a more accurate claim would be "SQLite inspired". Calling it a clone is misleading.

Mad props to the author. Many Python programmers never had proper training in computer science, so it is encouraging to see people filling in the gaps of their knowledge.


This is a very early release, whereas SQLite has 22 years of releases. In that light, this is about the least charitable take on this.

Someone in our community built something and had the courage to release it. Your criticism is unfair.


> Your criticism is unfair.

I think it doesn't even get close to being a criticism, and it's certainly unclear if the goal is to literally clone SQLite or to implement SQLite-ish. This is a fair question.


Why did they have to use the term misleading? Why not be charitable?


Just trying to encourage clear and accurate communication. I agree with your first sentence. The only thing unfair is the author claiming it is a SQLite clone. It isn't, as we both seem to agree. It is a form of cheating.


Fair. It’s “inspired by”, not a “clone”. Frankly, I don’t think these terms are that specific, that one couldn’t level the same point against “inspired by”.. in what sense is it inspired?


In the way that it's an embedded RDBMS.


I know Python, but little C.

I don’t recommend writing a DB in Python. So much of DB development is dealing with race conditions and consistency.

Python files require an interpreter. Bad for a DB.

And dependency management is a pain. Are you thinking a virtualenv for this?


The project is entitled LearnDB, and this is how they describe their motivation in the first two paragraphs of the README:

> > What I Cannot Create, I Do Not Understand -Richard Feynman

> In the spirit of Feynman's immortal words, the goal of this project is to better understand the internals of databases by implementing a relational database management system (RDBMS) (sqlite clone) from scratch.

> This project was motivated by a desire to: 1) understand databases more deeply and 2) work on a fun project.

It sounds like they couldn't care less if it's production-ready.


Also: what is "production ready"? Not every program is used in a context where performance, concurrency, etc. are relevant...


Okay this seems like the perfect project to test out the various AI repo update solutions. Anyone want to do this for fun, you can have your own agenda, I'm just bored!

https://github.com/paul-gauthier/aider https://www.mentat.codes/ https://www.gitwit.dev/ https://www.second.dev/


Why would you downvote this? Ive already turned it into a viable sqlite competitor. Shame on you.


my 2c:

Your comment seems to have nothing to do with the post and mostly has a bunch of links. If there is some relevance please make it clearer by editing your comment.

Your projects might be very cool in which case just submit them to HN on their own instead of commenting.

Commenting on other people's downvotes is usually not helpful. Accept the votes, learn from the experience and take that forward when participating in this community. The guidelines are pretty good, please read them as well: https://news.ycombinator.com/newsguidelines.html


Because you people that bring up the latest overhyped and useless buzzword nonsense on every single post are super tiring.

We do not need a comment about "someone should apply AI/ML to this" on every post any more than we need a "this but NFT" or "this but blockchain" or "this but React" or "this but OOP" comment. Your comment is just noise.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: