Hacker News new | past | comments | ask | show | jobs | submit login
DuckDB 0.8 (duckdb.org)
187 points by nnx on May 18, 2023 | hide | past | favorite | 98 comments



I recently pulled DuckDB out of a project after hitting a memory corruption issue in regular usage. Upon investigating, they had an extremely long list of fuzzer-found issues. I just don't understand why someone would start something in a memory unsafe language these days. I cannot in good conscience put that on a customer's machine.

We ended up rewriting a component to drop support for Parquet and to just use SQLite instead. I love the idea of being able to query Parquet files locally and then just ship them up to S3 and continue to use them with something like Athena.

The other thing that rubbed me the wrong way was that rather than fix the issue, they just removed functionality. DuckDB (unironically) needs a rewrite in Rust or a lot more fuzzing hours before I come back to it. While SQLite is not written in a memory safe language, it is probably one of the most fuzzed targets in the world.


Sorry to hear that!

(I work on docs for the DuckDB Foundation)

Starting in this release, the DuckDB team invested significantly in adding memory safety throughout catalog operations. There is more on the roadmap, but I would expect this release and all following to have improved stability!

That said, at my primary company, we have used it in production for years now with great success!


Yea, legit a cool project and if it wasn't on customer machines and being passed user-defined SQL it would have just been worked around.


Would you mind explaining your issues in a little more detail?

> being passed user-defined SQL

What does this mean exactly? Customers were writing their own SQL on their own machines? Maybe expensive operations such as:

`SELECT ROW_NUMBER() OVER (PARTITION BY THING ORDER BY TS), * FROM EVENTS`

And since it was on a customer's machine, it became a problem you had no control over?


> why someone would start something in a memory unsafe language these days

You might like what we (Splitgraph) are building with Seafowl [0], a new database which is written in Rust and based on Datafusion and delta-rs [1]. It's optimized for running at the edge and responding to queries via HTTP with cache-friendly semantics.

[0] https://seafowl.io

[1] https://www.splitgraph.com/blog/seafowl-delta-storage-layer


Getting a 5xx error for your site. Using Firefox mobile if that helps


Hmm... thanks. Maybe it was a hiccup? Does it happen when you click these links in my comment? We haven't been able to replicate on Firefox mobile, but we do have an issue with 500 errors in Firefox when double clicking links in the sidebar of the docs (I know, I know...)


>The other thing that rubbed me the wrong way was that rather than fix the issue, they just removed functionality.

It is a limited team size. If they feel a feature is causing too much grief, I would rather they drop it than post a, "Here be dragons" sign and let users pick up the pieces.

Edit: missed an obvious opportunity to take a shot at MySQL


I think the critique is that not that they should have left the thing broken, but that a limited team should limit the work to match the team size so that they do not release broken things in the first place.


It is often difficult to ascertain how much iceberg is beneath the water until you attempt to solve the problem.


- DuckDB is written in C++

  - SQLite is written in C
I wouldn't consider any of those written in a memory safe language. Although SQLite has been battle hardened over many years, while DuckDB is a relatively new project.

That being said, has been efforts of reimplementing SQLite in a more memory safe language like Rust.

e.g. https://github.com/epilys/rsqlite3


At the level of engineering of SQLite, the choice of language is almost immaterial. Suggesting a low effort transpilation is a competitive peer seems unserious and vaguely disrespectful.


>Finally, one of the best written software paired with one of the best writable programming language‽ Fearless and memory safe, since the uncountable amount of unsafe {} blocks makes you not care anymore.

Plus it seems project is a parody of the RiiR trend.


The CVE list would dispute that assertion. There's a reason Microsoft is rewriting parts of the Windows kernel in Rust, and it isn't trendiness or because the kernel is at a trivial level of engineering.

It's the same reason Torvalds refused to have C++ anywhere near the Linux kernel but is not accepting patches in Rust. The advantage of C is its transparency and simplicity, but its safety has always been a thorn in the industry's side.

RiiR has become a bit of a parody of itself, but there is a large grain of truth from where that sentiment was born.

Life is too short for segfaults and buffer overruns.


to be clear, OP said

> I just don't understand why someone would start something in a memory unsafe language these days.

It takes a lot of time and testing to iron out all the bugs. Not impossible. Just takes a lot of time and testing.


C has known memory footguns. C++, due to it's complexity, has footnukes.


No, I actually think the opposite.

In most C++ environments you will have std::string, STL vectors, unique_ptr, and RAII generally. Cleaning up memory via RAII discipline is standard programming practice. Manual frees are not typical these days. std::string manages its own memory, and isn't vulnerable to the same buffer overflow/null terminator safety issues that C-style strings are.

While in C you will be probably using null-terminated strings and probably your own hand-rolled linked list and vectors. You will not have RAII or destructors, so you will have manual frees all over.

Perhaps the big difference is that due to the nature of the language, C developers on the whole are probably more careful.


I started coding professionally in C++ back around 2000. Many things have improved in C++ such as the items you list above, but C++ remains a viciously complicated language that requires constant vigilance while coding in it, especially when more than one thread is involved.

CloudFlare is not lacking in good engineers with tons of both C and C++ experience. They still chose Rust for their replacement of Nginx. Now their crashes are so few and far between, they uncover kernel bugs rather than app-level bugs.

https://blog.cloudflare.com/how-we-built-pingora-the-proxy-t...

> Since Pingora's inception we’ve served a few hundred trillion requests and have yet to crash due to our service code.

I have never heard anything close to that level of reliability from a C or C++ codebase. Never. And I've worked with truly great programmers before on modern C++ projects. C++ may not have limits, but humans writing C++ provably do.


I'm not sure who you're arguing against, but it's not me, or it's off topic completely. The discussion and my reply was not between C/C++ and Rust, but between C and C++.

I'm a full-time Rust developer FWIW. But I also did C++ for 10 years prior and worked in the language on and off since the mid-90s. Nobody is arguing in this sub-thread about Rust vs C++, nor am I interested in getting into your religious war.


DuckDB is a great piece of software if you are

If you are looking for a query engine implemented in a safe language (Rust) I definitely suggest checking out DataFusion. It is comparable to DuckDB in performance, has all the standard built in SQL functionality, and is extensible in pretty much all areas (query language, data formats, catalogs, user defined functions, etc)

https://arrow.apache.org/datafusion/

Disclaimer I am a maintainer of DataFusion


I remember chasing memory bugs in C# and Java projects at work. These are usually considered memory-safe languages yet their sophisticated garbage collectors are not panacea. There is a reason things like this https://docs.oracle.com/javase/8/docs/api/java/lang/ref/Weak... exist. Or why you might re-use existing objects in certain situations https://www.oreilly.com/library/view/java-performance-tuning... .

On the other hand, there are garbage collectors available for C and C++ programs (they are not part of their standard libraries so you have to choose whether you use them or not). C++ standard library has had smart pointers for some time, they existed in Boost library beforehand and RAII pattern is even older.

Don't put all the blame for memory bugs to languages. C and C++ programs are more prone to memory leaks than programs written in "memory-safe" languages but these are not safe from memory bugs either.

Disclaimer: I like C (plain C, not C++, though that's not that as bad as many people claim) and I hate soydevs.


Memory leaks are annoying and, yes, you can get them in memory safe languages.

But they are way less severe than memory corruption. Memory unsafe languages are liable to undefined behaviour, which is actively dangerous, both in theory and practice.


> I just don't understand why someone would start something in a memory unsafe language these days. I cannot in good conscience put that on a customer's machine. We ended up rewriting a component to drop support for Parquet and to just use SQLite instead.

I am not sure that you realize that SQLite is written entirely in C -- a quintessential memory unsafe language. I guess quality of software depends on many things besides a choice of language.


SQLite also lives under a literal mountain of automated tests, an engineering effort I'm not sure I've ever seen elsewhere. The library code is absolutely dwarfed by the test code.

...and CVEs still pop up occasionally. The point about memory safety languages still holds, but can be mostly muted given you throw enough tests at the problem.


I’m quite sure that the OP DOES realize this.


This would be a good blog post to read. And I’m sure the team behind the project would like to see it, too.


What feature was removed? I'm pretty fine with a project deciding to remove a feature with known footguns, but curious what it was.


Blob to bitstring type casting for Parquet. They were doing a straight reinterpret cast on it which was causing an allocation of 18446744073709551503 bytes.

I was wanting to take a blob from Parquet and bitwise-and it against a bitstring in memory.


Is blob an efficient type?


A serious and curious question. Are close to the point with LLMs where we can just point to the source of something like DuckDB, and it’s suite of tests and say “rewrite this in Rust with these set of libraries, and make sure all these tests pass”? Even if not 100% complete, and produces in idiomatic code, could it work?


I don’t even have access to regular Claude so can’t confirm this but the 100K token model they released should in theory be able to handle this to a certain degree.


Is anyone having luck with this? It seems too "creative".


I haven't tried Claude but I have been tinkering with a lot of this in my home lab and there are various theories I have:

- GPT4 is not a model, it's a platform. I believe the platform picks the best model for your query in the background and this is part of the magic behind it.

- The platform will also query multiple data sources depending on your prompt if necessary. OpenAI is just now opening up this plugin architecture to the masses but I would think they have been running versions of this internally since last year.

- There is also some sort of feedback loop that occurs before the platform gives you a response.

This is why we can have two different entities use the same open source model yet the quality of the experience can vary significantly. Better models will produce better outputs "by default", but the tooling and process built around it is what will matter more in the future when we may or may not hit some sort of plateau. At some point we're going to have a model trained on all human knowledge current as of Now. It's inevitable right? After that, platform architecture is what will determine who competes.


Interesting speculation but I don’t think GPT-4 chooses any model, I’m pretty sure it’s just how good that one model is. I played with a lot of local models but the reality is, even with wizard vicuna, were at least an order of magnitude away from the size of GPT-4.


It is open source, you can do it with whatever you want, like fix it or write it in rust. I did not see the point of complaining it doesn’t work for you because it is not in rust and not doing it.


> The other thing that rubbed me the wrong way was that rather than fix the issue, they just removed functionality.

Yeah, DuckDB has some very cool features, but I with the community were less abrasive. I remember someone asking for ORC columnar format support, and DuckDB replied "that is not as popular as Parquet so we're not doing it, issue closed". Same story with Delta vs Iceberg.

Meanwhile Clickhouse supports both and if you ask for things they might say "tha tis low priority but we'll take a look". Clickhouse-local can work as CLI (though not in-process) DuckDB too.


> DuckDB has some very cool features, but I with the community were less abrasive.

I’m on the Discord and the community in my experience has been anything but “abrasive”. I’m just a random guy and yet I’ve received stellar and patient help for many of my naive questions. Saying they are abrasive because they’re not willing to build something seems so entitled to me.

Focused engineering teams have to be willing to say no in order to achieve excellence with limited bandwidth. I’m glad they said no when they did so they could deliver quality on DuckDB.

I certainly think ORC is a good thing to say no too - in my years of working in this space I’ve only rarely encountered ORC files (technically ORC is superior to Parquet in some ways but adoption has never been high)

Also realize that the team is not being paid by the people who ask for new features. If you’re willing to pay them on a retainer through DuckDB labs then you can expect priority, otherwise the sentiment expressed in your comment just seems so uncalled for.


Another cool new feature that's not mentioned in the blog post is function chaining:

https://github.com/duckdb/duckdb/pull/6725

I've been using DuckDB for filtering and post-processing data, specially strings, and this will make writing complex queries easier. By combining nested functions[0] and text functions[1], sometimes I don't even need to go into a Python notebook.

[0]: https://duckdb.org/docs/sql/functions/nested

[1]: https://duckdb.org/docs/sql/functions/char


The main use case for me for DuckDB is in Data Lake-related tasks where cloud providers often do not provide great/cost-effective/otherwise limited services.

In the case of AWS, repartitioning Parquet files in S3 via Athena CTAS statements in limited to 100 active partitions, which is a bummer to work around. Therefore, I’m using DuckDB with repartitioning queries, because it doesn’t have the 100 partition limit.

I wrote a blog post about it at https://tobilg.com/casual-data-engineering-or-a-poor-mans-da... Additionally, to get started with using DuckDB serverlessly in Lambda functions, you can have a look at https://tobilg.com/using-duckdb-in-aws-lambda


We extensively explored this approach for a use-case and dropped it in favour of BigQuery external data query for one core reason: It can't do streaming reads if you have to do any kind of operation on the underlying data, for almost everything it will have to load all the relevant parquet files locally to do the operations. You might have been able to solve this using row partitions on the underlying files beforehand, but it was too late for us.

Also, it actually turned out to be much more expensive to run the operations in cloud run, than relying on BQ (due to the fact it wouldn't stream as we wished). More than anything it gave me renewed appreciation for BQ.


Very interesting. Would you mind mentioning which external table format you were using? Iceberg by any chance?

I wonder if there wasn't a middle-ground that could still utilize DuckDB or similar. You still rely on BQ to deliver the subset of data needed, and then rely on DuckDB to do the "last-mile" analytics: grouping, filtering, ordering, etc...?


I don't think this is correct, or I didn't really understand your use case. Could you give an example for "any kind of operation on the underlying data"?


I used duckdb successfully in prod to replace SQL Server. We have a micro batch that generates around 5 billion rows of very wide tables every 3 minutes. These data used to go into SQL server, only to be replaced by the new batch of 5 billions and gets marked for deletion. SQL Server was struggling with all these purge activities. Replacing with DuckDB made things much lighter and faster. The only issue I faced is the case sensitivity, where in DuckDB if you ask for your queries to be case insensitive, your results lose their original casing and returned all lowercased.


I'm curious, what kind of problem needs 5 billion rows every 3 minutes to replace the previous 5 billion rows?

By "micro batch" I understand batch processing of some kind?


financial risk calculations with many sensitivities


How do you insert into DuckDB fast and what settings ("Indices") do you use? As far as I understand DuckDB builds up statistics for each "block" of data (number of different values, ... ). So I assume inserting is slow. There is a paper [0] and a comment [1] that mentions that DuckDB is 10-500 times slower in a write-heavy workload.

[0] https://simonwillison.net/2022/Sep/1/sqlite-duckdb-paper/ [1] https://vldb.org/pvldb/volumes/15/paper/SQLite%3A%20Past%2C%...


I have a large number of small and frequent batches, think of it like discrete ETL, where each process operates on a pandas DataFrame. This frame ends up being written to disc as parquet and immediately followed by creating a DuckDB that imports the parquet. The duckdb file from then on will only be opened for read, no further writes.

I use a python odata library to convert user queries in rest to a SQL similar to Postgres and run it on these duckdb for applying any filters where needed.


I know this is irrelevant, and I'm really no a fan of SQL Server either, but I'm curious - did you try partitioning those tables on the batch ID, and then truncating partitions instead of deleting old rows?


It's a bit more complicated than I put it above. The schema is quite normalised in SQL Server and represents financial risk, with several views on it. The schema makes sense for EOD data but not for live data. The whole performance of EOD was impacted by these live insertions. Moving to another DB (on another server) would incur licence fees and costs. DuckDB simple files are cost free. Many of these files are deleted even before anyone can be bothered to read them.


> results lose their original casing and returned all lowercased.

That's a shame. Have you reported this bug on their repo?



Damn. This database quacks. Huge list of awesome super helpful tools.

Tons of great features & speed for everyone. Also love the stuff from the edge, such as Arrow DataBase Connector support,

> From this release, DuckDB natively supports ADBC. We’re happy to be one of the first systems to offer native support, and DuckDB’s in-process design fits nicely with ADBC.


ADBC is going to be huge for BI tooling. Right now, most of them go through ODBC/JDBC which are row-oriented formats. So your columnar data is always reshaped from column-oriented (in your data warehouse / lakehouse) to row-oriented and often back again. The data is also serialized, deserialized on top of that.

We got 8-20X speed-up replacing JDBC with Arrow Flight Service, which is a more limited version of ADBC - https://www.youtube.com/watch?v=nCxIpMXvCp0


Lack of order-preserving parallel parquet writing was my biggest gripe with DuckDB. Glad to see they implemented it! Between this duckdb release, DataFusion and Apache Arrow (yes, I know that helps power the other projects I listed), I’m hoping I won’t have to touch a spark cluster for a long time.


> DuckDB is an in-process SQL OLAP databasemanagement system

> In-process, serverless C++11, no dependencies, single file build APIs for Python/R/Java/…

> Transactions, persistence Extensive SQL support Direct Parquet & CSV querying

> Vectorized engine Optimized for analytics, Parallel query processing

> Free & Open Source Permissive MIT License


This is the first time I heard about DuckDB. Why someone would use that instead of a mature engine like SQLite?


Great question! For analytical queries, DuckDB is ~100x faster than SQLite. It also has great connectivity with Parquet, Pandas, Arrow, Postgres, and more! Plus, the syntax follows Postgres closely, so you have strong data typing and true timestamps, etc.

All that said - SQLite is still faster for transactions. Now DuckDB can actually read and write from/to SQLite, so you can use SQLite for OLTP and DuckDB for OLAP.

https://duckdb.org/docs/sql/statements/attach


>Now DuckDB can actually read and write from/to SQLite

This feels like it opens up some possibilities, but I am struggling to figure out where. Are you aware of any interesting use cases this has enabled? Does all of the Duck syntax then work against the SQLite database? For example, could I now run a pivot without a bunch of hoops?


You could definitely run a pivot with the SQLite attach! All of the DuckDB syntax works!


Well that is just amazingly interesting. The documentation is failing to do this feature justice! Definitely going to be on my todo list to explore this bridge in the near future.


Columnar database not row based. More appropriate for analytical workloads, and not as a transactional backend.


Speed and better connectivity.


What a great database.

This and sqlite-utils are my go to tools for ad hoc data wrangling.


Excited to see how it will progress.


Setting up SQL, creating a well-designed schema, and optimizing queries can be a complex, multi-step process. Databases often feel like black boxes, requiring significant training and engineering to get right. One needs to tweak both queries, schema and database configuration to get best results.

So why do we need SQL? I know NoSQL came and went (was it nosql or no-relations? not sure...) but honestly I can't think of many things I can do with SQL that i CANNOT do easier in any programming language with a good library for indexing and retrieving data.

Any good industrial strength key-value database like rocksdb, lmdb and others would provide these APIs with lots of bindings for multiple programming languages.

Maybe we need a second coming of NoSQL (relational no-sql?) focused in providing good cross-language APIs so instead of compiling SQL into efficient programs that use the right indexes, we could just write our queries manually in code. Bad queries would be a lot easier to debug and fix this way, and it seems like this would require a lot less machinery.


OK, you got a good KV store, and a good indexing library in your favorite language. You now write special-case procedural code for each retrieval and each update. Eventually you get tired of copy-pasting such code with slight modifications; you factor things out. You make them generic. You end up with a poorly specified "language" for making queried, joins, projections. You notice how sometimes performance requires different approaches, and bake the logic tp choose them into your library. Then you realise that there are cases when you need two related updates either succeed or fail together, not halfway. Transactions.

This all has been tried many times in 1990s when RDBMSes were pricey or otherwise unattainable. Paradox engine, various stuff on top of Berkeley DB, etc. MySQL is) or at least was) an SQL engine on top of a KV storage engine.

Now we have Postgres that fills most small- and median-scale needs out of the box.

A modern RDBMS is one of the most powerful and versatile tools in developer's hands, and usually works amazingly well even with stock settings. Usually it by far outperforms makeshift replacements to it. Learn to operate an RDBMS, learn to wield this power. It's very much humanly possible, and is much nicer experience than learning, say, C++. Then maybe you will have less of a desire to replace it with manually written imperative code.


I think you are drastically overestimating the ability of most devs to write performant code. Do you really think you're going to get better results asking devs who, by your measure cannot write good declarative SQL schemas and queries, to roll their own data storage and retrieval raw?

RDBMS software is complicated in large part due to it needing to serve a wide range of use cases, so a custom-built version might end up simpler to use, but you can't sell me that Joe Backend-Dev is going to outperform however many million dev-hours have gone into Postgres.


I generally agree, but I think there’s a mistake here comparing something like Postgres to hand-rolled code: Postgres has to implement generally safe optimizations and so cannot take advantage of invariants that cannot be expressed in a way it understands. For a specific task, with well-understood data, I’d expect a typical programmer should be able to outperform Postgres with a bit of effort.

The trade-off here is maintenance cost and a lack of flexibility due to specialization. So, IMO, one ought to reach for an SQL database first and then something else to optimize hot paths as determined by a profiler.


I found this comment in a previous discussion insightful (https://news.ycombinator.com/item?id=34580448)

This simple command `UPDATE users SET preference = 'blue' WHERE id = 123` virtually contains:

* concurrency control

* statistics, which will help:

* execution plans evaluation, which will reduce IO cost with the help of:

* (several categories of) indexes

* type checking

* data invariants checking

* point in time recovery

* enables dataset-wide backup strategies

* ISO standard way of structuring data

* an interface that empowers business people

And that's just a few items off the top of my mind.


> Setting up SQL

You'll have to set up something.

> creating a well-designed schema

You'll have to design your schema well. (Or else.)

> optimizing queries can be a complex, multi-step process.

You'll have to optimize your queries. (Or else.)

---

I don't see why SQL is more owrk than the alternative.


The simple answer to this is that not only engineers write database queries.

Personally, I’d be interested to see a system like the one you describe. Part of me thinks that SQL would still be simpler to reason about (though probably not to maintain).


Polars is an interesting library in this space. A rust DataFrame with bindings in other languages, most notably python


Imo lazy dataframe libraries (like polars or datafusion) give you most of what you need in OLAP SQL but without the pain.


>relational no-sql

Do you mean something like edgeDB?[0]

Or do you mean some non-declarative language completely? I don't see the latter making much sense. The issue with SQL for me is the "natural language" which quickly loses all intended readabilty when you have SELECT col1, col2 FROM (SELECT * FROM ... WHERE 1=0 AND ... which is what edgeDB is trying to solve.

[0]https://edgedb.com/


"It's like a relational database with an object-oriented data model"

Dear god. Send these people back to a decent computer science program to study what the relational model actually is. They're embarrassing themselves. Or I'm embarrassed for them.

Talk about abusing terms and throwing the baby out with the bath water.

SQL is awful. It's an aberration (but a successful one, so). BUT the relational model is beautiful. AND these people clearly don't know what relational means. They appear to have made what in the old days would have been called a "network" database, where the relationships are hard coded in the data itself, aka pointers (they're calling them "links"). Which is anathema to the relational model which is based on sets of sets where the relationships emerge from the query (with hints from schema).

The relational model does not have "links" or pointers. Its "keys" are really a suggestion (or constraint) but not a hard link. Relations are sets of sets (tuples) where those tuples can be joined arbitrarily and so it represents a fundamentally more flexible model where the relationships are not fixed, but emergent from the query.

It was the existence of databases like this back in the 60s and 70s, and the intrinsic problems with them, that led to Codd's development of the relational model in the first place, as a way to free data from hardcoded relationships.

"Graph relational" as a term could mean something I guess (see RelationalAI's product for example, though they don't use that term). But this is not this. "Object identity" is fundamentally philosophically opposed to the relational data model. Identity in the relational algebra is available through the comparison of arbitrary tuples emerging from operations, but is not intrinsic to any "row" or "tuple" or "object".

ARGH. Buzzword marketing, incoherent conceptually.

The world needs a successful non-SQL relational database, but this is not what this is.


I thought like this until I discovered Prisma and EdgeDB. They provide all the niceties of a NoSQL / graph databases with the robustness of battle tested SQL databases


The feature to register arbitrary Python functions as scalar functions within SQL queries is pretty insane. Writing custom functions to manipulate data in ways that aren't possible with SQL, basically without fetching all of it at once in memory and then applying. I hope thats the case



Minor annoyance:

Yesterday I was reading the docs and one of the first things I tried is the following, copy and pasted, which didn't work, using version 0.71:

  COPY (SELECT 42 AS a, 'hello' AS b) TO 'query.json' (FORMAT JSON, ARRAY TRUE);

  >>> Error: Binder Error: Unrecognized option CSV writer "array"
Just tried the above again with this release and it's fixed now.

Ideally, all commands listed in the documentation should be tested or verified to work.

Aside from that, nice work on the release! And I'm learning more and find it pretty cool, especially WASM support.


Howdy! The docs reflect the latest dev branch. For a specific release, you can pick it from the version dropdown in the top right!


I think it would be a better choice to default to showing the current stable release's docs rather than the dev branch?

That way users who download it see the docs they expect, and those who build from the dev branch or download preview builds have to go through an extra step.


Why not show the latest stable release docs by default? I assume that's what most people would use.


Sorry! Just realized that when I checked the docs again. My bad.


No worries! Thanks for reading the docs!


Such an amazing project. Once the geospatial extension lands I'm going to seriously consider switching our geoparquet workflows over.


You mean this one?

DuckDB Spatial Extension https://duckdb.org/2023/04/28/spatial.html


Yes, it's still very alpha


I'm just minorly annoyed that they're not calling it GeoDuck.


Nice to see Parquet support in DuckDB improving with each new release! In earlier versions, I had trouble with enums and nested schema, these issues have been resolved since.

The only missing feature for me now is full table, column, &c. metadata support in the DuckDB JDBC driver.

Thanks!


I have been wanting to try DuckDB forever and finally took it for a spin recently. It is an amazing technology that I hope to utilize in production one day. Hat's off to the UI/UX team behind the website and docs. Love the aesthetic!


Can you share some use cases with details where I can learn more how DuckDB is doing better?


I've been thinking about rebuilding the website analytics service I have, to be around sharded duckdb and just giving people raw SQL access.

ClickHouse is the alternative, but this approach has some interesting advantages, like simplicity.


I find clickhouse equally simple. One command to install and then you can either read from local/remote files, create tables, etc


Clickhouse looks more complicated because instead of

    create table bla (
      id number,
      xyz varchar
    )
you're doing things like

    create table bla (
      id uint64 CODEC(DoubleDelta(8)),
      xyz Nullable(LowCardinality(Varchar)) CODEC(ZSTD(1))
    )
    engine=MergeTree
    order by id
but I do wonder what is the performance difference between actually carefully specifying everything and leaving defaults (except for order by I guess).


Fair enough. Some of the settings will have performance impact, e.g. the low cardinality type, data skipping indexes, etc. and you have to spend some time to learn and experiment yourself. However it’s a powerful db and even the default settings will work well in the majority of cases.


Just leave everything by default - no need specifying any codecs, etc. People often try to show up their knowledge by using advanced features, but that's not good.


From the brief time I used it running signoz it used a surprising amount of resources, which requires further investigation, but yes it's a nice stable option.


We at MotherDuck are huge supporters of DuckDB, and it's not a big secret that we're working on a managed DuckDB offering. Feel free to give us a ping to chat :)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: