More

mslot · 2025-11-05T16:03:01 1762358581

In principle, Postgres has an infinite number of possible types :).

pg_lake maps types into their Parquet equivalent and otherwise stores as text representation, there are a few limitations like very large numerics.

https://github.com/Snowflake-Labs/pg_lake/blob/main/docs/ice...

mslot · 2025-11-05T09:57:13 1762336633

(1) We've thought about it, no current plans. We'd ideally reimplement DuckLake in Postgres directly such that we can preserve Postgres transaction boundaries, rather than reuse the Ducklake implementation that would run in a separate process. The double-edged sword is that there's a bunch of complexity around things like inlined data and passing the inlined data into DuckDB at query time, though if we can do that then you can get pretty high transaction performance.

(2) In principle, it's a bit easier for pg_duckdb to reuse the existing Ducklake implementation because DuckDB sits in every Postgres process and they can call into each other, but we feel that architecture is less appropriate in terms resource management and stability.

mslot · 2025-11-04T21:29:46 1762291786

It's the same team and same project :). Crunchy Data was acquired by Snowflake.

scirob · 2025-11-04T21:31:43 1762291903

Holy shit thats amazing!! Congrats to the team

mslot · 2025-11-04T20:27:57 1762288077

When we first developed pg_lake at Crunchy Data and defined GTM we considered whether it could be a Snowflake competitor, but we quickly realised that did not make sense.

Data platforms like Snowflake are built as a central place to collect your organisation's data, do governance, large scale analytics, AI model training and inference, share data within and across orgs, build and deploy data products, etc. These are not jobs for a Postgres server.

Pg_lake foremost targets Postgres users who currently need complex ETL pipelines to get data in and out of Postgres, and accidental Postgres data warehouses where you ended up overloading your server with slow analytical queries, but you still want to keep using Postgres.

mslot · 2025-11-04T20:01:42 1762286502

Yes, just COPY table TO 's3://mybucket/data.parquet'

Or COPY table TO STDOUT WITH (format 'parquet') if you need it on the client side.

mslot · 2025-11-04T18:47:18 1762282038

DuckLake is pretty cool, and we obviously love everything the DuckDB is doing. It's what made pg_lake possible, and what motivated part of our team to step away from Microsoft/Citus.

DuckLake can do things that pg_lake cannot do with Iceberg, and DuckDB can do things Postgres absolutely can't (e.g. query data frames). On the other hand, Postgres can do a lot of things that DuckDB cannot do. For instance, it can handle >100k single row inserts/sec.

Transactions don't come for free. Embedding the engine in the catalog rather than the catalog in the engine enables transactions across analytical and operational tables. That way you can do a very high rate of writes in a heap table, and transactionally move data into an Iceberg table.

Postgres also has a more natural persistence & continuous processing story, so you can set up pg_cron jobs and use PL/pgSQL (with heap tables for bookkeeping) to do orchestration.

There's also the interoperability aspect of Iceberg being supported by other query engines.

jabr · 2025-11-04T19:01:22 1762282882

How does this compare to https://www.mooncake.dev/pgmooncake? It seems there are several projects like this now, with each taking a slightly different approach optimized for different use cases?

mslot · 2025-11-04T19:38:55 1762285135

Definitely similar goals, from the Mooncake author: https://news.ycombinator.com/item?id=43298145

I think pg_mooncake is still relatively early stage.

There's a degree of maturity to pg_lake resulting from our team's experience working on extensions like Citus, pg_documentdb, pg_cron, and many others in the past.

For instance, in pg_lake all SQL features and transactions just work, the hybrid query engine can delegate different fragments of the query into DuckDB if the whole query cannot be handled, and having a robust DuckDB integration with a single DuckDB instance (rather than 1 per session) in a separate server process helps make it production-ready. It is used in heavy production workloads already.

No compromise on Postgres features is especially hard to achieve, but after a decade of trying to get there with Citus, we knew we had to get that right from day 1.

Basically, we could speed run this thing into a comprehensive, production-ready solution. I think others will catch up, but we're not sitting still either. :)

j_kao · 2025-11-04T19:03:54 1762283034

FYI the mooncake team was acquired by Databricks so it's basically vendors trying to compete on features now :)

mritchie712 · 2025-11-04T19:42:20 1762285340

> For instance, it can handle >100k single row inserts/sec.

DuckLake already has data-inlining for the DuckDB catalog, seems this will be possible once it's supported in the pg catalog.

> Postgres also has a more natural persistence & continuous processing story, so you can set up pg_cron jobs and use PL/pgSQL (with heap tables for bookkeeping) to do orchestration.

This is true, but it's not clear where I'd use this in practice. e.g. if I need to run a complex ETL job, I probably wouldn't do it in pg_cron.

derefr · 2025-11-04T22:09:32 1762294172

> This is true, but it's not clear where I'd use this in practice. e.g. if I need to run a complex ETL job, I probably wouldn't do it in pg_cron.

Think "tiered storage."

See the example under https://github.com/Snowflake-Labs/pg_lake/blob/main/docs/ice...:

   select cron.schedule('flush-queue', '* * * * *', $$
     with new_rows as (
       delete from measurements_staging returning *
     )
     insert into measurements select * from new_rows;
   $$);

The "continuous ETL" process the GP is talking about would be exactly this kind of thing, and just as trivial. (In fact it would be this exact same code, just with your mental model flipped around from "promoting data from a staging table into a canonical iceberg table" to "evicting data from a canonical table into a historical-archive table".)

ijustlovemath · 2025-11-05T01:40:48 1762306848

Not to mention one of my favorite tools for adding a postgres db to your backend service: PostgREST. Insanely powerful DB introspection and automatic REST endpoint. Pretty good performance too!

anktor · 2025-11-04T19:35:55 1762284955

What does data frames mean in this context? I'm used to them in spark or pandas but does this relate to something in how duckDB operates or is it something else?

dunefox · 2025-11-04T22:22:34 1762294954

It's a python data frame.

mslot · 2025-11-04T18:17:10 1762280230

You can use it as a read layer for for specific metadata JSON URL or a table in a REST catalog. The latter got merged quite recently, not yet in docs.

mslot · 2025-11-04T18:09:29 1762279769

I gave a talk on that at Data Council, then still discussing the pg_lake extensions as part of Crunchy Data Warehouse.

https://youtu.be/HZArjlMB6W4?si=BWEfGjMaeVytW8M1

Also, nicer recording from POSETTE: https://youtu.be/tpq4nfEoioE?si=Qkmj8o990vkeRkUa

It comes down to the trade-offs made by operational and analytical query engines being fundamentally different at every level.

mslot · 2025-11-04T17:38:45 1762277925

There are Postgres roles for read/write access to the S3 object that DuckDB has access to. Those roles can create tables from specific files or at specific locations, and can then assign more fine-grained privileges to other Postgres roles (e.g. read access on a specific view or table).

mslot · 2025-11-04T17:34:53 1762277693

You could say

With DuckLake, the query frontend and query engine are DuckDB, and Postgres is used as a catalog in the background.

With pg_lake, the query frontend and catalog are Postgres, and DuckDB is used as a query engine in the background.

Of course, they also use different table formats (though similar in data layer) with different pros and cons, and the query frontends differ in significant ways.

An interesting thing about pg_lake is that it is effectively standalone, no external catalog required. You can point Spark et al. directly to Postgres with pg_lake by using the Iceberg JDBC driver.