More

spenczar5 · 2025-11-07T16:54:08 1762534448

"But accepting the full S3Client here ties UploadReport to an interface that’s too broad. A fake must implement all the methods just to satisfy it."

This isn't really true. Your mock inplementation can embed the interface, but only implement the one required method. Calling the unimplemented methods will panic, but that's not unreasonable for mocks.

That is:

    type mockS3 struct {
        S3Client
    }

    func (m mockS3) PutObject(...) {
        ...
    }

You don't have to implement all the other methods.

Defining a zillion interfaces, all the permutations of methods in use, makes it hard to cone up with good names, and thus hard to read.

skybrian · 2025-11-07T19:45:18 1762544718

While you can do that, having unused methods that don't work is a footgun. It's cleaner if they don't exist at all.

lenkite · 2025-11-08T07:29:21 1762586961

Not to mention, introducing all the permutations of methods as separate interfaces on the "consumer side" means extreme combinatorial explosion of interfaces. It is far better to judge the most common patterns and make single-method interfaces for these on the provider side.

Lots of such frequently-quoted Go "principles" are invalid and are regularly broken within the standard library and many popular Go projects. And if you point them out, you will be snootily advised by the Go gurus on /r/golang or even here on HN that every principle has exceptions. (Even if there are tens of thousands of such exceptions).

the_gipsy · 2025-11-07T17:22:27 1762536147

Is this pattern commonly used? Any drawbacks?

Sounds much better than the interface boilerplate if it's just for the sake of testing.

jgdxno · 2025-11-07T18:18:55 1762539535

At work we use it heavily. You don't really see "a zillion interfaces" after a while, only set of dependencies of a package which is easy to read, and easy to understand.

"makes it hard to cone up with good names" is not really a problem, if you have a `CreateRequest` method you name the interface `RequestCreator`. If you have a request CRUD interface, it's probably a `RequestRepository`.

The benefits outweigh the drawbacks 10 to one. The most rewarding thing about this pattern is how easy it is to split up large implementations, and _keep_ them small.

durbatuluk · 2025-11-07T19:37:38 1762544258

Any method you forget to overwrite from the embed struct gives a false "impression" you can call any method from mockS3. Most of time code inside test will be:

    // embedded S3Client not properly initialized
    mock := mockS3{}
    // somewhere inside the business logic
    s3.UploadReport(...) // surprise

Go is flexible, you can define a complete interface at producer and consumers still can use their own interface only with required methods if they want.

spenczar5 · 2025-11-07T03:54:12 1762487652

no, "pdf" is a very typical shortening for "probability density function," its correct.

spenczar5 · 2025-11-07T02:17:23 1762481843

Its JSON schema, well standardized, and predates LLMs: https://json-schema.org/

zahlman · 2025-11-07T05:05:02 1762491902

Ah, so I can specify how I want it to describe the tool request? And it's been trained to just accommodate that?

simonw · 2025-11-07T05:38:44 1762493924

Most LLMs have tool patterns trained into them now, which are then managed for you by the API that the developers run on top of the models.

But... you don't have to use that at all. You can use pure prompting with ANY good LLM to get your own custom version of tool calling:

  Any time you want to run a calculation, reply with:
  {{CALCULATOR: 3 + 5 + 6}}
  Then STOP. I will reply with the result.

Before LLMs had tool calling we called this the ReAct pattern - I wrote up an example of implementing that in March 2023 here: https://til.simonwillison.net/llms/python-react-pattern

spenczar5 · 2025-11-04T18:08:52 1762279732

Very cool. One question that comes up for me is whether pg_lake expects to control the Iceberg metadata, or whether it can be used purely as a read layer. If I make schema updates and partition changes to iceberg directly, without going through pg_lake, will pg_lake's catalog correctly reflect things right away?

pgguru · 2025-11-04T18:16:21 1762280181

We have some level of external iceberg table read-only support, but it is limited at the moment. See this example/caveat: https://github.com/Snowflake-Labs/pg_lake/blob/main/docs/fil...

mslot · 2025-11-04T18:17:10 1762280230

You can use it as a read layer for for specific metadata JSON URL or a table in a REST catalog. The latter got merged quite recently, not yet in docs.

spenczar5 · 2025-11-04T17:55:36 1762278936

I think I don't understand postgres enough, so forgive this naive question, but what does pushing down to the remote tables mean? Does it allow parallelism? If I query a very large iceberg table, will this system fan the work out to multiple duckdb executors and gather the results back in?

pgguru · 2025-11-04T18:06:40 1762279600

In any query engine you can execute the same query in different ways. The more restrictions that you can apply on the DuckDB side the less data you need to return to Postgres.

For instance, you could compute a `SELECT COUNT(*) FROM mytable WHERE first_name = 'David'` by querying all the rows from `mytable` on the DuckDB side, returning all the rows, and letting Postgres itself count the number of results, but this is extremely inefficient, since that same value can be computed remotely.

In a simple query like this with well-defined semantics that match between Postgres and DuckDB, you can run the query entirely on the remote side, just using Postgres as a go-between.

Not all functions and operators work in the same way between the two systems, so you cannot just push things down unconditionally; `pg_lake` does some analysis to see what can run on the DuckDB side and what needs to stick around on the Postgres side.

There is only a single "executor" from the perspective of pg_lake, but the pgduck_server embeds a multi-threaded duckdb instance.

How DuckDB executes the portion of the query it gets is up to it; it often will involve parallelism, and it can use metadata about the files it is querying to speed up its own processing without even needing to visit every file. For instance, it can look at the `first_name` in the incoming query and just skip any files which do not have a min_value/max_value that would contain that.

spenczar5 · 2025-11-04T18:17:41 1762280261

Thanks for the detailed answer!

I use DuckDB today to query Iceberg tables. In some particularly gnarly queries (huge DISTINCTs, big sorts, even just selects that touch extremely heavy columns) I have sometimes run out of memory in that DuckDB instance.

I run on hosts without much memory because they are cheap, and easy to launch, giving me isolated query parallism, which is hard to achieve on a single giant host.

To the extent that its possible, I dream of being able to spread those gnarly OOMing queries across multiple hosts; perhaps the DISTINCTs can be merged for example. But this seems like a pretty complicated system that needs to be deeply aware of Iceberg partitioning ("hidden" in pg_lake's language), right?

Is there some component in the postgres world that can help here? I am happy to continue over email, if you prefer, by the way.

pgguru · 2025-11-04T18:36:14 1762281374

Well, dealing with large analytics queries will always perform better with larger amounts of memory... :D You can perhaps tune things to perform based on the amount of system memory (IME 80% is what DuckDB targets if not otherwise configured). Your proposed system does sounds like it introduces quite a bit of complexity that would be better served just by using hosts with more memory.

As far as Iceberg is concerned, DuckDB has its own implementation, but we do not use that; pg_lake has its own iceberg implementation. The partitioning is "hidden" because it is separated out from the schema definition itself and can be changed gradually without the query engine needing to care about the details of how things are partitioning at read time. (For writes, we respect the latest partitioning spec and always write according to that.)

enether · 2025-11-07T12:12:17 1762517537

What does "remotely" mean in this context? My understanding is that all of this runs on the same machine - your Postgres server machine runs DuckDB on the same machine via the extension.

I assume you simply mean DuckDB, being a columnar engine, is more efficient in doing this work than PG is

spenczar5 · 2025-10-28T14:58:42 1761663522

sure, but that is a fairly trivial tool call too. Ask it to name the distribution family and its parameter values.

spenczar5 · 2025-10-27T06:02:34 1761544954

Frankly, yes.

The models are one part of the story. But the software around it matters at least as much: what tools does the model have access to, like bash or just file reading or (as in your example!) just a cache of files visited by the IDE (!). How does the software decide what extra context to provide to the model, how does it record past learnings from conversations and failed test runs (if at all!) and how are those fed in. And of course, what are the system prompts.

None of this is about the model; its all "plain old" software, and is the stuff around the model. Increasingly, that's where the quality differences lie.

I am sorry to say but Copilot is just sort of shoddy in this regard. I like Claude, some people like Codex, there are a bunch of options.

But my main point is - its probably not about the model, but about the products built on the models, which can vary wildly in quality.

noduerme · 2025-10-27T06:17:42 1761545862

In my experience with both Copilot and Claude, Claude makes subtler mistakes that are harder to spot, which also gobbles up time. Yes, giving it CLI access pretty cool and helps with scaffolding things. But unless you know exactly what you want to write, and exactly how it should work, to the degree that you will notice the footguns it can add deep in your structures, I wouldn't recommend anyone use it to build something professional.

spenczar5 · 2025-09-28T14:54:45 1759071285

Where (vaguely, of course) do you live? That sounds so different from my experience in a city.

mothballed · 2025-09-28T14:55:47 1759071347

Rural AZ

spenczar5 · 2025-09-04T15:07:08 1756998428

I agree, but there are other possibilities in between those two extremes, like Quivr [1]. Schemas are good, but they can be defined in Python and you get a lot more composability and modularity than you would find in SQL (or pandas, realistically).

1: https://github.com/B612-Asteroid-Institute/quivr

spenczar5 · 2025-08-15T18:14:28 1755281668

I dont know, arguing that http/2 is safer overall is a... bold claim. It is sufficiently complex that there is no standard implementation in the Python standard library, and even third party library support is all over the place. requests doesn't support it; httpx has experimental, partial, pre-1.0 support. Python http/2 servers are virtually unavailable at all. And it's not just Python - I remember battling memory leaks, catastrophic deadlocks, and more in the grpc-go implementation of http/2, in its early days.

HTTP 1.1 connection reuse is indeed more subtle than it first appears. But http/2 is so hard to get right.

Bender · 2025-08-15T20:14:57 1755288897

Speaking of http/2 [1] - August 14, 2025

The underlying vulnerability, tracked as CVE-2025-8671, has been found to impact projects and organizations such as AMPHP, Apache Tomcat, the Eclipse Foundation, F5, Fastly, gRPC, Mozilla, Netty, Suse Linux, Varnish Software, Wind River, and Zephyr Project. Firefox is not affected.

[1] - https://www.securityweek.com/madeyoureset-http2-vulnerabilit...

yencabulator · 2025-08-18T19:53:42 1755546822

Protocol smuggling is a lot more severe than DoS.

ameliaquining · 2025-08-15T19:13:13 1755285193

These sound to me like they are mostly problems with protocol maturity rather than with its fundamental design. If hypothetically the whole world decided to move to HTTP/2, there'd be bumps in the road, but eventually at steady state there'd be a number of battle-tested implementations available with the defect rates you'd expect of mature widely used open-source protocol implementations. And programming language standard libraries, etc., would include bindings to them.

jcdentonn · 2025-08-15T18:26:53 1755282413

Not sure about servers, but we had http/2 clients in java for a very long time.

cyberax · 2025-08-15T18:34:37 1755282877

An HTTP/2 client is pretty easy to implement. Built-in framing automatically improves a lot of complexity, and if you don't need multiple streams, you can simplify the overall state machine.

Perhaps something like "HTTP/2-Lite" profile is in order? A minimal profile with just 1 connection, no compression, and so on.

spenczar5 · 2025-08-15T18:37:25 1755283045

Isn't the original post about servers? A minimal client doesn't help with server security.

I would endorse your idea, though, speaking more broadly! That does sound useful.

jiehong · 2025-08-15T18:48:29 1755283709

nghttp2 is a C lib that can be used for serving as a server in many cases. Rust has the http2 crate.

Perhaps it isn’t that easy, but it could be put in common and used a bit everywhere.