Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Database drivers: Naughty or nice? (prequel.co)
94 points by ctc24 on Dec 12, 2022 | hide | past | favorite | 44 comments


This article would be way more useful if it named the offending drivers and linked to relevant github issues. As is this is just a set of useless curiosities.

I came in wanting to check if any of the drivers I use were affected by any of the bugs they found. No idea!


Understand the sentiment. On the flip side, and as noted in the post, we don't want it to serve as a "name and shame" situation. We have a lot of respect for the folks maintaining the drivers we use, they do a pretty hard job, and so we don't want to come out swinging and taking shots at them. So we're trying to walk that line.

We also think the information provided in the post is valuable as is: it's easy enough to check whether a driver you use faces any of the issues we mention.

With that said, here's a bit more color:

- the native Snowflake driver for GoLang does not implement COPY functionality (or at least it did not when we last tried to use it).

- the memory leaks are pretty prevalent across ODBC drivers. It's worth watching out for that one if you're using any ODBC driver.

- the breaking change on connection string was Databricks' GoLang driver.

- the DECIMAL one is pretty prevalent too. BigQuery only allows you to go up to DECIMAL(38,9) while most other drivers let you go to 18 on scale, and ClickHouse supports precision/scale of up to 76. Redshift complains loudly if you try to insert a DECIMAL(38,17) into a DECIMAL(38,18) column, for instance.

Hope the added color is helpful!

edit: formatting


Agreed. Without actual references, this is marketing blather, not a technical article.

The worst part is that it doesn't have to be name and shame. Take the "epoch" discussion, for example. The fact that "epochs" differ in implementation is something that isn't even a bug--it's just different. That alone is likely surprising to a lot of people and would probably be worth an article.

Of course, the real issue is probably that "boring" databases just work and "exciting" databases are full of bugs. If you're a database SAAS startup, slagging the databases that everybody considers cool and hip isn't going to be good for your exit.


Was wondering the same.

Are they scared of pointing fingers or what?


SQL Database drivers have so much in common, but each represent schema and data table(s) differently on the wire. Some handle cancellation within the protocol, others require a work around (like another DB connection that issues a KILL command). TDS (Microsoft Server) is actually one of the better protocols and better documented protocols (due in large part to historical court orders) out there.

My takeaway is that compact schema streaming data is not a well developed field. I think we can do better. Not only that, but developing both such a schema, protocol, and associated tooling is key to significantly better data-centric applications from end to end, not just the database.


Agreed! It's kind of fascinating that a better standard hasn't emerged yet, but not surprising given that each one is developed in isolation by a different organization.

Definitely a lot of room for improvement. Curious if anyone has thoughts about the best route to getting such a standard / protocol in place: it seems like a lot of stars would have to align, but would be invaluable nonetheless.


As mentioned by others elsewhere in the thread, Apache Arrow Flight SQL aims to be a standard wire protocol that multiple databases can implement, using a columnar data representation: https://arrow.apache.org/docs/format/FlightSql.html

However it's early days yet.


Such a nonproprietary protocol would reduce vendor lock-in, hence vendors have no incentive to develop one.


As mentioned under the other "XML" suggestion, dedicated services can be created to serve up the standard even if DB vendors don't cooperate.


I've had such problems with the official AWS redshift JDBC driver -- Mostly around memory usage compared to the PG8.4 driver that also works. But also,

- Much slower to cancel queries

- Still reading and buffering most of the result after canceling.

- Weird escaping issues

Glad to be done with that.


I know this isn't a specifically Amazon issue or topic, but I am so disappointed by their SDKs. From Amplify to AppSync they're just layers of half baked ideas and implementations. I've hit a lot of bugs and limitations on their platform.

And they they still have no concept for native dev tooling for most of their services. Its always "just setup separate environments for testing, CI, development, staging, release" and pay them more money.

I hate AWS, so much.

I am not shocked their DB drivers have issues.


Completely agree with this. The only products I'd even consider by choice from Amazon are the core IaaS products. Everything they built in house on top of their own platform is a buggy shit show.

One recent problem I have been dealing with, which I can't detail because they'll know who I am and I am under NDA, involved a major release last year. I found two critical bugs in the stack which they had extended from an open standard. As we're a massive company with a big spend I managed to get the team leads and enterprise support on a call and I found the whole fucking thing was clearly looked after by two guys who didn't know what they hell they were doing, had never spent any time on the stack that they were extending and had never even considered a real world use case.

I have considered looking into Azure but I'm not sure that'll be any better.


From my experience, having extensively used each cloud at 3 different jobs, this is my take. As always, YMMV, that said:

- AWS is the worst of all the clouds, not just (typically) in pricing, but definitely it has the worst DX, confusion of options, and documentation out of the big 3. They're flexible I guess, and have tons of services but no easy and tight way to integrate them, and they're very configuration heavy. Just didn't care for how they presented their happy paths (WTF is up with VTL?!). I think Lambda's are overrated and idk why they have cold starts in 2022 when all their competitors don't anymore.

- Azure has some good stuff, including local emulators you can download for some services, and they have mock packages for programmatically emulation some of their services. Their biggest downfall for me was the confusion amount of options they have and sometimes their pricing is opaque what the final cost on a service will be. Both combined made it harder to use than I would have liked but it was a step up (to me) from AWS. A note of caution though, is that Azure doesn't have the per se greatest support for things not in their "stack". .NET of course works well, Node, Python, and more recently Go have seen good support (particularly Node) but it has less flexible ways to do more customized deployments in my experience. Doesn't mean you can't, but you pay alot more for the privilege. They also are more aggressive in the upsell department as well. Azure is particularly great if you're migrating older Windows Server apps and/or heavily using things like ActiveDirectory. If you're willing to pay they do have good support. I recall Azure having a pretty decent cloud SDK and they supported all the cross-cloud ones decently (Terraform, CF etc) as well. One thing I like about Azure is they have the least confusing CLIs IME

- GCP is has pretty good DX. I think they're more expensive in alot of ways, if you aren't careful, but their "main" services are pretty cost competitive. In particular, I feel like AppRunner is the underrated alternative to things like AWS Lambda or other FaaS offerings. GCP also has a good amount of emulators and Firebase is still pretty great. GCP is very happy path oriented (more so than the other cloud providers) in my experience, but generally they seem to have picked good "happy paths", again IME. BigQuery is more byzantine in cost than I'd like, same with Spanner. Their support is pretty bad IME unless you pay for big contracts, though. Even then it wasn't the best, but at least we could get a human on the phone. Also, I wouldn't trust their one click apps, there is hardly any control GCP exerts over them to ensure quality, so unless you know the vendor is specifically supporting them for your use case, avoid them. I did find their permissions model to be as confusing as AWS. Didn't like their cloud SDK as much but it was better than AWS.

What would I do if I was able to start over again in 2022 though? I'd deploy to something like Supabase and/or Cloudflare if I wanted pretty much 100% managed services. Otherwise if I just want compute and add in my own services I'd do a combo of Fly.io with Planetscale (or a similiar DB provider) and using Cloudflare services (such as R2 and workers) for files / caching or a combo of B2 + Upstash. You could proxy / load balance your HTTP / TCP traffic via Cloudflare as well (idk that fly.io autoproxies provisioned instances for load balancing. I'm pretty sure it doesn't)


Don't start me on VTL. I spent two weeks trying to delete a VTL which was sitting unused at $5000 a month.

On lambda, the issue I mentioned was actually because one of their services was built on top of that and maintained state between lambda runs whereas the open source product they ripped off did not. That meant their entire product was stateful when it should have been stateless. It compromised everything.

I couldn't possibly use Google after a GApps fiasco I had a couple of years back. No way.

Thanks for the rest of the info. I shall carefully decipher it.

For the next platform migration we do it might be back to a dedicated DC which is where we came from. We have stable constrained, predictable load. The TCO was cheaper, the staff were fewer, the performance was better and the licensing of some commercial products we leverage was much easier to handle and cheaper.


Which Lambda competitors (which aren't v8 isolate based) don't have cold starts?


Hello, maintainer of the pg8000 driver for PostgreSQL here. I'm keen to know of any problems you had if you used pg8000. Thanks! https://github.com/tlocke/pg8000


This is one of the few Github repos I have seen that are very very well documented. Great job; starred.


Sweet! I didn't know there was a pure Python alternative to psycopg2 -- installing that has caused me many headaches.

I've started writing most new code async and switched to asyncpg, but it's good to know there's also an alternative for sync code!


Appreciate that and will do!


I remember dealing with DataDirect on Linux more than 10 years ago. I don't recall the specifics, but under certain error conditions, either the ODBC driver we were using or the driver manager would simply do `exit(0)` with no errors logged anywhere. That was a "fun" one to solve...


Speaking as a professional data yeeter who has been bit by weird driver issues in the past, I’ll go against the grain to say that I found this really helpful as a checklist to pattern match against when troubleshooting, even without calling out specific drivers. Thanks for sharing!


The world would be a better place if database drivers were completely abandoned as a way for clients to connect to databases. A standard API, implemented by multiple vendors, is a vastly preferable solution. Arrow Flight is an example of this.

https://arrow.apache.org/blog/2019/10/13/introducing-arrow-f...


Even within the Arrow project, there's still room for drivers just because not every vendor is going to implement the same wire protocol (at least on a feasible timeline). Hence both "ADBC" [1] and Flight SQL [2] (note: NOT a SQL dialect, it is a wire protocol) coexist in complementary niches.

[1]: https://arrow.apache.org/docs/format/ADBC.html [2]: https://arrow.apache.org/docs/format/FlightSql.html


I generally think that's a good idea, but be aware that the protocols are more interesting than you might first imagine, and that leads to a lot of the differences between drivers for different databases.

For instance, when setting a user's password in Postgres, you can do the hashing on the client side, even for non-trivial schemes like SCRAM. This means that the password itself never needs to move over the network, and that's very desirable. Speaking of authentication methods, that also opens up a big topic.

There are also important modes. For instance, the client encoding controls how strings are transcoded when they get to the server. That allows the client to not know/care what the encoding of the database is. You could demand that everything is UTF-8, and that's one philosophy, but not everyone agrees.

In practice, I think it'll be a while before there is consensus on all these points. And even when there is, the standard will need to evolve to handle new auth methods, etc.

If we invent a standard protocol, it will probably be more of a fallback for simple cases when the language framework doesn't offer a driver yet. Still helpful, though.


> For instance, when setting a user's password in Postgres, you can do the hashing on the client side, even for non-trivial schemes like SCRAM. This means that the password itself never needs to move over the network, and that's very desirable.

Off-topic, but I’m surprised more online apps don’t employ something similar.

It would all but eliminate accidental leaks that occur from logs being incorrectly stored / misconfigured, not to mention worries about MITM attacks (useful for corporate networks, or public networks).

Given how many people share usernames, emails, and passwords across sites I find it quite important to mitigate those issues as much as possible.


Nice.

This [1] appears to be the SQL layer on top of Arrow Flight specifically about SQL. It seems a bit chatty, where two network requests are required for each query if I read it correctly.

[1] https://arrow.apache.org/docs/format/FlightSql.html


Yup. The chattiness is to account for distributed databases, so you can spread the result set across multiple instances.

That said there is a proposal for base Flight RPC to help allow embedding small results directly into the first response, that mostly needs someone to draft a prototype and push it through. (That doesn't help the case of a large-ish response from a single backend, though; that may also need some work, if we want to get rid of the second request.)


I used to work with Sybase, and its JDBC driver was pure madness. There were times where the same query on same data would return different results. Never figured out why. We ended use its ODBC driver for reporting instead.


It is the same shitshow in JDBC land.


I remember something about mysql not treating localhost as local, but 127.0.0.1 would. This led to very different performance. (Long ago, not sure about details)

Also surprised by libraries not using more efficient protocols although they were defined.

Compression also doesn’t seem to be a thing.


Found out recently that utilizing Snowflake from Spark that you still cant represent higher than int32 normally, driver doesn't support it.

I think its in the works but it makes me laugh that the big data folks never really cared about a number bigger than 2.4b or so.


Big numbers would be in numeric formats. Int32/int64 would be bit fields or sequence numbers/row IDs. Given enough time and success you will overflow int32 for sequence number.


Why isn't there an HTTP protocol like "Query:" or "SQL:" such that drivers and ODBC/JDBC are not required to talk to a typical database? In other words, a database web service standard. Here's a sample of what the XML could look like:

https://www.reddit.com/r/Database/comments/p21u5d/standard_q...


For completeness, FlightSQL[1] (as mentioned elsewhere in this thread) aims to provide such an HTTP based protocol

https://arrow.apache.org/blog/2019/10/13/introducing-arrow-f...


HTTP is stateless, whereas most database connection protocols are stateful.

So if you use HTTP, you need a lot of boilerplate on the client side to manage that state (eg. prepared queries, transactions, session parameters).


Probably because a bunch of these databases, and odbc, predate http and "web services" by many years.


It generally takes roughly 5 years for such standards to trickle into common systems. But separate software could make a server gateway/portal to those database products without it. In other words, a server-based adaptor. Database-vendor-specific services could be fairly easily built in say PHP or Python. They'd still conform to the standard (using XML and/or JSON), but only implement a gateway for specific database brands. It's a start...


Adding a layer slows things down and uses more memory, even if just a tiny bit, but for large result sets, I fear it will be a lot.

In addition it likely initially a) will be buggy and b) won’t support every feature of every vendor.

I think that’s a hard to win battle.


Sure, that could be done... but it would be dramatically slower that any of the existing drivers. As much as the vendor drivers / odbc / jdbc have their annoying quirks, the wire protocols are pretty efficient.


It could have a binary format option (in addition to those formats listed). But also many kinds of apps don't need heavily compressed/compact data transfers. The "old" way won't go away, just have an alternative.


Ok, but they already have a binary format...

What is the advantage of your hypothetical binary format (which doesn't exist), (in addition to those formats listed), over the binary formats which have existed for decades?


without actually naming the drivers in question so that people can learn from it this is pretty pointless.


Presumably, this is a warning for people building a fresh system. Even if the authors called out a, b, and c for version 1, 2, and 3 (respectively); there is no promise that x, y, and z don't exhibit these issues (or that other versions of abc do or don't exhibit the behavior, either).

This is a reminder to consider our use cases and vet our dependencies for those use cases carefully.


I will presume that at least half of these are issues with the Snowflake connectors until told otherwise.. those guys have a great core product, but the connectors are pretty dodgy in many cases I've seen.

Telling as well that DuckDB fixed your issue in a couple of days.. had snowflake support even demonstrated technical understanding of your request in that time? I guess it doesn't really matter if it takes them six months or more to fix the connector bugs anyway.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: