I was under the impression that Yugabyte requires signing a CLA to contribute which leads me to avoid it for fear of them relicensing the thing when the VC's start squeezing.
Also: very unique and single vendor driven. Seems like too much of a risk longer term but that is just my take.
EDIT: in response to your question I did run a PoC of it but it had issues where I wasn't able to create very large indexes without the statement timing out on me. Basic simple hand-benchmarking of complex joins on very large tables were very slow if they finished at all. I suppose systems like this and cockroach really need short, simple statements and high client-concurrency rather than large, complex queries.
That’s normal for building indices on large tables, regardless of the RDBMS. Increase the timeout, and build them with the CONCURRENTLY option.
> Query speed
Without knowing your schema and query I can’t say with any certainty, but it shouldn’t be dramatically slower than single-node Postgres, assuming your table statistics are accurate (have you run ANALYZE <table>?), necessary indices are in place, and there aren’t some horrendously wrong parameters set.
Not sure about the CLA process, but the database is already under a restrictive, proprietary license:
## Free Trial
Use to evaluate whether the software suits a particular
application for less than 32 consecutive calendar days, on
behalf of you or your company, is use for a permitted purpose.
It's not really clear what this means (what is a permitted purpose?), but it seems the intent is that after 32 days, you are expected to pay up. Or at least prepare for a future when the infrastructure to charge customers is in place (if it isn't there yet).
Thanks. I think that only covers the commercial bits they run themselves though:
"The entire database with all its features (including the enterprise ones) is licensed under the Apache License 2.0
The binaries that contain -managed in the artifact and help run a managed service are licensed under the Polyform Free Trial License 1.0.0."
Index creation should not be controlled by statement timeout, but backfill_index_client_rpc_timeout_ms which defaults to 24 hours. May have been lower in old versions
While it is fun to see how to creatively solve such issues, it does raise the question of managability.
When sharding data into loosely (fdw) coupled silo's it would become tricky to make consistent backups, ensure locking mechanisms work when sharded data might sometimes be directly related, handle zone/region failures gracefully, prevent hot spots, perform multi-region schema-changes reliably, etc.
I suppose this pattern principally only works when the data is in fact not strongly related and the silo's are quite independent. I wouldn't call that a distributed system at all, really. This may be a matter of opinion of course.
It does give a "When all you have is a hammer..." vibe to me and begs the question: why not use a system that's designed for use-cases like this and do it reliably and securely ? i.e.: https://www.cockroachlabs.com/docs/stable/multiregion-overvi... (yes, I know full data domiciling requires something even more strict but I currently don't know of any system that can transparently span the globe and stay performant while not sharing any metadata or caching between regions)
> It does give a "When all you have is a hammer..." vibe to me and begs the question: why not use a system that's designed for use-cases like this and do it reliably and securely ?
(disclaimer: blog post author)
A reason would be that you want to stick to pure Postgres, for example because you want to use Postgres extensions, or prefer the liberal Postgres license.
It can also be a matter of performance, distributed transactions are necessarily slower so if almost all the time you can avoid them by connecting to a single node, which has all the data that the transaction needs, that's going to get you better performance.
Hi there! Thank you for the post and your work on pgzx!
Though it depends on the system (cockroachdb can place leaders on specific nodes to speed up local queries, it has global tables and otherwise there's always follower-reads) those are of course valid reasons.
Admittedly if you want to keep data "pinned", you're into manual placement, rather than horizontal scaling but that's nitpicking and there's pros and cons.
I do enjoy the freedom of Postgres and am hopeful that its virtually prehistoric storage-design becomes a non-issue thanks to tech such as Neon and Orioledb. The option to decouple storage would provide wonderful flexibility for solutions like yours too I think.
Not Postgres-based (but wire- and mostly syntax-compatible): cockroachDB using column families is much like a columnar MPP.
Yugabyte is PG-based and MPP but not columnar.
The presence and use of column families is only half of the puzzle - it doesn't strictly imply that the execution engine is capable of working in a vectorized columnar style (which is necessary for competitive OLAP).
...it seems the distinction here is that the vectorization is only present in the execution layer and not the storage layer also. I would guess that from a storage perspective, even with column families in play, everything is being streamed out of sorted a LSM engine regardless. So there isn't additionally some highly-tuned buffer pool serving up batches of compressed column files etc.
Indeed. As I commented alsewhere this is just about the general design. It is not targeting OLAP in this case (even though I do believe cockroach employs vectorization for reads)
They don't optimize for it and I suppose the data distribution is primarily aimed at parallel OLTP rather than OLAP. Just wanted to mention that design-wise it is similar but that's indeed not all there is to it.
I'd be hesitant to store large volumes of data on a single PG instance; don't see how a single-writer, filesystem-based database is suitable at all for data that is large enough to warrant columnar storage
I am more interested in actual OLAP than HTAP, and don't see strong OSS OLAP offering on the market right now, my rants in previous discussion: https://news.ycombinator.com/item?id=36992039
But I should look at TiDB, they looks like interesting and relatively mature project.
Thank you for the correction. Indeed it is not entirely the same thing. Though I'd expect that at least the benefit of not having to read columns that aren't in the family would still help (haven't tried in earnest). I suppose compression is not an option though.
Not to say that I disagree with you in principle but I'd think there is a large difference between disagreeing with someone's opinions and disagreeing with someones values.
You generally can't have a meaningful conversation in the latter case.
Personally I find the idea of a very small cartel of companies holding all the cards quite dystopian. There are certain people that I very much would not like to be in control of all that. Guess I'm not ready to accept the supposed inevitability of this end game.
The way I read it is simply a broadly applicable notion that a certain type of people will demonstrate their power and (perceived) higher status over others by openly breaking the others' rules without consequence.
EDIT for better wording
Ah well, we're all humans with our own blind focus on specific wants and needs.
Most managers are not maniacs. Our capitalistic society just has a lot of bad incentives that are hard to change without damaging its good parts. Spreading power helps to balance things out. I know unions have a bad rep in the US but they do have good merit where government is laissez faire and no, we do not need to go heavy socialist (not meant to imply you were advocating for such). In general I'd wish we'd focus a bit more on societal stability than economic output but we need both.
Honestly I think your comment shows part of the problem here, though I do understand your point. But if "mere" employees feel that they are personally in competition with basically the rest of the world it easily becomes a war of attrition and employees become kind of soldier-ants. I'd rather be my own person thank you very much.
People can in fact be loyal and productive in 8 hours per day without being a slacker and maybe have some energy left to partake in society outside of work. Nobody likes people with a bad work-ethic but if a good work-ethic is considered to require regular overtime than maybe it's time to reconsider that concept.
The unfortunate part is that your company is in competition with the rest of the world.
I made the comment about slackers because having moved from retail to military to tech, the amount of privilege shown unironically by tech employees on "crunch" is borderline unbelievable.
52. Not been pulling all-nighters (okay maybe once or twice in exceptional circumstances that were also appreciated as such by the management after) as I've always believed your point to be right. If you're not (in part) owning the company then it should not be your life unless you don't have one and don't care.
To each his own but an environment where overtime is the norm (even if driven by employees themselves) is not a healthy place to be when you have a family.
Besides, when a company is so successful to need more than the normal hours available then the onus is on the leadership to hire more people; it's in their best interest to not have to rely on such devotion.
...and if crunch-time is instead caused by the company _not_ being successful and not having the money to hire people then toxicity is pretty much guaranteed in short order.
EDIT: in response to your question I did run a PoC of it but it had issues where I wasn't able to create very large indexes without the statement timing out on me. Basic simple hand-benchmarking of complex joins on very large tables were very slow if they finished at all. I suppose systems like this and cockroach really need short, simple statements and high client-concurrency rather than large, complex queries.