All of these tools are insanely expensive (from my own experience at companies t...

mritchie712 · 2025-05-01T21:41:39 1746135699

The best open source options are Airbyte and Meltano / Singer. But it's hard to keep them running. If you self-host them, you'll hit issues at least a few times a month which can each take a few hours to solve.

It's not like running Postgres which "just works". When you self-host Airbyte, you're still building a good bit.

I felt the same way about the cost of data tools. Paying $1,000 for Fivetran, $2,000 for Snowflake, $2,000 for Looker seemed crazy. We bundle all three for $500 / month at https://www.definite.app

fblp · 2025-05-02T01:31:28 1746149488

Your comment reads like a pitch but I checked definite and I've been looking for something like this. What we you using it for? Did you evaluate any other Ai analytics tools?

mritchie712 · 2025-05-02T10:55:52 1746183352

Yes, I've looked at them all. Most AI analytics tools are doing "text to SQL", but writing SQL is a small percentage of data work.

We built an entire stack so the agent can operate across that whole stack (e.g. create pipelines, model data, build reports, etc.)

banditelol · 2025-05-02T02:29:35 1746152975

Hi, I've been loking something like this! Any of your custumer has success story migrating off bigquery to your platform? And how do you compare to motherduck? (Looks like you built some of ypur stack on top of duckdb)

mritchie712 · 2025-05-02T10:39:54 1746182394

Yes, we've had many bigquery / snowflake converts. The reality is, most companies don't have 100tb of data (which is what those platforms are optimized for). Motherduck has a good post[0] on this:

> There were many thousands of customers who paid less than $10 a month for storage, which is half a terabyte. Among customers who were using the service heavily, the median data storage size was much less than 100 GB.

I'm a fan of what motherduck is doing. We're building something different (opinionated, instant data stack), but yes, we both use duckdb under the hood.

0 - https://motherduck.com/blog/big-data-is-dead/

empireofdust · 2025-05-02T12:39:02 1746189542

Airbyte’s not an alternative for reverse-etl though. Also your pricing page also says $1k per month.

mritchie712 · 2025-05-03T13:01:48 1746277308

we have a discount for startups, which would qualify many on HN

ssharp · 2025-05-01T20:58:18 1746133098

I'm not sure about Census but Fivetran's free plan has met my needs to sync data from different ad platforms to BigQuery pretty well.

One of their pitfalls is charging by the row. If you're cost-conscious, you really need to watch what data you're syncing and you need to pare it down quite a bit during the 2-week period they give you when setting up a new connector. If you do all that though, you can get a lot of mileage out of the free plan for some use cases.

tomrod · 2025-05-01T21:40:13 1746135613

Or batch massive rows? JSON structures in-database go a long way...

morkalork · 2025-05-01T20:59:07 1746133147

Ok if you're bootstrap it probably doesn't make sense but otherwise fivetran is fantastic for not having to deal with a boatload of third parties constant API updates and changes. If your core competency is something else entirely and not doing ETL, then it's worth paying for so you're not wasting time on doing that ETL work.

zoogeny · 2025-05-01T21:31:18 1746135078

Yes, I've used Fivetran at VC funded startups that I worked at and I understand the value of not having to build this piece of common infrastructure. Although we did experience regular (probably once every couple of months) issues with our ETL getting out of sync. We even had to do a full re-sync on a couple of occasions (which to their credit they did for no charge).

As I said, I totally understand this market and why these companies are valuable. I respect the work they do. But while I am a tiny, tiny startup I don't want to lock in to anything and I know I can handle the amount of data myself with little effort if I have a basic open source alternative I can manage myself.

caust1c · 2025-05-01T21:17:02 1746134222

Check out redpanda connect / warpstream bento (depending on your license needs). Both came out of what was benthos.

https://github.com/redpanda-data/connect

https://github.com/warpstreamlabs/bento

zoogeny · 2025-05-01T21:49:37 1746136177

Interesting, it looks like redpanda is a Kafka replacement and redpanda connect is a Kafka connect replacement but with a supported set of connectors (sources and sinks). I (once upon a time) had to write a Kafka connector myself so I get the general idea.

To be honest, I hadn't really given much thought about what event streaming I would use anyway. So I imagine using redpanda along with redpanda connect could be that layer (I was considering just using Redis streams or even PostgreSQL) and then there is just another redpanda connector for the db to add into that mix. If someone is starting from scratch that might be a good path. But I agree the MIT license of warpstream is a bit nicer if all you need is the connectors.

themanmaran · 2025-05-01T20:53:29 1746132809

Airbyte is probably the best opensource tool in this space.

iflores12 · 2025-05-01T21:44:24 1746135864

Airbyte gave us more headaches than it was worth. But if you can get it to work for you, it's probably the closest you'll get to Fivetran in the open-source tool space

zoogeny · 2025-05-01T21:22:42 1746134562

Cheers, that is what I was thinking must exist but didn't know about.

doctorpangloss · 2025-05-01T23:20:49 1746141649

Palantir's market cap is $274b and they make glorified dashboards. There's just too much money in it to spend cycles doing it for free.

paxys · 2025-05-01T20:20:57 1746130857

A bootstrapped startup needs a MySQL database and a bunch of SELECT queries. Everything else is overkill.

zoogeny · 2025-05-01T21:17:17 1746134237

Sure, SQL + something like metabase is a decent starting point (ideally running on a read-only replica). However, there is room to improve over that.

It's like logging. Yeah, there is sentry, papertrail, splunk, datadog and the like. But something better than greping sys logs is nice and totally reasonable for a startup to standup with Kibana/Elastic running on a tiny instance. That can provide significantly higher value.

There is a middle ground between stone tools and jet aircrafts. I was asking: what are the middle ground tools in this space.

banditelol · 2025-05-02T02:32:40 1746153160

I've tried airbyte, sling, and dlt (besides building several tools from scratch)

My best bet for now will be dlt if you have dedicated DE team, but sling will get you a long way for moving data around your warehouse

loginx · 2025-05-01T20:17:01 1746130621

Haven't used it personally, but I would suggest looking into Apache Hudi

zoogeny · 2025-05-01T21:25:33 1746134733

Good to know about but looks more like an open source snowflake (e.g. data lake). Fivetran and Census are the in/out process layers that bookend the data lake. Although, Hudi does look like it has some of that functionality baked in.