Launch HN: Castodia (YC W20) – Populate Google Sheets from databases

davedx · on March 13, 2020

So I work at an ecommerce company where sales are taking off and more and more people from the business are asking for dumps of various kinds from our production database. Sometimes people ask for dumps with PII in them (typically customer contact info). I’m trying to educate people about the risks of dumping our database into loads of excel files with no access controls that get emailed around.

What kind of data governance does your service support? Can you automatically filter/redact PII? This would be a great feature.

jimmyechan · on March 13, 2020

We've had a couple of people ask for that feature. We don't have a solution yet. We could try to control access to specific tables as a first step, but what it would take solve it better would be some access restriction down to the column data.

david_shaw · on March 13, 2020

It might make sense to support (or replicate) Google's DLP functionality. I think that feature is just a blacklist of certain regular expressions.

Cool product, by the way!

jimmyechan · on March 14, 2020

I just took a look - it's really cool. We'll check out Google's DLP in more detail! Thanks!

xrendan · on March 13, 2020

We looked at building a two-way sync tool in Google sheets to test out an MVP instead of building some custom tool and it was a shitshow from a data synchronization standpoint figuring out how to keep the state of each in sync is very hard. We abandoned the idea pretty quickly.

For problem (2), I'm not sure if you're directly returning data from lambda or if you're using lambda to use the sheets API to update the sheet after the data is returned.

If it's the first one, there's no way around it since lambda functions called through API Gateway have the 30 second timeout that you're running into. If it's the second you should be able to use two lambdas one as one that ingests the requests and adds it to a queue, and the second one (which can have up to a 15 minute timeout) that consumes the queue.

neeleshs · on March 14, 2020

Two-way sync without maintaining state is indeed a complex problem. It is loosely a multi-master replication problem.

Shameless plug : We built a multi-directional sync engine and a company called Syncari around it to solve exactly this problem. And added data pipelines around it. Mostly meant for sales operations/marketing operations folks, self-service style.

ayazhan · on March 13, 2020

Yes, data is returned directly from lambda. We’re working on the pro version and enterprise versions that will run queries in the backend and update sheets as you described. For that, we’d need to request for additional scopes, to be able to update sheets remotely. Currently, updates are made by the addon and only when it is in use.

jacamera · on March 13, 2020

I'm unfamiliar with the Google Apps Add-On API and how Add-Ons are installed/hosted/executed. What machine is connecting to the database? The user's, your server or Google's server? I'm asking because if the database is required to be publicly accessible over the internet that seems like a deal breaker.

jimmyechan · on March 13, 2020

Our servers. We don't require to make databases open to the world, but we do ask that our IP is whitelisted. We use lambda functions to run the queries.

jacamera · on March 13, 2020

If you're connecting over the internet without a VPN or SSH tunnel then you're requiring databases be open to the world, regardless of any firewalls that sit in front of it. Having worked on the infrastructure side of things I can't imagine that any serious companies would be amenable to such an arrangement.

toomuchtodo · on March 13, 2020

You might consider building and providing a docker container for on-prem use when a database cannot be exposed to the world.

jimmyechan · on March 13, 2020

Yes. We'll likely take this on-prem approach

reilly3000 · on March 13, 2020

I think read-only is what most businesses would want, that seems to be how most BI works.

Other challenges:

1. Building uniform data connectors suck. GLHF.

2. Long queries: can you return a promise, then callback to an Apps Script endpoint when the data is ready?

3. Row Limits: I don't see aggregation in your screenshots, that would certainly help. Spreadsheet users know about row limts, just give them good UX and control. Sheets gets sluggish pretty quickly on large data sets, so you may want to have a lower default limit so the UX stays snappy and a warning for large queries.

ayazhan · on March 13, 2020

Thanks!

1. Yes, We learned that pretty quickly. Especially when working on postgres and mongoDB. And this is the value we’re trying to deliver. As a user, you don’t have to know technical differences between databases and query them all the same way.

2. Interesting idea, we’ll give it a try, thanks!

3. Yes, we don’t have an aggregate feature just yet, working on it. For large datasets, we're thinking about loading a sampled version of the dataset instead for a quick analysis (like histogram analysis and so on). Lowering default limit is interesting, but we are worried it might compromise the experience (if you can only load 10,000 out of 1,000,000 records, you might not be able to do all the analysis you want). Still exploring solutions

jujodi · on March 13, 2020

The pricing is confusing and possibly too expensive. For this price you could just buy Office 365 and use the built in Data Connection to any SQL server, and companies like MongoDB and ElasticSearch already offer ODBC/DSNs for excel.

How are you handling MySQL differently than Postgres such that it would warrant a higher price for the consumer?

It's also not super clear what you get with your 100 queries, you can hook up DataGrip or another tool to any of these data sources and get infinite queries for free.

mbil · on March 13, 2020

Singer taps and targets can do this:

    pip install tap-postgresql target-google-sheets
    tap-postgresql | target-google-sheets

https://www.singer.io/

disclaimer: I used to work at Stitch Data

ayazhan · on March 13, 2020

This is cool! Our vision is to simplify this experience with intuitive visual tools and advanced features, such as versioning, data governance, saved queries, and more.

knackfuss · on March 13, 2020

Interesting product, but $30 per user per month just seem like too much. I don't have that pain though, is it justifiable? Is it worth for a data science team with 5+ people pay over 150 dollars a month for this? I am asking out of curiosity.

edmundsauto · on March 13, 2020

If a 5 person data science team gets any value of out this -- even if it only prevents frustration and literally doesn't save any time -- it's worth it. At a minimum, that team is costing $50k/month, so in theory they should be driving much more than that in value... $150/mo to support that much value is nothing.

ollieglass · on March 13, 2020

Redash offer this as a feature[1]. Their product starts at $49 per month for unlimited users with up to three data sources from a wide range[2], not just Postgres.

I think the Google Sheets export is very valuable, it opens up data to non-developer analysts and general business users. But I also can't see how this product is competitive at this price with only this functionality.

[1] https://redash.io/help/user-guide/integrations-and-api/how-t...

[2] https://redash.io/integrations/

jimmyechan · on March 13, 2020

Thanks. We think it justifies the time saving of doing manual exports and keeping Sheets connected to the database - this way data can be easily refreshed by re-running the query. We looked around similar tools and found pricing to be fairly competitive

reilly3000 · on March 13, 2020

I think its competitive. Many businesses run off of spreadsheets, and armies of folks enter data into them manually or with great integration pain. Self-serve business data in sheet form is pretty much the ideal UX for something on the order of millions of business knowledge workers. $30/mo is a pittance for that superpower.

Its certainly more affordable than Tableau and other traditional BI tools.

The userbase is probably not well represented on HN...

toomuchtodo · on March 13, 2020

You may want to explore a partnership or some other relationship with no code tool providers like Tray.io, Zapier, and the like. In my experience, users are looking for this sort of functionality from those products, but they're unable to support this use case, as they're not designed as bulk ETL tools (but event driven transformation tools).

jimmyechan · on March 13, 2020

I like this idea. I'll play around with those tools and see if it's a viable options - we just want to make sure that UI/UX remains easy for people to access their data in the spreadsheets.

georgek · on March 14, 2020

Great work guys! This could be very useful for us. Some feedback for our use case:

1. We’re in healthcare, so data privacy / security and HIPAA compliance are a must. One way to limit this problem might be by architecting it so that our data doesn’t touch your servers.

2. Support for ActiveRecord-like abstraction layers would be great. Our underlying schema is fairly gnarly due to multiple rounds of regulation-driven changes to our data model so directly querying the Postgres DB isn’t really feasible for most of the team that does basic analysis.

Pricing is fine/cheap for what it could make possible for business intelligence/analytics.

jimmyechan · on March 14, 2020

Thank you! Those are good insights and feedback for the healthcare use case. We'll look into it!

corry · on March 13, 2020

Cool idea guys, congrats on the launch especially given it was mid-batch.

I'm not in the market for something like this but it resonates with me since a few times now we've run into the need to jury-rig some scripts to pull data from a MySQL DB to populate a Google Sheet, which then does some simple calc/re-packaging for display on a dashboard or other simple BI use-case. But to do it you end up maintaining basically a simple "app", even if it's just a cron job hitting a python script hourly or something.

I also like the idea of connecting the world's most powerful / most adopted BI tool (spreadsheets) and the more proper data stores one finds in DBs.

jimmyechan · on March 13, 2020

Thanks! Would love to understand more about your use case. What do you use to display dashboards? Is through charts inside Sheets or do you have a separate tool reading data from the spreadsheet and displaying it? We are working on a MySQL connection and scheduled refreshes. Might be able to solve your problem soon

LaserToy · on March 14, 2020

There is a thing, called Presto. It has GSheet connector and many other connectors. Why don’t use it?

ayazhan · on March 14, 2020

Great advice! We'll give it a try, thank you!

corcov20 · on March 14, 2020

Can you go into the decision to not deny collection of:

"Other Data While using our Service, we may also collect the following information: sex, age, date of birth, place of birth, passport details, citizenship, registration at place of residence and actual address, telephone number (work, mobile), details of documents on education, qualification, professional training, employment agreements, non-disclosure agreements, information on bonuses and compensation, information on marital status, family members, social security (or other taxpayer identification) number, office location and other data. " https://www.castodia.com/privacy-policy

jimmyechan · on March 14, 2020

Hey thanks! You actually went through our privacy policy. We don't really need this. We are working with templates - we'll take it down.

amolo · on March 14, 2020

About the issue on AWS Lambda, have you taken a look at other options such as Cloud Run or Fargate. I think they would better serve your use case.

Edit.

To be specific, the limit on runtime of the container. It is longer on the services I mentioned above.

ayazhan · on March 14, 2020

Thank you for recommending these! Our problems is not related to Lambda functions though, they can run for minutes. It is the API Gateway that shuts the connection after 30 seconds. We might just need to build our own web server.

dyu · on March 18, 2020

Also worth looking into using ALB instead of API Gateway, and have ALB call Lambda (I think this is possible).

_t0du · on March 13, 2020

I might be interested in using this, but I'm seeing the Snowflake connection is in the "Enterprise" package which requires an annual contract and is coming soon. Any idea on timing? Is there a beta?

jimmyechan · on March 13, 2020

Would you want to be our early beta user for the Snowflake integration? We'd appreciate your insight and learn from your use-case. Timeline-wise we are prioritizing MySQL, MongoDB, and improving our back-end. I'd say another month out before we start messing around with Snowflake, but would love to give you early access when we start with it.

_t0du · on March 13, 2020

Lemme check with my higher ups, but I think so? We would use this functionality a lot.

jimmyechan · on March 13, 2020

Sure. Just reach out to us!

akanthi · on March 16, 2020

It would be interesting if you could extend this to have a 2 way duplex connector from something like Airtable to Postgres. Airtable could represent the last n records at any given point of time. I'd pay for that product. Not so much into Google Sheets.

saadatq · on March 13, 2020

Y’all should check out the Google Sheets <> BigQuery connector for design and usability ideas

ayazhan · on March 13, 2020

Will do, thanks for advice!

enisberk · on March 13, 2020

Congrats on the launch, an important problem to solve. On the other hand, I found your previous idea about pipelines really interesting as well. Do you mind sharing why it did not work out?

kevc · on March 13, 2020

This looks great, looking forward to trying it out. I was a heavy Essbase user at a previous analyst role so working with data as a spreadsheet plugin feels very natural.

jimmyechan · on March 13, 2020

Thanks! Let us know if you have any feedback or suggestions for improving the product!

eliblock · on March 13, 2020

I'm curious, what was the original data pipeline idea?

bluntcoder · on March 14, 2020

How is your product different from https://seekwell.io/?

jimmyechan · on March 14, 2020

Right now we're focused on providing access directly from Sheets so you don't need to switch back and forth between multiple apps or pages. We also allow UI based filtering and sorting for those who prefer not using SQL.

thibautg · on March 14, 2020

How is it different than an Excel sheet with external data from a ODBC connector?

swampthinker · on March 13, 2020

How does this compare to SuperMetrics?

jimmyechan · on March 13, 2020

We focus a lot on our visual query editor. Making it easy for people to search tables, select columns they want, filter/order and then load into Sheets through the UI. We also allow for direct direct SQL queries for people who already have pre-written queries they want to use.