Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Castodia (YC W20) – Populate Google Sheets from databases
103 points by jimmyechan on March 13, 2020 | hide | past | favorite | 51 comments
Hey HN, we’re Jimmy and Ayazhan, founders of Castodia (https://www.castodia.com). We help pull data from databases directly into Google Sheets. It’s basically a live database connector for Google Sheets that allows for data refreshes.

Before we started working on this, we were working on an idea around data pipelines. It didn’t take off so we had to pivot mid-batch. We had less than 1 month to do user interviews, build our product, get it approved, and launch it. To start with, we knew we wanted to stay in the data space. We spent the first week talking to over 30 people at different companies. In those conversations, we noticed that sales, marketing, and operations teams constantly need to ask developers to help them export data from the database and often have to submit follow-on data requests because they forgot to add a dimension needed for analysis (they often do this by submitting Jira tickets). This is an inefficient yet surprisingly common thing at companies today. Perhaps even more surprising is that data scientists at big companies like Facebook, who have invested heavily in data infrastructure and analytics tools, often use spreadsheets as a step in their data analysis workflow. While they can pull data for analysis themselves, they too export data into CSV and then open it in a spreadsheet. This means that their data doesn’t update automatically and the process has to be repeated each time.

At first, we were considering building our own BI tool to solve this problem. However, during our conversations we noticed that people feel new-tool fatigue, especially when their companies rotate through different tools that are used for the same purpose. For instance, we’ve heard of companies going from Tableau to Looker and back to Tableau. Or from Kibana to Sisense to Looker. Sometimes it feels like companies are paying a lot of money for tools that people aren’t really using just because it’s the thing everybody “needs to have”. Each one comes with its own structure, data modeling and steep learning curve. It's overwhelming. At some point people just realize that whatever tool they learn will likely change once a new VP gets hired and wants to do things in a new way. In anticipation of that, they default to the one tool they know how to use and likely won’t be replaced soon. Spreadsheets also happen to be an elegant solution for simple calculations, quick pivots, and high level data exploration. They are elegant in a way that no $50k a year enterprise visualization tool can be - not because they can’t do those things, but because people don’t know how to do it quickly with them. It’s also hard to change workflows for people who live in spreadsheets. Spreadsheets are a pretty good for ad-hoc analysis and summaries that are used for presentations and reports.

So instead of another BI tool, we built a Google Sheets add-on that connects to a database and lets them search through tables, filter/order the data, and then load it into Sheets directly. In the future, we’ll let them schedule data refreshes for any saved query, so all their calculations and pivot tables update with the latest data from the database. We’ve built our first integration with Postgres and testing a MySQL, MongoDB, and public dataset connector. We plan to add integrations to more data sources, including data warehouses such as Snowflake and BigQuery.

We implemented it to only get read-access to the database and we don’t store connection credentials or replicate data in our database. We are focused on data analysis at the moment, but a few customers have requested features to also write into database (write access to the database). Not sure yet how to do this safely with spreadsheets or whether we should at all. We’ve heard suggestions such as database write roll-backs or some intermediate data queue that can be approved by table owners.

Other challenges include: (1) building data connectors because many data sources spits out data differently, which means we have to parse it differently to make it accessible from a uniform interface. (2) We make connections to databases and perform all queries using AWS lambda functions (thank you serverless!) Some queries are taking minutes to execute, but AWS drops connection after 29 seconds. (3) It is hard and, as HN often points out, risky to build a product that depends on Google. We're also running into Google Sheets' row limitations - still figuring out a way to work around this for large databases. We’ve been bouncing ideas around building a back-end to perform heavy computations while only displaying a sampled subset of data in the spreadsheet. We think it's worth it to plunge ahead despite all these issues, though, because having a spreadsheet interface to their data is very much what our users want. We plan to charge users a monthly fee for use of our product and we'll have a HN discount on our basic plan.

We’d love to hear about your data analysis workflows, the tools you use to do this, any problems you’ve had getting data for analysis, and of course, your thoughts and experiences on this use of spreadsheets! Questions and specific integration requests are also welcome, and if you would like to be beta users, feel free to email us at founders@castodia.com.




So I work at an ecommerce company where sales are taking off and more and more people from the business are asking for dumps of various kinds from our production database. Sometimes people ask for dumps with PII in them (typically customer contact info). I’m trying to educate people about the risks of dumping our database into loads of excel files with no access controls that get emailed around.

What kind of data governance does your service support? Can you automatically filter/redact PII? This would be a great feature.


We've had a couple of people ask for that feature. We don't have a solution yet. We could try to control access to specific tables as a first step, but what it would take solve it better would be some access restriction down to the column data.


It might make sense to support (or replicate) Google's DLP functionality. I think that feature is just a blacklist of certain regular expressions.

Cool product, by the way!


I just took a look - it's really cool. We'll check out Google's DLP in more detail! Thanks!


We looked at building a two-way sync tool in Google sheets to test out an MVP instead of building some custom tool and it was a shitshow from a data synchronization standpoint figuring out how to keep the state of each in sync is very hard. We abandoned the idea pretty quickly.

For problem (2), I'm not sure if you're directly returning data from lambda or if you're using lambda to use the sheets API to update the sheet after the data is returned.

If it's the first one, there's no way around it since lambda functions called through API Gateway have the 30 second timeout that you're running into. If it's the second you should be able to use two lambdas one as one that ingests the requests and adds it to a queue, and the second one (which can have up to a 15 minute timeout) that consumes the queue.


Two-way sync without maintaining state is indeed a complex problem. It is loosely a multi-master replication problem.

Shameless plug : We built a multi-directional sync engine and a company called Syncari around it to solve exactly this problem. And added data pipelines around it. Mostly meant for sales operations/marketing operations folks, self-service style.


Yes, data is returned directly from lambda. We’re working on the pro version and enterprise versions that will run queries in the backend and update sheets as you described. For that, we’d need to request for additional scopes, to be able to update sheets remotely. Currently, updates are made by the addon and only when it is in use.


I'm unfamiliar with the Google Apps Add-On API and how Add-Ons are installed/hosted/executed. What machine is connecting to the database? The user's, your server or Google's server? I'm asking because if the database is required to be publicly accessible over the internet that seems like a deal breaker.


Our servers. We don't require to make databases open to the world, but we do ask that our IP is whitelisted. We use lambda functions to run the queries.


If you're connecting over the internet without a VPN or SSH tunnel then you're requiring databases be open to the world, regardless of any firewalls that sit in front of it. Having worked on the infrastructure side of things I can't imagine that any serious companies would be amenable to such an arrangement.


You might consider building and providing a docker container for on-prem use when a database cannot be exposed to the world.


Yes. We'll likely take this on-prem approach


I think read-only is what most businesses would want, that seems to be how most BI works.

Other challenges:

1. Building uniform data connectors suck. GLHF.

2. Long queries: can you return a promise, then callback to an Apps Script endpoint when the data is ready?

3. Row Limits: I don't see aggregation in your screenshots, that would certainly help. Spreadsheet users know about row limts, just give them good UX and control. Sheets gets sluggish pretty quickly on large data sets, so you may want to have a lower default limit so the UX stays snappy and a warning for large queries.


Thanks!

1. Yes, We learned that pretty quickly. Especially when working on postgres and mongoDB. And this is the value we’re trying to deliver. As a user, you don’t have to know technical differences between databases and query them all the same way.

2. Interesting idea, we’ll give it a try, thanks!

3. Yes, we don’t have an aggregate feature just yet, working on it. For large datasets, we're thinking about loading a sampled version of the dataset instead for a quick analysis (like histogram analysis and so on). Lowering default limit is interesting, but we are worried it might compromise the experience (if you can only load 10,000 out of 1,000,000 records, you might not be able to do all the analysis you want). Still exploring solutions


The pricing is confusing and possibly too expensive. For this price you could just buy Office 365 and use the built in Data Connection to any SQL server, and companies like MongoDB and ElasticSearch already offer ODBC/DSNs for excel.

How are you handling MySQL differently than Postgres such that it would warrant a higher price for the consumer?

It's also not super clear what you get with your 100 queries, you can hook up DataGrip or another tool to any of these data sources and get infinite queries for free.


Singer taps and targets can do this:

    pip install tap-postgresql target-google-sheets
    tap-postgresql | target-google-sheets
https://www.singer.io/

disclaimer: I used to work at Stitch Data


This is cool! Our vision is to simplify this experience with intuitive visual tools and advanced features, such as versioning, data governance, saved queries, and more.


Interesting product, but $30 per user per month just seem like too much. I don't have that pain though, is it justifiable? Is it worth for a data science team with 5+ people pay over 150 dollars a month for this? I am asking out of curiosity.


If a 5 person data science team gets any value of out this -- even if it only prevents frustration and literally doesn't save any time -- it's worth it. At a minimum, that team is costing $50k/month, so in theory they should be driving much more than that in value... $150/mo to support that much value is nothing.


Redash offer this as a feature[1]. Their product starts at $49 per month for unlimited users with up to three data sources from a wide range[2], not just Postgres.

I think the Google Sheets export is very valuable, it opens up data to non-developer analysts and general business users. But I also can't see how this product is competitive at this price with only this functionality.

[1] https://redash.io/help/user-guide/integrations-and-api/how-t...

[2] https://redash.io/integrations/


Thanks. We think it justifies the time saving of doing manual exports and keeping Sheets connected to the database - this way data can be easily refreshed by re-running the query. We looked around similar tools and found pricing to be fairly competitive


I think its competitive. Many businesses run off of spreadsheets, and armies of folks enter data into them manually or with great integration pain. Self-serve business data in sheet form is pretty much the ideal UX for something on the order of millions of business knowledge workers. $30/mo is a pittance for that superpower.

Its certainly more affordable than Tableau and other traditional BI tools.

The userbase is probably not well represented on HN...


You may want to explore a partnership or some other relationship with no code tool providers like Tray.io, Zapier, and the like. In my experience, users are looking for this sort of functionality from those products, but they're unable to support this use case, as they're not designed as bulk ETL tools (but event driven transformation tools).


I like this idea. I'll play around with those tools and see if it's a viable options - we just want to make sure that UI/UX remains easy for people to access their data in the spreadsheets.


Great work guys! This could be very useful for us. Some feedback for our use case:

1. We’re in healthcare, so data privacy / security and HIPAA compliance are a must. One way to limit this problem might be by architecting it so that our data doesn’t touch your servers.

2. Support for ActiveRecord-like abstraction layers would be great. Our underlying schema is fairly gnarly due to multiple rounds of regulation-driven changes to our data model so directly querying the Postgres DB isn’t really feasible for most of the team that does basic analysis.

Pricing is fine/cheap for what it could make possible for business intelligence/analytics.


Thank you! Those are good insights and feedback for the healthcare use case. We'll look into it!


Cool idea guys, congrats on the launch especially given it was mid-batch.

I'm not in the market for something like this but it resonates with me since a few times now we've run into the need to jury-rig some scripts to pull data from a MySQL DB to populate a Google Sheet, which then does some simple calc/re-packaging for display on a dashboard or other simple BI use-case. But to do it you end up maintaining basically a simple "app", even if it's just a cron job hitting a python script hourly or something.

I also like the idea of connecting the world's most powerful / most adopted BI tool (spreadsheets) and the more proper data stores one finds in DBs.


Thanks! Would love to understand more about your use case. What do you use to display dashboards? Is through charts inside Sheets or do you have a separate tool reading data from the spreadsheet and displaying it? We are working on a MySQL connection and scheduled refreshes. Might be able to solve your problem soon


There is a thing, called Presto. It has GSheet connector and many other connectors. Why don’t use it?


Great advice! We'll give it a try, thank you!


Can you go into the decision to not deny collection of:

"Other Data While using our Service, we may also collect the following information: sex, age, date of birth, place of birth, passport details, citizenship, registration at place of residence and actual address, telephone number (work, mobile), details of documents on education, qualification, professional training, employment agreements, non-disclosure agreements, information on bonuses and compensation, information on marital status, family members, social security (or other taxpayer identification) number, office location and other data. " https://www.castodia.com/privacy-policy


Hey thanks! You actually went through our privacy policy. We don't really need this. We are working with templates - we'll take it down.


About the issue on AWS Lambda, have you taken a look at other options such as Cloud Run or Fargate. I think they would better serve your use case.

Edit.

To be specific, the limit on runtime of the container. It is longer on the services I mentioned above.


Thank you for recommending these! Our problems is not related to Lambda functions though, they can run for minutes. It is the API Gateway that shuts the connection after 30 seconds. We might just need to build our own web server.


Also worth looking into using ALB instead of API Gateway, and have ALB call Lambda (I think this is possible).


I might be interested in using this, but I'm seeing the Snowflake connection is in the "Enterprise" package which requires an annual contract and is coming soon. Any idea on timing? Is there a beta?


Would you want to be our early beta user for the Snowflake integration? We'd appreciate your insight and learn from your use-case. Timeline-wise we are prioritizing MySQL, MongoDB, and improving our back-end. I'd say another month out before we start messing around with Snowflake, but would love to give you early access when we start with it.


Lemme check with my higher ups, but I think so? We would use this functionality a lot.


Sure. Just reach out to us!


It would be interesting if you could extend this to have a 2 way duplex connector from something like Airtable to Postgres. Airtable could represent the last n records at any given point of time. I'd pay for that product. Not so much into Google Sheets.


Y’all should check out the Google Sheets <> BigQuery connector for design and usability ideas


Will do, thanks for advice!


Congrats on the launch, an important problem to solve. On the other hand, I found your previous idea about pipelines really interesting as well. Do you mind sharing why it did not work out?


This looks great, looking forward to trying it out. I was a heavy Essbase user at a previous analyst role so working with data as a spreadsheet plugin feels very natural.


Thanks! Let us know if you have any feedback or suggestions for improving the product!


I'm curious, what was the original data pipeline idea?


How is your product different from https://seekwell.io/?


Right now we're focused on providing access directly from Sheets so you don't need to switch back and forth between multiple apps or pages. We also allow UI based filtering and sorting for those who prefer not using SQL.


How is it different than an Excel sheet with external data from a ODBC connector?


How does this compare to SuperMetrics?


We focus a lot on our visual query editor. Making it easy for people to search tables, select columns they want, filter/order and then load into Sheets through the UI. We also allow for direct direct SQL queries for people who already have pre-written queries they want to use.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: