This is really cool and promising product! So many times when I was getting a google docs spreadsheet from non-engineering folks I was wondering if only I can embed a small jupyter notebook with a few python cells and a nice looking pandas chart, using this data.. Congrats on the launch!
other arguments aside .. Athena costs $5 per 1TB scanned and also supports predicates pushdown to S3 Select. I wouldn't call this expensive, at least in comparison to self hosted Presto.
At a certain scale it does become very expensive. It's easy math.
When your monthly Athena bill crosses whatever it would cost to have 5 or 10 EC2 machines it'll be cheaper to use Trino. At my previous workplace we moved from ~$40,000/month to ~$18,000/month by replacing Athena.
Athena is a very good tool to start with - unless you have super large scale you'll probably not outgrow it. But when you do there's Trino.
I do contribute to Trino - although I was merely a user when that cost reduction happened.
I'm not sure the math is so easy. Even knowing the direct cost savings in hindsight, engineers' time is expensive, and it's not obvious that the ongoing engineering cost of maintaining Trino on an EC2 cluster would be that far below $22k/month. Even if you get a net cost savings on an ongoing basis (which, granted, you probably do), you may have a long payback period for the initial engineering time spent evaluating solutions and getting the deployment spun up.
And that's all with the benefit of hindsight - it's hard to know a priori how much cheaper your own deployment will be compared to a managed service or how long it will take to implement. Of course, anecdotes like yours help with that, so thanks for sharing your experience!
Sure, I agree, above certain usage threshold hosted Trino becomes totally justified. But then, some engineering time to maintain the cluster has to be factored in as well.. for anything ad-hoc in nature, I would start with Athena by default.
For this particular example direct initialization i.e. std::string product("not worked"); would be preferred, as you end up using one call to constructor instead of two: default constructor followed by the move assignment.
Good progress overall, especially on the Rust side, I wonder if C++ and Rust would at some point follow the same roadmap when it comes higher-level compute features or rather deviate and develop at their own pace.
Special kudos to the Rust team for Parquet predicates pushdown feature.
Following similar observations I was wondering if one can actually execute SQL queries inside of a Python process with the access to native Python functions and Numpy as UDFs. Thanks to Apache Arrow one can essentially combine DataFrame API with SQL within data analysis workflows, without the need to copy the data and write operators in a mix of C++ and Python, all within the confines of the same Python process.
So I implemented Vinum, which allows to execute queries which may invoke Numpy or Python functions as UDFs available to the interpreter.
For example: "SELECT value, np.log(value) FROM t WHERE ..".
Finally, DuckDB makes a great progress integrating pandas dataframes into the API, with UDFs support coming soon. I would certainly recommend giving it a shot for OLAP workflows.
That's correct, but SQLite would require to serialize/deserialize the data sent to Python func (from C to Python and back), while Arrow allows to get a "view" of the same data without making a copy. Which is probably not an issue in OLTP workloads, but may become more visible in OLAP.
That's one of the core use cases for Vinum - to provide SQL query engine, complementing Pandas in data analysis (thanks to Arrow), yet retain an ability to use native python and numpy funcs as UDFs. Also, it doesn't require input dataset to fit into memory.