More

dmitrykoval · on Feb 16, 2023

This is really cool and promising product! So many times when I was getting a google docs spreadsheet from non-engineering folks I was wondering if only I can embed a small jupyter notebook with a few python cells and a nice looking pandas chart, using this data.. Congrats on the launch!

dosinga · on Feb 16, 2023

Thanks! We like it too :)

dmitrykoval · on Aug 4, 2022

other arguments aside .. Athena costs $5 per 1TB scanned and also supports predicates pushdown to S3 Select. I wouldn't call this expensive, at least in comparison to self hosted Presto.

hashhar · on Aug 4, 2022

At a certain scale it does become very expensive. It's easy math.

When your monthly Athena bill crosses whatever it would cost to have 5 or 10 EC2 machines it'll be cheaper to use Trino. At my previous workplace we moved from ~$40,000/month to ~$18,000/month by replacing Athena.

Athena is a very good tool to start with - unless you have super large scale you'll probably not outgrow it. But when you do there's Trino.

I do contribute to Trino - although I was merely a user when that cost reduction happened.

tfehring · on Aug 4, 2022

I'm not sure the math is so easy. Even knowing the direct cost savings in hindsight, engineers' time is expensive, and it's not obvious that the ongoing engineering cost of maintaining Trino on an EC2 cluster would be that far below $22k/month. Even if you get a net cost savings on an ongoing basis (which, granted, you probably do), you may have a long payback period for the initial engineering time spent evaluating solutions and getting the deployment spun up.

And that's all with the benefit of hindsight - it's hard to know a priori how much cheaper your own deployment will be compared to a managed service or how long it will take to implement. Of course, anecdotes like yours help with that, so thanks for sharing your experience!

dmitrykoval · on Aug 4, 2022

Sure, I agree, above certain usage threshold hosted Trino becomes totally justified. But then, some engineering time to maintain the cluster has to be factored in as well.. for anything ad-hoc in nature, I would start with Athena by default.

dmitrykoval · on May 20, 2021

For this particular example direct initialization i.e. std::string product("not worked"); would be preferred, as you end up using one call to constructor instead of two: default constructor followed by the move assignment.

https://en.cppreference.com/w/cpp/language/direct_initializa... https://en.cppreference.com/w/cpp/language/move_assignment

kleiba · on May 20, 2021

Thank you very much.

What are examples of situations where the other pattern would be preferable?

Ticklee · on May 20, 2021

afaik the only time you are really supposed to use uninitialized values is when you are planning on reading it in from some stream.

otherwise you should really avoid it.

dmitrykoval · on May 20, 2021

Both are legal statements.

The second one is a direct initialization, invoking corresponding constructor.

The first one first invokes default constructor and then copy or move assignment depending on rvalueness of the arg.

dmitrykoval · on May 5, 2021

Good progress overall, especially on the Rust side, I wonder if C++ and Rust would at some point follow the same roadmap when it comes higher-level compute features or rather deviate and develop at their own pace.

Special kudos to the Rust team for Parquet predicates pushdown feature.

dmitrykoval · on May 3, 2021

Following similar observations I was wondering if one can actually execute SQL queries inside of a Python process with the access to native Python functions and Numpy as UDFs. Thanks to Apache Arrow one can essentially combine DataFrame API with SQL within data analysis workflows, without the need to copy the data and write operators in a mix of C++ and Python, all within the confines of the same Python process.

So I implemented Vinum, which allows to execute queries which may invoke Numpy or Python functions as UDFs available to the interpreter. For example: "SELECT value, np.log(value) FROM t WHERE ..".

https://github.com/dmitrykoval/vinum

Finally, DuckDB makes a great progress integrating pandas dataframes into the API, with UDFs support coming soon. I would certainly recommend giving it a shot for OLAP workflows.

justsomeuser · on May 3, 2021

Also I think SQLite lets you call Python functions from the SQL program.

dmitrykoval · on May 3, 2021

That's correct, but SQLite would require to serialize/deserialize the data sent to Python func (from C to Python and back), while Arrow allows to get a "view" of the same data without making a copy. Which is probably not an issue in OLTP workloads, but may become more visible in OLAP.

dmitrykoval · on May 2, 2021

Story by TC - https://techcrunch.com/2021/04/30/what3words-legal-threat-wh...

dmitrykoval · on April 15, 2021

Architecture of a Database System – Hellerstein, Stonebraker & Hamilton, 2007. Is a great paper focused on DBs design and architecture. From 2007 but still rocks. http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf

dmitrykoval · on March 26, 2021

That's one of the core use cases for Vinum - to provide SQL query engine, complementing Pandas in data analysis (thanks to Arrow), yet retain an ability to use native python and numpy funcs as UDFs. Also, it doesn't require input dataset to fit into memory.

https://github.com/dmitrykoval/vinum

dmitrykoval · on March 5, 2017

It's a little bit strange that even though Python was mentioned there's not a hint about the actual problem in question.

curryhowardiso · on March 5, 2017

I didn't tell the journalists the question because it wouldn't mean anything to their target audience.

Most of these reposts (including this one in hacker news) have not reached out to me for additional or audience-specific details :)