More

ptrik · 2024-08-25T00:53:56 1724547236

How’s this different from explainshell?

__jonas · 2024-08-25T10:55:14 1724583314

Explainshell seems to have parsed manpages and extracted arguments, while this tool uses the fig autocomplete specs

ptrik · 2024-08-25T13:59:35 1724594375

Sounds like explainshell's approach is more robust, no?

ptrik · 2024-08-07T13:46:42 1723038402

> While the supermarket that I was using to test things every step of the way worked fine, one of them didn't. The reason? It was behind Akamai and they had enabled a firewall rule which was blocking requests originating from non-residential IP addresses.

Why did you pick Tailscale as the solution for proxy vs scraping with something like AWS Lambda?

anamexis · 2024-08-07T14:18:13 1723040293

Didn't you answer your own question with the quote? It needs to originate from a residential IP address

ptrik · 2024-08-07T13:45:09 1723038309

> My CI of choice is [Concourse](https://concourse-ci.org/) which describes itself as "a continuous thing-doer". While it has a bit of a learning curve, I appreciate its declarative model for the pipelines and how it versions every single input to ensure reproducible builds as much as it can.

What's the thought process behind using a CI server - which I thought is mainly for builds - for what essentially is a data pipeline?

sakisv · 2024-08-07T17:09:51 1723050591

Well I'm just thinking of concourse the same way it describes itself, "a continuous thing doer".

I want something that will run some code when something happens. In my case that "something" is a specific time of day. The code will spin up a server, connect it to tailscale, run the 3 scraping jobs and then tear down the server and parse the data. Then another pipeline runs that loads the data and refreshes the caches.

Of course I'm also using it for continuously deploying my app across 2 environments, or its monitoring stack, or running terraform etc.

Basically it runs everything for me so that I don't have to.

ptrik · 2024-08-07T13:44:13 1723038253

> The data from the scraping are saved in Cloudflare's R2 where they have a pretty generous 10GB free tier which I have not hit yet, so that's another €0.00 there.

Wonder how's the data from R2 fed into frontend?

ptrik · 2024-08-07T13:43:53 1723038233

> I went from 4vCPUs and 16GB of RAM to 8vCPUs and 16GB of RAM, which reduced the duration by about ~20%, making it comparable to the performance I get on my MBP. Also, because I'm only using the scraping server for ~2h the difference in price is negligible.

Good lesson on cloud economics. Below certain threshold we get linear performance gain with more expensive instance type. It is essentially the same amount of spending but you would save time running the same workload with more expensive machine but for shorter period of time.

ptrik · 2024-08-04T09:12:04 1722762724

This is the main takeaway for me. The decentralized way of software development in a large scale. It does echoes with microservices a lot, but this can be done with a more traditional stack as well. It's ultimately about how you empower teams to develop features in parallel, and only coordinate when patterns emerge.

ptrik · on Oct 9, 2023

Curious to see how this stacks up against a more specialised HTAP database like SingleStore / TiDB

ptrik · on Dec 22, 2022

This depends on use case. SQL is the king for batching process - queries are declarative, decades of effort put into optimization.

For real-time / streaming use cases, however, there is yet a mature solution in SQL yet. Flink SQL / Materialize is getting there, but the state-of-the-art approach is still Flink / Kafka Streams approach - put your state in memory / on local disk, and mutate it as you consume messages.

This actually echoes the "Operate on data where it resides" principle in the article.

dagss · on Dec 22, 2022

We do mini-batch processing in SQL. Some hundred milliseconds latency, some hundred events consumed from to the inbound event table per iteration. Paginate through using a (Shard, EventSequenceNumber) key; writers to table synchronize/lock so that this is safe.

Kafka-in-SQL if you wish. Or, homegrown Flink.

(There are many different uses for the events inside our SQL processing pipelines, and have to store the ingested events in SQL anyway)

I am sure real Kafka+Flink has some advantages, but...what we do works really well, is simple, and feels right for our scale.

It is enough batching in SQL to real speed/CPU benefits on inserts/updates into SQL (vs e.g. hitting SQL once per consumed event which would be way worse). And with Azure SQL the infra is extremely simple vs getting a Kafka cluster in our context.

fifilura · on Dec 22, 2022

Do you find flink SQL immature? For me it looks a lot like syntactical sugar on top of the datastream api.

Same thing, less code?

ptrik · on June 29, 2022

Ditto on DuckDB point. This looks like OLAP workloads to me and a columnar database would work wonders. DuckDB if you going embedded, Clickhouse if you going with a server.

ptrik · on June 7, 2022

Related library in Rust: https://docs.rs/typed-html/latest/typed_html/