More

_ben_ · on Aug 24, 2023

At PolyScale [1] we tackle many of the same challenges. Some of this article feels a little dated to me but the data distribution, connectivity and scaling challenges are valid.

We use caching to store data and run SQL compute at the edge. It is wire protocol compatible with various databases (Postgres, MySQL, MS SQL, MariaDB) and it dramatically reduces query execution times and lower latency. It also has a JS driver for SQL over HTTP, as well as connection pooling for both TCP and HTTP.

https://www.polyscale.ai/

avinassh · on Aug 24, 2023

This is interesting! How does polyscale works, especially this part:

> PolyScale automatically and intelligently caches or invalidates data close to where it is being requested.

skybrian · on Aug 24, 2023

This pitch is rather opaque to me. How does cache invalidation actually work?

I don't see how cache invalidation happens at all unless all changes go through PolyScale. What about making a change to the database directly?

_ben_ · on Aug 24, 2023

Thanks for the questions. At a very high level, the AI uses statistical models that learn in real-time and estimate how frequently the data on the database is changing. The TTL's get set accordingly and are set per SQL query. The model looks at many inputs such as the payload sizes being returned from the database as well as arrival rates.

If PolyScale can see mutation queries (inserts, updates, deletes) it will automatically invalidate, just the effected data from the cache, globally.

If you make changes directly to the database out of band to PolyScale, you have a few options depending on the use case. Firstly, the AI, statistical based models will invalidate. Secondly, you can purge - for example after a scheduled import etc. Thirdly, you can plug in CDC streams to power the invalidations.

Feel free to ping me if you would like to dig in deeper (ben at) and this document provides more detail on the caching protocol: https://docs.polyscale.ai/how-does-it-work#caching-protocol

This blog also goes in to detail on how invalidation works: https://www.polyscale.ai/blog/approaching-cache-invalidation...

_ben_ · on July 31, 2023

PolyScale [1] focuses on many of these issues. It provides a globally distributed database cache at the edge. Writes pass through to the database and reads are cached locally to the app tier. The Smart Invalidation feature inspects updates/deletes/inserts and invalidates just the changed data from the cache, globally.

1. https://www.polyscale.ai/

_ben_ · on Nov 15, 2022

For database caching outside of PlanetScale, PolyScale.ai [1] provides a serverless database edge cache that is compatible with Postgres, MySQL, MariaDB and MS SQL Server. Requires zero configuration or sizing etc.

1.https://www.polyscale.ai/

rbranson · on Nov 15, 2022

I tried to use PolyScale in the past but had issues with performance because updating a row would invalidate the entire cache. I wonder if that has improved?

_ben_ · on Nov 15, 2022

Yes, in the early versions of the automated invalidation, the logic cleared all cached data based on tables. That is no longer the case. The invalidations only remove the affected data from the cache, globally. You can read more here: https://docs.polyscale.ai/how-does-it-work#smart-invalidatio...

rbranson · on Nov 15, 2022

It didn't impact everything, I think I was hitting this case:

> When a query is deemed too complex to determine what specific cached data may have become invalidated, a fallback to a simple but effective table level invalidation occurs.

saybar · on Nov 15, 2022

We've made a lot of changes - give it a try again or feel free to reach out to support@polyscale.ai and we'd be happy to assist you.

lern_too_spel · on Nov 16, 2022

The Noria solution seems superior. It doesn't necessarily have to rerun queries from scratch because a single row changed.

mvcalder · on Nov 16, 2022

One might argue one approach is superior over the other, I'd argue they are more like duals of one another. The PolyScale approach analyzes the queries and identifies the semantic and the statistical relationships between reads and writes. The Noria approach forgoes analyzing the queries and instead maintains a materialized view like representation of where the data should-be-at-now.

The PolyScale approach does not maintain / require a separate data representation and so saves space, but on the other hand, precisely identifying the relationship between reads and writes is not possible and so the PolyScale approach must sometimes over-invalidate in the interest of accuracy.

There are scenarios in which show-me-the-data (Noria) beats show-me-the-math (PolyScale), for example, running complex queries against a relatively simple schema. There are also scenarios in which the statistical (PolyScale) approach wins, for example if the queries are relatively simple or if not all writes to the underlying data are visible.

There are additional unique features of PolyScale that set it apart. Full disclosure, I work at PolyScale.

_ben_ · on Aug 18, 2022

Does CF workers support TCP yet?

tbarn · on Aug 18, 2022

They still require external connections to be made over HTTP and not other networking protocols. This is part of why this new driver is useful. Before today, it would have been impossible to use PlanetScale directly from a worker without it.

_ben_ · on Aug 17, 2022

PolyScale founder here. Assuming you are referring to PolyScale (rather than ClickHouse), the product is aimed at devs who dont want to build data distribution and caching. You can connect your database and then have global low-latency reads, without writing code. Useful for multi-region deployments, serverless/microservices/FaaS as well as simply scaling your origin db.

ClickHouse is used for computing Observability metrics within the UI. The automated caching algorithms do not use ClickHouse in any way. You can read more about the automation here: https://docs.polyscale.ai/how-does-it-work#caching-protocol or try the live demo here: https://playground.polyscale.ai/

_ben_ · on Aug 2, 2022

Disclaimer: I am the founder of PolyScale [1].

We see both use cases: single large database vs multiple small, decoupled. I agree with the sentiment that a large database offer simplicity, until access patterns change.

We focus on distributing database data to the edge using caching. Typically this eliminates read-replicas and a lot of the headache that goes with app logic rewrites or scaling "One Big Database".

[1] https://www.polyscale.ai/

_ben_ · on June 16, 2022

PolyScale [1] is a serverless plug-and-play database edge cache. Our goal is for devs to be able to scale reads globally in a few minutes. It’s wire compatible with Postgres, MySQL, MS SQL Server (more coming including no-sql).

It has a global edge network, so no infrastructure to deploy and AI managed cache and auto invalidation, so no cache configuration needed.

[1] https://www.polyscale.ai/

_ben_ · on June 1, 2022

PolyScale | Remote (GMT -8 to GMT +3) | Full-time | https://www.polyscale.ai/

Founding team hires.

PolyScale is changing how databases are distributed and scaled. Our mission is to enable edge-first data by simplifying global caching for developers. We provide a smart database edge cache that plugs into your existing database and intelligently caches data globally. No code and no servers to deploy.

We're a small team tackling hard problems and growing fast. If you are passionate about developer experiences, data performance and are a curious problem solver, join us! We are currently hiring for:

* Software Engineer, Full Stack - React & TypeScript - https://www.polyscale.ai/careers/software-engineer-full-stac...

* Software Engineer, C++ Backend Proxy - https://www.polyscale.ai/careers/software-engineer-backend

* Product Marketing - spec coming soon - contact us for more details.

_ben_ · on May 2, 2022

PolyScale | Remote (GMT -8 to GMT +3) | Full-time | https://www.polyscale.ai/

Founding team hires.

PolyScale is changing how databases are distributed and scaled. Our mission is to enable edge-first data by simplifying global caching for developers. We provide a smart database edge cache that plugs into your existing database and intelligently caches data globally. No code and no servers to deploy.

We're a small team tackling hard problems and growing fast. If you are passionate about developer experiences, data performance and are a curious problem solver, join us! We are currently hiring for:

* Developer Advocate - https://www.polyscale.ai/careers/developer-advocate

* Software Engineer, Full Stack - React & TypeScript - https://www.polyscale.ai/careers/software-engineer-full-stac...

* Software Engineer, C++ Backend Proxy - https://www.polyscale.ai/careers/software-engineer-backend

* Product Marketing - spec coming soon.

_ben_ · on April 20, 2022

We're building PolyScale[1] to address this problem. PolyScale is a serverless edge cache for databases so you can easily distribute your reads.

We are opening up early access to our connection pooling features in the next couple of weeks which allows FaaS platforms like Netlify, Cloudflare etc to create large numbers of ephemeral connections, without impacting your origin database, as well as reducing connection latency significantly.

[1] https://www.polyscale.ai/

tauwauwau · on April 20, 2022

I was looking at the polyscale docs and found following

  PolyScale evaluates each SQL query in real-time and when it detects a DML query i.e. a SQL INSERT, UPDATE or DELETE, it extracts the associated tables for the query. Then, all data for the table(s) in question are purged from the cache, for every region globally.

at https://docs.polyscale.ai/how-does-it-work/#smart-invalidati...

Isn't clearing cache for entire tables for a all DMLs which may be changing one record too intensive and how does this affect performance of cache when there are multiple DML queries being run every minute?

Also can you please give the docs link to connection pooling feature

_ben_ · on April 20, 2022

That’s right. Currently the auto invalidation is somewhat of a blunt instrument in that it will blow away all cache data related to the table(s) as default. That approach favors consistency over performance, but is also a natural fit to some query traffic patterns. You can also switch it off if you so desire. The next iteration that is imminent for release can be much more surgical, invalidating based on more of the query details.

Connection pooling docs are coming soon as part of the feature early access launch. Feel free to drop me an email and I can let you know when its released. Im ben at our domain.