The likely endgames of 'no tracking' (accelerated by automating rejection rate up to 100%) are:
- drastically reduced free content (see the rise of paywalls on most news sites)
- an arms race to find other ways to track (see rise of cookieless tracking and retargetting approaches, first party cloaking, etc)
Not to say the latter wouldn't always be there, and but fact is a good chunk of the web is free and stuffed with ads because that's the only way to stay afloat, people simply don't want to pay for content.
Silico | Compiler Software Engineer | London, UK | REMOTE (UK) | FULL TIME | https://www.silicoai.com/
Silico is developing a flexible simulation and decision support platform, enabling businesses to better plan and evaluate their decisions.
We're hiring a Compiler Engineer to join our team and work on our core engine toolchain, an interactive compiler for analysing end user formulae to provide LSP-style editor support, as well as compilation into high performance code for execution. Another facet is bringing powerful techniques such as auto-differentiation to our users seamlessly via the compilation pipeline.
Our engine is written in Rust, supporting a TypeScript/React SPA, and run server-side as well as client-side via WASM.
That doesn't force covering "both sides" of every issue though, in the way people often mean by "fair and balanced" in these discussions, where every issue needs to have a counterpoint aired.
The UK election rules are more about ensuring all the candidates for election get some representative airtime, and that overtly political advertisements are labelled as such during the election period, so a channel is not allowed to promote one party and completely block out another, or be monopolised by advertising funds. (There's also a very short quiet period just before the voting takes place.)
If all parties generally agree on something, the opposing view to that is unlikely to get much airtime, even if someone not running for office would really like to talk about it.
At Silico we're building a simulation platform to enable businesses to use forecasting and simulation to plan for tomorrow. We're still a small team of 5 engineers, so join us early and help us shape our team and product going forwards!
We're VC funded and in the process of starting a new wave of hiring, currently prioritising for a Backend Engineer with experience with Kubernetes and Golang and/or Rust, with new roles on the frontend in TypeScript, React & Rust in future. Please see our careers site or the linked job posting for more information, or contact me directly at chris@silicoai.com.
Arrow itself is a standardised in-memory columnar data representation. The benefit of this is you can then send data between processes without needing to be concerned with serialise/deserialise. There's then a growing ecosystem around this, e.g. Flight for making the sending of data easier, DataFusion for querying, etc.
Can you persist Arrow-format data to disk? I see a lot of interests in it, but I can't figure out the use cases. For example, let's say I have ton (xx TB) of well-structured objects on S3. I want to run query via Spark/Presto on the data. I still need to deserialize the data from ORC/PARQUET into ARROW right? The advantage with ARROW here is if Spark/Presto can use this format to pass the data between worker nodes, the query would be faster because we don't need to deserialize/serialize when passing data between nodes? If yes, how do I utilize the format in Spark/Presto?
You can: it is serializable and self-describing. However, unless "disk" is super fast and thus more likely memory, and your data is ephemeral, you probably shouldn't. Instead, we've been happier as parquet/orc: tunable compression, nicer multi-part / parallel readers, and a bit more stable.
There is feather for persistence, but you don't need it: just as how you can stream binary arrow buffers to processes, you can write raw arrow to disk. In theory it might give some teams in some setups parallel read/write speedups, but we've been exploring other paths there, e.g., 90+GB/s per node via GDS https://pavilion.io/nvidia . I'm not aware of feather efforts targeting that kind of perf but would be curious!
To utilize w/ spark.. it already does underneath ;-) an increasing flow is something like spark filter -> gpu compute+ai, where the transfer is spark cpu rdd -> arrow (spark-native) -> rapids/tensorflow
Edit: Arrow dev does seem more active than parquet/orc (and a lot of their dev is _by_ arrow devs!), so give it another couple of years, and I can see arrow being stable enough that you can persist data with less fear of having to reprocess older files and having most of the compression features you'd want!
Got ya. We are sticking with Par/Orc for now, we are running into the scenario where size of the data is going up, query SLA is going down. At some point, we will need to look at other technology to reduce cost without sacrificing performance.
Yep. I may have been unclear, they work well together: we'll do a gpu parquet reader that returns an arrow dataframe that our ETL pipeline then transforms into visual depictions of the correlations+relationships in people's datasets. Stuff on disk is nice stable formats, stuff across our API boundaries & compute frameworks is arrow.
it varies.. a lot of our users look at say 50kb files for quick small and targeted visual sessions , but when doing something like a log dump analysis, we are working on TB files and 1-2 GB per streaming part is good. CPU arrow people like to do say 10KB-1MB per record batch, but GPU land is a lot faster by thinking in terms of bandwidth, and so 500MB-10GB per contiguous part, depending on GPU memory and working set size. likewise, depends on how compressed it is, as you ultimately care how much it uncompresses into for the downstream memory pressure. hope that makes sense!
> unless "disk" is super fast and thus more likely memory, and your data is ephemeral, you probably shouldn't
Can you elaborate why Arrow is not a good format for storing to disk? If you’re using it for in-memory querying, why would you not want to also serialize it directly to disk instead of using some intermediary format?
Performance: Arrow does not do significant compression. Feather started adding it, but that adds even more change risk. Parquet/ORC/Arrow are all fairly similar, so until Arrow catches up and stablizes, I'd stick w/ Parquet/ORC. We do GPU stuff, and get in-GPU decompression already, so that's been a win/win.
Sheetless makes it easy to model complex systems and explore decisions before making them in the real world. We're building a platform that allows analysts, data scientists and management to build and collaborate on better decision making. From no-code model building, through scenario simulation, data comparison and generating actionable insights into the systems at work.
The challenges involve building highly interactive and data intensive interfaces in a Simulation IDE, spanning the gamut from UX concerns through to low-latency compilers. We're also building out the infrastructure to enable simple data connectivity for models, real-time analysis, and scale-out for compute intensive workloads for analysis and optimisation.
We're currently 5 people, have recently secured VC backing (pre-seed) and are building out a core engineering team, with positions open on frontend and backend. Technology stack is TypeScript (Next.js), Rust, Hasura, and deployed on GCP. We're interested in Go/Java engineers with DevOps experience for the backend role.
It's interesting to see companies showing up in this space (e.g., https://hash.ai/), many of them based out of London. I'd consider applying if I was based in the UK, but I'm curious about what inspired the founding team to start the company. Can you share an origin story?
We're building a simulation platform for the modern enterprise, enabling organisations to build digital twins of themselves, to bring together data and human knowledge to give better support and guidance in making decisions. We believe the future is in tools that provide collaborative and explainable predictive power to decision makers.
The challenges involve building highly interactive and data intensive interfaces in a Simulation IDE, spanning the gamut from UX concerns through to low-latency compilers. On the server side we're then providing services around on-demand computation, analysis, and compilation.
We're currently 3 people, have recently secured VC backing (pre-seed) and are building out a core engineering team, with positions open on frontend and backend. Technology stack is TypeScript (Next.js), Rust, Hasura, and deployed on GCP.
Excalidraw can store the files, although it's a little hidden: the link option in export gives a link that downloads an encrypted JSON from their server.
Sheetless | Front-End Engineer | London (Remote) | Full Time | https://sheetless.io
Sheetless is creating a modelling tool to bridge the gap between spreadsheets and specialist packages, helping people to move their knowledge about the systems they know out of their heads and spreadsheets. We're making it easier for people to understand systems and make better decisions to improve them, whether that's a business, or the environment.
We're a SaaS product, developing using TypeScript/Rust languages. On the frontend we're using a stack of React/Redux/Next.js/Material UI, along with some Rust modules powering the simulations. As a first hire, we're looking for someone comfortable and capable to build out new UI/UX around building simulations, with a focus on making things accessible for non-experts.
We're early stage, with initial funding and eager first customers. We're fully remote, but with a preference for being within a timezone or two of the UK.
I attempted to email the address on the linked page but received the following automatic-reply error:
> We're writing to let you know that the group you tried to contact (hiring) may not exist, or you may not have permission to post messages to the group.
I've found Rust is great for doing heavy lifting (parsing/compiling/graph analysis, etc) on the frontend, as long as you can define the boundary between wasm/JS reasonably cleanly. And in that code I don't think there's a productivity hit as you can benefit a lot from the stricter types (ADT pattern matching, etc) and more reliable performance.
For more straightforward UI code, I'd agree that TypeScript strikes a much better balance right now, especially for (almost) seamlessly working with the wider ecosystem. I moved away from Scala.js because defining the boundary transition was quite error prone, but maybe I was missing something.
- drastically reduced free content (see the rise of paywalls on most news sites)
- an arms race to find other ways to track (see rise of cookieless tracking and retargetting approaches, first party cloaking, etc)
Not to say the latter wouldn't always be there, and but fact is a good chunk of the web is free and stuffed with ads because that's the only way to stay afloat, people simply don't want to pay for content.