IMO, a real solution here has to be hybrid, not full LLM, because these sheets can be massive and have very complicated structures. You want to be able to use the LLM to identify / map column headers, while using non-LLM tool calling to run Excel operations like SUMIFs or VLOOKUPs. One of the most important traits in these systems is consistency with slight variation in file layout, as so much Excel work involves consolidating / reconciling between reports made on a quarterly basis or produced by a variety of sources, with different reporting structures.
Disclosure: My company builds ingestion pipelines for large multi-tab Excel files, PDFs, and CSVs.
"Hey Claude, what's the total from this file?
> grep for headers
> "Ah, I see column 3 is the price value"
> SUM(C2:C) -> $2020
> "Great! I found your total!"
If you can find me an example of tech that can solve this at scale on large, diverse Excel formats, then I'll concede, but I haven't found something actually trustworthy for important data sets
Danya was like the Mr. Rogers of chess. He had a way of making you feel accepted into the chess community even if you were a beginner, and was such a clear thinker. I strive to be more like him, and am devastated by this loss.
The way my company uses Clickhouse is basically that we have one giant flat table, and have written our own abstraction layer on top of it based around "entities" which are functions of data in the underlying table, potentially adding in some window functions or joins. Pretty much every query we write with Clickhouse tacks on a big "Group By All" at the end of it, because we are always trying to squash down the number of rows and aggregate as aggressively as possible.
I imagine we're not alone in this type of abstraction layer, and some type-safety would be very welcome there. I tried to build our system on top of Kysely (https://kysely.dev/) but the Clickhouse extension was not far along enough to make sense for our use-case. As such, we basically had to build our own parser that compiles down to sql, but there are many type-error edge cases, especially when we're joining in against data from S3 that could be CSV, Parquet, etc.
Side note: One of the things I love most about Clickhouse is how easy it is to combine data from multiple sources other than just the source database at query time. I imagine this makes the problem of building an ORM much harder as well, since you could need to build type-checking / ORM against sql queries to external databases, rather than to the source table itself
Given a model with pre-trained transcoders, it finds the circuit / attribution graph; i.e., it computes the direct effect that each non-zero transcoder feature, transcoder error node, and input token has on each other non-zero transcoder feature and output logit.
Given an attribution graph, it visualizes this graph and allows you to annotate these features.
Enables interventions on a model's transcoder features using the insights gained from the attribution graph; i.e. you can set features to arbitrary values, and observe how model output changes.
Syncopate | NYC (Hybrid ~3d/week) | Full-time | Senior Full Stack Engineers / Focus on AI + Finance
Syncopate builds tools to help automate financial diligence and management of long-tail financial assets.
We've found product market fit with ETL/analysis tools for niche financial data, starting with music rights, and we're looking to build out our capabilities across more Excel + PDF-based workflows.
What we're looking for: A full-stack engineer with experience building data-heavy applications. Experience with analytics databases like Clickhouse and data pipelining is a plus. Required to be knowledgeable in Typescript.
Big bonus points for:
1) High agency (previously a founder or built side-projects to completion)
2) Some knowledge of finance
3) Skill in Rust
It seems to me like chunking (or some higher order version of it like chunking into knowledge graphs) is the highest leverage thing someone can work on right now if trying to improve intelligence of AI systems like code completion, PDF understanding etc. I’m surprised more people aren’t working on this.
Chunking is less important in the long context era with most people just pulling in top 20 K. You obviously don’t want to butcher it, but you’ve got a lot of room for error.
We still want chunking in practice to avoid LLM confusion, undifferentiated embeddings, and handling large datasets at lower cost + large volumes. Large context means we can now tolerate multi-paragraph/page, so more like chunk by coherent section.
In theory we can do entire chapter/book, but those other concerns come in, so I only see more niche tools or talk-to-your-PDF do that.
At the same time, embedding is often a significant cost in above scenarios, so I'm curious about the semantic chunking overheads..
In our use-case we have many gigabytes of PDFs that contain some qualitative data but also many pages of inline pdf tables. In an ideal world we’d be “compressing” those embedded tables into some text that says “there’s a table here with these columns, if you want to analyze it you can use this <tool>, but basically the table is talking about X, here are the relevant stats like mean, sum, cardinality.”
In the naive chunking approach, we would grab random sections of line items from these tables because they happen to reference some similar text to the search query, but there’s no guarantee the data pulled into context is complete.
Trueish - for orgs that can't use API models for regulatory or security reasons, or that just need really efficient high throughput models, setting up your own infra for long context models can still be pretty complicated and expensive. Careful chunking and thoughtful design of the RAG system often still matters a lot in that context.
It splits an input text into equal sized chunks using DFS and parallelization (rayon) to do so relatively quickly.
However, the goal for me is to use a n LLM to split text by topic. I’m thinking I will implement it as an API saas service on top of it being OSS. Do you think it’s a viable business? You send a library of text, and receive a library of single topic context chunks as output.
If I were to guess, most (adult) humans could not add two 3 digit numbers together with 100% accuracy. Maybe 99%? Computers can already do 100%, so we should probably be trying to figure out how to use language to extract the numbers from stuff and send them off to computers to do the calculations. Especially because in the real world most numbers that matter are not just two digits addition
It looks like Deepseek had a subdomain called "openai-us1.deepseek.com". What is a legitimate use-case for hosting an openai proxy(?) on your subdomain like this?
Not implying anything's off here, but it's interesting to me that this OpenAI entity is one of the few subdomains they have on their site
Could just be an OpenAI-compatible endpoint too. A lot of LLM tools use OpenAI compatible APIs, just like a lot of Object Storage tools use S3 compatible APIs.