More

michaelmarkell · 2026-01-16T19:36:37 1768592197

That's approximately ~15.3M USD per person in the top 1% assuming 340 million americans

scottious · 2026-01-16T19:38:43 1768592323

correct me if I'm wrong but isn't it just 1% of Americans, not globally?

michaelmarkell · 2026-01-16T19:42:07 1768592527

You're 100% right, thanks! Updated my comment.

rayiner · 2026-01-16T20:24:08 1768595048

EDIT: Sorry, thought that was $15 million per American.

floxy · 2026-01-16T21:02:23 1768597343

top 1%. So 52e12 / (0.01 * 340e6). I'm betting the mean and median are quite different.

michaelmarkell · 2025-10-27T18:00:24 1761588024

IMO, a real solution here has to be hybrid, not full LLM, because these sheets can be massive and have very complicated structures. You want to be able to use the LLM to identify / map column headers, while using non-LLM tool calling to run Excel operations like SUMIFs or VLOOKUPs. One of the most important traits in these systems is consistency with slight variation in file layout, as so much Excel work involves consolidating / reconciling between reports made on a quarterly basis or produced by a variety of sources, with different reporting structures.

Disclosure: My company builds ingestion pipelines for large multi-tab Excel files, PDFs, and CSVs.

dcre · 2025-10-27T18:22:45 1761589365

That's exactly what they're doing.

https://www.anthropic.com/news/advancing-claude-for-financia...

levocardia · 2025-10-27T18:53:59 1761591239

"This won't work because (something obvious that engineers at Anthropic clearly thought of already)"

michaelmarkell · 2025-10-27T20:33:09 1761597189

Not really. Take for example:

item, date, price

abc, 01/01/2023, $30

cde, 02/01/2023, $40

... 100k rows ...

subtotal. $1000

def, 03/01,2023, $20

"Hey Claude, what's the total from this file? > grep for headers > "Ah, I see column 3 is the price value" > SUM(C2:C) -> $2020 > "Great! I found your total!"

If you can find me an example of tech that can solve this at scale on large, diverse Excel formats, then I'll concede, but I haven't found something actually trustworthy for important data sets

stevenhuang · 2025-10-28T07:48:07 1761637687

That's a basic tool call that current models already can do well. All the sql query generation LLMs can do this for example.

sunnybeetroot · 2025-10-27T18:33:04 1761589984

So more or less like what AI has been doing for the last couple of years when it comes to writing code?

michaelmarkell · 2025-10-20T21:12:54 1760994774

Danya was like the Mr. Rogers of chess. He had a way of making you feel accepted into the chess community even if you were a beginner, and was such a clear thinker. I strive to be more like him, and am devastated by this loss.

michaelmarkell · 2025-08-17T17:35:06 1755452106

The way my company uses Clickhouse is basically that we have one giant flat table, and have written our own abstraction layer on top of it based around "entities" which are functions of data in the underlying table, potentially adding in some window functions or joins. Pretty much every query we write with Clickhouse tacks on a big "Group By All" at the end of it, because we are always trying to squash down the number of rows and aggregate as aggressively as possible.

I imagine we're not alone in this type of abstraction layer, and some type-safety would be very welcome there. I tried to build our system on top of Kysely (https://kysely.dev/) but the Clickhouse extension was not far along enough to make sense for our use-case. As such, we basically had to build our own parser that compiles down to sql, but there are many type-error edge cases, especially when we're joining in against data from S3 that could be CSV, Parquet, etc.

Side note: One of the things I love most about Clickhouse is how easy it is to combine data from multiple sources other than just the source database at query time. I imagine this makes the problem of building an ORM much harder as well, since you could need to build type-checking / ORM against sql queries to external databases, rather than to the source table itself

michaelmarkell · 2025-05-31T15:09:39 1748704179

From the Readme:

Given a model with pre-trained transcoders, it finds the circuit / attribution graph; i.e., it computes the direct effect that each non-zero transcoder feature, transcoder error node, and input token has on each other non-zero transcoder feature and output logit. Given an attribution graph, it visualizes this graph and allows you to annotate these features. Enables interventions on a model's transcoder features using the insights gained from the attribution graph; i.e. you can set features to arbitrary values, and observe how model output changes.

The blog post: https://www.anthropic.com/research/open-source-circuit-traci...

michaelmarkell · 2025-05-01T23:33:52 1746142432

Syncopate | NYC (Hybrid ~3d/week) | Full-time | Senior Full Stack Engineers / Focus on AI + Finance

Syncopate builds tools to help automate financial diligence and management of long-tail financial assets.

We've found product market fit with ETL/analysis tools for niche financial data, starting with music rights, and we're looking to build out our capabilities across more Excel + PDF-based workflows.

What we're looking for: A full-stack engineer with experience building data-heavy applications. Experience with analytics databases like Clickhouse and data pipelining is a plus. Required to be knowledgeable in Typescript.

Big bonus points for: 1) High agency (previously a founder or built side-projects to completion) 2) Some knowledge of finance 3) Skill in Rust

You can reach out to me here https://www.linkedin.com/in/michael-markell-377b4221a/ or via email (michael at syncopate dot ai)

More about Syncopate (geared towards our music rights segment): https://syncopate.notion.site/

michaelmarkell · 2025-04-13T13:37:09 1744551429

It seems to me like chunking (or some higher order version of it like chunking into knowledge graphs) is the highest leverage thing someone can work on right now if trying to improve intelligence of AI systems like code completion, PDF understanding etc. I’m surprised more people aren’t working on this.

serjester · 2025-04-13T14:15:12 1744553712

Chunking is less important in the long context era with most people just pulling in top 20 K. You obviously don’t want to butcher it, but you’ve got a lot of room for error.

lmeyerov · 2025-04-13T14:22:39 1744554159

Yeah exactly

We still want chunking in practice to avoid LLM confusion, undifferentiated embeddings, and handling large datasets at lower cost + large volumes. Large context means we can now tolerate multi-paragraph/page, so more like chunk by coherent section.

In theory we can do entire chapter/book, but those other concerns come in, so I only see more niche tools or talk-to-your-PDF do that.

At the same time, embedding is often a significant cost in above scenarios, so I'm curious about the semantic chunking overheads..

michaelmarkell · 2025-04-13T14:43:09 1744555389

In our use-case we have many gigabytes of PDFs that contain some qualitative data but also many pages of inline pdf tables. In an ideal world we’d be “compressing” those embedded tables into some text that says “there’s a table here with these columns, if you want to analyze it you can use this <tool>, but basically the table is talking about X, here are the relevant stats like mean, sum, cardinality.”

In the naive chunking approach, we would grab random sections of line items from these tables because they happen to reference some similar text to the search query, but there’s no guarantee the data pulled into context is complete.

DeveloperErrata · 2025-04-13T16:23:09 1744561389

Trueish - for orgs that can't use API models for regulatory or security reasons, or that just need really efficient high throughput models, setting up your own infra for long context models can still be pretty complicated and expensive. Careful chunking and thoughtful design of the RAG system often still matters a lot in that context.

J_Shelby_J · 2025-04-13T17:37:48 1744565868

“Performance is less important in an era of multi-core CPUs.”

J_Shelby_J · 2025-04-13T17:34:51 1744565691

That makes me feel better about spending so much time implementing this balanced text chunker last year. https://github.com/ShelbyJenkins/llm_utils

It splits an input text into equal sized chunks using DFS and parallelization (rayon) to do so relatively quickly.

However, the goal for me is to use a n LLM to split text by topic. I’m thinking I will implement it as an API saas service on top of it being OSS. Do you think it’s a viable business? You send a library of text, and receive a library of single topic context chunks as output.

michaelmarkell · 2025-04-04T22:53:20 1743807200

If I were to guess, most (adult) humans could not add two 3 digit numbers together with 100% accuracy. Maybe 99%? Computers can already do 100%, so we should probably be trying to figure out how to use language to extract the numbers from stuff and send them off to computers to do the calculations. Especially because in the real world most numbers that matter are not just two digits addition

michaelmarkell · on Jan 30, 2025

Can someone with more expertise help me understand what I'm looking at here? https://crt.sh/?id=10106356492

It looks like Deepseek had a subdomain called "openai-us1.deepseek.com". What is a legitimate use-case for hosting an openai proxy(?) on your subdomain like this?

Not implying anything's off here, but it's interesting to me that this OpenAI entity is one of the few subdomains they have on their site

gkbrk · on Jan 30, 2025

Could just be an OpenAI-compatible endpoint too. A lot of LLM tools use OpenAI compatible APIs, just like a lot of Object Storage tools use S3 compatible APIs.

michaelmarkell · on Dec 12, 2024

Archive link: https://web.archive.org/web/20240330143422/https://www.nytim...