More

cschmidt · 2025-11-18T21:36:51 1763501811

There are equal weight S&P ETFs, which avoid having a handful of stock dominating. However, they do have to do a lot more rebalancing to keep things in line.

cschmidt · 2025-10-23T12:25:37 1761222337

There is other research that works with pixels of text, such as this recent paper I saw at COLM 2025 https://arxiv.org/abs/2504.02122.

cschmidt · 2025-08-05T17:36:49 1754415409

I worry how often that is happening already on Spotify.

cschmidt · 2025-07-29T23:32:58 1753831978

I think in this context Management Science is an older term that was synonymous with operations research. The flagship journal of Informs (the institute for operations research and management science) has the same name. Studying how to optimize thing, lots of statistics and math. Stanford was at the forefront of the field from George Danzig onwards. So not trying to make management a “science” in this case.

cschmidt · 2025-07-29T23:40:16 1753832416

I’m not sure about this masters program, but the undergrad program seems to be proper ORMS.

cschmidt · 2025-06-25T13:18:59 1750857539

Virtually all current tokenization schemes do work at the raw byte level, not the utf-8 character. They do this to avoid the Out of Vocabulary (OOV) or unknown token problem. In older models, if you came across something in the data you can't tokenize, you add a <UNK>. But tokenization should be exactly reversible, so now people use subword tokenizers including all 256 single bytes in the vocab. That way you can always represent any text by dropping down to the single byte level. The other alternative would be to add all utf-8 code points to the vocabulary, but there are more than 150k of those, and enough are rare, that many would be undertrained. You'd have a lot of glitch tokens (https://arxiv.org/abs/2405.05417). That does mean an LLM isn't 100% guaranteed to output well formed utf-8.

cschmidt · 2025-06-25T13:23:34 1750857814

And in regard to utf-8 being a shitty biased tokenizer, here is recent paper trying to design a better style of encoding https://arxiv.org/abs/2505.24689

cschmidt · 2025-06-24T18:45:00 1750790700

This paper has a good solution:

https://arxiv.org/abs/2402.14903

You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7. And if you ensure all 1-3 digits groups are in the vocab, it does much better.

Both https://arxiv.org/abs/2503.13423 and https://arxiv.org/abs/2504.00178 (co-author) both independently noted that you can do this with just by modifying the pre-tokenization regex, without having to explicitly add commas.

nielsole · 2025-06-25T07:58:26 1750838306

Isn't that the opposite of the bitter lesson - adding more cleverness to the architecture?

cschmidt · 2025-06-25T11:37:47 1750851467

I suppose it is. There is a lot to tokenization - pre-tokenization, how to handle digits, the tokenization training approach - that is about adding cleverness. In the long run, the bitter lesson would be to just get rid of it all and learn from more data. Many people would love to do it. But I think for the case of BLT, digits will still be an issue. There is no way an autoregressive entropy model will be able to split numbers sensibly, since it has no idea how many digits are coming. It seems like it will struggle more with arithmetic. Perhaps you could reverse all the digits in a number, then it has a chance. So 12334 becomes 43321, and it gets to start from the ones digit. This has been suggested as an approach for LLM's.

infogulch · 2025-06-25T13:47:42 1750859262

Little endian wins in the end.

pas · 2025-06-25T16:19:26 1750868366

... why does reversing the all the digits help? could you please explain it? many thanks!

cschmidt · 2025-06-26T12:02:55 1750939375

Math operations go right to left in the text, while we write them left to right. So if you see the digits 123... in an autoreressive manner, you don't know really anything, since it could be 12345 or 1234567. If you flipped 12345 as 543..., you know the place value of each. You know that the 5 you encounter first is in the ones place, the 4 is the tens place, etc. It gives the LLM a better chance of learning arithmetic.

pas · 2025-06-28T02:14:25 1751076865

ah, okay, thanks!

so basically reverse notation has the advantage of keeping magnitude of numbers (digits!) relative to each other constant (or at least anchored to the beginning of the number)

doesn't attention help with this? (or, it does help, but not much? or it falls out of autoregressive methods?)

cschmidt · 2025-06-28T15:17:54 1751123874

Attention does help, which is why it can learn arithmetic, even with arbitrary tokenization. However, if you put it in a standard form, such as right-to-left groups of 3, you make it an easier problem for the LLM to learn. All the examples it sees are in the same format. Here, the issue is that BLT operates in an autoregressive manner (strictly left to right), which makes it harder to tokenize the digits in a way that is easier for the LLM to learn. Each digit is its own token (Llama style), or flipping the digits might be the best.

RaftPeople · 2025-06-25T19:34:38 1750880078

> Isn't that the opposite of the bitter lesson - adding more cleverness to the architecture?

The bitter lesson is that general methods and a system that learns trumps trying to manually embed/program human knowledge into the system, so clever architecture is ok and expected.

fennecbutt · 2025-06-25T17:21:11 1750872071

I guess it's just working with the brain model (so to speak) than against it.

Inthesamewaythatweusepunctuation. Or even that we usually order words a certain way, oranges and apples, Ted and Bill, roundabouts and swings.

jvanderbot · 2025-06-24T23:58:58 1750809538

Ok great! This is precisely how I chunk numbers for comparison. And not to diminish a solid result or the usefulness of it or the baseline tech: its clear that it we keep having to create situation - specific inputs or processes, we're not at AGI with this baseline tech

chmod775 · 2025-06-25T03:37:44 1750822664

> [..] we're not at AGI with this baseline tech

DAG architectures fundamentally cannot be AGI and you cannot even use them as a building block for a hypothetical AGI if they're immutable at runtime.

Any time I hear the goal being "AGI" in the context of these LLMs, I feel like listening to a bunch of 18th-century aristocrats trying to get to the moon by growing trees.

Try to create useful approximations using what you have or look for new approaches, but don't waste time on the impossible. There's no iterative improvements here that will get you to AGI.

kristjansson · 2025-06-25T04:46:13 1750826773

> "So... what does the thinking?"

> "You're not understanding, are you? The brain does the thinking. The meat."

> "Thinking meat! You're asking me to believe in thinking meat!"

https://www.mit.edu/people/dpolicar/writing/prose/text/think...

munksbeer · 2025-06-26T09:01:38 1750928498

It doesn't feel particularly interesting to keep dismissing "these LLMs" as incapable of reaching AGI.

It feels more interesting to note that this time, it is different. I've been watching the field since the 90s when I first dabbled in crude neural nets. I am informed there was hype before, but in my time I've never seen progress like we've made in the last five years. If you showed it to people from the 90s, it would be mind blowing. And it keeps improving incrementally, and I do not think that is going to stop. The state of AI today is the worst it will ever be (trivially obvious but still capable of shocking me).

What I'm trying to say is that the shocking success of LLMs has become a powerful engine of progress, creating a positive feedback loop that is dramatically increasing investment, attracting top talent, and sharpening the focus of research into the next frontiers of artificial intelligence.

dTal · 2025-06-26T14:07:15 1750946835

>If you showed it to people from the 90s, it would be mind blowing

90's? It's mind blowing to me now.

My daily driver laptop is (internally) a Thinkpad T480, a very middle of the road business class laptop from 2018.

It now talks to me. Usually knowledgeably, in a variety of common languages, using software I can download and run for free. It understands human relationships and motivations. It can offer reasonably advice and write simple programs from a description. It notices my tone and tries to adapt its manner.

All of this was inconceivable when I bought the laptop - I would have called it very unrealistic sci-fi. I am trying not to forget that.

AllegedAlec · 2025-06-25T12:17:42 1750853862

Thank you. It's maddening how people keep making this fundamental mistake.

mgraczyk · 2025-06-25T08:57:58 1750841878

This is meant to be some kind of Chinese room argument? Surely a 1e18 context window model running at 1e6 tokens per second could be AGI.

chmod775 · 2025-06-25T10:04:34 1750845874

Personally I'm hoping for advancements that will eventually allow us to build vehicles capable of reaching the moon, but do keep me posted on those tree growing endeavors.

mgraczyk · 2025-06-25T12:02:51 1750852971

Tree growing?

And I don't follow, we've had vehicles capable of reaching the moon for over 55 years

anonymoushn · 2025-06-25T16:15:15 1750868115

It's about the immutability of the network at runtime. But I really don't think this is a big deal. General-purpose computers are immutable after they are manufactured, but can exhibit a variety of useful behaviors when supplied with different data. Human intelligence also doesn't rely on designing and manufacturing revised layouts for the nervous system (within a single human's lifetime, for use by that single human) to adapt to different settings. Is the level of mutability used by humans substantially more expressive than the limits of in-context learning? what about the limits of more unusual in-context learning techniques that are register-like, or that perform steps of gradient descent during inference? I don't know of a good argument that all of these techniques used in ML are fundamentally not expressive enough.

mgraczyk · 2025-06-25T17:21:07 1750872067

LLMs, considered as a function of input and output, are not immutable at runtime. They create tokens that change the function when it is called again. That breaks most theoretical arguments

anonymoushn · 2025-06-25T18:20:31 1750875631

Sure. Another view is that an LLM is an immutable function from document-prefixes to next-token distributions.

mgraczyk · 2025-06-25T18:24:56 1750875896

But that view is wrong, the model outputs multiple tokens.

The right alternative view is that it's an immutable function from prefixes to a distribution over all possible sequences of tokens less than (context_len - prefix_len).

There are no mutable functions that cannot be viewed as immutable in a similar way. Human brains are an immutable function from input sense-data to the combination (brain adaptation, output actions). Here "brain adaptation" doing a lot of work, but so would be "1e18 output tokens". There is much more information contained within the latter

VonGallifrey · 2025-06-25T13:08:48 1750856928

Excuse me for the bad joke, but it seems like your context window was too small.

The Tree growing comment was a reference to another comment earlier in the comment chain.

mgraczyk · 2025-06-25T16:51:15 1750870275

It's not a tree though

lukan · 2025-06-25T09:23:05 1750843385

"Surely a 1e18 context window model running at 1e6 tokens per second could be AGI."

And why?

mgraczyk · 2025-06-25T16:52:50 1750870370

Because that's quite a bit more information processing than any human brain

lukan · 2025-06-25T17:14:37 1750871677

I don't think it is quantity that matters. Otherwise supercomputers are smart by definition.

mgraczyk · 2025-06-25T17:19:25 1750871965

Well no, that's not what anyone is saying.

The claim was that it isn't possible in principle for "DAGs" or "immutable architectures" to be intelligent. That statement is confusing some theoretical results that aren't applicable to how LLMs work (output context is mutation).

I'm not claiming that compute makes the m intelligent. I'm pointing out that it is certainly possible, and at that level of compute it should be plausible. Feel free to share any theoretical results you think demonstrate the impossibility of "DAG" intelligence and are applicable

lukan · 2025-06-26T08:03:26 1750925006

I am not saying it is impossible, I am saying it might be possible, but far from plausible with the current approach of LLMs in my experience with them.

rar00 · 2025-06-25T12:33:26 1750854806

This argument works better for state space models. A transformer would still steps context one token at a time, not maintain an internal 1e18 state.

mgraczyk · 2025-06-25T16:52:24 1750870344

That doesn't matter, are you familiar with any theoretical results in which the computation is somehow limited in ways that practically matter when the context length is very long? I am not

Y_Y · 2025-06-25T08:23:30 1750839810

What do the vector space embeddings for digit strings even look like? Can you do arithmetic on them? If that's even desirable that it seems like you could just skip "embedding" altogether and intern all the numbers along one dimension.

cschmidt · 2025-06-14T16:36:21 1749918981

Gurobi does have a cloud service where you pay by the hour. A full non-academic license is pricy.

cschmidt · 2025-06-03T14:54:30 1748962470

Every conference has their own required LaTeX style file that must be used. Unless there is an automated way to convert these exactly, I don't see how LaTeX alternatives can be used.

svara · 2025-06-03T15:14:19 1748963659

I think that might be true at some maths and computer science meetings but is unheard of in other scientific fields.

ossner · 2025-06-03T16:18:51 1748967531

CS strongly prefers LaTeX [0,1] while broader journals and conferences prefer MS Word over it [2,3]. As long as there is not a solid infrastructure for these other typesetting systems, I never saw the appeal. I think for internal company reports they do have their uses, but other than that, why not use the LaTeX or Word? Realistically any person wanting to submit a work will know how to work with either one or the other.

I also don't see the need for journals and conferences to make a typst template for exactly these reasons. The templates will have to be community-made and then you still run the risk of having a paper rejected a year from now because the template is outdated.

[0] https://conferences.miccai.org/2025/en/PAPER-SUBMISSION-GUID...

[1] https://github.com/apoorvkh/cvpr-latex-template

[2] https://www.nature.com/nature/for-authors/formatting-guide

[3] https://www.science.org/content/page/science-information-aut...

cschmidt · 2025-06-04T10:18:43 1749032323

You make a fair point - I'm talking specifically about CS/ML/AI conferences. I shouldn't overgeneralize.

tincholio · 2025-06-03T16:09:53 1748966993

It's the standard in most hard science fields. Also common in some humanities, too.

MikeTheGreat · 2025-06-03T16:54:25 1748969665

Can I ask which humanities?

I'm probably showing my bias here, but I'm (respectfully) surprised that, say, poets would want to work in LaTeX :)

yannis · 2025-06-03T18:31:54 1748975514

Linguistics and many of its subbranches. Historians, archaeologists and to be honest LaTeX is great for poetry.

ubersketch · 2025-06-03T18:15:35 1748974535

This is a general argument against any sort of innovation in this field, which is absurd.

cschmidt · 2025-06-04T10:36:27 1749033387

I'm just saying that these systems don't work for me. I write ML/AI conference papers in LaTeX, and I think that use case will be tough to dislodge. I can see this being very attractive to people making other types of documents without a fixed format, especially if you don't already know LaTeX.

SGML_ROCKSTAR · 2025-06-03T15:02:34 1748962954

Is LaTeX difficult to learn? An article stated something about how its syntax is inscrutable, from what memory will allow before the caffeine hits.

Biganon · 2025-06-03T16:10:14 1748967014

It's hell. You usually use an existing layout and adapt it to your needs, but even that can be counter intuitive.

Source : wrote my MLaw papers with it.

tough · 2025-06-03T16:32:10 1748968330

What's the problem with making your own layouts? just the syntax?

yannis · 2025-06-03T18:40:39 1748976039

Depends on the user. Basic LaTeX2e/LuaTeX can be learned over 5 days. Guru level like any programming language needs its 10K hours. There are people who have an aversion for backlashes. The main reason for the "\" is perhaps the only char that is not commonly found in texts. Others like ":" re very common in texts. When parsing LaTeX and behind it is Knuth's original TeX engine, the commands are swimming in a sea of text (as the Dragon book says).

cschmidt · 2025-06-04T10:26:38 1749032798

One thing that has helped with ease of use is Overleaf. It is a hosted LaTeX editor with lots of collaboration features (leaving comments, history of edits) that let people collaborate in real time on a paper. It comes with many templates to get you started on a new document. If you're working with collaborators, it has a lock on the market.

LaTeX itself can be easy for simple things (pick a template, and put text in each section). And it can grow into almost anything if you put in enough effort. It is far and away the standard way to write math equations, so if your document has lots of formulas, that's a plus.

tough · 2025-06-03T16:31:43 1748968303

llms are great at handling syntax for you so you can focus on your work

I never wrote latex before, but writing a simple PDF / scholarly article from code is pretty easy with current tools if you're a dev

everybodyknows · 2025-06-03T17:29:18 1748971758

> writing a simple PDF / scholarly article from code is pretty easy with current tools

I tripped over a lot of abandonware while looking for a free OSS HTML->PDF solution, recently. What do you recommend?

tough · 2025-06-03T18:14:08 1748974448

I settled using latex with tectonic, you could always leverage playwright or similar for easy html -> print to pdf without any weird libs? (not great startup time, but you can batch many ops in one session)

# justfile ── put in repo root set shell := ["bash", "-cu"] # one shell → predictable env, pipe-fail, etc.

# Build a PDF once pdf: tectonic -X compile src-v0.1/main.tex --outdir target/pdf # or swap for typst

# Clean artefacts clean: rm -rf target

# Live-reload while writing watch: cargo watch -q -x 'just pdf'

then i just split the paper in sections like react components but using tex

main.tex

\documentclass{article}

\input{preamble}

\begin{document} \maketitle

\input{sections/abstract}

\input{sections/introduction}

\input{sections/syntax}

\input{sections/evaluation}

\input{sections/conclusion}

\end{document}

cschmidt · 2025-05-31T19:22:26 1748719346

Yes, they were concurrent work. (Co-author of BoundlessBPE here). A sibling comment describes the main differences. Our paper motivates why superwords can lead to such a big improvement, by overcoming a limit that pre-tokenization imposes on current tokenization methods. The SuperBPE paper has a wonderful set of downstream evaluation runs. So if you're interested in either, they are quite complimentary papers.

cschmidt · 2025-05-31T13:24:39 1748697879

Regarding $O(n L^2)$ vs $O(n L)$, that was because we somewhat sloppily tend to use the term 'tokenization' for both training a tokenizer vocab, and for tokenizing a given document. In the paper, we tried to always call the latter one segmentation or inference. The former is $O(n L^2)$ per iteration, while the latter $O(n L)$. I'll update the README to be more explicit about this.

anonymoushn · 2025-05-31T13:56:13 1748699773

No, the segmentation algorithm you have implemented has runtime O(N*L^2). In particular, if you want to do a hash table lookup using a string key, that takes time proportional to the length of the string, not constant time.

cschmidt · 2025-05-31T19:27:37 1748719657

That's in interesting point. While your correct, of course, it is so common to consider a hash table lookup a O(1) operation, it never occurred to me. But in this case, the loops are actually really tight and the hash table lookup might be a significant part of the time, so it might well behave more like O(n L^2). I'll update the docs and paper.