There are equal weight S&P ETFs, which avoid having a handful of stock dominating. However, they do have to do a lot more rebalancing to keep things in line.
I think in this context Management Science is an older term that was synonymous with operations research. The flagship journal of Informs (the institute for operations research and management science) has the same name. Studying how to optimize thing, lots of statistics and math. Stanford was at the forefront of the field from George Danzig onwards. So not trying to make management a “science” in this case.
Virtually all current tokenization schemes do work at the raw byte level, not the utf-8 character. They do this to avoid the Out of Vocabulary (OOV) or unknown token problem. In older models, if you came across something in the data you can't tokenize, you add a <UNK>. But tokenization should be exactly reversible, so now people use subword tokenizers including all 256 single bytes in the vocab. That way you can always represent any text by dropping down to the single byte level. The other alternative would be to add all utf-8 code points to the vocabulary, but there are more than 150k of those, and enough are rare, that many would be undertrained. You'd have a lot of glitch tokens (https://arxiv.org/abs/2405.05417). That does mean an LLM isn't 100% guaranteed to output well formed utf-8.
And in regard to utf-8 being a shitty biased tokenizer, here is recent paper trying to design a better style of encoding https://arxiv.org/abs/2505.24689
You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7. And if you ensure all 1-3 digits groups are in the vocab, it does much better.
I suppose it is. There is a lot to tokenization - pre-tokenization, how to handle digits, the tokenization training approach - that is about adding cleverness. In the long run, the bitter lesson would be to just get rid of it all and learn from more data. Many people would love to do it. But I think for the case of BLT, digits will still be an issue. There is no way an autoregressive entropy model will be able to split numbers sensibly, since it has no idea how many digits are coming. It seems like it will struggle more with arithmetic. Perhaps you could reverse all the digits in a number, then it has a chance. So 12334 becomes 43321, and it gets to start from the ones digit. This has been suggested as an approach for LLM's.
Math operations go right to left in the text, while we write them left to right. So if you see the digits 123... in an autoreressive manner, you don't know really anything, since it could be 12345 or 1234567. If you flipped 12345 as 543..., you know the place value of each. You know that the 5 you encounter first is in the ones place, the 4 is the tens place, etc. It gives the LLM a better chance of learning arithmetic.
so basically reverse notation has the advantage of keeping magnitude of numbers (digits!) relative to each other constant (or at least anchored to the beginning of the number)
doesn't attention help with this? (or, it does help, but not much? or it falls out of autoregressive methods?)
Attention does help, which is why it can learn arithmetic, even with arbitrary tokenization. However, if you put it in a standard form, such as right-to-left groups of 3, you make it an easier problem for the LLM to learn. All the examples it sees are in the same format. Here, the issue is that BLT operates in an autoregressive manner (strictly left to right), which makes it harder to tokenize the digits in a way that is easier for the LLM to learn. Each digit is its own token (Llama style), or flipping the digits might be the best.
> Isn't that the opposite of the bitter lesson - adding more cleverness to the architecture?
The bitter lesson is that general methods and a system that learns trumps trying to manually embed/program human knowledge into the system, so clever architecture is ok and expected.
Ok great! This is precisely how I chunk numbers for comparison. And not to diminish a solid result or the usefulness of it or the baseline tech: its clear that it we keep having to create situation - specific inputs or processes, we're not at AGI with this baseline tech
DAG architectures fundamentally cannot be AGI and you cannot even use them as a building block for a hypothetical AGI if they're immutable at runtime.
Any time I hear the goal being "AGI" in the context of these LLMs, I feel like listening to a bunch of 18th-century aristocrats trying to get to the moon by growing trees.
Try to create useful approximations using what you have or look for new approaches, but don't waste time on the impossible. There's no iterative improvements here that will get you to AGI.
It doesn't feel particularly interesting to keep dismissing "these LLMs" as incapable of reaching AGI.
It feels more interesting to note that this time, it is different. I've been watching the field since the 90s when I first dabbled in crude neural nets. I am informed there was hype before, but in my time I've never seen progress like we've made in the last five years. If you showed it to people from the 90s, it would be mind blowing. And it keeps improving incrementally, and I do not think that is going to stop. The state of AI today is the worst it will ever be (trivially obvious but still capable of shocking me).
What I'm trying to say is that the shocking success of LLMs has become a powerful engine of progress, creating a positive feedback loop that is dramatically increasing investment, attracting top talent, and sharpening the focus of research into the next frontiers of artificial intelligence.
>If you showed it to people from the 90s, it would be mind blowing
90's? It's mind blowing to me now.
My daily driver laptop is (internally) a Thinkpad T480, a very middle of the road business class laptop from 2018.
It now talks to me. Usually knowledgeably, in a variety of common languages, using software I can download and run for free. It understands human relationships and motivations. It can offer reasonably advice and write simple programs from a description. It notices my tone and tries to adapt its manner.
All of this was inconceivable when I bought the laptop - I would have called it very unrealistic sci-fi. I am trying not to forget that.
Personally I'm hoping for advancements that will eventually allow us to build vehicles capable of reaching the moon, but do keep me posted on those tree growing endeavors.
It's about the immutability of the network at runtime. But I really don't think this is a big deal. General-purpose computers are immutable after they are manufactured, but can exhibit a variety of useful behaviors when supplied with different data. Human intelligence also doesn't rely on designing and manufacturing revised layouts for the nervous system (within a single human's lifetime, for use by that single human) to adapt to different settings. Is the level of mutability used by humans substantially more expressive than the limits of in-context learning? what about the limits of more unusual in-context learning techniques that are register-like, or that perform steps of gradient descent during inference? I don't know of a good argument that all of these techniques used in ML are fundamentally not expressive enough.
LLMs, considered as a function of input and output, are not immutable at runtime. They create tokens that change the function when it is called again. That breaks most theoretical arguments
But that view is wrong, the model outputs multiple tokens.
The right alternative view is that it's an immutable function from prefixes to a distribution over all possible sequences of tokens less than (context_len - prefix_len).
There are no mutable functions that cannot be viewed as immutable in a similar way. Human brains are an immutable function from input sense-data to the combination (brain adaptation, output actions). Here "brain adaptation" doing a lot of work, but so would be "1e18 output tokens". There is much more information contained within the latter
The claim was that it isn't possible in principle for "DAGs" or "immutable architectures" to be intelligent. That statement is confusing some theoretical results that aren't applicable to how LLMs work (output context is mutation).
I'm not claiming that compute makes the m intelligent. I'm pointing out that it is certainly possible, and at that level of compute it should be plausible. Feel free to share any theoretical results you think demonstrate the impossibility of "DAG" intelligence and are applicable
I am not saying it is impossible, I am saying it might be possible, but far from plausible with the current approach of LLMs in my experience with them.
That doesn't matter, are you familiar with any theoretical results in which the computation is somehow limited in ways that practically matter when the context length is very long? I am not
What do the vector space embeddings for digit strings even look like? Can you do arithmetic on them? If that's even desirable that it seems like you could just skip "embedding" altogether and intern all the numbers along one dimension.
Every conference has their own required LaTeX style file that must be used. Unless there is an automated way to convert these exactly, I don't see how LaTeX alternatives can be used.
CS strongly prefers LaTeX [0,1] while broader journals and conferences prefer MS Word over it [2,3]. As long as there is not a solid infrastructure for these other typesetting systems, I never saw the appeal. I think for internal company reports they do have their uses, but other than that, why not use the LaTeX or Word? Realistically any person wanting to submit a work will know how to work with either one or the other.
I also don't see the need for journals and conferences to make a typst template for exactly these reasons. The templates will have to be community-made and then you still run the risk of having a paper rejected a year from now because the template is outdated.
I'm just saying that these systems don't work for me. I write ML/AI conference papers in LaTeX, and I think that use case will be tough to dislodge. I can see this being very attractive to people making other types of documents without a fixed format, especially if you don't already know LaTeX.
Depends on the user. Basic LaTeX2e/LuaTeX can be learned over 5 days. Guru level like any programming language needs its 10K hours. There are people who have an aversion for backlashes. The main reason for the "\" is perhaps the only char that is not commonly found in texts. Others like ":" re very common in texts. When parsing LaTeX and behind it is Knuth's original TeX engine, the commands are swimming in a sea of text (as the Dragon book says).
One thing that has helped with ease of use is Overleaf. It is a hosted LaTeX editor with lots of collaboration features (leaving comments, history of edits) that let people collaborate in real time on a paper. It comes with many templates to get you started on a new document. If you're working with collaborators, it has a lock on the market.
LaTeX itself can be easy for simple things (pick a template, and put text in each section). And it can grow into almost anything if you put in enough effort. It is far and away the standard way to write math equations, so if your document has lots of formulas, that's a plus.
I settled using latex with tectonic, you could always leverage playwright or similar for easy html -> print to pdf without any weird libs? (not great startup time, but you can batch many ops in one session)
# justfile ── put in repo root
set shell := ["bash", "-cu"] # one shell → predictable env, pipe-fail, etc.
# Build a PDF once
pdf:
tectonic -X compile src-v0.1/main.tex --outdir target/pdf # or swap for typst
Yes, they were concurrent work. (Co-author of BoundlessBPE here). A sibling comment describes the main differences. Our paper motivates why superwords can lead to such a big improvement, by overcoming a limit that pre-tokenization imposes on current tokenization methods. The SuperBPE paper has a wonderful set of downstream evaluation runs. So if you're interested in either, they are quite complimentary papers.
Regarding $O(n L^2)$ vs $O(n L)$, that was because we somewhat sloppily tend to use the term 'tokenization' for both training a tokenizer vocab, and for tokenizing a given document. In the paper, we tried to always call the latter one segmentation or inference. The former is $O(n L^2)$ per iteration, while the latter $O(n L)$. I'll update the README to be more explicit about this.
No, the segmentation algorithm you have implemented has runtime O(N*L^2). In particular, if you want to do a hash table lookup using a string key, that takes time proportional to the length of the string, not constant time.
That's in interesting point. While your correct, of course, it is so common to consider a hash table lookup a O(1) operation, it never occurred to me. But in this case, the loops are actually really tight and the hash table lookup might be a significant part of the time, so it might well behave more like O(n L^2). I'll update the docs and paper.
reply