All major tokenisers have explicit support for encoding arbitrary byte sequences. There's usually a consecutive range of tokens reserved for 0x00 to 0xFF, and you can encode any novel UTF-8 words or structures with it. Including emoji and characters that weren't a part of the model's initial training, if you show it some examples.
Pretty sure that we’re talking apples and oranges. Yes to the arbitrary byte sequences used by tokenizers, but that is not the topic of discussion. The question is will the tokenizer come up with words not in the training vocabulary. Word tokenizers don’t, but character tokenizers do.
Source: Generative Deep Learning by David Foster, 2nd edition, published in 2023. From “Tokenization” on page 134.
“If you use word tokens: …. willnever be able to predict words outside of the training vocabulary.”
"If you use character tokens: The model may generate sequences of characters that form words outside the training vocabulary."
Those tokens won't come up during training, but LLMs are capable of In-Context Learning. If you give it some examples of how to create new words/characters in this manner as a part of the prompt, they will be able to use those tokens at inference time. Show it some examples of how to compose a Thai or Chinese sentence out of byte tokens, and give them a description of the hypothetical Unicode range of a custom alphabet, and a sufficiently strong LLM will be able to just output bytes despite those codepoints not technically existing.
And like I said, single-byte tokens very much are a part of word tokenisers, or to be precise, their token selection. "Word tokeniser" is a misnomer in any case - they are word piece tokenisers. English is simple enough that word pieces can be entire words. With languages where you have numerous suffixes, prefixes, and even in-fixes as a part of one "word" (as defined by "one or more characters preceded or followed by a space" - because the truth is more complicated than that), you have not so much "word tokenisers" as "subword tokenisers". A character tokeniser is just a special case of a subword tokeniser where the length of each subword is exactly 1.
I just went through the quiz at Mochi and it said I was eligible for their nutrition program but not medication. The FAQ says your BMI has to be over 30 or 27 if you have some other health condition.
Our company never adds comments, because the code speaks for itself.
And with genAI I can have these comments added in very low time, helping me to get an overview of what happens.
But as for the "why are are doing this in the first place" business documentation is usually outside the source code and therefore out of reach of any genAI, for now.
As for what senior devs should do when coding:
> They're constantly:
> Refactoring the generated code into smaller, focused modules
> Adding edge case handling the AI missed
> Strengthening type definitions and interfaces
> Questioning architectural decisions
> Adding comprehensive error handling
Ain't nobody got time for that! The one girl and other guy that could do this, because they know the codebase, have no time to do it. Everyone else works by doing just enough, which is nearly what TDD dictates.
And we have PR code review to scrape up quality to barely get maintainable code. And never overcomplicate things, since writing code that works is good enough. And by the time you want to refactor a module three years later, you would want to use another data flow style or library to do the work altogether.
Yes, there was also a thing known as the Great Media Debate in the 90's iirc. MOOCs have a less than 10% completion rate. Media by itself is not the answer. What makes a big difference is teaching techniques and things like superhuman spaced repetition and system adaptation. See how Math Academy does it: https://www.mathacademy.com/pedagogy
1. Their code was calling a 21-parameter "matcher" function with 20 parameters of data.
2. They didn't notice, because all the matcher rules had "allow anything" for the 21st parameter and so never looked at it.
3. They later published the first list of rules with something other than "allow anything" as the 21st parameter, direct to customers.
4. On customer machines, the first rule with a non "match everything" 21st parameter went to look at the 21st element of the 20 element array. It expected a string pointer, but instead there was random stack data. It tried dereferencing this to read the string it was expecting, which caused the kernel driver to segfault during early startup, putting customer machines in a boot loop.