Hacker News new | past | comments | ask | show | jobs | submit | 2snakes's comments login

I read one characterization which is that LLMs don't give new information (except to the user learning) but they reorganize old information.


Custodians of human knowledge.


That’s only true if you tokenize words rather than characters. Character tokenization generates new content outside the training vocabulary.


All major tokenisers have explicit support for encoding arbitrary byte sequences. There's usually a consecutive range of tokens reserved for 0x00 to 0xFF, and you can encode any novel UTF-8 words or structures with it. Including emoji and characters that weren't a part of the model's initial training, if you show it some examples.


Pretty sure that we’re talking apples and oranges. Yes to the arbitrary byte sequences used by tokenizers, but that is not the topic of discussion. The question is will the tokenizer come up with words not in the training vocabulary. Word tokenizers don’t, but character tokenizers do.

Source: Generative Deep Learning by David Foster, 2nd edition, published in 2023. From “Tokenization” on page 134.

“If you use word tokens: …. willnever be able to predict words outside of the training vocabulary.”

"If you use character tokens: The model may generate sequences of characters that form words outside the training vocabulary."


Those tokens won't come up during training, but LLMs are capable of In-Context Learning. If you give it some examples of how to create new words/characters in this manner as a part of the prompt, they will be able to use those tokens at inference time. Show it some examples of how to compose a Thai or Chinese sentence out of byte tokens, and give them a description of the hypothetical Unicode range of a custom alphabet, and a sufficiently strong LLM will be able to just output bytes despite those codepoints not technically existing.

And like I said, single-byte tokens very much are a part of word tokenisers, or to be precise, their token selection. "Word tokeniser" is a misnomer in any case - they are word piece tokenisers. English is simple enough that word pieces can be entire words. With languages where you have numerous suffixes, prefixes, and even in-fixes as a part of one "word" (as defined by "one or more characters preceded or followed by a space" - because the truth is more complicated than that), you have not so much "word tokenisers" as "subword tokenisers". A character tokeniser is just a special case of a subword tokeniser where the length of each subword is exactly 1.


Why stop there? Just have it spit out the state of the bits on the hardware. English seems like a serious shackle for an LLM.


Kind of, but character-based tokens make it a lot harder and more expensive to learn semantics.


Source: Generative Deep Learning by David Foster, 2nd edition, published in 2023. From “Tokenization” on page 134.

“If you use word tokens: …. willnever be able to predict words outside of the training vocabulary.”

"If you use character tokens: The model may generate sequences of characters that form words outside the training vocabulary."


-Your own- yes, but Cloudflare is extremely easy.


Hayy ibn Yaqdhan Nature vs nurture and relative nature of intelligence iirc


Disgusting!


Real yoga is possible.


That has what to do with this person saying a death is sad?


Bodies and minds can be transcended beyond suffering.


Mochi / Henry Meds. Mochi is the cheapest.


I just went through the quiz at Mochi and it said I was eligible for their nutrition program but not medication. The FAQ says your BMI has to be over 30 or 27 if you have some other health condition.


Take my advice at your own risk, but nobody is checking your math.

I was 10 pounds or so from qualifying, so I fudged my numbers a bit. Didn't make sense to force myself to gain weight so I could lose weight.

Places like OrderlyMeds doesn't even require a telehealth visit, just the questionnaire and a photo.


One thing that LLM can do besides generate code is explain complex code too. So that is inherently an upskilling feature.


Our company never adds comments, because the code speaks for itself. And with genAI I can have these comments added in very low time, helping me to get an overview of what happens.

But as for the "why are are doing this in the first place" business documentation is usually outside the source code and therefore out of reach of any genAI, for now.

As for what senior devs should do when coding: > They're constantly:

> Refactoring the generated code into smaller, focused modules

> Adding edge case handling the AI missed

> Strengthening type definitions and interfaces

> Questioning architectural decisions

> Adding comprehensive error handling

Ain't nobody got time for that! The one girl and other guy that could do this, because they know the codebase, have no time to do it. Everyone else works by doing just enough, which is nearly what TDD dictates. And we have PR code review to scrape up quality to barely get maintainable code. And never overcomplicate things, since writing code that works is good enough. And by the time you want to refactor a module three years later, you would want to use another data flow style or library to do the work altogether.


Yes, there was also a thing known as the Great Media Debate in the 90's iirc. MOOCs have a less than 10% completion rate. Media by itself is not the answer. What makes a big difference is teaching techniques and things like superhuman spaced repetition and system adaptation. See how Math Academy does it: https://www.mathacademy.com/pedagogy


I have yet to take a MOOC that is marginally engaging and teaches something new.

My only guess is that they are there for the certs.


Psychology of hunger behaviors.


Surprised noone has mentioned the Crowdstrike issue, which was due to NUL characters wasn't it?


It was not. The Crowdstrike issue was:

1. Their code was calling a 21-parameter "matcher" function with 20 parameters of data.

2. They didn't notice, because all the matcher rules had "allow anything" for the 21st parameter and so never looked at it.

3. They later published the first list of rules with something other than "allow anything" as the 21st parameter, direct to customers.

4. On customer machines, the first rule with a non "match everything" 21st parameter went to look at the 21st element of the 20 element array. It expected a string pointer, but instead there was random stack data. It tried dereferencing this to read the string it was expecting, which caused the kernel driver to segfault during early startup, putting customer machines in a boot loop.

https://www.crowdstrike.com/wp-content/uploads/2024/08/Chann...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: