You are citing a single use of em-dashes in a single 30 year old article as proof of something.
If anything, the length of that article shows how rarely em-dashes were used by most writers. They're like exclamatory versions of semicolons, a contrived sudden interruption, a sort of inversion of the three dot "…" elipsis. Maybe the em-dash cracked and fell on the floor.
The reason LLMs use a lot of em-dashes is because that's a format they've chosen for output. Thinking that LLMs have a lot of em-dashes because works in the wild have a lot of em-dashes is like thinking that LLM output has a lot of emoticons because a lot of essayists use emoticons to mark subject divisions in the text.
A single one is sufficient evidence that calling out a single em-dash as evidence of AI use is flawed. Especially when it is from the same magazine.
There are also em-dashes in a huge number of their articles. I didn't spend time picking one. I just went back to the oldest article in the first category I picked, and found one on the first try. It's a common style for more "serious" magazines and always has been.
> Thinking that LLMs have a lot of em-dashes because works in the wild have a lot of em-dashes is like thinking that LLM output has a lot of emoticons because a lot of essayists use emoticons to mark subject divisions in the text.
No, thinking they do is like having read a lot of literary text and being aware of how it has a long history of being used in serious writing.
If you read a lot of books, particularly older ones, you'll find em dashes in all kinds of writing and used often. It's functional punctuation that once you understand you may even find yourself using it (and then being accused of being an AI, lol)
If you have n buckets, and n+1 items you're guaranteed to have a shared bucket.
In the case of hash algorithms you're taking an arbitrary sized input and "compressing" (in quotes because this is one way, you can't decompress it because of collisions) into a fixed size. If you permit more inputs to your hash function than there are hash values, then you will eventually have a collision.
A stupid awful hash function: n mod 100
So long as n is less than 100, you will never have a collision. But as soon as you compute the hash of n = 100, you will get a collision (with 0 in this case). Now, real world hash algorithms have larger spaces they map to, and more complicated mappings, but they all have this same problem. The larger the space (like 256-bits versus 64-bits) the less likely collisions become, but it could still happen.
Hash functions represent a chunk of data with fewer bits than the original data, hence there's always a _chance_ of a collision. With cryptographic hashes, the output of the hash function is relatively large in size, making the probability of an accidental hash collision vanishingly small. For example, sha-256 hash algorithm can result in over 115 quattuorvigintillion different values.
The hashing functions used with hash tables typically reduce the hashed value to one of only tens or hundreds of values, making collisions unavoidable. Typically, a hash table will try and manage the number of available slots to be roughly equal to the number of items stored in the hash to achieve performance that is a good balance of lookup time and memory requirements. For an extreme example, In Ruby, hashes of less than a certain size (6, I believe) are just represented internally as a list because the overhead of using an actual hash table is greater than just iterating through every item in the list.
A hash function has a limited range of outputs (e.g. for a hashtable it might be a number only a few bits large), whereas the space of possible inputs is larger or even unrestricted - e.g. could be arbitrary text.
If you have a space of possible inputs that isn't larger thant the outputs, then you can indeed design hash functions that do not collide.
Think what happens when the number of possible outputs is smaller than the number of possible inputs.
If we have a hash function f(n) that outputs a number between 1-100, but n can be any number between 1-1000, then some inputs must result in collisions.
reply