So to be "safe" and "secure" we can only have strings 256 characters long, or we...

kg · on April 11, 2014

The reality is that null-terminated strings are dramatically more expensive than strings with a length counter in every regard other than memory usage, and the memory usage overhead from storing a length value is utterly miniscule compared to the actual size of the string. Even if you ignore all the secondary costs that result from the decision to use null-terminated strings, they're just poor engineering. There are far better ways to save a few bytes.

(By secondary costs I mean things like the myriad bugs caused by null-terminated strings, the severe performance penalties involved in copying and manipulating them, the unfortunate implications they have for file formats and network protocols, etc.)

TheLoneWolfling · on April 11, 2014

My "ideal" string would be to store it as a UTF-8 rope, with the additional restriction (doesn't change the interface any) that all characters within a node in the rope have the same length. (You can use overlong encodings internally if it makes sense (One single-byte character in a bunch of longer characters), which is a microoptimization that will in some cases save a few bytes.)

I'd also treat a character + combining characters as a single character.

weinzierl · on April 11, 2014

>that all characters within a node in the rope have the same length [...] I'd also treat a character + combining characters as a single character.

The problem with this is that Unicode doesn't restrict the number of combining marks. If your hypothetical library wants to offer full Unicode support, your "nodes of same length" idea wouldn't work.

Of course an implementation which makes an arbitrary restriction wouldn't be unusual. In fact, I'm not aware of any application that supports an arbitrary number of combining marks even if the standard allows it.

When it comes to standard conformance UAX15-D3 [1] is probably the closest we could get. It'd require 128 byte per character.

[1] http://unicode.org/reports/tr15/#Stream_Safe_Text_Format

TheLoneWolfling · on April 11, 2014

The length of each node is not the same; each character within a node is the same number of bytes long.

So you end up with one node in the rope that has one logical character - that is some absurd number of bytes long. (In reality there'd be a maximum of 2^8-1 or 2^16-1 or something bytes per character)

jevinskie · on April 11, 2014

A length field really is insignificant. On a 64-bit machine, a 4GB max length field is half the size of the pointer to the string itself! C++ STL strings already use a length specifier and I don't think anyone is complaining about performance because of it.

arnehormann · on April 11, 2014

Length could be encoded differently, as a varint. As long as the highest bit is set, the next byte is also part of the length - just left-shift the result so far by seven and add the 7 lower bits, as soon as the highest bit is 0 we have the final length. The processing overhead is low, 127 bytes only cost 1... Not such a big issue.

callesgg · on April 11, 2014

The speedup would be well worth it by being able to tell the mmc to move 100 bytes from position 20000 to position 40000 instead of give me 8 bytes from position 200000 then check evry one of the bytes then ask the mmc again give me 8 bytes from position 20008 and so on.

If I got to chose in this day and age I would define a c string/array as a pointer to a long int and after the long int we have or data.