Yeah, if we only used strings marked with 2-byte integers, everybody would have been happy, because 64kb string is enough for everyone. (And let's be realistic, nobody sane would have chosen 4-byte string length back in early 70s.)
So, if we went down the pass, what will we have? All the fun of having "legacy" APIs that seem to work but internally only accept strings up to 64kb length and mysteriously chop off excess bytes when you least expect it. It's Y2K problem all over again.
And just when you finally think you're over with it, memory is cheaper again, size_t is 64bit, and someone invariably wants to store a binary blob >4G as string. Fun time again.
Have we forgotten how much trouble we went through in the 90s to handle memory in x86 "640k is enough for everybody" architecture?
This is similar to the "kill Hitler" time travel joke [1], it's easy to say things would be better but all we know for sure is that they would be different. Instead of `char` we'd have strings that were `struct` and we'd STILL have a ton of different string formats because of lengths and character formats. (Bonus problem: are the lengths in bytes, or in chars?)
That's exactly equivalent to having a "size" parameter with the same size as the pointer, except you have to use a substract instruction when you want to get the length of the string, so I'd say it's inferior to just storing the length of the string.
For instance, if you copy a string you also have to update the end pointer instead of just copying the size attribute in bulk. And you get the same disadvantages of non-portable strings, different representations depending on the architecture/endianess etc...
I completely agree with the OP, there's no perfect solution. If addr + len was truly superior I'm sure we'd see
struct string { long len; char s[]; };
or for your version
struct string { char *endptr; char s[]; };
everywhere. And the C standard library would have evolved along with it.
Out of the top of my head the only thing that makes '\0' terminated strings special in C is that it's the way string literals are represented. It would be trivial to recode all of string.h using addr + len instead of nul terminated.
> That's exactly equivalent to having a "size" parameter with the same size as the pointer, except you have to use a substract instruction when you want to get the length of the string, so I'd say it's inferior to just storing the length of the string.
Except that it has the important property that the (effective) length descriptor, being a pointer, would necessarily "grow" over time (across generations of machines, e.g. 16 bit -> 32 bit, etc.) and would thus never impose any artificial restrictions on string length.
>Out of the top of my head the only thing that makes '\0' terminated strings special in C is that it's the way string literals are represented. It would be trivial to recode all of string.h using addr + len instead of nul terminated.
exactly, anywhere you would have
if(str[i] == NULL)
you replace with
if (str.s + i == str.end)
And, of course, I'm sure its completely safe to treat something as a pointer but explicitly set it to memory you don't own (i.e. just past the end of the string). I guess we could point to the last byte and doom everyone to inevitable off-by-one errors in string length computation.
it should hve made it easier actually, because with pointers you are usually just doing addition subtraction, and everything should be independent of the acutal value of the pointer.
On the week that str+len was abused left and right, someone surfaces to the frontpage an article about how str+NUL is wrong and everyone should use str+len.
NUL terminated strings were the right decision for C. They’re certainly much simpler than length fields.
Consider using a length field. How big should that field be? If it's fixed size, you introduce complications regarding how big a string you can represent, and differences in field sizes across architectures. If it's variable-sized (a la UTF-8), then you've added different complications: you would need library functions to read and write the length, to get access to the string contents, to calculate the amount of memory required to hold a string of a given size, etc. Very much not in the spirit of C.
Next, what endianness should that field have? NUL terminated strings have no endianness issues: they can be trivially written to files, embedded in network packets, whatever. But with a length field, we either need to remember to marshall the string, or allow for the length field to not be in native byte order. Neither is a pleasant prospect, especially for a 1970’s C programmer.
Also, consider C-style string parsing, e.g. strtok/strsep. These could not be implemented with length-field strings.
Explicit length is better when you have an enforced abstraction, like std::string, but at that point you’re not writing in C. If you have to pick an exposed representation, NUL termination is much better than Pascal-style length fields.
So what was the “one-byte mistake?” The article says that it was saving a byte by using NUL termination instead of a two-byte length field. Had K&R not made that “mistake,” we would be unable to make a string longer than 65k - a far more serious limitation than anything NUL termination imposes!
No one doubts that there were advantages to NUL-terminated strings, but against them you have to weigh the many thousands of security holes that were thereby created.
UTF-8 got it right with having a variable-length byte representation of numerical values. Seven-bit values unaffected. Longer values use more bytes as necessary.
The C approach takes a whole different philosophy. You want a "string"? NULL terminated. Simple. You want a buffer? Do it yourself.
instead of a length field, use a pointer to the last character. the length is the difference of these two pointers. The maximum string length is the size of your address space. Problem solved.
Hypothetically, C could use the first `sizeof (size_t)` bytes to store the length. Endianness shouldn't be much of an issue; just use the endianness of the current machine. You don't typically write the NUL terminator when writing a string to a file; similarly, you wouldn't typically write the length field when writing a counted string to a file.
I agree that NUL-terminated strings were the right decision at the time (and aren't that bad now), but there are sane ways to do counted strings.
You piqued my interest, and it looks like NUL-terminated strings in files is actually common. tar, ELF, JPEG, gzip all use them. When the serialization format matches the in-memory format, you can mmap directly into a struct, which is fast and convenient.
A lot of programming decisions were made to save a byte here and there. It's easy to point at them today and say they're "bad", but at the time they were the absolutely correct thing to do. It's hard to imagine now but not saving that byte could mean your program wouldn't fit into RAM. Try telling your management in the 1960s that your program won't load because it's "properly coded" and see how far you get.
What we've failed to do is ever revisit those decisions and change them where we've identified problems. Yes, you can probably compile (with warnings) files from UNIX v7, but we pay for that compatibility. But there's no question designing, building and maintaining a libc alternative is a colossal undertaking and not likely to happen on a whim. So here we are.
Well, strings without an explicit length field allow for things like strstr(3) or prefix parsing without performance penalties due to reallocating memory.
Blatantly incorrect. Try thinking over how you might implement those two operations on a string with a length field and see if you can figure out why your statement is wrong.
(referring to an allocation-free strstr if strings had an associated length)
If a string was a (pointer,length) pair, then strstr would return a pair (pointer+offset,length-offset) where, just like in the original strstr, the new pointer points into the existing string. No allocation needed
(But what about the pointer-length pair that has to be created? Well, it will live on the stack just like the return value from the original strstr, which has to be stored somewhere also.)
Except that's not how K&R would have done it. It would have been pascal style: one byte of length directly followed by the string itself in memory. The pointer would be the address of the length byte. Now it's hard to create strstr without copying so they wouldn't. Instead you would have a new API where every call would have offsets to the beginning, likely it would be one based too. Anyway C does it the way it does cause PDP set the zero condition flag on move.
When someone makes a post demonstrating a complete failure to consider an argument before making it (claiming that a substring search requires allocation of any kind when all it does is scan and return an offset...) I don't think it's necessary for me to explain in elaborate detail why they're wrong.
I don't need to explain how I'd implement it because every obvious implementation of the strstr algorithm doesn't need to allocate because it's a substring search! Search operations are pure and don't mutate!
Every single implementation of strstr or equivalent in every single programming language I have ever seen does not require an allocation. It either returns a pointer to the location of the first matched result, or returns the character index of the first match. That's it.
Any other absurd implementation you can think up to justify null-terminated strings - like returning a copy of the string with a null terminator after the match - just doesn't make any sense. That's a different operation.
int strstr(string haystack, string needle, int offset);
DESCRIPTION
The strstr() function finds the first occurrence of the substring needle
in the string haystack, starting from position offset in haystack.
RETURN VALUE
This function returns the index of the first character in haystack after
offset where needle can be found, or -1 if the substring is not found (or
if offset is out of bounds for haystack).
Same complexity and memory overhead as C strstr(). Actually more efficient if needle is longer than haystack + offset, which can be checked in O(1) on entry.
(I don't have a problem with NUL-terminated strings, but couldn't resist a little sideways thinking on how an alternate library implementation for strings might work.)
That's a far less useful function, since you still have to do a memory copy if you want to pass the return value to a function that operates on strings.
Only if the first occurrence of needle happens to terminate the string. Otherwise the return value of the real version of strstr returns a string that starts with needle but is followed by more text.
OK, so for that one particular use case[0], the result of strstr might not be quite as convenient as the NUL-terminated strstr() case. But that's what you get with a completely different string implementation - some operations are more efficient than others, others less so. It's a bunch of trade-offs.
But strstr() finds the position of a substring within a larger string, or indicates that it doesn't exist. My suggested API for the equivalent functionality does exactly that, with no problems or performance issues for the job strstr() itself needs to do that I can see.
[0] A use case I have to say I find a little contrived. I can say that the majority of the times I've used strstr(), it's either to simply check that needle exists in haystack, or to take the part of the haystack following needle - at which point I normally need to make a copy because I can't guarantee that the source string will be around for the lifetime that I need the remainder for.
Different experiences, I guess. Your "contrived use case" is my "only reason I've ever used strstr". That said, I haven't done a ton of C programming, so maybe my experience is atypical.
When I was at PARC the Mesa guys (who had counted strings) did some analysis and (at least in those days) the counted strings ended up being, in aggregate, faster. I suspect the advantage would be even greater these days since memory allocation was a bigger deal back then.
I wonder if you could do this compatibly in the compiler by adding another primitive type (counted string) which had the length in the bytes before the start of the null-terminated string. You'd need a new type because various routines in the standard library would have to invisibly have two versions for counted and non-counted strings (since if you incremented a string pointer, or used a function like strchr, you'd have to treat it as a regular char). "Safe" code would use a different call (say, cstrchr) that returned an index instead of a char. The compiler could optionally warn on unsafe "legacy" calls as it can with strcpy instead of strncpy.
It's all true, but then again, everything would be better if we'd start from scratch today. Compromises made to tip-toe around technology limitations are what adds complexity to most of today's software, but even tomorrow's software will be influenced by today's limitations. It's best not to dwell on the past.
Does anyone understand what the author meant by the following statement:
${lang} is the language of the future
This looks like a macro for substitution, but maybe its some hip new term I've never encountered. An actual language or just a placeholder for a language that hasn't been chosen yet?
So to be "safe" and "secure" we can only have strings 256 characters long, or we need to waste a few bytes repeatedly for short strings. Sounds like the UTF-8 vs UTF-16/32 debate..
The reality is that null-terminated strings are dramatically more expensive than strings with a length counter in every regard other than memory usage, and the memory usage overhead from storing a length value is utterly miniscule compared to the actual size of the string. Even if you ignore all the secondary costs that result from the decision to use null-terminated strings, they're just poor engineering. There are far better ways to save a few bytes.
(By secondary costs I mean things like the myriad bugs caused by null-terminated strings, the severe performance penalties involved in copying and manipulating them, the unfortunate implications they have for file formats and network protocols, etc.)
My "ideal" string would be to store it as a UTF-8 rope, with the additional restriction (doesn't change the interface any) that all characters within a node in the rope have the same length. (You can use overlong encodings internally if it makes sense (One single-byte character in a bunch of longer characters), which is a microoptimization that will in some cases save a few bytes.)
I'd also treat a character + combining characters as a single character.
>that all characters within a node in the rope have the same length [...] I'd also treat a character + combining characters as a single character.
The problem with this is that Unicode doesn't restrict
the number of combining marks. If your hypothetical library wants to offer full Unicode support, your "nodes of same length" idea wouldn't work.
Of course an implementation which makes an arbitrary restriction wouldn't be unusual. In fact, I'm not aware of any application that supports an arbitrary number of combining marks even if the standard allows it.
When it comes to standard conformance UAX15-D3 [1] is probably the closest we could get. It'd require 128 byte per character.
The length of each node is not the same; each character within a node is the same number of bytes long.
So you end up with one node in the rope that has one logical character - that is some absurd number of bytes long. (In reality there'd be a maximum of 2^8-1 or 2^16-1 or something bytes per character)
A length field really is insignificant. On a 64-bit machine, a 4GB max length field is half the size of the pointer to the string itself! C++ STL strings already use a length specifier and I don't think anyone is complaining about performance because of it.
Length could be encoded differently, as a varint. As long as the highest bit is set, the next byte is also part of the length - just left-shift the result so far by seven and add the 7 lower bits, as soon as the highest bit is 0 we have the final length.
The processing overhead is low, 127 bytes only cost 1...
Not such a big issue.
The speedup would be well worth it by being able to tell the mmc to move 100 bytes from position 20000 to position 40000 instead of give me 8 bytes from position 200000 then check evry one of the bytes then ask the mmc again give me 8 bytes from position 20008 and so on.
If I got to chose in this day and age I would define a c string/array as a pointer to a long int and after the long int we have or data.
So, if we went down the pass, what will we have? All the fun of having "legacy" APIs that seem to work but internally only accept strings up to 64kb length and mysteriously chop off excess bytes when you least expect it. It's Y2K problem all over again.
And just when you finally think you're over with it, memory is cheaper again, size_t is 64bit, and someone invariably wants to store a binary blob >4G as string. Fun time again.
Have we forgotten how much trouble we went through in the 90s to handle memory in x86 "640k is enough for everybody" architecture?