The Most Expensive One-byte Mistake (2011)

yongjik · on April 11, 2014

Yeah, if we only used strings marked with 2-byte integers, everybody would have been happy, because 64kb string is enough for everyone. (And let's be realistic, nobody sane would have chosen 4-byte string length back in early 70s.)

So, if we went down the pass, what will we have? All the fun of having "legacy" APIs that seem to work but internally only accept strings up to 64kb length and mysteriously chop off excess bytes when you least expect it. It's Y2K problem all over again.

And just when you finally think you're over with it, memory is cheaper again, size_t is 64bit, and someone invariably wants to store a binary blob >4G as string. Fun time again.

Have we forgotten how much trouble we went through in the 90s to handle memory in x86 "640k is enough for everybody" architecture?

dmethvin · on April 11, 2014

This is similar to the "kill Hitler" time travel joke [1], it's easy to say things would be better but all we know for sure is that they would be different. Instead of `char` we'd have strings that were `struct` and we'd STILL have a ton of different string formats because of lengths and character formats. (Bonus problem: are the lengths in bytes, or in chars?)

[1] http://www.tor.com/stories/2011/08/wikihistory

acchow · on April 11, 2014

You can have a multi-byte chained marker.

Use the first 7 bits of the byte as significant digits, the last bit as a boolean (the next byte is String payload or another byte for string length).

wolf550e · on April 11, 2014

https://developers.google.com/protocol-buffers/docs/encoding...

TheSoftwareGuy · on April 11, 2014

Just store a pointer to the first character and a pointer to the last one. to get the length you just subtract the two.

simias · on April 11, 2014

That's exactly equivalent to having a "size" parameter with the same size as the pointer, except you have to use a substract instruction when you want to get the length of the string, so I'd say it's inferior to just storing the length of the string.

For instance, if you copy a string you also have to update the end pointer instead of just copying the size attribute in bulk. And you get the same disadvantages of non-portable strings, different representations depending on the architecture/endianess etc...

I completely agree with the OP, there's no perfect solution. If addr + len was truly superior I'm sure we'd see

    struct string { long len; char s[]; };

or for your version

    struct string { char *endptr; char s[]; };

everywhere. And the C standard library would have evolved along with it.

Out of the top of my head the only thing that makes '\0' terminated strings special in C is that it's the way string literals are represented. It would be trivial to recode all of string.h using addr + len instead of nul terminated.

1amzave · on April 11, 2014

> That's exactly equivalent to having a "size" parameter with the same size as the pointer, except you have to use a substract instruction when you want to get the length of the string, so I'd say it's inferior to just storing the length of the string.

Except that it has the important property that the (effective) length descriptor, being a pointer, would necessarily "grow" over time (across generations of machines, e.g. 16 bit -> 32 bit, etc.) and would thus never impose any artificial restrictions on string length.

TheSoftwareGuy · on April 12, 2014

>Out of the top of my head the only thing that makes '\0' terminated strings special in C is that it's the way string literals are represented. It would be trivial to recode all of string.h using addr + len instead of nul terminated.

exactly, anywhere you would have if(str[i] == NULL) you replace with if (str.s + i == str.end)

thedufer · on April 12, 2014

And, of course, I'm sure its completely safe to treat something as a pointer but explicitly set it to memory you don't own (i.e. just past the end of the string). I guess we could point to the last byte and doom everyone to inevitable off-by-one errors in string length computation.

acchow · on April 11, 2014

I love this solution, but it probably would have made the move from 32-bit to 64-bit harder.

TheSoftwareGuy · on April 12, 2014

it should hve made it easier actually, because with pointers you are usually just doing addition subtraction, and everything should be independent of the acutal value of the pointer.

GnarfGnarf · on April 12, 2014

That's exactly what I was thinking: a design based on a 16-bit length field would have been even more of a nightmare to migrate to 32, then 64 bits.

Like what Churchill said about democracy: NUL-terminated is a terrible solution, but it beats whatever's in second place.

gcb0 · on April 11, 2014

Oh the irony of history.

On the week that str+len was abused left and right, someone surfaces to the frontpage an article about how str+NUL is wrong and everyone should use str+len.

quotient · on April 11, 2014

Now that you point it out, that's actually rather amusing.

millstone · on April 11, 2014

NUL terminated strings were the right decision for C. They’re certainly much simpler than length fields.

Consider using a length field. How big should that field be? If it's fixed size, you introduce complications regarding how big a string you can represent, and differences in field sizes across architectures. If it's variable-sized (a la UTF-8), then you've added different complications: you would need library functions to read and write the length, to get access to the string contents, to calculate the amount of memory required to hold a string of a given size, etc. Very much not in the spirit of C.

Next, what endianness should that field have? NUL terminated strings have no endianness issues: they can be trivially written to files, embedded in network packets, whatever. But with a length field, we either need to remember to marshall the string, or allow for the length field to not be in native byte order. Neither is a pleasant prospect, especially for a 1970’s C programmer.

Also, consider C-style string parsing, e.g. strtok/strsep. These could not be implemented with length-field strings.

Explicit length is better when you have an enforced abstraction, like std::string, but at that point you’re not writing in C. If you have to pick an exposed representation, NUL termination is much better than Pascal-style length fields.

So what was the “one-byte mistake?” The article says that it was saving a byte by using NUL termination instead of a two-byte length field. Had K&R not made that “mistake,” we would be unable to make a string longer than 65k - a far more serious limitation than anything NUL termination imposes!

K&R got it right.

ScottBurson · on April 11, 2014

No one doubts that there were advantages to NUL-terminated strings, but against them you have to weigh the many thousands of security holes that were thereby created.

astrodust · on April 11, 2014

UTF-8 got it right with having a variable-length byte representation of numerical values. Seven-bit values unaffected. Longer values use more bytes as necessary.

The C approach takes a whole different philosophy. You want a "string"? NULL terminated. Simple. You want a buffer? Do it yourself.

TheSoftwareGuy · on April 11, 2014

instead of a length field, use a pointer to the last character. the length is the difference of these two pointers. The maximum string length is the size of your address space. Problem solved.

thedufer · on April 12, 2014

The problem is hardly solved. Your string length computation is already wrong - the length is the difference between those pointers plus one.

_kst_ · on April 11, 2014

Hypothetically, C could use the first `sizeof (size_t)` bytes to store the length. Endianness shouldn't be much of an issue; just use the endianness of the current machine. You don't typically write the NUL terminator when writing a string to a file; similarly, you wouldn't typically write the length field when writing a counted string to a file.

I agree that NUL-terminated strings were the right decision at the time (and aren't that bad now), but there are sane ways to do counted strings.

millstone · on April 11, 2014

You piqued my interest, and it looks like NUL-terminated strings in files is actually common. tar, ELF, JPEG, gzip all use them. When the serialization format matches the in-memory format, you can mmap directly into a struct, which is fast and convenient.

1amzave · on April 11, 2014

> When the serialization format matches the in-memory format, you can mmap directly into a struct, which is fast and convenient.

...and a non-starter if you ever have data interchange between differing types of machines.

TomMasz · on April 11, 2014

A lot of programming decisions were made to save a byte here and there. It's easy to point at them today and say they're "bad", but at the time they were the absolutely correct thing to do. It's hard to imagine now but not saving that byte could mean your program wouldn't fit into RAM. Try telling your management in the 1960s that your program won't load because it's "properly coded" and see how far you get.

What we've failed to do is ever revisit those decisions and change them where we've identified problems. Yes, you can probably compile (with warnings) files from UNIX v7, but we pay for that compatibility. But there's no question designing, building and maintaining a libc alternative is a colossal undertaking and not likely to happen on a whim. So here we are.

radiospiel · on April 11, 2014

Well, strings without an explicit length field allow for things like strstr(3) or prefix parsing without performance penalties due to reallocating memory.

kevingadd · on April 11, 2014

Blatantly incorrect. Try thinking over how you might implement those two operations on a string with a length field and see if you can figure out why your statement is wrong.

RogerL · on April 11, 2014

That came across as very condescending to me. Could you tell us how you would implement this?

I for one do not see how to implement it.

edmccard · on April 11, 2014

>I for one do not see how to implement it.

(referring to an allocation-free strstr if strings had an associated length)

If a string was a (pointer,length) pair, then strstr would return a pair (pointer+offset,length-offset) where, just like in the original strstr, the new pointer points into the existing string. No allocation needed

(But what about the pointer-length pair that has to be created? Well, it will live on the stack just like the return value from the original strstr, which has to be stored somewhere also.)

mzs · on April 11, 2014

Except that's not how K&R would have done it. It would have been pascal style: one byte of length directly followed by the string itself in memory. The pointer would be the address of the length byte. Now it's hard to create strstr without copying so they wouldn't. Instead you would have a new API where every call would have offsets to the beginning, likely it would be one based too. Anyway C does it the way it does cause PDP set the zero condition flag on move.

kevingadd · on April 13, 2014

When someone makes a post demonstrating a complete failure to consider an argument before making it (claiming that a substring search requires allocation of any kind when all it does is scan and return an offset...) I don't think it's necessary for me to explain in elaborate detail why they're wrong.

I don't need to explain how I'd implement it because every obvious implementation of the strstr algorithm doesn't need to allocate because it's a substring search! Search operations are pure and don't mutate!

Every single implementation of strstr or equivalent in every single programming language I have ever seen does not require an allocation. It either returns a pointer to the location of the first matched result, or returns the character index of the first match. That's it.

Any other absurd implementation you can think up to justify null-terminated strings - like returning a copy of the string with a null terminator after the match - just doesn't make any sense. That's a different operation.

Karellen · on April 11, 2014

NAME

  strstr - locate a substring

SYNOPSIS

  int strstr(string haystack, string needle, int offset);

DESCRIPTION

  The strstr() function finds the first occurrence of the substring needle
  in the string haystack, starting from position offset in haystack.

RETURN VALUE

  This function returns the index of the first character in haystack after
  offset where needle can be found, or -1 if the substring is not found (or
  if offset is out of bounds for haystack).

Same complexity and memory overhead as C strstr(). Actually more efficient if needle is longer than haystack + offset, which can be checked in O(1) on entry.

(I don't have a problem with NUL-terminated strings, but couldn't resist a little sideways thinking on how an alternate library implementation for strings might work.)

thedufer · on April 12, 2014

That's a far less useful function, since you still have to do a memory copy if you want to pass the return value to a function that operates on strings.

Karellen · on April 12, 2014

Or you could just pass 'needle', which is already that string.

thedufer · on April 12, 2014

Only if the first occurrence of needle happens to terminate the string. Otherwise the return value of the real version of strstr returns a string that starts with needle but is followed by more text.

Karellen · on April 12, 2014

OK, so for that one particular use case[0], the result of strstr might not be quite as convenient as the NUL-terminated strstr() case. But that's what you get with a completely different string implementation - some operations are more efficient than others, others less so. It's a bunch of trade-offs.

But strstr() finds the position of a substring within a larger string, or indicates that it doesn't exist. My suggested API for the equivalent functionality does exactly that, with no problems or performance issues for the job strstr() itself needs to do that I can see.

[0] A use case I have to say I find a little contrived. I can say that the majority of the times I've used strstr(), it's either to simply check that needle exists in haystack, or to take the part of the haystack following needle - at which point I normally need to make a copy because I can't guarantee that the source string will be around for the lifetime that I need the remainder for.

thedufer · on April 14, 2014

Different experiences, I guess. Your "contrived use case" is my "only reason I've ever used strstr". That said, I haven't done a ton of C programming, so maybe my experience is atypical.

gumby · on April 12, 2014

When I was at PARC the Mesa guys (who had counted strings) did some analysis and (at least in those days) the counted strings ended up being, in aggregate, faster. I suspect the advantage would be even greater these days since memory allocation was a bigger deal back then.

I wonder if you could do this compatibly in the compiler by adding another primitive type (counted string) which had the length in the bytes before the start of the null-terminated string. You'd need a new type because various routines in the standard library would have to invisibly have two versions for counted and non-counted strings (since if you incremented a string pointer, or used a function like strchr, you'd have to treat it as a regular char). "Safe" code would use a different call (say, cstrchr) that returned an index instead of a char. The compiler could optionally warn on unsafe "legacy" calls as it can with strcpy instead of strncpy.

cliveowen · on April 11, 2014

It's all true, but then again, everything would be better if we'd start from scratch today. Compromises made to tip-toe around technology limitations are what adds complexity to most of today's software, but even tomorrow's software will be influenced by today's limitations. It's best not to dwell on the past.

crashandburn4 · on April 11, 2014

This page won't load for me and neither will googles webcached version[1], does anyone have a version of this that I can see?

[1] http://webcache.googleusercontent.com/search?q=cache:http://...

HillRat · on April 11, 2014

This (http://cacm.acm.org/magazines/2011/9/122797-the-most-expensi...) should work.

crashandburn4 · on April 11, 2014

Thanks

ojbyrne · on April 11, 2014

The text-only google cache version worked for me: http://webcache.googleusercontent.com/search?q=cache:http://...

orvado · on April 11, 2014

Does anyone understand what the author meant by the following statement:

${lang} is the language of the future

This looks like a macro for substitution, but maybe its some hip new term I've never encountered. An actual language or just a placeholder for a language that hasn't been chosen yet?

noobiemcfoob · on April 11, 2014

I think he means to imply that whoever is making the statement would substitute ${lang} with their language of choice as the successor of C.

bananas · on April 11, 2014

Yeah because strings with a length prefix/field are just as secure!

   200,"STR"

We know where that got us...

Programming 101, rules 1&2:

1 - never trust your inputs

2 - always check your invariants.

ithinkso · on April 11, 2014

With NULL terminated strings it also was simpler to serialize it. If str+len was a standard now we would have 13 more serialization standards.

rw_grim · on April 11, 2014

So to be "safe" and "secure" we can only have strings 256 characters long, or we need to waste a few bytes repeatedly for short strings. Sounds like the UTF-8 vs UTF-16/32 debate..

kevingadd · on April 11, 2014

The reality is that null-terminated strings are dramatically more expensive than strings with a length counter in every regard other than memory usage, and the memory usage overhead from storing a length value is utterly miniscule compared to the actual size of the string. Even if you ignore all the secondary costs that result from the decision to use null-terminated strings, they're just poor engineering. There are far better ways to save a few bytes.

(By secondary costs I mean things like the myriad bugs caused by null-terminated strings, the severe performance penalties involved in copying and manipulating them, the unfortunate implications they have for file formats and network protocols, etc.)

TheLoneWolfling · on April 11, 2014

My "ideal" string would be to store it as a UTF-8 rope, with the additional restriction (doesn't change the interface any) that all characters within a node in the rope have the same length. (You can use overlong encodings internally if it makes sense (One single-byte character in a bunch of longer characters), which is a microoptimization that will in some cases save a few bytes.)

I'd also treat a character + combining characters as a single character.

weinzierl · on April 11, 2014

>that all characters within a node in the rope have the same length [...] I'd also treat a character + combining characters as a single character.

The problem with this is that Unicode doesn't restrict the number of combining marks. If your hypothetical library wants to offer full Unicode support, your "nodes of same length" idea wouldn't work.

Of course an implementation which makes an arbitrary restriction wouldn't be unusual. In fact, I'm not aware of any application that supports an arbitrary number of combining marks even if the standard allows it.

When it comes to standard conformance UAX15-D3 [1] is probably the closest we could get. It'd require 128 byte per character.

[1] http://unicode.org/reports/tr15/#Stream_Safe_Text_Format

TheLoneWolfling · on April 11, 2014

The length of each node is not the same; each character within a node is the same number of bytes long.

So you end up with one node in the rope that has one logical character - that is some absurd number of bytes long. (In reality there'd be a maximum of 2^8-1 or 2^16-1 or something bytes per character)

jevinskie · on April 11, 2014

A length field really is insignificant. On a 64-bit machine, a 4GB max length field is half the size of the pointer to the string itself! C++ STL strings already use a length specifier and I don't think anyone is complaining about performance because of it.

arnehormann · on April 11, 2014

Length could be encoded differently, as a varint. As long as the highest bit is set, the next byte is also part of the length - just left-shift the result so far by seven and add the 7 lower bits, as soon as the highest bit is 0 we have the final length. The processing overhead is low, 127 bytes only cost 1... Not such a big issue.

callesgg · on April 11, 2014

The speedup would be well worth it by being able to tell the mmc to move 100 bytes from position 20000 to position 40000 instead of give me 8 bytes from position 200000 then check evry one of the bytes then ask the mmc again give me 8 bytes from position 20008 and so on.

If I got to chose in this day and age I would define a c string/array as a pointer to a long int and after the long int we have or data.