By what means? The only ways I would expect are either unconditionally duplicati...

masklinn · on July 13, 2023

Inlining `StringUTF16.compress` and `StringUTF16.bytes what the code currently does is essentially this (using python as pseudocode):

    latin1 = bytearray(len(input))
    for i in range(len(input)):
        c = input[i]
        if c > 255:
            latin1 = None
            break
        latin1[i] = c

    if latin1 is not None:
        return LATIN1, latin1

    utf16 = bytearray(len(input)*2)
    for i in range(0, len(input)):
        c = input[i]
        utf16[2*i] = c >> HI_BYTE_SHIFT 
        utf16[2*i+1] = c >> LO_BYTE_SHIFT

    return UTF16, utf16

The issue occurs because `input` can be mutated between the moment it finds a char that's above 255 in the first loop and the moment it visits that same char in the second loop.

The solution is to not do that, but instead something like:

    latin1 = bytearray(len(input))
    for i in range(len(input)):
        c = input[i]
        if c <= 255:
            latin1[i] = c
            continue
        
        utf16 = bytearray(len(input)*2)
        for j in range(i):
            utf16[2*j] = latin1[j]
        latin1 = None
        utf16[2*i] = c
        utf16[2*i+1] = c >> 8
        for j in range(i+1, len(input)):
            c = input[j]
            utf16[2*i] = c
            utf16[2*i+1] = c >> 8
        return UTF16, utf16

    return LATIN1, latin1

This means if you find a char that's above 255 you will always append that char to the UTF16 array, there's no possibility that someone will change it under you because you append the exact same char you tested. So you can not get into the situation the essay describes, a utf16 string will always contain at least one non-latin1-code unit.

Non-vectorised performances should an improvement as latin1 to utf16 is a trivial operation (just copy every byte of the input to every other byte of the output)

Though if you vectorised the char to utf16 conversion you now vectorise two loops on bailout (latin1 -> utf16 up to i, then char -> utf16) which is probably less efficient. I don't know if the JDK has vectorised optimisations, the source has "HotSpotIntrinsicCandidate" annotations but I don't know to what extent the intrinsics go.