aerospike/aerospike-client-java

Fully support surrogates symbol in Strings

vladislav-sidorovich opened this issue ยท 5 comments

    @Test
    void evilString() throws Exception {
        String messageText = "Hey Aerospike! Let's store the string ๐Ÿ™.";
        String evilString = messageText.substring(0, 39) + "...";

        Key key = new Key("TEST", "map-strings", "test-key");

        String binName = "map-test";

        Map<Value,Value> inputMap = new HashMap<Value,Value>();
        inputMap.put(Value.get("text"), Value.get(evilString));
        inputMap.put(Value.get("type"), Value.get("missing data"));

        // Write values to empty map.
        aerospikeClient.operate(new WritePolicy(), key,
                MapOperation.putItems(MapPolicy.Default, binName, inputMap)
        );

        Record record = aerospikeClient.get(new Policy(), key);
        Map<?, ?> storedMap = record.getMap(binName);

        Assert.assertEquals("missing data", storedMap.get("type"));
        Assert.assertEquals(evilString, new String((byte[]) storedMap.get("text")));
    }

The root cause of the issue:

else if (Character.isHighSurrogate(ch)) {

The effect is here: https://github.com/aerospike/aerospike-client-java/blob/8251d673a6ec573e662541cd6f045241db164467/client/src/com/aerospike/client/util/Packer.java#L403,L410

  1. int size = Buffer.estimateSizeUtf8(val) + 1; return some X value
  2. the value X is packed into the buffer
  3. offset += Buffer.stringToUtf8(val, buffer, offset); return some Y value
  4. offset is moved to Y position
  5. X <> Y => data in the buffer are corrupted because of overlapping

Reference implementation: https://github.com/openjdk/jdk11u-dev/blob/c1411113b396f468963a1deacc3b57ed366e735a/src/java.base/share/classes/java/lang/StringCoding.java#L924-L950
or java.lang.String#encodeUTF8_UTF16 Amazon Correto 18

Notes:
What are surrogates? https://unicode.org/faq/utf_bom.html#utf16-2

This will be fixed in the next client release.

The code sample creates a malformed string. I think the best solution is to detect and throw an exception in estimateSizeUtf8() when the string is malformed.

From my point of view, an exception will be better than a corrupted document.
At the same time, java.lang.String doesn't throw an exception. Also, I can send/receive such a string via REST (http).

So, I can send/receive such strings, and I can process it in the code in my service but I can't store it in long-term storage (Aerospike), it is a bit confusing, is it?

If aerospike-client will be able to process such strings it will be the best option for me.

Java's getBytes(StandardCharsets.UTF_8) modifies malformed strings to include a "?" in place of the invalid surrogate pair when converting to UTF8. When the UTF8 bytes are converted back into a string, there is a mismatch between the original string and the converted string. This will cause problems for applications that test these strings for equality. In the interest of safety, the client will throw an exception when malformed strings are encountered in estimateSizeUtf8().