Fully support surrogates symbol in Strings
vladislav-sidorovich opened this issue ยท 5 comments
@Test
void evilString() throws Exception {
String messageText = "Hey Aerospike! Let's store the string ๐.";
String evilString = messageText.substring(0, 39) + "...";
Key key = new Key("TEST", "map-strings", "test-key");
String binName = "map-test";
Map<Value,Value> inputMap = new HashMap<Value,Value>();
inputMap.put(Value.get("text"), Value.get(evilString));
inputMap.put(Value.get("type"), Value.get("missing data"));
// Write values to empty map.
aerospikeClient.operate(new WritePolicy(), key,
MapOperation.putItems(MapPolicy.Default, binName, inputMap)
);
Record record = aerospikeClient.get(new Policy(), key);
Map<?, ?> storedMap = record.getMap(binName);
Assert.assertEquals("missing data", storedMap.get("type"));
Assert.assertEquals(evilString, new String((byte[]) storedMap.get("text")));
}
The root cause of the issue:
The effect is here: https://github.com/aerospike/aerospike-client-java/blob/8251d673a6ec573e662541cd6f045241db164467/client/src/com/aerospike/client/util/Packer.java#L403,L410
- int size = Buffer.estimateSizeUtf8(val) + 1; return some X value
- the value X is packed into the buffer
- offset += Buffer.stringToUtf8(val, buffer, offset); return some Y value
- offset is moved to Y position
- X <> Y => data in the buffer are corrupted because of overlapping
Reference implementation: https://github.com/openjdk/jdk11u-dev/blob/c1411113b396f468963a1deacc3b57ed366e735a/src/java.base/share/classes/java/lang/StringCoding.java#L924-L950
or java.lang.String#encodeUTF8_UTF16 Amazon Correto 18
Notes:
What are surrogates? https://unicode.org/faq/utf_bom.html#utf16-2
This will be fixed in the next client release.
The code sample creates a malformed string. I think the best solution is to detect and throw an exception in estimateSizeUtf8()
when the string is malformed.
From my point of view, an exception will be better than a corrupted document.
At the same time, java.lang.String
doesn't throw an exception. Also, I can send/receive such a string via REST (http).
So, I can send/receive such strings, and I can process it in the code in my service but I can't store it in long-term storage (Aerospike), it is a bit confusing, is it?
If aerospike-client will be able to process such strings it will be the best option for me.
Java's getBytes(StandardCharsets.UTF_8)
modifies malformed strings to include a "?" in place of the invalid surrogate pair when converting to UTF8. When the UTF8 bytes are converted back into a string, there is a mismatch between the original string and the converted string. This will cause problems for applications that test these strings for equality. In the interest of safety, the client will throw an exception when malformed strings are encountered in estimateSizeUtf8()
.
Java client 6.1.3 is released: https://download.aerospike.com/download/client/java/notes.html