Don't drop non-utf8 file paths

Question

Don't drop non-utf8 file paths

fezzzza opened this issue 6 years ago · 2 comments

I can get mirror working fine as a client without watchman, but with watchman installed I get this:
I am running linux mint 19 (~ubuntu 18 bionic).
Same result whether running as user or root
Same result with whichever version of openjdk-8/9/10/11-jre
A quick google and it appears to be related to character encodings. It may help to mention that I am in the UK and most of my system defaults to UTF-8, but it may be related to some form of internationalisation.
With reference to your notes about WatchService, I notice that JDK-8145981 is now fixed - is WatchService still considered buggy in the latest release and is watchman still recommended/required for stability?

$mirror client -h localhost -l /var/www/html -r /var/www/html
2018-10-28 16:15:39 INFO Connected, starting session, version unspecified
2018-10-28 16:15:41 INFO Watchman root is /var/www/html
2018-10-28 16:15:41 ERROR Exception starting the client
java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:816)
at com.facebook.buck.bser.BserDeserializer.deserializeString(BserDeserializer.java:236)
at com.facebook.buck.bser.BserDeserializer.deserializeRecursiveWithType(BserDeserializer.java:332)
at com.facebook.buck.bser.BserDeserializer.deserializeTemplate(BserDeserializer.java:302)
at com.facebook.buck.bser.BserDeserializer.deserializeRecursiveWithType(BserDeserializer.java:338)
at com.facebook.buck.bser.BserDeserializer.deserializeRecursive(BserDeserializer.java:313)
at com.facebook.buck.bser.BserDeserializer.deserializeObject(BserDeserializer.java:276)
at com.facebook.buck.bser.BserDeserializer.deserializeRecursiveWithType(BserDeserializer.java:336)
at com.facebook.buck.bser.BserDeserializer.deserializeRecursive(BserDeserializer.java:313)
at com.facebook.buck.bser.BserDeserializer.deserializeBserValue(BserDeserializer.java:113)
at mirror.watchman.WatchmanChannelImpl.read(WatchmanChannelImpl.java:93)
at mirror.watchman.WatchmanChannelImpl.query(WatchmanChannelImpl.java:87)
at mirror.watchman.WatchmanFileWatcher.startWatchAndInitialFind(WatchmanFileWatcher.java:197)
at mirror.watchman.WatchmanFileWatcher.performInitialScan(WatchmanFileWatcher.java:140)
at mirror.MirrorSession.calcInitialState(MirrorSession.java:78)
at mirror.MirrorClient.startSession(MirrorClient.java:88)
at mirror.MirrorClient.access$300(MirrorClient.java:27)
at mirror.MirrorClient$SessionStarter.runOneLoop(MirrorClient.java:198)
at mirror.tasks.ThreadBasedTask.run(ThreadBasedTask.java:62)
at mirror.tasks.ThreadBasedTask.lambda$new$0(ThreadBasedTask.java:39)
at java.lang.Thread.run(Thread.java:748)
2018-10-28 16:15:41 INFO Stopping session

Answer 1 · 2018-10-28T20:46:45.000Z

Oh, yes, this is from getting non-UTF8 paths. I ran into this myself but hadn't released the "fix". If you bump to 1.2.1, which I just pushed, it should not blow up.

The disclaimer is that I wasn't sure how to fix it, so for now when watchman says "um, this file path can't be decoded as utf-8", mirror just skips it and does not sync that path.

I guess in theory it could transfer the file path as binary (just a byte[]) across the wire ... however all of the Java file system APIs take strings, so once the remote side got it, there is not a (standard) Java API that would accept it. I'd have to do something janky like save it to a temp file (via the Java APIs) and then use a JNI call/something to rename it.

In my case, these were corrupted file paths, so I used env LC_ALL=C find . -name '*[! -~]*' to find them and delete them. But I suppose for you they are real files...

I'll leave this issue open as "somehow support non-utf8 file names in a way that is not dropping them".

Answer 2 · 2018-10-28T21:11:00.000Z

Ah yes, just to confirm, there are a bunch of image files of international flags that have accented characters in the filenames - that's the way they came from the source - I certainly wouldn't have chosen to use complex characters in the filenames and I've seen it documented that it's not a good idea - but I wouldn't know how to check whether they are UTF-8 or an international ISO like ISO-8859-1 or some other.