Dataflow /var/opt/google/dataflow directory created as /var/opt/google/dataflow/dataflow
junying1 opened this issue · 1 comments
Dataflow is not aboutable to find files packaged with my classes. I use Class.getResource("/data.json"). Stackdriver log shows it's looking for the file in /var/opt/google/dataflow/some-random-jar-name.jar!/data.json. When I ssh into the VM instance for the worker, the file is actually in /var/opt/google/dataflow/dataflow/some-random-jar-name.jar.jar. This was working as of 5/9/18.
I tested with the WordCount example straight from Apache Beam documentation: https://beam.apache.org/get-started/quickstart-java/
Followed all the steps. Then added a "resources/data.json" to "src/main". Added the following lines to WordCount.ExtractWordsFn's processElement method:
try {
String jsonStr = new Scanner(new File(WordCount.class.getResource("/data.json").getFile())).useDelimiter("\\Z").next();
System.out.println("====================================================");
System.out.println(jsonStr);
System.out.println("====================================================");
} catch (Exception e) {
e.printStackTrace();
}
Sure enough, it runs fine locally with DirectRunner, but with DataflowRunner, I got the same error in stack driver:
message: "java.io.FileNotFoundException: file:/var/opt/google/dataflow/classes-yGX0uczTTR8A8LXakSr0JA.jar!/data.json (No such file or directory)"
While the example batch is still running, I ssh'ed into the worker instance and checked /var/opt/google/dataflow. There is another "dataflow" directory, and the files are copied there. So confirmed the double dataflow directory issue.
I worked out a workaround: use Class.getResourceAsStream to get an inputstream. For whatever reason, getResourceAsStream functioned as expected, while getResource still fails. For all of my purposes, an inputstream works just as well as a URL.