deanwampler/spark-scala-tutorial

Files do not get processed in SparkStreaming11Main when file is dropped manually.

Closed this issue · 4 comments

Hi,

I am testing out the code on Windows 7 machine and the example for SparkStreaming11Main seems to work fine. However when I commented out on line 93:

startDirectoryDataThread(in, data)

and drop the files manually into the streaming-input folder, nothing is being processed. Do we have to setup anything to have it working manually? Thanks for your help.

Check the time stamps of the files. They need to be newer than the time stamp of the last Spark minibatch job or Spark will ignore them. I don't know Windows, but if you move the file, it might have the old time stamp. Copying the file should work better. That's what the startDirectoryDataThread code basically does. (I'm assuming that's working for you.)

Thanks for your help. That does not seem to be the case because I do copy files over. Furthermore, I did test it on Unbuntu and it just worked fine on there. So it looks more like it is something specific to Windows. But I am just not sure how to get that resolved at the moment.

Thanks for investigating further. I'm not sure what to suggest at the moment.

I looked into this with a Windows 8 environment. If you use copy data\kjvdat.txt tmp\streaming-input\1.txt (for example), it keeps the same creation and modification times for the target file, so Spark doesn't recognize it as new. However, if you use more < data\kjvdat.txt > tmp\streaming-input\1.txt it does truly consider it new and gives it new creation and modification times. This is effectively what DataDirectoryServer does, as well. So, I'm going to close this one.