dask/hdfs3

Windows Support -- help needed

tadas-subonis opened this issue · 7 comments

Hi all,

what is needed to make this support Windows? I am in a need of a python HDFS client so I would be willing to develop that if it is reasonable.

hdfs3 is pure-python and so easily transferrable, except for the extension of the linked code (.so onllinux, would be .dll on windows).

The low-level library, libhdfs3, is only configured to build on linux, although building on osx is known to be possible. For windows, you would have to get all of the dependencies in the recipe built for windows (some of them already are) and figure out the appropriate way to trigger the build.

In addition, there is a later version of libhdfs3 that we are trying to incorporate from HAWQ, which has more build dependencies such as cmake, gtest, gmock. It may be possible to simplify that, or port the newer code back tot the previous build system.

Hi all,

what is needed to make this support Windows? I am in a need of a python HDFS client so I would be willing to develop that if it is reasonable.

@tadas-subonis Any luck on this after all this time? I spent tons of hours searching for a windows-based HDFS client.

Have you tried with pyarrow, which is available for windows? The java requirements will surely work, but I don't know the status of the Hadoop JNI (native) libraries. You may want to ask on the pyarrow tracker.

Have you tried with pyarrow, which is available for windows? The java requirements will surely work, but I don't know the status of the Hadoop JNI (native) libraries. You may want to ask on the pyarrow tracker.

Yes I tried that as well, but I get this error on it:

Traceback (most recent call last):
  File "hdfstest.py", line 6, in <module>
    fs = pa.hdfs.connect(host='localhost', port=9000, user='dr.who')
  File "C:\Program Files\Python36\lib\site-packages\pyarrow\hdfs.py", line 187,
in connect
    extra_conf=extra_conf)
  File "C:\Program Files\Python36\lib\site-packages\pyarrow\hdfs.py", line 37, i
n __init__
    self._connect(host, port, user, kerb_ticket, driver, extra_conf)
  File "pyarrow\io-hdfs.pxi", line 89, in pyarrow.lib.HadoopFileSystem._connect
  File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Unable to load libjvm

I installed Java SE but not sure why it doesn't recognize the library, not sure how to even trace back the error since it's in pxi which I'm not familiar with.

Yeah, I think you should ask them. hdfs3 will not be developed further, and as you can see from the conversation above, we had no good idea of how to make a build system for Windows - sorry.

Since I assume hadoop is not running on your windows system, would webhdfs be a possible solution? It may have additional overhead, but perhaps the only thing that you can make work. You can use that by REST API directly, or a python wrapper such as pywebhdfs.

That's a brilliant idea that I wasn't aware of, thank you so much! I'm testing hadoop on my local Windows machine before connecting to our clouds. Anyway to start httpfs on my lame windows too?

I don't know, but from a 20-second read, httpfs appears no simpler than webhdfs.