jingw/pyhdfs

BUG:Chinese character can't copy to hdfs

Closed this issue · 3 comments

UnicodeEncodeError: 'latin-1' codec can't encode characters in position 2-3: Body ('张三') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.

File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1741, in
main()
File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1735, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1135, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Applications/PyCharm.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/Users/zhenyuepeng/dtWorkSpace/dt-center-algorithm/main.py", line 73, in
execute(input_json=input1, application_id=input2)
File "/Users/zhenyuepeng/dtWorkSpace/dt-center-algorithm/main.py", line 64, in execute
model(json_info, application_id).run()
File "/Users/zhenyuepeng/dtWorkSpace/dt-center-algorithm/pipeline/source/write_data.py", line 31, in run
super().upload_to_hdfs(data=input_data, path=data, sep=sep2)
File "/Users/zhenyuepeng/dtWorkSpace/dt-center-algorithm/common/abstract.py", line 79, in upload_to_hdfs
upload_to_hdfs(data, path, fs=self.fsClient, sep=sep)
File "/Users/zhenyuepeng/dtWorkSpace/dt-center-algorithm/common/hdfs.py", line 32, in upload_to_hdfs
overwrite=True)
File "/Users/zhenyuepeng/dtWorkSpace/dt-center-algorithm/venv/lib/python3.7/site-packages/pyhdfs.py", line 426, in create
metadata_response.headers['location'], data=data, **self._requests_kwargs)
File "/Users/zhenyuepeng/dtWorkSpace/dt-center-algorithm/venv/lib/python3.7/site-packages/requests/api.py", line 131, in put
return request('put', url, data=data, **kwargs)
File "/Users/zhenyuepeng/dtWorkSpace/dt-center-algorithm/venv/lib/python3.7/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/Users/zhenyuepeng/dtWorkSpace/dt-center-algorithm/venv/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/Users/zhenyuepeng/dtWorkSpace/dt-center-algorithm/venv/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/Users/zhenyuepeng/dtWorkSpace/dt-center-algorithm/venv/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/Users/zhenyuepeng/dtWorkSpace/dt-center-algorithm/venv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/Users/zhenyuepeng/dtWorkSpace/dt-center-algorithm/venv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 354, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1274, in _send_request
body = _encode(body, 'body')
File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 160, in _encode
(name.title(), data[err.start:err.end], name)) from None
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 2-3: Body ('张三') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.

Process finished with exit code 1

jingw commented

The problem is that you're sending a unicode string without specifying how it should be encoded, though I'm a bit surprised Python doesn't default to utf-8 encoding. In general, files contain bytes. Those bytes might represent plain ASCII, Chinese characters, an image, etc. But at the file system level, they're just bytes. The fix is as the error messages states: call body.encode('utf-8') before uploading (or whatever your preferred encoding is).

thank you very much!