getodk/pyodk

Encode Unicode in X-XlsForm-FormId-Fallback

Closed this issue · 3 comments

Software and hardware versions

pyodk 0.3.0, Python v3.11.3

Problem description

I'm noticing that pyODK doesn't encode the X-XlsForm-FormId-Fallback header. Central expects the header to be ASCII. Unicode is expected to be URL-encoded. (pyxform-http is the one to decode it.) This came up in Central in getodk/central#196.

That said, I'm not sure to what extent this is a real problem. I tried using client.forms.update() to send an XLSForm with Unicode in its filename, and pyODK seemed happy to send a Unicode header. If the Central API and pyxform-http are happy to receive a Unicode header, then the only issue would be filenames that contain % (filenames for which the filename and the URL-decoded filename are not the same).

@matthew-white thanks for the report - it would help a lot if you could you please 1) add example code to reproduce the issue, 2) show the expected result, and 3) actual result?

I've uploaded a form with an ID of ✅ here: https://staging.getodk.cloud/#/projects/22/forms/%E2%9C%85. The issue can be reproduced by downloading that form, then running the following:

client.forms.update(project_id=22, form_id='✅', definition='✅.xlsx')

Without changing the version string in the XLSForm, I think I should receive a 409 error response. However, the request doesn't seem to get off the ground. I see the following error:

  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1256, in putheader
    values[i] = one_value.encode('latin-1')
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2705' in position 0: ordinal not in range(256)

I'm not very familiar with Python, but I'm thinking that this could be solved by calling urllib.parse.quote() on file_path.stem here:

"X-XlsForm-FormId-Fallback": file_path.stem,

That said, I'm not sure to what extent this is a real problem. … pyODK seemed happy to send a Unicode header. … the only issue would be filenames that contain %

Looking at it more, I think I was wrong about this. I think pyODK actually generally isn't willing to send a Unicode X-XlsForm-FormId-Fallback header. I think I got confused by the form at getodk/central#196. When I download that form from GitHub, the resulting file name is tést.xlsx (encoded as te%CC%81st.xlsx). But the form ID in the XLSForm is tést, and even though the two look the same, the latter is encoded as t%C3%A9st. I guess there are multiple ways to input é, and while one (%C3%A9) can be encoded as latin-1, the other (e%CC%81) cannot. I tried to avoid this issue in my reproduction steps above by using ✅, which is definitely not latin-1.

@matthew-white thanks for these details. I've put together a draft PR (linked above). I haven't tested it against Central yet but if you would like to try it out please let me know how it goes.