freedomofpress/securedrop-client

[0.9.0] Client crashes when using the pt_PT locale and attempting to export a transcript with non-latin-1 chars

zenmonkeykstop opened this issue · 11 comments

Description

The client crashes when using Portugese and attempting to export a partially-downloaded conversation. Looks like an encoding issue in a message string. Looks like it also happens with Export Conversation. Works fine with en_us and zh_Hans.

Steps to Reproduce

  1. start client using pt_PT locale
  2. choose a source with some files submitted but not downloaded to the client.
  3. Choose Export All (or Export Conversation)
  4. Choose "Continuar"
    Client crashes, following is terminal output
$ LANG=pt_PT securedrop-client
/opt/venvs/securedrop-client/lib/python3.9/site-packages/sqlalchemy/orm/query.py:196: SyntaxWarning: "is not" with a literal. Did you mean "!="?
  if entities is not ():
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
/opt/venvs/securedrop-client/lib/python3.9/site-packages/sqlalchemy/orm/query.py:196: SyntaxWarning: "is not" with a literal. Did you mean "!="?
  if entities is not ():
qt.xkb.compose: failed to create compose table
Traceback (most recent call last):
  File "/opt/venvs/securedrop-client/lib/python3.9/site-packages/securedrop_client/gui/actions.py", line 350, in _on_confirmation_dialog_accepted
    self._prepare_to_export()
  File "/opt/venvs/securedrop-client/lib/python3.9/site-packages/securedrop_client/gui/actions.py", line 315, in _prepare_to_export
    f.write(str(transcript))
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 97-98: ordinal not in range(256)

Expected Behavior

Export succeeds

Actual Behavior

Client crashes with error above.

Please provide screenshots where appropriate.

Comments

Adding an encoding='utf-8' arg to this call seems to resolve:

with open(file_path, "w") as f:

Will probably also need to be changed at
with open(file_path, "w") as f:
for printing to work.

The transcript in question (one of the dev test sources):

many globule wrote:
~!@#$%^&*()_+{}|:"<>?~!@#$%^&*()_+{}|:"<>?~!@#$%^&*()_+{}|:"<>?~!@#$%
------
Ω≈ç√∫˜µ≤≥÷
åß∂ƒ©˙∆˚¬…æ
œ∑´®†¥¨ˆøπ“‘
¡™£¢∞§¶•ªº–≠
¸˛Ç◊ı˜Â¯˘¿
ÅÍÎÏ˝ÓÔÒÚÆ☃
Œ„´‰ˇÁ¨ˆØ∏”
------
many globule sent:
File: memo.txt
------
many globule sent:
File: 4-many_globule-doc.gz.gpg
------
dellsberg wrote:
~!@#$%^&*()_+{}|:"<>?~!@#$%^&*()_+{}|:"<>?~!@#$%^&*()_+{}|:"<>?~!@#$%
------
deleted wrote:
Ω≈ç√∫˜µ≤≥÷
åß∂ƒ©˙∆˚¬…æ
œ∑´®†¥¨ˆøπ“‘
¡™£¢∞§¶•ªº–≠
¸˛Ç◊ı˜Â¯˘¿
ÅÍÎÏ˝ÓÔÒÚÆ☃
Œ„´‰ˇÁ¨ˆØ∏”
cfm commented

Naïvely, I'd think we always want to write the bytes of transcript.encode('utf-8'), regardless of the $LANG in effect.

I think that's a non-naive assumption until Marain enters the chat.

We could do that, but it looks like we're also defining the tostring behaviour of Transcript so we could be more careful there as well.

Adding encoding='utf-8' to the initial file open seems to resolve cleanly.

(That said we should probably have per language strings in a test source message somewhere to see how they behave)

Just tested with zh_Hans and pt_PT test strings and the unicode change above, export works fine and the resultant text file displays correctly.

Probably too big of a change now, but for the next release I think we should just always use UTF-8 via https://peps.python.org/pep-0540/ (it'll become the default in 3.15, see https://peps.python.org/pep-0686/) regardless of locale.

Calling out that this is a release blocker for 0.9.0. 🔥

I'm looking into:

  • reproducing the error in the tests suite
  • fixing it

[…] I'd think we always want to write the bytes of transcript.encode('utf-8'), regardless of the $LANG in effect.

I too think we should be creating UTF-8-encoded files regardless of the $LANG. (Because source submissions are not actually restricted in any way by $LANG.)

Probably too big of a change now, but for the next release I think we should just always use UTF-8 via https://peps.python.org/pep-0540/ (it'll become the default in 3.15, see https://peps.python.org/pep-0686/) regardless of locale.

I mistakenly assumed that's what Python was doing by default – so I'd second this suggestion.

(split the always-use-UTF-8 idea to #1647)