Character encoding broken

Question

Character encoding broken

VladimirFokow opened this issue 5 months ago · 7 comments

When copying non-English characters (specifically, Ukrainian) from/to the clipboard, the encoding is broken.

To reproduce:

pip install pyperclip

Write a simple python script, e.g.:

import pyperclip

text = pyperclip.paste()  # text is whatever is in the clipboard
pyperclip.copy(text)  # save text to clipboard

Copy some Ukrainian characters to your clipboard, e.g.: тест

When the script is run from the terminal, e.g. python script.py, it works fine: it saves to the clipboard the same thing that was there.

When it's run as the raycast "Script Command", it saves this to the clipboard: ????

If I change the script so that it doesn't get text from the clipboard but instead saves "тест" to clipboard directly: pyperclip.copy('тест'), then I get this in my clipboard: —Ç–µ—Å—Ç

Answer 1 · 2024-08-05T15:53:06.000Z

Hey there @VladimirFokow,

Sorry to hear you are having issues. I am sadly not very knowledgeable with Python. We have had similar problems before, but to be honest I don't know what the issue could be. Here is what I said to a previous user.

Answer 2 · 2024-08-05T16:45:55.000Z

Hi, thanks for the reply..
Just tested it with bash - it has the same problem.

Also printed the env variables LANG and PATH:

#!/bin/bash

# Required parameters:
# @raycast.schemaVersion 1
# @raycast.title test_cp
# @raycast.mode fullOutput

# Optional parameters:
# @raycast.description Test the clipboard
# @raycast.packageName test_cp
# @raycast.icon 🧪


# Save the content of the clipboard into a variable `text`
text=$(pbpaste)
# Save the content of the variable `text` back into the clipboard
echo "$text" | pbcopy

# Example non-English characters to copy: тест
# result in the clipboard: ????




echo $LANG  # en_DE.UTF-8
echo $PATH  # /usr/local/bin:/opt/homebrew/bin:/usr/bin:/bin:/usr/sbin:/sbin






# # Alternative test:
# text='тест'
# echo "$text" | pbcopy
# # result in the clipboard: —Ç–µ—Å—Ç

we run the scripts as a subprocess

Could you please point to the code location where this subprocess is created? (to try isolating the issue)

(unfortunately, I haven't used Swift or Ruby before)

Answer 3 · 2024-08-06T08:29:05.000Z

@VladimirFokow to me, sounds very likely as a UTF-8 unicode problem.
Using the example you provided:

import pyperclip

text = pyperclip.paste()  # text is whatever is in the clipboard
pyperclip.copy(text)  # save text to clipboard

give a try in this piece of code and let us know if it will work for you

import pyperclip
import os
import chardet

# Ensure the environment uses UTF-8 encoding
os.environ['PYTHONIOENCODING'] = 'utf-8'

# Function to detect and convert encoding
def convert_to_utf8(text):
    result = chardet.detect(text.encode())
    encoding = result['encoding']
    return text.encode(encoding).decode('utf-8')

# Get text from clipboard
text = pyperclip.paste()

# Convert text to UTF-8
text_utf8 = convert_to_utf8(text)

# Copy text back to clipboard
pyperclip.copy(text_utf8)

print("Text successfully copied to clipboard.")

Answer 4 · 2024-08-06T10:10:59.000Z

hi @unnamedd , thanks for the idea. But it didn't help..

For experimenting, here is an example "Script Command" which can invoke a python script:

#!/bin/bash

# Required parameters:
# @raycast.schemaVersion 1
# @raycast.title test_cp_py
# @raycast.mode fullOutput

# Optional parameters:
# @raycast.description Test the clipboard (Python)
# @raycast.packageName test_cp_py
# @raycast.icon 🐍

# /path/to/python can be seen by calling: `which python3`
/path/to/python /path/to/script.py

added some prints (click)

import pyperclip
import os
import chardet

# Ensure the environment uses UTF-8 encoding
print("PYTHONIOENCODING: ", os.environ.get('PYTHONIOENCODING'))
os.environ['PYTHONIOENCODING'] = 'utf-8'
print("PYTHONIOENCODING: ", os.environ.get('PYTHONIOENCODING'))


# Function to detect and convert encoding
def convert_to_utf8(text):
    result = chardet.detect(text.encode())
    print('result, detected with chardet:', result)
    encoding = result['encoding']
    print('encoding:', encoding)
    return text.encode(encoding).decode('utf-8')



text = pyperclip.paste()
print(text)
text_utf8 = convert_to_utf8(text)
print(text_utf8)
pyperclip.copy(text_utf8)





print('\nencoding of "тест":', chardet.detect('тест'.encode()))
print('  just question marks:')
print('encoding of "????":', chardet.detect('тест'.encode()))
print('  symbols that the script produced to the clipboard:')
print('encoding of "????":', chardet.detect('????'.encode()))

Output:

PYTHONIOENCODING:  None
PYTHONIOENCODING:  utf-8
????
result, detected with chardet: {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
encoding: ascii
????

encoding of "тест": {'encoding': 'utf-8', 'confidence': 0.938125, 'language': ''}
  just question marks:
encoding of "????": {'encoding': 'utf-8', 'confidence': 0.938125, 'language': ''}
  symbols that the script produced to the clipboard:
encoding of "????": {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

Done in 0.17s

Looks like the encoding from clipboard doesn't survive the transition to the raycast process (the one which is spawned to executes the script).
How is this process created?

It could be beneficial to isolate the issue (create a minimal reproducible example of creating this process to see exactly where the encoding problem happens)

Answer 5 · 2024-08-17T02:12:18.000Z

Hi, could someone please point me at the code where the subprocess is created? Thanks!

quote:

we run the scripts as a subprocess