twardoch/split-markdown4gpt

Error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 3183: character maps to <undefined>

avfirsov opened this issue · 1 comments

I am trying to split following md Makdyeniyel_M._Zapomnit_Vsyo_Usvoenie_Zn.a6.md and get an error:

$ python3 -m split_markdown4gpt ~/Downloads/Makdyeniyel_M._Zapomnit_Vsyo_Usvoenie_Zn.a6/Makdyeniyel_M._Zapomnit_Vsyo_Usvoenie_Zn.a6.md --model gpt-3.5-turbo --limit 4096 --separator "=== SPLIT ==="
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\curious_andrew\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\split_markdown4gpt\__main__.py", line 44, in <module>
    cli()
  File "C:\Users\curious_andrew\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\split_markdown4gpt\__main__.py", line 40, in cli
    fire.Fire(split_md_file)
  File "C:\Users\curious_andrew\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\fire\core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\curious_andrew\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\fire\core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\curious_andrew\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\curious_andrew\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\split_markdown4gpt\__main__.py", line 34, in split_md_file
    return f"\n{separator}\n".join(md_splitter.split(md_path))
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\curious_andrew\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\split_markdown4gpt\splitter.py", line 372, in split
    self.load_md(md)
  File "C:\Users\curious_andrew\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\split_markdown4gpt\splitter.py", line 121, in load_md
    self.load_md_path(md)
  File "C:\Users\curious_andrew\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\split_markdown4gpt\splitter.py", line 91, in load_md_path
    self.load_md_file(md_file)
  File "C:\Users\curious_andrew\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\split_markdown4gpt\splitter.py", line 100, in load_md_file
    self.load_md_str(md_file.read())
                     ^^^^^^^^^^^^^^
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.496.0_x64__qbz5n2kfra8p0\Lib\encodings\cp1251.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 3183: character maps to <undefined>

What can I be doing wrong?

well, now finally it's the tool that might be at fault, I'll try to take a look