attardi/wikiextractor

cannot serialize/pickle '_io.TextIOWrapper' object

kwon0408 opened this issue · 5 comments

Input file is https://dumps.wikimedia.org/kowiki/latest/kowiki-latest-pages-articles.xml.bz2 .
Environment is like below:

  • Windows 10 21H1 (build 19043.1165)
  • run in two Python versions, both in Windows Terminal
    • Python 3.7.4 on PowerShell ("Env 1")
    • Python 3.9.5 on Anaconda PowerShell ("Env 2")
  • at both versions of Python, command line was python -m wikiextractor.WikiExtractor ..\assets\kowiki-latest-pages-articles.xml.bz2 -o ..\assets\kowiki-dump\

Output at Env 1:

PS C:\Users\User\Downloads\pycharm\7-speech-to-text\wikiextractor-3.0.4> python -m wikiextractor.WikiExtractor ..\assets\kowiki-latest-pages-articles.xml.bz2 -o ..\assets\kowiki-dump\
INFO: Preprocessing '..\assets\kowiki-latest-pages-articles.xml.bz2' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
INFO: Preprocessed 200000 pages
INFO: Preprocessed 300000 pages
INFO: Preprocessed 400000 pages
INFO: Preprocessed 500000 pages
INFO: Preprocessed 600000 pages
INFO: Preprocessed 700000 pages
INFO: Preprocessed 800000 pages
INFO: Preprocessed 900000 pages
INFO: Preprocessed 1000000 pages
INFO: Preprocessed 1100000 pages
INFO: Preprocessed 1200000 pages
INFO: Preprocessed 1300000 pages
INFO: Preprocessed 1400000 pages
INFO: Preprocessed 1500000 pages
INFO: Preprocessed 1600000 pages
INFO: Loaded 56777 templates in 291.7s
INFO: Starting page extraction from ..\assets\kowiki-latest-pages-articles.xml.bz2.
Traceback (most recent call last):
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\User\Downloads\pycharm\7-speech-to-text\wikiextractor-3.0.4\wikiextractor\WikiExtractor.py", line 621, in <module>
    main()
  File "C:\Users\User\Downloads\pycharm\7-speech-to-text\wikiextractor-3.0.4\wikiextractor\WikiExtractor.py", line 617, in main
    args.compress, args.processes)
  File "C:\Users\User\Downloads\pycharm\7-speech-to-text\wikiextractor-3.0.4\wikiextractor\WikiExtractor.py", line 357, in process_dump
    reduce.start()
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot serialize '_io.TextIOWrapper' object
PS C:\Users\User\Downloads\pycharm\7-speech-to-text\wikiextractor-3.0.4> Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\spawn.py", line 99, in spawn_main
    new_handle = reduction.steal_handle(parent_pid, pipe_handle)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\reduction.py", line 87, in steal_handle
    _winapi.DUPLICATE_SAME_ACCESS | _winapi.DUPLICATE_CLOSE_SOURCE)
PermissionError: [WinError 5] 액세스가 거부되었습니다

Output at Env 2:

(DGAIS2021) PS C:\Users\User\Downloads\pycharm\7-speech-to-text\wikiextractor-3.0.4> python -m wikiextractor.WikiExtractor ..\assets\kowiki-latest-pages-articles.xml.bz2 -o ..\assets\kowiki-dump\
INFO: Preprocessing '..\assets\kowiki-latest-pages-articles.xml.bz2' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
INFO: Preprocessed 200000 pages
INFO: Preprocessed 300000 pages
INFO: Preprocessed 500000 pages
INFO: Preprocessed 600000 pages
INFO: Preprocessed 700000 pages
INFO: Preprocessed 800000 pages
INFO: Preprocessed 900000 pages
INFO: Preprocessed 1000000 pages
INFO: Preprocessed 1100000 pages
INFO: Preprocessed 1200000 pages
INFO: Preprocessed 1300000 pages
INFO: Preprocessed 1400000 pages
INFO: Preprocessed 1500000 pages
INFO: Preprocessed 1600000 pages
INFO: Loaded 56777 templates in 219.2s
INFO: Starting page extraction from ..\assets\kowiki-latest-pages-articles.xml.bz2.
Traceback (most recent call last):
  File "C:\Users\User\.conda\envs\DGAIS2021\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\User\.conda\envs\DGAIS2021\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\User\Downloads\pycharm\7-speech-to-text\wikiextractor-3.0.4\wikiextractor\WikiExtractor.py", line 621, in <module>
    main()
  File "C:\Users\User\Downloads\pycharm\7-speech-to-text\wikiextractor-3.0.4\wikiextractor\WikiExtractor.py", line 616, in main
    process_dump(input_file, args.templates, output_path, file_size,
  File "C:\Users\User\Downloads\pycharm\7-speech-to-text\wikiextractor-3.0.4\wikiextractor\WikiExtractor.py", line 357, in process_dump
    reduce.start()
  File "C:\Users\User\.conda\envs\DGAIS2021\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Users\User\.conda\envs\DGAIS2021\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\User\.conda\envs\DGAIS2021\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\Users\User\.conda\envs\DGAIS2021\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\User\.conda\envs\DGAIS2021\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle '_io.TextIOWrapper' object
(DGAIS2021) PS C:\Users\User\Downloads\pycharm\7-speech-to-text\wikiextractor-3.0.4> Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\User\.conda\envs\DGAIS2021\lib\multiprocessing\spawn.py", line 107, in spawn_main
    new_handle = reduction.duplicate(pipe_handle,
  File "C:\Users\User\.conda\envs\DGAIS2021\lib\multiprocessing\reduction.py", line 79, in duplicate
    return _winapi.DuplicateHandle(
PermissionError: [WinError 5] 액세스가 거부되었습니다

In both outputs the last PermissionError message reads "Access is denied."

These were not helping at all:

  • using a PowerShell window with admin permissions
  • not setting an output folder
  • replacing the BZ2 with the only XML file extracted from it
  • running python setup.py install and trying again

The two outputs are almost the same, but some are different: the most significant one I think is TypeError: cannot serialize '_io.TextIOWrapper' object from Env 1 vs. TypeError: cannot pickle '_io.TextIOWrapper' object from Env 2.

Works on linux

  • Ubuntu 20.04.2 LTS 64bit
  • python 3.8.5

I have encounter the same issue.

  • Windows 10 Home 20H2 (build 19042.1110)
  • Python 3.9.7 via scoop
  • command line was wikiextractor .\jawiki-20210901-pages-articles6.xml-p4307948p4444230.bz2
(base) PS > wikiextractor .\jawiki-20210901-pages-articles6.xml-p4307948p4444230.bz2
INFO: Preprocessing '.\jawiki-20210901-pages-articles6.xml-p4307948p4444230.bz2' to collect template definitions: this mINFO: Loaded 3237 templates in 17.9s
Traceback (most recent call last):
  File "C:\Users\skytomo\scoop\apps\python\current\Scripts\wikiextractor-script.py", line 33, in <module>
    sys.exit(load_entry_point('wikiextractor==3.0.5', 'console_scripts', 'wikiextractor')())
  File "C:\Users\skytomo\scoop\apps\python\current\lib\site-packages\wikiextractor-3.0.5-py3.9.egg\wikiextractor\WikiExtractor.py", line 636, in main
  File "C:\Users\skytomo\scoop\apps\python\current\lib\site-packages\wikiextractor-3.0.5-py3.9.egg\wikiextractor\WikiExtractor.py", line 364, in process_dump
  File "C:\Users\skytomo\scoop\apps\python\current\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Users\skytomo\scoop\apps\python\current\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\skytomo\scoop\apps\python\current\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\Users\skytomo\scoop\apps\python\current\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\skytomo\scoop\apps\python\current\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle '_io.TextIOWrapper' object
(base) PS D:\skytomo\Documents\何らかのディレクトリ> Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\skytomo\scoop\apps\python\current\lib\multiprocessing\spawn.py", line 107, in spawn_main
    new_handle = reduction.duplicate(pipe_handle,
  File "C:\Users\skytomo\scoop\apps\python\current\lib\multiprocessing\reduction.py", line 79, in duplicate
    return _winapi.DuplicateHandle(
PermissionError: [WinError 5] アクセスが拒否されました。

On a mac with python 3.8 same error. Not a windows issue

Same error here too with macOS BigSur 20G165 and python 3.8.11

Works fine with macOS BigSur 20G165 and python 3.7.11