attardi/wikiextractor

Is Windows 10 supported?

Closed this issue · 28 comments

Is Windows 10 supported?

rgryta commented

I have noticed that there's an issue with multiprocessing when using Windows. I've patched that up by switching from multiprocessing to multithreading. This makes it SIGNIFICANTLY slower when using CPUs with many cores (~25 times slower on my 3900X) but at least it works.

I've added a Pull Request. In the meantime you can use my fork by adding: git+https://github.com/rgryta/wikiextractor.git@master
to the requirements.txt instead of just wikiextractor.

Thank you. Don't know how to add.
Can just send updated zip file here?

rgryta commented

Wikiextractor project zip file?
You can get it from git: https://github.com/rgryta/wikiextractor/archive/refs/heads/master.zip

If you're using pip then I'd recommend using that though: pip install git+https://github.com/rgryta/wikiextractor.git@master

wikiextractor-master>python setup.py
usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
or: setup.py --help [cmd1 cmd2 ...]
or: setup.py --help-commands
or: setup.py cmd --help

error: no commands supplied

python L:\data\wikiextractor-master\wikiextractor\WikiExtractor.py -b 50M -o L:/data/testko --json L:\data\kowiki-latest-pages-articles-multistream.xml.bz2
Traceback (most recent call last):
File "L:\data\wikiextractor-master\wikiextractor\WikiExtractor.py", line 67, in
from .extract import Extractor, ignoreTag, define_template, acceptedNamespaces
ImportError: attempted relative import with no known parent package

rgryta commented

Are you using python 3.6 or higher?

Seems like you're located under some weird path and sys.path is not set properly, causing import errors.
Please ensure that python --version returns python version 3.6 (or better).

Then ensure you have pip installed -> python -m pip --version. If it's an older version (let's say lower than 60.0.0), update pip with: python -m pip install --upgrade pip.
If you don't have pip installed, then download get-pip.py from https://bootstrap.pypa.io/get-pip.py and execute: python get-pip.py
Once you have pip installed, execute python -m pip install git+https://github.com/rgryta/wikiextractor.git@master.

If this won't work then unfortunately I probably won't be able to help.

python --version
Python 3.8.8

python -m pip install -upgrade pip

Usage:
D:\Python3.8.8\python.exe -m pip install [options] [package-index-options] ...
D:\Python3.8.8\python.exe -m pip install [options] -r [package-index-options] ...
D:\Python3.8.8\python.exe -m pip install [options] [-e] ...
D:\Python3.8.8\python.exe -m pip install [options] [-e] ...
D:\Python3.8.8\python.exe -m pip install [options] <archive url/path> ...

no such option: -u

L:\data\wikiextractor-master>python -m pip install git+https://github.com/rgryta/wikiextractor.git@master
Collecting git+https://github.com/rgryta/wikiextractor.git@master
Cloning https://github.com/rgryta/wikiextractor.git (to revision master) to c:\users\ni\appdata\local\temp\pip-req-build-z7_jhu8r
error: subprocess-exited-with-error

× git version did not run successfully.
│ exit code: 1
╰─> [2 lines of output]
'git' 不是内部或外部命令,也不是可运行的程序
或批处理文件。
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× git version did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

L:\data\wikiextractor-master>

rgryta commented

Should be --upgrade, sorry - copied command from somewhere and it truncated double hyphen to something weird.

rgryta commented

From the looks of it you may also need to install git cli. There's a compiled installer: https://git-scm.com/download/win
You can of course use a different distribution, that's just the first one I found quickly on Google

python -m pip install --upgrade pip
Requirement already satisfied: pip in d:\python3.8.8\lib\site-packages (23.1.2)

rgryta commented

python -m pip install --upgrade pip Requirement already satisfied: pip in d:\python3.8.8\lib\site-packages (23.1.2)

In that case you're probably missing git command. You can get it from the link above (git-scm.com). After that just use python -m pip install git+https://github.com/rgryta/wikiextractor.git@master.

Unless the error you've gotten is about something else. Google Translator unfortunately struggled a bit with proper translation.

rgryta commented

image

You can try another way.

Open the directory with extracted zip file that I sent at first. From the root directory of the project (so where the setup.py is located) execute: pip install .
This SHOULD do the trick.

If it won't work then the pip install git+... is the way to go, but it looks like you have connection issues. Maybe VPN?
See this thread for potential solutions: https://stackoverflow.com/questions/71571965/openssl-ssl-connect-connection-was-reset-in-connection-to-github-com443-while

pip install setup.py
ERROR: Could not find a version that satisfies the requirement setup.py (from versions: none)
ERROR: No matching distribution found for setup.py

rgryta commented

No, not pip install setup.py
Just pip install . with a dot at the end.

wikiextractor-master>pip install .
Processing l:\data\wikiextractor-master
Preparing metadata (setup.py) ... done
ERROR: No .egg-info directory found in C:\Users\Ni\AppData\Local\Temp\pip-pip-egg-info-b84l6p21

There is one common error for No egg. I don't know why there is this error. I see it often recently, don't know how to solve.

rgryta commented

Probably needs setuptools update. Execute pip install --upgrade setuptools
And then retry the pip install .

pip install --upgrade setuptools
Requirement already satisfied: setuptools in d:\python3.8.8\lib\site-packages (67.8.0)

which packages are installed for the command?
Maybe I can copy them to site-packages folder directly.

rgryta commented

pip install --upgrade setuptools Requirement already satisfied: setuptools in d:\python3.8.8\lib\site-packages (67.8.0)

which packages are installed for the command? Maybe I can copy them to site-packages folder directly.

I read that uninstalling setuptools may also fix it: pip uninstall setuptool (obviously after installing wikiextractor you should probably reinstall it though)

As for packages... wikitools has no dependencies so there aren't any unfortunately. There's something weird with your python configuration. Another package that you may try to install/uninstall is wheel: pip install wheel. You may have to try with different combinations of having those two packages installed/uninstalled (so wheel and setuptools both installed, both uninstalled, and just one installed). Hard to say what's wrong.

It's 1am here so I'll be leaving at that for now. If you'll still have some problem with it then you can write me in a few hours. Though I think a few queries to Google/Stack Overflow should suffice to fix it.

Good luck!

installed, but still failed when using the code

pip uninstall setuptools
Found existing installation: setuptools 67.8.0
Uninstalling setuptools-67.8.0:
Would remove:
d:\python3.8.8\lib\site-packages_distutils_hack*
d:\python3.8.8\lib\site-packages\distutils-precedence.pth
d:\python3.8.8\lib\site-packages\pkg_resources*
d:\python3.8.8\lib\site-packages\setuptools-67.8.0.dist-info*
d:\python3.8.8\lib\site-packages\setuptools*
Proceed (Y/n)? y
Successfully uninstalled setuptools-67.8.0

L:\data\wikiextractor-master>pip install .
Processing l:\data\wikiextractor-master
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: wikiextractor
Building wheel for wikiextractor (pyproject.toml) ... done
Created wheel for wikiextractor: filename=wikiextractor-3.0.7-py3-none-any.whl size=47887 sha256=89c3060b72af9867ae877b249a60d3ba7fa00f6b194c7b8c467561b07a30c948
Stored in directory: c:\users\ni\appdata\local\pip\cache\wheels\ac\88\3b\0022eef871f6d21b6e24acdd2b6ca634c7b3fb274c1c5c6533
Successfully built wikiextractor
Installing collected packages: wikiextractor
Attempting uninstall: wikiextractor
Found existing installation: wikiextractor 3.0.6
Uninstalling wikiextractor-3.0.6:
Successfully uninstalled wikiextractor-3.0.

python L:\data\wikiextractor-master\wikiextractor\WikiExtractor.py -b 50M -o L:/data/testko --json L:\data\kowiki-latest-pages-articles-multistream.xml.bz2
Traceback (most recent call last):
File "L:\data\wikiextractor-master\wikiextractor\WikiExtractor.py", line 67, in
from .extract import Extractor, ignoreTag, define_template, acceptedNamespaces
ImportError: attempted relative import with no known parent package

pip list
wikiextractor 3.0.7
wikipedia 1.4.0

C:\Users\Ni>python wikiextractor -b 50M -o L:/data/testko --json L:\data\kowiki-latest-pages-articles-multistream.xml.bz2
python: can't open file 'wikiextractor': [Errno 2] No such file or directory

System path

image

rgryta commented

Don't use it like that. You're providing full path to the WikiExtractor file - which is a submodule of wikiextractor (so basically "wikiextractor.WikiExtractor". Relative imports will bne broken when you use it like this.

Use syntax provided in README.md: python -m wikiextractor.WikiExtractor <Wikipedia dump file>

rgryta commented

As for why python wikiextractor doesn't work -> wikiextractor project is not launchable through __main__.py (for some reason, it just is this way). Syntax from comment above should work.

python -m wikiextractor.WikiExtractor -b 50M -o L:/data/testko --json L:\data\kowiki-latest-pages-articles-multistream.xml.bz2

This seems working now, waiting for result.

python -m wikiextractor.WikiExtractor -b 50M -o L:/data/testko --json L:\data\kowiki-latest-pages-articles-multistream.xml.bz2
INFO: Preprocessing 'L:\data\kowiki-latest-pages-articles-multistream.xml.bz2' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages

rgryta commented

I have not used --json option so I have no idea if it'll work. Fingers crossed that it does. Good luck. I gotta get some sleep.

Thank you. Good night.

Can extract result as linux now