WikiExtractor.py - No such file or directory

Question

WikiExtractor.py - No such file or directory

Tomas0413 opened this issue 6 years ago · 6 comments

@yoheikikuta thanks for creating this repository and sharing the instructions on how to train BERT with Japanese wiki data! I'm trying to reproduce everything from scratch and I can't find WikiExtractor.py file.

python3 src/data-download-and-extract.py

100.0% 2906087424 / 2906079739
python3: can't open file '/data/bert-japanese/src/../../wikiextractor/WikiExtractor.py': [Errno 2] No such file or directory

ありがとうございました

Answer 1 · 2019-02-13T14:05:29.000Z

I'm going to try a WikiExtractor.py from here:

https://github.com/attardi/wikiextractor

Maybe it's what you used?

Answer 2 · 2019-02-14T03:05:14.000Z

@Tomas0413
Yes, this repository uses wikiextractor you mentioned.
This wikiextractor` is installed during the docker build.
Did you properly run the script in a docker container?

Answer 3 · 2019-02-14T10:58:48.000Z

Hi, @yoheikikuta, thanks for the response!

I should have looked at the Dockerfile indeed. Anyway, I was able to download the data and extract it, so this step worked fine. Now will move onto the SentencePiece, I initially tried running it in a small VM with 8GB of RAM, so I ran out of RAM and will what you used: n1-standard-8.

Answer 4 · 2019-02-14T13:02:27.000Z

@Tomas0413

n1-standard-8 (8CPUs, 30GB memories)

You need enhance the instance's memory to avoid the memory error.
I'm sure 30 [GB] is enough.

Answer 5 · 2019-02-14T13:52:27.000Z

Yep, SentencePiece on 30GB RAM finished without problems.

Answer 6 · 2019-02-16T11:00:57.000Z

I'll close this issue.
Please make another issues if you have any other problems.