Time about extract citations from S2ORC
MIracleyin opened this issue · 2 comments
Hi @malteos
I am using cli_s2orc.py to get s2orc_full, and s2orc_train_test files for citation graph embeddings.
And I find some questions.
First, I used gunzip to unzip all ".jsonl.gz" files, (It's come from s2orc's repo). So I think maybe replacing ".jsonl.gz" with "*.jsonl" is a better option.
Second, I find function "worker_extract_citations" is time-consuming, maybe add tqdm for it.
If you also think this is important, May I give your PR in this repo?
BTW, could you give me some details about your time using it? I am not sure the programming is correct.
-
Decompression with gunzip is not needed (smart_open takes care of this and does not add much delay)
-
Yes, it is indeed time-consuming. On my machine, it took ~5hrs (60 cores but I/O speed is probably more important).
Feel free to do a PR :)
about 1:
I already used gunzip to decompress. If I use your code early I can save more storage space, lol.
about 2:
It is also memory-consuming. My VM only has 50G memory and it is stuck for a long time without any feedback.
Thanks for your replay, have a nice day :)