malteos/scincl

Time about extract citations from S2ORC

MIracleyin opened this issue · 2 comments

Hi @malteos

I am using cli_s2orc.py to get s2orc_full, and s2orc_train_test files for citation graph embeddings.

And I find some questions.

First, I used gunzip to unzip all ".jsonl.gz" files, (It's come from s2orc's repo). So I think maybe replacing ".jsonl.gz" with "*.jsonl" is a better option.

Second, I find function "worker_extract_citations" is time-consuming, maybe add tqdm for it.

If you also think this is important, May I give your PR in this repo?

BTW, could you give me some details about your time using it? I am not sure the programming is correct.

  1. Decompression with gunzip is not needed (smart_open takes care of this and does not add much delay)

  2. Yes, it is indeed time-consuming. On my machine, it took ~5hrs (60 cores but I/O speed is probably more important).

Feel free to do a PR :)

about 1:
I already used gunzip to decompress. If I use your code early I can save more storage space, lol.

about 2:
It is also memory-consuming. My VM only has 50G memory and it is stuck for a long time without any feedback.

Thanks for your replay, have a nice day :)