Is a text transformation step needed?

Question

Is a text transformation step needed?

Closed this issue 3 years ago · 1 comments

In the README it states:

NOTE: If you would like to run our code on your own datasets, there is no need to represent each paper/author/word as a number. Just make sure that (1) each paper/venue/author/word name does not have whitespace inside

I noticed that in vocabulary.txt the words are all lowercase, and many "words" are actually multiple words separated by underscores.

If I'm starting with titles and abstracts that include capitalization and punctuation do I need to transform that in some way before putting it into the "text" field in the .json file?

Answer 1 · 2021-06-09T21:06:01.000Z

Thank you for your interest in our work.

Our code can take raw text (without any preprocessing) as input. However, it would be better to tokenize the text, remove all punctuations, and convert all characters to lowercase. Phrase discovery (i.e., "multiple words separated by underscores") is not neccessary.