This code is part of the article: "Natural language processing to identify the creation and impact of new technologies in patent text: code, data, and new measures".

Data is available from https://zenodo.org/record/3515985 (DOI: 10.5281/zenodo.3515985).

If you use the code or data, please cite the following paper:

Arts S, Hou J, Gomez JC. (2020). Natural language processing to identify the creation and impact of new technologies in patent text: code, data, and new measures. Forthcoming Research Policy. (https://doi.org/10.1016/j.respol.2020.104144)

To use the code to replicate the results in the paper, the following steps need to be followed:

Create an auxiliary data directory with the original raw data files as input: claim_full_till2018.csv and patent_title_abstract_till_2018.csv
Create a data directory to store the outputs from the code
Run the step_01_concatenate_patents.py code to concatenate the raw files
There are 5 metrics to compute with the code: new_word, new_word_comb, new_bigram, new_trigram and backward_cosine. For each, there is a folder with the required steps to compute it. The metrics new_bigram and new_trigram use the code in the new_ngram folder.
Inside each code folder the requiered steps to compute a metric are numbered sequentially, and they must be run in that order.
Inside the data directory create a folder for each metric to store the corresponding output files.

gpucce/respol_patents_code

This code is part of the article: "Natural language processing to identify the creation and impact of new technologies in patent text: code, data, and new measures".