This present repository contains the dataset from "The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications", which has recently been accepted to the NeurIPS 2023 Datasets and Benchmarks Track.
N.B. We will be updating our GitHub repository and website shortly.
- Overview of HUPD
- Usage: Loading the Dataset
- Downloading the Dataset
- Data Fields and Data Format
- Google Colab
- Experiments and Tasks
- Citation
- Licensing and Contact
The Harvard USPTO Dataset (HUPD) is a large-scale, well-structured, and multi-purpose corpus of English-language utility patent applications filed to the United States Patent and Trademark Office (USPTO) between January 2004 and December 2018. With more than 4.5 million patent documents, HUPD is two to three times larger than comparable patent datasets. Unlike previously proposed patent datasets in NLP, it contains the inventor-submitted versions of patent applications, not the final versions of granted patents, allowing us to study patentability at the time of filing using NLP methods for the first time. It is also novel in its inclusion of rich structured metadata alongside the text of patent filings: By providing each application's metadata along with all of its text fields, the dataset enables researchers to perform new sets of NLP tasks that leverage variation in structured covariates.
As a case study on the types of research HUPD makes possible, we introduce a new task to the NLP community, namely patent acceptance prediction. We additionally show the structured metadata provided in the dataset allows us to conduct explicit studies of concept shifts for this task. Finally, we demonstrate how our dataset can be used for three additional tasks: Multi-class classification of patent subject areas, language modeling, and summarization. Overall, HUPD is one of the largest multi-purpose NLP datasets containing domain-specific textual data, along with well-structured bibliographic metadata, and aims to advance research extending language and classification models to diverse and dynamic real-world data distributions.
The following command can be used to load the sample
version of the dataset, which contains all the patent applications that were filed to the USPTO during the month of January in 2016. This small subset of the dataset can be used for debugging and exploration purposes.
from datasets import load_dataset
dataset_dict = load_dataset('HUPD/hupd',
name='sample',
data_files="https://huggingface.co/datasets/HUPD/hupd/blob/main/hupd_metadata_2022-02-22.feather",
icpr_label=None,
train_filing_start_date='2016-01-01',
train_filing_end_date='2016-01-21',
val_filing_start_date='2016-01-22',
val_filing_end_date='2016-01-31',
)
If you would like to use the full version of the dataset, please make sure that change the name
field from sample
to all
, specify the training and validation start and end dates carefilly, and set force_extract
to be True
(so that you would only untar the files that you are interested in and not squander your disk storage space). In the following example, for instance, we set the training set year range to be [2011, 2016] (inclusive) and the validation set year range to be 2017.
from datasets import load_dataset
dataset_dict = load_dataset('HUPD/hupd',
name='all',
data_files="https://huggingface.co/datasets/HUPD/hupd/blob/main/hupd_metadata_2022-02-22.feather",
icpr_label=None,
force_extract=True,
train_filing_start_date='2011-01-01',
train_filing_end_date='2016-12-31',
val_filing_start_date='2017-01-01',
val_filing_end_date='2017-12-31',
)
HUPD can be easily accessed through Hugging Face Datasets. To download the raw patent application files in HUPD, please go to this link, uncompress the all-years.tar
file, and then further uncompress the new [year].tar
files that are of interest to you.
HUPD is also available on Google Drive. This Google Drive folder contains four large tarred files and a big feather file. More than 360GB of disk storage space is needed to download and store all the individual files.
Each patent application is defined by a distinct JSON file, named after its application number, and includes information about the application and publication numbers, title, decision status, filing and publication dates, primary and secondary classification codes, inventor(s), examiner, attorney, abstract, claims, background, summary, and full description of the proposed invention, among other fields. There are also supplementary variables, such as the small-entity indicator (which denotes whether the applicant is considered to be a small entity by the USPTO) and the foreign-filing indicator (which denotes whether the application was originally filed in a foreign country).
- In total, there are 34 data fields for each application:
{
"application_number": "...",
"publication_number": "...",
"title": "...",
"decision": "...",
"date_produced": "...",
"date_published": "...",
"main_cpc_label": "...",
"cpc_labels": ["...", "...", "..."],
"main_ipcr_label": "...",
"ipcr_labels": ["...", "...", "..."],
"patent_number": "...",
"filing_date": "...",
"patent_issue_date": "...",
"abandon_date": "...",
"uspc_class": "...",
"uspc_subclass": "...",
"examiner_id": "...",
"examiner_name_last": "...",
"examiner_name_first": "...",
"examiner_name_middle": "...",
"inventor_list": [
{
"inventor_name_last": "...",
"inventor_name_first": "...",
"inventor_city": "...",
"inventor_state": "...",
"inventor_country": "..."
}
],
"abstract": "...",
"claims": "...",
"background": "...",
"summary": "...",
"full_description": "..."
}
You can also use the following Google Colab notebooks to explore HUPD.
- HUPD Examples: Loading the Dataset
- HUPD Examples: Loading HUPD By Using HuggingFace's Libraries
- HUPD Examples: Using the HUPD DistilRoBERTa Model
- HUPD Examples: Using the HUPD T5-Small Summarization Model
Let us first provide a brief overview of each task we consider in our paper:
- Patent Acceptance Prediction: Given a section of a patent application (in particular, the asbtract, claims, or description), we predict whether the application will be accepted by the USPTO.
- Automated Subject (IPC/CPC) Classification: We predict the primary IPC or CPC code of a patent application given (some subset of) the text of the application.
- Language Modeling: We perform masked language modeling on the claims and description sections of patent applications.
- Abstractive Summarization: Each patent contains an abstract section in which the applicant summarizes the content of the patent. We use this section as the ground truth for our abstractive summarization task, and we use either the claims section or the description section as the source text.
HUPD DistilRoBERTa-Base was fine-tuned on HUPD with a masked language modeling objective. You can use this model directly with the Hugging Face pipeline as follows:
from transformers import pipeline
model = pipeline(task="fill-mask", model="turingmachine/hupd-distilroberta-base")
model("Improved <mask> for playing a game of thumb wrestling.")
Here is the output:
[{'score': 0.4274042248725891,
'sequence': 'Improved method for playing a game of thumb wrestling.',
'token': 5448,
'token_str': ' method'},
{'score': 0.06967400759458542,
'sequence': 'Improved system for playing a game of thumb wrestling.',
'token': 467,
'token_str': ' system'},
{'score': 0.06849079579114914,
'sequence': 'Improved device for playing a game of thumb wrestling.',
'token': 2187,
'token_str': ' device'},
{'score': 0.04544765502214432,
'sequence': 'Improved apparatus for playing a game of thumb wrestling.',
'token': 26529,
'token_str': ' apparatus'},
{'score': 0.025765646249055862,
'sequence': 'Improved means for playing a game of thumb wrestling.',
'token': 839,
'token_str': ' means'}]
HUPD T5-Small was fine-tuned on the claims (text) and abstract (summary) sections of HUPD. You can use this model directly with the Hugging Face pipeline as follows:
from transformers import pipeline
TEXT = "1. An optical coherent receiver for an optical communication network, said optical coherent receiver being configured to receive a modulated optical signal and to process said modulated optical signal for generating an in-phase component and a quadrature component, said in-phase component and said quadrature component being electrical signals, said optical coherent receiver comprising a power adjuster in turn comprising: a multiplying unit configured to multiply said in-phase component by an in-phase gain thereby providing a power-adjusted in-phase component, and to multiply said quadrature component by a quadrature gain thereby providing a power-adjusted quadrature component; and a digital circuit connected between output and input of said multiplying unit and configured to compute: a common gain indicative of a sum of a power of said power-adjusted in-phase component and a power of said power-adjusted quadrature component, and a differential gain indicative of a difference between said power of said power-adjusted in-phase component and said power of said power-adjusted quadrature component; and said in-phase gain as a product between said common gain and said differential gain, and said quadrature gain as a ratio between said common gain and said differential gain. 2. An optical coherent receiver according to claim 1, wherein it further comprises an analog-to-digital unit connected at the input of said power adjuster, said analog-to-digital unit being configured to ..."
summarizer = pipeline(task="summarization", model="turingmachine/hupd-t5-small")
summarizer(TEXT)
Here is the output:
[{'summary_text': 'An optical coherent receiver for an optical communication network includes a power adjuster and a digital circuit connected between output and input of the multiplying unit and configured to compute a common gain indicative of a sum of the power of an in-phase component and the power-adjusted quadrature component, and the differential gain as a product between the common gain and the diffractive gain.'}]
The model weights can also be downloaded from this Google Drive link.
If your research makes use of our dataset, models, or results, please consider citing our paper.
@inproceedings{
suzgun2023the,
title={The Harvard {USPTO} Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications},
author={Mirac Suzgun and Luke Melas-Kyriazi and Suproteem K Sarkar and Scott Kominers and Stuart Shieber},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2023},
url={https://openreview.net/forum?id=tk27oD2cBw}
}
HUPD is released under the Creative Commons Attribution 4.0 International License. If you have any questions, comments, or suggestions, please feel free to reach out to msuzgun@stanford.edu.