/USPTO-PatFT-Web-Crawler

Crawler for fetching information of US Patents and PDF bulk download

Primary LanguagePython

Web Crawler of USPTO PatFT Database

Crawler for fetching information of US Patents and batch PDF download.
preview:

Motivation

I've participated in patent analyzation project since Apr. 2017. Our team need to search with certain query on PatFT and examine if each resulting patent is suitable for our topic and then analyze suitable patents. I found out that we can download bulk patent data only by searching certain words, names, or regions with Download patent data and PAIR Bulk Data from USPTO's Open Data Portal, which aren't very useful for us, and suitable tools that can be found on the Internet are all charged. So, I started to write a Python scripts containing basic functions, which accelerated the progress of project. To made this program more user friendly, I revised the code and made an UI with PyQt5.  

Download Execution File

The source code has packaged with pyinstaller in Windows
1.Normal package
2.Single executable file

Instruction

You can follow the instruction below or watch this video. It should be easy to learn :).

Patent Fetcher

(1) Insert PN (2) Filtering conditions (3) Information to be fetched (4) PDF type to be downloaded (5) Table

  1. Insert the patent numbers (PNs) to be processed in following ways:
    (a) Choose a CSV file with PNs in the first coulumn (example).
    (b) Search with query (The query should examined on PatFT first) . The PNs should be shown in the table.

  2. (Optional) Filtering the shown PNs with setting the patent types, range of application date & issue date.
    The filtered PNs are also shown on the table but will be deleted in the end of this process.

  3. Fetching the information of patents shown in the table by web crawling.

  4. Download PDF of full-text or drawing section (or both simultaneously) of patents shown in the table.

  5. The table can be saved as a CSV file anytime.

Browser

In the second page, you can insert PN to show the PatFT web of this patent or open PDF with your default browser.

Caution

  1. The program has some problems when fetching information of the patents issued before 1976. Still working on it.
  2. Searching with long query takes a lot of time, same as it takes on PatFT (example). I tried using threading in the program but it leads to more time consumed, and multiprocessing leads to bad connection. If you have a long query with less than 500 results, copying the patents number to a CSV file on your own and insert the file should be faster.
  3. If you encountered any problems or have any suggestion (like adding other function), feel free to contact me!