/pagination-prediction-pytorch

A Deep Learning model for detecting pagination links in an webpage

Primary LanguageHTML

Pagination prediction

Introduction

In this project we we have developed a Deep Learning model designed to predict the pagination links on a web page.
The model classifies each <a> and <button> elements in a web page into the following categories:

  • PREV - previous page link
  • PAGE - a link to a specific page
  • NEXT - next page link
  • OTHER - for elements that are not a pagination link

The model is based on the research paper: "Large Scale Web Data API Creation via Automatic Pagination Recognition - A Case Study on Event Extraction"
We introduce features such as URL feature from URLNet1 and sentence embedding from Sentence Transformer2 into our model to enhance the performance beyond previous methodology.

Datasets

We utilized the same dataset as used in "Large Scale Web Data API Creation via Automatic Pagination Recognition - A Case Study on Event Extraction."
This dataset is an extension of the original dataset used in the Autopager3 contains 319 pages extracted from 109 distinct websites.

Installation

Intall the required packages with anaconda

conda env create -f environment.yml

Activate the environment

conda activate pagination-prediction-pytorch

Usage

Training

./train.sh

Checkpoint files will be available at ckpt directory after training, which can than be used by pagination_prediction_api.py for inference purpose.

Test the API

./test_api.sh

Footnotes

  1. URLNet: Source - arXiv:1802.03162

  2. SentenceTransformers: Source - arXiv:1908.10084

  3. Autopager: Source