/Publaynet

Primary LanguagePython

Publaynet

Company Articles DataSet

Overview

Category Training Set Validating Set Testing Set
Num of Images 20365 500 499
Percentage 95% 2.5% 2.5%

Training Set:

category #instances category #instances category #instances category #instances
chapter 11312 section 17471 clause 106931 total 135714

Validating Set:

category #instances category #instances category #instances category #instances
chapter 151 section 246 clause 3096 total 3493

Testing Set:

category #instances category #instances category #instances category #instances
chapter 151 section 249 clause 2947 total 3347

Download

All Files:

Images

Annotation

Dataset:

Model:

Pretrained on Publaynet Dataset

Trained on Company Articles Dataset

Python Files:

  • faster_rcnn_resnet101_coco_2018_01_28: backbone的预训练模型,用于publaynet数据集训练
  • visualizeSet.py: 可视化数据集
  • build.py: 构建优化器和学习率策略
  • utils.py: 使用publaynet数据集的工具文件
  • train.py: 使用publaynet数据集的训练文件
  • test_per_img.py: 可视化测试集的预测结果
  • predict.py: 使用publaynet数据集的预测文件

Requirements

Detectron2

Run on Google Colab:

Install Requirements and Clone Publaynet

!pip install pyyaml==5.1
!pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
!git clone https://github.com/noba1anc3/Publaynet.git
cd Publaynet

Build Detectron2 from Source

After having the above dependencies and gcc & g++ ≥ 5, run:

!git clone https://github.com/facebookresearch/detectron2.git
cd detectron2
!python -m pip install -e .
cd ..

# Or if you are on macOS
# CC=clang CXX=clang++ python -m pip install -e .

Train

Data Preparation

Mount Google Drive

from google.colab import drive
drive.mount('/content/drive/')

Copy Training and Testing Data to Publaynet's Path

mkdir data

cp -rf ../drive/'My Drive'/train.zip ./data/
cp -rf ../drive/'My Drive'/val.zip ./data/

cd data
!unzip train.zip
!unzip val.zip
cd ..

Finetune on Faster_RCNN_X_101_32x8d_FPN_3x

!python train.py -f False

Finetune on Publaynet's Pretrained Model

mkdir output
cp -rf ../drive/'My Drive'/model_final.pth ./output/
!python train.py -f True

Training Log

Training From Scratch

Training on Faster-RCNN Pretrained Model

Training on Pretrained Model Finetuned on Publaynet Dataset

Comparison

Training From Scratch & Training on Faster RCNN Pretrained Model

scratch & faster rcnn

Faster RCNN Pretrained Model & Publaynet Pretrained Model

faster rcnn & publaynet

Evaluation Result on Testing Set

Per-class AP

chapter AP section AP clause AP mAP
85.180 86.641 93.367 88.396

Average Precision

AP AP50 AP75 APs APm APl
88.396 99.037 98.956 NaN 80.382 88.964

Average Recall

AR1 AR10 AR100 ARs ARm ARl
57.0 91.4 92.0 NaN 84.8 92.1