🤗 Hugging Face | 📁 YOLOv8l | 📁 YOLOv8x | 📁 YOLOv9c | 🐙 RapidAPI
predict results by yolov8n-doclaynet
👏 Update 6/21/2024 - Add YOLOv9 models.
You know that RAG is very popular these days. There are many applications that support talking to documents. However, there is a huge performance drop when talking to a complex document due to the complex structures. So it's a challenge to extract content from complex document and organize it into parsable form. This repo aims to solve this challenge with a fast and good performance method.
YOLO
is the most advenced detect model developed by Ultralytics. YOLO has 5 different sizes of base model and a super powerful framework for training and deployment. So I chose YOLO to solve this challenge.DocLayNet
is a human-annotated document layout segmentation dataset containing 80863 pages from a broad variety of document sources. As far as I know, it's the most qualified document layout analysis dataset.
- Offer a script to turn DocLayNet dataset into YOLO detect training ready dataset.
- Offer train, eval and serve codes.
- Train and release 5 different sizes
of YOLOv8 models:
yolov8n
,yolov8s
,yolov8m
,yolov8l
andyolov8x
.yolov8n
,yolov8s
andyolov8m
can be found on HuggingFace.yolov8l
andyolov8x
are only slightly better thanyolov8m
. If you really want to try, please buy from yolov8l and yolov8x, as I rent GPUs to train them.
from ultralytics import YOLO
model = YOLO("{path to model file}")
pred = model("{path to test image}")
print(pred)
The definition of predict result please refer to the doc.
You can simply python main.py
to serve the model. Open http://localhost:8000/redoc check the API.
DocLayNet can be found more details and download at this link. It has 11 labels:
- Text: Regular paragraphs.
- Picture: A graphic or photograph.
- Caption: Special text outside a picture or table that introduces this picture or table.
- Section-header: Any kind of heading in the text, except overall document title.
- Footnote: Typically small text at the bottom of a page, with a number or symbol that is referred to in the text above.
- Formula: Mathematical equation on its own line.
- Table: Material arranged in a grid alignment with rows and columns, often with separator lines.
- List-item: One element of a list, in a hanging shape, i.e., from the second line onwards the paragraph is indented more than the first line.
- Page-header: Repeating elements like page number at the top, outside of the normal text flow.
- Page-footer: Repeating elements like page number at the bottom, outside of the normal text flow.
- Title: Overall title of a document, (almost) exclusively on the first page and typically appearing in large font.
- download DocLayNet dataset by this link
- unzip to
datasets
folder - use my convert script to make datasets ready for training
wget https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_core.zip
mkdir datasets
mv DocLayNet_core.zip datasets/
cd datasets/ && unzip DocLayNet_core.zip && rm DocLayNet_core.zip
cd ../
python convert_dataset.py
After preparing data, thanks to Ultralytics, training is super easy. You can choose base models from this link. I use the YOLOv8 series.
python train.py {base-model}
After training, you can evaluate your best model on test split.
python eval.py {path-to-your-model}
- Figure of overall
mAP50-95
ontest
between different models.
- Full table of
mAP50-95
ontest
compare between different models.
label | images | boxes | yolov8n | yolov9t | yolov8s | yolov9s | yolov8m | yolov9m | yolov8l | yolov9c | yolov8x | yolov9e |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Caption | 4983 | 1542 | 0.682 | 0.68 | 0.721 | 0.735 | 0.746 | 0.749 | 0.75 | 0.746 | 0.753 | |
Footnote | 4983 | 387 | 0.614 | 0.638 | 0.669 | 0.684 | 0.696 | 0.693 | 0.702 | 0.689 | 0.717 | |
Formula | 4983 | 1966 | 0.655 | 0.678 | 0.695 | 0.719 | 0.723 | 0.737 | 0.75 | 0.752 | 0.747 | |
List-item | 4983 | 10521 | 0.789 | 0.802 | 0.818 | 0.827 | 0.836 | 0.838 | 0.841 | 0.843 | 0.841 | |
Page-footer | 4983 | 3987 | 0.588 | 0.599 | 0.61 | 0.612 | 0.64 | 0.62 | 0.641 | 0.65 | 0.655 | |
Page-header | 4983 | 3365 | 0.707 | 0.731 | 0.754 | 0.77 | 0.769 | 0.77 | 0.776 | 0.785 | 0.784 | |
Picture | 4983 | 3497 | 0.723 | 0.764 | 0.762 | 0.789 | 0.789 | 0.787 | 0.796 | 0.796 | 0.805 | |
Section-header | 4983 | 8544 | 0.709 | 0.72 | 0.727 | 0.736 | 0.742 | 0.742 | 0.75 | 0.741 | 0.748 | |
Table | 4983 | 2394 | 0.82 | 0.86 | 0.854 | 0.88 | 0.88 | 0.881 | 0.885 | 0.884 | 0.886 | |
Text | 4983 | 29917 | 0.845 | 0.856 | 0.86 | 0.869 | 0.876 | 0.874 | 0.878 | 0.877 | 0.877 | |
Title | 4983 | 334 | 0.762 | 0.778 | 0.806 | 0.81 | 0.83 | 0.836 | 0.846 | 0.838 | 0.84 | |
All | 4983 | 66454 | 0.718 | 0.737 | 0.752 | 0.766 | 0.775 | 0.775 | 0.783 | 0.782 | 0.787 |