jet-c-21/PrettyPaper

PrettyPaper

Member Info

Google Sheet

Work Group

1. Data Filter (3 people a group)

Determine whether the block of data is related to the paper or not.
If there are any data from another paper, you have to remove them.
Skill Set: majority for textual analysis, python pdf parser, maybe some image processing

2. Normal Paper Paragraph Segmentation (5 people a group)

Extract metadata, main title, and subtitle in a paper.
Extract all the content beyond each subtitle.
Skill Set: textual analysis, python pdf parser

3. Component Object Detection (4 people a group)

Extract figures, tables, and charts in a paper. The output result would be image data.
Skill Set: Multiple Object Detection, image processing, Machine Learning, python pdf parser and maybe some textual analysis

4. Img Paper Paragraph Segmentation (5 people a group)

Separate the whole image page into parts by each paragraph.
Segment subtitle and content in each block.
Skill Set: Image Segmentation, image processing, Machine Learning, python pdf parser and maybe some textual analysis

5. OCR, lang_trans (3 people a group)

Extract text data from image data with the OCR technique after image pre-processing.
Skill Set: OCR (Tesseract), image processing, Machine Learning, python pdf parser and maybe some textual analysis