PrettyPaper

Member Info

Work Group

1. Data Filter (3 people a group)

  • Determine whether the block of data is related to the paper or not.
  • If there are any data from another paper, you have to remove them.
  • Skill Set: majority for textual analysis, python pdf parser, maybe some image processing

2. Normal Paper Paragraph Segmentation (5 people a group)

  • Extract metadata, main title, and subtitle in a paper.
  • Extract all the content beyond each subtitle.
  • Skill Set: textual analysis, python pdf parser

3. Component Object Detection (4 people a group)

  • Extract figures, tables, and charts in a paper. The output result would be image data.
  • Skill Set: Multiple Object Detection, image processing, Machine Learning, python pdf parser and maybe some textual analysis

4. Img Paper Paragraph Segmentation (5 people a group)

  • Separate the whole image page into parts by each paragraph.
  • Segment subtitle and content in each block.
  • Skill Set: Image Segmentation, image processing, Machine Learning, python pdf parser and maybe some textual analysis

5. OCR, lang_trans (3 people a group)

  • Extract text data from image data with the OCR technique after image pre-processing.
  • Skill Set: OCR (Tesseract), image processing, Machine Learning, python pdf parser and maybe some textual analysis