juzraai/cordis-projects-crawler

v2.0

Closed this issue · 0 comments

This is the plan for v2.0. I'll modify this description as something changes/comes up in my mind or in the comments below.


Features (based on 1.x)

  • Crawl project RCNs: single RCN, RCN list, RCN range
  • Crawl RCNs found in output directory
  • Crawl all available RCN
  • Crawl RCNs of search URL
  • CSV/TSV export
  • MySQL export
  • RCN list export + seed? v2.1
  • Java API for developers
  • CLI for users
  • Fancy documentation with Docsify (MySQL docs, extending docs)

Improvements

  • Use CORDIS XML and OpenAIRE API
  • Test old and new projects too
  • Unified view - separate ticket? may need help? v2.1

Under the hood

  • JitPack compatible POM
  • Complete rewrite in Kotlin
  • Batch processing pattern (Kotlin sequences)
  • Parse XML with Simple framework
  • Modular design - interfaces and IoC framework
  • Write in batches for better performance
  • database: use an ORM framework (OrmLite? Hibernate? jOOQ?) - stick with plain JDBC
  • database: should remove relation records before inserting new ones - now I don't think it's needed
  • use Spring Boot framework? would simplify config handling and ORM - keep it simple