你的爬蟲大統領
-
index => 列出所有的 crawlers
- endpoint: /crawlers
- show last_run_at
- show running workers in queue (Sidekiq::Queue find class name)
- [*] show how many courses each crawler had done
-
show => 顯示單一爬蟲的資訊 name / crawling status
- endpoint: /crawlers/ntust, /crawler/{school name}
- track each worker job progress and status
- Start crawler anytime => track job ids => maybe save it to another model?
- ScheduledSet / RetrySet / DeadSet status (filtered by class name)
- Limiting queueing crawler (eg. each class for 5 instances)
- Manage/Track Rufus Scheduled Job
- Unschedule EveryJob / CronJob(EveryJob first)
-
setting => 設定單一爬蟲的 api secrets / retry interval / scheduling
- endpoint: /crawlers/{school name}/setting (ediqt page)
- understanding sidekiq scheduler usage and parameters
- Schedule crawler (whenever, .etc)
I use rufus-scheduler eventually
- initializer setup existing scheduling behavior
-
Course Model
- [*] Copy and Paste from Colorgy/Book :p
- Check data integrity (no blank class name / no blank class period data / no invalid period data......)
- Check course_code
- [*] Sync data to Core
-
後期調教
- Redis Namespace
- queue namespace(Sidekiq::Client push specific queue name)
- Limiting retry count
- limit queue number
- we can't kill workers orz
- sidekiq-limit_fetch set limit
- Check sidekiq proccess
Sidekiq::Client.push('queue' => 'NtustCourseCrawler', 'class' => CourseCrawler::Worker, 'args' => ['NtustCourseCrawler'])
- 有閒套個 AdminLTE 吧 ww