Actor-API Docs
Running Rails 5.2, Ruby 2.5.1
This is a simple project that:
- retrieves short bios for all actors on IMDb born on a specific day, saves the respective html pages,
- then retrieves and saves the respective html for each actors' "most known work"
- next, it parses the html pages into CSV files,
- then it seeds the data into a mySQL db, and lastly,
- has the machinery to make said data available via API.
The data model is basic: an Actor has_one MostKnownWork
and vice-versa.
Organization details
- crawlers/scrapers (originally written purely as scripts) moved into rake tasks,
- organized by function:
- lib/tasks/crawlers.rake
- lib/tasks/scrapers.rake
- meant to be used by objective:
- lib/tasks/seed_actors.rake
- lib/tasks/seed_actors.rake
- all this functionality accessible as part of one call to
rake db:seed
(and of course individually).- seed this code in db/seeds.rb
- "ETL" process goes as follows:
- retrieving and saving html files (crawling)
- parsing the html into CSV files (scraping)
- ingesting clean CSV files into database
- CSVs will be written to respective lib/imdb/ folders
- Interesting logic lives in lib/imdber.rb (where crawling / scraping concerns are again kept 'together' but otherwise organized by function), eg:
IMDber::Crawler#retrieve_actors_results_pages
IMDber::Crawler#retrieve_best_known_work_pages
IMDber::Scraper#parse_actors_results_pages
IMDber::Scraper#parse_known_work_pages
Side Note:
To simplify the task of seeding of the data, there is a duplicated column - both actors
and most_known_works
have a column for their respective known-work url.
I'm temporarily using this as the effective foreign key, again, only to make life easier while seeding the initial data (1200+ rows per table in my example), because the alternative was a "%fuzzy search%" on the film name, which was far less reliable. This column should be removed after the data is seeded, before deploy.
Visuals:
- Actors endpoint:
- Paginated actors results sorted by their respective most-known works' ratings
- GET /api/v1/actors/
- Birth-date search endpoint:
- GET /api/v1/actors/search/:birth_month/:birth_day
- Screenshot of actual data (local):
- Respective SQL "search" query