/tar.IMDbScraper

An IMDb scraper (AJAX, HTML, JSON)

Primary LanguageC#GNU General Public License v3.0GPL-3.0

tar.IMDbScraper

  • C# .NET Standard v2.1

Function

This library can be used to scrape various IMDb title information via the static Scraper class:

via AJAX
  • all user reviews
via HTML
  • alternate versions page
  • awards page
  • crazy credits page
  • critics reviews page
  • FAQ page
  • full credits page
  • locations page
  • main page
  • parental guide page
  • ratings page
  • reference page
  • soundtrack page
  • taglines page
  • technical page
via JSON
  • all alternate titles ("Also known as" = AKAs)
  • all awards
  • all awards for a particular awards event (via enum)
  • all awards for a particular awards event (via string)
  • all awards events
  • all companies
  • all companies of a particular category (via enum)
  • all connections
  • all connections of a particular category (via enum)
  • all external reviews
  • all external sites
  • all external sites of a particular category (via enum)
  • all filming dates
  • all filming locations
  • all goofs
  • all goofs of a particular category (via enum)
  • all keywords
  • all news
  • all plot summaries
  • all quotes
  • all release dates
  • all seasons
  • all topics
  • all trivia entries
  • episodes card (2 top ranked and 2 most recent episodes, if available)
  • main news (without details)
  • next episode (if available)
  • storyline
  • suggestions (search on IMDb)

You can also use the IMDbTitle class in which the title scraping is encapsuled.

For results, see images.

Caution

Some of the methods provide incomplete data
  • As long as there is no "Show more"/"All" button on any of the loaded HTML pages, the info scraped should be complete. Otherwise the corresponding JSON method needs to be used. If there is no JSON method implemented yet, the author of this library needs to be informed about the affected title.
  • The full credits page could be incomplete depending on the production status.
  • The critic reviews page only consists of 10 entries from metacritic.com.
  • The locations page has only 5 filming dates and locations (JSON methods are implemented), but it also has production dates (no JSON method is implemented, yet).
  • The main page has many infos no other method can provide, yet, but also some of those is incomplete (e.g. the technical info, therefore you need to scrape the Technical Page).
  • The ratings page has a heatmap for all episode ratings which is not yet implemented.
  • The reference page has (as the Main Page) some info which is incomplete.
  • The storyline does provide some general plot entries but not all.

It is recommended to not scrape all information at once and it also does not make any sense to store everything in your own database which could not only be a legal issue but is also immediately outdated as the IMDb data is updated regularly. Therefore, you should only scrape and store general information (e.g. title(s), year(s), genre(s), plot(s)) and scrape the other info when you really need (to display) it. This is also due to the duration a particular scrape needs (e.g. it takes already around 42 seconds to scrape all 37 seasons of "The Simpsons" without detailed information of each episode).

Hashes

IMDb regularly changes the hashes which are used for most of the requests. Use Scraper.ScrapeAllOperationHashesAsync() once in a while which automatically updates the hashes via a simulated browser window and stores those in a local .json file. You can adjust the default path [PathToYourApp]\Data\IMDbHashes.json and the DateTime to compare the last update with.

Furthermore, you can also adjust the .json file manually as follows:

  1. Open the corresponding site listed below with Firefox, click F12 to show the Web Dev Tools window
  2. Go to Network Analysis and sort by Host
  3. On the site, click on "More..." below the corresponding items
  4. In Web Dev Tools window: check new entry for File starting with "/?operationName=" to find the corresponding operation
  5. Copy the value from `Header Lines` -> `GET` -> `extensions` -> `sha256Hash` to the .json file
Operation GET-Operation-Name Page How to retrieve
AllAwardsEvents AllEventsPage https://www.imdb.com/event/all/ no click necessary
AllTopics TitleAllTopics https://www.imdb.com/title/tt0068646/keywords/ no click necessary
AlternateTitles TitleAkasPaginated https://www.imdb.com/title/tt0068646/releaseinfo/ click on "More"
Awards TitleAwardsSubPagePagination https://www.imdb.com/title/tt0068646/awards/ click on "More"
CompanyCredits TitleCompanyCreditsPagination https://www.imdb.com/title/tt0068646/companycredits/ click on "More"
Connections TitleConnectionsSubPagePagination https://www.imdb.com/title/tt0068646/movieconnections/ click on "More"
EpisodesCard TMD_Episodes_EpisodesCardContainer https://www.imdb.com/title/tt0072562/ no click necessary
ExternalReviews TitleExternalReviewsPagination https://www.imdb.com/title/tt0068646/externalreviews/ click on "More"
ExternalSites TitleExternalSitesSubPagePagination https://www.imdb.com/title/tt0068646/externalsites/ click on "More"
FilmingDates TitleFilmingDatesPaginated https://www.imdb.com/title/tt0944947/locations/ click on "More"
FilmingLocations TitleFilmingLocationsPaginated https://www.imdb.com/title/tt0068646/locations/ click on "More"
Goofs TitleGoofsPagination https://www.imdb.com/title/tt0068646/goofs/ click on "More"
Keywords TitleKeywordsPagination https://www.imdb.com/title/tt0068646/keywords/ click on "More"
MainNews TitleMainNews https://www.imdb.com/title/tt0072562/ only scroll down
News TitleNewsPagination https://www.imdb.com/title/tt0072562/news/ click on "More"
NextEpisode TMD_Episodes_NextEpisode https://www.imdb.com/title/tt0072562/ no click necessary
PlotSummaries TitlePlotSummariesPaginated https://www.imdb.com/title/tt4154796/plotsummary/ click on "More"
Quotes TitleQuotesPagination https://www.imdb.com/title/tt0068646/quotes/ click on "More"
ReleaseDates TitleReleaseDatesPaginated https://www.imdb.com/title/tt0068646/releaseinfo/ click on "More"
Storyline TMD_Storyline https://www.imdb.com/title/tt0072562/ only scroll down
Trivia TitleTriviaPagination https://www.imdb.com/title/tt0068646/trivia/ click on "More"

Usage

IMDbExporter