udacity/AIND-NLP

Scraping Does Not Support Javascript Evaluation

petermetz opened this issue · 1 comments

The BeautifulSoup library does not support the evaluation of Javascript on the scraped pages. The notebook's code seems to be broken, probably because the Udacity main site switched to client side rendering at some point and now the list of courses does not appear in the scraped content produced by BeautifulSoup.

The offending line is this piece of python code which finds exactly zero elements:

# Find all course summaries
summaries = soup.find_all("div", class_="course-summary-card")

I recommend the notebook code to be updated with the use of PhantomJS for script evaluation during scraping:
https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python

Those who want to stay on the topic with BeautifulSoup and work closely with the material structure can use Harvard University Online Courses site instead of Udacity site.