aspc/mainsite

Write scraper to scrape reviews from ratemyprofessor.com

ZiqiXiong opened this issue · 11 comments

We are in dire need of reviews for courses offered in other four colleges. Data from Rate My Professor could be a great boost. This will also give students in other four colleges a good reason to use ASPC once they can log in.

Check out this existing scraper on npm: https://www.npmjs.com/package/rmp-api. (Although, I am told that all the cool kids are using yarn now.)

I would be seriously concerned about scraping and repackaging their review data though. I think that would open us up to copyright/legal issues.

@mattdahl Thanks! Any suggestion on how to avoid any potential legal issues?

@mattdahl Btw the existing scraper you mentioned is down. If we want to pursue this idea, we need to write our own scraper.

Yeah (hence legal issues), but the code is still available. There are probably some python packages out there too, but I just thought the API that that one exposed was very clean. But as you point out, running a node instance ourselves might be too much trouble.

My concern with the legality is that RMP's data is surely proprietary. It would be like taking the reviews from Yelp to populate a Yelp clone. If you simply linked to the appropriate RMP page for a course/teacher that was missing reviews, I think that would be okay, but we can't just scrape (steal) their data and serve it ourselves.

@mattdahl How about we serve their reviews but make it clear that they are from ratemyprofessors.com and provide a link to the original page?

I don't know, I would be cautious. See, e.g. https://www.ratemyprofessors.com/TermsOfUse_us.jsp#section4

You shall not, nor will you allow any third party (whether or not for your benefit) to reproduce, modify, create derivative works from, display, perform, publish, distribute, disseminate, broadcast or circulate to any third party (including, without limitation, on or via a third party website), or otherwise use, any Material without the express prior written consent of VII or its owner if VII is not the owner. Any unauthorized or prohibited use of any Material may subject you to civil liability, criminal prosecution, or both, under applicable federal, state and local laws.

Sounds scary, but I feel in the worst case they will just ask us to take it down like the npm API thing. @KentShikama Any comment?

How about just a link to the professor's RMP page?

Could ASPC course review be open to all the 5C's? Currently I have to inconvenience my PO friend whenever I want to see ASPC course reviews. Furthermore, students from the other colleges would be able to pitch in their reviews for the other 4C courses.

@mfeng1904
A link to professor's page is okay but really a suboptimal choice since RMP groups reviews by professor while our course review groups reviews by courses. Ideally we only want the relevant reviews for each course.
Last time I checked with Gloria about 5C login, she said "Disaster but it's getting done".

From my experience with RMP, people don't always thoroughly fill out the "courses taken" box and sometimes just comment on the professor's personality. However, this has still been helpful to me because if this professor is teaching a course I'm interested in for the first time, it's good to know if the prof is/isn't friendly. Implementation wise, our scraper would probably have to have some NLP capabilities to accurately group reviews by course, which sounds pretty tough. Maybe we could just have a RMP page link next to the professor's name in the course info.

Good to know it's being worked on :). Can we not use Sakai's 5C login system? I notice it's the same login that lets me access Lynda, Honnold library, etc..

@mfeng1904 I don't think NLP is required. Most reviews have the correct course number (example), so given a review all we need is to find the course with that number in the department that the professor is in. If legal issue is really the problem, we can think of a way around that but the scraper is definitely doable.
If CAS login doesn't work, I think it is indeed time for us to implement our own login. We can allow students from other colleges to register with their school email. @mattdahl ?