BUGS-NYU/schedge

Class name contains extra alternative class codes

zhumingcheng697 opened this issue · 2 comments

There are some cross-listed classes whose name field contains extra alternative class codes.

Example 1: https://nyu.a1liu.com/api/courses/ja2023/STS-UY

{
  "name": "| LA-UY 143 | PL-UY 2064 | STS-UY 2144 Ethics and Technology",
  "deptCourseId": "347",
  "subjectCode": "STS-UY"
}

Example 2: https://nyu.a1liu.com/api/courses/su2022/CS-GY

{
  "name": "| CS-UY 3083 Introduction to Databases",
  "deptCourseId": "608",
  "subjectCode": "CS-GY"
}

I am cleaning the class name by replacing matches to the following regex with an empty string in my client:

^(?:\| +[A-Z][A-Z0-9]+-[A-Z]{2,3} [0-9]+[A-Z0-9]* +)+

Basically, I am assuming that:

  • the “department segment” of the subject code will always match[A-Z][A-Z0-9]+
  • the “school segment” of the subject code will always match [A-Z]{2,3}
  • the “department segment” and “school segment” of the subject code are combined with a hyphen -
  • the department course id will always match [0-9]+[A-Z0-9]*
  • the subject code and department course id are combined with a single space
  • the class codes are separated by |

I have the intention of making this regex stricter so that there might be class names not cleaned up properly, but it is less likely for correct class names to be messed up incorrectly by the regex.

I wonder what is you opinion on adding a similar clean up to the scraper?

A1Liu commented

Yeah, this is definitely a cleanup that should happen scraper-side. I'll investigate this further, since it looks like these are examples of cross-subject courses. For now, your Regex looks good, since it's doing a similar thing to what the scraper does when parsing the scraped source:

Screen Shot 2022-11-21 at 12 41 04 PM

A1Liu commented

The first example has been fixed, and the second example is in the process of being re-scraped. I'll be doing a full re-scrape of the entire DB to fix up the remainder of the semesters as well.