Class name contains extra alternative class codes
zhumingcheng697 opened this issue · 2 comments
There are some cross-listed classes whose name
field contains extra alternative class codes.
Example 1: https://nyu.a1liu.com/api/courses/ja2023/STS-UY
{
"name": "| LA-UY 143 | PL-UY 2064 | STS-UY 2144 Ethics and Technology",
"deptCourseId": "347",
"subjectCode": "STS-UY"
}
Example 2: https://nyu.a1liu.com/api/courses/su2022/CS-GY
{
"name": "| CS-UY 3083 Introduction to Databases",
"deptCourseId": "608",
"subjectCode": "CS-GY"
}
I am cleaning the class name by replacing matches to the following regex with an empty string in my client:
^(?:\| +[A-Z][A-Z0-9]+-[A-Z]{2,3} [0-9]+[A-Z0-9]* +)+
Basically, I am assuming that:
- the “department segment” of the subject code will always match
[A-Z][A-Z0-9]+
- the “school segment” of the subject code will always match
[A-Z]{2,3}
- the “department segment” and “school segment” of the subject code are combined with a hyphen
-
- the department course id will always match
[0-9]+[A-Z0-9]*
- the subject code and department course id are combined with a single space
- the class codes are separated by
|
I have the intention of making this regex stricter so that there might be class names not cleaned up properly, but it is less likely for correct class names to be messed up incorrectly by the regex.
I wonder what is you opinion on adding a similar clean up to the scraper?
The first example has been fixed, and the second example is in the process of being re-scraped. I'll be doing a full re-scrape of the entire DB to fix up the remainder of the semesters as well.