coursetable/ferry

Different classes reusing the same course code

codyjlin opened this issue · 13 comments

A very niche issue, brought to our attention by Professor Bensinger. We've only found this issue to affect one course code (CSTC 300). A fix could perhaps cross-validate classes using both the course code and the extended class name; currently we only group course codes together with the assumption that Yale does not reuse course codes.

CSTC 300 01 - https://coursetable.com/Table/201903/course/CSTC_300_1
2019 Fall - Leadership as Behavior
2014 Spring - Captivity & Law World History

This is a tough one. If we group courses by code and title, then we also have many false positives from courses changing their name.

For instance, CPSC 453 has been offered with three different titles: in Fall 2016, it was "Computational Mthds Biol Data"; in Fall 2017, it was "Machine Learning with Applications in Biology"; and now it's "Unsupervised Learning for Big Data"

The same course code probably wouldn't be recycled for a couple years after the original class was taught - do you think we could build a heuristic based on this assumption?

On a closer look, CSTC corresponds to College Seminar: Trumbull College, which appears to be like a special case in which codes are being reused. For instance, we see the same thing with CSMC (Morse College Seminars) in Spring '14 (CSMC 300: Real Estate in Econ & Society) and in Fall '19 (CSMC 300: Hip Hop Music and Culture).

Maybe we should just disable historical evaluations for college seminars? Looks like none of them are re-offered. If we want to be more careful, we can also restrict the title+course code matching to college seminars.

Yup that should work for college seminars. I think we'll still need something for other instances of this issue that aren't college seminars

If we pull out the course codes that have had multiple titles over the years, we get 4,869 total. For just the college seminars, we find 23 instances.

See https://gist.github.com/kevinhu/0f558a9236f4514f415eddb4199fc9ff

Hmm so dealing specifically with college seminars won’t yield a meaningful improvement.

Can we take a look at the time intervals between these changes and then inspect manually to see if there’s a pattern?

We’ll also need to think about how to get this logic into the frontend as well - we’ll probably need to add an additional field with a course exclusion list or something

There doesn't seem to be much of a pattern by time interval among these groups – note that it's especially hard to tell because a lot of these are false positives (at least judging by the description). Maybe we could partition by similar descriptions instead? We could do a text-distance heuristic.

Yeah I would be curious to see how well that works

I took a look at the maximum text distance between the descriptions in these groups - looks like there isn't a clear cutoff value we can use, but at larger edit distances of 256 we start to see clear examples of course code reuse. (You can see a few at the updated notebook: https://gist.github.com/kevinhu/0f558a9236f4514f415eddb4199fc9ff).

image
image

Yeah not what we were hoping for there - I guess there’s some low-hanging fruit on either end, but the stuff in the middle will likely be pretty difficult.

A couple questions/suggestions:

  • can you include in your python notebook the last year a course was taught under one title vs the first year it had a new title?
  • does the edlib library that we’re using take into account the length of the text itself?

Updated it to include the years - looks like we see code reuse even in the same year. For instance, in 2013, AFAM 353 was the listing for "Black British Art and Culture" as well as "Punishment and Inequality".

The edlib distance is the raw edit distance, but I've now normalized that by the length of the longer of the compared strings.

image

image

By the way, here's what the relationship between title and description distance looks like:
image

We can be pretty confident by just selecting the cluster at the top right, but it looks like there is still no clear cutoff region.

Yeah I’m not sure how we should proceed here - I definitely don’t want to hide a course’s evaluation unnecessarily, since it would erode user trust in the data.

One potential approach would be to make a change for the ones we’re certain about, and then add a “report a data issue” button to the frontend and handle the rest of the issues as they come up.

I would guess that these aren’t too frequent, and that many of the title changes we do see here are just things like ENGL 114 where the titles change very frequently.