Request for Comments: Data Science Curriculum v2
waciumawanjohi opened this issue · 18 comments
Problem:
The curriculum has not been maintained and does not represent best practice.
Duration:
2020-08-31
Background:
OSSU recommends courses that would constitute an undergraduate major in Data Science. It is our responsibility to ensure that we follow best practice. To do so, we must bring the curriculum into alignment with external guidelines. A candidate set of guidelines has been identified and previously proposed.
In 2017, the Annual Review of Statistics and Its Application published the report "Curriculum guidelines for undergraduate programs in data science." The report was authored by “25 undergraduate faculty from a variety of institutions in the United States, primarily from the disciplines of mathematics, statistics, and computer science.” It had a goal of providing “structure for institutions planning for or revising a major in data science.”
The current state of OSSU Data Science is one of disrepair. The curriculum has had 1 change in 3 years. That change deleted a link to a broken application. But there remained many links to courses that are no longer offered. A list of these can be found here. Prospective students have posted in the issues asking if the Data Science curriculum is still maintained. Updating the curriculum must ensure that all courses are available for students.
Proposal:
OSSU Data Science should adopt “Curriculum guidelines for undergraduate programs in data science” (CGUPDS) as our guidelines. The curriculum should be updated to match. The exact changes can be reviewed in this pull request.
It might be helpful to have a direct link to the guidelines so it's easier to read and comment:
https://www.amstat.org/asa/files/pdfs/EDU-DataScienceGuidelines.pdf
Very short and nice read, just 16 pages.
Looks like you already made all the necessary course changes in the pull request. All the links are alive and the courses cover the Key Competencies on Page 6, and the Six Main Subject Areas & Outline on Page 9 very well. Really excellent work!
The topic progression path in the pull request looks somewhat different than the possible path in Figure 1, Page 12 of the guidelines. I think it's fine, but maybe you could explain a little bit for those that are curious?
One question to which I do not see an immediate answer is, where would the "Capstone Experience" and "Course in an outside discipline" mentioned in the Outline on Page 9 come from? Are they contained in some of the courses in the curriculum?
By the way I don't know much about data science at all, and I don't have a horse in this race. Just trying to be helpful.
For suggested changes at #60 (comment) regarding Algorithms Part 1 and 2
These courses are completely in Java and have a steep learning curve from the start. Those coming straight from python/r/julia will have a hard time adjusting to both the course materials and programming syntax. Suggest an optional course on Java as pre-requisite, specifically Java Programming I and II by University of Helsinki. Gives college credit for Finland residents.
I certainly think adding resources in Extras for teaching Java would be appropriate. As well, adding a note in the main curriculum that those resources are available.
The University of Helsinki courses are high quality and I have no objection to listing them as a resource.
One other option to keep in mind is Computer Science: Programming with Purpose. One thing to recommend this alternative is that it is taught by the same instructor as the Algorithm courses. This could be used instead of Introduction to Computer Science and Programming Using Python and Introduction to Computational Thinking and Data Science, or in addition to them.
Yeah, that's not an easy choice. The MITx pair go well together as a series just like Sedgewick's series.
On one hand, MITx uses python which is what most people will be programming with but Intro to Computational Thinking might not be as rigorous as the alternative and covers a range of things implemented in python (distributions, monte carlo, etc). This would be helpful for the DS student as it'll give more practice in a language they'll definitely encounter. It'll also reinforce topics covered in probability and statistics.
On the other, Sedgewick's will give you a very thorough understanding of algorithms specifically and the textbooks are available online, with updates and resources. Learning Java will also be good for anyone that will work at larger companies and be exposed to these types of codebases, so this route would be good for something you 'might' encounter.
Personally, I think the Sedgewick combination would be best in the CS curriculum, mainly because it's more aligned with CS than DS in my opinion as I don't think they're as necessary for machine/deep learning. They would be if you were programming the libraries themselves, but that's why I think they're more relevant for CS.
Definitely would suggest having in the DS curriculum as an extra though.
For suggested changes at #60 (comment) regarding Algorithms Part 1 and 2
These courses are completely in Java and have a steep learning curve from the start. Those coming straight from python/r/julia will have a hard time adjusting to both the course materials and programming syntax. Suggest an optional course on Java as pre-requisite, specifically Java Programming I and II by University of Helsinki. Gives college credit for Finland residents.
Head first Java might be an excellent option and beginner-friendly
Regarding the Algorithms section.
The OSSU route for CS suggests the Algorithms specialization from Stanford on Coursera: https://www.coursera.org/specializations/algorithms
The DS major suggests Algorithms 1 & II from Princeton on Coursera: https://www.coursera.org/learn/algorithms-part1
Would there be value in using the same set of courses to cover algorithms between both programs?
Would there be value in using the same set of courses to cover algorithms between both programs?
Yes, there would be. While Discord channels for the Data Science individual courses have not been added yet, they will be in the future. If Data Science and Computer Science students are in the same course, they can be in the same discussion rooms, increasing critical mass for productive peer learning.
The natural next question is: Why does the proposal include a different algorithms course for Data Science?
Essentially, computer scientists need to know more about complexity and computability than data scientists do. Some CS2013 requirements are:
- Greedy algorithms
- Dynamic Programming
- Introduction to the P and NP classes and the P vs. NP problem
- Introduction to the NP-complete class and exemplary NP-complete problems (e.g., SAT, Knapsack)
These match up with the 3rd and 4th Stanford algorithms courses, which teach:
- Greedy Algorithms, Minimum Spanning Trees, and Dynamic Programming
- Shortest Paths Revisited, NP-Complete Problems and What To Do About Them
The CGUPDS, by contrast requires:
- Programming concepts and data structures: Students should have the knowledge to implement their algorithms using procedural and functional programming techniques and their associated data structures, including lists, vectors, data frames, dictionaries, trees, and graphs.**
This is a decent fit for Princeton's Algorithms which teaches:
- Chapter 1: Fundamentals introduces a scientific and engineering basis for comparing algorithms and making predictions. It also includes our programming model.
- Chapter 2: Sorting considers several classic sorting algorithms, including insertion sort, mergesort, and quicksort. It also features a binary heap implementation of a priority queue.
- Chapter 3: Searching describes several classic symbol-table implementations, including binary search trees, red–black trees, and hash tables.
- Chapter 4: Graphs surveys the most important graph-processing problems, including depth-first search, breadth-first search, minimum spanning trees, and shortest paths.
- Chapter 5: Strings investigates specialized algorithms for string processing, including radix sorting, substring search, tries, regular expressions, and data compression.
- Chapter 6: Context highlights connections to systems programming, scientific computing, commercial applications, operations research, and intractability.
I think that curricular fit here is an overriding concern, but I'm interested to hear the opposing case.
I think that curricular fit here is an overriding concern, but I'm interested to hear the opposing case.
I agree with your concern. The default proposed curriculum should cover the material in CGUPDS and not try to cover an inordinate amount of additional material.
I think another possibility would be to include courses that overlap in both curricula as appropriate as alternatives.
For example, in the DS curriculum list the Stanford Algorithm Specialization as an alternative for fulfilling the requirements of the program of the study and that it would also fulfill the requirements of the DS course with the caveats that the Stanford specialization covers more material and require a larger time commitment.
This may help capture benefit you mentioned
If Data Science and Computer Science students are in the same course, they can be in the same discussion rooms, increasing critical mass for productive peer learning.
If an acceptable alternative is present in the CS program, then listing it would seem to facilitate this goal.
Have you guys seen The Open Source Data Science Masters website, Siraj Raval - Data Sciente Youtuber Github and Data Science From Scratch? Maybe they have good guidelines and courses options for the new DS Curriculum... I don't know... just giving suggestions...
I just found this RFC on Friday 8/28, and I haven't yet had a chance to deep-dive it, but in the spirit of commenting before the close date of the RFC, I have a few thoughts:
-
My biggest criticism of the program as it's currently presented is we basically say, "So you want to learn Data Science, do ya? Great -- go take four semesters of math first!" What a disappointment. Calculus doesn't come easily for most, and to suggest you have to be competent in mathematics as a prerequisite is somewhat discouraging to those starting from zero.
-
More importantly, it's also unnecessary. One could explore the concepts of data science -- especially classification problems -- with little more than an understanding of how Euclidean distance works. (SVM, k-means, k-nearest neighbors have some fancy math behind the scenes, but understanding how they work at a naive level is no more complicated than calculating distance between two points. And you get some really cool results early on.)
-
An exploration of the concepts of data science can also get people motivated and willing to take on four to six semesters worth of math courses via independent study.
-
Doing this mirrors what we do in our computer science curriculum: give new entrants a taste of the good stuff, up front. The core of the program is the "How To Design Programs" series, but there's a reason we don't dump new entrants there first.
-
Key Recommendation: Find a suitable early-entry data analysis course to offer as a parallel offering to LAFF. MIT's 6.00.2x might fill that role.
-
-
The OSSU curricula (especially for computer science) strive to be platform-agnostic, and tend to be more interested in academic rigor rather than practical skills. But in my personal view, data science is at its heart a skill-based discipline, with some sciencey aspects involved to justify assumptions with a repeatable approach. The output of a data science endeavor is inherently based on value -- a "goodness" of model fit, a suitability for a practical purpose.
-
With this in mind, should we have some thought towards practical skills classes early in the curriculum? Even if they're not in the format of full semester classes, perhaps "lab-based" short courses in day-to-day workflow tasks like repository management, data scraping/munging/cleanup/tidying, "practical" R, an exploration of Wickham's Tidyverse methods, etc.
-
This will give students something else to work on during the skill-building phase, while they're trying to climb Mount Mathematics.
-
Key Recommendation: Provide skill-based workflow classes to augment the early program experience. An intro to programming in R course, followed by a treatment of Wickham's R For Data Science book might fill that role.
-
Your first point is well taken. And I suspect easily addressed. The curriculum has essentially two parallel tracks:
As stated in the draft:
Order of the classes
Some courses can be taken in parallel, while others must be taken sequentially.
All of the courses within a topic should be taken in the order listed in the curriculum.
The graph below demonstrates how topics should be ordered.
It sounds like your first point could be addressed by simply putting listing the computer science courses first. The very course that you mention, MIT's 6.00.2x is already part of the introduction to computer science group.
On your second point, there are also practical tools and methods courses added:
Data Science Tools & Methods
I'm certainly open to suggestions for changing these courses for other ones, or for adjusting their place in the curriculum.
@waciumawanjohi Thank you for your responses here. I should clarify, I was looking at the current V1 curriculum while developing my comments above. I'll take a closer look to see how these ideas are proposed to be implemented in the V2 and comment further.
Great! It sounds like we're thinking in similar directions.
Can I start my first course from the V2 Curriculum or should I wait a little longer?
@EWCunha I would recommend any student that's starting now use V2.
I have reviewed the proposed CGUPDS curriculum guidelines and the candidate V2. Overall, I like the thrust and structure of the new program and I have no exceptions or recommendations for substantive changes at this time. These are good selections, and seem to fit the curriculum guidelines well.
A couple of comments:
-
The MIT 6.00.1x and 6.00.2x series are great choices for a second-level programming course sequence. You currently recommend that students new to programming complete the Py4E course, and that "students who already know basic programming in any language can skip this course".
-
A common student complaint with the 6.00.1x course is that it doesn't actually teach Python. My experience with 6.00.1x was that the course doesn't focus on building skill with the language, but still expects students to be functionally fluent in Python by the end of Week 2 of the course. For MIT on-campus students, that's probably not too big of an ask. I benefited from having a Python-focused training course before starting 6.00.1x and expect others will too.
-
If the point is to shortcut Py4E, I might recommend the Google Crash Course in Python (on Coursera), which can be done in 30 hours of estimated total time... realizing that might not be shaving much off of Py4E. (Currently, only the end-of-module quizzes are paywalled -- though they do appear to be of good practice value.)
-
If the point is to not overly bore students with information they've likely seen elsewhere, perhaps we can choose some specific sections of Py4E that better focus on teaching the particulars of the language.
-
-
I noticed that the course borrows heavily from both the IBM Data Science Specialization/Certificate programs on Coursera, and the Harvard Data Science Certificate program on EdX, but awards neither certificate. Taking the IBM "Databases and SQL For Data Science" course, with the other courses in the curriculum, would complete the IBM Introduction to Data Science Specialization (and grant a certificate).
-
For your Databases section, you do recommend the entirety of the University of Colorado "Data Warehousing for BI" specialization. For that reason, I'm loath to recommend skipping a course there, to emphasize a certificate program elsewhere. And I'm sure we're not interested in duplication of effort... it just seems an unfortunate situation.
-
Maybe it's enough to point out that the Coursera IBM specialization is one course away, and you could go pick it up for extra credit. Maybe we decide that external validation in the form of certificates is totally irrelevant to the point of OSSU, and that's reason enough to leave it as-is.
-
I'd prefer keeping the continuity of the UColorado database courses. You've put a great deal of thought into these course selections and I'm interested to hear your ideas about the database courses.
-
-
I have a general feeling that more of the Data Science coursework could be moved forward in time. But I don't want to create three parallel tracks...the current position is probably better pedagogically. Still, I feel like some of the data science coursework could inform the need / provide the motivation for five courses in data warehousing.
- I don't have a good solution for this. I mostly wanted to see if anybody else felt this way. If it's just me, I'm biased toward accepting the program as-is and letting students stress-test it and see what they think. We can adjust later on if we decide it's necessary.
Close of the Comment Period
Findings:
The proposal does not prepare students to work in Python or Java.
The proposal does not address either a capstone experience, or recommend work in other disciplines.
Response:
We should absolutely give students an approachable onramp for learning the languages of instruction. For Python, it makes sense to use Py4E, which is also the intro to programming class for the CS curriculum. For Java, the U Helsinki course is high quality and free. It has long been part of the CS curriculum's extras pages. The textbook Head First Java was mentioned, but it is not a free text.
CGUPDS makes mention of needing a course in another discipline, but gives this recommendation not even a paragraph of support. It strikes me as similar to a recommendation for a balanced liberal arts education. And while I highly value such an education, that's different from the goal of OSSU. OSSU supports the study of particular domains and leaves the rounding out of other domains as an exercise for the learner.* As such, no work in other disciplines is contained in this revision.
OSSU should recommend how students can undertake a capstone experience. I don't have an answer for this question at the moment. This is left undone. I hope that contributors can propose and discuss options, either in the Issues here or in the OSSU Discord.
Conclusion:
The proposed changes will be merged in with the addition of intro courses for programming in python and java.
...In going to add Py4E to V2, I noticed that it is already in the curriculum. oof.