/datascience_resources

List of Data Science resources: useful papers, posts, books and repos

What to read, listen to and watch in Data Science

This is a small collection of very curated resources centered in Data Science, Machine Learning and Artificial Intelligente. These resources vary but in general they are not a collection of state of the art academic papers. They are more a collection of posts, talks and lectures that have defined and shaped what Data Science and Applied Machine Learning are today. What's important, what's not, how to work, how not to work, etc. Some of these resources have been copied shamelessly all over the internet but here you can find the original sources. It's more of a personal collection for me to understand the industry that I work in and not to lose sight to what's important and what's just hype. Honest, rigorous, thoughtful and relevant reflections about what Data Science is and should be.

Some of them I read or follow daily or weekly and some I just have a look at every once in a while to see what's up. I hope you find them as interesting as I do.

Books:

The only two technical textbooks that I have read from the first to the last page are the Statistical Learning pair:

By far the most useful. Very well structured, entertaining and located in the sweetspot of introductory and reference textbook. I think I have read at least 3 times the chapter on Model Selection of the first book. If you have time and interest to read two books regarding ML, just read these two. Period.

Lately I have been trying to learn the basics of Deep Learning in a more rigorous way. I did the Deep Learning coursera specialization which I completely recommend but for quick doubts or to refresh stuff I usually consult the following two ebooks:

The Nielsen one is my go-to resource when I need to solve some quick doubt regarding DL. It's rigorous, practical, thoughful and damn useful.

In terms of Reinforcement Learning it will no surprise that I strongly recommend the Sutton and Barto classic:

  • Reinforcement Learning: An Introduction. I have not really gotten in RL in depth yet, but the chapters that I have read are all thoughtful, entertaining and as clear as a topic this complex can be.

Finally, my favourite technical book I have ever read: Statistical Rethinking by Professor Richard McElreath. I feel sometimes that statistical rigour is a casualty in the hype war of Machine Learning and it's such a shame. I believe most of the problems we face as data scientists are solvable with coherent data, simple models and rigorous statistical analysis. Professor McElreath focuses on Bayesian statistics but it starts from very basic concepts such as uncertainty, probability and such and works from there. Even if in your day to day work you mainly use frequency-based statistics (which is totally fine by the way) being able to think in a more bayesian way is very useful and very healthy.

Posts:

It seems that nowadays everybody writes Data Science posts. About anything really: DL, ML engineering, communication, A/B testing, etc. The vast majority of them are just useless. I don't want to sound harsh but it's the truth. It looks like as a mandatory part of your career development you have to publish regularly. Having something to say, sadly, is not that common. However, among them you can find some useful and just a few over the years that I find truly exceptional in some sense. I curated a personal collection of my personal favourites. In general they are not very technical, most of them focus on the nature of our job, what to do, how to bring real value, how to organize, etc. I hope you find them useful:

  • The AI Hierarchy of Needs by Monica Rogati. You have probably seen her piramide-shaped figure of DS needs all over the internet, copied shamelessly. This is the original source, accompanied by a wonderful article. To this day I consider it the best description of DS needs any organization can have.

  • The Data Science Venn Diagram by Drew Conway. Best description of what Data Science really is and what is composed of. Hacking skills, Math&Statistics and Substantive Expertise or Business domain. That's about it. Simple, true, elegantly displayed.

  • Information Platforms and the Rise of the Data Scientist by Jeff Hammerbacher. Jeff led the data team at Facebook and later cofounded Coursera. During his time in Facebook he came up with a great definition to an emerging role at the time: the data scientist. A decade later his definition is still on point:

At Facebook, we felt that traditional titles such as Business Analyst, Statistician, Engineer, and Research Scientist didn’t quite capture what we were after for our team. The workload for the role was diverse: on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization in a clear and concise fashion. To capture the skill set required to perform this multitude of tasks, we created the role of “Data Scientist.”

  • Scaling Knowledge at Airbnb by Chetan Sharma and Jan Overgoor. As important of doing Data Science is communicating it. IMHO Airbnb is the best company out there doing knowledge communication within it structure. The concept is especially important when a company is in hyper-growth state. How do we make the insights we produce from data spread throughout the whole organization? Here's the answer.

  • The Scientific Paper is obsolete by James Somers. This is a more general, less operational post. It deals with the form in which we communicate science and knowledge in general and how now more than ever it needs an interactive approach if possible. Good examples of this approach are the backpropagation algorithm interative explanation, the redesign of the famous 'Small world' networks by Watts and Strogatz or the Stich Fix algorithms tour. They all are rigorous scientific and interactive representations of several topics.

  • Why Managing Data Scientists is Different by Roger M. Stein. Data Scientists are a curious bunch. If you ever find youself or transition into a managing position you should be aware of some things.

  • Be aware the data science pin factory by and Let Curiosity Drive Eric Colson. A profound and full-on defense of the full stack Data Scientist. A praise of flexibility and an attack on specialization. Really insightful. The second one is very related to the first. On the advantages of innovation, experimentation and the importance of trying things out.

  • What is the Most Effective Way to Structure a Data Science Team by Chuong Do. You will read versions of this post all over the internet. The eternal discussion of how to structure a Data Science team: embedded or centralized? This dicotomy is really well explained in here.

  • At Airbnb Data Belongs Everywhere by Riley Newman. How do you scale a Data Science team? How do you grow the impact of Data Science within a company? How do you gain the trust of other branches of your organization? This is how Airbnb does it.

  • Engineers Shouldn’t Write ETL by Jeff Magnusson. Who is who in a Data department? Who should do what? Very common misconceptions regarding ETLs, Big Data and another bunch of buzz words. Very insightful.

Talks:

There are many many very interesting talks regarding applied Data Science. This is definitely a WIP but I take as mandatory to watch the yearly update lecture on Deep Learning taught by Lex Fridman.

People & Blogs:

I follow (semi)-regularly several people from the Data Science scene. Some are more tech-savvy, some are more focused on scaling knowledge, management, statistics or communication. In no particular order of preference:

Sites:

Besides particular individuals I read a few sites/blog every once in a while. Probably the most famous and mainstream one is Towards Data Science but there are others just as good if not better. Here they are:

Podcasts:

There are just as many data related podcasts as there are blogs. The only I listen frequently is The Artificial Intelligence podcast because of the extremely interesting guests that Lex Fridman is able to book. Not only people working on Data Science or Artificial Intelligente but science in general, economics, phylosophy, business, etc. The rest I listen sometimes when I travel.

3Blue1Brown deserves especial mention. It's a Youtube channel really, not a podcast. I have never ever seen complex mathematical topics explained in such an interesting, engaging and plain fun way. Asbolutely spectacular content. It never ceases to surprise me.

Reinforcement Learning

RL is one of the most promising and powerful branches of Machine Learning. Unfortunately, there is not a corpus of courses/tutorials that cover the full spectrum from basic stuff to AlphaStar. Information and knowledge are still very fragmented in a bunch of research groups, companies and institutions. As of February 2020, I think this is the most comprehensive collection of high quality resources. From DeepMind lectures to Berkeley to the first courses/specialization in the matter in Coursera.

Useful links

Finally, I usually dump here a few miscellanea links that I find useful about more general coding or data visualization.