Data Science Workstation Setup Guide

Welcome! If you are totally new to coding, what you are reading right now is called a README, and this README is part of a larger grouping of coding files called a Git code repository. This particular README that you are currently reading is created as part of a file called README.md. Written using an easy-to-learn syntax called "markdown" (hence the .md file extension), READMEs help other developers to learn how to best navigate your code repository. In this particular case, we are using the README to introduce this data science workstation setup guide as well as how to properly navigate it.

The best part about everything we'll be installing in this guide is that it is all FREE! This is because everything we will be downloading is referred to as open source, meaning that these things are maintained by the general public and can be updated by anybody. Of course, open source tools go through rigorous peer review processes, so you can be confident that I'm only having you install things that have been thoroughly vetted and verified by professional software developers. Open source software is generally denoted as such by its license, and its license can generally be found in a file called LICENSE. Some of the more popular choices for open source licenses include MIT or Apache 2.0. To keep things simple for you in this guide, be assured that I've already pre-reviewed these licenses to ensure you're good to go! 🥳

If you already have some experience doing software development, it may be that you already have some of these tools (or close equivalents) already installed. In that case, feel free to jump around to install what you need. If you are totally new to software development, I'd recommend following sticking with the path as it is laid out.

Hardware Recommendation

Before moving forward, you might be curious if you need any specific hardware. When you're first getting up and going, I think any modern PC will work. I will recommend that it's any machine that has either the Windows or macOS operating systems. Tablets (e.g. iPads) will unfortunately not work, and I can't say about Chromebooks. But don't worry about things like RAM or processor speed!

In case you are curious what I personally use, I have a 2021 MacBook Pro with M1 Pro chip running the latest stable software, and as a "2-in-1" tablet / laptop combo, I own a Microsoft Surface Pro 9 running Windows 11. Thus, the tutorials written as part of this repository are wrriten for both Mac and Windows users!

What We'll Be Installing

As the name of this guide implies, we're going to be focused on getting you what I consider to be the reasonable minimum for installations on your computer. You could certainly get more simplistic or more complex here or there, but we're hopefully going to strike a balance that feels comfortable without being too bloated.

That said, let's talk about what you'll be installing as part of our collective tech stack. (Don't mind the order of these bullets! We will not necessarily be installing things in this order.)

  • Command Line Interface (CLI): You are probably already familiar interacting with graphical user interface (GUI) apps, like Microsoft Excel. While it can seem daunting at first sight, interacting with the CLI is just like working in your standard app, except all in text form! Because it is in text form, the "scariness" of the CLI comes in the fact that we prefer to shorthand everything for quick typing. For example, one simple command, cd, actually stands for "change directory." The CLI is very important because it is going to be the way we run a lot of our coding stuff.
  • Python: Currently the most popular coding language in the AI/ML community, we will need to run Python on our computers as we perform our work. Python sees significant upgrades roughly every year. Generally speaking, I do not recommend being an early adopter. For example, the latest major release as of this update (Q1 2024) for Python is 3.12. I personally have had troubles getting Python 3.12 to work in some nuanced ways, so I would recommend being one major version behind, which in this case would be Python 3.11.
  • Git: The way that we coders keep track of how our code is versioned is using this piece of software called Git. Now don't worry, it's very easy to confuse Git with GitHub. Git in and of itself is the software that allows us to version control our code. GitHub is one website (and the most popular of these) that allows people to manage their Git projects on the web. So Git and GitHub aren't exclusively linked! For example, you might be familiar with another platform called GitLab. GitLab is a totally separate entity from GitHub, but users are allowed to push Git-managed code to GitLab just as they would with GitHub.
    • Note: As part of our minimalist tech stack, I am recommending Git but not necessarily GitHub. Again, GitHub is not required to use Git, although GitHub is really great. I personally love to use GitHub, but it goes beyond the scope of our minimalist stack here.
  • Integrated Development Environment (IDE): Those are big words that can be simplified to answer this question: "Where do you write your code?" An integrated development environment (IDE) allows us to write code in a way that's clean and easy to keep track. If you've seen pictures of code on the internet, you'll notice it's often very pretty and colorful. While the prettiness is a nice "side effect", the colorfulness is actually a major assistance when making sure that you're getting your code's syntax correct. There are also other really cool things IDEs can do. There are a lot of choices for IDEs out there, and to keep things simple, we're going to go with Microsoft's Visual Studio Code (VSCode). VSCode is the most popular choice in the world, and it is very well supported. I personally love using VSCode, and I'll share some of the little bells and whistles that I enjoy in VSCode.

What If I Get Stuck?

Unfortunately, I'm pretty bad about keeping this repository up-to-date. 😅 While I'm trying to keep this guide as "timeless" as possible by offering solutions that largely haven't changed in years, the reality is that things change. Or maybe I did a bad job at explaining something. In any case, my recommendation is to use an LLM for assistance, specifically one that is hooked up to the internet. I personally use Perplexity Pro and love it. If you do choose to go that route, remember to be as specific as you can. If you see an error message, do a quick check that there's no sensitive information being exposed and then also provide that to the LLM.

Suggested Installation Order

As some of the installations in this tutorial require prerequisite dependencies, here is the order I suggest you install these tools. Additionally, I also marked which tools are generally required for your data science work and which are more optional. I provide deeper descriptions for each of these tools in their own respective tutorials linked in tandem.

Required Installations

As we touched on in the previous section, this will be our minimalist tech stack. If you would like to go a bit further with some more advanced or specialized tools, please see the optional installations section. Although be aware: if you are a beginner, the optional installations can be perhaps too much to cover at once. That's okay! Once you feel comfortable with these core tools, you can always revisit this guide to cover that "next level" stuff.

Please check out the following pages in this order. Best wishes as you begin your data science journey!

Optional Installations

This section will provide some tips on optional installations. While these are optional for beginners, these eventually become "must haves" for an active practitioner working in the AI/ML space.