This workshop was presented to the UC Berkeley D-Lab Digital Humanities Working Group on March 7, 2019 as an overview of data science tools and how to keep your data science projects organized using RStudio Projects.
Even if you are not an R/RStudio user, I hope that you will find the organization concepts useful along with the list of resources below.
While comprehensive, this list of resources is far from exhaustive. There are many programs available to help manage your Themes, Teams, Tools, and Timelines in terms of your:
-
Research Design
-
Data Methods and Tools
-
Document Preparation
-
Presentation
-
Organization, and
-
Preparation
Before all of the tech stuff, it usually helps to get your project off the ground by flexing your domain expertise. Don't skip the design basics!
Research question: learn how to ask a guided research question.
Literature review: learn how to conduct a literature review.
Hypotheses: learn about hypotheses in scientific research.
Statistical framework: brush up on your statistics.
fread: a favorite function from the "data.table" package for quickly importing data into your R session from the web - or otherwise.
Be sure to also consider data methods and tools in the design of your research - this can save you time, energy, and stress as your results and discussion depend on your broader research framework.
RStudio cheatsheets: quintessential resources for RStudio users.
Importation: learn how to import data using R.
Preprocessing: learn about data preparation with R.
Visualization: learn about data visualization in R.
Statistics: excellent introduction to the quantitative side of things in R.
Inferential Thinking (Python Data8 textbook): excellent introduction to the quantitative side of things in Python.
MS Excel: Microsoft brand spreadsheet program.
Qualtrics: subscription-based software for data collection and analysis; especially good for surveys!
OpenRefine: open source data preparation and transformation.
Salesforce: manage your own business, customer service, marketic, analytics, and application development via dashboards.
Use markdown to produce vibrant documents to help you rock your next presentation or publication!
Markdown a language for plain text formatting.
RMarkdown (.Rmd): R-specific brand of markdown language.
knitr: R package used to "knit" your markdown into an .html, .docx, .pdf, etc. file.
.html: markup language used to create Internet stuff.
.docx: Microsoft file format for preparing text.
.pdf: Adobe file format for document preparation.
Ensure that your presentation skills are up to speed to secure that next round of funding or wow potential employers and find the job of your dreams.
Ten rules for structuring papers: a short paper that outlines the critical steps you must consider in your scientific writing; rubric included!
MS Powerpoint: fantastic way to make slideshow presentations!
Omeka: super fun online open-source management system for digital collections.
Slideshows: as fun as creating MS PowerPoint slides can be, you should at some point learn to use R code in a way to directly create slideshows such as ioslides and Slidify.
Shiny: awesome way to build apps and widgets using R code.
Stay organized! Keep track of your whos, whats, wheres, whens, whys (and especially hows).
Box: a great way to backup your stuff online.
Dropbox: another great way to back your stuff up online.
Google Drive: yet another great way to store your stuff online (notice a trend here?).
G Suite: compiles many "traditional" proprietary programs into a single suite: word processing, spreadsheets, slideshow presentations, drive storage, calendaring, etc.
Asana: web-based platform for team based management.
Don't forget to share your stuff! Whether your project is public or private, make sure you are tracking your organization.
RStudio Projects and Software Carpentry intro: RStudio projects are the file type (.Rproj) that allows you to do all of this!
Dependency management: packrat allows you to keep all your versions configured correctly.
GitHub: the premier open-source solution for tracking and sharing code.
ssh: secure shell; standard secure way for for connecting computers through the command line.
Jupyter Notebooks: go-to computational environment for Python programmers.
Do you need to run a particular version of a program? Learn more about:
Academic Environments on Demand: "Researchers need easy access to analytic computing environments that are designed to fit their needs. BRC's Analytics Environments on-Demand (AEoD) service is designed for researchers who need to run analytic software packages (such as ArcGIS, Stata, SPSS, R Studio, etc.) on a platform that is scaled up from a standard laptop or workstation, in a Windows-based environment."
Virtual Box open source virtual machine for home or enterprise use.
You might also need to package these version configurations as reproducible environments along with your data and code as "containers".
What is a container? "A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings". Check out:
Docker: learn to package your program/application with all the essential dependencies.
Kubernetes: Google container service.
Medium freeCodeCamp intro to Docker, VMs, and Containers: excellent resource for learning the basics of containers and virtual machines.
Click the links below to learn more about accessing greater compute power, memory, and storage:
Benten GPU server - no public link...(yet?).
Savio UC Berkeley condo cluster maintained by Research IT.
XSEDE: NSF-funded supercomputer; best of the best?
Bridges: seeks to bring supercomputing to those unfamiliar with it; accessible through XSEDE User Portal.
Jetstream easily deploy virtual machines on cloud-based, on-demand systems.
Take your programming to the next level!
Make files: run and compile programs automatically to ensure your entire repository stays updated!
Travis CI: sync GitHub code for testing.
Bash: an essential programming language for all programmers (even if you do not even know what it is yet - you will eventually!).
Bookdown: learn to convert your R work to digital book formats.
R package tutorial: learn how to bundle your code and share.
Pkgdown: build websites for your packages!
Overleaf (LaTeX): web-based LaTeX editor.
7-zip: compress and encrypt your data!
Computer Information Systems: familiarize yourself with systems software.
Securing your computing environment: although a little dated, this guide still covers many topics relevant today in simple language.