/Code_cafe

Primary LanguageRBSD 2-Clause "Simplified" LicenseBSD-2-Clause

First Steps With R

This is designed to be a self-directed study session where you work through the material at your own pace. If you are at a Cookies and Code event, instructors will be on hand to help you.

What is R?

R is a free, open-source programming language that has very strong support for statistics. It was originally developed as an open source implementation of the S Programming language. It is used extensively in research and industry for areas such as data analysis, statistics, machine learning, bioinformatics, simulation, linguistics and much more.

With over 8000 freely available add-on packages that provide extensive additional functionality, R will probably have something that can help your research.

Don't just take our word for it though -- here's what others have to say

Installing R and RStudio

Many users of R use it from within another free piece of software called RStudio. RStudio is a powerful and productive user interface for R. It’s free and open source, and works great on Windows, Mac, and Linux.

Our first task, therefore, is to install R and RStudio.

Starting RStudio

When you start RStudio, you'll be greeted with a window like the one below

R Studio Screen Shot

R can be used interactively by typing commands into the Console panel. In this tutorial, everything that is formatted like this:

print("this is an R command")

Should be typed into the terminal. Press Return after every command.

Simple commands and calculations

R is a command based system which means that you (usually) interact with it by entering commands rather than using a Graphical User Interface (GUI). Some of these commands are rather straightforward! For example, R can be used to do arithmetic

1+1
3*9
377/120

R can also do all of the mathematical operations that you'd expect to see on a scientific calculator. For example, to take the square root of two:

sqrt(2)

This is the first time we've entered a function in R so let's discuss some details. In the above, the function name is sqrt and the function argument is 2. In R, all function arguments are enclosed in parentheses ()

R is case sensitive. For example, the correct command for square root is sqrt(2) with everything in lower case. Variations such as Sqrt(2) or SQRT(2) won't work (try it!).

R can also evaluate all the standard trigonometric functions such as sin, cos and tan. These take their arguments in radians rather than degrees. As such, a right angle is pi/2 rather than 90.

sin(pi/2)

Unlike many scientific calculators, R's log function takes the natural logarithm by default.

log(10)

If you want to calculate a logarithm to base 10, you'll need to specify the base as a second argument.

log(100,base=10)

This shows another feature of R functions -- named arguments. In this case, the named argument is base. Since the second argument to log is, by design, always the base you could have simply executed

log(100,10)

but the named argument version is more readable.

Getting help

Built in to R is a large amount of documentation that you can call on any time. For example, if you forget the details about the log function described above, ask R for help

 help(log)

Variables

We'll rarely want to perform a calculation and throw away the result. It is much more likely that we'll want to store the result in R's memory for later use; either as part of future calculations or ready for export to external files.

We do this by assigning the results of calculations to variables. For example,

a <- sin(1)
b <- 10
c <- a+b

In the above, we created three variables called a, b and c. Note that as you create variables, they are shown, along with their values, in RStudio's Environment window. You can also list all variable names that currently exist in R's memory using the command

ls()

To see the value of any given variable, just type it's name followed by enter

c

To remove a variable from R's memory, we use the rm() command

rm(c)

The rm command can also remove a list of variables in one go. For example, we could remove all variables in R's memory by sending the results of ls() to it.

rm(list=ls())

Built in datasets

R comes with a package called datasets that contains a set of classic datasets such as Fisher's Iris data and Anscombe's quartet. This package is one of the few that are loaded when you start R.

To see the full list of available datasets, execute the command

library(help="datasets")

We are going to focus on the iris dataset which is stored as an R object called a Data Frame in the variable name iris. Learn more about this dataset using the help command:

help(iris)

If you run the above command, you'll see that R's documentation tells us that "iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species."

Let's confirm this information for ourselves by introducing a few more R commands. dim() tells us the dimensions of a data frame

dim(iris)

The names() function tells us the column names of a data frame.

names(iris)

We can extract any of the columns by name using the $ operator. To get a list of the petal lengths for example we do

iris$Petal.Length  

The str() function gives a compact summary of the structure of its input

str(iris)

The head() function shows us the first 6 data points.

head(iris)

You could display the entire data frame by simply entering

iris

Alternatively, we can obtain some summary statistics about this data frame using the summary() command

summary(iris)

Plotting data

Let's extract the columns Petal.Length and Petal.Width and plot them against each other

x = iris$Petal.Length
y = iris$Petal.Width
plot(x,y)

We add axis labels and titles by supplying named arguments to the plot command

plot(x,y,xlab="Petal Length",ylab="Petal Width",main="Iris Data")

Each datapoint has an iris species associated with it - one of setosa, versicolor and virginica. We can see this by asking R what the structure of the iris$Species column is

str(iris$Species)

Factors are how R represent categorical variables. We can see what the factor levels are with

levels(iris$Species)

We can include this information on the plot by coloring each datapoint according to its species.

plot(x,y,xlab="Petal Length",ylab="Petal Width",main="Iris Data",col=iris$Species)

Finally, let's add a legend

plot(x,y,xlab="Petal Length",ylab="Petal Width",main="Iris Data",col=iris$Species)
legend(x = 1, y = 2.5, legend = levels(iris$Species), col = c(1:3), pch=1)

Exercise - Tooth growth:

Try summarising and plotting a different dataset using the commands you've learned. The name of the dataset to investigate is ToothGrowth. Again, you can use help(ToothGrowth) to see contextual information and metadata.

Packages

R has many functions built in but there are over 8000 freely available add-on packages that provide thousands more functions. Once you know the name of a package, you call install it very easily.

For example, a package called ggplot2 is widely used to create high quality graphics. To install ggplot2:

install.packages("ggplot2")

We make all of the ggplot2 functions available to our R session with the library command

library(ggplot2)

Among other things, this makes the qplot function available to us. We can use this as an alternative to the basic plot command described above

qplot(iris$Petal.Length, iris$Petal.Width,col=iris$Species)

Alternatively, we can save ourselves typing iris$ a lot by telling qplot that the data we are referring to is the iris data

qplot(data=iris,Petal.Length, Petal.Width,col=Species)

To get help about the functionality in the ggplot2 package:

help(package=ggplot2)

Exercise (Packages)

A very popular R package is MASS which was created to support the book Modern Applied Statistics with S. This contains many more classic data sets which can be used to develop your R skills.

  1. Install the MASS package on your machine.
  2. Explore the MASS package's documentation and find a dataset that interests you.
  3. Load the MASS library into your R session.
  4. Take a look at the dataset you chose in part (2) using what you've learned so far.

The current working directory

Working with built-in datasets is great for practice but for real-life work its vital that you can import our own data. Before we do this, we must learn where R is expecting to find your files. It does this using the concept of current working directory. To see what the current working directory is, execute

getwd()

You can create a new directory using dir.create()

dir.create('mydata')

Move into this new directory using setwd()

setwd('mydata')

See its contents with

dir()

The current working directory is where R is currently looking for files and also where it will put any files it creates unless you tell it otherwise.

Importing your own data

In this section, you'll learn how to import data into R from the common .csv (comma separated values) format.

Download the file example_data.csv to your current working directory. You can either do this manually, using your web browser, or you can use the R command download.file

download.file('https://raw.githubusercontent.com/mikecroucher/Code_cafe/master/First_steps_with_R/example_data.csv',destfile="example_data.csv")

Ensure that the file is in your current working directory using the dir() function

dir()

Import the .csv file using the read.csv() function

example_data <- read.csv('example_data.csv')

The variable example_data will be an R data frame -- exactly the same type of object as the iris data we looked at earlier.

Exercise - example_data

  • Show the first few lines of example_data
  • Create a plot of the example_data
  • Show summary statistics of example_data

Scripts

In the simplest terms, a script is just a text file containing a list of R commands. We can run this list in order with a single command called source()

An alternative way to think of a script is as a permanent, repeatable, annotated, shareable, cross-platform archive1 of your analysis! Everything required to repeat your analysis is available in a single place. The only extra required ingredient is a computer.

For example, based on the article at http://www.walkingrandomly.com/?p=5254, we have created a script called best_fit.R that finds the parameters p1 and p2 such that the curve p1*cos(p2*xdata) + p2*sin(p1*xdata) is a best fit for the example_data described earlier. The details of this are beyond the scope of this course but you can easily download and run this analysis yourself.

download.file('https://raw.githubusercontent.com/mikecroucher/Code_cafe/master/First_steps_with_R/best_fit.R',destfile='best_fit.R')
source('best_fit.R')

By doing this, you have reproduced the analysis that we did. You are able to check and extend our results or apply the code to your own work. Making code and data publicly available like this is the foundation of Open Data Science

Further reading and next steps

In this session, we told you how to import data from a file but not how to export it. The following link will teach you how to export to .csv.

There are many online resources for learning R. Here are some we like

References

[1] Getting Started with R - An Introduction for Biologists. Authors: Beckerman and Petchey.