Adapted from Introduction to the Unix shell for biologists by Konrad U. Förstner: https://github.com/konrad/2017-03-29-Software_Carpentry_Munich_Teaching_Material/edit/master/Unix_Shell/Unix_Shell_Handout.md
(This was the first course I attended which really helped me learn how to use the Unix shell)
- Installation instructions
- Background
- Course conventions
- The basic anatomy of a command line call
- How to get help
- Tab Completion
- Files, folders, locations
- Manipulating files and folders
- File contents - Viewing and Editing files
- File content - Sorting, counting and Filtering files
- Connecting tools
- Repeating commands using a
for
loop - Shell scripting
- Useful Example of Shell Scripting
- Bonus
- Useful Links
Short Unix shell course Feb 2020 - Haller Lab
To take part in this course you need to have a Unix/Linux bash shell installed on your computer. Luckily Windows 10 has made this so much easier than it used to be and you can run an almost complete Linux subsystem on your windows PC.
Open a powershell (search for powershell in the search bar) as an administrator (right click on the program) and copy and paste the following command:
Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux
You will then be prompted to restart your computer. After booting up again, open the Microsoft store app and search for your preferred linux distribution. We will be using Ubuntu for this course, so I would suggest using that but if you want to use another one that's also ok.
Once installed, open the terminal and provide a username and password. To download the contents of this course type or copy paste the following into your terminal (recommended that you do this on the day/night before as it is liable to change):
git clone https://github.com/adamsorbie/unix_shell_course-2020-02-14.git
if that doesn't work then you may need to install git first, you can do this by typing:
sudo apt update
followed by:
sudo apt install git
then repeat the above command.
In this short course you will learn the very basics of how to use the Unix shell. Most tools used in Bioinformatics are written for command line use and understanding how to use the shell will allow you to use these tools in your own research. Additionally, knowing a little bit of bash scripting is very helpful for automating annoying, repetitive tasks and can help make your analyses more reproducible.
- Anytime you see a path with yourusername in it e.g.
/home/yourusername
please replace "yourusername" with the username you chose for yourself during the installation. - If you have any questions or something isn't working, you can interrupt me at any time to ask.
Running a tool in the command line follows a simple
pattern. At first you have to write the name of the command.
Some command line tools require additional parameters.
While the parameters are the requirement of the program
the actual values we give are called arguments. There are two
different ways how to pass those arguments to a program - via keyword
parameters (also called named keywords, flags or options) or via
positional parameters. The common pattern looks like this (<>
indicates required items, []
indicates optional items):
<program name> [keyword parameters] [positional parameters]
An example is calling the program ls
which lists the content of a
directory. You can call it without any argument
$ ls
or with one or more keyword arguments
$ ls -l
$ ls -lh
or with one or more positional arguments
$ ls test_folder
or with both keyword and positional arguments
$ ls -l test_folder
The output of a command is written usually to the so called standard output of the shell which is shown on the screen when you call the command.
Perhaps the most important command to know, although not particularly helpful if you don't
understand anything yet, is man
which stands for manual. Most commands have a manual
and man
allows you to read those. Reading the manual should tell you what the command does and how to use it.
For example, to get the man of ls
type
$ man ls
To close the manual use q
. Note some tools may offer help in a different way, common ones are
-h
, -help
or --help
. For example cd
:
$ cd --help
One of the most useful things about bash is tab completion. Pressing Tab after typing the first few letters of a command/file/folder will autocomplete it. Be aware that if there are multiple commands or files starting with the same few letters then the bash shell will not know which one to choose. You can double press to reveal all of the options or just continue typing until the name is unique. Programmers are lazy, be lazy too.
- Tab - extend commands and file/folder names
Type the following and press tab to see how it works
$ ls un
If you have downloaded the course folder it should auto-complete to:
$ ls unix_shell_course-2020-02-14/
Topics:
ls
pwd
cd
mkdir
- Relative vs. absolute path
~/
In this part you will learn how to navigate through the Unix file system.
Firstly, it always useful to know where we are. When you open a new terminal you
should be in your home directory. To test
this, call the program pwd
which stands for print working directory.
$ pwd
/home/adamsorbie
On WSL, the output of pwd
should be /home/username
, yours
may differ slightly however. The next command we need is ls
.
This command simply lists the contents of a folder.
If you call it without any arguments it will output the content
of the current folder.
We can use the ls
command to get a rough overview
of what a common Unix file system tree looks like and learn how to address
files and folders. The root folder of a system starts with /
. Type
$ ls /
to see the content of the root folder. You should see something like
bin c etc home lib media opt root sbin srv tmp var
boot dev h init lib64 mnt proc run snap sys usr
Most of these folders are not particularly important for you right now. Those are more important if you are the administrator of the system. Normal users do not have the permission to make changes here. Currently your home directory is where you will work today and probably where you will end up doing most of your work in future.
In this section we will learn how work with paths. A file or folder can be addressed
by its absolute or its relative path. As you have
downloaded the github repo containing todays course materials, when you type ls
you should see a folder with the name of todays course.
Assuming you are in the home folder (/home/yourusername/
) the relative path to
the folder is unix_course-2020-02-14
. You can see what's inside the folder by
calling ls
like this:
$ ls unix_shell_course-2020-02-14
This is the so called relative path as it is relative to where you are right now
/home/yourusername/
i.e. your current working directory. The absolute path or full path
starts with a /
and is /home/yourusername/unix_course-2020-02-14
. Call ls
like this:
$ ls /home/yourusername/unix_shell_course-2020-02-14
You can think of it a little like being given a street address without the city or town. Gregor-Mendel Str. 2 may make sense to you if you are in Freising (relative to where you are), however if you live elsewhere you may not know this street and would also need to know in which place you could find this street.
There are some conventions regarding relative and absolute paths. One
is that a dot (.
) represents the current folder you are in. The command
$ ls ./
should return the same as
$ ls
Two dots (..
) represents the folder above your current working directory. If you call
$ ls ../
you should see the content of /home
. If you call
$ ls ../../
you should see the content of the parent folder of the parent folder which on
a normal linux system is the root folder (/
) assuming you are in /home/yourusername/
.
However, on WSL it may be slightly different. Another
convention is that ~/
represents the home directory of the user. The
command
$ ls ~/
should list the content of your home directory independent of where you are in the file system.
Now we know where we are and what's there already we can start to move around the file system.
To do this we use the command cd
(change directory). If
you are in your home directory /home/youruserbname/
you can go into the
folder unix_shell_course-2020-02-14
by typing
$ cd unix_shell_course-2020-02-14
After that call pwd
to make sure you're in the correct folder.
$ pwd
/home/yourusername/unix_shell_course-2020-02-14
To go back into your home directory you have a couple of different options.
You could use the absolute path
$ cd /home/yourusername/
or the above mentioned convention for the home directory ~/
:
$ cd ~/
or the relative path, in this case the parent directory of
/home/yourusername/unix_shell_course-2020-02-14
:
$ cd ../
As the home directory is such an important place cd
actually uses this as
a default argument, thus if you just call cd
without anything you will
automatically go to the home directory. Test this behavior by calling
$ cd
You can try now to move around the file system yourself and list the files and folders located there.
Now we will create our own folder using the command mkdir
(make
directory). Go into the home directory and type:
$ mkdir my_first_folder
In Unix-based shells and often many programming languages "No
news is good news." The command successfully created the folder
my_first_folder
. You can check by calling ls
, but mkdir
did
not tell you this. If you do not get a message this usually means
everything went fine. If you call the same mkdir
command again you
should get an error message:
$ mkdir my_first_folder
mkdir: cannot create directory ‘my_first_folder’: File exists
So if a command does not complain you can usually assume there was no error.
Topics:
touch
cp
mv
rm
Now we will learn how to manipulate files and folders. We can create some
empty files touch
. The main purpose of the touch
command is actually to
change the time stamps of files but you can also use it to create empty files. Let's
use touch to create a file called test_file_1.txt
:
$ touch test_file_1.txt
Use ls
to check it worked.
The cp
command (copy) can be used to copy files or folders (only with a specific flag).
For this it requires at least two arguments: the source and the target file. In
the following example we generate a copy of the file test_file_1.txt
called a_copy_of_test_file.txt
.
$ cp test_file_1.txt a_copy_of_test_file.txt
Use ls
to confirm that this worked. We can also copy the file to the folder we
created earlier my_first_folder
:
$ cp test_file_1.txt my_first_folder
Now there should be also a file test_file_1.txt
in the folder
my_first_folder
. If you want to copy a folder and its content you
have to use the flag -r
. The r stands for recursive, which you can think of
as repeatedly.
$ cp -r my_first_folder a_copy_of_my_first_folder
You can use mv
command(move) to rename or move files
or folders. To rename the file a_copy_of_test_file.txt
to
test_file_with_new_name.txt
call
$ mv a_copy_of_test_file.txt test_file_with_new_name.txt
With mv
you can also move a file into a folder. For this the second
argument must be a folder. For example, to move the file now named
test_file_with_new_name.txt
into the folder my_first_folder
use
$ mv test_file_with_new_name.txt my_first_folder
You are not limited to one file if you want to move them into a
folder. Let's create and move two files file1
and file2
into the
folder my_first_folder
.
$ touch file1 file2
$ mv file1 file2 my_first_folder
Now we can introduce another useful feature most shells offer
called globbing. Imagine you want to apply the same
command to several files. Instead of explicitly writing all the file
names (which could take a long time) you can instead use a globbing pattern
to refer to all of them. There are different "wildcards" that can be used for these patterns.
The most important one is the asterisk (*
). It can replace none, one or more
characters. Let's explain this with a quick example:
$ touch file1.txt file2.txt file3
$ ls *txt
$ mv *txt my_first_folder
The ls
shows the two files matching the given pattern
(i.e. file1.txt
and file2.txt
) while dismissing the one not
matching (i.e. file3
). In this case the *
basically means match
anything before txt.
Similarly for mv
- it will only move the two
files ending with txt
.
We now have several empty test files that we don't need anymore. The last command we
will learn in this section is rm
(remove) which allows you to delete files
and folders.
Danger Ahead there is no trash bin if you remove items using rm
.
They will be gone for good and without further notice. It's good practice to use rm -i
,
this flag askes before deleting a file, which can be a lifesaver. Note, once you are more
advanced and comfortable using the unix shell you can modify the default behaviour of rm so
that it will always ask you before deleting.
To delete a file in my_first_folder
call:
$ rm my_first_folder/file1.txt
To remove a folder use the parameter -r
(recursive):
$ rm -r my_first_folder
Alternatively you can use the command rmdir
:
$ rmdir my_first_folder
However, keep in mind rmdir
will only work on empty directories.
Topics:
less
/more
cat
echo
head
tail
cut
We haven't really looked at how to view and edit files yet, so now we will move on
and look at some of the commands we can use for this. Please go into the folder unix_course-2020-02-14
if you aren't already and unzip files.zip
using the following command:
unzip files.zip
You should now see some files there (and by now you should know how to check).
To read the content of files with the ability to scroll around we need
a so called pager program. We will use the tool less
which should be available on
all of your systems. Let's start with the file:
origin_of_species.txt
$ less origin_of_species.txt
The file contains Charles Darwin's Origin of species in plain
text. You can scroll up and down line-wise using the arrow keys or page-wise
using the page-up/page-down keys. To quit use q
. With
pager programs you can read file content interactively, but sometimes
you just need the content of a file. The command cat
(concatenate) does just
that for one or more files.
Let us use it to see what is in the example file
lorem_ipsum.txt
. Assuming you are still in the folder unix_course-2020-02-14
you can call
$ cat lorem_ipsum.txt
The content of the file is shown to you. You can apply the command to two files and the content is concatenated and returned:
$ cat lorem_ipsum.txt lorem_ipsum_2.txt
This is a good time to introduce standard input and standard
output and what you can do with them. Stdin is the standard input stream
and accepts text as input. Stdout is the standard output stream, and as you might expect,
outputs text. You can redirect the standard output into a file by using the >
operator.
Let try this to generate a new file that contains the combined content of both files:
$ cat lorem_ipsum.txt lorem_ipsum_2.txtt > lorem_ipsum_combined.txt
Use cat again to have a look at the content of this file
$ cat lorem_ipsum_combined.txt
standard output can also be redirected to other tools as
standard input. We will go into more detail a little later. With cat
we used
existing file content. To create something entirely new we can use the echo
command
which writes an input string to standard output.
$ echo "Linux is cool"
To redirect the output into a target file use >
.
$ echo "Linux is cool" > super_cool.txt
Please note that this can be dangerous. You can easily overwrite the content of an existing file. For example if you call now
$ echo "Linux is boring" > super_cool.txt
only the last string will be written to the file and the
previous one will be overwritten (Worst of all the text in your file is totally wrong ;) ).
To append output of a command to a file without overwriting use >>
.
$ echo "Linux is cool" > super_cool.txt
$ echo "I love Linux" >> super_cool.txt
Now your file should contain two lines
Often you just want part a file, for example just the
first few or last few lines. For this the commands head
and tail
can
be used. By default 10 lines are shown. You can use the parameter -n <NUMBER>
(e.g. -n 20
or just -<NUMBER>
(e.g. -20
) to specify the
number of lines you want to be displayed. Test head
and tail
with the file
origin_of_species.txt
:
$ head origin_of_species.txt
$ tail origin_of_species.txt
This is super useful if you want to look at large files (e.g. FASTQ), because you can't normally open them in a text editor.
You can also extract text horizontally using the
command cut
. This is especially useful if you have a file with different columns.
To see how this works lets extract the first 10 characters of each line
in the file origin_of_species.txt
:
$ cut -c 1-10 origin_of_species.txt
Lets now look at a more common usage (at least for me). Have a look at the content of the
file mapping_file.tab
. You see that it contains different columns that are
tabular-separated. You can extract selected columns with cut
:
$ cut -f 1,4 mapping_file.tab
Topics:
wc
sort
uniq
grep
There are also several tools that allow you to manipulate the content of a
text file or find out information about it. For example if you would like to
find out the number of characters, words or lines in a file, you can use the
wc
command. Try to count the number of lines in the file:
origin_of_species.txt
:
$ wc -l origin_of_species.txt
You can use sort
to sort a file alpha-numerically. Try the following commands
on the file unsorted_numbers.txt
$ sort unsorted_numbers.txt
$ sort -n unsorted_numbers.txt
$ sort -rn unsorted_numbers.txt
and see if you can understand the output.
uniq
takes a sorted list of lines returns the uniques. Let's imagine we got a file
containing a list of bacterial taxa from our sequencing data. Due to the nature of 16S
sequencing we will often have mutiple OTUs with the same taxonomic assignment, and let's say in
this case we want to look at the uniques. Have a quick look at taxa.txt
. Then use uniq
to generate
a non-redundant list of taxa:
$ uniq taxa.txt
If you call uniq
with -c
you can count the number of occurrences of each taxon
$ uniq -c taxa.txt
With the tool grep
you can extract lines that match a given
pattern. For instance, let's say we are interested in the order
Bacteroidales. Note that grep by default is case
sensitive, to remove this behaviour you can use the -i
flag.
$ grep Bacteroidales taxa.txt
If you are only interested in the number of lines that match the pattern
use -c
:
$ grep -c Bacteroidales taxa.txt
The real power of Unix is built on its capability to readily connect tools.
For this so-called pipes are used. To use the standard output of one tool
as standard input of another command we use |
. For example, based on what
we just did above, let's say we want to find the unique taxonomic assignments
within the order Bacteroidales and write this output to a file
$ grep Bacteroidales taxa.txt | uniq > Bacteroidales.txt
Now we want to generate a copy of all the text files in your working directory. Running this
$ cp *txt copy_of_*txt
would not work.
With for
loops you can solve this problem. Let's start with a simple
one.
for FILE in lorem_ipsum.txt lorem_ipsum2.txt
> do
> head -n 1 $FILE
> done
You can not only just use one command inside of a loop but multiple, for example now we want to look at Clostridiales and Lactobacillaes as well as Bacteroidales, to do this we can a for loop:
for taxa in Bacteroidales Clostridiales and Lactobacillales
> do
> grep $taxa taxa.txt | uniq >> taxa_list.txt
> done
Open a new file in a text editor of you choice, I would recommend using nano in this case but if you are brave you can also try vi/vim. To open, simply type the name of the text editor into your shell. Add the following text:
echo "Number of lines that contain species":
wc -l origin_of_species.txt
Save the file (nano: ctrl-o) as count_lines.sh
, exit your text editor (nano: ctrl-x) make sure the
file origin_of_species.txt
is in the same folder and run the script:
$ bash count_lines.sh
You should get something like
Number of lines that contains species:
15322 origin_of_species.txt
This is your very first shell script. Now we can make it a little more
flexible. Instead of hard coding (setting this variable within the script itself)
the input file for wc -l
we want to be able to give this as argument.
To do this we can change the shell script to:
echo "Number of lines in the given file":
wc -l $1
$1
is a variable that represents the first argument given.
Now you can call the script like this:
$ bash count_lines.sh origin_of_species.txt
You should get the same results as before. If you want to your script to take
a second argument you can use $2
. To use all arguments
given to the shell script use the variable "$@". Change the shell
script to:
echo "Number of lines in the given file(s)":
wc -l $@
and run it with several input files:
$ bash count_lines.sh origin_of_species.txt taxa.txt
You should get something like:
Number of lines in the given file(s):
21648 origin_of_species.txt
313 taxa.txt
21961 total
Now we will look at a use for shell scripts that could be really helpful for you (especially if you are studying the microbiome).
After doing your differential abundance analysis you often want to extract the sequences of the OTUs which were identified as differentially abundant and perhaps check their taxonomic assignment using something like BLAST or EZtaxon. This is extremely annoying to do manually and can take a lot of time if you have lots of OTUs, but luckily with a bit of BASH magic you can automate this process easily.
Consider that the headers in a fasta file have a consistent format (i.e. >OTU_4), and also
that grep
can take a txt file as input (use man to check this), can you think of
a way we could do this?
Solution
An easy way to do this (not necessarily the best but it works)
Firstly you need a text file with your OTUs, you should be able to generate your own or you can use the one provided
echo "Extracting OTUs"
grep -f $1 -w -A8 $2 >> $3
# -f means use a file as input, -w means match the whole word, and A means
# print X number of lines of context after, in this case 8 because there our
# sequences are 8 lines
You may notice this prints a couple of extra OTU numbers and this is because the sequences in the FASTA files are not always exactly 8. A better way to do this, but much more complicated, would be to make your FASTA file single line before extracting the sequences.
This is the code I normally use:
echo "enter sequence filename: "
read filename
echo "enter file containing patterns to match: "
read patterns
echo "enter output filename: "
read output
perl -pe '/^>/ ? print "\n" : chomp' $filename | grep -f $patterns -w -A1 | grep -v -- "^--$" > $output
I thought it might also be useful to quickly show an example of a BASH script used for analysis. This is one I wrote to run Salmon on some RNA seq data:
#!/bin/bash
for fn in trimmed_reads/F123{54..72};
do
samp=`basename ${fn}`
echo "Processing sample ${samp}"
echo ${fn}
salmon quant -i mouse_index.idx -l A \
-1 ${fn}_R1_001.qc.fq.gz \
-2 ${fn}_R2_001.qc.fq.gz \
-p 8 --validateMappings -o quants/${samp}_quant
done
Can you understand what it's doing?
here are some useful links if you want to learn more about working with the unix shell:
http://www.linuxcommand.org/lc3_learning_the_shell.php
https://www.datacamp.com/courses/introduction-to-shell-for-data-science