This is my ever-growing collection of links, solutions and sources I have discovered and used when trying to learn and teach computational biology. I often use it as a one-stop resource page for whomever asks me about a good book, website or that command that lets you execute line 45 from history
to learn about handling data in shell and R.
If you need a good reference or just to persuade your colleague or supervisor that she really needs to get to where the puck is going to be. Actually, scrape that, this train has been puffing along for quite a while and all we can do now is not get left behind.
- Loman, N. & Watson, M. So you want to be a computational biologist? Nat Biotechnol 31, 996–998 (2013).
- Wilson, G. et al. Best Practices for Scientific Computing. PLoS Biol 12, e1001745 (2014).
- Wilson, G. et al. Good Enough Practices in Scientific Computing. PLoS Comput Biol 13, e1005510 (2017).
- Tippmann, S. Programming tools: Adventures with R. Nature 517, 109–110 (2015).
- Lindsay Barone, Jason Williams, David Micklos Unmet Needs for Analyzing Biological Big Data: A Survey of 704 NSF Principal Investigators (2017) PLoS Comput Biol 13(10): e1005755
- Melissa A. Wilson Sayres et al. Bioinformatics Core Competencies for Undergraduate Life Sciences Education PLoS ONE 13, e0196878–20 (2018).
Also, bioinformatics != computational biology.
- Practical Computing for Biologists by Steven H.D. Haddock and Casey W. Dunn. It covers command line, Python, installing software and manipulation of graphics.
- Bioinformatics Data Skills by Vince Buffalo. Shell, R, Git with emphasis on life science data analysis, including next-generation sequencing file handling.
- R for Data Science by Garett Golemund and Hadley Wickham. Solid introduction to
tidyverse
ways of handling data and analysis by the creators and evangelists :-) - R Graphics Cookbook by Winston Chang.
ggplot2
explained using clear examples akin to recipes ("if you want to plot this, do this and that").
GitHub files from Vince’s book (there are some useful comments about setting up the Terminal etc.): Vince Buffalo’s GitHub account and his book-related files on GitHub.
- Automate the Boring Stuff with Python by Al Sweigart. The link leads to a free online version, but there are also a hard copy and an ebook version available.
In particular, do not export gene IDs and dates to Excel and then import it back to R or other programming tools. You have been warned.
- Zeeberg, B. R. et al. Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics. BMC Bioinformatics 5, 80 (2004). Also check this blog post (with comments), from 2012 (sic): Gene name errors and Excel: lessons not learned.
- Mallona, I. & Peinado, M. A. Truke, a web tool to check for and handle excel misidentified gene symbols. 1–3 (2017). doi:10.1186/s12864-017-3631-8
If you have to use Excel for dates, split your date into three numerical columns: year, month and day and use package lubridate to handle the dates after importing to R. Also, here is a good website with tricks for power users and here is a website which explains R data structures for people coming from Excel.
This is essential. A good text editor has to support regular expressions and understand different line ending conventions. All the software below is free to use.
- Notepad++ on Windows
- BBEdit on Macs (free version is powerful enough and entirely sufficient for a start)
- Gedit on Linux (available by default on Ubuntu)
- Atom on everything (it runs as a Chrome-based browser)
Code style guides for R. Pick one and stick to it:
Also important:
- Naming things Jenny Bryan's definitive slides on how to name things FTW
- Full R documentation online (including 13k+ packages)
- How to write a reproducible example. If you need to ask for R help online, this is how you do it. Now in a form of R package: reprex.
- Reserved words in R. The list is short:
if
,else
,repeat
,while
,function
,for
,in
,next
,break
,TRUE
,FALSE
,NULL
,Inf
,NaN
,NA
,NA_integer_
,NA_real_
,NA_complex_
,NA_character_
. - [Make a website with R Blogdown] and share your code with the world
- learnr Interactive tutorials with R Notebook and Shiny - the next big thing for teaching R in my opinion.
- etherpad for collaborative real time editing (a la Google Docs). This is what Software and Data Carpentry use, but you need to host it (there are free public hosts available).
- HackMD a possibly better alternative to etherpad. Does not require hosting and uses Markdown (it formats the text automatically).
- UpDog A websites-hosting service (that supports your own domain names) run off your normal Dropbox or Google Drive accounts. The great thing about it is that you can put your R Notebook or text files there to have a refreshing page (30 sec. delay) with live coding session for your students to follow. Free! (Markdown support is paid extra). See also this tweet from Cloudstitch: Power a Jekyll Blog from Google Drive with just a 2 minute setup.
- R Blogdown is a fantastic way to set up your website from within R (this Twitter thread from Dan Quintana is rather useful as well). If you want to write a book or a paper within R, try R Bookdown. Both Bookdown and Blogdown are by Yihui Xie. Hugo + Netlify seem to be the new Jekyll + GitHub Pages.
- Awwapp - web whiteboard You draw/type something on your screen and your students see it and can contribute in real time.
- ASCIinema Recording you shell sessions is useful for your students, and this system let's you select the text in the recording and copy/paste it! What would be super useful though is a real-time shell recording system that would output the recording as-is (both commands and their output) to an accesible location like a website or even a file.
- Choose an open source license: great source to figure out in plain English what license to use for your open source project.
-
Software Carpentry's founder Greg Wilson's book on teaching programming: How to Teach Programming (And Other Things). Free versions available on his site, as an epub, mobi or as a low-cost hard copy.
-
Brown, N. and Wilson, G. Ten quick tips for teaching programming, PLoS Comput Biol 14(4): e1006023 (2018).
-
David Robinson's Teach the tidyverse to beginners. Very sensible, but do check the comments that point out the advantages of
base
R. The complementaryTidyverse
vsbase
R philosophies are actually a result of evolution of R and its users, what Roger Peng expertly summarised in his talk Teaching R to New Users - From tapply to the Tidyverse. -
Mine Cetinkaya-Rundel teaches stats with R and Git at Duke and is at the forefront of implementing these tools in high-throughput teaching context. Check out her paper Infrastructure and tools for teaching computing throughout the statistical curriculum, her talk on the last useR! conference Teaching data science to new useRs and the course that she teaches itself [http://www2.stat.duke.edu/courses/Spring18/Sta199/](STA 199: Intro to Data Science).
-
If you want just one thing to explain someone why R is super awesome, show them Paul Campbell's presentation A whirlwind tour of working with data in R. You're welcome.
-
Pretty much anything Jenny Brian does, but in particular her UBC course Data wrangling, exploration, and analysis with R and her tutorial on purrr.
-
David Robinson's step-by-step demonstrations of exploratory data analysis: Modeling gene expression with broom: a case study in tidy analysis and Cleaning and visualizing genomic data: a case study in tidy analysis.
-
Julia Silge's amazing text mining walkthrough. She also has a book: Text Mining with R (free online version), paid hardcopy.
-
Mara Averic's collection of purrr tutorials.
-
Susan Baert's crystal clear, in-depth four-part tutorial on dplyr.
-
Software and Data carpentry R lessons are a bit inconsistent in their depth and scope, but I think the Data Carpentry R Ecology Lesson is the best one to start with.
Two classics:
- Code School's Try R - R console is emulated in the browser, no R installation necessary.
- Swirl: Learn R, in R - when you have R installed, try this package first.
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>The only two things that make @JennyBryan 😤😠🤯. Instead use projects + here::here() #rstats pic.twitter.com/GwxnHePL4n
— Hadley Wickham (@hadleywickham) December 11, 2017
...use the right way to organise your R work:
- Prime Hints For Running A Data Project In R by Kasia Kulma, with tips from commenters incorporated into her post. The best post on the topic that I know of.
- Project-oriented workflow where Jenny Bryan explains what's up with burning of the computers.
- File organisation best practices by Andrew Tran that summarises and builds on Jenny's and Joris Muller's solutions.
- Sandve, G. K., Nekrutenko, A., Taylor, J. & Hovig, E. Ten Simple Rules for Reproducible Computational Research PLoS Comput Biol 9, e1003285 (2013).
- http://ryanstutorials.net/linuxtutorial/navigation.php
- http://korflab.ucdavis.edu/Unix_and_Perl/
- Software Carpentry Unix Shell lesson
- explainshell.com will try to give you explanation for every element of a command line expression that you type (try it, it's really cool)
- Take Control: Command Line by Joe Kissell (aimed at Mac users, but good for everyone - as usual ;-)
- The UNIX workbench by Sean Kross (donationware); now with a Coursera course!
Take time to make your terminal window and the font big enough!
- Default (at least on my machine):
\h:\W \u\$
- How to check what's your current prompt:
echo $PS1
- Hot to change your prompt:
PS1="yournewprompt"
. A nice trick is to use PS1="\n\W \u-$ " so that you have a new line before your prompt - it's visually separated from the output of a previous command.
Useful link with options to modify your prompt: https://www.cyberciti.biz/tips/howto-linux-unix-bash-shell-setup-prompt.html
This is relevant for modifying the $PATH
:
- http://www.joshstaiger.org/archives/2005/07/bash_profile_vs.html
- http://stackoverflow.com/questions/9832770/where-is-the-default-terminal-path-located-on-mac
control-a
: move cursor to beginning of linecontrol-e
: move cursor to end of linecontrol-c
: cancel input or stop a running commandcontrol-k
: delete all text from cursor to end of linecontrol-d
deletes a character in placeoption-delete
: delete an entire word (may not work depending on whether your option key is reassigned; this is a preference in your Terminal settings)option-b
: move cursor backwards an entire word (as above)option-f
: move cursor forwards an entire word (as above)up arrow
: access last entered commandcontrol-r
: start searching shell history (start typing to search; enter will enter the current command;command-.
will cancel)control-v + [some key]
will literally print[some key]
- useful if you want to enter a tab and\t
doesn’t workhistory | ![some number]
where[some number]
is a number of a history command you want to execute (no need to copy and paste)- You can also narrow down the last command selection by including the first letter of the last command you want to use, e.g.:
!d
(if your favourite last command starts withd
) !$
retrieves the last word of the last command
How to really clear the terminal
clear
: clears the screencontrol-l
: works just likeclear
command-k
: clears the screen and prevents from scrolling backexit
: exit shell (it closes the terminal window)
ls [a-z]*.txt
list every .txt file with lowercase letters in their namels {pear,peach}.txt
lists pear.txt and peach.txtls -1
show output in a single columnls -alh
show output including hidden files (-a
), in a long format (-l
) and human-readable file sizes (-h
)history
displays history of the commands (can be piped into a file). If you don't want the terminal to remember the history between sessions, start with this thread on Stack Overflow.
cd -
: go to last foldercd .
: go to a current foldercd ..
: go to a parent folder
cd
cd ~
cd /Users/Jarek
cd -
(if you were in your home folder in a previous command)
\
: will escape the space character (e.g. “My\ folder”)- If you drag your folder from Finder to a Terminal window, it will automatically recognise the path to this folder and escape spaces
!!
: works just like theup arrow
, but you can modify it by adding stuff in front or behind it, e. g.:!! -h
orsudo !!
- You can also narrow down the last command selection by including the first letter of the last command you want to use, e.g.:
!d
(if your favourite last command starts with “d”)
cat
less
: space to move forward, B to move back, Q to quitmore
:more
on a Mac is the same asless
head
: show first few lines of the file; parameter -n specifies number of lines to showtail
: as above, but for the end of the file(head -n5; tail -n5) < inputfile
: display the first and last 5 lines of the input filetouch newfilename
: will create an empty file with a name newfilenametouch existingfilename
: will update modification date of the exsitingfilenamehead -n[line number]
to display [line number] number of lines (if you want a range use pipes andtail
after head -n)wc
word count (displays line, word and character count);-l -w -c
limits display to line, word or character only\
*
: a wildcard for “zero or more” instances (*og would catch anything that ends with “og” including just "og")?
: a wildcard for “any single” instance (?og would catch: dog, fog, log etc.){}
: brackets will select a range of stuff ({A..Z}, {1..3}, {apple, pear, watermelon}) (this is called “brace expansion”)
...but rememeber that grep
in Notepadd++, Ruby, JavaScript or Mac terminal can have slightly different implementations (i.e. not all functions will work or not all functions will work the same way). When stuff doesn't work, try egrep
(extended grep) and always RTFM.
A cool regular expression recognition web app - you put in your input and it tries to automatically find a regexp pattern to match it. When it works, it's like magic.
There is now also a way of testing and visualising regular expressions inside R studio: Regexplain by Garrick Aden-Buie.
\w
Letters, numbers and _.
Any character except \n \r\d
Numerical digits\t
Tab\r
Return character. Also used as the generic end-of-line character in BBEdit\n
Line-feed character. Also used as the generic end-of-line character in Notepad++\s
Space, tab, or end of line[A-Z]
A single character of the ranges indicated in square brackets[^A-Z]
A single character including all characters not in the brackets. Note that this will include \n unless otherwise specified, and may cause you to match across lines\
Used to escape punctuation characters so they are searched for as them- selves, not interpreted as wildcards or special symbols\\
The \ symbol itself, escaped
^
Match the start of the line, i.e., the position before the first character$
Match the last position before the end-of-line character
+
Look for the longest possible match of one or more occurrences of the character, wildcard, or bracketed character range immediately preced- ing. The match will extend as far as it can while still allowing the entire expression to match.*
As above, matches as many of the previous character to occur, but allows for the character not to occur at all if the match still succeeds?
Modifies greediness of + or * to match the shortest possible match instead of longest{}
Specify a range of numbers to repeat the match of the previous character. For example:\d{2,4}
matches between 2 and 4 digits in a row[AC]{4,}
matches 4 or more of the letter A or C in a row
()
Capture the search results between the parentheses for use in the re- placement term\1
or$1
Substitute the contents of the matched pattern with the replacement term, in numerical order. Syntax depends on the text editor or language that you are using.
grep "@" [file name]
search for lines that contain "@"grep -c "@" [file name]
count matching linesgrep -v "@" [file name]
find non-matching linesgrep -v -c "@"
grep -c "^CGATA" [file name]
count lines beginning with CGATAgrep "0\.98"
greps literal dot
mkdir -p
: make multiple directories at oncetr
to substitute one thing with another or delete a query from a string
cut
will cut out characters or columns from a delimited filecut -d":" -f2
will first split each line into columns delimited with the ":" and then extract -f2 (second) column from each linesort
can use column numberssort -k[number of the column]n
(n is for numerical, r is for reverse). You can combine sorting by column, i.e. first by column 3 then by 2sort -k 3 -k 2nr
uniq
will collapse multiple matches, but they have to be next to each other, so the file has to be sorted bysort
first
rm -i
flag-i
will prompt you to confirm before proceeding to remove. It can be used with other commands, such asmv
.
Jenny Brian's book about Git for R users is great: Happy Git and GitHub for the useR.
git init
to initialise repository (a tracked directory)git remote add origin https://github.com/jarekbryk/example_repository.git
to add remote repository link for local trackinggit add [files]
to explicitly add [files] to tracking (files can also be explicitly ignored withgit ignore
)git commit
to “upload” the tracked version to a repository, always with a [comment] on what was done `git commit -m"[your comment here]"``git status
to check, er, statusgit diff
to check differences between committed version and current version (I think it must be done before add?)git log
to list all commits in reverse chronological ordergit -u push origin master
to upload local changes ("master) to github ("origin"):git remote -v
to check if it was pushed all right (?)
- Computational Biology - A Practical Introduction to BioData Processing and Analysis with Linux, MySQL, and R by Röbbe Wünschiers (Amazon.co.uk), which includes good coverage of awk and sed. The book’s website is at http://www.staff.hs-mittweida.de/~wuenschi/doku.php?id=rwbook2.
And a very good tutorial that let's you use Awk right away: Why you should learn just a little Awk: An Awk tutorial by Example by Greg Grothaus.
- https://www.biostars.org/p/72433/
- http://linuxcommando.blogspot.co.uk/2008/04/using-awk-to-extract-lines-in-text-file.html
- http://bioinformatics.cvr.ac.uk/blog/essential-awk-commands-for-next-generation-sequence-analysis/
reads.fastq | awk '{if(NR%4==2) print length($1)}' | sort -n | uniq -c > read_length.txt
awk '0 == (NR + 1) % 2' inputfile.txt
cat barcount.txt | sed -E -e 's/^ +([0-9]+) [ACGTN]+/\1/' | awk 'BEGIN{total=0} {if ($1>10000) total+=$1} END{print total}'
This will let you read anc write to a Windows partition from macOS:
- http://www.makeuseof.com/tag/write-ntfs-drives-el-capitan-free/
- http://osxdaily.com/2013/10/02/enable-ntfs-write-support-mac-os-x/
open /Volumes
sudo echo "LABEL=DRIVE_NAME none ntfs rw,auto,nobrowse" >> /etc/fstab
This will let you read from a Linux partition on macOS:
- Install FUSE for macOS
- Install ext4fuse
I am still working on it - I only managed to get read access when root...
This assumes you cannot modify or don’t trust the system–wide settings in Ubuntu/Mac.
- HowTo: Use a Proxy on the Linux Command Line
- How to change proxy setting using Command line in Mac OS?
Ctrl-a
d to disconnect from the screen
screen -ls
list of screens
screen -r [id of the screen]
to reconnect to the screen