Problem Set 1

| Due: 5pm on Feb 9

Workflow

To start, fork the repository.
Clone the repository to your computer.
Modify the files and commit changes to complete your solution.
Push/sync the changes up to GitHub.
Make corrections until the Travis CI build status icon changes to green / passing
Create a pull request on the original repository to turn in the assignment.

For each problem, you should use simple Unix commands to arrive at the correct answer. Put all your work in a run.sh file that generates a answers.yml file.

# ansewer.yml should look like this:
answer-1: 123
answer-2: 456

Example Problem

Which state has the highest popultion?

# zcat == gzcat on Linux
high_pop=$(gzcat states.tab.gz | cut -f1,2 | sort -k2n | tail -n1 | cut -f1)
echo "answer-example: $high_pop

Each problem below is worth 5 points. Use the files in the data-sets repository.

Question 1

Which state in states.tab.gz has the lowest murder rate?

Question 2

How many sequence records are in the sample.fa file?

Question 3

How many unique CpG IDs are in cpg.bed.gz?

Question 4

How many sequence records are in the SP1.fq file?

Question 5

How many words are on lines containing the word bloody in hamlet.txt? (Hint: use wc to count words).

Question 6

What is the length of the sequence in the first record of sample.fa? (Hint: use wc to count characters).

Question 7

What is the name of the longest gene in genes.hg19.bed.gz?

Question 8

How many unique chromosomes are in genes.hg19.bed.gz?

Question 9

How many intervals are associated with CTCF (not CTCFL) in peaks.chr22.bed.gz?

Question 10

On what chromosome is the largest interval in lamina.bed?

ancarr/problem-set-1