Exercise

This repository contains the assignment on Introduction to Linux

Questions and solutions

Create a project directory called Exercise. Which commands did you use?

#creating a directory using mkdir command
mkdir Exercise

In the directory, create all the sub-directories you would need for a Bioinformatics Project

cd Exercise
mkdir -p data scripts results

With the provided dataset files, put them in the appropriate directories you created under your Bioinformatics Project

#Using the mv command to move the datasets from downloads folder to data folder
cd data
mv ../nrf1_seq* ./

Extract the sequence headers and save into a file sequence_names.txt in the appropriate directory

#Extracting sequence headers using grep

grep "^>" nrf1_seq.fa >> ../results/sequence_names.txt
grep "^>" nrf1_seqtemp.fa >> ../results/sequence_names.txt

Save the commands you used in question 4 in a script file extract_seq.sh

nano extract_seq.sh
#!/usr/bin/bash
#Commands to extract sequence headers from a fasta file

grep "^>" nrf1_seq.fa >> ../results/sequence_names.txt
grep "^>" nrf1_seqtemp.fa >> ../results/sequence_names.txt

Count the number of mRNA.

grep -c "mRNA" nrf1_seq.fa
#Output : 90
grep -c "mRNA" nrf1_seqtemp.fa
#Output : 90

What other sequences apart from mRNA?

cDNA

For each category, how many are there?

 grep -v "mRNA" sample_headers.txt |wc
#output : 20

Save the above output to files.

 grep -v "mRNA" sample_headers.txt > non_mrna.txt

How many organisms (create a file with the organisms without duplicates)

#Extracting organisms and saving in file species.txt
grep ">" sequence_names.txt | cut -d, -f1 | sed 's/PREDICTED://g' | awk '{print $2,$3}' >> species.txt

#Counting unique species
sort species.txt | uniq |wc

#Output : 36 unique organisms

How many are predicted?

 grep -c  "PREDICTED" sequence_names.txt
#Output : 156

How many nucleotides are in the file? How many of each of the bases are there?

#Number of nucleotides in nrf1_seq fasta file
#First I extract only the nucleotides and save them in a new file using grep -v
grep -v "^>" nrf1_seq.fa  > nrf1_seq
#Then filter out the last line doesnt contain a valid nucleotide
head -n -1 nrf1_seq > nrf1.seq
#Counting number of nucleotides using wc -m flag which counts characters
wc -m nrf1.seq
#Output : 820493

Number of nucleotides in nrf1_seqtemp.fa

grep -v "^>" nrf1_seqtemp.fa |wc -m
#Output:  820493

Number of each bases

#Counting each number of nucleotide using grep -o flag which show only nonempty parts of lines that match.
#Count of A
grep -o 'A' nrf1_seqtemp |wc -l
#output:215395

#Count of G
grep -o 'G' nrf1_seqtemp |wc -l
#Output: 193881

#Count of C
grep -o 'C' nrf1_seqtemp |wc -l
#output:184571


#Count of T
grep -o 'T' nrf1_seqtemp |wc -l
#Output: 216465

Parcelli/Exercise

Exercise

Questions and solutions