Haruo Suzuki
Last Update: 2016-03-22
Project started 2015-11-15.
A complete genome of Mycoplasma genitalium G37 was retrieved from the NCBI FTP site. R language was used to assess genome features (length, base composition, and GC content of the DNA sequence, and local variation in GC content and skew).
mgen/
README.md: project documentation
data/: contains sequence data in FASTA format
scripts/: contains R and Shell scripts
analysis/: contains results of data analyses
FASTA format sequences of Nucleic Acids (GCF_000027325.1_ASM2732v1_genomic.fna.gz
) and Amino Acids (GCF_000027325.1_ASM2732v1_protein.faa.gz
) downloaded on 2016-01-22 from ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000027325.1_ASM2732v1/ into data/
:
# MD5 Checksums
MD5 (data/GCF_000027325.1_ASM2732v1_genomic.fna.gz) = 456f609c24abab75d4d593b5c92bd631
MD5 (data/GCF_000027325.1_ASM2732v1_protein.faa.gz) = 02805a93d8dc783967e0a8fdb4511af5
The shell script scripts/run.sh
automatically carries out the entire steps: creating subdirectories, downloading data files, and running the R scripts
scripts/my_analysis.R
that calculates length and base composition of a DNA sequence, and
scripts/my_sliding_window.R
that generates sliding window plots of GC content and GC skew (analysis/Rplots_sliding_window.pdf
).
Let's run the driver script in the project's main directory mgen/
with:
bash scripts/run.sh > log.txt 2>&1 &
cd data/
ls -lh
# Working with Gzipped Compressed Files using gzcat, zgrep, and zless
gzcat *.fna.gz | head -n 2
zgrep -c '^>' *.gz
zless *.faa.gz
-
2016-01-22
-
Downloaded data from ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000027325.1_ASM2732v1/
-
2016-01-11
-
Created the R script
scripts/my_sliding_window.R
-
2015-12-04
-
Downloaded data from ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000027325.1_ASM2732v1/
-
2015-11-15
-
Created the project directory using
mkdir -p mgen/{data,scripts,analysis}
-
Downloaded data from ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/
-
Created the shell script
scripts/run.sh
and R scriptscripts/my_analysis.R
-
Genomes Download FAQ
-
12. How can I download RefSeq data for all complete bacterial genomes?
-
9. How can I find the sequence and annotation of my genome of interest? Using the assembly summary files for bacteria or a species under genbank or refseq
-
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt
-
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt
-
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Mycoplasma_genitalium/assembly_summary.txt
-
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Mycoplasma_genitalium/assembly_summary.txt
-
Genomic nucleotide composition