Élaboration d'une base de donnée détaillée des îlots génomiques à partir de bases de données existantes.
- IslandViewer
- PAI-DB
- ICEberg
- Islander
# Colonnes de la base de données
ID ACC_NB ORGANISME START STOP SEQUENCE INSERTION* REFERENCES DETECTION*
Inputs :
- page_x.html 467 pages html de la base de données
# ICEberg
python3 parsing.py -i input/iceberg/ -pos input/SupTable1.xlsx -neg input/SupTable2.xlsx -db iceberg -bp 50
output : table.iceberg.tsv
Islands : 468
Uniques : 354
NA : 258 (39 ont des souches inconnus ('Unknown strain'))
Inputs:
- island_sequence.06.03.2015.fasta
- islander.08.03.2015.sql
# Islander
python3 parsing.py -i input/islander.08.03.2015.sql -pos input/SupTable1.xlsx -neg input/SupTable2.xlsx -db islander -bp 50
output :table.islander.08.03.2015.sql.tsv
Islands : 4 208
Uniques : 4 208
NA : 1
Inputs :
- all_gis_islander_iv4.txt
- all_gis_islandpick_iv4.txt
- all_gis_islandpath_dimob_iv4.txt
- all_gis_sigi_hmm_iv4.txt
- all_gis_islandviewer_iv4.txt
- Supp1: positive dataset
- Supp2: negative dataset
- Supp3: 80 GIs ranging in size from 3 to 133 kb
- Additional_file_1.xls
- Additional_file_2.xls
- Additional_file_3.xls
- Additional_file_8.xls
# IslandViewer 4 islander
python3 parsing.py -i input/all_gis_islander_iv4.txt -pos input/SupTable1.xlsx -neg input/SupTable2.xlsx -db iv -bp 50
output : table.all_gis_islander_iv4.txt.tsv
Islands : 3 170
Uniques : 3 170
NA : 0
# IslandViewer 4 islandpath dimob
python3 parsing.py -i input/all_gis_islandpath_dimob_iv4.txt -pos input/SupTable1.xlsx -neg input/SupTable2.xlsx -db iv -bp 50
output : table.all_gis_islandpath_dimob_iv4.txt.tsv
Islands : 51 425
Uniques : 51 425
NA : 0
# IslandViewer 4 islandpick
python3 parsing.py -i input/all_gis_islandpick_iv4.txt -pos input/SupTable1.xlsx -neg input/SupTable2.xlsx -db iv -bp 50
output :table.all_gis_islandpick_iv4.txt.tsv
Islands : 43 469 Uniques : 43 469 NA : 0
# IslandViewer 4 islandviewer
python3 parsing.py -i input/all_gis_islandviewer_iv4.txt -pos input/SupTable1.xlsx -neg input/SupTable2.xlsx -db iv -bp 50
output :table.all_gis_islandviewer_iv4.txt.tsv
Islands :110 914
Uniques :110 914
NA : 0
# IslandViewer 4 sigi hmm
python3 parsing.py -i input/all_gis_sigi_hmm_iv4.txt -pos input/SupTable1.xlsx -neg input/SupTable2.xlsx -db iv -bp 50
output :table.all_gis_sigi_hmm_iv4.txt.tsv
Islands : 63 917
Uniques : 63 917
NA : 0
Inputs :
- PAIDB_REI.html : REsistance Islands
- PAIDB_PAI.html : PAthogenicity Islands
# PAIDB PAI
python3 parsing.py -i input/PAIDB_PAI.html -pos input/SupTable1.xlsx -neg input/SupTable2.xlsx -db paidb -bp 50
output :table.PAIDB_PAI.html.tsv
Islands : 986
Uniques : 986
NA : 904
# PAIDB REI
python3 parsing.py -i input/PAIDB_REI.html -pos input/SupTable1.xlsx -neg input/SupTable2.xlsx -db paidb -bp 50
output :table.PAIDB_REI.html.tsv
Islands : 108
Uniques : 108
NA : 95
Temps de calcul : voir log_parsing_1.0
# Nombre d'island
wc -l table.all_gis_islander_iv4.txt.tsv
# Nombre d'island uniques
sort -k3 table.all_gis_islander_iv4.txt.tsv | sort -k1 |uniq | wc -l
# Island avec numéro d'accession et positions
awk '$1=="NA" || $3=="NA" || $4=="NA"' table.all_gis_islander_iv4.txt.tsv | wc -l
mv table.all_gis_islander_iv4.txt.tsv Tables/.
# L'option "cat" permet donner 9 arguments et lance la fonction concatenate
python3 parsing.py -cat Tables/*
output: database_1.0.tsv
Islands : 181 699
NA : 9 221