We introduce BioKlustering, a user-friendly open-source and publicly available web app for unsupervised and semi-supervised learning specialized in cases when sequence alignment and/or experimental phenotyping of all classes are not possible.
Among its main advantages, BioKlustering
- allows for maximally unbalanced settings of partially observed labels including cases when only one class is observed, which is currently prohibited in most semi-supervised methods,
- takes unaligned sequences as input and thus, allows learning for widely diverse sequences (impossible to align) such as virus and bacteria,
- is easy to use for anyone with little or no programming expertise, and
- works well with small sample sizes.
BioKlustering is browser-based (preferably Google Chrome), and thus, no installation is needed. Users simply need to click on the following link: https://bioklustering.wid.wisc.edu/.
More details are available in the documentation: DOCS.md.
BioKlustering is an open source project, and the source code is available at in this repository with the following structure:
BioKlustering-Website
contains all the code for the website and machine-learning models (seereadme.md
file inside this folder)manuscript
contains the reproducible analysis and sample dataset used in the published manuscript (in review)
Users with strong programming skills might like to modify the existing code and run a version of the website locally.
- Clone this repository by typing the following line in the terminal
git clone https://github.com/solislemuslab/bioklustering
- Get inside the
bioklustering/BioKlustering-Website
folder, create and activate a python virtual environment:
cd bioklustering/BioKlustering-Website
python3 -m venv virtual-env
source virtual-env/bin/activate
Note that Mac users might need the whole path to python3
: /usr/local/bin/python3
.
- Install the necessary packages by typing the following line in the terminal
pip3 install -r requirements.txt
Note that these requirements assume you are using Python 3.8.13. People can manage different python versions with pyenv.
A list of packages can be found in the requirements.txt
file and is listed below:
numpy~=1.22
pandas~=2.0.2
bio~=1.5.9
scikit-learn~=1.1.1
plotly~=5.4.0
Django~=3.1.2
django-plotly-dash~=1.4.2
channels~=2.4.0
channels-redis~=3.1.0
django-crispy-forms~=1.9.2
django-redis~=4.12.1
daphne~=2.5.0
redis~=3.5.3
psutil~=5.9.2
kaleido~=0.2.0
- You might also need to install
plotly-orca
which is for writing and saving the static plotly images locally. To install with conda, you can use the following command (or see this link for other alternatives).
conda install -c plotly plotly-orca==1.2.1 psutil requests
To install conda, you can follow instructions in this link. You might need to add a path to conda if it is not in your PATH
.
- Run the website with
python3 manage.py makemigrations
python3 manage.py migrate
python3 manage.py runserver
Notes:
- Even when the web app supports all browsers, we recommend the use Google Chrome to render the web app because different browsers might result in different interface and functionalities.
- Sometimes when running
python3 manage.py makemigrations
, you might get the following warning message:
The dash_core_components package is deprecated. Please replace
`import dash_core_components as dcc` with `from dash import dcc`
The dash_html_components package is deprecated. Please replace
`import dash_html_components as html` with `from dash import html`
You are trying to add the field 'create_date' with 'auto_now_add=True' to fileinfo without a default; the database needs something to populate existing rows.
1) Provide a one-off default now (will be set on all existing rows)
2) Quit, and let me add a default in models.py
Select an option:
If this happens, select option 1 and then press 'Enter' after the message:
Please enter the default value now, as valid Python
You can accept the default 'timezone.now' by pressing 'Enter' or you can provide another value.
The datetime and django.utils.timezone modules are available, so you can do e.g. timezone.now
Type 'exit' to exit this prompt
[default: timezone.now] >>>
- Make sure you are in a virtual environment.
source virtual-env/bin/activate
- Install selenium
pip3 install selenium
- Download webdriver and move it into the git root directory 'bioklustering/'
- Run the following command
python3 manage.py test
Users interested in expanding functionalities in BioKlustering are welcome to do so. See details on how to contribute in CONTRIBUTING.md
BioKlustering is licensed under the MIT licence. © SolisLemus lab projects (2020)
If you use the BioKlustering website in your work, we ask that you cite the following paper:
@ARTICLE{Ozminkowski2022-bw,
title = "{BioKlustering}: a web app for semi-supervised learning of
maximally imbalanced genomic data",
author = "Ozminkowski, Samuel and Wu, Yuke and Yang, Liule and Xu,
Zhiwen and Selberg, Luke and Huang, Chunrong and
Solis-Lemus, Claudia",
month = sep,
year = 2022,
archivePrefix = "arXiv",
primaryClass = "q-bio.GN",
eprint = "2209.11730"
}
- More details are available in the documentation: DOCS.md.
- Issues reports are encouraged through the GitHub issue tracker
- Feedback is always welcome via the following google form