This repository utilizes NVIDIA's Nucleotide Transformer to generate positional embeddings for a MPRA (Massively Parallel Reporter Assay) 3' sequence dataset. The goal is to explore the correlation between 3' sequences and gene expression.
- Nucleotide Transformer: GitHub Repository
- Dataset Source Paper: PLOS Genetics Article
Note: Conda environments and Dockerfiles are currently available for use :)
To set up a quick conda environment for this project, follow these steps:
1. Clone the Repository:
git clone https://github.com/zbates1/zb-deepmind-terminator.git ./zb-deepmind-terminator && cd ./zb-deepmind-terminator
2. Create the conda env
conda create -p ./envs/zb-terminator python=3.9
3. Activate Env
conda activate ./envs/zb-terminator
4. Install Scipy with Conda
conda install scipy
5. Clone the Nucleotide Transformer Repo
git clone https://github.com/instadeepai/nucleotide-transformer.git ./nucleotide-transformer
6. Use the native Nucleotide-Transformer setup.py file to install dependencies
python3 ./nucleotide-transformer/setup.py install
7. Then install the Nucleotide-Transformer Package=0.0.1
cd nucleotide-transformer && pip install . && cd ..
8. Install dependencies
pip install -r requirements.txt
9. Install CUDA-enabled JAX
pip install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
Note: The Dockerfile is currently only supporting Jax implementation. The following commands are what I used to build the container
1. Build the image
docker build -t jax_test -f ./Dockerfile.jax .
2. Containerize
docker run -it jax_test:latest /bin/bash -c "conda init bash && conda activate /task/envs/zb-terminator && python3 run_inference.py"
Now you can use my pipeline to generate embeddings for the Shalem dataset (provided) and do the rest of the analysis.
Note: If you would like to run your own dataset: 1. format correctly 2. pass --input_filename /path/to/your/ds/in/./data
Your dataset should be: a text or csv file and have two columns: ['sequences'] and ['gene expression (tpm)']