AI2 Dolma Most Frequent Words: a derivative dataset of the most common words

Simple Streamlit app (and blog post) showing word counts for AI2's Dolma dataset.

View the app here: https://dolma-count-levon003.streamlit.app/

Local development setup

First-time setup

This repository uses Conda to manage two dependencies: Python and Poetry. (This SO post provides more context on using Conda and Poetry together.)

Install conda or miniconda. Then, create the needed environment, called dolma-count.

conda env create -f conda-environment.yml

Note that the environment file can't be called environment.yml because of how Streamlit resolves dependencies.

Python development

Activate the conda environment: conda activate dolma-count
Use make install to install all needed dependencies (including the pre-commit hooks).

Ideally, the Makefile would activate the needed conda environment, but I don't actually know enough make to add that.

Other useful commands

poetry run <command> - Run the given command, e.g. poetry run pytest.
source $(poetry env info --path)/bin/activate - An alternative to poetry shell that's less buggy in conda environments.
poetry add <package> - Add the given package as a dependency. Use flag -G dev to add it as a development dependency.
conda remove -n dolma-count --all - Tear it all down, so first-time setup can be repeated.

"Dolma"