User-friendly implementation and extension of common data streaming applications using Apache Kakfa, written in Python
Available on GitHub at https://github.com/openmsi/openmsipython
Developed for Open MSI (NSF DMREF award #1921959)
Programs use the Python implementation of the Apache Kafka API, and are designed to run on Windows machines connected to laboratory instruments. The only base requirements are Python >=3.7, git
, and pip
.
The quickest way to get started is to use Miniconda3. Miniconda3 installers can be downloaded from the website here, and installation instructions can be found on the website here.
With Miniconda installed, next create and switch to a new environment for Open MSI. In a terminal window (or Anaconda Prompt in admin mode on Windows) type:
conda create -n openmsi python=3
conda activate openmsi
This environment needs a special variable set to allow the Kafka Python code to find its dependencies on Windows (see here for more details), so after you've done the above, type the following commands to set the variable and then refresh the environment:
conda env config vars set CONDA_DLL_SEARCH_MODIFICATION_ENABLE=1
conda deactivate #this command will give a warning, that's normal
conda activate openmsi
You'll need to use that second "activate" command every time you open a Terminal window or Anaconda Prompt to switch to the openmsi
environment.
Miniconda installs pip
, and if you need to install Git you can do so with
conda install -c anaconda git
(or use the instructions on the website here.)
While in the openmsi
environment, navigate to wherever you'd like to store this code, and type:
git clone https://github.com/openmsi/openmsipython.git
cd openmsipython
pip install .
cd ..
This will give you access to all of the console commands discussed below, as well as any of the other modules in the openmsipython
package. If you'd like to be able to make changes to the openmsipython
code without reinstalling, you can include the --editable
flag in the pip install
command.
If you like, you can check everything with:
python
>>> import openmsipython
And if that line runs without any problems then the package was installed correctly.
Installing the code provides access to several programs that share a basic scheme for user interaction. These programs share the following attributes:
- Their names correspond to names of Python Classes within the code base
- They can be run from the command line by typing their names
- i.e. they are provided as "console script entry points"
- check the relevant section of the setup.py file for a list of all that are available
- They provide helpful logging output when run, and the most relevant of these logging messages are written to files called "[ClassName].log" in the directories relevant to the programs running
- They can be installed as Windows Services instead of run from the bare command line
The documentation for specific programs can be found in a few locations within the repo.
The readme file here explains programs used to upload entire arbitrary files by breaking them into chunks/producing those chunks as messages to a Kafka topic or download entire files by reading messages from the topic and writing data to disk.
The readme file here explains programs used to upload specific portions of data in Lecroy Oscilloscope files and produce sheets of plots for PDV spall or velocity analyses.
The readme file here gives more details about options for configuration files used to define which kafka cluster(s) the programs interact with and how data are produced to/consumed from topics within them.
The readme file here details procedures for installing any available command-line program as a Windows Service and working with it.
The readme file here describes the automatic testing and CI/CD setup for the project, including how to run tests interactively and add additional tests.
The following items are currently planned to be implemented ASAP:
- Adding a safer and more graceful shutdown when stopping Services so that no external lag time needs to be considered
- Allowing watching directories where large files are in the process of being created/saved instead of just directories where fully-created files are being added
- Implementing other data types and serialization schemas, likely using Avro
- Create pypi and conda installations. Pypi method using twine here: https://github.com/bast/pypi-howto. Putting on conda-forge is a heavier lift. Need to decide if it's worth it; probably not for such an immature package.
- Re-implement PDV plots from a submodule
- What are best practices for topic creation and naming? Should we have a new topic for each student, for each instrument, for each “kind” of data, ...?
- Would it be possible to have an environment and dependency definition? YAML??
- How do I know (and trust!) my data made it and is safe?
- What if I forget and write my data to some “wrong” place? What if I write my data to the directory twice?
- Should I clear my data out of the streaming directory once it’s been produced to Kafka?