runrex: A Python repository from Kaiser Permanente Washington Health Research Institute - Kaiser Permanente Washington Health Research Institute

Runrex

Library to aid in organizing, running, and debugging regular expressions against large bodies of text.

About the Project
Getting Started
- Prerequisites
- Installation
Usage
Roadmap
Contributing
License
Contact
Acknowledgements

About the Project

The goal of this library is to simplify the deployment of regular expression on large bodies of text, in a variety of input formats.

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

Python 3.8+
runrex package: https://github.com/kpwhri/runrex

Installation

Clone the repo

git clone https://github.com/kpwhri/runrex.git

Install requirements (requirements-dev is for test packages)
```
pip install -r requirements.txt -r requirements-dev.txt
```
If you wish to read text from SAS or SQL, you will need to install additional requirements. These additional requirements files may be of use:
- ODBC-connection: requirements-db.txt
- Postgres: requirements-psql.txt
- SAS: requirements-sas.txt
Run tests.
```
set/export PYTHONPATH=src
pytest tests
```

Usage

Example Implementations

Build Customized Algorithm

Create 4 files:
- patterns.py: defines regular expressions of interest
  - See examples/example_patterns.py for some examples
- test_patterns.py: tests for those regular expressions
  - Why? Make sure the patterns do what you think they do
- algorithm.py: defines algorithm (how to use regular expressions); returns a Result
  - See examples/example_algorithm.py for guidance
- config.(py|json|yaml): various configurations defined in schema.py
  - See example in examples/example_config.py for basic config

Input Data

Accepts a variety of input formats, but will need to at least specify a document_id and document_text. The names are configurable.

Sentence Splitting

By default, the input document text is expected to have each sentence on a separate line. If a sentence splitting scheme is desired, it will need to be supplied to the application.

Schema/Examples

For more details, see the example config or consult the schema

Output Format

Recommended output format is jsonl
- The data can be extracted using python:

import json
with open('output.jsonl') as fh:
    for line in fh:
         data = json.loads(line)  # data is dict

Output variables are configurable and can include:
- id: unique id for line
- name: document name
- algorithm: name of algorithm with finding
- value
- category: name of category (usually the pattern; multiple categories contribute to an algorithm)
- date
- extras
- matches: pattern matches
- text: captured text
- start: start index/offset of match
- end: end index/offset of match
Scripts to accomplish useful tasks with the output are included in the scripts directory.

Versions

Uses SEMVER.

See https://github.com/kpwhri/runrex/releases.

Roadmap

See the open issues for a list of proposed features (and known issues).

Contributing

Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

Distributed under the MIT License.

See LICENSE or https://kpwhri.mit-license.org for more information.

Contact

Please use the issue tracker.

kpwhri/runrex