/REmatch

REmatch C++ library

Primary LanguageRoffMIT LicenseMIT

REmatch

Here you can find the main implementation of REmatch library in C++, and its bindings for both Python and JavaScript/WebAssembly. This version has been refactorized, tested, and developed for being ready for production.

Directory structure

  • /src: The C++ implementation.
  • /tests: Contains all automatic tests for the code.
  • /scripts: Contains useful scripts for setting up the project, including compilation and other tasks.

Build instructions

Setup

Dependencies:

  • Clang version 11 or newer
  • CMake version 3.23.2 or newer
  • Catch2 (required for tests only)

The setup of CMake, Catch2 and Clang on a clean ubuntu system can be done using the following script:

./scripts/setup/ubuntu/setup_cpp_to_execute_rematch.sh

Building

The target file is rematch.cpp, which should be located in the src/targets/main/ directory. To build, run:

./scripts/compile.sh

After building, the binary file will be located in the build/Release/bin folder. To try it simply run:

./build/Release/bin/rematch

If you want to use a debugger such as gdb, then you should add -DCMAKE_BUILD_TYPE=Debug in the first CMake command.

Examples

To use REmatch, you have two options. You can create a Query object through the method REMatch::compile and pass the regular expression as argument, or you can directly call the functions provided.

// Compile a regular expression using the compile method and find a match
REMatch::Query query = REMatch::compile("!x{aba}");
std::unique_ptr<REMatch::Match> query_match = query.findone("aabaa");

// Equivalent to calling findone directly
std::unique_ptr<REMatch::Match> direct_match = REMatch::findone("!x{aba}", "aabaa");

The Query provides the methods findone and finditer that evaluate a document and return the matches found. The findone method returns a pointer to the first encountered match, while finditer returns an iterator that allows you to access all matches. To retrieve all the matches at once, you can use the findall method. You can use the start and end methods to obtain the indices of the matched spans or span to get a string representation.

Retrieving an iterator for the pattern aba

#include <memory>
#include <string>
#include "library_interface/rematch.hpp"

int main() {
  std::string document = "abaababa";
  REMatch::Query query = REMatch::compile("!x{aba}");
  std::unique_ptr<REMatch::MatchIterator> iterator = query.finditer(document);
  std::unique_ptr<REMatch::Match> match = iterator->next();
  while (match != nullptr) {
    std::cout << "Span: [" << match->start("x") << ", " << match->end("x") << ">" << std::endl;
    match = iterator->next();
  }
  return 0;
}

Finding the first match for the pattern aba

#include <iostream>
#include <string>
#include "library_interface/rematch.hpp"

int main() {
  std::string document = "abaababa";
  REMatch::Query query = REMatch::compile("!x{aba}");
  std::unique_ptr<REMatch::Match> match = query.findone(document);
  std::cout << "Span: " << match->span("x") << std::endl;
  return 0;
}

Finding a vector with all matches for the pattern aba

#include <iostream>
#include <string>
#include "library_interface/rematch.hpp"

int main() {
  std::string document = "aba";
  std::string query = "!x{aba}";
  std::vector<REMatch::Match> matches = REMatch::library_interface::findall(query, document);
  for (REMatch::Match& match: matches) {
    std::cout << "Span: [" << match.start("x") << ", " << match.end("x") << ">" << std::endl;
  }
  return 0;
}

Testing

We are using Catch2 for unit testing.

To add more tests, create files of the form: tests/<module_name>/<class_tested>.cpp and add these files to the TEST_SOURCES of tests/CMakeLists.txt.

Automatic documentation

To build the automatic documentation two packages are needed: graphviz and doxygen. To build the documentation run doxygen DoxyFile.

Profiler

We are using the Tracy profiler to measure the time spent in specific code segments. To profile the code using the graphical user interface, follow these steps:

  • Set the ENABLE_PROFILING flag to ON in the CMakeLists.txt file.
  • Compile the code with the updated CMakeLists.
  • Navigate to build/Debug/_deps/tracy-src/profiler/build/unix and run make to generate the "Tracy-release" executable.
  • Execute the "Tracy-release" executable.
  • Run REmatch to initiate the profiling.

You should be able to view the results in the graphical interface.

Reference

This implementation is based on the paper REmatch: a novel regex engine for finding all matches by Cristian Riveros, Nicolás Van Sint Jan, and Domagoj Vrgoč.

Building bindings

JavaScript/WASM

To begin, follow the official installation tutorial for Emscripten available at https://emscripten.org/docs/getting_started/downloads.html

Once the installation is complete, execute the following command:

./scripts/compile_emscripten.sh

Python

To install the release version from PyPI, run:

pip install pyrematch

To build from the source code, clone this repository and run:

pip install .