SmartSearch

This project has been developed as part of the course "Advanced Software Engineering" at the BHT. It is a commandline tool containing the basic functionality of searching for one or more search patterns inside a raw text string, a text file or a directory containing several txt-files. As can be seen in the UML diagrams and the DDD files, the vision of the project is to provide a tool that can be used to search for information in a smart way by using different search algorithms and NLP techniques.

This tool has been developed and tested using Python 3.9.

Installation

To install this commandline tool, execute the following commands in your terminal:

git clone https://github.com/bogdankostic/SmartSearch.git
cd SmartSearch
sudo -H ./install.sh

Usage

After installation, you can use this tool directly from the commandline by executing the following command:

 search [-h] [-n] [-i] SEARCH_PATTERN [SEARCH_PATTERN ...] TEXT_INPUT

Positional Arguments:

  • SEARCH_PATTERN – Search pattern to search for in the provided text inputs
  • TEXT_INPUT – Raw text, text file or directoy containing .txt-files

Optional Arguments:

  • -h / --help – Show help message explaining how to use this tool
  • -n / --naive – Use naive string matching algorithm instead of Boyer-Moore algorithm
  • -i / --case-insensitive – Perform case-insensitive search

Output

Each match is printed on a new line with the following tab-seperated formats:

Format for single text files and raw text input

SEARCH_PATTERN \t POSITION_IN_TEXT/FILE

Format for directories containing txt-files

FILE_NAME \t SEARCH_PATTERN \t POSITION_IN_FILE

Software Engineering

1) Git

Throughout the project, Git and GitHub were used as tools for version control. The commit history can be found here.

2) UML

The directory uml contains the following UML diagrams as images and PlantUML files:

  • Class Diagram
  • Sequence Diagram
  • Component Diagram

3) Domain Driven Design

The event storming, the core domain diagram, and the relationship mapping can be found in the domain_driven_design.pdf file.

4) Metrics

Code Metrics are tracked on Coveralls for test coverage and SonarCloud for code quality.

Metric badges:

Coverage Status Quality Gate Status Bugs Code Smells Duplicated Lines (%) Reliability Rating Security Rating Maintainability Rating Vulnerabilities

5) Clean Code Development

Examples of clean code development principles used in the project:

  • Don't Repeat Yourself: The project uses functions and classes to avoid code duplication.
    Example: Input validation in BaseMatcher.
  • Usage of type hints to specify the types of function arguments and return values.
    Example
  • Usage of docstrings to document functions and classes.
    Example
  • Usage of meaningful variable and function names.
    Example
  • Short functions that do one thing.
    Example

My personal clean code development cheat sheet can be found in the clean_code_cheat_sheet.md file.

6 & 7) Build & Continuous Delivery

The project uses GitHub Actions for continuous integration and delivery. The workflow can be found here. The workflow is triggered on every push to the main branch and runs the tests, measures the test coverage, and uploads the coverage report to Coveralls.

8) Unit Tests

The unit tests for the project can be found in the test directory. The tests can be executed by running the following command:

pytest test/

9) IDE

Throughout the project, the PyCharm IDE was used. My favorite key shortcuts are:

  • ⌘ Command + ⇧ Shift + F: Search in all files
  • ⌘ Command + ⇧ Shift + R: Replace in all files
  • ⌘ Command + /: Comment/uncomment code
  • ⌘ Command + K: Commit changes
  • ⌘ Command + ⇧ Shift + K: Push changes

10) Domain Specific Language

As the main use case for the SmartSearch application is to find information by executing search queries, the Domain Specific Langauge for the SmartSearch project could be inspired by SQL or a similar query language. An example of a query that would use most of the features of the DSL is:

SELECT 
    document
FROM 
    document_idx
WHERE 
    exact_search('software engineering')
    AND semantic_search('How to construct a DSL?', similarity = 0.8)
    AND meta.year >= 2021 

This query would search for documents that contain the exact phrase 'software engineering', documents that are semantically similar to 'How to construct a DSL?' with a similarity threshold of 80%, and documents that have the meta field 'year' greater than or equal to 2021.

Using a DSL that is inspired by SQL has many benefits. SQL is a powerful language that already comes with many features that would be useful for the SmartSearch application, such as aggregation functions, sorting, and filtering. Furthermore, many developers are already familiar with SQL, so they would be able to use the DSL without much additional training. They would just need to learn the specific functions and features of the SmartSearch DSL.

11) Functional Programming

The project follows aspects of functional programming, for example:

  • Final data structures: data structures used are immutable, for example tuples (see here).
  • Side-effect free functions: functions are designed to be side-effect free, for example the search function in the NaiveMatcher class (see here).
  • Anonymous functions: lambda functions are used to define a defaultdict (see here).

There was no need for higher-order functions and using functions as parameters and return values in the project, so I created the file functional_programming.py to demonstrate these aspects of functional programming.