Build status Build Status codecov

TT1 Context Graph - Lösung Jan Marker

Dependencies

The program is mainly based on the C++ standard library, except for:

The two dependencies are included in 3rdparty/ and are to no concern of a user.

The code uses features of C++17, including:

  • string_view
  • experimental/filesystem

It was tested with Visual Studio 15.0 2017 and GCC 7.2.1.

Usage

--help can be used to gain information about possible parameters:

./tt1_contextgraph --help
Text Technologie 1 - Aufgabe 5 - Lösung Jan Marker
Usage: ./tt1_contextgraph [OPTIONS]

Options:
  -h,--help                   Print this help message and exit
  -i TEXT                     Path to the vector file.
  -o TEXT                     Path to the output GML file.
  -s TEXT                     Search term.
  -t UINT                     Number of neighbors per node.
  -d UINT                     Depth of search

See ./00_EXAMPLE_CALLS.md for some sample invocations.

Code Organisation

clip.cpp

  • orchestrator
  • command line parsing
  • reading of *.vec file (supplied via -i)
  • search for the graph according to -t and -d
  • output of GML to file

database.{h,cpp}

  • storage of vectors read from *.vec file
  • implementation cosine similarity
  • search for t most similar to a given word

file_access.{h,cpp}

  • reading of *.vec file
  • the file is parsed according to its first line which contains vector count and column count
  • if that information is accurate, parsing of the file may fail

gml_generation.{h,cpp}

  • generation of GML structure out of search tree provided by search
  • the Graph type contains pointer to root of search tree and desired depth
  • the depth is required to also emit edges for leaf nodes that connect the leaf node and another existing node

helper_tst.{h,cpp}

helpers to generate test data in various unit tests

main.cpp

connects main with cli

search.{h,cpp}

  • use Database::most_similar(WordView, std::size_t) to build a tree of nodes
  • root is the search term
  • every node has neighbors, neighbors of one level of deepness have the same distance from the root node
  • it is implemented as a tree but models a graph: for every edge a node is duplicated in the tree

performance consideration

  • this has performance drawbacks when searching, because Database::most_similar has a long run time and is called multiple times for the same search term without cache
  • however, for d=2 the implementation is still sufficiently fast

vector.{h,cpp}

  • internal representation of a vector as a Word (string), a vector of doubles (columns) and euclideanNorm (cached)

tst_*.cpp

unit tests for the respective modules