competition-cpp-parallel-indexing

Parallel Indexing Cpp Competition July 2014 Kacper Kokoszka, Bartosz Szurgot

Info: README.md has to be viewed in raw version for proper display

Task: Implementation of collection with simple API:

  • search(word) returns list of files containing provided word
  • index(file) - places new file in collection
  • handle n searching (8 threads) and m (2 threads) indexing clients
  • application has to handle any number of files/words

Target:

  • fast access to collection
  • multithread safety
  • high responsiveness (no unnecessary blocking)

Tools:

  • C++ 14 (gcc-4.9; clang-3.4 - available on wrlinb29.emea.nsn-net.net in /opt/ directory, clang-2.8 for now)
  • large text files available at location: wrlinb29.emea.nsn-net.net -> path: /var/fpwork/comp-cpp-par/ (approximately 100 M unique words)
  • full C++14 (ok, ok - C++1y) is at your disposal
  • no external libraries are allowed

File format:

  • text files
  • one word per line
  • line endings format is not specified

Testing:

  • 60 seconds of pseudo random search and index queries
  • the best result out of 10 runs is taken
  • testing machine: wrlinb29.emea.nsn-net.net

Winning criteria: Sum of number of handled (finished) queries, performed by all clients + total number of files found, by queries. Important: each solution will be checked manually in order to avoid cheating.

Allowed modifications:

  • everything under Impl/ directory (both creating new files and modifying existing ones)
  • CMakeLists.txt: adding new files, extending compilation and linking flags
  • Impl/* contains example implementation (terrible performance, but working)
  • types from types.hpp can be changed (like FilesList)

How to:

  • run cmake: in build dir 'cmake <path_to_cmakelist.txt> [-G Ninja]' (you can try ninja instead of make)
  • use word_list_generator:
    • - seed for generator
    • <number_of_outputs> - amount of in_files for benchmark
    • <words_per_file> - amount of words in each in_file
    • <in_file_1> ... <in_file_N> - dictionaries to generate in_files from
    • example call: ./word_list_generator 60 20 1000 ../simple_test_data/pl_10k.txt (you have much bigger dictionaries at location provided in Tools section)
  • run benchmark:
    • - query file generated by word_list_generator named 'queries.txt'
    • - testing period
    • ... - in_files generated by word_list_generator named dict_[number].txt
    • example call: ./benchmark queries.txt 10 dict_*