- Title: Text Indexer
- Author: Guillem Nicolau Alomar Sitjes
- Initial release: June 16th, 2017
- Code version: 0.1
- Availability: Public
Index
Just a text indexer for fast text searches. Similar to a grep, but for several words at the same time, giving an ordered list as a result, with a % of the words in the input that appear in each file.
- Java
Currently, there are two available indexing modes to choose:
-
- Indexing by file (Default mode)
- Data is stored in a "file: words in file" shape
-
- Indexing by word
- Data is stored in a "word: files where it appears" shape
-
- Indexing by file and Indexing by word
The mode can be passed as a parameter when executing the application.
It has been tested with +1.25GB of data. For a bigger dataset, it might give memory problems. In this case, having the data in a DDBB would solve the memory problems (as long as the dataset fits in your laptop disk, otherwise you could store it in a server). The 'IndexableDirectory' folder only contains some small files used for testing (I have not put the +1GB files that I used for GitHub storage reasons).
A good parallel implementation would improve the performance.
Several tests will be done in order to obtain some performance data to evaluate them. I guess that the proportion WordsRange/NumFiles will be the main way to choose between them. Some performance results have already been taken and can be seen in the PerformanceTests folder.