- Implemented a Dynamic-Programming-based Word-Break Tokenizer for English and Janpanese.
- Implement a LSM disk-based inverted index with tiering merging policy and positional information which support insertion, keyword search, boolean AND search, boolean OR search, Phrase Search and further enhance the performance with data compression using Gamma encoding.
- Use Term Frequency - Inverse Document Frequency(TF-IDF) and page rank algorithm enable the search function for all UCI ICS webpages.