/IRWS-Homework

Repository including all the programming assignements given throughout the course of Information Retrieval and Web Search at the University of Mannheim during the Spring Term 2017

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

IRWS-Homework

Repository including all the programming assignements given throughout the course of Information Retrieval and Web Search at the University of Mannheim during the Spring Term 2017

Homework 1: Minimum Edit Distance

For Homework 1 the (Damerau-)Levensthein distance has been implemented both with dynamic programming and recursions.

There are several flags to customize what happens at runtime

  • original: left side of the comparison
  • compare: right side of the comparison
  • recursive: true if a recursive version shall be used
  • damerau: true if the Damerau-Levensthein Distance shall be used
  • weigths: true if custom weights for transposition/ replacement shall be used

For the implementation golang was used, there are a couple of tests to show sample output and benchmark tests to see the difference in runtime between recursive and dynamic programming versions.

Homework 2: Vector Space and Probabilistic Retrieval

  • Term weighting: Compute TF-IDF for a toy document collection with different definitions for TF and IDF and rank the documents given a query with cosine similarity.
  • Distance/similarity metrics: Ranking of documents given a query and 'raw Euclidean distance', 'normalized Euclidean distance' and 'cosine similarity'
  • Optimizing vector space model: Given a toy collection of TF-IDF vectors perform random projections to reduce computation costs. Do a pre-clustering of the documents using a given set of leader vectors. Finally retrieve top 5 documents for a query vector using the random projection vectors and leader vectors with clusters.
  • Classic probabilistic retrieval: Given a query rank documents with 'Binary independence model', 'Two-Poisson model', 'BM25'
  • Unigram Likelihood Model for Information Retrieval: For the programming assignment the tasks was to build a query likelihood model based on a unigram Likelihood Model for the 20 News corpus, which is able to take ad-hoc queries and rank the documents by relevance based on the unigram model. This part is implemented using Scala and the Spark Api.

Homework 3: Semantic Retrieval, Text Clustering, and IR Evaluation

  • Latent Semantic Indexing: Computing the similarity of latent vectors for a toy collection of documents and a query
  • Text Clustering: Using 'K-Means' and 'Single Pass Clustering' to cluster a toy collection of TF-IDF vectors
  • IR Evaluation: Calculating precision, recall, F1, P@k, R-precision, average precision and mean average precision for a toy collection of retrievals and their relevance rating
  • Semantic Retrieval with Word-Embeddings: Implementation of a simple retrieval engine based on aggregation of word embeddings using the pretrained 'GloVe' word embeddings and a random subsample of 500 documents from the '20 News Groups dataset'