TurkishText_Indexer

This project is created for educational purposes as part of my Information Retrieval course. The primary goal is to implement an indexing engine in Java and utilize it to process Turkish poems by NAZIM HIKMET.

Objective

The objective of this project is to learn and implement the fundamental concepts of information retrieval, including text preprocessing, indexing, and potentially search functionalities.

Features

  • Web scraping to retrieve a Turkish poem from a specific URL.
  • Text preprocessing techniques such as tokenization, and lowercasing.
  • Calculation of cosine similarity of different poems
  • (Planned) Search functionalities to retrieve information based on user queries.

Technologies Used

  • Java
  • Jsoup (for web scraping)
  • (Planned) Apache Lucene (for advanced indexing and search functionalities)

Analysed Poems

Usage

This project serves as an educational resource to understand and apply information retrieval concepts. To use or contribute to this project, clone the repository and follow the setup instructions.

Acknowledgements

The initial inspiration for this project came from https://www.cs.rpi.edu/~sibel/poetry/nazim_hikmet.html where the Turkish poem is sourced.

Status

This project is currently in its early stages, focusing on text retrieval and preprocessing. Contributions and suggestions for improvement are welcome.

Author

Zainab Lawal