/IR-CosineSimilarity-vs-Freq

Information Retrieval Model to the research interests of the Faculty members of the Department (a.k.a. DEP or professors) and based on this to suggest possible collaborations between them.

Primary LanguageTeXMIT LicenseMIT

IR-CosineSimilarity

I create a vector Information Retrieval Model to the research interests of the Faculty members of the Department (a.k.a. DEP or professors) and based on this to suggest possible collaborations between them.

  • Question 1 . Finding important terms for each faculty member. In this question you are asked to characterize with a set of terms the research interests of each faculty member. Use the Vector Model and the tf-idf load to represent each faculty member as a weight vector of the terms contained in the titles of his research articles and conferences / journals that have been published. Your implementation should, after doing the above, create the profile "prof-description.txt" that will have 26 lines (as many faculty members) and each line will contain the last name of the faculty member followed by the N most important terms together with their weights in the form (term, weight), which were found in the titles of the articles and the respective journals / conferences. N will be a parameter of your program.

  • Question 2 . Find a faculty member based on a question. In this question you are asked to sort the faculty members based on their relevance to a question posed by the user. To do this you will characterize with a set of terms the research interests of each faculty member as you did in Question 1 using the Vector Model and tf-idf load. This way, the user will ask you a question in one or more terms (eg, truth model driven system with enable the nearest database.) And you will calculate the similarity of each member of the faculty with the question and you will rank them based on the similarity you calculated. Note that whatever you pre-processed in your data, you should do the same in the user's questions! Your implementation should, after doing the above, create the results- "question-words" .txt file (eg, results-truth-model-driven-system-with-enable-the-nearest-database. .txt for our example), which will have 26 lines (as well as faculty members) and each line will contain the surname of the faculty member followed by the similarity with the question in the form (surname, similarity). The file should be sorted in descending order of resemblance to the question.

  • Question 3 . Finding faculty members with close research interests. You are invited to calculate for each faculty member his two colleagues with the closest research interests. To do this, you will characterize with a set of terms the research interests of each faculty member as you did in Question 1 using the Vector Model and the tf-idf load. Then, you need to find its resemblance (use a cosine similarity) with any other colleague, performing an exhaustive comparison algorithm for each faculty member with each other. Your implementation should, after doing the above, create the file similar_profs.txt which will have 26 lines (as many faculty members) and each line will contain the last name of the faculty member followed by the Ms more similar with this faculty members along with their degree of similarity in form (surname, similarity). M will be a parameter of your program. In addition to this file, you should include in your report a 26x26 table with all the similarities of all faculty members with each other (See "Results / All profsesors cosine similarities").

  • Question 4 . Change of weight calculation method. Implement the above three questions by changing the way Calculate the weights of the terms as follows: for tf just simply use the number of occurrences of a term (freq) and do not use idf and normalize in cosine similarity. Describe in your report if and how the results are affected compared to before. It was expected; (Of course the similarity of the cosines works better than the simple frequency.)