In this project, I have implement a content based recommender via an web app by recommending similarity based Java programming wiki-books content ( https://en.wikibooks.org/wiki/Java_Programming ) to Stackoverflow.com data set (10 posts).
To do this:
- First, crawled java programming wikibooks pages (crawlingSoup.java)
- Used Apache Lucene (SOLR or any Lucene compatibles to index crawled content )
- built a simple web app to display the provided StackOverflow 10 posts, by selecting each post,
- listed recommendations - the top 10 relevant wikibooks items (from your indexed documents).
Code Run Instructions : run serveletExample2.java using tomcat server(version used is 7).
Main JSP File which is called automatically from servelet is index.jsp
External Jar Libraries used are :
- jsop-1.8.3.jar
- lucene-analyzers-common-4.8.0.jar
- lucene-core-4.8.0.jar
- lucene-demo-4.8.0.jar
- lucene-queryparser-4.8.0.jar
Java files and their functions
- crawlingSoup.java - Used to crawl data from website given and 3 types of folders are created. Jsoup is used to crwal the data.
- textfiles : which will store text files for every indexed page
- linknames : which stores links to file names
- htmlfiles : stores html code for files
-
Keyword.java - This is file containing third party code which is used in indexing files
-
SampleLuceneIndexing.java - This file contain indexing code which is combination of code provided by professor, third party code and self written code.
-
ServeletExample2.java - This file contains servelet code which acts as binding place to call all other java files and dynamically calling index.jsp page
-
index.jsp - This file uses foundation responsive framework to render collapsible panel functionality.
Error Handling
- As crawling is done on init() function of servlet, if sometimes directory is not found than please restart the server and run again.