
Search Engine on Stackoverflow corpus and also a web crawler on the Stack Overflow website to get the most popular technologies

Search Engine on Stackoverflow corpus nearly 160k+ docs

I split the code in three notebook files

  • DataExtraction is code for collecting data
  • DataPreprocessing is code for processing the data
  • Retrieval is code for retrieving top 10 similar docs

Required libraries and packages are

pandas, numpy, sklearn, nltk, re, os ,sys, csv, xml


For this project I collected data from Stack Exchange Data Dump website


A web crawler which crawls the Stack Overflow website and finds the most popular technologies at current point of time by getting the tags information of the newest questions asked on the website.

webcrawler is the code for the this task

Required libraries are

urllib3, requests, bs4, zlib, operator, os, sys

How to run

Download the files and make sure all the files and folders are in the same directory

To run the code in server

  • Go to UI-demo folder
  • create a virual environment ( Command : virtualenv env for windows)
  • activate the virtual environment (Command : env/Scripts/activate)
  • install requirements.txt (Command: pip install -r requirements.txt)
  • run python app.py in the terminal

