/stack_analyser

A python Project for course Business Intelligent. The whole project includes two main parts, a spider based on scrpy and a web application based on flask. I crawl millions of questions from stackoverflow and find out what are the hottest tags for every day and how they change with time.

Primary LanguagePython

stack_analyser

A python Project for course Business Intelligent aims to analyse the questions in stackoverflow, which includes two main parts, a spider based on scrpy and a web application based on flask.

Introduction

The project aims to find out what are the mostly asked questions every day and how they change with time in the famous Q&A site stackoverflow. Crawling all the questions is a quite simple job, as stackoverflow has no anti-spider policy, so it's easy to get millions of questions with scrapy. However, it's not so easy to store so much data. As the data is in json format, so I choose mongodb. As for showing the analysis result, I use flask to build a simple web application, in which I make use of highchats to show the result in different graph and charts.

The following diagram shows structure of the whole project. structure

Screenshot

hottest topics screenshot

how much share they account for screenshot

how the change with time screenshot

custom analysis screenshot screenshot screenshot

search screenshot screenshot screenshot

Setup

  1. clone the project.
  2. install mongodb(skip if exists) and create database stack_db in mongodb.
  3. install mysql(skip if exists), create database stack_db, create table tag(id, tagname, tag_count, date) in stack_db.
  4. open the project with PyCharm, both stack_analyser and stack_spider.
  5. run the stack_spider to crawl questions from stackoverflow.
  6. run static_cache.py in stack_analyser to do some statistis and data transfer.
  7. run the stack_analyser for final result.