- Read the plan below and discuss.
- Choose a module from the Issues section.
- Assign the chosen module to yourself so others know you are working on it.
- Implement the module in the corresponding files, details given in comments
- Write your assumptions like input etc. and a brief of how you've gone about in the comment on top so that it'll be easier to collaborate
- Once you complete implementation, push your code, close the issue and choose a new issue
- Any doubts let me know. Happy coding :)
- P.S we need to finish all this by 1st April.
- First extractor extracts tweets as usual.
- Tweets are cleaned and dumped into MongoDB.
- Aggregation is done for the whole day.
- Based on the aggregation, top 100 entities are found and the respective tweets are clubbed into one collection.
- Before it is dumped into the collection, sentiment analysis is done on them.
- Using each of the 100 collections as a separate document, LDA is performed. If 100 documents is too low, we can split the big documents into smaller ones.
- The tweets are iterated individually to find the topic to which it belongs.
- URLs are extracted for each topic which seem to be most relevant.
- Webpages corresponding to the URLs are downloaded and parsed.
- A portion of the main content can be displayed after extraction.
- The graph is approximated as usual but the time span has to be discussed upon.
- The graph, related tweets and summarizations of the URLs along with the hyperlinks is displayed for each topic on the portal.
- Control of engine starts with manager.py
- manager.py makes us of multiprocess and subprocess to spawn extractor, preprocessor and postprocessor as separate processes
- config.py in the utilities package stores tuning parameters such as 'alarm' times, file limit etc.
- Refer to this .ppt for further information.
- Download dataset(s) from the Drive folder
- The full_dataset.rar contains all 2 Million tweets
- Optionally, you can download parts of this dataset from the Parts folder, each (dataset*.rar) containing 200,000 tweets
- Each .json file contains 10,000 tweets
- Clone the git repository
- Run python_path.bat to add PYTHONPATH env variable. This needs to be done only once
- Make necessary changes in the config.py file in *engine\utilities*
- Run python init.py in Command Prompt to start engine
- To stop, close all Command Prompt and Python windows
- The portal folder is the django project for the web portal
- Create a database called 'trends'
- In the settings file, change password for mysql root, in case it is different
- Run createsuperuser to create an admin
- Create some top trends using the admin site. I have included a screenshot for UI after creating some sample topics(with ranks). It will redirect to the details page after clicking(see screenshots).
- Homepage can be opened using the url: 127.0.0.1:8000 or localhost:8000
- TopTrends model has a topic object and a rank object. Will be modified to include graphs n all when implementation is done.