This project, initially started as an end-of-year project for CSC111 at UofT, is focused on finding gaps in the knowledge of Wikipedia. The information found on Wikipedia can be used as a microcosm for the greater collective human knowledge. Finding gaps or underdeveloped areas in this will give us directions that we should explore as a society.
This project entails running data analysis on the entire wikipedia dataset which is, by nature, very large. This means that running the computations on the dataset will require sufficient processing power and memory (minimum 16 GB with paging scheme such as a swapfile or swap partition). More detailed instructions can be found in the project report.
- Create a venv:
python -m venv venv
and enter the venv:source venv/bin/activate
on Unixvenv\Scripts\activate.bat
on Windows
- Install requirements:
pip install -r requirements.txt
- Install local package:
pip install -e .
- Place data files in respective locations in
data/
- Run tests:
pytest -v
Each of these subpoints will be a directory in the repo. Try to ensure that your code is as cleaned up as possible when you are pushing and that you are not pushing unnecessary files or you don’t have files in the wrong location.
The root directory will contain things like this README, requirements.txt, etc. Try not to clutter it up too much with things that would do better placed in a subdirecotry.
This directory is meant for data storage. This will not be pushed, but the structure will remain. We don’t push this because it’s bad practice to push file that are obtainable outside of the project (especially if these files are large)
Raw files that have not yet been processed. This inlucdes the wikidump.
Smaller sections of the wikidump that we can run trials on.
This is where output will go. We may push some of these or find some other way to share these as the processing time will be insane.
Directory for the project proposal. Only push tex, pdf, and bib files.
Directory for the project report. Only push tex, pdf, and bib files.
This is where all the python files will go. There should generally be no subfolders here but there are some exceptions. This is to allow for proper PATH management (how python modules are imported, etc).
All python files here will need to include the following
"""Module docstring"""
import os # Toward the top of the file
if __name__ == '__main__':
os.chdir(__file__[0:-len('wikigraph/name of file')])
This code ensures that the code runs relative to the root directory, no matter where you execute it from. This smooths out some differences between vscode and pycharm/terminal python. I know that some of our TAs use vscode so this is NECESSARY.
We should also make sure to document our code very well.
This directory is where we will put unit tests but it is also okay to have random testing for other things. Try to make sure that your code is as clean as possible when you’re pushing things.