Devopedia.org

Chicago Manual Style Citation String

For Devopedia Articles the citation adapted is Chicago style. More on the style is in the following link

https://www.citationmachine.net/chicago

Objective

In this project we have aimed to construct the citation for website given the URL. The examples are in the following link

https://www.chicagomanualofstyle.org/tools_citationguide/citation-guide-1.html#cg-website

The standard format is as follows

Author, year_of_publishing, Title, granular time details, granular content details

We attempt to predict the content in bold given URL

Data

Total Devopedia Articles - About 10K
Web Site Referenced Articles - About 7K
Used Articles

-For Authors- 5064 files

-For Titles- 6220 files

-For YoPs- 3532 files
Train Articles

-For Authors- 4477 files

-For Titles- 5633 files

-For YoPs- 2945 files
Test Articles -587 files for each
Challenges:
- Plenty of .htm files had no relevant content (for example, just script and meta tags)
- Some .htm files did not support 'utf-8' or 'iso' encodings (perhaps in another language script)
- Those .htm files which did not contain even a single Title/Author/YoP encoding label as 1 for all its extracted tag-text rows were not used in their respective field datasets

Methodology

Crawl the URL
Extract Tag and Content
Label the Tag and Content with Author/YoP/Title if the content has Author/YoP/Title respectively
Build Features
1. Number of tokens in text
2. Weights for Tags, say, if title appears in 'h1' tag 10% of the times, the weight for 'h1' is 0.10
3. number of commas as percentage of number of tokens - this is useful to find Author
4. number of tokens have first letter as Capital letter
5. index(position) weightage of Tag/content. the earlier the tag/content, the higher weightage is assigned
6. if the text contains year(4 digit number with first digit either 1 or 0)
Split the URLs(and its content) into Train and Test
Train distinct models for Author/Title/YoP by up-sampling/down-sampling
Since the probable tags are predicted some post processing is done to extract Author/Title/YoP
The dataset is imbalanced because only fewer tags can be labelled as Author/Title/YoP.

Hence False negatives are minimized and False positives are compromised. The idea is to increase recall and compromise on precision

We publish top 3 predictions and expect the exact predictions to be part of the top 3

Reference string is constructed

Process Document

Install requirements

pip install -r requirements.txt

Training

Modify the train_inputs.csv file to configure
html_path - location where crawled html files are present
articles - location where train data is made available in json format
meta_data - location where meta data is present
you can also configure number of layers, number of nodes in each layer, learning rate, number of epoch and batch size towards the end of the file
The output files are named in the configuration. please leave them as it is to avoid errors

python commands.py preprocess train inputs.csv (files_batch_size)
python commands.py train inputs.csv save (model_name)
python commands.py postprocess train inputs.csv (model_name)

The preprocess will generate files in Train_preprocess folder
The model files will be generated in the root folder as .pkl files
The postprocess will generate files Train_diagnostics folder

Diagnostics

Since we ignore false positives on purpose and depend on top 3 predictions post processing a custom metric(based on similarity with original ref string) is worked out to arrive at accuracy

This needs further attention

To get predictions by giving URL as input

python commands.py predict (url) (model_name)

The URL inputted is crawled and saved as file.htm in the Prediction_outputs folder (which is created if not already present in the root direcotry)
This results are displayed in screen and files will be present in Prediction_outputs folder
The results are also stored in Prediction_results.txt file in the Prediction_outputs folder

Example

python commands.py predict https://sunscrapers.com/blog/10-django-packages-you-should-know/ ANN

Further Scope

Reference String construction for PDFs/Text files
Eliminate redundant code for train and predict
A robust mechanism to define accuracy
Model file is big and takes long time to load at the time of prediction
Exploring NER's to narrow down predictions yielding accuracy.
- We attempted, but it slows down training/prediction
Improve features - Tag Weight and index Weight - investigate data and innovate math to represent tag and index with appropriate weights
Constructing Reference String in case of multiple authors

Saranga7/Devopedia-CMS_generator