/Covid-vCues

Developing a web crawler to create a multi-modal Covid data set

Primary LanguagePython

COVID vCues Dataset

This is a dataset research project developed to assist Professor Ankur Chattopadhyay's COVID vCues research by creating a multi-modal dataset containing images sourced from reliable and unreliable sources on COVID-19. This dataset will be used to train multiple AI models: reliable vs unreliable images, and identify memes, ads, claims, fact-checks, or logos.

To-Do

  • Scrape CoAID (Sarah)
  • Scrape ReCovery (Shreetika)
  • Scrape MM-Covid (Shreetika)
  • Scrape tweets with Twikit
  • Consolidate CoAID, ReCovery, & MM-Covid (Sarah)
  • Remove duplicate images
  • Deep learning reliable vs unreliable model
    - [X] Small test model w/ Keras and Tensorflow (200 images each) (Sarah)
    - [X] Keras neural network using all images - Model is overfit? (Sarah)
    - [X] Small test model w/ SVM (also tried with random images on Google) (Shreetika)
  • Clean dataset
    1. Remove duplicates w/ existing script (Sarah)
    2. Figure out how to remove pixilated images (Shreetika)
    3. Use OpenCV to identify people and manually del profile pictures (Shreetika)
    4. Remove favicon and icon type images by sorting images by size (Shreetika)
    5. Randomly select same amount of remaining images from each category
  • Redo model training with cleaned dataset-SVM Model (Shreetika)
  • Develop category identifying models: Method 1
    - [ ] Memes
    - [ ] Ads
    - [ ] Claims
    - [ ] Fact-checks
    - [ ] Logos
    Method 2
    - [ ] Infographics/Diagrams
    - [ ] Photographs
    - [ ] Illustrations
    - [ ] Memes
    - [ ] Advertisements
    - [ ] Misc/Logos
    - Image naming convention idea: (un)reliable.subcategory.####.jpg/png
  • Analysis of dataset breakdown

Sources

The dataset based on CoAID: COVID-19 Healthcare Misinformation Dataset, ReCovery, and MM-Covid.

Citations:
@misc {
cui2020coaid,
title={CoAID: COVID-19 Healthcare Misinformation Dataset},
author={Limeng Cui and Dongwon Lee},
year={2020},
eprint={2006.00885},
archivePrefix={arXiv},
primaryClass={cs.SI}
}
https://github.com/apurvamulay/ReCOVery/tree/master
https://github.com/bigheiniu/MM-COVID/blob/main/README.md

Usage

This dataset is still underdevelopment and not yet ready for use.

Authors

Sarah Ogden
Shreetika Poudel

Helpful Tutorials