42-AI/SentimentalBB

feat (data): script to request tweets from twitter API

Closed this issue ยท 3 comments

Objective: Build the database and having data on as much days as possible.

๐Ÿ“– Describe what you want

Update the script about dataset to request specific tweets from the API of twitter based on its date or ID.

The script MUST save ALL the tweets received into csv files in the data/raw/twitter directory, with the date and ID of first and last tweet specified (in the name ?). Possible format:

  • data/raw/twitter/[candidat_name]_[startdate]_[enddate].csv
  • data/raw/twitter/[candidat_name]_[first_id_tweet]_[last_id_tweet].csv
  • data/raw/twitter/candidat_name/[startdate]_[enddate]_[first_id_tweet]_[last_id_tweet].csv
  • data/raw/twitter/week_#x/[candidat_name]_[startdate]_[enddate]_[first_id_tweet]_[last_id_tweet].csv

A particular point must be considered: the script should collect small chunks of results in order to save all the results little by little to avoid issue related to cache memory, disk memory or whatever: Create a tmp directory where the little portions are stored and after that the script

This script should be designed to be launched periodically, every week (or every day?) and collect specified amount of tweets about each candidate. These amounts of tweets per day and per candidate are yet to be determined.

โœ”๏ธ Definition of done

  • a functioning script is written.
  • a format for the filename is chosen,
  • the script create a tmp directory where it saves small chunks of the total results.
  • the script concatenate all the chunks into a final csv file.

This script should be FIRST and ONLY tested with small amounts of tweet requested to the API in order to save the amount of tweet we can request: for instance 1k tweets for 2 or 3 candidates. The person testing the script should be careful to check the above points.

This script will be used for larger amounts after the pull request is validated.

Acutally, you do not need to write a need script but only add a feature to the existing one.

la requete:

poetry run python -m src data --download twitter --mention Melenchon --start_time '2022-03-18 8:00' --end_time '2022-03-18 22:00'

At this time, the part concerning:

A particular point must be considered: the script should collect small chunks of results in order to save all the results little by little to avoid issue related to cache memory, disk memory or whatever: Create a tmp directory where the little portions are stored and after that the script.

is not implmentend yet