dvc (data): save tweets to build database & push to DVC
guillaume-salle opened this issue ยท 2 comments
๐ Describe what you want
This issue must be completed only after issue #75 is completly done.
Run the feature built for issue #75 to download locally tweets from twitter API.
Make sure you have enough place available and your machine works correctly before doing so.
Then push these tweets with DVC to the AWS S3 storage service.
Determine the amount of tweet to request per day and per candidate after consultation with the others members of the project.
You should download tweets for one candidate with the start_date: 7 days ago and end_date: 6 days ago.
The .csv
file must have at least this start_date and the last tweet ID in its name, like advised in issue #75, in order to be able to complete the data for this day and candidate later if possible.
This dataset should be reproductible : you must save somewhere the query sent to the Twitter API, either in the .csv
file or in another file dedicated to it. The idea is to be able to publish this dataset on thr Huggingface Hub, and it requires reproductibility.
โ๏ธ Definition of done
Tweets for ONE candidate and ONE day in the right amount regarding our request capacity are saved in the data/raw/twitter
directory (or data/raw/twitter/week_#x/
) and pushed to the DVC remote storage.
The script should suggest the user to perform the commands to add and push the data obtained to the dvc remote.
Every data obtained from the twitter API should be pushed to the dvc remote.
The query used to obtain the tweet must be saved.
It is not enough specific @guillaume-salle. What we want is to run the feature of the script describe in issue #75 with the correct starting point (start_date or tweet ID). So maybe you should evoke that and the script should suggest the user to perfrom the command to do the dvc add/push.
Done in issue #75 .