This code takes an XML RSS-feed of my favorite podcast, "DarkNet Diaries", translates he data into a Python Dataframe, stores itlocally as a ".pkl" file, stores it in Google Cloud Platform's "Firebase" NoSQL database. Then, python loops through the firebase stream, locally transcribing the podcasts and storing the transcription in the firebase. The ".pkl" files are used for data mining and.
Table of Contents
-
My favorite podcast is more than 100 episodes and counting, all ~1hr each. After binging it in a month, I found myself wanting to search for episodes to rewatch or solidify interesting facts. This project enables cheap storage in the cloud, transcript searchability, and statistical research and NLP projects in the future.
-
I take an RSS XML feed, loop through podcasts mp3 links, transcribing them, and storing the results locally as .pkl files and in the cloud in a firebase.
- Installing all Required Packages
pip install -r requirements.txt
-
Open a Google Cloud Platform Account and a firebase account.
-
Download a admin sdk json file to access firebase. Download the file and replace the firebase-adminsdk.json file in your repo. Adjust "cred" variable in loadToFirebase_gitVersion.py file to match the name of your credentials file.
- Check access to RSS Feed.
- Run loadToFirebase.py in python. This step took my computer well over 24 hours for the 100+ hours in the DarkNet Diaries Podcast.
python loadToFirebase_gitVersion.py
- Check Firebase to make sure the data went through.
- Use Jupyter Notebook and pandas to play wit the pickle data!
Jared Fiacco - jaredfiacco2@gmail.com
Another GCP Project of Mine: Publish Computer Statistics to Pub/Sub, Use Cloud Functions to Store in BigQuery