/FirebasePodcastTranscription

Use Python to: Get RSS feed XML, Download Podcast MP3, Create Transcript, Store data to GCP Firebase

Primary LanguagePython

LinkedIn


Logo

Store Podcast Transcriptions in GCP Firebase

This code takes an XML RSS-feed of my favorite podcast, "DarkNet Diaries", translates he data into a Python Dataframe, stores itlocally as a ".pkl" file, stores it in Google Cloud Platform's "Firebase" NoSQL database. Then, python loops through the firebase stream, locally transcribing the podcasts and storing the transcription in the firebase. The ".pkl" files are used for data mining and.

Table of Contents
  1. About The Project
  2. Prerequisites/Instructions
  3. Contact

About The Project

  • My favorite podcast is more than 100 episodes and counting, all ~1hr each. After binging it in a month, I found myself wanting to search for episodes to rewatch or solidify interesting facts. This project enables cheap storage in the cloud, transcript searchability, and statistical research and NLP projects in the future.

  • I take an RSS XML feed, loop through podcasts mp3 links, transcribing them, and storing the results locally as .pkl files and in the cloud in a firebase.

Process Map

Built With

Prerequisites

  1. Installing all Required Packages
pip install -r requirements.txt
  1. Open a Google Cloud Platform Account and a firebase account.

  2. Download a admin sdk json file to access firebase. Download the file and replace the firebase-adminsdk.json file in your repo. Adjust "cred" variable in loadToFirebase_gitVersion.py file to match the name of your credentials file.

getFirebaseKey

  1. Check access to RSS Feed.

xml

  1. Run loadToFirebase.py in python. This step took my computer well over 24 hours for the 100+ hours in the DarkNet Diaries Podcast.
python loadToFirebase_gitVersion.py
  1. Check Firebase to make sure the data went through.

firebase

  1. Use Jupyter Notebook and pandas to play wit the pickle data!

pandas

Contact

Jared Fiacco - jaredfiacco2@gmail.com

Another GCP Project of Mine: Publish Computer Statistics to Pub/Sub, Use Cloud Functions to Store in BigQuery