The Podcast Transcriber project is designed to automate the process of transcribing podcast episodes and managing those transcripts. It allows users to submit transcription jobs, check their statuses, and handle the resulting transcripts. This tool is particularly useful for podcast creators who need to efficiently manage transcripts, synchronize them with audio, or edit content.
- AWS Transcribe - for transcriptions
- ChatGPT for polishing the results
-
Submit a Transcription Job: Use the
submit_transcribe_job.py
script to submit a new transcription job. This script downloads an audio file from YouTube using the provided YouTube video ID, uploads it to an S3 bucket, and starts a transcription job.python submit_transcribe_job.py <youtube_video_id>
-
Check Transcription Job Status: Use the
check_transcribe_jobs.py
script to check the status of ongoing transcription jobs. This helps you monitor whether your submitted jobs are complete and ready for processing.python check_transcribe_jobs.py
-
Process Transcripts: Once transcripts are ready, they are stored in the
transcripts/raw
directory. These files are plain text with time-coded entries to synchronize with the podcast audio.
After transcription jobs are completed, the resulting transcripts are saved in the transcripts/raw
directory. These files include timecodes that allow for easy navigation and editing. To refine and finalize these transcripts, use ChatGPT as follows:
-
Editing with ChatGPT:
Open each transcript file and use ChatGPT to refine and edit the text. Follow the "Prompt for Correcting" in the
prompt.md
file to:- Remove unnecessary filler words.
- Correct grammatical errors and improve sentence clarity.
- Use specific names for the host and guest.
- Keep the structure with timestamps for synchronization.
Refer to the Prompt for Correcting section in the
prompt.md
file for detailed instructions on how to use this prompt. -
Creating Titles and Sections:
- Use ChatGPT to generate a structured outline with titles for key topics discussed in the podcast. Follow the "Prompt for Titles" in the
prompt.md
file to create topic titles and provide timestamps.
Refer to the Prompt for Titles section in the
prompt.md
file for detailed instructions on how to use this prompt. - Use ChatGPT to generate a structured outline with titles for key topics discussed in the podcast. Follow the "Prompt for Titles" in the
This project supports automation through GitHub Actions, allowing users to submit transcription jobs and check job statuses directly from the GitHub interface.
-
Submit a Transcription Job via GitHub Actions:
- Navigate to the "Actions" tab in your GitHub repository.
- Select the "Submit Transcribe Job" workflow from the list of available workflows.
- Click on the "Run workflow" button.
- Provide the required
YouTube ID
for the audio file you wish to transcribe and start the workflow.
The workflow is defined in
.github/workflows/submit_transcribe_job.yml
and will automatically handle the submission of a new transcription job using the provided YouTube ID. -
Check Transcription Job Status via GitHub Actions:
- Navigate to the "Actions" tab in your GitHub repository.
- Select the "Check Transcribe Jobs" workflow.
- Click on the "Run workflow" button to manually trigger a check on the transcription job statuses.
The workflow is defined in
.github/workflows/check_transcribe_jobs.yml
and will run thecheck_transcribe_jobs.py
script to update the status of ongoing jobs.
For both programmatic and GitHub Actions usage, ensure that the following environment variables are set (typically in the GitHub Secrets for GitHub Actions):
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_DEFAULT_REGION
AWS_REGION
S3_BUCKET
(default:podcast-audio-storage
)
These are used to authenticate and interact with the cloud services handling the transcription jobs.
.envrc_template
: Template for environment variables. It should be copied to.envrc
and filled with necessary configuration details like API keys..github/
: Contains GitHub workflows for automating tasks such as testing and deployment.Pipfile
&Pipfile.lock
: Define the Python environment and dependencies needed to run the project.check_transcribe_jobs.py
: Script to check the status of transcription jobs submitted to the external service.jobs/
: Directory to store job-related metadata and temporary files.prompt.md
: Contains the prompt or instructions for the transcription process, including sections for correcting transcripts and generating titles.submit_transcribe_job.py
: Script to submit new transcription jobs.transcripts/raw/
: Directory for storing raw text transcripts with timecodes.
-
Clone the repository:
git clone <repository_url> cd podcast-transcriber-main
-
Install dependencies using Pipenv:
pipenv install
-
Set up environment variables by copying
.envrc_template
to.envrc
and filling in the required API keys and settings. -
Run transcription jobs and manage transcripts as outlined in the usage section.
Contributions are welcome! Please create a new branch for each feature or bug fix, and submit a pull request for review.