This project is designed to extract transcriptions from YouTube videos using the YouTube Transcriber API and then generate concise summaries of these transcriptions using the GPT-4 API. The application includes both a terminal-based script and a Django-based web API for generating chapter titles from video transcriptions. Additionally, it involves setting up a data pipeline to synchronize data between PostgreSQL and AWS DynamoDB, and deploying the application using Docker on AWS Elastic Beanstalk.
- Technologies Used
- Setup and Installation
- Terminal Application
- API Version
- Containerizing with Docker
- Data Pipeline
- Deploying to AWS Elastic Beanstalk
- Conclusion
- Python: Core programming language.
- Django: Web framework for the API version.
- YouTube Transcript API: For extracting video transcriptions.
- OpenAI GPT-4: For generating summaries from transcriptions.
- PostgreSQL: Primary database for storing video transcriptions and summaries.
- AWS DynamoDB: NoSQL database for storing processed data.
- Docker: For containerizing the application.
- AWS Elastic Beanstalk: For deploying the Docker container.
- Docker Compose: For managing multi-container applications.
- AWS Amplify: For real-time messaging, authentication, notifications, and deployment.
-
Clone the Repository:
git clone https://github.com/your-repo/youtube-video-summarizer.git cd youtube-video-summarizer
-
Create and Activate a Virtual Environment:
python3 -m venv venv source venv/bin/activate
-
Install Dependencies:
pip install -r requirements.txt
-
Set Up Environment Variables: Create a
.env
file and add your OpenAI API key, AWS credentials, and other necessary environment variables.OPENAI_API_KEY=your_openai_api_key AWS_ACCESS_KEY_ID=your_aws_access_key_id AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key AWS_DEFAULT_REGION=your_aws_default_region
The terminal application extracts transcriptions from a YouTube video and generates chapter titles using the GPT-4 API.
-
Script Explanation:
main.py
: Contains functions to fetch video transcriptions, group sentences, split into chapters by topic, and generate chapter titles using GPT-4.
-
Run the Script:
python main.py <youtube_video_id>
Transitioning from a terminal application to a web API using Django.
-
API Endpoint:
/generate-titles/<video_id>/
: Generates and returns chapter titles for the given YouTube video ID.
-
Run the Django Server:
python manage.py runserver
Containerizing the application to ensure consistent environments across development, testing, and production.
-
Dockerfile:
FROM python:3.11.4-slim-buster WORKDIR /app ENV PYTHONDONTWRITEBYTECODE 1 ENV PYTHONUNBUFFERED 1 COPY requirements.txt . RUN pip install --upgrade pip RUN pip install -r requirements.txt COPY . .
-
Docker Compose:
version: '3.8' services: web: build: . command: ["sh", "-c", "python manage.py migrate && python manage.py runserver 0.0.0.0:8000"] volumes: - .:/app ports: - "8000:8000" env_file: - .env depends_on: - db db: image: postgres:15 volumes: - postgres_data:/var/lib/postgresql/data/ environment: - POSTGRES_USER=postgres - POSTGRES_PASSWORD=secret - POSTGRES_DB=transcribed_data etl: build: . command: ["sh", "-c", "sleep 10 && python etl_script.py"] volumes: - .:/app env_file: - .env depends_on: - web volumes: postgres_data:
-
Build and Run Containers:
docker-compose up --build
Setting up an ETL pipeline to synchronize data between PostgreSQL and AWS DynamoDB.
- ETL Script Explanation:
etl_script.py
: Connects to PostgreSQL, fetches data, and inserts it into DynamoDB.
Deploying the Dockerized application to AWS Elastic Beanstalk for scalability and ease of management.
-
Initialize Elastic Beanstalk:
eb init -p docker time-stamp
-
Create and Deploy Environment:
eb create time-stamp-env eb deploy
- Developed a robust application to summarize YouTube videos using advanced APIs.
- Containerized the application for consistency and ease of deployment.
- Established a reliable data pipeline between PostgreSQL and AWS DynamoDB.
- Successfully deployed the application on AWS Elastic Beanstalk.
- Ensured compatibility between different APIs and services.
- Overcame data consistency issues by implementing a robust ETL pipeline.
- Enhance the summarization algorithm for better accuracy.
- Implement user authentication and access control for the API.
- Integrate additional data sources and processing features.
Try out similar projects and explore the use of APIs and cloud services in your applications. For more details, visit the project repository and check out the documentation.