This project automates the listing of B2B software products , ensuring that new software is promptly and efficiently added to our database. By leveraging advanced web scraping techniques, real-time data streaming, and automated workflows, this system maximizes the visibility and accessibility of new software products.
- Fast and Efficient Listings: Automate the detection and listing of new software products to ensure real-time updates.
- Global Reach: Capture and list software launches worldwide, especially from underrepresented regions.
- Technological Innovation: Utilize modern technologies including web scraping, real-time data streams, and cloud-native services to maintain an efficient workflow.
- Description: These are the primary sources where detailed and technical data about software products can be found. Key sources include software directories, official product pages, and industry-specific news portals.
- Scraping Techniques: Utilize BeautifulSoup for parsing HTML content from static pages and Selenium for interacting with JavaScript-driven dynamic web pages to extract critical data about software releases and updates.
- Websites ProductHunt, Slashdot, Betalist and many more tech news sites regularly post about new software products.
- Web Scraping: BeautifulSoup, Selenium
- Data Streaming: Apache Kafka, Spark
- Data Storage and Management: MongoDB, Docker, Kubernetes
- APIs and Advanced Processing: Large Language Models (LLMs)
Extracted data is streamed in real-time into Kafka topics designed to segment the data efficiently:
- software for direct product data
- x-llm for processed textual data needing further extraction
- news for updates from news sources about software products
Kafka consumers process data on-the-fly. If new products are detected , they are added to MongoDB.
LLMs analyze textual data from news and social media to extract and verify new product details.
run this command in root directory of the project
# start zookeeper and kafka
docker-compose up -d
shutdown the kafka and zookeeper
# stop zookeeper and kafka
docker-compose down
# pull spark image
docker pull apache/spark
# start spark
docker exec -it spark /bin/bash
pip install python-dotenv pymongo pip install pydantic_core pyspark
spark-submit \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1 \
--master spark://localhost:7077 \
/opt/application/consumer_software.py
# Build the image
docker build -t scrape-products .
# run the image
docker run --network="host" scrape-products
# Build the product consumer
docker build -t software-consumer .
# run the image
docker run --network="host" software-consumer
# Build the product consumer
docker build -t twitter-consumer .
# run the image
docker run --network="host" twitter-consumer
# Build the product consumer
docker build -t news-consumer .
# run the image
docker run --network="host" news-consumer
MONGO_CONN_STRING=
TWITTER_USER_NAME=
TWITTER_PASSWORD=
#Gemini API KEY
GOOGLE_API_KEY=
https://hub.docker.com/r/bitnami/spark/ https://hub.docker.com/r/confluentinc/cp-kafka https://hub.docker.com/layers/confluentinc/cp-zookeeper/