This demo project showcases how to use Langchain to interact with a large data source using PySpark with a Streamlit frontend. The project focuses on analyzing an eCommerce behavior dataset sourced from Kaggle.
- The dataset used for this project is the eCommerce Behavior Data from a Multi-Category Store available on Kaggle. Please download the dataset and place it in the project directory.
- Python 3.8 or higher
- Conda (for creating the environment)
- Git (for cloning the repository)
To execute the project, please follow the instructions below:
-
Clone the project repository from GitHub:
git clone https://github.com/rohan-mudaliar/langchian-demo.git
-
Change into the project directory:
cd langchian-demo
-
Create a new Conda environment with Python 3.8:
conda create --name langchain-demo python=3.8
-
Activate the newly created Conda environment:
conda activate langchain-demo
-
Install the project dependencies using requirements.txt:
pip install -r requirements.txt
-
Start the Streamlit application by executing the following command:
streamlit run app.py
-
Once the application is running, open your web browser and navigate to the provided URL (usually http://localhost:8501).
-
Use the available prompts to interact with the application and query the eCommerce behavior dataset. For example:
-
To count the number of records in the electronics database, enter the following prompt:
count number of records in electronics database
-
To retrieve the order ID in the electronics database that has the highest price, use the following prompt:
give me order ID in electronics database that has the highest price
The project repository contains the following files and directories:
app.py
: The main entry point of the application, responsible for launching the Streamlit frontend and handling user prompts.langchain.py
: The Langchain module, which interacts with the eCommerce behavior dataset using PySpark.data
: A directory where you should place the downloaded eCommerce behavior dataset CSV file (not included in the repository due to size constraints).requirements.txt
: A file specifying the Python dependencies required for running the project.README.md
: This README document, providing instructions and information about the project.
Contributions to this project are welcome! If you find any issues or would like to propose enhancements, please submit a pull request or open an issue on the GitHub repository.