Langchain Demo Project

This demo project showcases how to use Langchain to interact with a large data source using PySpark with a Streamlit frontend. The project focuses on analyzing an eCommerce behavior dataset sourced from Kaggle.

Dataset

The dataset used for this project is the eCommerce Behavior Data from a Multi-Category Store available on Kaggle. Please download the dataset and place it in the project directory.

Prerequisites

Python 3.8 or higher
Conda (for creating the environment)
Git (for cloning the repository)

Getting Started

To execute the project, please follow the instructions below:

Clone the project repository from GitHub:

git clone https://github.com/rohan-mudaliar/langchian-demo.git

Change into the project directory:
```
cd langchian-demo
```

Create a new Conda environment with Python 3.8:

conda create --name langchain-demo python=3.8

Activate the newly created Conda environment:
```
conda activate langchain-demo
```
Install the project dependencies using requirements.txt:
```
pip install -r requirements.txt
```

Running the Application

Start the Streamlit application by executing the following command:

streamlit run app.py
Once the application is running, open your web browser and navigate to the provided URL (usually http://localhost:8501).
Use the available prompts to interact with the application and query the eCommerce behavior dataset. For example:

To count the number of records in the electronics database, enter the following prompt:

count number of records in electronics database
To retrieve the order ID in the electronics database that has the highest price, use the following prompt:

give me order ID in electronics database that has the highest price

Project Structure

The project repository contains the following files and directories:

app.py: The main entry point of the application, responsible for launching the Streamlit frontend and handling user prompts.
langchain.py: The Langchain module, which interacts with the eCommerce behavior dataset using PySpark.
data: A directory where you should place the downloaded eCommerce behavior dataset CSV file (not included in the repository due to size constraints).
requirements.txt: A file specifying the Python dependencies required for running the project.
README.md: This README document, providing instructions and information about the project.

Contributing

Contributions to this project are welcome! If you find any issues or would like to propose enhancements, please submit a pull request or open an issue on the GitHub repository.