/SmartlistAI

Primary LanguageJupyter Notebook

SmartList.ai

SmartList.ai

Table of Contents

Getting Started

To get started with the project, you'll need to navigate to your CLI and create a conda environment: conda create -n mycondaenv python=3.9

Dependencies

The project relies on the following Python libraries:

  • numpy
  • pandas
  • scikit-learn
  • matplotlib
  • seaborn
  • sqlite3
  • json
  • openai
  • newsapi
  • googlemaps
  • wordcloud
  • et. al

Installation

To install the the dependencies of the project using pip: pip install -r requirements.txt

Usage

To use the project, open the Jupyter notebooks in the notebooks directory and follow the instructions.

Data

The data used in the project comes from the NewsAPI, Google Business and BCN datasets, which are a public datasets/apis containing information about world news, businesses listed in Google, and Neighborhood data within Barcelona.

Notebooks

  • 01-newsAPI.ipynb: This notebook connects to the NewsAPI that allows you to locate the most popular articles in the world with their JSON API (in this project it is used to find top AI articles happening in the world)
  • 02-OpenAI_Processing.ipynb: This notebook processes the (01) data, utilizing the OpenAI api to summarize, and create taglines for each article.
  • 03-googleAPI.ipynb: This notebook connects to the GoogleMaps API allowing the user to collect a list of businesses specifying the latitude and longitude (with a 50,000 meter radius max).
  • 04-bcnEDA.ipynb: This notebook explores the barcelona neighborood datasets.
  • 05-merge.ipynb: This notebook combines the dataset from (04) with the google dataset in (03) in order to have the neighborhood name with the google business.
  • 06-bcn_preprocessed.ipynb: This notebook cleans the result in (05) to prepare for the KMeans clustering algorithm.

Scripts

  • '01-AI_News.py': This script combines (01 aand 02) notebooks into one algorithm that allows a user to run and collect the most popular AI articles in the world.
  • '02-LeadList_BCN.py': This script uses (03) notebook to run and collect a lead list of businesses in the BCN area.

Databases

  • 'DSFinal.db': Located in databases/ and deliverables/ this information includes the timestamps for the news articles and googlemaps api. Useful for future work in forecasting growth rate of businesses, and analysis on top articles over time.

Models

The model used in this project is a KMeans Clustering model (unsupervised clustering) with the following hyperparameters:

  • n_clusters = 3
  • random_state = 42

Deliverables

  • 'lead_list.csv': This is a lead list of businesses generated by the (03) Notebook.
  • 'lead_list_scored.csv': This is a lead list of businesses generated by the (03) Notebook, scored by the KMeans model.
  • 'news_article_summaries.csv': This is a list of the most popular articles on artificial intelligence generated from {today} to {last_month}.
  • 'OpenAITop{N}news_article_summaries{today}.csv': This file includes the top 10 most popular articles on Artificial intelligence, summarized and taged by OpenAI
  • 'OpenAITop{N}news_article_summaries{today}.pdf': This file is a 'pdf website' that showcases the top 10 most popular articles on Artificial intelligence, summarized and taged by OpenAI.
  • 'DSF_Final_Presentation.pptx': This presentation provides high level overview of steps taken to create this project.

References