/Tsunami

pipeline to auto (scrape => clean => analyze => chat with) tons of data

Primary LanguagePython

Tsunami | Auto Scraping / Cleaning / LLM Analysis

Tsunami allows you to easily automate scraping data from numerous sources, and then feed it into large language models for analysis.

Tsunami:

  1. Reads your instructions from the project config json - data sources, models to use, prompts, etc.
  2. Downloads data from sources
  3. Cleans the data, formatting documents into readable versions without extra tokens
  4. Sends each doc/file to be analyzed by an LLM (with your specified prompt)
  5. Has a model merge each analysis n responses at a time
  6. Repeats #4 until it has less than m responses, and then merges the final m responses into a final analysis A workspace is created in ./workspace/{project_name} containing all doc/data downloads, each response, and the final analysis. Cost data is output after each response completion, including cumulative cost.

⚠️⚠️WARNING⚠️⚠️

⚠️⚠️⚠️ AUTOMATED ANALYSIS OF LARGE AMOUNTS OF DATA CAN BE EXTREMELY EXPENSIVE ⚠️⚠️⚠️

Make sure you know what you are doing and use cheaper models, such as Haiku, until you are familiar with the program.

Terms/Conditions

A condition for using this program is that you take responsibility for all costs incurred through any/all API usage. Do not use the program if you don't accept these terms.

Quick start

  1. git clone https://github.com/dnbt777/Tsunami
  2. run pip install -r requirements.txt
  3. Create a file called ".env" in the below format and fill it out with your keys/region/username
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_REGION=
AWS_USERNAME=
  1. Run the example script with python ./example_project.py -download -analyze

Currently supported

Models:

  • Claude (AWS bedrock)

Request a model via DM or by opening an Issue.

Data sources:

  • Youtube
    • Individual video links
    • Playlist links
  • Arxiv semantic search queries
  • Pubmed semantic search queries
  • Github
    • Repo links
    • Queries for repos

Usage - Documentation/Examples

See guides in the DOCS-EXAMPLES folder

Support

Submit an issue, DM me on twitter (https://twitter.com/dnbt777), or DM me on github

TODO

Documentation RAG Add more models Save logs Add more data sources