𝕏-Insight💡: Data Scraping, Analysis, Image caption and More

中文指引

This project enables you to fetch liked tweets from Twitter (using Selenium), save it to JSON and Excel files, and perform initial data analysis and image captions.

This is part of the initial steps for a larger personal project involving Large Language Models (LLMs). Stay tuned for more updates!

Example of Exported Excel sheets & Visualizations:

Demo Video

Demo

Prerequisites

Before running the code, ensure you have the following:

Required Python libraries (listed in requirements.txt)
Get your Twitter auth token (Not API key)
- Quick text instruction:
- - Go to your already logged-in Twitter
  - F12 (open dev tools) -> Application -> Cookies -> Twitter.com -> auth_key
- or follow the video demo in FAQs section.
GEMINI API key (*Optional, only needed if you want to try the image captions feature)

Setup

Clone the repository (Recommend) or download the project files.

git clone https://github.com/jerlinn/X-Insight

Install the requirements:

pip install -r requirements.txt

Open the config.py file and replace the placeholders with your actual API keys:

Set TWITTER_AUTH_TOKEN to your Twitter API authentication token.
Set GEMINI_API_KEY to your GEMINI API key. 🔑 Get your own key here.

Data Ingestion

To fetch data from Twitter and save it to JSON and Excel files, follow these steps:

Open twitter_data_ingestion.py.
Modify the fetch_tweets function call at the bottom of the script with your desired parameters:

Set the URL of the Twitter page you want to fetch data from (e.g., https://twitter.com/ilyasut/likes).
Specify the start and end dates for the data range (in YYYY-MM-DD format).

Run the script by executing the following command (recommend run this in IDE directly):
```
python twitter_data_ingestion.py
```
The script will fetch the data from Twitter, save it to a JSON file, and then export it to an Excel file.

Data Analysis

To perform initial data analysis on the fetched data, follow these steps:

Open the twitter_data_initial_exploration.ipynb notebook in Jupyter Notebook or JupyterLab.
Run the notebook cells sequentially to load the data from the JSON file and perform various data analysis tasks.

Some sample results:

Visualizing likes by media type over time
Creating a calendar heatmap of liked tweets per day

The notebook also demonstrates how to use the Gemini API & Replicate API with LlaVa V1.6 to generate image captions for tweet images (with tweet metadata).

Sample Output

The project includes sample output files for reference:

sample_output_json.json: A sample JSON file containing the fetched Twitter data.
sample_exported_excel.xlsx: A sample Excel file exported from the JSON data.

Feel free to explore and modify the code to suit your specific data analysis requirements.

FAQs:

Will I get banned? Could this affect my account?
- Selenium is one of the safest scraping methods out there, but it's still best to be cautious when using it for personal projects.
- I've been using it for quite a while without any issues.
- (Though, if you've got a spare / alt account, I'd recommend using that one's auth token instead)
How do I find the auth token?
- Check out this for a step-by-step guide!
  - video demo

Credits

Acknowledgements

Initial structure and parts of the Selenium code inspired by Twitter-Scrapper.
The image captioning feature is powered by the OpenAI API. You should be able to achieve similar results using Gemini 1.0.

For any questions or issues, please open an issue in the repository.