/twitter-scraper-selenium

Python's package to scrap Twitter's front-end easily

Primary LanguagePythonMIT LicenseMIT

Twitter scraper selenium

Python's package to scrape Twitter's front-end easily with selenium.

PyPI license Python >=3.6.9 Maintenance

Table of Contents

Table of Contents
  1. Getting Started
  2. Usage
  3. Privacy
  4. License


Prerequisites

  • Internet Connection
  • Python 3.6+
  • Chrome or Firefox browser installed on your machine

  • Installation

    Installing from the source

    Download the source code or clone it with:

    git clone https://github.com/shaikhsajid1111/twitter-scraper-selenium
    

    Open terminal inside the downloaded folder:


     python3 setup.py install
    

    Installing with PyPI

    pip3 install twitter-scraper-selenium
    

    Usage

    Available Function In this Package - Summary

    Function Name Function Description Scraping Method Scraping Speed
    scrape_profile() Scrape's Twitter user's profile tweets Browser Automation Slow
    scrape_keyword() Scrape's Twitter tweets using keyword provided. Browser Automation Slow
    scrape_topic() Scrape's Twitter tweets by URL. It expects the URL of the topic. Browser Automation Slow
    scrape_keyword_with_api() Scrape's Twitter tweets by query/keywords. For an advanced search, query can be built from here. HTTP Request Fast
    get_profile_details() Scrape's Twitter user details. HTTP Request Fast
    scrape_topic_with_api() Scrape's Twitter tweets by URL. It expects the URL of the topic Browser Automation & HTTP Request Fast
    scrape_profile_with_api() Scrape's Twitter tweets by twitter profile username. It expects the username of the profile Browser Automation & HTTP Request Fast

    Note: HTTP Request Method sends the request to Twitter's API directly for scraping data, and Browser Automation visits that page, scroll while collecting the data.



    To scrape twitter profile details:

    from twitter_scraper_selenium import get_profile_details
    
    twitter_username = "TwitterAPI"
    filename = "twitter_api_data"
    get_profile_details(twitter_username=twitter_username, filename=filename)

    Output:

    {
    	"id": 6253282,
    	"id_str": "6253282",
    	"name": "Twitter API",
    	"screen_name": "TwitterAPI",
    	"location": "San Francisco, CA",
    	"profile_location": null,
    	"description": "The Real Twitter API. Tweets about API changes, service issues and our Developer Platform. Don't get an answer? It's on my website.",
    	"url": "https:\/\/t.co\/8IkCzCDr19",
    	"entities": {
    		"url": {
    			"urls": [{
    				"url": "https:\/\/t.co\/8IkCzCDr19",
    				"expanded_url": "https:\/\/developer.twitter.com",
    				"display_url": "developer.twitter.com",
    				"indices": [
    					0,
    					23
    				]
    			}]
    		},
    		"description": {
    			"urls": []
    		}
    	},
    	"protected": false,
    	"followers_count": 6133636,
    	"friends_count": 12,
    	"listed_count": 12936,
    	"created_at": "Wed May 23 06:01:13 +0000 2007",
    	"favourites_count": 31,
    	"utc_offset": null,
    	"time_zone": null,
    	"geo_enabled": null,
    	"verified": true,
    	"statuses_count": 3656,
    	"lang": null,
    	"contributors_enabled": null,
    	"is_translator": null,
    	"is_translation_enabled": null,
    	"profile_background_color": null,
    	"profile_background_image_url": null,
    	"profile_background_image_url_https": null,
    	"profile_background_tile": null,
    	"profile_image_url": null,
    	"profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/942858479592554497\/BbazLO9L_normal.jpg",
    	"profile_banner_url": null,
    	"profile_link_color": null,
    	"profile_sidebar_border_color": null,
    	"profile_sidebar_fill_color": null,
    	"profile_text_color": null,
    	"profile_use_background_image": null,
    	"has_extended_profile": null,
    	"default_profile": false,
    	"default_profile_image": false,
    	"following": null,
    	"follow_request_sent": null,
    	"notifications": null,
    	"translator_type": null
    }

    get_profile_details() arguments:

    Argument Argument Type Description
    twitter_username String Twitter Username
    output_filename String What should be the filename where output is stored?.
    output_dir String What directory output file should be saved?
    proxy String Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.


    Keys of the output:

    Detail of each key can be found here.



    To scrape profile's tweets:

    In JSON format:

    from twitter_scraper_selenium import scrape_profile
    
    microsoft = scrape_profile(twitter_username="microsoft",output_format="json",browser="firefox",tweets_count=10)
    print(microsoft)

    Output:

    {
      "1430938749840629773": {
        "tweet_id": "1430938749840629773",
        "username": "Microsoft",
        "name": "Microsoft",
        "profile_picture": "https://twitter.com/Microsoft/photo",
        "replies": 29,
        "retweets": 58,
        "likes": 453,
        "is_retweet": false,
        "retweet_link": "",
        "posted_time": "2021-08-26T17:02:38+00:00",
        "content": "Easy to use and efficient for all \u2013 Windows 11 is committed to an accessible future.\n\nHere's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW ",
        "hashtags": [],
        "mentions": [],
        "images": [],
        "videos": [],
        "tweet_url": "https://twitter.com/Microsoft/status/1430938749840629773",
        "link": "https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC"
      },...
    }

    In CSV format:

    from twitter_scraper_selenium import scrape_profile
    
    
    scrape_profile(twitter_username="microsoft",output_format="csv",browser="firefox",tweets_count=10,filename="microsoft",directory="/home/user/Downloads")
    

    Output:

    tweet_id username name profile_picture replies retweets likes is_retweet retweet_link posted_time content hashtags mentions images videos post_url link
    1430938749840629773 Microsoft Microsoft https://twitter.com/Microsoft/photo 64 75 521 False 2021-08-26T17:02:38+00:00 Easy to use and efficient for all – Windows 11 is committed to an accessible future.

    Here's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW
    [] [] [] [] https://twitter.com/Microsoft/status/1430938749840629773 https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC

    ...



    scrape_profile() arguments:

    Argument Argument Type Description
    twitter_username String Twitter username of the account
    browser String Which browser to use for scraping?, Only 2 are supported Chrome and Firefox. Default is set to Firefox
    proxy String Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.
    tweets_count Integer Number of posts to scrape. Default is 10.
    output_format String The output format, whether JSON or CSV. Default is JSON.
    filename String If output parameter is set to CSV, then it is necessary for filename parameter to passed. If not passed then the filename will be same as username passed.
    directory String If output_format parameter is set to CSV, then it is valid for directory parameter to be passed. If not passed then CSV file will be saved in current working directory.
    headless Boolean Whether to run crawler headlessly?. Default is True
    browser_profile String Path to the browser profile where cookies are stored and can be used for scraping data in an authenticated way.


    Keys of the output

    Key Type Description
    tweet_id String Post Identifier(integer casted inside string)
    username String Username of the profile
    name String Name of the profile
    profile_picture String Profile Picture link
    replies Integer Number of replies of tweet
    retweets Integer Number of retweets of tweet
    likes Integer Number of likes of tweet
    is_retweet boolean Is the tweet a retweet?
    retweet_link String If it is retweet, then the retweet link else it'll be empty string
    posted_time String Time when tweet was posted in ISO 8601 format
    content String content of tweet as text
    hashtags Array Hashtags presents in tweet, if they're present in tweet
    mentions Array Mentions presents in tweet, if they're present in tweet
    images Array Images links, if they're present in tweet
    videos Array Videos links, if they're present in tweet
    tweet_url String URL of the tweet
    link String If any link is present inside tweet for some external website.


    To scrape tweets using keywords with API:

    from twitter_scraper_selenium import scrape_keyword_with_api
    
    query = "#gaming"
    tweets_count = 10
    output_filename = "gaming_hashtag_data"
    scrape_keyword_with_api(query=query, tweets_count=tweets_count, output_filename=output_filename)

    Output:

    {
      "1583821467732480001": {
        "tweet_url" : "https://twitter.com/yakubblackbeard/status/1583821467732480001",
        "tweet_details":{
          ...
        },
        "user_details":{
          ...
        }
      }, ...
    }

    scrape_keyword_with_api() arguments:

    Argument Argument Type Description
    query String Query to search. The query can be built from here for advanced search.
    tweets_count Integer Number of tweets to scrape.
    output_filename String What should be the filename where output is stored?.
    output_dir String What directory output file should be saved?
    proxy String Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.


    Keys of the output:

    Key Type Description
    tweet_url String URL of the tweet.
    tweet_details Dictionary A dictionary containing the data about the tweet. All fields which will be available inside can be checked here
    user_details Dictionary A dictionary containing the data about the tweet owner. All fields which will be available inside can be checked here



    To scrape tweets using keywords with browser automation

    In JSON format:

    from twitter_scraper_selenium import scrape_keyword
    #scrape 10 posts by searching keyword "india" from date 30th August till date 31st August
    india = scrape_keyword(keyword="india", browser="firefox",
                          tweets_count=10,output_format="json" ,until="2021-08-31", since="2021-08-30")
    print(india)

    Output:

    {
      "1432493306152243200": {
        "tweet_id": "1432493306152243200",
        "username": "TOICitiesNews",
        "name": "TOI Cities",
        "profile_picture": "https://twitter.com/TOICitiesNews/photo",
        "replies": 0,
        "retweets": 0,
        "likes": 0,
        "is_retweet": false,
        "posted_time": "2021-08-30T23:59:53+00:00",
        "content": "Paralympians rake in medals, India Inc showers them with rewards",
        "hashtags": [],
        "mentions": [],
        "images": [],
        "videos": [],
        "tweet_url": "https://twitter.com/TOICitiesNews/status/1432493306152243200",
        "link": "https://t.co/odmappLovL?amp=1"
      },...
    }


    In CSV format:

    from twitter_scraper_selenium import scrape_keyword
    
    scrape_keyword(keyword="india", browser="firefox",
                          tweets_count=10, until="2021-08-31", since="2021-08-30",output_format="csv",filename="india")

    Output:
    tweet_id username name profile_picture replies retweets likes is_retweet posted_time content hashtags mentions images videos tweet_url link
    1432493306152243200 TOICitiesNews TOI Cities https://twitter.com/TOICitiesNews/photo 0 0 0 False 2021-08-30T23:59:53+00:00 Paralympians rake in medals, India Inc showers them with rewards [] [] [] [] https://twitter.com/TOICitiesNews/status/1432493306152243200 https://t.co/odmappLovL?amp=1

    ...



    scrape_keyword() arguments:

    Argument Argument Type Description
    keyword String Keyword to search on twitter.
    browser String Which browser to use for scraping?, Only 2 are supported Chrome and Firefox,default is set to Firefox.
    until String Optional parameter, Until date for scraping, a end date from where search ends. Format for date is YYYY-MM-DD.
    since String Optional parameter, Since date for scraping, a past date from where to search from. Format for date is YYYY-MM-DD.
    proxy Integer Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port
    tweets_count Integer Number of posts to scrape. Default is 10.
    output_format String The output format, whether JSON or CSV. Default is JSON.
    filename String If output parameter is set to CSV, then it is necessary for filename parameter to passed. If not passed then the filename will be same as keyword passed.
    directory String If output parameter is set to CSV, then it is valid for directory parameter to be passed. If not passed then CSV file will be saved in current working directory.
    since_id Integer After (NOT inclusive) a specified Snowflake ID. Example here
    max_id Integer At or before (inclusive) a specified Snowflake ID. Example here
    within_time String Search within the last number of days, hours, minutes, or seconds. Example 2d, 3h, 5m, 30s.
    headless Boolean Whether to run crawler headlessly?. Default is True
    browser_profile String Path to the browser profile where cookies are stored and can be used for scraping data in an authenticated way.

    Keys of the output

    Key Type Description
    tweet_id String Post Identifier(integer casted inside string)
    username String Username of the profile
    name String Name of the profile
    profile_picture String Profile Picture link
    replies Integer Number of replies of tweet
    retweets Integer Number of retweets of tweet
    likes Integer Number of likes of tweet
    is_retweet boolean Is the tweet a retweet?
    posted_time String Time when tweet was posted in ISO 8601 format
    content String content of tweet as text
    hashtags Array Hashtags presents in tweet, if they're present in tweet
    mentions Array Mentions presents in tweet, if they're present in tweet
    images Array Images links, if they're present in tweet
    videos Array Videos links, if they're present in tweet
    tweet_url String URL of the tweet
    link String If any link is present inside tweet for some external website.



    To scrape topic tweets with URL using API

    from twitter_scraper_selenium import scrape_topic_with_api
    
    topic_url = 'https://twitter.com/i/topics/1468157909318045697'
    scrape_topic_with_api(URL=topic_url, output_filename='solana_cryptocurrency', tweets_count=50)

    Output:

    {
      "1584979408338632705": {
        "tweet_url" : "https://twitter.com/AptosBullCNFT/status/1584979408338632705",
        "tweet_details":{
          ...
        },
        "user_details":{
          ...
        }
      }, ...
    }

    scrape_topic_with_api() arguments:

    Argument Argument Type Description
    URL String Twitter's Topic URL
    tweets_count Integer Number of tweets to scrape.
    output_filename String What should be the filename where output is stored?.
    output_dir String What directory output file should be saved?
    proxy String Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.
    browser String Which browser to use for extracting out graphql key. Default is firefox.
    headless String Whether to run browser in headless mode?

    Keys of the output:

    Same as scrape_keyword_with_api



    To scrape topic tweets with URL using browser automation:

    from twitter_scraper_selenium import scrape_topic
    # scrape 10 tweets from steam deck topic on twitter
    data = scrape_topic(filename="steamdeck", url='https://twitter.com/i/topics/1415728297065861123',
                         browser="firefox", tweets_count=10)

    Keys of the output:

    Same as scrape_profile


    scrape_topic() arguments:

    Arguments Argument
    Type
    Description
    filename str Filename to write result output.
    URL str Topic URL.
    browser str Which browser to use for scraping?
    Only 2 are supported Chrome and Firefox. default firefox
    proxy str If user wants to use proxy for scraping.
    If the proxy is authenticated proxy then the proxy format is username:password@host:port
    tweets_count int Number of posts to scrape. default 10.
    output_format str The output format whether JSON or CSV. Default json.
    directory str Directory to save output file. Deafult current working directory.
    browser_profile str Path to the browser profile where cookies are stored and can be used for scraping data in an authenticated way.


    To Scrap profile's tweets with API:

    from twitter_scraper_selenium import scrape_profile_with_api
    
    scrape_profile_with_api('elonmusk', output_filename='musk', tweets_count= 100)

    scrape_profile_with_api() Arguments:

    Argument Argument Type Description
    username String Twitter's Profile username
    tweets_count Integer Number of tweets to scrape.
    output_filename String What should be the filename where output is stored?.
    output_dir String What directory output file should be saved?
    proxy String Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.
    browser String Which browser to use for extracting out graphql key. Default is firefox.
    headless String Whether to run browser in headless mode?

    Output:

    {
      "1608939190548598784": {
        "tweet_url" : "https://twitter.com/elonmusk/status/1608939190548598784",
        "tweet_details":{
          ...
        },
        "user_details":{
          ...
        }
      }, ...
    }


    Using scraper with proxy (http proxy)

    Just pass proxy argument to function.

    from twitter_scraper_selenium import scrape_keyword
    
    scrape_keyword(keyword="#india", browser="firefox",tweets_count=10,output="csv",filename="india",
    proxy="66.115.38.247:5678") #In IP:PORT format

    Proxy that requires authentication:

    from twitter_scraper_selenium import scrape_profile
    
    microsoft_data = scrape_profile(twitter_username="microsoft", browser="chrome", tweets_count=10, output="json",
                          proxy="sajid:pass123@66.115.38.247:5678")  #  username:password@IP:PORT
    print(microsoft_data)
    


    Privacy

    This scraper only scrapes public data available to unauthenticated user and does not holds the capability to scrape anything private.



    LICENSE

    MIT