MatthewWolff/TwitterScraper

TypeError: 'NoneType' object is not subscriptable

Closed this issue · 19 comments

python3 ./scrape.py -u jack

Traceback (most recent call last):
  File "./scrape.py", line 276, in <module>
    begin = datetime.strptime(args.since, DATE_FORMAT) if args.since else get_join_date(args.username)
  File "./scrape.py", line 256, in get_join_date
    date_string = soup.find("span", {"class": "ProfileHeaderCard-joinDateText"})["title"].split(" - ")[1]
TypeError: 'NoneType' object is not subscriptable

shit.... i hope they didn't change up their front end... i just checked the HTML and it seems like they've discarded the coherent class names for some garbled stuff... class="css-901oao css-16my406 r-1re7ezh r-4qtqp9 r-1qd0xha r-ad9z0x r-zso239 r-bcqeeo r-qvutc0"...

will need to investigate this. if they changed this up, the soup parser will need to be updated so it can find the start date.

As a workaround, you can manually specify the --since parameter so that it doesn't try to figure it out.

--since works, and it successfully opens Chrome and shows each page of search results, but it doesn't seem to find any tweets:

 python3 ./scrape.py -u jack --since 2020-01-01
[ scraping user @jack... ]
[ 0 existing tweets in jack.json ]
[ searching for tweets... ]
[ found 0 new tweets ]
[ retrieving new tweets (estimated time: 0 seconds)... ]
[ finished scraping ]
[ stored tweets in jack.json ]
cat jack.json
{}%

It's using the old Twitter UI but perhaps something changed?

they've re-written their front end, so all the HTML parsing in this repo will need to be updated ):

tweet_selector = "li.js-stream-item" will need to be updated with the new identifier for tweets
soup.find("span", {"class": "ProfileHeaderCard-joinDateText"}) will need to be updated to find the classes css-901oao css-16my406 r-1re7ezh r-4qtqp9 r-1qd0xha r-ad9z0x r-zso239 r-bcqeeo r-qvutc0 instead

All right, I updated it to work with the new CSS, both in terms of tweet scraping and figuring out user join date. Let me know if it works for you now

python3 ./scrape.py -u jack --since 2020-01-01
[ scraping user @jack... ]
[ 0 existing tweets in jack.json ]
[ searching for tweets... ]
Traceback (most recent call last):
  File "./scrape.py", line 279, in <module>
    user.scrape(begin, end, args.by, args.delay)
  File "./scrape.py", line 101, in scrape
    self.__find_tweets(start, end, by, loading_delay)
  File "./scrape.py", line 203, in __find_tweets
    tweet_id = locate_tweet_id(tw)
  File "./scrape.py", line 170, in locate_tweet_id
    tweet_id_link = list(filter(lambda tw: "status" in tw.get_attribute("href"), subelements))[0]
IndexError: list index out of range

just pushed again, should be good now

> ./scrape.py -u jack --since 2020-01-01                                                                                                                                             [5:04:58]
[ scraping user @jack... ]
[ 0 existing tweets in jack.json ]
[ searching for tweets... ]
no tweets in time period 2020-02-05 -- 2020-02-12
no tweets in time period 2020-04-15 -- 2020-04-22
[ found 128 new tweets ]
[ retrieving new tweets (estimated time: 12 seconds)... ]
- batch 1 of 2
- batch 2 of 2
[ finished scraping ]
[ stored tweets in jack.json ]

Working now, thanks! Although I found slightly fewer tweets than you so that might be a bug (or something weird with Twitter)

Either way I've uploaded jack.json in case you want to look into that:

https://pastebin.com/raw/GeAzTQmN

python3 ./scrape.py -u jack --since 2020-01-01
[ scraping user @jack... ]
[ 0 existing tweets in jack.json ]
[ searching for tweets... ]
> no tweets in time period 2020-02-05 -- 2020-02-12
[ found 110 new tweets ]
[ retrieving new tweets (estimated time: 12 seconds)... ]
- batch 1 of 2
- batch 2 of 2
[ finished scraping ]
[ stored tweets in jack.json ]

Quick question, is there support for cookies? I'd like to scrape tweets from a private account but at the moment it doesn't look like that's possible.

Quick question, is there support for cookies? I'd like to scrape tweets from a private account but at the moment it doesn't look like that's possible.

The only way to scrape a private account is to have Twitter API credentials for an account that is following the private account, otherwise the tweets just aren't visible.

I found slightly fewer tweets

I noticed this too... Sadly, scraping tweets can be hit-or-miss. I adapted and rewrote portions of some old code to create this repo, and the only sure-fire way to get better results has been to increase the --delay and decrease the --by parameter such that no tweets are missed.

If you want to further investigate the core logic of the scraping, you could look at these lines of code. I think there's definitely room for improvement, as the increment variable is a bit opaque (in terms of the while condition).

The only way to scrape a private account is to have Twitter API credentials for an account that is following the private account, otherwise the tweets just aren't visible.

Ah, you're right. Well that sucks. from: seems to be limited to just a week. to: doesn't appear to have a restriction, so it's possible to find some older tweets that way, but doesn't help for tweets with no replies.

If you want to further investigate the core logic of the scraping, you could look at these lines of code. I think there's definitely room for improvement, as the increment variable is a bit opaque (in terms of the while condition).

I'll take a look and see :-)

they've re-written their front end, so all the HTML parsing in this repo will need to be updated ):

tweet_selector = "li.js-stream-item" will need to be updated with the new identifier for tweets
soup.find("span", {"class": "ProfileHeaderCard-joinDateText"}) will need to be updated to find the classes css-901oao css-16my406 r-1re7ezh r-4qtqp9 r-1qd0xha r-ad9z0x r-zso239 r-bcqeeo r-qvutc0 instead

Hi Matt,

Thanks for the code. Really helpful. Though am unable to understand where I need to make these changes in the code. I changed the value for tweet_selector. However, I don't understand where I need to introduce the soup.find code and the classes part. Request you to help me out here.

Thanks in advance.

Facing the same problem, the twitter page with the search query and the tweets shows up, but none of it gets captured in the generated json file.

@nmalcolm can you help me out here, if you were able to run the script?

Did it stop working again? If so, they must have changed up the CSS classes again. Sorry for the delayed reply, coursework is hitting hard lately. i’ll try and take a peek in the next 24 hours and run some tests

Facing the same problem, the twitter page with the search query and the tweets shows up, but none of it gets captured in the generated json file.

I think I'm just going to re-write the Selenium portion to have more robust scraping. It's currently using auto-generated CSS tags, which are subject to change. It's a lame fix for when they switched up their front end CSS. I'll try to hit it this weekend, but I've got a lot on my plate.

Here's a primer on text extraction with Selenium though

Facing the same problem, the twitter page with the search query and the tweets shows up, but none of it gets captured in the generated json file.

I think I'm just going to re-write the Selenium portion to have more robust scraping. It's currently using auto-generated CSS tags, which are subject to change. It's a lame fix for when they switched up their front end CSS. I'll try to hit it this weekend, but I've got a lot on my plate.

Here's a primer on text extraction with Selenium though

Thank you @MatthewWolff. Please don't prioritize this, whenever you're done with all the important stuff, and if you can find time for this later, then please help. I will meanwhile try to go through the primer and figure it out by myself. Thanks again for the code.

Thanks,
Sandeep

@sandeep17oct Got around to peeking at it. They changed their CSS again, and it's auto-generated to begin with, so the former solution here wasn't very great. I've modified it to be more robust now—I limited Selenium usage and there is no longer any dependency on CSS classes, it just parses out all of the target user's tweet ids on a page.

Additionally, the driver now runs headless (yay), but you can turn on a debug option if desired.

Let me know if it works better for you.

Modified scraper to no longer search using CSS tags, this bug can no longer occur

Thank you @MatthewWolff for the quick help. Works like a charm. 💯 👍

Hi @MatthewWolff, Quick question: have you tried scraping a handle with a very large number of tweets? I tried scraping sadhguruJV handle, which had about 10K tweets, got roughly 2.7K tweets. Also, it seems that the variation is larger, the longer the script runs. So, year 1 might fetch the accurate number of tweets, year 2 a few missing, year 3 a few more...and so on. Apart from the delay and increment variable (by), is there a way to tune this to reduce the misses?
image