L-Dot/Letterboxd-list-scraper

Any plans to update the app?

meanjoep92 opened this issue · 18 comments

Hey there, absolutely loving the hell out of this.

Is there any plan to add any more features to this app? (Like finding the number of watches, appearances on lists, etc)

Would also love to to be able to run the scraper on the main pages.

Keep up the good work!

L-Dot commented

Heya,

Thanks for the kind reply! I'm actually quite surprised (but also very glad) that the app can still be of use today :).

As the app fulfilled my needs at the time I had paused any development, and eventually moved on to other things. Knowing that there is a request for added features gives me motivation to start it up again.

Number of watches, appearances on lists, etc. should all be a fairly simple addition I think. What did you have in mind for running the scraper on main pages? I do like tot keep it focused on user-specific content, but I'll see what's possible.

Unfortunately I'm quite busy until the end of August, so any development will be slow up until then.

-Arno

It's definitely been a godsend for me personally...

It makes it so much easier to go through these massive lists, and pick movies easily.

I think I care about the number of watches because it would allow me to find actual movies. (Sometimes you'll see find several shorts, specials, musicals mentioned in a massive list...but then you'll discover that these outnumber actual movies...even no matter how strict you try to filter with what Letterboxd gives you)

Other requests could be possible to output the Top 4 from a given list of users (Might be far-fetched, but definitely would be kinda cool)

As far as the main pages, I previously used a custom-made scraper that would allow you to scrape, and just grab the pages you want.

https://letterboxd.com/films/popular/decade/1960s/page/[1-3]/

Where you could easily have it to only query the specific pages that you wanted.

This scraper has been far more efficient and quicker than the others that I used.

Really hope to see some updates to it soon!

Just created an account to say thanks for making this, I've been using it to scrape lists based on podcasts and making some stats displays around them (https://letterblocksd.com - still WIP, but I'll be adding the letterboxd stats collections too).

I also want to second meanjoep92's request for watchcount. I'm currently pulling "popularity" from TMDB, but that seems kind of fickle and strange. I'd love to have access to the LB watchcount for movies alongside the other info your tool scrapes.

Thanks again!

L-Dot commented

Thank you for the kind words :). Your website looks amazing! Kudos for creating such a creative and interesting site haha. Looking forward to see how it develops.

Luckily, I finally have some more time to work on this project again. I've just released an update of the scraper and it now has a lot more functionality (i.e. data columns for watches, likes, list appearances, genres, languages, countries). I hope this can be useful for your own projects on the short term.

In the meantime I'll think of implementing more features that would fit this app. Of course, let me know if you have any requests and/or issues. Thanks!

Very happy to see that this has been updated!

I especially love that there's now ways to see the runtimes, and number watched! (It makes picking films even easier now) ❤️

Would anyone know if it's still possible to scrape certain pages beyond just lists and watchlists?

Similar to what I asked months ago, I had a scraper where you could put your own custom query of pages in the brackets, and it would give you the info similar to the scraper.

https://letterboxd.com/films/popular/decade/1960s/page/[1-3]/

I found that the workaround was to just copy and paste what you see from a desired page, and then put it in a private list to run the scraper on. Just didn't know if there was any capability to do it since manually trying to copy it is always prone to error.

Thank you, and y'all are awesome! ❤️❤️

Thanks for the update! This is great timing for me, as I was planning to try my hand at extending this next week (and that likely wouldn’t have gone well ;)

One thing I was going to add that didn’t get into this update is the fan count (the number displayed above the rating, e.g. ‘1.2k FANS’. It looks like there’s not a more precise number once they go over 1k, and not every movie has them, but that’s an interesting piece of LB data that’s present on the page.

Two small issues that persist with the new version:

  1. The letterboxd links all have an extra / after the tld. Trivial, but I wanted to call it out.
  2. I’ve run into an issue scraping this movie: https://letterboxd.com/film/reel-5/ The title column is empty in the csv.

Thanks again! This is great!

Oh one other feature I'd really appreciate would be the ability to call the function directly, or pass it a list of urls and a target directory. I'm sure this can be done somehow, I'm going to try and figure it out next week, but considering the programatic use-case instead of a user with a text prompt would make this tool even more versatile. (Like I said, I'm sure this is already possible somehow, I'm just new ;)

I put in a fix (4 characters!) for the letterboxd url issue.

The https://letterboxd.com/film/reel-5/ is actually a problem with letterboxd itself, they're not escaping the quotes correctly in the alt text. I think addressing that edge case is well outside the scope of this project. Thanks again!

I've been fighting with github for awhile now, but maybe I submitted a pull request for a larger change? Hopefully?

Your call if you want to merge it, I'm sure there's cleanup that could be done, or best practices to follow, but I wanted to at least share the functionality that I've been using. My fork maintains the original functionality, but if there's a list of letterboxd urls (and optional filenames) in a text file, it will scrape all those lists (up to 4 concurrently) without user input, which is very handy for my use case.

Thanks for the functionality!

Unfortunately, the fancount isn't in the source html, it's generated, so probably not something to be done with this script

L-Dot commented

Apologies for the delay in responding. I've had quite a busy schedule, but have been able to dedicate some time to this project again since last week. The result has been the new v2.0.0 update.

Regarding the requests of @meanjoep92:

Would anyone know if it's still possible to scrape certain pages beyond just lists and watchlists?

Besides lists and watchlists, the scraper now also reads user films (https://letterboxd.com/{username}/films/) and Letterboxd film lists https://letterboxd.com/films/.

Similar to what I asked months ago, I had a scraper where you could put your own custom query of pages in the brackets, and it would give you the info similar to the scraper.

With the new program version you can now scrape certain pages of list like https://letterboxd.com/films/popular/decade/1960s/page/[1-3]/ by running the command python -m listscraper -p 1~3 https://letterboxd.com/films/popular/decade/1960s/.

Regarding the requests of @BeSweets:

One thing I was going to add that didn’t get into this update is the fan count (the number displayed above the rating, e.g. ‘1.2k FANS’. It looks like there’s not a more precise number once they go over 1k, and not every movie has them, but that’s an interesting piece of LB data that’s present on the page.

The new scraper has functionality for scraping the fan count, although only in the whole hundreds (i.e. 1.2K FANS becomes 1200 in the CSV). Unfortunately I can't seem to find a more detailed number for this count in the HTML code.

I’ve run into an issue scraping this movie: https://letterboxd.com/film/reel-5/ The title column is empty in the csv.

The film title scraping of e.g. https://letterboxd.com/film/reel-5/ was fixed by changing the way it finds the title in the HTML code. This was luckily quite easy to do and I hope the scraper can now correctly read all film titles. Thanks for this tip!

Oh one other feature I'd really appreciate would be the ability to call the function directly, or pass it a list of urls and a target directory. I'm sure this can be done somehow, I'm going to try and figure it out next week, but considering the programatic use-case instead of a user with a text prompt would make this tool even more versatile. (Like I said, I'm sure this is already possible somehow, I'm just new ;)

The scraper can now be called directly from the command line and it is very easy to supply it with a list of URLs from a .txt file by using the -f or --file flag. I can say that this comment was the main inspiration for creating the big overhaul in v2.0.0!

Thank you guys for your involvement and contribution to the project. I'm sorry for the lack of communication and the overall slow development, but I hope the new scraper can still be of (even better) use! :)

(Also please don't hesitate to bring up any new suggestions/issues with the scraper!)

So happy that this scraper has been updated tremendously since I've made this! ❤️ This thing is going to be a blessing for my movie watching. Hope these updates never stop.

Hey there! Might be a silly question, but is there a way to have the synopsis of each film scrapped as well? Just curious! Thanks!

Hey! Yes certainly possible and easy to implement (don't know why I have not added this earlier haha).

I added the change to the new release v2.2.0.

Hey! Yes certainly possible and easy to implement (don't know why I have not added this earlier haha).

I added the change to the new release v2.2.0.

Man, you are the absolute best! Thank you so much!!!

This might be another silly request (I greatly apologize for asking so many random things because this scraper has been a life saver for me) but is there a way to possibly grab the URL for the image of the movie poster? (Or is that even possible?) Thanks!!

Hi certainly no silly requests here! I left this thread open exactly for requests such as this.

Adding the movie poster URL is certainly possible. It has been on the TODO list for quite some time. I'm gonna check it out when I have some time next week :)

So I genuinely tried to get the posters using BeautifulSoup, but to no avail...would Selenium be the only way to do this since it's served through React?

UPDATE:
Managed to find a working solution to get the posters!

At the top of the listscraper/scrape_functions.py you would import json..

import json

Within the def scrape_film function add in...

    #Get The Poster
    try:
        script_w_data = film_soup.select_one('script[type="application/ld+json"]')
        json_obj = json.loads(script_w_data.text.split(' */')[1].split('/* ]]>')[0])
        film_dict["Poster_URL"] = json_obj['image']
    except:
        film_dict["Poster_URL"] = not_found

I'm sure the creators might have a better solution, but only wanted this because I really take my movie-watching seriously. :3