/download_webtoons

Using selenium and BS4 to automatically download webtoons for offline reading!

Primary LanguagePython

πŸ“₯ Download Webtoons 🎨

Using selenium and BS4 to automatically download Daum and Naver webtoons for offline reading!

Table of Content

Section Name Section Description
0 - About this project Explaining what this project is and why it was made. (Also some troubleshooting tips!)
1 - Tech description Write up on some of the tech involved in the project
2 - Future plans Additional features I want to add later down the road!
3 - Legal disclaimer Just a legal disclaimer

0. About this project

What are webtoons?

Webtoons are a sort of cultural phenomena in a lot of Asian countries! As far as I know, Korea's main 'source' of webtoons are from two portal giants: Daum.net and Naver.com They're basically scrollable, digital comics that updated every week.

Why build this project?

A few years ago a friend actually asked if it'd be possible to create a program that could download freely available webtoons to read them offline. At the time I didn't really know much about webscraping or coding, so I thought the only way was to painstakingly screenshot or save each of the comic's image files. Thankfully, I know a little bit more about webpages and coding, and thought I'd try my hands on automatically downloading webcomics!

What does this project do?

[Input] link [Input] Directory [Output] Dir of .jpeg
https://webtoon.daum.net/webtoon/... ~/path/to/output/dir ~/comic_title/...jpeg

As of June 23rd 2019, given a link to 1 DAUM or NAVER comic episode, and a path for the output files, it will download the webcomic as jpeg/jpg on to the output directory. Make sure the link is surrounded by single quotes ('). This just ensures that the command line arguments are passed correctly to the program, since some links may include ambpersands (&) in them.

Now with the setup.py, and organized directories you should be able to install this as a package and run it in command line!

pip install dlwebtoon
dlwebtoon 'link_to_webtoon' /path/to/output_directory

The program isn't running!

This may be due to 1) your virtualenv. Just try

pip install --upgrade virtualenv
find webtoon_dl -type l delete // webtoon_dl is the name I chose for my virtualenv
virtualenv -p python3 webtoon_dl

Other issues may arise from 2) the Selenium Chrome webdriver. In that case, visit here and download the corresponding webdriver for your version of chrome. Then drop the chromedriver.exe file into ~/webtoon_dl folder.

1. Tech description

Virtual Environment

I wanted to use Python 3 for this project, but this conflicted with Python2 already installed. Hence, I opted to use virtualenv.

Creating virtualenv:

pip install --upgrade virtualenv
virtualenv -p python3 envname //webtoon_dl in our case

To activate the script:

source /path/to/ENV/bin/activate
python -v //should show 3.7

To run the program:

python webtoondl.py link_to_comic /path/to/output

OR

pip install dlwebtoon
dlwebtoon 'link_to_webtoon' /path/to/output_directory

You may need to install some packages to run the program. Such as: With setup.py, all the necessary packages should be installed when you 'pip install dlwebtoon'

Package Version
beautifulsoup4 4.7.1
selenium 3.141.0
chromedriver 2.24.1
imageio 2.5.0
pip install package_name

HTML parsing

There were a lot of options in which tool to use for webscraping. There's Scrapy (webscraping framework), Urllib, and Requests. But I ultimately opted for Selenium! I originally tried using urllib, simply because it's included in Python's standard library (meaning no extra installs!), but the images I wanted seem to be loaded clientside via JavaScript. (see get_imgurl.py)

Parsers

I knew of two options: LXML and BeautifulSoup. I chose BeautifulSoup since there seemed to be a lot of online tutorials and documentation on it!

Here's how I worked on it!

Downloading the image

Used imageio to first read the images to a bufferimage, then saving the bufferimage to a specific filepath.

Here's how an output folder would look like

2. Future plans

DONE (June 24 2019) - Daum / Naver distinguishing

Currently this program only works for Daum webcomics. Later on I plan to make it work for both Daum and Naver webcomics. I have now implemented support for Naver webtoons too. This was a bit harder since there were some 403 forbidden errors and such. But I found an adequate work around in providing user agents.

DONE (June 28 2019) - Setup.py and running in command line

I have now implemented this!

    pip install dlwebtoon
    dlwebtoon 'link_to_comic' /path/to/output

Combine multiple .jpeg

One 'episode' of a webcomic is around ~10 .jpeg files. I hope to later on implement an optional function that just stiches together these images automatically.

One link to download entire webtoon

It's quite a hassle to download an entire webtoon with the current system. You have to input a new link for every episode of the comic in order to download everything. I want to implement a feature where it'll automatically go through the entire episode list and download all the episodes available. I expect this to be a bit difficult since there's no easy way to identify how many episodes there are for one comic, and whether or not the comic is 'completed' or not.

3. Legal disclaimer

The main purpose of building this program was for me to 1) learn how to build a command line tool and 2) get a small taste of how procedures can be automated.

With these learning objectives in mind, this program was built for people to enjoy free webtoons offline. Please note that using this program to distribute the comics could go against copyright laws.

More details on why "personal use" is legal. In the Korean copyright law it states: 제30μ‘°(μ‚¬μ μ΄μš©μ„ μœ„ν•œ 볡제) κ³΅ν‘œλœ μ €μž‘λ¬Όμ„ 영리λ₯Ό λͺ©μ μœΌλ‘œ ν•˜μ§€ μ•„λ‹ˆν•˜κ³  개인적으둜 μ΄μš©ν•˜κ±°λ‚˜ κ°€μ • 및 이에 μ€€ν•˜λŠ” ν•œμ •λœ λ²”μœ„ μ•ˆμ—μ„œ μ΄μš©ν•˜λŠ” κ²½μš°μ—λŠ” κ·Έ μ΄μš©μžλŠ” 이λ₯Ό λ³΅μ œν•  수 μžˆλ‹€. λ‹€λ§Œ, κ³΅μ€‘μ˜ μ‚¬μš©μ— μ œκ³΅ν•˜κΈ° μœ„ν•˜μ—¬ μ„€μΉ˜λœ 볡사기기에 μ˜ν•œ λ³΅μ œλŠ” κ·ΈλŸ¬ν•˜μ§€ μ•„λ‹ˆν•˜λ‹€.