Using selenium and BS4 to automatically download Daum and Naver webtoons for offline reading!
Section Name | Section Description |
---|---|
0 - About this project | Explaining what this project is and why it was made. (Also some troubleshooting tips!) |
1 - Tech description | Write up on some of the tech involved in the project |
2 - Future plans | Additional features I want to add later down the road! |
3 - Legal disclaimer | Just a legal disclaimer |
Webtoons are a sort of cultural phenomena in a lot of Asian countries! As far as I know, Korea's main 'source' of webtoons are from two portal giants: Daum.net and Naver.com They're basically scrollable, digital comics that updated every week.
A few years ago a friend actually asked if it'd be possible to create a program that could download freely available webtoons to read them offline. At the time I didn't really know much about webscraping or coding, so I thought the only way was to painstakingly screenshot or save each of the comic's image files. Thankfully, I know a little bit more about webpages and coding, and thought I'd try my hands on automatically downloading webcomics!
[Input] link | [Input] Directory | [Output] Dir of .jpeg |
---|---|---|
https://webtoon.daum.net/webtoon/... | ~/path/to/output/dir | ~/comic_title/...jpeg |
As of June 23rd 2019, given a link to 1 DAUM or NAVER comic episode, and a path for the output files, it will download the webcomic as jpeg/jpg on to the output directory. Make sure the link is surrounded by single quotes ('). This just ensures that the command line arguments are passed correctly to the program, since some links may include ambpersands (&) in them.
Now with the setup.py, and organized directories you should be able to install this as a package and run it in command line!
pip install dlwebtoon
dlwebtoon 'link_to_webtoon' /path/to/output_directory
This may be due to 1) your virtualenv. Just try
pip install --upgrade virtualenv
find webtoon_dl -type l delete // webtoon_dl is the name I chose for my virtualenv
virtualenv -p python3 webtoon_dl
Other issues may arise from 2) the Selenium Chrome webdriver. In that case, visit here and download the corresponding webdriver for your version of chrome. Then drop the chromedriver.exe file into ~/webtoon_dl folder.
I wanted to use Python 3 for this project, but this conflicted with Python2 already installed. Hence, I opted to use virtualenv.
Creating virtualenv:
pip install --upgrade virtualenv
virtualenv -p python3 envname //webtoon_dl in our case
To activate the script:
source /path/to/ENV/bin/activate
python -v //should show 3.7
To run the program:
python webtoondl.py link_to_comic /path/to/output
OR
pip install dlwebtoon
dlwebtoon 'link_to_webtoon' /path/to/output_directory
You may need to install some packages to run the program. Such as:
With setup.py, all the necessary packages should be installed when you 'pip install dlwebtoon'
Package | Version |
---|---|
beautifulsoup4 | 4.7.1 |
selenium | 3.141.0 |
chromedriver | 2.24.1 |
imageio | 2.5.0 |
pip install package_name
There were a lot of options in which tool to use for webscraping. There's Scrapy (webscraping framework), Urllib, and Requests. But I ultimately opted for Selenium! I originally tried using urllib, simply because it's included in Python's standard library (meaning no extra installs!), but the images I wanted seem to be loaded clientside via JavaScript. (see get_imgurl.py)
I knew of two options: LXML and BeautifulSoup. I chose BeautifulSoup since there seemed to be a lot of online tutorials and documentation on it!
Used imageio to first read the images to a bufferimage, then saving the bufferimage to a specific filepath.
Currently this program only works for Daum webcomics. Later on I plan to make it work for both Daum and Naver webcomics.
I have now implemented support for Naver webtoons too. This was a bit harder since there were some 403 forbidden errors and such. But I found an adequate work around in providing user agents.
I have now implemented this!
pip install dlwebtoon
dlwebtoon 'link_to_comic' /path/to/output
One 'episode' of a webcomic is around ~10 .jpeg files. I hope to later on implement an optional function that just stiches together these images automatically.
It's quite a hassle to download an entire webtoon with the current system. You have to input a new link for every episode of the comic in order to download everything. I want to implement a feature where it'll automatically go through the entire episode list and download all the episodes available. I expect this to be a bit difficult since there's no easy way to identify how many episodes there are for one comic, and whether or not the comic is 'completed' or not.
The main purpose of building this program was for me to 1) learn how to build a command line tool and 2) get a small taste of how procedures can be automated.
With these learning objectives in mind, this program was built for people to enjoy free webtoons offline. Please note that using this program to distribute the comics could go against copyright laws.
More details on why "personal use" is legal. In the Korean copyright law it states: μ 30μ‘°(μ¬μ μ΄μ©μ μν 볡μ ) 곡νλ μ μλ¬Όμ μ리λ₯Ό λͺ©μ μΌλ‘ νμ§ μλνκ³ κ°μΈμ μΌλ‘ μ΄μ©νκ±°λ κ°μ λ° μ΄μ μ€νλ νμ λ λ²μ μμμ μ΄μ©νλ κ²½μ°μλ κ·Έ μ΄μ©μλ μ΄λ₯Ό 볡μ ν μ μλ€. λ€λ§, 곡μ€μ μ¬μ©μ μ 곡νκΈ° μνμ¬ μ€μΉλ 볡μ¬κΈ°κΈ°μ μν 볡μ λ κ·Έλ¬νμ§ μλνλ€.