/lihkg-keyword-scrapper

Scrape the url of LIHKG threads whose title include specific keywords, and download contents of the thread.

Primary LanguagePython

README.md

This is the README file forlihkg-keyword-scrapperby @christinesfkao.

Adapted from: Ho, J.C. & Or, N.H.K. (2020). LIHKGr: An application for scraping LIHKG.

Last updated: Nov 2020

Directory

lihkg-scrapper.py	
lihkg.R	 
README.md

Synopsis

python3 lihkg-scrapper.py [keyword]
RScript lihkg.R [keyword]

LIHKG, aka 連登, is the most popular internet forum in Hong Kong in 2019.

lihkg-scrapper.pycan scrape the url of threads whose title include specific keywords; then dump them into LIHKGr withlihkg.Rto download contents of the thread.

Environment

Feel free to set up according to your preferences. The following is what I used.

  • MacOS Catalina 10.15.7 (x86_64-apple-darwin17.0)
  • Firefox Browser 83.0 (64-bit)
  • Python 3.9.0
  • R version 4.0.3

Before running

  1. Decide on the keyword(s) you're going to search for

    • forpython3 lihkg-scrapper.py [keyword], the keyword would be read in assys.argv[1]in my module and put on the search bar during the automation process
    • forRscript lihkg.R [keyword], read in with commandArgs(trailingOnly = true)
  2. Check your environment settings from Selenium documents on Python

    • install Python bindings for selenium:pip3 install selenium
    • download (and install) the web broswer driver that you have chosen
    • no need for JAVA server for this scrapper
  3. Put the downloaded geckodriver for Firefox (or the driver for your preferred browser) under your desired directory

    • preferred $PATH setting method: Special thanks to @shouko's advice
  4. Change the constants in lihkg-scrapper.py according to your needs:

    • PATHas your desired directory
    • AccountUSERNAMEandPASSWORDfor LIHKG (Preferred: apply for a LIHKG account before scrapping!)
    • Or you could choose to input these rather sensitive information manually

Outputs

  • The Python script outputs thread ids into a.txtfile, one id for each line
  • The R script outputs contents of the threads into a.xlsxfile
    • the ids saved in the.txtfile are read in as a vector and thrown into LIHKGr
    • lihkg.Rdownload contents of the thread in to an.xlsxfile.

To-dos

I haven't finished the code on R for logging in yet. Perhaps the entire process can then be done in a single R script. PRs welcome!