
Primary LanguageGoApache License 2.0Apache-2.0


This project is an agent for scraping, parsing and writing to some storages.
Firstly, it has been scraped the news on the portal(eg, Naver/Daum) in Korea.


Target page -------------------------|
  |- 1'st Sub -----------------------|
       |- 2'st Sub ------------------|
             |- N'st visitable ------|=> asynchronous scraper(crawler) module
       |- ... -----------------------|                  |
  |- 1'st Sub -----------------------|                  |=> asynchronous emitter(store or relay) module
  |- ... ----------------------------|

Scraper Modules

Actually, this is an extendable parsing handler.
Data crawling do via the Scraper interface.


Emitter Modules

This is an extendable output module via Emitter interface


How to build

Preparing dependencies

  • GachiFinder works with go 1.15+(on current build version).
  • In the case of Windows, it couldn't be invoked "make" command,
    So you need to download and install GNUMake for windows.

Run from the source code

Tested Support OS : Linux, MacOSX(darwin), Windows

# If you're on Windows, run "Git Bash" and type the followings.

$ git clone https://github.com/seversky/gachifinder.git
$ cd gachifinder
$ make all # or one of "windows", "darwin" and "linux".

If well done, you can see the binary.

$ cd $GACHIFINDER_FOLDER/cmd/gachifinder/windows
$ ls

You can run it refers to the help options.

$ ./gachifinder.exe -h
  C:\Users\...\go\src\github.com\seversky\gachifinder\cmd\gachifinder\windows\gachifinder.exe [OPTIONS]

Options for GachiFinder

Application Options:
  /c, /config_path:  Path To configure (default: ../config/gachifinder.yml)
  /t, /test          To test for crawling via a scraper only
                     (Without an emitter module)
  /v, /version       Show GachiFinder version and git revision id

Help Options:
  /?                 Show this help message
  /h, /help          Show this help message

Options: gachifinder.yml

  max_used_cores: 0 # if zero(0), all cores used.
  interval: 5 # Crawing interval(unit: min)

    log_level: debug # one of trace, debug, info, warn[ing], error, fatal or panic
    stdout: true
    format: 'text' # one of "text" or "json"
    # go_time_format: '2006-01-02T15:04:05.999Z07:00' # default=RFC3339, refer to https://golang.org/src/time/format.go
    force_colors: true

    log_path: './log/gachifinder.log'
    max_size: 50 # Max megabytes before log is rotated
    max_age: 7 # Max number of days to retain log files
    max_backups: 3 # Max number of old log files to keep
    compress: true

    - https://news.naver.com
    - https://news.daum.net
  # allowed_domains:
  #   - https://news.naver.com
  #   - https://news.daum.net
  user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
  max_depth_to_visit: 1

  async: true # Async turns on asynchronous network communication. Use Collector.Wait() to be sure all requests have been finished.

  parallelism: 20 # The number of the maximum allowed concurrent requests of the matching domains.
  delay: 1 # The duration to wait before creating a new request to the matching domains.(unit: sec)
  random_delay: 5 # The extra randomized duration to wait added to delay before creating a new request.(unit: sec)

  consumer_queue_threads: 2 # The number of consumer queue threads
  consumer_queue_max_size: 10 # Max size of consumer queue

      - http://elasticsearch:9200
    username: elastic
    password: changeme

Simple test for scraping

# Go into the folder where the built binary of gachifinder is.
$ cd $GACHIFINDER_FOLDER/cmd/gachifinder/windows
$ ./gachifinder.exe -t
INFO[2021-05-28T15:13:41+09:00] Show All Configurations                       config.Emitter.Elasticsearch.Hosts="[http://elasticsearch:9200]" config.Emitter.Elasticsearch.Password=changem
e config.Emitter.Elasticsearch.Username=elastic config.Global.Interval=5 config.Global.Log.Compress=true config.Global.Log.ForceColors=true config.Global.Log.Format=text
config.Global.Log.GoTimeFormat= config.Global.Log.LogLevel=debug config.Global.Log.LogPath=./log/gachifinder.log config.Global.Log.MaxAge=7 config.Global.Log.MaxBackups=3
 config.Global.Log.MaxSize=50 config.Global.Log.Stdout=true config.Global.MaxUsedCores=0 config.Scraper.AllowedDomains="[]" config.Scraper.Async=true config.Scraper.
ConsumerQueueMaxSize=10 config.Scraper.ConsumerQueueThreads=2 config.Scraper.Delay=1 config.Scraper.MaxDepthToVisit=1 config.Scraper.Parallelism=20 config.Scraper.RandomD
elay=5 config.Scraper.UserAgent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" config.Scraper.VisitDomains="[https://news.n
aver.com https://news.daum.net]"
INFO[2021-05-28T15:13:41+09:00] Application Initializing                      1-GO Runtime Version=go1.16.3 2-System Arch=amd64 3-GachiFider version=0.1.0 4-GachiFider revisi
on number=c44ad459 5-Number of used CPUs=8
INFO[2021-05-28T14:56:34+09:00] Begin crawling
INFO[2021-05-28T14:56:34+09:00] visiting https://news.daum.net
INFO[2021-05-28T14:56:34+09:00] visiting https://news.naver.com

Using Docker with Elasticsearch and Kibana for a test environment on Linux.

I suppose to be already installed Docker and Docker-compose therefore I don't handle installing those here.

To run Elasticsearch and Kibana, just go ahead below.

$ docker-compose up -d --build

If Elasticsearch account has been changed, you need to type into docker/kibana/kibana.yml