/n-gram

Sina News Crawler and Word Segmentation

Primary LanguagePython

Sina news crawler + word segementation

Project Structure

Usage

  • Method1 (Without news material)
    • Change the start date and end date for news crawler
    • Run Crawler
    • Change start date and end date in nGram.py
    • Change parameters(Frequency, Freedom, Condensation) in nGram.py
    • Run nGram.py
    • Wait for nGram.py
    • 1Gram.txt-5Gram.txt will be generated when nGram.py ends
  • Method2 (With news material)
    • Change parameters(Frequency, Freedom, Condensation) in nGram.py
    • Run nGram.py
    • Wait for nGram.py
    • 1Gram.txt-5Gram.txt will be generated when nGram.py ends

Advantages

  • Several crawler interferences are solved, such as
    • gzip compress
    • Other html attribute in

      (Some webpage even has nested more than 1k times, which causes Rugular Expression to be dead)
    • I don't use HTTPParser as required but to use Regular Expression
  • n-gram word segmentation
    • references
    • adopt three measurement to decide word segmentation
      • Work Frequency
      • Condensation(即“电影院”不是“电”+“影院”或“电影”+“院“)
      • Freedom(即“伊拉克”不是“伊拉”,也不是”拉客“)
  • Good Comment
    • Almost every line of code has comments
  • 2-character, 3-character words' performance is extremely great

Disadvantages

  • Crawler may encounter some encoding problems, some of them are Sina's matter but some are due to my decoding method (Some of the webpage are not encoded with gb2312)
  • n-gram word segmentation requires a large amount of memory, although I've used some memory control method
  • n-gram word segmentation could be improved in time complexity, although it may require even bigger space complexity
  • n-gram word segmentation did not consider function word such as 3-character word: “激烈的”
  • 4-character and 5-character words' performance is relatively bad. There is no 5-character words in 200M news material even though I've lower the standard for 5-character word.

Authored By Chen Letian at 2016.05.14