luochonghai/Spider

for question_similarity

JavaScript

Spider_for_baiduzhidao

1.Description

  Based on python3, it's used to get text data(question and answer) from zhidao.baidu.com.(spider180311)
  Now we aim to get data from wenshu.court.gov.cn, which is a website using js to load data asynchronously.

2.Mirgration

  As shown in the website, finetune the file_path&target_url, and you can get different sources of writ from wenshu.court.gov.cn.
  PS:It seems that the server of the government website is somehow weaker than I imagine, and it tends to kill you IP when your 
  IP sends just a few requests to it.

3.Advance

  Use data crabbed from the website to form a corpus, then transfer string to vector and calculate their cos similarity.