Based on python3, it's used to get text data(question and answer) from zhidao.baidu.com.(spider180311)
Now we aim to get data from wenshu.court.gov.cn, which is a website using js to load data asynchronously.
2.Mirgration
As shown in the website, finetune the file_path&target_url, and you can get different sources of writ from wenshu.court.gov.cn.
PS:It seems that the server of the government website is somehow weaker than I imagine, and it tends to kill you IP when your
IP sends just a few requests to it.
3.Advance
Use data crabbed from the website to form a corpus, then transfer string to vector and calculate their cos similarity.