Labs of 2019 Web Information Processing and Application in USTC.
$ git clone https://github.com/IcePear-Jzx/USTC-Web-Info.git --depth 1
Notice: Only part of data is remianed. You need to find and download complete data by yourself.
Douban is a community website where users give comments about books, movies, music and so on. The goal of this lab is to use web spider to get books' information from Douban Books.
- Get the top 250 book's information.
- Get all books' URLs.
- Design distributed technique or parallel technique.
More details: https://git.bdaa.pro/yxonic/data-specification/wikis/豆瓣%20书评
Given a number of documents and queries, return the first 20 most relevant documents for each query.
- Documents are in test_docs.csv,
each item has
doc_id
,doc_url
,doc_title
andcontent
. - Queries are in test_querys.csv,
each item has
query
andquery_id
. - Submission format is shown in submission.csv.
More details: http://staff.ustc.edu.cn/~tongxu/webinfo/slides/exp1.pdf
A task of clinical named entity recognition (CNER) in CCKS 2019.
- Train set is given in train.txt,
each line is in JSON format with
originalText
andentities
. - Test set is given in test.txt,
each line is in JSON format with
originalText
andtextId
. - Recognize entities in test set and record them in CSV format,
each row includes
textId
,label_type
,start_pos
,end_pos
.
More details: http://staff.ustc.edu.cn/~tongxu/webinfo/slides/exp2.pdf
The data comes from ratings of films and books on douban. Judge the user preference according to the user rating information in the training data, and score the user-item pairs in the test data.
More details: http://staff.ustc.edu.cn/~tongxu/webinfo/slides/exp3.pdf