USTC-Web-Info

Labs of 2019 Web Information Processing and Application in USTC.

Download

$ git clone https://github.com/IcePear-Jzx/USTC-Web-Info.git --depth 1

Notice: Only part of data is remianed. You need to find and download complete data by yourself.

Lab0 Web Spider

Douban is a community website where users give comments about books, movies, music and so on. The goal of this lab is to use web spider to get books' information from Douban Books.

Get the top 250 book's information.
Get all books' URLs.
Design distributed technique or parallel technique.

More details: https://git.bdaa.pro/yxonic/data-specification/wikis/豆瓣%20书评

Lab1 Search Engine

Given a number of documents and queries, return the first 20 most relevant documents for each query.

Documents are in test_docs.csv, each item has doc_id, doc_url, doc_title and content.
Queries are in test_querys.csv, each item has query and query_id.
Submission format is shown in submission.csv.

More details: http://staff.ustc.edu.cn/~tongxu/webinfo/slides/exp1.pdf

Lab2 Entity Recognition

A task of clinical named entity recognition (CNER) in CCKS 2019.

Train set is given in train.txt, each line is in JSON format with originalText and entities.
Test set is given in test.txt, each line is in JSON format with originalText and textId.
Recognize entities in test set and record them in CSV format, each row includes textId, label_type, start_pos, end_pos.

More details: http://staff.ustc.edu.cn/~tongxu/webinfo/slides/exp2.pdf

Lab3 Recommender System

The data comes from ratings of films and books on douban. Judge the user preference according to the user rating information in the training data, and score the user-item pairs in the test data.

More details: http://staff.ustc.edu.cn/~tongxu/webinfo/slides/exp3.pdf

icepear-jzx/USTC-Web-Info-2019