/track2

Primary LanguagePython

KDD CUP 2013 - Track 2
======================
Copyright 2013 Cheng-Xia Chang, Wei-Cheng Chang, Wei-Sheng Chin, Kuan-Hao Huang, Yu-Chin Juan,
Tzu-Ming Kuo, Chun-Liang Li, Chih-Jen Lin, Hsuan-Tien Lin, Shan-Wei Lin,
Shou-De Lin, Ting-Wei Lin, Young-San Lin, Yu-Chen Lu, Yu-Chuan Su, Cheng-Hao Tsai,
Hsiao-Yu Tung, Jui-Pin Wang, Cheng-Kuang Wei, Felix Wu, Chun-Pai Yang, Tu-Chun Yin,
Tong Yu, and Yong Zhuang.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.




This package is developed at National Taiwan University.
Our approach includes three different author-matching algorithms,
'main1', 'main2' and 'typo', and the outputs of these three algorithms are merged
using post-processing scripts in 'merge'. To simplify the usage, we integrate
all necessary processes into a Makefile in the top directory, so users could
easily type 'make' to activate the whole process. Moreover, 'buff/main1.csv'
and 'buff/main2.csv' are results of these two algorithms respectively.
Because our algorithm relies on some other open source packages,
please read the following statements before you get started. The detail of
these methods will be introduced in the paper which will be published in mid July.

A. Package organization
    'main1' (One author matching algorithm)
    'main2' (Another author matching algorithm)
    'typo' (A matching algorithm for detecting duplicates that can not be found by 'main1' and 'main2' because of typos)
    'merge' (Scripts for merging the results of above methods)


B. Performance in KDD-Cup 2013(F1 score):
            | main1 | main2 | merge
public(20%) |0.99186|0.99071|0.99195
private(80%)|0.99198|0.99083|0.99202
                

C. Make a prediction step-by-step:
    1. Requirements and dependency:
        Our package runs under Ubuntu 10.04 and requires the following packages:
            1-1. Python2 (test with version 2.6.5)
            1-2. Python3 (test with version 3.3.1)
            1-3. Perl5 (test with version 5.10)
            1-4. Perl module Text::CSV
            1-5. Raw data (dataRev2.zip). The zipped file should be stored at the top directory of this package 

    2. Run:
        Type 'make'.

    3. Result:
        The result file is 'final.csv'. Please notice that algorithm
        may take more than 2 hours to generate the result file.


D. Public resources used. They are included in this package.
    1. Chinese information for 'main1'. This information is used for Eastern and Western name identification. 
        1-1. Chinese family name
            We have two lists of Chinese family names. The smaller one, TW.raw,
            is the official romanization of first 100 common Chinese name in
            Taiwan. The larger one, CN.raw, including 506 common Chinese names and
            their romanization, is downloaded from Wikipedia.
            Links:
                "http://tc.wangchao.net.cn/xinxi/detail_1855256.html" and romanization in "http://www.boca.gov.tw/mp?mp=1"
                https://zh.wikipedia.org/wiki/中文姓氏羅馬字標注
                http://www.greatchinese.com/surname/surname.htm
        1-2. Korean family name
            KR.raw contains 20 common Korean first names and their romanization.
            Links:
                http://mirror.enha.kr/wiki/한국인%20이름의%20로마자%20표기
        1-3. Common romanization of Chinese tokens
            Links:
                http://www.pinyin.info/romanization/compare/gwoyeu_romatzyh.html 
                http://en.wikipedia.org/wiki/Comparison_of_Chinese_romanization_systems
        In these tokens, we manually select 45 tokens frequently appeared in both English and
        Chinese.

    2. Chinese information for 'main2'.
        2-1. Chinese family name
            Link:
                http://www.chineseinla.com/lastname/key_ng.html
        2-2. Common romanization of Chinese tokens
            Link:
                 http://irw.ncut.edu.tw/general/chen813/羅馬拼音/中文羅馬拼音對照表.htm

    3. Nick names. We substitute all nick names before we do any matching.
        Link for 'main1': http://www.cc.kyoto-su.ac.jp/~trobb/nicklist.html
                          http://mentalfloss.com/article/24761/origins-10-nicknames
        Link for 'main2': https://code.google.com/p/author-dedupe/

    4. List of stop words used in the merge step
        Links:
            http://nlp.stanford.edu/software/tmt/tmt-0.4/