/plagiarism_detection_pan2015

Plagiarism Detection Approach for PAN 2015 Text Alignment task

Primary LanguagePythonMIT LicenseMIT

plagiarism_detection_pan2015

Plagiarism Detection Approach for PAN 2015 Text Alignment task This system is the implementation as detailed in [1] and [2] for the Text Alignment task at PAN 2015

  1. REQUIREMENTS

To use the algorithm you need to install the following python modules:

  1. USAGE

python PAN2015 <pairs> <source document folder> <suspicious document folder> <output folder>

Example:

python PAN2015_JCR.py E:/text-alignment/pan13-text-alignment-training-dataset-2013-01-21/pairs E:/text-alignment/pan13-text-alignment-training-dataset-2013-01-21/src E:/text-alignment/pan13-text-alignment-training-dataset-2013-01-21/susp C:/Users/sanchezperez15/Results
  1. INPUT

  • It is a file containing the pairs of documents to be compare
  • Folder with all the source documents mentioned in
  • Folder with all the suspicios documents mentioned in
  • Folder were the resulting xml files will be store
  1. OUTPUT

The results are store in the as a xml file with the following format as required at PAN 2015 [3]:

<document reference="suspicious-documentXYZ.txt">
<feature
  name="detected-plagiarism"
  this_offset="5"
  this_length="1000"
  source_reference="source-documentABC.txt"
  source_offset="100"
  source_length="1000"
/>
<feature ... />
...
</document>

For example, the above file would specify an aligned passage of text between suspicious-documentXYZ.txt and source-documentABC.txt, and that it is of length 1000 characters, starting at character offset 5 in the suspicious document and at character offset 100 in the source document.

  1. NOTE

In the main method the following lines allow comparing 2 documents:

sgsplag_obj = SGSPLAG(read_document(<path_to_suspicious_document>), read_document(<path_to_source_document>), parameters)
type_plag, summary_flag = sgsplag_obj.process()

where the results are stored in <sgsplag_obj.detections>.

We state this note in order to facilitate the reusing of this method outside the PAN requirements

  1. REFERENCES

[1] Sanchez-Perez, M.A., Gelbukh, A., Sidorov, G.: Adaptive algorithm for plagiarism detection: The best-performing approach at PAN 2014 text alignment competition. In: Mothe, J., Savoy, J., Kamps, J., Pinel-Sauvagnat, K., Jones, G.J.F., SanJuan, E., Cappellato, L., Ferro, N. (eds.) Experimental IR Meets Multilinguality, Multi-modality, and Interaction - 6th International Conference of the CLEF Association, CLEF 2015, Toulouse, France, September 8-11, 2015, Proceedings. Lecture Notes in Computer Science, vol. 9283, pp. 402{413. Springer (2015)

[2] Sanchez-Perez, M.A., Gelbukh, A.F., Sidorov, G.: Dynamically adjustable approach through obfuscation type recognition. In: Cappellato, L., Ferro, N., Jones, G.J.F., SanJuan, E. (eds.) Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8-11, 2015. CEUR Workshop Proceedings, vol. 1391. CEUR-WS.org (2015), http://ceur-ws.org/Vol-1391/92-CR.pdf

[3] http://pan.webis.de/clef15/pan15-web/plagiarism-detection.html

*** For more questions do not hesitate to contact us! ***

  • Miguel Ángel Sánchez Pérez <miguel.sanchez.nan(?)gmail.com>
  • Alexander Gelbukh <gelbukh(?)gelbukh.com>
  • Grigori Sidorov <sidorov(?)cic.ipn.mx>