From: Jan "Yenya" Kasprzak Date: Wed, 15 Aug 2012 20:56:53 +0000 (+0200) Subject: abstract, uvod X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=commitdiff_plain;h=b66fa2e3f8b93fc0c12a7248848b434d18c1e047;p=pan12-paper.git abstract, uvod --- diff --git a/paper.tex b/paper.tex index dd3c3b7..46c3a65 100755 --- a/paper.tex +++ b/paper.tex @@ -23,7 +23,15 @@ \maketitle \begin{abstract} -Briefly describe the main ideas of your approach. +In this paper, we describe our approach in PAN 2012 competition. +Our candidate retrieval system is based on TODO Simon. + +Our detailed comparison system detects common features of both +documents, computing valid intervals from them, and then merging +some detections in the postprocessing phase. We also discuss +the relevance of current PAN 2012 settings to the real-world +plagiarism detection systems. + \end{abstract} @@ -42,6 +50,7 @@ we need to possess the original and the plagiarized document. %document base with potential plagiarized documents and evaluate the amount of plagiarism by detailed document comparison. %In this paper we introduce a method which has been used in PAN 2012 competition\footnote{\url{http://pan.webis.de/}} %in plagiarism detection. + In the first section we will introduce methods for candidate document retrieval from online sources, which took part in PAN 2012 competition\footnote{\url{http://pan.webis.de/}} in plagiarism detection. The task was to retrieve a set of candidate source documents that may had served as an original to plagiarize from. @@ -49,9 +58,11 @@ In the PAN 2012 candidate document retrieval test corpus, there were 32 text doc The documents were approximately 30 KB of size, the smallest were 18 KB and the largest were 44 KB. In the second section we describe our approach of detailed document comparison. - -We also discuss the performance ... - +We highlight the differences of this approach to the one we used for PAN 2010 +competition. We then provide the outline of the algorithm, and describe +its steps in detail. We briefly mention the approaches we explored, +but did not use in the final submission. Finally, we discuss the performance +of our system (both in terms of the plagdet score, and in terms of CPU time). \include{simon-searchengine} @@ -65,9 +76,11 @@ The proposed methods are applicable in general to any type of text input with no In PAN 2012 competition the proposed methods succeeded with similar amount of plagiarism detected with only a small fraction of used queries compared to the others. - - - +We also present a novel approach for detailed (pair-wise) document +comparison, where we allow the common features of different types +to be evaluated together into valid intervals, even though the particular +types of common features can vary to the great extent in their length +and importance, and do not provide a natural ordering. \bibliographystyle{splncs03} \begin{raggedright}