X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=paper.tex;h=46c3a6561f63eed7513b2757f9b3c6515ab265ce;hb=b66fa2e3f8b93fc0c12a7248848b434d18c1e047;hp=e098f4ae4ceb6da19f51cf9b6876d7e38a20b561;hpb=8bd472fc89fa7f354933fcc568d8ad378c019c39;p=pan12-paper.git diff --git a/paper.tex b/paper.tex index e098f4a..46c3a65 100755 --- a/paper.tex +++ b/paper.tex @@ -4,11 +4,15 @@ \usepackage[utf8]{inputenc} \usepackage{times} \usepackage{graphicx} +\usepackage{algorithm} +\usepackage{algorithmic} +\usepackage{amssymb} +\usepackage{multirow} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} -\title{Your Title} +\title{Three way search engine queries with multi-feature document comparison for plagiarism detection} %%% Please do not remove the subtitle. \subtitle{Notebook for PAN at CLEF 2012} @@ -19,7 +23,15 @@ \maketitle \begin{abstract} -Briefly describe the main ideas of your approach. +In this paper, we describe our approach in PAN 2012 competition. +Our candidate retrieval system is based on TODO Simon. + +Our detailed comparison system detects common features of both +documents, computing valid intervals from them, and then merging +some detections in the postprocessing phase. We also discuss +the relevance of current PAN 2012 settings to the real-world +plagiarism detection systems. + \end{abstract} @@ -28,9 +40,29 @@ Briefly describe the main ideas of your approach. %The notebooks shall contain a full write-up of your approach, including all details necessary to reproduce your results. -Due to the increasing ease of plagirism the plagiarism detection has nowdays become a need for many instutisions. Especially for universities where modern learning methods include e-learning and a vast document sources are online available. +Due to the increasing ease of plagiarism the plagiarism detection has nowadays become a need for many institutions. +Especially for universities where modern learning methods include e-learning and a vast document sources are online available. +%In the Information System of Masaryk University~\cite{ismu} there is also an antiplagiarism tool which is based upon the same principles as are shown in this paper. +The core methods for automatic plagiarism detection, which also work in practice on extensive collections of documents, +are based on computation document similarities. In order to compute a similarity +we need to possess the original and the plagiarized document. +%The most straightforward method is to use an online search engine in order to enrich +%document base with potential plagiarized documents and evaluate the amount of plagiarism by detailed document comparison. +%In this paper we introduce a method which has been used in PAN 2012 competition\footnote{\url{http://pan.webis.de/}} +%in plagiarism detection. +In the first section we will introduce methods for candidate document retrieval from online sources, which took part in +PAN 2012 competition\footnote{\url{http://pan.webis.de/}} in plagiarism detection. +The task was to retrieve a set of candidate source documents that may had served as an original to plagiarize from. +In the PAN 2012 candidate document retrieval test corpus, there were 32 text documents all contained at least one plagiarism case. +The documents were approximately 30 KB of size, the smallest were 18 KB and the largest were 44 KB. +In the second section we describe our approach of detailed document comparison. +We highlight the differences of this approach to the one we used for PAN 2010 +competition. We then provide the outline of the algorithm, and describe +its steps in detail. We briefly mention the approaches we explored, +but did not use in the final submission. Finally, we discuss the performance +of our system (both in terms of the plagdet score, and in terms of CPU time). \include{simon-searchengine} @@ -38,7 +70,17 @@ Due to the increasing ease of plagirism the plagiarism detection has nowdays bec \section{Conclusions} -Tady napsat zaver +We present methods for candidate document retrieval which lead to +discovery the decent amount of plagiarism with minimizing the number of used queries. +The proposed methods are applicable in general to any type of text input with no apriori information about the input document. +In PAN 2012 competition the proposed methods succeeded with similar amount of plagiarism detected with +only a small fraction of used queries compared to the others. + +We also present a novel approach for detailed (pair-wise) document +comparison, where we allow the common features of different types +to be evaluated together into valid intervals, even though the particular +types of common features can vary to the great extent in their length +and importance, and do not provide a natural ordering. \bibliographystyle{splncs03} \begin{raggedright}