X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=pan13-paper%2Fpan13-notebook.tex;h=2afc33564df960487dfdeae0b258a7fae3c2fff9;hb=ebba97ad24be305e65ceb7cfdbb34d54d9a6bfba;hp=3febdcc6c2cee3222d7038b9e19f165b4878bb69;hpb=479a615009e1e6eaa1efd9714ce65511d40e2e6e;p=pan13-paper.git diff --git a/pan13-paper/pan13-notebook.tex b/pan13-paper/pan13-notebook.tex index 3febdcc..2afc335 100755 --- a/pan13-paper/pan13-notebook.tex +++ b/pan13-paper/pan13-notebook.tex @@ -7,7 +7,7 @@ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} -\title{Improving plagiarism detection} +\title{Diverse Queries and Feature Type Selection for Plagiarism Discovery} %%% Please do not remove the subtitle. \subtitle{Notebook for PAN at CLEF 2013} @@ -21,32 +21,48 @@ This paper describes approaches used for the Plagiarism Detection task in PAN 2013 international competition on uncovering plagiarism, authorship, and social software misuse. We present modified three-way search methodology for Source Retrieval subtask and analyse snippet similarity performance. -Next, we show changes in selected feature for text alignement which led to plagdet score improvement. -The results of source retrieval show, that presented approach is adaptable in real-world plagiarism situations. -Improved results for text alignment achieved in the competition overall third place. +The results show, that presented approach is adaptable in real-world plagiarism situations. +For the Detailed Comparison task, we discuss feature type selection and +global postprocessing. Resulting performance is significantly better +with the described modifications, and further improvement is still possible. \end{abstract} \section{Introduction} In PAN 2013 competition on plagiarism detection we participated in both the Source Retrieval -and the Text Alignment subtask. In both tasks we adapted methodology used in PAN 2012. +and the Text Alignment subtasks. In both tasks we adapted methodology used in PAN 2012\footnote{% +See \cite{pan2012} for an overview of PAN 2012 plagiarism detection campaign.} \cite{suchomel_kas_12}. Section~\ref{source_retr} describes querying approach for source retrieval, where we used three different types of queries. We present a new type of query based on text paragraphs. -The query execution were controled by its type and by preliminary similarities +The query execution was controlled by its type and by preliminary similarities discovered during the searches. -In section~\ref{text_alignment} we present modified common text feature fot text alignment. -We also compare performance of both the previous and the modified algorithms. - +Section~\ref{text_alignment} describes our approach for the text alignment +(pairwise comparison) subtask. We briefly introduce our system, +and then we discuss the feature types, which are usable for pairwise comparison, +including the evaluation of their feasibility for this purpose. We then describe +the global (corpus-wide) optimizations used, and finally we discuss +the results achieved and further development. \input{simon-source_retrieval} \input{yenya-text_alignment} \section{Conclusions} - -Unfortunately the ChatNoir search engine does not support phrasal search, therefore it +We introduced querying strategy with snippet similarity measure. %which approved to be competitive. +In source retrieval subtask the strategy performed with the second best ratio +of recall to the number of used queries. +We focused our queries on selected parts of text +and on parts with no discovered external similarities. +Unfortunately the ChatNoir search engine currently does not support phrasal search, therefore it is possible that evaluated results may be quite distorted in this manner. +In the text alignment subtask, we have achieved a significant improvement +with respect to our system from PAN 2012. Further development in this +area is still possible. For a real-world system, however, a completely +different set of parameters and heuristics needs to be used, as a result +of plagdet score together with the structure of the competition corpus +being too different from the real world. + \bibliographystyle{splncs03} \begin{raggedright} \bibliography{pan13-notebook}