From 68358470c18055ae0aa29e3fd25a178a6fe79f07 Mon Sep 17 00:00:00 2001 From: "Jan \"Yenya\" Kasprzak" Date: Thu, 9 Aug 2012 18:55:51 +0200 Subject: [PATCH] Prvni verze clanku do sborniku --- paper.tex | 47 +++++++++++++ simon-searchengine.tex | 4 ++ yenya-detailed.tex | 150 +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 201 insertions(+) create mode 100644 paper.tex create mode 100644 simon-searchengine.tex create mode 100644 yenya-detailed.tex diff --git a/paper.tex b/paper.tex new file mode 100644 index 0000000..27db7db --- /dev/null +++ b/paper.tex @@ -0,0 +1,47 @@ +\documentclass{llncs} +\usepackage[american]{babel} +%\usepackage[T1]{fontenc} +\usepackage[utf8]{inputenc} +\usepackage{times} +\usepackage{graphicx} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +\begin{document} + +\title{Your Title} +%%% Please do not remove the subtitle. +\subtitle{Notebook for PAN at CLEF 2012} + +\author{\v{S}imon Suchomel \and Jan Kasprzak \and Michal Brandejs} +\institute{Faculty of Informatics, Masaryk University \\ +{\tt\{suchomel,kas,brandejs\}@fi.muni.cz}} + +\maketitle + +\begin{abstract} +Briefly describe the main ideas of your approach. +\end{abstract} + + +\section{Introduction} + +The notebooks shall contain a full write-up of your approach, including all details necessary to reproduce your results. + + +\include{simon-searchengine} +\include{yenya-detailed} + +\section{Conclusions} + +Tady napsat zaver + +\bibliographystyle{splncs03} +\begin{raggedright} +\bibliography{paper} +\end{raggedright} + +\end{document} + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + diff --git a/simon-searchengine.tex b/simon-searchengine.tex new file mode 100644 index 0000000..0ce5371 --- /dev/null +++ b/simon-searchengine.tex @@ -0,0 +1,4 @@ +\section{Search-Engine Queries} + +% Tohle napise Simon + diff --git a/yenya-detailed.tex b/yenya-detailed.tex new file mode 100644 index 0000000..3615ab9 --- /dev/null +++ b/yenya-detailed.tex @@ -0,0 +1,150 @@ +\section{Detailed Document Comparison} + +\subsection{General Approach} + +The approach Masaryk University team has used in PAN 2012 Plagiarism +detection---detailed comparison sub-task is based on the same approach +that we have used in PAN 2010 \cite{Kasprzak2010}. This time, we have +used a similar approach, enhanced by several means + +The algorithm evaluates the document pair in several stages: + +\begin{itemize} +\item intrinsic plagiarism detection +\item language detection of the source document +\begin{itemize} +\item cross-lingual plagiarism detection, if the source document is not in English +\end{itemize} +\item detecting intervals with common features +\item post-processing phase, mainly serves for merging the nearby common intervals +\end{itemize} + +\subsection{Intrinsic plagiarism detection} + +Our approach is based on character $n$-gram profiles of the interval of +the fixed size (in terms of $n$-grams), and their differences to the +profile of the whole document \cite{pan09stamatatos}. We have further +enhanced the approach with using gaussian smoothing of the style-change +function \cite{Kasprzak2010}. + +For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead +of only 3-grams, and using the different measure of the difference between +the n-gram profiles. We have used an approach similar to \cite{ngram}, +where we have compute the profile as an ordered set of 400 most-frequent +$n$-grams in a given text (the whole document or a partial window). Apart +from ordering the set we have ignored the actual number of occurrences +of a given $n$-gram altogether, and used the value inveresly +proportional to the $n$-gram order in the profile, in accordance with +the Zipf's law \cite{zipf1935psycho}. + +This approach has provided more stable style-change function than +than the one proposed in \cite{pan09stamatatos}. Because of pair-wise +nature of the detailed comparison sub-task, we couldn't use the results +of the intrinsic detection immediately, so we wanted to use them +as hints to the external detection. + +\subsection{Cross-lingual detection} + +%For language detection, we used the $n$-gram based categorization \cite{ngram}. +%We have computed the language profiles from the source documents of the +%training corpus (using the annotations from the corpus itself). The result +%of this approach was better than using the stopwords-based detection we have +%used in PAN 2010. However, there were still mis-detected documents, +%mainly the long lists of surnames and other tabular data. We have added +%an ad-hoc fix, where for documents having their profile too distant from all of +%English, German, and Spanish profiles, we have declared them to be in English. + +For cross-lingual plagiarism detection, our aim was to use the public +interface to Google translate if possible, and use the resulting document +as the source for standard intra-lingual detector. +Should the translation service not be available, we wanted +to use the fall-back strategy of translating isolated words only, +with the additional exact matching of longer words (we have used words with +5 characters or longer). +We have supposed these longer words can be names or specialized terms, +present in both languages. + +We have used dictionaries from several sources, like +{\tt dicts.info\footnote{\url{http://www.dicts.info/}}}, +{\tt omegawiki\footnote{\url{http://www.omegawiki.org/}}}, +and {\tt wiktionary\footnote{\url{http://en.wiktionary.org/}}}. The source +and translated document were aligned on a line-by-line basis. + +In the final form of the detailed comparison sub-task, the results of machine +translation of the source documents were provided to the detector programs +by the surrounding environment, so we have discarded the language detection +and machine translation from our submission altogether, and used only +line-by-line alignment of the source and translated document for calculating +the offsets of text features in the source document. + +\subsection{Multi-feature Plagiarism Detection} + +Our pair-wise plagiarism detection is based on finding common passages +of text, present both in the source and suspicious document. We call them +{\it features}. In PAN 2010, we have used sorted word 5-grams, formed from +words of three or more characters, as features to compare. +Recently, other means of plagiarism detection have been explored: +Stop-word $n$-gram detection is one of them +\cite{stamatatos2011plagiarism}. + +We propose the plagiarism detection system based on detecting common +features of various type, like word $n$-grams, stopword $n$-grams, +translated words or word bigrams, exact common longer words from document +pairs having each document in a different language, etc. The system +has to be to the great extent independent of the specialities of various +feature types. It cannot, for example, use the order of given features +as a measure of distance between the features, as for example, several +word 5-grams can be fully contained inside one stopword 8-gram. + +We thus define {\it common feature} of two documents (susp and src) +as the following tuple: +$$\langle +\hbox{offset}_{\hbox{susp}}, +\hbox{length}_{\hbox{susp}}, +\hbox{offset}_{\hbox{src}}, +\hbox{length}_{\hbox{src}} \rangle$$ + +In our final submission, we have used only the following two types +of common features: + +\begin{itemize} +\item word 5-grams, from words of three or more characters, sorted, lowercased +\item stop-word 8-grams, from 50 most-frequent English words (including + the possessive suffix 's), unsorted, lowercased, with 8-grams formed + only from the seven most-frequent words ({\it the, of, a, in, to, 's}) + removed +\end{itemize} + +We have gathered all the common features for a given document pair, and formed +{\it valid intervals} from them, as described in \cite{Kasprzak2009a} +(a similar approach is used also in \cite{stamatatos2011plagiarism}). +The algorithm is modified for multi-feature detection to use character offsets +only instead of feature order numbers. We have used valid intervals +consisting of at least 5 common features, with the maximum allowed gap +inside the interval (characters not belonging to any common feature +of a given valid interval) set to 3,500 characters. + +We have also experimented with modifying the allowed gap size using the +intrinsic plagiarism detection: to allow only shorter gap if the common +features around the gap belong to different passages, detected as plagiarized +in the suspicious document by the intrinsic detector, and allow larger gap, +if both the surrounding common features belong to the same passage, +detected by the intrinsic detector. This approach, however, did not show +any improvement against allowed gap of a static size, so it was omitted +from the final submission. + +\subsection{Postprocessing} + + +\subsection{Further discussion} + +In the full paper, we will also discuss the following topics: + +\begin{itemize} +\item language detection +\item suitability of plagdet score\cite{potthastframework} for performance measurement +\item feasibility of our approach in large-scale systems +\item other possible features to use, especially for cross-lingual detection +\item discussion of parameter settings +\end{itemize} + -- 2.43.5