yenya: dalsi verze

[pan12-paper.git] / yenya-detailed.tex
diff --git a/yenya-detailed.tex b/yenya-detailed.tex

index 46a2cd5be7a6c041e974d1a71e1ed449972a7979..dd28b4dc2a1be5a038ba996ca4058aa7d07945d9 100755 (executable)
--- a/yenya-detailed.tex
+++ b/yenya-detailed.tex
@@ -9,11 +9,11 @@ The submitted program has been run in a controlled environment
  separately for each document pair, without the possibility of keeping any
  data between runs.
  
  separately for each document pair, without the possibility of keeping any
  data between runs.
  
-In this section, we describe our approach in the detailed comparison
-task. The rest of this section is organized as follows: in the next
-subsection, we summarise the differences from our previous approach.
-In subsection \ref{sec-alg-overview}, we give an overview of our approach.
-TODO napsat jak to nakonec bude.
+%In this section, we describe our approach in the detailed comparison
+%task. The rest of this section is organized as follows: in the next
+%subsection, we summarise the differences from our previous approach.
+%In subsection \ref{sec-alg-overview}, we give an overview of our approach.
+%TODO napsat jak to nakonec bude.
  
  \subsection{Differences Against PAN 2010}
  
  
  \subsection{Differences Against PAN 2010}
  
@@ -136,6 +136,7 @@ inside the interval (characters not belonging to any common feature
  of a given valid interval) set to 4000 characters.
  
  \subsection{Postprocessing}
  of a given valid interval) set to 4000 characters.
  
  \subsection{Postprocessing}
+\label{postprocessing}
  
  In the postprocessing phase, we took the resulting valid intervals,
  and made attempt to further improve the results. We have firstly
  
  In the postprocessing phase, we took the resulting valid intervals,
  and made attempt to further improve the results. We have firstly
@@ -196,47 +197,31 @@ them here is worthwhile nevertheless.
  
  \subsubsection{Intrinsic Plagiarism Detection}
  
  
  \subsubsection{Intrinsic Plagiarism Detection}
  
-Our approach is based on character $n$-gram profiles of the interval of
+We tested the approach based on character $n$-gram profiles of the interval of
  the fixed size (in terms of $n$-grams), and their differences to the
  profile of the whole document \cite{pan09stamatatos}. We have further
  enhanced the approach with using gaussian smoothing of the style-change
  the fixed size (in terms of $n$-grams), and their differences to the
  profile of the whole document \cite{pan09stamatatos}. We have further
  enhanced the approach with using gaussian smoothing of the style-change
-function \cite{Kasprzak2010}.
-
-For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead
-of only 3-grams, and using the different measure of the difference between
-the n-gram profiles. We have used an approach similar to \cite{ngram},
-where we have compute the profile as an ordered set of 400 most-frequent
-$n$-grams in a given text (the whole document or a partial window). Apart
-from ordering the set, we have ignored the actual number of occurrences
-of a given $n$-gram altogether, and used the value inveresly
-proportional to the $n$-gram order in the profile, in accordance with
-the Zipf's law \cite{zipf1935psycho}.
-
-This approach has provided more stable style-change function than
-than the one proposed in \cite{pan09stamatatos}. Because of pair-wise
-nature of the detailed comparison sub-task, we couldn't use the results
-of the intrinsic detection immediately, therefore we wanted to use them
-as hints to the external detection.
-
-We have also experimented with modifying the allowed gap size using the
-intrinsic plagiarism detection: to allow only shorter gap if the common
-features around the gap belong to different passages, detected as plagiarized
-in the suspicious document by the intrinsic detector, and allow larger gap,
-if both the surrounding common features belong to the same passage,
-detected by the intrinsic detector. This approach, however, did not show
-any improvement against allowed gap of a static size, so it was omitted
-from the final submission.
-
-\subsubsection{Language Detection}
-
-For language detection, we used the $n$-gram based categorization \cite{ngram}.
-We have computed the language profiles from the source documents of the
-training corpus (using the annotations from the corpus itself). The result
-of this approach was better than using the stopwords-based detection we have
-used in PAN 2010. However, there were still mis-detected documents,
-mainly the long lists of surnames and other tabular data. We have added
-an ad-hoc fix, where for documents having their profile too distant from all of
-English, German, and Spanish profiles, we have declared them to be in English.
+function \cite{Kasprzak2010}. For PAN 2012, we made further improvements
+to the algorithm, resulting in more stable style change function in
+both short and long documents.
+
+We tried to use the results of the intrinsic plagiarism detection
+as hint for the post-processing phase, allowing to merge larger
+intervals, if they both belong to the same passage, detected by
+the intrinsic detector. This approach did not provide improvement
+when compared to the static gap limits, as described in Section
+\ref{postprocessing}, so we have omitted it from our final submission.
+
+%\subsubsection{Language Detection}
+%
+%For language detection, we used the $n$-gram based categorization \cite{ngram}.
+%We computed the language profiles from the source documents of the
+%training corpus (using the annotations from the corpus itself). The result
+%of this approach was better than using the stopwords-based detection we have
+%used in PAN 2010. However, there were still mis-detected documents,
+%mainly the long lists of surnames and other tabular data. We added
+%an ad-hoc fix, where for documents having their profile too distant from all of
+%English, German, and Spanish profiles, we declared them to be in English.
  
  \subsubsection{Cross-lingual Plagiarism Detection}
  
  
  \subsubsection{Cross-lingual Plagiarism Detection}
  
@@ -250,34 +235,61 @@ with the additional exact matching of longer words (we have used words with
  We have supposed that these longer words can be names or specialized terms,
  present in both languages.
  
  We have supposed that these longer words can be names or specialized terms,
  present in both languages.
  
-We have used dictionaries from several sources, like
+We used dictionaries from several sources, for example
  {\it dicts.info}\footnote{\url{http://www.dicts.info/}},
  {\it omegawiki}\footnote{\url{http://www.omegawiki.org/}},
  {\it dicts.info}\footnote{\url{http://www.dicts.info/}},
  {\it omegawiki}\footnote{\url{http://www.omegawiki.org/}},
-and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. The source
-and translated document were aligned on a line-by-line basis.
+and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}.
  
  
-In the final form of the detailed comparison sub-task, the results of machine
-translation of the source documents were provided to the detector programs
-by the surrounding environment, so we have discarded the language detection
-and machine translation from our submission altogether, and used only
-line-by-line alignment of the source and translated document for calculating
-the offsets of text features in the source document. We have then treated
-the translated documents the same way as the source documents in English.
+In the final submission, we simply used the machine translated texts,
+which were provided to the running program from the surrounding environment.
  
  
-\subsection{Performance Notes}
  
  
-We consider comparing the performance of PAN 2012 submissions almost
+\subsection{Further discussion}
+
+From our previous PAN submissions, we knew that the precision of our
+system was good, and because of the way how the final score is computed, we
+wanted to exchange a bit worse precision for better recall and granularity.
+So we pushed the parameters towards detecting more plagiarized passages,
+even when the number of common features was not especially high.
+
+\subsubsection{Plagdet score}
+
+Our results from tuning the parameters show that the plagdet score\cite{potthastfamework}
+is not a good measure for comparing the plagiarism detection systems:
+for example, the gap of 30,000 characters, described in Section \ref{postprocessing},
+can easily mean several pages of text. And still the system with this
+parameter set so high resulted in better plagdet score.
+
+Another problem of plagdet can be
+seen in the 01-no-plagiarism part of the training corpus: the border
+between the perfect score 1 and the score 0 is a single false-positive
+detection. Plagdet does not distinguish between the system reporting this
+single false-positive, and the system reporting the whole data as plagiarized.
+Both get the score 0. However, our experience from real-world plagiarism detection systems show that
+the plagiarized documents are in a clear minority, so the performance of
+the detection system on non-plagiarized documents is very important.
+
+\subsubsection{Performance Notes}
+
+We consider comparing the CPU-time performance of PAN 2012 submissions almost
  meaningless, because any sane system would precompute features for all
  documents in a given set of suspicious and source documents, and use the
  results for pair-wise comparison, expecting that any document will take
  part in more than one pair.
  
  meaningless, because any sane system would precompute features for all
  documents in a given set of suspicious and source documents, and use the
  results for pair-wise comparison, expecting that any document will take
  part in more than one pair.
  
-We did not use this exact split in our submission, but in order to be able
-to evaluate various approaches faster, we have split our computation into
-the following two parts: in the first part, common features have been
-computed, and the results stored into a file\footnote{We have use the
-{\tt Storable.pm} storage available in Perl.}. The second part
-then used this data to compute valid intervals and do post-processing.
+Also, the pair-wise comparison without caching any intermediate results
+lead to worse overall performance: in our PAN 2010 submission, one of the
+post-processing steps was to remove all the overlapping detections
+from a given suspicious documents, when these detections were from different
+source doucments, and were short enough. This removed many false-positives
+and improved the precision of our system. This kind of heuristics was
+not possible in PAN 2012.
+
+As for the performance of our system, we split the task into two parts:
+1. finding the common features, and 2. computing valid intervals and
+postprocessing. The first part is more CPU intensive, and the results
+can be cached. The second part is fast enough to allow us to evaluate
+many combinations of parameters.
  
  We did our development on a machine with four six-core AMD 8139 CPUs
  (2800 MHz), and 128 GB RAM. The first phase took about 2500 seconds
  
  We did our development on a machine with four six-core AMD 8139 CPUs
  (2800 MHz), and 128 GB RAM. The first phase took about 2500 seconds
@@ -289,35 +301,11 @@ When we have tried to use intrinsic plagiarism detection and language
  detection, the first phase took about 12500 seconds. Thus omitting these
  featurs clearly provided huge performance improvement.
  
  detection, the first phase took about 12500 seconds. Thus omitting these
  featurs clearly provided huge performance improvement.
  
-The code has been written in Perl, and had about 669 lines of code,
+The code was written in Perl, and had about 669 lines of code,
  not counting comments and blank lines.
  
  not counting comments and blank lines.
  
-\subsection{Further discussion}
-
-As in our PAN 2010 submission, we tried to make use of the intrinsic plagiarism
-detection, but despite making further improvements to the intrinsic plagiarism detector, we have again failed to reach any significant improvement
-when using it as a hint for the external plagiarism detection.
-
-In the full paper, we will also discuss the following topics:
-
-\begin{itemize}
-\item language detection and cross-language common features
-\item intrinsic plagiarism detection
-\item suitability of plagdet score\cite{potthastframework} for performance measurement
-\item feasibility of our approach in large-scale systems
-\item discussion of parameter settings
-\end{itemize}
-
-\nocite{pan09stamatatos}
-\nocite{ngram}
-
  \endinput
  
  \endinput
  
-Co chci diskutovat v zaveru:
-- nebylo mozno cachovat data
-- nebylo mozno vylucovat prekryvajici se podobnosti
-- cili udaje o run-time jsou uplne nahouby
-- 669 radku kodu bez komentaru a prazdnych radku
  - hranice mezi pasazema nekdy zahrnovala whitespace a nekdy ne.
  
  Diskuse plagdet:
  - hranice mezi pasazema nekdy zahrnovala whitespace a nekdy ne.
  
  Diskuse plagdet: