From f1c54e45e52bb8fb386caa2dc58f7395c8ec6421 Mon Sep 17 00:00:00 2001 From: "Jan \"Yenya\" Kasprzak" Date: Wed, 15 Aug 2012 22:36:24 +0200 Subject: [PATCH] yenya: dalsi verze --- yenya-detailed.tex | 160 +++++++++++++++++++++------------------------ 1 file changed, 74 insertions(+), 86 deletions(-) diff --git a/yenya-detailed.tex b/yenya-detailed.tex index 46a2cd5..dd28b4d 100755 --- a/yenya-detailed.tex +++ b/yenya-detailed.tex @@ -9,11 +9,11 @@ The submitted program has been run in a controlled environment separately for each document pair, without the possibility of keeping any data between runs. -In this section, we describe our approach in the detailed comparison -task. The rest of this section is organized as follows: in the next -subsection, we summarise the differences from our previous approach. -In subsection \ref{sec-alg-overview}, we give an overview of our approach. -TODO napsat jak to nakonec bude. +%In this section, we describe our approach in the detailed comparison +%task. The rest of this section is organized as follows: in the next +%subsection, we summarise the differences from our previous approach. +%In subsection \ref{sec-alg-overview}, we give an overview of our approach. +%TODO napsat jak to nakonec bude. \subsection{Differences Against PAN 2010} @@ -136,6 +136,7 @@ inside the interval (characters not belonging to any common feature of a given valid interval) set to 4000 characters. \subsection{Postprocessing} +\label{postprocessing} In the postprocessing phase, we took the resulting valid intervals, and made attempt to further improve the results. We have firstly @@ -196,47 +197,31 @@ them here is worthwhile nevertheless. \subsubsection{Intrinsic Plagiarism Detection} -Our approach is based on character $n$-gram profiles of the interval of +We tested the approach based on character $n$-gram profiles of the interval of the fixed size (in terms of $n$-grams), and their differences to the profile of the whole document \cite{pan09stamatatos}. We have further enhanced the approach with using gaussian smoothing of the style-change -function \cite{Kasprzak2010}. - -For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead -of only 3-grams, and using the different measure of the difference between -the n-gram profiles. We have used an approach similar to \cite{ngram}, -where we have compute the profile as an ordered set of 400 most-frequent -$n$-grams in a given text (the whole document or a partial window). Apart -from ordering the set, we have ignored the actual number of occurrences -of a given $n$-gram altogether, and used the value inveresly -proportional to the $n$-gram order in the profile, in accordance with -the Zipf's law \cite{zipf1935psycho}. - -This approach has provided more stable style-change function than -than the one proposed in \cite{pan09stamatatos}. Because of pair-wise -nature of the detailed comparison sub-task, we couldn't use the results -of the intrinsic detection immediately, therefore we wanted to use them -as hints to the external detection. - -We have also experimented with modifying the allowed gap size using the -intrinsic plagiarism detection: to allow only shorter gap if the common -features around the gap belong to different passages, detected as plagiarized -in the suspicious document by the intrinsic detector, and allow larger gap, -if both the surrounding common features belong to the same passage, -detected by the intrinsic detector. This approach, however, did not show -any improvement against allowed gap of a static size, so it was omitted -from the final submission. - -\subsubsection{Language Detection} - -For language detection, we used the $n$-gram based categorization \cite{ngram}. -We have computed the language profiles from the source documents of the -training corpus (using the annotations from the corpus itself). The result -of this approach was better than using the stopwords-based detection we have -used in PAN 2010. However, there were still mis-detected documents, -mainly the long lists of surnames and other tabular data. We have added -an ad-hoc fix, where for documents having their profile too distant from all of -English, German, and Spanish profiles, we have declared them to be in English. +function \cite{Kasprzak2010}. For PAN 2012, we made further improvements +to the algorithm, resulting in more stable style change function in +both short and long documents. + +We tried to use the results of the intrinsic plagiarism detection +as hint for the post-processing phase, allowing to merge larger +intervals, if they both belong to the same passage, detected by +the intrinsic detector. This approach did not provide improvement +when compared to the static gap limits, as described in Section +\ref{postprocessing}, so we have omitted it from our final submission. + +%\subsubsection{Language Detection} +% +%For language detection, we used the $n$-gram based categorization \cite{ngram}. +%We computed the language profiles from the source documents of the +%training corpus (using the annotations from the corpus itself). The result +%of this approach was better than using the stopwords-based detection we have +%used in PAN 2010. However, there were still mis-detected documents, +%mainly the long lists of surnames and other tabular data. We added +%an ad-hoc fix, where for documents having their profile too distant from all of +%English, German, and Spanish profiles, we declared them to be in English. \subsubsection{Cross-lingual Plagiarism Detection} @@ -250,34 +235,61 @@ with the additional exact matching of longer words (we have used words with We have supposed that these longer words can be names or specialized terms, present in both languages. -We have used dictionaries from several sources, like +We used dictionaries from several sources, for example {\it dicts.info}\footnote{\url{http://www.dicts.info/}}, {\it omegawiki}\footnote{\url{http://www.omegawiki.org/}}, -and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. The source -and translated document were aligned on a line-by-line basis. +and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. -In the final form of the detailed comparison sub-task, the results of machine -translation of the source documents were provided to the detector programs -by the surrounding environment, so we have discarded the language detection -and machine translation from our submission altogether, and used only -line-by-line alignment of the source and translated document for calculating -the offsets of text features in the source document. We have then treated -the translated documents the same way as the source documents in English. +In the final submission, we simply used the machine translated texts, +which were provided to the running program from the surrounding environment. -\subsection{Performance Notes} -We consider comparing the performance of PAN 2012 submissions almost +\subsection{Further discussion} + +From our previous PAN submissions, we knew that the precision of our +system was good, and because of the way how the final score is computed, we +wanted to exchange a bit worse precision for better recall and granularity. +So we pushed the parameters towards detecting more plagiarized passages, +even when the number of common features was not especially high. + +\subsubsection{Plagdet score} + +Our results from tuning the parameters show that the plagdet score\cite{potthastfamework} +is not a good measure for comparing the plagiarism detection systems: +for example, the gap of 30,000 characters, described in Section \ref{postprocessing}, +can easily mean several pages of text. And still the system with this +parameter set so high resulted in better plagdet score. + +Another problem of plagdet can be +seen in the 01-no-plagiarism part of the training corpus: the border +between the perfect score 1 and the score 0 is a single false-positive +detection. Plagdet does not distinguish between the system reporting this +single false-positive, and the system reporting the whole data as plagiarized. +Both get the score 0. However, our experience from real-world plagiarism detection systems show that +the plagiarized documents are in a clear minority, so the performance of +the detection system on non-plagiarized documents is very important. + +\subsubsection{Performance Notes} + +We consider comparing the CPU-time performance of PAN 2012 submissions almost meaningless, because any sane system would precompute features for all documents in a given set of suspicious and source documents, and use the results for pair-wise comparison, expecting that any document will take part in more than one pair. -We did not use this exact split in our submission, but in order to be able -to evaluate various approaches faster, we have split our computation into -the following two parts: in the first part, common features have been -computed, and the results stored into a file\footnote{We have use the -{\tt Storable.pm} storage available in Perl.}. The second part -then used this data to compute valid intervals and do post-processing. +Also, the pair-wise comparison without caching any intermediate results +lead to worse overall performance: in our PAN 2010 submission, one of the +post-processing steps was to remove all the overlapping detections +from a given suspicious documents, when these detections were from different +source doucments, and were short enough. This removed many false-positives +and improved the precision of our system. This kind of heuristics was +not possible in PAN 2012. + +As for the performance of our system, we split the task into two parts: +1. finding the common features, and 2. computing valid intervals and +postprocessing. The first part is more CPU intensive, and the results +can be cached. The second part is fast enough to allow us to evaluate +many combinations of parameters. We did our development on a machine with four six-core AMD 8139 CPUs (2800 MHz), and 128 GB RAM. The first phase took about 2500 seconds @@ -289,35 +301,11 @@ When we have tried to use intrinsic plagiarism detection and language detection, the first phase took about 12500 seconds. Thus omitting these featurs clearly provided huge performance improvement. -The code has been written in Perl, and had about 669 lines of code, +The code was written in Perl, and had about 669 lines of code, not counting comments and blank lines. -\subsection{Further discussion} - -As in our PAN 2010 submission, we tried to make use of the intrinsic plagiarism -detection, but despite making further improvements to the intrinsic plagiarism detector, we have again failed to reach any significant improvement -when using it as a hint for the external plagiarism detection. - -In the full paper, we will also discuss the following topics: - -\begin{itemize} -\item language detection and cross-language common features -\item intrinsic plagiarism detection -\item suitability of plagdet score\cite{potthastframework} for performance measurement -\item feasibility of our approach in large-scale systems -\item discussion of parameter settings -\end{itemize} - -\nocite{pan09stamatatos} -\nocite{ngram} - \endinput -Co chci diskutovat v zaveru: -- nebylo mozno cachovat data -- nebylo mozno vylucovat prekryvajici se podobnosti -- cili udaje o run-time jsou uplne nahouby -- 669 radku kodu bez komentaru a prazdnych radku - hranice mezi pasazema nekdy zahrnovala whitespace a nekdy ne. Diskuse plagdet: -- 2.43.5