X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=yenya-detailed.tex;h=8fe2d8364dc2b25e9b45c14fc22ae1d221975e3b;hb=3e7c5460bb20e9738744447e7926c810951d6a33;hp=9f4034655fbc22a8eaddbddff11839aa70f50541;hpb=f60109308b38984bfe852f58aea209863e482ff3;p=pan12-paper.git diff --git a/yenya-detailed.tex b/yenya-detailed.tex index 9f40346..8fe2d83 100755 --- a/yenya-detailed.tex +++ b/yenya-detailed.tex @@ -9,11 +9,11 @@ The submitted program has been run in a controlled environment separately for each document pair, without the possibility of keeping any data between runs. -In this section, we describe our approach in the detailed comparison -task. The rest of this section is organized as follows: in the next -subsection, we summarise the differences from our previous approach. -In subsection \ref{sec-alg-overview}, we give an overview of our approach. -TODO napsat jak to nakonec bude. +%In this section, we describe our approach in the detailed comparison +%task. The rest of this section is organized as follows: in the next +%subsection, we summarise the differences from our previous approach. +%In subsection \ref{sec-alg-overview}, we give an overview of our approach. +%TODO napsat jak to nakonec bude. \subsection{Differences Against PAN 2010} @@ -39,74 +39,104 @@ merging in the post-processing stage. The algorithm evaluates the document pair in several stages: \begin{itemize} -\item intrinsic plagiarism detection -\item language detection of the source document -\begin{itemize} -\item cross-lingual plagiarism detection, if the source document is not in English +\item tokenizing both the suspicious and source documents +\item forming {\it features} from some tokens +\item discovering {\it common features} +\item making {\it valid intervals} from common features +\item postprocessing \end{itemize} -\item detecting intervals with common features -\item post-processing phase, mainly serves for merging the nearby common intervals + +\subsection{Tokenization} + +We tokenize the document into words, where word is a sequence of one +or more characters of the {\it Letter} Unicode class. +With each word, two additional attributes, needed for further processing, +are associated: the offset where the word begins, and the word length. + +The offset where the word begins is not necessarily the first letter character +of the word itself. We have discovered that in the training corpus, +some plagiarized passages were annotated including the preceding +non-letter characters. We have used the following heuristics to add +parts of the inter-word gap to the previous, or the next adjacent word: + +\begin{itemize} +\item when the inter-word gap contains interpunction (any of the dot, +semicolon, colon, comma, exclamation mark, question mark, or quotes), +add the characters up to, and including the interpunction, to the previous +word, ignore the space character(s) after the interpunction, and add +the rest to tne next word. +\item otherwise, when the inter-word gap contains newline, add the character +before the first newline to the previous word, ignore the first newline +character, and add the rest to the next word. +\item otherwise, ignore the inter-word gap characters altogether. \end{itemize} -\subsection{Multi-feature Plagiarism Detection} +When the detection program has been presented three different +files instead of two (meaning the third one is machine-translated +version of the second one), we have tokenized the translated document instead +of the source one. We have make use of the line-by-line alignment of the +source and machine-translated documents to transform the word offsets +and lengths in the translated document to the terms of the source document. -Our pair-wise plagiarism detection is based on finding common passages -of text, present both in the source and in the suspicious document. We call them -{\it common features}. In PAN 2010, we have used sorted word 5-grams, formed from -words of three or more characters, as features to compare. -Recently, other means of plagiarism detection have been explored: -stopword $n$-gram detection is one of them -\cite{stamatatos2011plagiarism}. +\subsection{Features} -We propose the plagiarism detection system based on detecting common -features of various types, for example word $n$-grams, stopword $n$-grams, -translated single words, translated word bigrams, -exact common longer words from document pairs having each document -in a different language, etc. The system -has to be to the great extent independent of the specialities of various -feature types. It cannot, for example, use the order of given features -as a measure of distance between the features, as for example, several -word 5-grams can be fully contained inside one stopword 8-gram. - -We therefore propose to describe the {\it common feature} of two documents -(susp and src) with the following tuple: -$\langle -\hbox{offset}_{\hbox{susp}}, -\hbox{length}_{\hbox{susp}}, -\hbox{offset}_{\hbox{src}}, -\hbox{length}_{\hbox{src}} \rangle$. This way, the common feature is -described purely in terms of character offsets, belonging to the feature -in both documents. In our final submission, we have used the following two types -of common features: +We have used features of two types: \begin{itemize} -\item word 5-grams, from words of three or more characters, sorted, lowercased -\item stopword 8-grams, from 50 most-frequent English words (including - the possessive suffix 's), unsorted, lowercased, with 8-grams formed - only from the seven most-frequent words ({\it the, of, a, in, to, 's}) - removed +\item Lexicographically sorted word 5-grams, formed of words at least +three characters long, and +\item unsorted stop-word 8-grams, formed from 50 most frequent words in English, +as described in \cite{stamatatos2011plagiarism}. We have further ignored +the 8-grams, formed solely from the six most frequent English words +(the, of, and, a, in, to), or the possessive {\it'{}s}. \end{itemize} -We have gathered all the common features of both types for a given document -pair, and formed {\it valid intervals} from them, as described -in \cite{Kasprzak2009a}. A similar approach is used also in -\cite{stamatatos2011plagiarism}. -The algorithm is modified for multi-feature detection to use character offsets +We have represented each feature with the 32 highest-order bits of its +MD5 digest. This is only a performance optimization, targeted for +larger systems. The number of features in a document pair is several orders +of magnitude lower than $2^{32}$, so the probability of hash function +collision is low. For pair-wise comparison, it would be feasible to compare +the features directly instead of their MD5 sums. + +Each feature has also the offset and length attributes. +Offset is taken as the offset of the first word in a given feature, +and length is the offset of the last character in a given feature +minus the offset of the feature itself. + +\subsection{Common Features} + +For further processing, we have taken into account only the features +present both in source and suspicious document. For each such +{\it common feature}, we have created the list of +$(\makebox{offset}, \makebox{length})$ pairs for the source document, +and a similar list for the suspicious document. Note that a given feature +can occur multiple times both in source and suspicious document. + +\subsection{Valid Intervals} + +To detect a plagiarized passage, we need to find a set of common features, +which map to a dense-enough interval both in the source and suspicious +document. In our previous work, we have presented the algorithm +for discovering these {\it valid intervals} \cite{Kasprzak2009a}. +A similar approach is used also in \cite{stamatatos2011plagiarism}. +Both of these algorithms use the features of a single type, which +allows to use the ordering of features as a measure of distance. + +When we use features of different types, there is no natural ordering +of them: for example, a stop-word 8-gram can span multiple sentences, +which can contain several word 5-grams. The assumption of both of the +above algorithms, that the last character of the previous feature +is before the last character of the current feature, is broken. + +We have modified the algorithm for computing valid intervals with +multi-feature detection to use character offsets only instead of feature order numbers. We have used valid intervals -consisting of at least 5 common features, with the maximum allowed gap +consisting of at least 4 common features, with the maximum allowed gap inside the interval (characters not belonging to any common feature -of a given valid interval) set to 3,500 characters. - -We have also experimented with modifying the allowed gap size using the -intrinsic plagiarism detection: to allow only shorter gap if the common -features around the gap belong to different passages, detected as plagiarized -in the suspicious document by the intrinsic detector, and allow larger gap, -if both the surrounding common features belong to the same passage, -detected by the intrinsic detector. This approach, however, did not show -any improvement against allowed gap of a static size, so it was omitted -from the final submission. +of a given valid interval) set to 4000 characters. \subsection{Postprocessing} +\label{postprocessing} In the postprocessing phase, we took the resulting valid intervals, and made attempt to further improve the results. We have firstly @@ -129,9 +159,37 @@ features per character in the adjacent interval was not more than three times bigger than number of features per character in the possible joined interval. \end{itemize} -These parameters were fine-tuned to achieve the best results on the training corpus. With these parameters, our algorithm got the total plagdet score of 0.73 on the training corpus. - -\subsection{Other Approaches Tried} +\subsection{Results} + +These parameters were fine-tuned to achieve the best results on the training +corpus. With these parameters, our algorithm got the total plagdet score +of 0.7288 on the training corpus. The details of the performance of +our algorithm are presented in Table \ref{table-final}. +In the PAN 2012 competition, we have acchieved the plagdet score +of 0.6827, precision 0.8932, recall 0.5525, and granularity 1.0000. + +\begin{table} +\begin{center} +\begin{tabular}{|l|r|r|r|r|} +\hline +&plagdet&recall&precision&granularity\\ +\hline +whole corpus&0.7288&0.5994&0.9306&1.0007\\ +\hline +01-no-plagiarism &0.0000&0.0000&0.0000&1.0000\\ +02-no-obfuscation &0.9476&0.9627&0.9330&1.0000\\ +03-artificial-low &0.8726&0.8099&0.9477&1.0013\\ +04-artificial-high &0.3649&0.2255&0.9562&1.0000\\ +05-translation &0.7610&0.6662&0.8884&1.0008\\ +06-simulated-paraphrase&0.5972&0.4369&0.9433&1.0000\\ +\hline +\end{tabular} +\end{center} +\caption{Performance on the training corpus} +\label{table-final} +\end{table} + +\subsection{Other Approaches Explored} There are several other approaches we have evaluated, but which were omitted from our final submission for various reasons. We think mentioning @@ -139,38 +197,31 @@ them here is worthwhile nevertheless. \subsubsection{Intrinsic Plagiarism Detection} -Our approach is based on character $n$-gram profiles of the interval of +We tested the approach based on character $n$-gram profiles of the interval of the fixed size (in terms of $n$-grams), and their differences to the profile of the whole document \cite{pan09stamatatos}. We have further enhanced the approach with using gaussian smoothing of the style-change -function \cite{Kasprzak2010}. - -For PAN 2012, we have experimented with using 1-, 2-, and 3-grams instead -of only 3-grams, and using the different measure of the difference between -the n-gram profiles. We have used an approach similar to \cite{ngram}, -where we have compute the profile as an ordered set of 400 most-frequent -$n$-grams in a given text (the whole document or a partial window). Apart -from ordering the set, we have ignored the actual number of occurrences -of a given $n$-gram altogether, and used the value inveresly -proportional to the $n$-gram order in the profile, in accordance with -the Zipf's law \cite{zipf1935psycho}. - -This approach has provided more stable style-change function than -than the one proposed in \cite{pan09stamatatos}. Because of pair-wise -nature of the detailed comparison sub-task, we couldn't use the results -of the intrinsic detection immediately, therefore we wanted to use them -as hints to the external detection. - -\subsubsection{Language Detection} - -For language detection, we used the $n$-gram based categorization \cite{ngram}. -We have computed the language profiles from the source documents of the -training corpus (using the annotations from the corpus itself). The result -of this approach was better than using the stopwords-based detection we have -used in PAN 2010. However, there were still mis-detected documents, -mainly the long lists of surnames and other tabular data. We have added -an ad-hoc fix, where for documents having their profile too distant from all of -English, German, and Spanish profiles, we have declared them to be in English. +function \cite{Kasprzak2010}. For PAN 2012, we made further improvements +to the algorithm, resulting in more stable style change function in +both short and long documents. + +We tried to use the results of the intrinsic plagiarism detection +as hint for the post-processing phase, allowing to merge larger +intervals, if they both belong to the same passage, detected by +the intrinsic detector. This approach did not provide improvement +when compared to the static gap limits, as described in Section +\ref{postprocessing}, so we have omitted it from our final submission. + +%\subsubsection{Language Detection} +% +%For language detection, we used the $n$-gram based categorization \cite{ngram}. +%We computed the language profiles from the source documents of the +%training corpus (using the annotations from the corpus itself). The result +%of this approach was better than using the stopwords-based detection we have +%used in PAN 2010. However, there were still mis-detected documents, +%mainly the long lists of surnames and other tabular data. We added +%an ad-hoc fix, where for documents having their profile too distant from all of +%English, German, and Spanish profiles, we declared them to be in English. \subsubsection{Cross-lingual Plagiarism Detection} @@ -184,46 +235,77 @@ with the additional exact matching of longer words (we have used words with We have supposed that these longer words can be names or specialized terms, present in both languages. -We have used dictionaries from several sources, like +We used dictionaries from several sources, for example {\it dicts.info}\footnote{\url{http://www.dicts.info/}}, {\it omegawiki}\footnote{\url{http://www.omegawiki.org/}}, -and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. The source -and translated document were aligned on a line-by-line basis. - -In the final form of the detailed comparison sub-task, the results of machine -translation of the source documents were provided to the detector programs -by the surrounding environment, so we have discarded the language detection -and machine translation from our submission altogether, and used only -line-by-line alignment of the source and translated document for calculating -the offsets of text features in the source document. We have then treated -the translated documents the same way as the source documents in English. - -\subsection{Further discussion} +and {\it wiktionary}\footnote{\url{http://en.wiktionary.org/}}. -As in our PAN 2010 submission, we tried to make use of the intrinsic plagiarism -detection, but despite making further improvements to the intrinsic plagiarism detector, we have again failed to reach any significant improvement -when using it as a hint for the external plagiarism detection. +In the final submission, we simply used the machine translated texts, +which were provided to the running program from the surrounding environment. -In the full paper, we will also discuss the following topics: -\begin{itemize} -\item language detection and cross-language common features -\item intrinsic plagiarism detection -\item suitability of plagdet score\cite{potthastframework} for performance measurement -\item feasibility of our approach in large-scale systems -\item discussion of parameter settings -\end{itemize} +\subsection{Further discussion} -\nocite{pan09stamatatos} -\nocite{ngram} +From our previous PAN submissions, we knew that the precision of our +system was good, and because of the way how the final score is computed, we +wanted to exchange a bit worse precision for better recall and granularity. +So we pushed the parameters towards detecting more plagiarized passages, +even when the number of common features was not especially high. + +\subsubsection{Plagdet score} + +Our results from tuning the parameters show that the plagdet score\cite{potthastframework} +is not a good measure for comparing the plagiarism detection systems: +for example, the gap of 30,000 characters, described in Section \ref{postprocessing}, +can easily mean several pages of text. And still the system with this +parameter set so high resulted in better plagdet score. + +Another problem of plagdet can be +seen in the 01-no-plagiarism part of the training corpus: the border +between the perfect score 1 and the score 0 is a single false-positive +detection. Plagdet does not distinguish between the system reporting this +single false-positive, and the system reporting the whole data as plagiarized. +Both get the score 0. However, our experience from real-world plagiarism detection systems show that +the plagiarized documents are in a clear minority, so the performance of +the detection system on non-plagiarized documents is very important. + +\subsubsection{Performance Notes} + +We consider comparing the CPU-time performance of PAN 2012 submissions almost +meaningless, because any sane system would precompute features for all +documents in a given set of suspicious and source documents, and use the +results for pair-wise comparison, expecting that any document will take +part in more than one pair. + +Also, the pair-wise comparison without caching any intermediate results +lead to worse overall performance: in our PAN 2010 submission, one of the +post-processing steps was to remove all the overlapping detections +from a given suspicious documents, when these detections were from different +source doucments, and were short enough. This removed many false-positives +and improved the precision of our system. This kind of heuristics was +not possible in PAN 2012. + +As for the performance of our system, we split the task into two parts: +1. finding the common features, and 2. computing valid intervals and +postprocessing. The first part is more CPU intensive, and the results +can be cached. The second part is fast enough to allow us to evaluate +many combinations of parameters. + +We did our development on a machine with four six-core AMD 8139 CPUs +(2800 MHz), and 128 GB RAM. The first phase took about 2500 seconds +on this host, and the second phase took 14 seconds. Computing the +plagdet score using the official script in Python took between 120 and +180 seconds, as there is no parallelism in this script. + +When we have tried to use intrinsic plagiarism detection and language +detection, the first phase took about 12500 seconds. Thus omitting these +featurs clearly provided huge performance improvement. + +The code was written in Perl, and had about 669 lines of code, +not counting comments and blank lines. \endinput -Co chci diskutovat v zaveru: -- nebylo mozno cachovat data -- nebylo mozno vylucovat prekryvajici se podobnosti -- cili udaje o run-time jsou uplne nahouby -- 669 radku kodu bez komentaru a prazdnych radku - hranice mezi pasazema nekdy zahrnovala whitespace a nekdy ne. Diskuse plagdet: