From 1898383bd74cec83ebf5106c0a750866ca4f76c1 Mon Sep 17 00:00:00 2001 From: "Jan \"Yenya\" Kasprzak" Date: Fri, 17 Aug 2012 00:11:43 +0200 Subject: [PATCH] yenya: aplikovany pripominky od Simona --- yenya-detailed.tex | 57 ++++++++++++++++++++++++++-------------------- 1 file changed, 32 insertions(+), 25 deletions(-) diff --git a/yenya-detailed.tex b/yenya-detailed.tex index 492b38a..9493f75 100755 --- a/yenya-detailed.tex +++ b/yenya-detailed.tex @@ -26,11 +26,11 @@ we developed a method of evaluating multiple types of similarities density and length. As a proof of concept, we used two types of common features: word -5-grams and stop-word 8-grams, the later being based on the method described in +5-grams and stop word 8-grams, the later being based on the method described in \cite{stamatatos2011plagiarism}. In addition to the above, we made several minor improvements to the -algorithm, such as parameter tuning and improving the detections +algorithm such as parameter tuning and improving the detections merging in the post-processing stage. \subsection{Algorithm Overview} @@ -50,25 +50,32 @@ The algorithm evaluates the document pair in several stages: We tokenize the document into words, where word is a sequence of one or more characters of the {\it Letter} Unicode class. -With each word, two additional attributes, needed for further processing, +With each word, two additional attributes needed for further processing, are associated: the offset where the word begins, and the word length. The offset where the word begins is not necessarily the first letter character -of the word itself. We discovered that in the training corpus, +of the word itself. We discovered that in the training corpus some plagiarized passages were annotated including the preceding non-letter characters. We used the following heuristics to add -parts of the inter-word gap to the previous, or the next adjacent word: +parts of the inter-word gap to the previous or the next adjacent word: \begin{itemize} \item When the inter-word gap contains interpunction (any of the dot, -semicolon, colon, comma, exclamation mark, question mark, or quotes), -add the characters up to, and including the interpunction, to the previous -word, ignore the space character(s) after the interpunction, and add -the rest to the next word. -\item Otherwise, when the inter-word gap contains newline, add the character -before the first newline to the previous word, ignore the first newline -character, and add the rest to the next word. -\item otherwise, ignore the inter-word gap characters altogether. +semicolon, colon, comma, exclamation mark, question mark, or quotes): +\begin{itemize} +\item add the characters up to and including the interpunction character +to the previous word, +\item ignore the space character(s) after the interpunction +character, +\item add the rest to the next word. +\end{itemize} +\item Otherwise, when the inter-word gap contains newline: +\begin{itemize} +\item add the character before the first newline to the previous word, +\item ignore the first newline character, +\item add the rest to the next word. +\end{itemize} +\item Otherwise: ignore the inter-word gap characters altogether. \end{itemize} When the detection program was given three different @@ -84,21 +91,21 @@ We have used features of two types: \begin{itemize} \item Lexicographically sorted word 5-grams, formed of words at least -three characters long, and -\item unsorted stop-word 8-grams, formed from 50 most frequent words in English, +three characters long. +\item Unsorted stop word 8-grams, formed from 50 most frequent words in English, as described in \cite{stamatatos2011plagiarism}. We have further ignored the 8-grams, formed solely from the six most frequent English words -(the, of, and, a, in, to), or the possessive {\it'{}s}. +({\it the}, {\it of}, {\it and}, {\it a}, {\it in}, {\it to}), or the possessive {\it'{}s}. \end{itemize} We represented each feature with the 32 highest-order bits of its -MD5 digest. This is only a performance optimization, targeted for +MD5 digest. This is only a performance optimization targeted for larger systems. The number of features in a document pair is several orders -of magnitude lower than $2^{32}$, so the probability of hash function +of magnitude lower than $2^{32}$, thus the probability of hash function collision is low. For pair-wise comparison, it would be feasible to compare the features directly instead of their MD5 sums. -Each feature has also the offset and length attributes. +Each feature has also two attributes: offset and length. Offset is taken as the offset of the first word in a given feature, and length is the offset of the last character in a given feature minus the offset of the feature itself. @@ -123,7 +130,7 @@ Both of these algorithms use features of a single type, which allows to use the ordering of features as a measure of distance. When we use features of different types, there is no natural ordering -of them: for example, a stop-word 8-gram can span multiple sentences, +of them: for example a stop word 8-gram can span multiple sentences, which can contain several word 5-grams. The assumption of both of the above algorithms, that the last character of the previous feature is before the last character of the current feature, is broken. @@ -138,7 +145,7 @@ of a given valid interval) set to 4000 characters. \subsection{Postprocessing} \label{postprocessing} -In the postprocessing phase, we took the resulting valid intervals, +In the postprocessing phase we took the resulting valid intervals and made attempt to further improve the results. We firstly removed overlaps: if both overlapping intervals were shorter than 300 characters, we have removed both of them. Otherwise, we @@ -151,9 +158,9 @@ if at least one of the following criteria were met: and it contained at least one feature per 10,000 characters\footnote{we have computed the length of the gap as the number of characters between the detections in the source document, plus the -number of charaters between the detections in the suspicious document.}, or +number of charaters between the detections in the suspicious document.} \item the gap was smaller than 30,000 characters and the size of the adjacent -valid intervals was at least twice as big as the gap between them, or +valid intervals was at least twice as big as the gap between them \item the gap was smaller than 30,000 characters and the number of common features per character in the adjacent interval was not more than three times bigger than number of features per character in the possible joined interval. @@ -210,7 +217,7 @@ as hint for the post-processing phase, allowing to merge larger intervals, if they both belong to the same passage, detected by the intrinsic detector. This approach did not provide improvement when compared to the static gap limits, as described in Section -\ref{postprocessing}, so we have omitted it from our final submission. +\ref{postprocessing}, therefore we have omitted it from our final submission. %\subsubsection{Language Detection} % @@ -226,7 +233,7 @@ when compared to the static gap limits, as described in Section \subsubsection{Cross-lingual Plagiarism Detection} For cross-lingual plagiarism detection, our aim was to use the public -interface to Google translate if possible, and use the resulting document +interface to Google Translate\footnote{\url{http://translate.google.com/}} if possible, and use the resulting document as the source for standard intra-lingual detector. Should the translation service not be available, we wanted to use the fall-back strategy of translating isolated words only, -- 2.43.5