X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=pan13-paper%2Fyenya-text_alignment.tex;h=b57dc26b6ad21de9bc5f09cea38dcb3d796e2285;hb=d2aca3192a5629781ebe22cd5a47e95d6b444d1a;hp=7a93e50fc0564bf32935d00ce872acc1daa070c9;hpb=060415b2f4c4f0482b6a128f8103651e1c9af823;p=pan13-paper.git

diff --git a/pan13-paper/yenya-text_alignment.tex b/pan13-paper/yenya-text_alignment.tex
index 7a93e50..b57dc26 100755
--- a/pan13-paper/yenya-text_alignment.tex
+++ b/pan13-paper/yenya-text_alignment.tex
@@ -1 +1,130 @@
 \section{Text Alignment}~\label{text_alignment}
+
+\subsection{Overview}
+
+Our approach at the text alignment subtask of PAN 2013 uses the same
+basic principles as our previous work in this area, described
+in \cite{suchomel_kas_12}, which in turn builds on our work for previous
+PAN campaigns \cite{Kasprzak2010}, \cite{Kasprzak2009a}:
+
+We detect {\it common features} between source and suspicious documents,
+where the features we currently use are word $n$-grams, and stop-word $m$-grams
+\cite{stamatatos2011plagiarism}. From those common features (each of which
+can occur multiple times in both source and suspicious document), we form
+{\it valid intervals}\footnote{%
+We describe the algorithm for computing valid intervals in \cite{Kasprzak2009a},
+and a similar approach is also used in \cite{stamatatos2011plagiarism}.}
+of characters
+from the source and suspicious documents, where the interval in both
+of these documents is covered ``densely enough'' by the common features.
+
+We then postprocess the valid intervals, removing the overlapping detections,
+and merging the detections which are close enough to each other.
+
+For the training corpus,
+our unmodified software from PAN 2012 gave the following results\footnote{%
+See \cite{potthastframework} for definition of {\it plagdet} and the rationale for this type of scoring.}:
+
+\def\plagdet#1#2#3#4{\par{
+$\textit{plagdet}=#1, \textit{recall}=#2, \textit{precision}=#3, \textit{granularity}=#4$}\hfill\par}
+
+\plagdet{0.7235}{0.6306}{0.8484}{1.0000}
+
+We take the above as the baseline for further improvements.
+In the next sections, we summarize the modifications we did for PAN 2013,
+including approaches tried but not used.
+
+\subsection{Alternative Features}
+\label{altfeatures}
+
+In PAN 2012, we have used word 5-grams and stop-word 8-grams.
+This year we have experimented with different word $n$-grams, and also
+with contextual $n$-grams as described in \cite{torrejondetailed}.
+Modifying the algorithm to use contextual $n$-grams created as word
+5-grams with the middle word removed (i.e. two words before and two words
+after the context) yielded better score:
+
+\plagdet{0.7421}{0.6721}{0.8282}{1.0000}
+
+We have then made tests with plain word 4-grams, and to our surprise,
+it gave even better score than contextual $n$-grams:
+
+\plagdet{0.7447}{0.7556}{0.7340}{1.0000}
+
+It should be noted that these two quite similar approaches (both use the
+features formed from four words), while having a similar plagdet score,
+have their precision and recall values completely different. Looking at the
+training corpus parts, plain word 4-grams were better at all parts
+of the corpus (in terms of plagdet score), except the 02-no-obfuscation
+part.
+
+In our final submission, we have used word 4-grams and stop-word 8-grams.
+
+\subsection{Global Postprocessing}
+
+For PAN 2013, the algorithm had access to all of the source and suspicious
+documents at once. It was not limited to a single document pair, as in
+2012. Because of this, we have rewritten our software to handle
+all of the documents in one run, in order to be able to do cross-document
+optimizations and postprocessing, similar to what we did for PAN 2010.
+This required refactorization of most of the code. We are able to handle
+most of the computation in parallel in per-CPU threads, with little
+synchronization needed. The parallelization was used especially
+for development, where it has provided a significant performance boost.
+The official performance numbers are from single-threaded run, though.
+
+For PAN 2010, we have used the following postprocessing heuristics:
+If there are overlapping detections inside a suspicious document,
+keep the longer one, provided that it is long enough. For overlapping
+detections up to 600 characters, drop them both. We have implemented
+this heuristics, but have found that it led to a lower score than
+without this modification. Further experiments with global postprocessing
+of overlaps led to a new heuristics: we unconditionally drop overlapping
+detections with up to 250 characters both, but if at least one of them
+is longer, we keep both detections. This is probably a result of
+plagdet being skewed too much to recall (because the percentage of
+plagiarized cases in the corpus is way too high compared to real world),
+so it is favourable to keep the detection even though the evidence
+for it is rather low.
+
+The global postprocessing improved the score even more:
+
+\plagdet{0.7469}{0.7558}{0.7382}{1.0000}
+
+\subsection{Evaluation Results and Future Work}
+
+	The evaulation on the competition corpus had the following results:
+
+\plagdet{0.7448}{0.7659}{0.7251}{1.0003}
+
+This is quite similar to what we have seen on a training corpus,
+only the granularity different from 1.000 is a bit surprising, given
+the aggressive joining of neighbouring detections we perform.
+Compared to the other participants, our algorithm performs
+especially well for human-created plagiarism (the 05-summary-obfuscation
+sub-corpus), which is where we want to focus for our production
+systems\footnote{Our production systems include the Czech National Archive
+of Graduate Theses, \url{http://theses.cz}}.
+
+	After the final evaluation, we did further experiments
+with feature types, and discovered that using stop-word 8-grams,
+word 4-grams, {\it and} contextual $n$-grams as described in
+Section \ref{altfeatures} performs even better (on a training corpus):
+
+\plagdet{0.7522}{0.7897}{0.7181}{1.0000}
+
+We plan to experiment further with combining more than two types
+of features, be it continuous $n$-grams or contextual features.
+This should allow us to tune down the aggresive heuristics for joining
+neighbouring detections, which should lead to higher precision,
+hopefully without sacrifying recall.
+
+	As for the computational performance, it should be noted that
+our software is prototyped in a scripting language (Perl), so it is not
+the fastest possible implementation of the algorithm used. The code
+contains about 800 non-comment lines of code, including the parallelization
+of most parts and debugging/logging statements. The only language-dependent
+part of the code is the list of English stop-words for stop-word $n$-grams.
+We use no stemming or other kinds of language-dependent processing.
+
+