X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;ds=sidebyside;f=pan13-paper%2Fyenya-text_alignment.tex;h=e284fe19581f6f382a5e8e3c738e6adc569595e9;hb=9ac1a493410b13697a7dd9a32fc58b894f59e044;hp=c3c6c4984c0f2c6c941a581ceb5f55cdb4ffa835;hpb=b16fe92fb7dd5fd6667718a0fe3d91e7ad95a581;p=pan13-paper.git

diff --git a/pan13-paper/yenya-text_alignment.tex b/pan13-paper/yenya-text_alignment.tex
index c3c6c49..e284fe1 100755
--- a/pan13-paper/yenya-text_alignment.tex
+++ b/pan13-paper/yenya-text_alignment.tex
@@ -1 +1,51 @@
-\section{Text Alignment}
+\section{Text Alignment}~\label{text_alignment}
+
+\subsection{Overview}
+
+Our approach at the text alignment subtask of PAN 2013 uses the same
+basic principles as our previous work in this area, described
+in \cite{suchomel_kas_12}, which in turn builds on our work for previous
+PAN campaigns,, \cite{Kasprzak2010}, \cite{Kasprzak2009a}:
+
+We detect {\it common features} between source and suspicious documents,
+where features we currently use are word $n$-grams, and stop-word $m$-grams
+\cite{stamatatos2011plagiarism}. From those common features (each of which
+can occur multiple times in both source and suspicious document), we form
+{\it valid intervals}\footnote{%
+We describe the algorithm for computing valid intervals in \cite{Kasprzak2009a},
+and a similar approach is also used in \cite{stamatatos2011plagiarism}.}
+of characters
+from the source and suspicious documents, where the interval in both
+of these documents is covered ``densely enough'' by the common features.
+
+We then postprocess the valid intervals, removing overlapping detections,
+and merging detections which are close enough to each other.
+
+In the next sections, we summarize the modifications we did for PAN 2013,
+including approaches tried but not used. For the training corpus,
+our software from PAN 2012 gave the plagdet score of TODO, which we
+consider the baseline for further improvements.
+
+\subsection{Alternative features}
+
+TODO \cite{torrejondetailed}
+
+\subsection{Global postprocessing}
+
+For PAN 2013, the algorithm had access to all of the source and suspicious
+documents. Because of this, we have rewritten our software to handle
+all of the documents at once, in order to be able to do cross-document
+optimizations and postprocessing, similar to what we did for PAN 2010.
+This required refactorization of most of the code. We are able to handle
+most of the computation in parallel in per-CPU threads, with little
+synchronization needed. The parallelization was used especially
+for development, where it has provided a significant performance boost.
+The official performance numbers are from single-threaded run, though.
+
+For PAN 2010, we have used the following postprocessing heuristics:
+If there are overlapping detections inside a suspicious document,
+keep the longer one, provided that it is long enough. For overlapping
+detections up to 600 characters, 
+TODO
+
+