X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=pan13-paper%2Fyenya-text_alignment.tex;h=1b54ac7bc478f5558345f4fea44e697e2bed93e6;hb=ebba97ad24be305e65ceb7cfdbb34d54d9a6bfba;hp=c35ba7c980470b921fb9e57fc2af3db1b9482d37;hpb=14ecfe62bce797cf4a4dba67481fccce2bba24aa;p=pan13-paper.git

diff --git a/pan13-paper/yenya-text_alignment.tex b/pan13-paper/yenya-text_alignment.tex
index c35ba7c..1b54ac7 100755
--- a/pan13-paper/yenya-text_alignment.tex
+++ b/pan13-paper/yenya-text_alignment.tex
@@ -1,7 +1,5 @@
 \section{Text Alignment}~\label{text_alignment}
-
-\subsection{Overview}
-
+%\subsection{Overview}
 Our approach at the text alignment subtask of PAN 2013 uses the same
 basic principles as our previous work in this area, described
 in \cite{suchomel_kas_12}, which in turn builds on our work for previous
@@ -36,8 +34,8 @@ In the next sections, we summarize the modifications we did for PAN 2013.
 \subsection{Alternative Features}
 \label{altfeatures}
 
-In PAN 2012, we have used word 5-grams and stop-word 8-grams.
-This year we have experimented with different word $n$-grams, and also
+In PAN 2012, we used word 5-grams and stop-word 8-grams.
+This year we experimented with different word $n$-grams, and also
 with contextual $n$-grams as described in \cite{torrejondetailed}.
 Modifying the algorithm to use contextual $n$-grams created as word
 5-grams with the middle word removed (i.e. two words before and two words
@@ -45,7 +43,7 @@ after the context) yielded better score:
 
 \plagdet{0.7421}{0.6721}{0.8282}{1.0000}
 
-We have then made tests with plain word 4-grams, and to our surprise,
+We then made tests with plain word 4-grams, and to our surprise,
 it gave even better score than contextual $n$-grams:
 
 \plagdet{0.7447}{0.7556}{0.7340}{1.0000}
@@ -57,7 +55,7 @@ training corpus parts, plain word 4-grams were better at all parts
 of the corpus (in terms of plagdet score), except the 02-no-obfuscation
 part.
 
-In our final submission, we have used word 4-grams and stop-word 8-grams.
+In our final submission, we used word 4-grams and stop-word 8-grams.
 
 \subsection{Global Postprocessing}
 
@@ -72,17 +70,17 @@ optimizations and postprocessing, similar to what we did for PAN 2010.
 %for development, where it has provided a significant performance boost.
 %The official performance numbers are from single-threaded run, though.
 
-For PAN 2010, we have used the following postprocessing heuristics:
+For PAN 2010, we used the following postprocessing heuristics:
 If there are overlapping detections inside a suspicious document,
 keep the longer one, provided that it is long enough. For overlapping
-detections up to 600 characters, drop them both. We have implemented
-this heuristics, but have found that it led to a lower score than
+detections up to 600 characters, drop them both. We implemented
+this heuristics, but found that it led to a lower score than
 without this modification. Further experiments with global postprocessing
 of overlaps led to a new heuristics: we unconditionally drop overlapping
 detections with up to 250 characters both, but if at least one of them
 is longer, we keep both detections. This is probably a result of
 plagdet being skewed too much towards recall (because the percentage of
-plagiarized cases in the corpus is way too high compared to real world),
+plagiarized cases in the corpus is way too high compared to real-world),
 so it is favourable to keep the detection even though the evidence
 for it is rather low.
 
@@ -92,7 +90,7 @@ The global postprocessing improved the score even more:
 
 \subsection{Evaluation Results and Future Work}
 
-	The evaulation on the competition corpus had the following results:
+	The evaluation on the competition corpus had the following results:
 
 \plagdet{0.7448}{0.7659}{0.7251}{1.0003}
 
@@ -104,7 +102,7 @@ Compared to the other participants, our algorithm performs
 especially well for human-created plagiarism (the 05-summary-obfuscation
 sub-corpus), which is where we want to focus for our production
 systems\footnote{Our production systems include the Czech National Archive
-of Graduate Theses, \url{http://theses.cz}}.
+of Graduate Theses,\\ \url{http://theses.cz}}.
 
 %	After the final evaluation, we did further experiments
 %with feature types, and discovered that using stop-word 8-grams,
@@ -115,9 +113,9 @@ of Graduate Theses, \url{http://theses.cz}}.
 
 We plan to experiment further with combining more than two types
 of features, be it continuous $n$-grams or contextual features.
-This should allow us to tune down the aggresive heuristics for joining
+This should allow us to tune down the aggressive heuristics for joining
 neighbouring detections, which should lead to higher precision,
-hopefully without sacrifying recall.
+hopefully without sacrificing recall.
 
 	As for the computational performance, it should be noted that
 our software is prototyped in a scripting language (Perl), so it is not