Finalka 3

[pan13-paper.git] / pan13-paper / yenya-text_alignment.tex
diff --git a/pan13-paper/yenya-text_alignment.tex b/pan13-paper/yenya-text_alignment.tex

index 1f4f5cf58a37c6e09a7a2d609cbcc340fd16c0ab..1b54ac7bc478f5558345f4fea44e697e2bed93e6 100755 (executable)
--- a/pan13-paper/yenya-text_alignment.tex
+++ b/pan13-paper/yenya-text_alignment.tex
@@ -34,8 +34,8 @@ In the next sections, we summarize the modifications we did for PAN 2013.
  \subsection{Alternative Features}\r
  \label{altfeatures}\r
  \r
-In PAN 2012, we have used word 5-grams and stop-word 8-grams.\r
-This year we have experimented with different word $n$-grams, and also\r
+In PAN 2012, we used word 5-grams and stop-word 8-grams.\r
+This year we experimented with different word $n$-grams, and also\r
  with contextual $n$-grams as described in \cite{torrejondetailed}.\r
  Modifying the algorithm to use contextual $n$-grams created as word\r
  5-grams with the middle word removed (i.e. two words before and two words\r
@@ -43,7 +43,7 @@ after the context) yielded better score:
  \r
  \plagdet{0.7421}{0.6721}{0.8282}{1.0000}\r
  \r
-We have then made tests with plain word 4-grams, and to our surprise,\r
+We then made tests with plain word 4-grams, and to our surprise,\r
  it gave even better score than contextual $n$-grams:\r
  \r
  \plagdet{0.7447}{0.7556}{0.7340}{1.0000}\r
@@ -55,7 +55,7 @@ training corpus parts, plain word 4-grams were better at all parts
  of the corpus (in terms of plagdet score), except the 02-no-obfuscation\r
  part.\r
  \r
-In our final submission, we have used word 4-grams and stop-word 8-grams.\r
+In our final submission, we used word 4-grams and stop-word 8-grams.\r
  \r
  \subsection{Global Postprocessing}\r
  \r
@@ -70,17 +70,17 @@ optimizations and postprocessing, similar to what we did for PAN 2010.
  %for development, where it has provided a significant performance boost.\r
  %The official performance numbers are from single-threaded run, though.\r
  \r
-For PAN 2010, we have used the following postprocessing heuristics:\r
+For PAN 2010, we used the following postprocessing heuristics:\r
  If there are overlapping detections inside a suspicious document,\r
  keep the longer one, provided that it is long enough. For overlapping\r
-detections up to 600 characters, drop them both. We have implemented\r
-this heuristics, but have found that it led to a lower score than\r
+detections up to 600 characters, drop them both. We implemented\r
+this heuristics, but found that it led to a lower score than\r
  without this modification. Further experiments with global postprocessing\r
  of overlaps led to a new heuristics: we unconditionally drop overlapping\r
  detections with up to 250 characters both, but if at least one of them\r
  is longer, we keep both detections. This is probably a result of\r
  plagdet being skewed too much towards recall (because the percentage of\r
-plagiarized cases in the corpus is way too high compared to real world),\r
+plagiarized cases in the corpus is way too high compared to real-world),\r
  so it is favourable to keep the detection even though the evidence\r
  for it is rather low.\r
  \r
@@ -90,7 +90,7 @@ The global postprocessing improved the score even more:
  \r
  \subsection{Evaluation Results and Future Work}\r
  \r
-       The evaulation on the competition corpus had the following results:\r
+       The evaluation on the competition corpus had the following results:\r
  \r
  \plagdet{0.7448}{0.7659}{0.7251}{1.0003}\r
  \r
@@ -102,7 +102,7 @@ Compared to the other participants, our algorithm performs
  especially well for human-created plagiarism (the 05-summary-obfuscation\r
  sub-corpus), which is where we want to focus for our production\r
  systems\footnote{Our production systems include the Czech National Archive\r
-of Graduate Theses, \url{http://theses.cz}}.\r
+of Graduate Theses,\\ \url{http://theses.cz}}.\r
  \r
  %      After the final evaluation, we did further experiments\r
  %with feature types, and discovered that using stop-word 8-grams,\r
@@ -113,9 +113,9 @@ of Graduate Theses, \url{http://theses.cz}}.
  \r
  We plan to experiment further with combining more than two types\r
  of features, be it continuous $n$-grams or contextual features.\r
-This should allow us to tune down the aggresive heuristics for joining\r
+This should allow us to tune down the aggressive heuristics for joining\r
  neighbouring detections, which should lead to higher precision,\r
-hopefully without sacrifying recall.\r
+hopefully without sacrificing recall.\r
  \r
         As for the computational performance, it should be noted that\r
  our software is prototyped in a scripting language (Perl), so it is not\r