+Our results from tuning the parameters show that the plagdet score\cite{potthastfamework}
+is not a good measure for comparing the plagiarism detection systems:
+for example, the gap of 30,000 characters, described in Section \ref{postprocessing},
+can easily mean several pages of text. And still the system with this
+parameter set so high resulted in better plagdet score.
+
+Another problem of plagdet can be
+seen in the 01-no-plagiarism part of the training corpus: the border
+between the perfect score 1 and the score 0 is a single false-positive
+detection. Plagdet does not distinguish between the system reporting this
+single false-positive, and the system reporting the whole data as plagiarized.
+Both get the score 0. However, our experience from real-world plagiarism detection systems show that
+the plagiarized documents are in a clear minority, so the performance of
+the detection system on non-plagiarized documents is very important.
+
+\subsubsection{Performance Notes}
+
+We consider comparing the CPU-time performance of PAN 2012 submissions almost
+meaningless, because any sane system would precompute features for all
+documents in a given set of suspicious and source documents, and use the
+results for pair-wise comparison, expecting that any document will take
+part in more than one pair.
+
+Also, the pair-wise comparison without caching any intermediate results
+lead to worse overall performance: in our PAN 2010 submission, one of the
+post-processing steps was to remove all the overlapping detections
+from a given suspicious documents, when these detections were from different
+source doucments, and were short enough. This removed many false-positives
+and improved the precision of our system. This kind of heuristics was
+not possible in PAN 2012.
+
+As for the performance of our system, we split the task into two parts:
+1. finding the common features, and 2. computing valid intervals and
+postprocessing. The first part is more CPU intensive, and the results
+can be cached. The second part is fast enough to allow us to evaluate
+many combinations of parameters.
+
+We did our development on a machine with four six-core AMD 8139 CPUs
+(2800 MHz), and 128 GB RAM. The first phase took about 2500 seconds
+on this host, and the second phase took 14 seconds. Computing the
+plagdet score using the official script in Python took between 120 and
+180 seconds, as there is no parallelism in this script.
+
+When we have tried to use intrinsic plagiarism detection and language
+detection, the first phase took about 12500 seconds. Thus omitting these
+featurs clearly provided huge performance improvement.
+
+The code was written in Perl, and had about 669 lines of code,
+not counting comments and blank lines.
+
+\endinput
+
+- hranice mezi pasazema nekdy zahrnovala whitespace a nekdy ne.
+
+Diskuse plagdet:
+- uzivatele chteji "aby odevzdej ukazovalo 0\% shody", nezajima je
+ co to cislo znamena
+- nezalezi na hranicich detekovane pasaze
+- false-positives jsou daleko horsi
+- granularita je zlo
+
+Finalni vysledky nad testovacim korpusem:
+
+0.7288 0.5994 0.9306 1.0007 2012-06-16 02:23 plagdt recall precis granul
+ 01-no-plagiarism 0.0000 0.0000 0.0000 1.0000
+ 02-no-obfuscation 0.9476 0.9627 0.9330 1.0000
+ 03-artificial-low 0.8726 0.8099 0.9477 1.0013
+ 04-artificial-high 0.3649 0.2255 0.9562 1.0000
+ 05-translation 0.7610 0.6662 0.8884 1.0008
+ 06-simulated-paraphr 0.5972 0.4369 0.9433 1.0000
+
+Vysledky nad souteznimi daty:
+plagdet precision recall granularity
+0.6826726 0.8931670 0.5524708 1.0000000
+
+Run-time:
+12500 sekund tokenizace vcetne sc a detekce jazyka
+2500 sekund bez sc a detekce jazyka
+14 sekund vyhodnoceni valid intervalu a postprocessing
+
+
+TODO:
+- hranici podle hustoty matchovani
+- xml tridit podle this_offset
+
+Tady je obsah souboru JOURNAL - jak jsem meril nektera vylepseni:
+=================================================================
+baseline.py
+0.1250 0.1259 0.9783 2.4460 2012-05-03 06:02 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.8608 0.8609 0.8618 1.0009
+ 03_artificial_low 0.1006 0.1118 0.9979 2.9974
+ 04_artificial_high 0.0054 0.0029 0.9991 1.0778
+ 05_translation 0.0003 0.0002 1.0000 1.2143
+ 06_simulated_paraphr 0.0565 0.0729 0.9983 4.3075
+
+valid_intervals bez postprocessingu (takhle jsem to poprve odevzdal)
+0.3183 0.2034 0.9883 1.0850 2012-05-25 15:25 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9861 0.9973 0.9752 1.0000
+ 03_artificial_low 0.4127 0.3006 0.9975 1.1724
+ 04_artificial_high 0.0008 0.0004 1.0000 1.0000
+ 05_translation 0.0001 0.0000 1.0000 1.0000
+ 06_simulated_paraphr 0.3470 0.2248 0.9987 1.0812
+
+postprocessed (slucovani blizkych intervalu)
+0.3350 0.2051 0.9863 1.0188 2012-05-25 15:27 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9863 0.9973 0.9755 1.0000
+ 03_artificial_low 0.4541 0.3057 0.9942 1.0417
+ 04_artificial_high 0.0008 0.0004 1.0000 1.0000
+ 05_translation 0.0001 0.0000 1.0000 1.0000
+ 06_simulated_paraphr 0.3702 0.2279 0.9986 1.0032
+
+whitespace (uprava whitespaces)
+0.3353 0.2053 0.9858 1.0188 2012-05-31 17:57 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9865 0.9987 0.9745 1.0000
+ 03_artificial_low 0.4546 0.3061 0.9940 1.0417
+ 04_artificial_high 0.0008 0.0004 1.0000 1.0000
+ 05_translation 0.0001 0.0000 1.0000 1.0000
+ 06_simulated_paraphr 0.3705 0.2281 0.9985 1.0032
+
+gap_100: whitespace, + ve valid intervalu dovolim mezeru 100 petic misto 50
+0.3696 0.2305 0.9838 1.0148 2012-05-31 18:07 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9850 0.9987 0.9717 1.0000
+ 03_artificial_low 0.5423 0.3846 0.9922 1.0310
+ 04_artificial_high 0.0058 0.0029 0.9151 1.0000
+ 05_translation 0.0001 0.0000 1.0000 1.0000
+ 06_simulated_paraphr 0.4207 0.2667 0.9959 1.0000
+
+gap_200: whitespace, + ve valid intervalu dovolim mezeru 200 petic misto 50
+0.3906 0.2456 0.9769 1.0070 2012-05-31 18:09 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9820 0.9987 0.9659 1.0000
+ 03_artificial_low 0.5976 0.4346 0.9875 1.0139
+ 04_artificial_high 0.0087 0.0044 0.9374 1.0000
+ 05_translation 0.0001 0.0001 1.0000 1.0000
+ 06_simulated_paraphr 0.4360 0.2811 0.9708 1.0000
+
+gap_200_int_10: gap_200, + valid int. ma min. 10 petic misto 20
+0.4436 0.2962 0.9660 1.0308 2012-05-31 18:11 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9612 0.9987 0.9264 1.0000
+ 03_artificial_low 0.7048 0.5808 0.9873 1.0530
+ 04_artificial_high 0.0457 0.0242 0.9762 1.0465
+ 05_translation 0.0008 0.0004 1.0000 1.0000
+ 06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
+
+no_trans: gap_200_int_10, + nedetekovat preklady vubec, abych se vyhnul F-P
+0.4432 0.2959 0.9658 1.0310 2012-06-01 16:41 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9608 0.9980 0.9263 1.0000
+ 03_artificial_low 0.7045 0.5806 0.9872 1.0530
+ 04_artificial_high 0.0457 0.0242 0.9762 1.0465
+ 05_translation 0.0000 0.0000 0.0000 1.0000
+ 06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
+
+
+swng_unsorted se stejnym postprocessingem jako vyse "whitespace"
+0.2673 0.1584 0.9281 1.0174 2012-05-31 14:20 plagdt recall precis granul
+ 01_no_plagiarism 0.0000 0.0000 0.0000 1.0000
+ 02_no_obfuscation 0.9439 0.9059 0.9851 1.0000
+ 03_artificial_low 0.3178 0.1952 0.9954 1.0377
+ 04_artificial_high 0.0169 0.0095 0.9581 1.1707
+ 05_translation 0.0042 0.0028 0.0080 1.0000
+ 06_simulated_paraphr 0.1905 0.1060 0.9434 1.0000
+
+swng_sorted
+0.2550 0.1906 0.4067 1.0253 2012-05-30 16:07 plagdt recall precis granul
+ 01_no_plagiarism 0.0000 0.0000 0.0000 1.0000
+ 02_no_obfuscation 0.6648 0.9146 0.5222 1.0000
+ 03_artificial_low 0.4093 0.2867 0.8093 1.0483
+ 04_artificial_high 0.0454 0.0253 0.4371 1.0755
+ 05_translation 0.0030 0.0019 0.0064 1.0000
+ 06_simulated_paraphr 0.1017 0.1382 0.0814 1.0106
+
+sort_susp: gap_200_int_10 + postprocessing tridim intervaly podle offsetu v susp, nikoliv v src
+0.4437 0.2962 0.9676 1.0308 2012-06-01 18:06 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9641 0.9987 0.9317 1.0000
+ 03_artificial_low 0.7048 0.5809 0.9871 1.0530
+ 04_artificial_high 0.0457 0.0242 0.9762 1.0465
+ 05_translation 0.0008 0.0004 1.0000 1.0000
+ 06_simulated_paraphr 0.5123 0.3485 0.9662 1.0000
+
+post_gap2_16000: sort_susp, + sloucit dva intervaly pokud je < 16000 znaku a mezera je jen polovina velikosti tech intervalu (bylo 4000)
+0.4539 0.2983 0.9642 1.0054 2012-06-01 18:09 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9631 0.9987 0.9300 1.0000
+ 03_artificial_low 0.7307 0.5883 0.9814 1.0094
+ 04_artificial_high 0.0480 0.0247 0.9816 1.0078
+ 05_translation 0.0008 0.0004 1.0000 1.0000
+ 06_simulated_paraphr 0.5133 0.3487 0.9721 1.0000
+
+post_gap2_32000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon polovina velikosti
+0.4543 0.2986 0.9638 1.0050 2012-06-01 18:12 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9628 0.9987 0.9294 1.0000
+ 03_artificial_low 0.7315 0.5893 0.9798 1.0085
+ 04_artificial_high 0.0480 0.0247 0.9816 1.0078
+ 05_translation 0.0008 0.0004 1.0000 1.0000
+ 06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000
+
+post_gap2_64000: sort_susp, + sloucit intervaly < 32000 znaku a mezera aspon pol
+ovina velikosti
+0.4543 0.2988 0.9616 1.0050 2012-06-01 18:21 plagdt recall precis granul
+ 01_no_plagiarism 1.0000 1.0000 1.0000 1.0000
+ 02_no_obfuscation 0.9603 0.9987 0.9248 1.0000
+ 03_artificial_low 0.7316 0.5901 0.9782 1.0085
+ 04_artificial_high 0.0480 0.0247 0.9816 1.0078
+ 05_translation 0.0008 0.0004 1.0000 1.0000
+ 06_simulated_paraphr 0.5138 0.3487 0.9763 1.0000