\usepackage{amsmath}\r
\usepackage{amssymb}\r
\usepackage{multicol}\r
-\usepackage{bera}\r
\usepackage[utf8]{inputenc}\r
%\usepackage{fancybullets}\r
%\usepackage{floatflt}\r
%\usepackage{graphics}\r
+\usepackage{fontspec}\r
+\usepackage{xunicode}\r
+\setmainfont[Mapping=tex-text]{DejaVu Sans}\r
+\setsansfont[Mapping=tex-text]{DejaVu Sans}\r
+\setmonofont[Mapping=tex-text]{DejaVu Sans Mono}\r
\r
\definecolor{BoxCol}{rgb}{0.9,0.9,1}\r
% uncomment for light blue background to \section boxes \r
\definecolor{ReallyEmph}{rgb}{0.7,0,0}\r
\r
\renewcommand{\titlesize}{\Huge}\r
-\title{Diverse Queries and Feature Type Selection \\ for Plagiarism Discovery}\r
+\title{Diverse Queries and Feature Type Selection for Plagiarism Discovery}\r
\r
% Note: only give author names, not institute\r
\author{Šimon Suchomel, Jan Kasprzak, and Michal Brandejs}\r
\r
\setlength{\figbotskip}{\smallskipamount}\r
\r
+\renewcommand{\SubSection}[2][?]{\r
+ \vspace{0.5\secskip}\r
+ \refstepcounter{subsection}\r
+ {\bf \subsectionsize \textcolor{SectionCol}{\arabic{section}.\arabic{subsection}~#2}}\r
+ \par\vspace{0.375\secskip}\r
+}\r
+\r
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
%%% Begin of Document\r
\r
\begin{multicols}{2}\setlength{\columnseprule}{0pt}\r
\section{Introduction}\r
%\r
-PAN 2013 LOrem ipsum Lorem ipsum Lorem ipsumLorem ipsumLorem ipsumLorem ipsumLorem ipsum \r
+A program for helping detering real-world plagiarism needs to accomplish many tasks.\r
+Original documents which served for creation of plagiarism must be retrieved and also suspicious passages according to\r
+input document must be highlighted. This poster presents methodology used during PAN2013 competition on uncovering plagiarism.\r
+\r
+The whole process is depicted at picture~\ref{fig:process}. The source retrieval task is divided into\r
+2 subtasks: Quering and Selecting, during which the software utilizes a given search engine. The retrieved\r
+sources must be examined in detail in order to highlight as many plagiarism cases as possible. This process is depicted\r
+as Text Alignment. Results of this process are called {\em detections}, i.e.~passages of {\em source document} and {\em suspicious document}, which are similar enough to each other, and can serve as a basis for further manual examination for possible plagiarism.\r
%\r
\vfill\r
\columnbreak\r
%\r
\begin{figure}\r
\centering\r
- \includegraphics[width=0.6\textwidth]{img/source_retrieval_process.pdf}\r
+ \includegraphics[width=0.8\textwidth]{img/source_retrieval_process.pdf}\r
\caption{Plagiarism discovery process.}\r
\label{fig:process}\r
\end{figure} \r
%\rm\r
%%% Introduction\r
\section{Querying}\r
-Querying means to effectively utilize the search engine in order to retrieve as many relevant\r
+Querying means to effectively utilize a search engine in order to retrieve as many relevant\r
documents as possible with the minimum amount of queries.\r
%We consider the resulting document relevantif it shares some of text characteristics with the suspicious document.\r
-In real-world queries as such represent appreciable cost, therefore their minimization should be one of the top priorities. \r
+In real-world, queries as such represent appreciable cost, therefore their quantity minimization should be one of the top priorities. \r
%\subsection{Types of Queries}\r
-From the suspicious document, there were three diverse types of queries extracted.\\\r
+During initial phase, there were three diverse types of queries extracted from each suspicious document.\\\r
\begin{minipage}{0.55\linewidth}\r
\subsection{Keywords Based Queries}\r
\begin{ytemize}\r
\begin{minipage}{0.55\linewidth}\r
\subsection{Intrinsic Plagiarism Based Queries}\r
\begin{ytemize}\r
-\item Averaged Word Frequency Class based chunking~\cite{AWFC};\r
+\item Averaged Word Frequency Class based chunking~\cite{awfc};\r
\item Random sentence selection from the chunk;\r
\item Non-deterministic;\r
\item Positional;\r
\section{Selecting}\r
Document snippets were used for deciding whether to download the document for the text alignment.\r
We used 2-tuples measurement, which indicates how many neighbouring word pairs coexist in the snippet and in the suspicious document.\r
-Performance of this measure is depicted at picture~\ref{fig:snippet_graph}.\r
+Performance of this measure is depicted at Figure~\ref{fig:snippet_graph}.\r
Having this measure, a threshold for download decision needs to be set in order to maximize all discovered similarities\r
and minimize total downloads.\r
A profitable threshold is such that matches with the largest distance between those two curves.\r
\caption{Downloads and similarities performance.}\r
\label{fig:snippet_graph}\r
\end{figure}\r
-\r
-\r
%\r
% Yenyova cast\r
%\r
-\r
\section{Text Alignment}\r
\r
+The system uses the same basic principles as in \cite{suchomel_kas_12}:\r
+\r
+\begin{ytemize}\r
+\item{\cemph{common features} between source and suspicious documents}\r
+\begin{ytemize}\r
+\item{word 5-grams}\r
+\item{stop-word 8-grams \cite{stamatatos2011plagiarism}}\r
+\end{ytemize}\r
+\item{\cemph{valid intervals} of characters covered by common features\r
+ ``densely enough''}\r
+\item{\cemph{postprocessing}---remove overlapping detections,\r
+ join neighbouring detections}\r
+\end{ytemize}\r
+\r
+\subsection{Alternative Features}\r
+\r
+\begin{ytemize}\r
+\item{\cemph{contextual n-grams} \cite{torrejondetailed}}\r
+\begin{ytemize}\r
+\item{\cemph{The quick} brown \cemph{fox jumped} over the lazy dogs.}\r
+\item{The \cemph{quick brown} fox \cemph{jumped over} the lazy dogs.}\r
+\end{ytemize}\r
+\item{plain word 4-grams}\r
+\begin{ytemize}\r
+\item{\cemph{The quick brown fox} jumped over the lazy dogs.}\r
+\item{The \cemph{quick brown fox jumped} over the lazy dogs.}\r
+\end{ytemize}\r
+\end{ytemize}\r
+\r
+\begin{table}\r
+\r
+\begin{center}\r
+\begin{tabular}{|l|r|r|r|r|}\r
+\hline\r
+\bf feature & \bf recall & \bf precision & \bf granularity & plagdet \\\r
+\hline\r
+plain 5-grams & 0.6306 & 0.8484 & 1.0000 & \cemph{0.7235} \\\r
+contextual 4-grams & 0.6721 & \cemph{0.8282} & 1.0000 & \cemph{0.7421} \\\r
+plain 4-grams & \cemph{0.7556} & 0.7340 & 1.0000 & \cemph{0.7447} \\\r
+\hline\r
+\end{tabular}\r
+\end{center}\r
+\r
+\caption{Comparison of contextual 4-grams and plain word 4-grams}\r
+\end{table}\r
+\r
+\subsection{Global Postprocessing}\r
+\r
+\begin{ytemize}\r
+\item{Similar to PAN 2010 \cite{Kasprzak2010}}\r
+\item{Overlapping detections removal}\r
+\item{\cemph{Result:} improvement, but not as significant as in 2010}\r
+\end{ytemize}\r
+\r
%\r
% Spolecna cast\r
%\r
\r
\section{Conclusion}\r
\r
-Nějaký závěr\r
+\subsection{Candidate retrieval}\r
\r
-%%% References\r
+\begin{ytemize}\r
+\item{Second best ratio of recall to the number of queries}\r
+\item{Missing support for phrasal search in ChatNoir is a big stumbling block}\r
+\end{ytemize}\r
\r
-%% Note: use of BibTeX als works!!\r
+\subsection{Text alignment}\r
\r
-\bibliographystyle{plain}\r
-\begin{thebibliography}{1}\r
+\begin{ytemize}\r
+\item{Significant improvement against PAN 2013}\r
+\item{Word 4-grams are better than contextual 4-grams}\r
+\item{We need a better ranking system than plagdet!}\r
+\end{ytemize}\r
\r
-\bibitem{ISMU}\r
-\cemph{Masaryk University Information System}\\\r
-{\tt http://is.muni.cz/}, contact: {\tt iscor@fi.muni.cz}.\r
+%%% References\r
\r
-\bibitem{Theses}\r
-\cemph{Czech National Archive of Graduate Theses}\\\r
-{\tt http://theses.cz/}, contact: {\tt theses@fi.muni.cz}.\r
+%% Note: use of BibTeX als works!!\r
\r
-\bibitem{AWFC}\r
-\cemph{Sven Meyer Zu Eissen and Benno Stein: Intrinsic Plagiarism Detection}\\\r
-{\tt Proceedings of the European Conference on Information Retrieval (ECIR-06)}, {\tt 2006}\r
+\bibliographystyle{plain}\r
+\bibliography{pan13-notebook}\r
+\nocite{awfc}\r
\r
-\end{thebibliography}\r
+%\begin{thebibliography}{1}\r
+%\r
+%\bibitem{ISMU}\r
+%\cemph{Masaryk University Information System}\\\r
+%{\tt http://is.muni.cz/}, contact: {\tt iscor@fi.muni.cz}.\r
+%\r
+%\bibitem{Theses}\r
+%\cemph{Czech National Archive of Graduate Theses}\\\r
+%{\tt http://theses.cz/}, contact: {\tt theses@fi.muni.cz}.\r
+%\r
+%\bibitem{AWFC}\r
+%\cemph{Sven Meyer Zu Eissen and Benno Stein: Intrinsic Plagiarism Detection}\\\r
+%{\tt Proceedings of the European Conference on Information Retrieval (ECIR-06)}, {\tt 2006}\r
+%\r
+%\end{thebibliography}\r
\r
\smallskip\r
\hrule height .1em\r
\r
% \sffamily\r
\r
-QR kód?\r
\r
+\hbox to \hsize{\r
+ {\hsize=0.5\hsize\vbox{\r
\cemph{Contact information:}\\\r
- Šimon Suchomel {\tt suchomel@fi.muni.cz},\\\r
- Jan Kasprzak, {\tt kas@fi.muni.cz}.\r
-\r
+ Šimon Suchomel {\tt suchomel@fi.muni.cz}\\\r
+ Jan Kasprzak {\tt kas@fi.muni.cz}\\\r
+ {\cemph{\tt http://www.fi.muni.cz/\~{}kas/pan13/}}\r
+}\r
+ \hfill\r
+ {\hsize=0.4\hsize\vbox{\r
+ \includegraphics[width=\hsize]{qrcode.png}\r
+}}}}\r
+ \r
\r
\end{multicols}\r
\r
\end{document}\r
-\r