\usepackage{amsmath}\r
\usepackage{amssymb}\r
\usepackage{multicol}\r
-\usepackage{bera}\r
\usepackage[utf8]{inputenc}\r
%\usepackage{fancybullets}\r
%\usepackage{floatflt}\r
%\usepackage{graphics}\r
+\usepackage{fontspec}\r
+\usepackage{xunicode}\r
+\setmainfont[Mapping=tex-text]{DejaVu Sans}\r
+\setsansfont[Mapping=tex-text]{DejaVu Sans}\r
+\setmonofont[Mapping=tex-text]{DejaVu Sans Mono}\r
\r
\definecolor{BoxCol}{rgb}{0.9,0.9,1}\r
% uncomment for light blue background to \section boxes \r
\r
\setlength{\figbotskip}{\smallskipamount}\r
\r
+\renewcommand{\SubSection}[2][?]{\r
+ \vspace{0.5\secskip}\r
+ \refstepcounter{subsection}\r
+ {\bf \subsectionsize \textcolor{SectionCol}{\arabic{section}.\arabic{subsection}~#2}}\r
+ \par\vspace{0.375\secskip}\r
+}\r
+\r
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
%%% Begin of Document\r
\r
input document must be highlighted. This poster presents methodology used during PAN2013 competition on uncovering plagiarism.\r
\r
The whole process is depicted at picture~\ref{fig:process}. The source retrieval task is divided into\r
-2 subtasks: Quering and Selecting, during which the software utilizes given search engine. The retrieved\r
+2 subtasks: Quering and Selecting, during which the software utilizes a given search engine. The retrieved\r
sources must be examined in detail in order to highlight as many plagiarism cases as possible. This process is depicted\r
-as Text Alignment.\r
-\r
+as Text Alignment. Results of this process are called {\em detections}, i.e.~passages of {\em source document} and {\em suspicious document}, which are similar enough to each other, and can serve as a basis for further manual examination for possible plagiarism.\r
%\r
\vfill\r
\columnbreak\r
%\r
\begin{figure}\r
\centering\r
- \includegraphics[width=0.7\textwidth]{img/source_retrieval_process.pdf}\r
+ \includegraphics[width=0.8\textwidth]{img/source_retrieval_process.pdf}\r
\caption{Plagiarism discovery process.}\r
\label{fig:process}\r
\end{figure} \r
%\rm\r
%%% Introduction\r
\section{Querying}\r
-Querying means to effectively utilize the search engine in order to retrieve as many relevant\r
+Querying means to effectively utilize a search engine in order to retrieve as many relevant\r
documents as possible with the minimum amount of queries.\r
%We consider the resulting document relevantif it shares some of text characteristics with the suspicious document.\r
-In real-world queries as such represent appreciable cost, therefore their minimization should be one of the top priorities. \r
+In real-world, queries as such represent appreciable cost, therefore their quantity minimization should be one of the top priorities. \r
%\subsection{Types of Queries}\r
-From the suspicious document, there were three diverse types of queries extracted.\\\r
+During initial phase, there were three diverse types of queries extracted from each suspicious document.\\\r
\begin{minipage}{0.55\linewidth}\r
\subsection{Keywords Based Queries}\r
\begin{ytemize}\r
\caption{Downloads and similarities performance.}\r
\label{fig:snippet_graph}\r
\end{figure}\r
-\r
-\r
%\r
% Yenyova cast\r
%\r
-\r
\section{Text Alignment}\r
\r
The system uses the same basic principles as in \cite{suchomel_kas_12}:\r
\r
-\begin{itemize}\r
+\begin{ytemize}\r
\item{\cemph{common features} between source and suspicious documents}\r
-\begin{itemize}\r
+\begin{ytemize}\r
\item{word 5-grams}\r
\item{stop-word 8-grams \cite{stamatatos2011plagiarism}}\r
-\end{itemize}\r
+\end{ytemize}\r
\item{\cemph{valid intervals} of characters covered by common features\r
``densely enough''}\r
\item{\cemph{postprocessing}---remove overlapping detections,\r
join neighbouring detections}\r
-\end{itemize}\r
+\end{ytemize}\r
\r
\subsection{Alternative Features}\r
\r
-\begin{itemize}\r
+\begin{ytemize}\r
\item{\cemph{contextual n-grams} \cite{torrejondetailed}}\r
-\begin{itemize}\r
+\begin{ytemize}\r
\item{\cemph{The quick} brown \cemph{fox jumped} over the lazy dogs.}\r
\item{The \cemph{quick brown} fox \cemph{jumped over} the lazy dogs.}\r
-\end{itemize}\r
+\end{ytemize}\r
\item{plain word 4-grams}\r
-\begin{itemize}\r
+\begin{ytemize}\r
\item{\cemph{The quick brown fox} jumped over the lazy dogs.}\r
\item{The \cemph{quick brown fox jumped} over the lazy dogs.}\r
-\end{itemize}\r
-\end{itemize}\r
+\end{ytemize}\r
+\end{ytemize}\r
\r
\begin{table}\r
\r
\r
\subsection{Global Postprocessing}\r
\r
-\begin{itemize}\r
+\begin{ytemize}\r
\item{Similar to PAN 2010 \cite{Kasprzak2010}}\r
\item{Overlapping detections removal}\r
-\item{\cemph{Result:} improvement, but not as big as in 2010}\r
-\end{itemize}\r
+\item{\cemph{Result:} improvement, but not as significant as in 2010}\r
+\end{ytemize}\r
\r
%\r
% Spolecna cast\r
\r
\subsection{Candidate retrieval}\r
\r
-\begin{itemize}\r
+\begin{ytemize}\r
\item{Second best ratio of recall to the number of queries}\r
\item{Missing support for phrasal search in ChatNoir is a big stumbling block}\r
-\end{itemize}\r
+\end{ytemize}\r
\r
\subsection{Text alignment}\r
\r
-\begin{itemize}\r
+\begin{ytemize}\r
\item{Significant improvement against PAN 2013}\r
\item{Word 4-grams are better than contextual 4-grams}\r
\item{We need a better ranking system than plagdet!}\r
-\end{itemize}\r
+\end{ytemize}\r
\r
%%% References\r
\r