\r
\r
\begin{multicols}{2}\setlength{\columnseprule}{0pt}\r
-\r
-\r
\section{Introduction}\r
-\r
+%\r
PAN 2013 LOrem ipsum Lorem ipsum Lorem ipsumLorem ipsumLorem ipsumLorem ipsumLorem ipsum \r
-\r
-\r
+%\r
\vfill\r
\columnbreak\r
-\r
+%\r
\begin{figure}\r
\centering\r
- \includegraphics[width=0.8\textwidth]{img/source_retrieval_process.pdf}\r
+ \includegraphics[width=0.6\textwidth]{img/source_retrieval_process.pdf}\r
\caption{Plagiarism discovery process.}\r
\label{fig:process}\r
\end{figure} \r
-\r
-\r
\end{multicols}\r
-\r
-\r
-\r
\begin{multicols}{2}\r
-\r
%\rm\r
-\r
%%% Introduction\r
\section{Querying}\r
Querying means to effectively utilize the search engine in order to retrieve as many relevant\r
documents as possible with the minimum amount of queries.\r
%We consider the resulting document relevantif it shares some of text characteristics with the suspicious document.\r
-In real-world queries as such represent appreciable cost, therefore their minimization should be one of the top priorities. \\\r
-\subsection{Types of Queries}\r
-From the suspicious document, there were three diverse types of queries extracted.\r
-\subsubsection{Keywords Based Queries}\r
+In real-world queries as such represent appreciable cost, therefore their minimization should be one of the top priorities. \r
+%\subsection{Types of Queries}\r
+From the suspicious document, there were three diverse types of queries extracted.\\\r
+\begin{minipage}{0.55\linewidth}\r
+\subsection{Keywords Based Queries}\r
\begin{ytemize}\r
\item TF--IDF base automated keywords extraction;\r
\item 5-token long; \r
\item Non-positional;\r
\item Non-phrasal.\r
\end{ytemize}\r
-\r
+\end{minipage}\r
+\begin{minipage}{0.45\linewidth}\r
+\begin{figure}[h]\r
+ %\centering\r
+ \includegraphics[width=1\linewidth]{img/document_keywords.pdf}\r
+\end{figure}\r
+\end{minipage}\r
\begin{minipage}{0.55\linewidth}\r
-\subsubsection{Intrinsic Plagiarism Based Queries}\r
+\subsection{Intrinsic Plagiarism Based Queries}\r
\begin{ytemize}\r
\item Averaged Word Frequency Class based chunking~\cite{AWFC};\r
\item Random sentence selection from the chunk;\r
\includegraphics[width=1\linewidth]{img/document_awfc.pdf}\r
\end{figure}\r
\end{minipage}\r
-\r
-\subsubsection{Paragraph Based Queries}\r
+\begin{minipage}{0.55\linewidth}\r
+\subsection{Paragraph Based Queries}\r
\begin{ytemize}\r
\item Longest sentences from miscellaneous paragraphs;\r
\item Deterministic;\r
\item Positional;\r
\item Phrasal.\r
\end{ytemize}\r
+\end{minipage}\r
+\begin{minipage}{0.45\linewidth}\r
+\begin{figure}[h]\r
+ %\centering\r
+ \includegraphics[width=1\linewidth]{img/document_paragraphs.pdf}\r
+\end{figure}\r
+\end{minipage}\r
+\r
+\begin{figure}[h]\r
+ \centering\r
+ \includegraphics[width=0.8\linewidth]{img/queryprocess.pdf}\r
+ \caption{Stepwise queries execution process.}\r
+\end{figure}\r
\r
\section{Selecting}\r
+Document snippets were used for deciding whether to download the document for the text alignment.\r
+We used 2-tuples measurement, which indicates how many neighbouring word pairs coexist in the snippet and in the suspicious document.\r
+Performance of this measure is depicted at picture~\ref{fig:snippet_graph}.\r
+Having this measure, a threshold for download decision needs to be set in order to maximize all discovered similarities\r
+and minimize total downloads.\r
+A profitable threshold is such that matches with the largest distance between those two curves.\r
\begin{figure}\r
\centering\r
\includegraphics[width=0.8\textwidth]{img/snippets_graph.pdf}\r
\label{fig:snippet_graph}\r
\end{figure}\r
\r
+\r
%\r
% Yenyova cast\r
%\r