X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=pan13-poster%2Fposter.tex;h=a01b220400aeb1d6e42f7b2fa44eead3b1b26c76;hb=b079a98b8d46aee8c68ff3d968873eae403b25e1;hp=453c32ad31a459fb3c559e6acc77f088023ea266;hpb=91d4f5116c1cef56b06a6929ccd277d265f4f2bf;p=pan13-paper.git

diff --git a/pan13-poster/poster.tex b/pan13-poster/poster.tex
index 453c32a..a01b220 100755
--- a/pan13-poster/poster.tex
+++ b/pan13-poster/poster.tex
@@ -116,41 +116,40 @@
 
 
 \begin{multicols}{2}\setlength{\columnseprule}{0pt}
-
-
 \section{Introduction}
+%
+A program for helping detering real-world plagiarism needs to accomplish many tasks.
+Original documents which served for creation of plagiarism must be retrieved and also suspicious passages according to
+input document must be highlighted. This poster presents methodology used during PAN2013 competition on uncovering plagiarism.
 
-PAN 2013 LOrem ipsum Lorem ipsum Lorem ipsumLorem ipsumLorem ipsumLorem ipsumLorem ipsum 
-
+The whole process is depicted at picture~\ref{fig:process}. The source retrieval task is divided into
+2 subtasks: Quering and Selecting, during which the software utilizes given search engine. The retrieved
+sources must be examined in detail in order to highlight as many plagiarism cases as possible. This process is depicted
+as Text Alignment.
 
+%
 \vfill
 \columnbreak
-
+%
 \begin{figure}
  \centering
-  \includegraphics[width=0.8\textwidth]{img/source_retrieval_process.pdf}
+  \includegraphics[width=0.7\textwidth]{img/source_retrieval_process.pdf}
   \caption{Plagiarism discovery process.}
   \label{fig:process}
 \end{figure} 
-
-
 \end{multicols}
-
-
-
 \begin{multicols}{2}
-
 %\rm
-
 %%% Introduction
 \section{Querying}
 Querying means to effectively utilize the search engine in order to retrieve as many relevant
 documents as possible with the minimum amount of queries.
 %We consider the resulting document relevantif it shares some of text characteristics with the suspicious document.
-In real-world queries as such represent appreciable cost, therefore their minimization should be one of the top priorities. \\
-\subsection{Types of Queries}
-From the suspicious document, there were three diverse types of queries extracted.
-\subsubsection{Keywords Based Queries}
+In real-world queries as such represent appreciable cost, therefore their minimization should be one of the top priorities. 
+%\subsection{Types of Queries}
+From the suspicious document, there were three diverse types of queries extracted.\\
+\begin{minipage}{0.55\linewidth}
+\subsection{Keywords Based Queries}
 \begin{ytemize}
 \item TF--IDF base automated keywords extraction;
 \item 5-token long; 
@@ -158,29 +157,58 @@ From the suspicious document, there were three diverse types of queries extracte
 \item Non-positional;
 \item Non-phrasal.
 \end{ytemize}
-\subsubsection{Intrinsic Plagiarism Based Queries}
+\end{minipage}
+\begin{minipage}{0.45\linewidth}
+\begin{figure}[h]
+ %\centering
+  \includegraphics[width=1\linewidth]{img/document_keywords.pdf}
+\end{figure}
+\end{minipage}
+\begin{minipage}{0.55\linewidth}
+\subsection{Intrinsic Plagiarism Based Queries}
 \begin{ytemize}
-\item Averaged Word Frequency Class based chunking~\cite{AWFC};
+\item Averaged Word Frequency Class based chunking~\cite{awfc};
 \item Random sentence selection from the chunk;
 \item Non-deterministic;
 \item Positional;
 \item Phrasal.
 \end{ytemize}
-
-\begin{figure}[r]{100pt}
- \centering
-  \includegraphics[width=0.4\textwidth]{img/document_awfc.pdf}
+\end{minipage}
+\begin{minipage}{0.45\linewidth}
+\begin{figure}[h]
+ %\centering
+  \includegraphics[width=1\linewidth]{img/document_awfc.pdf}
 \end{figure}
-
-\subsubsection{Paragraph Based Queries}
+\end{minipage}
+\begin{minipage}{0.55\linewidth}
+\subsection{Paragraph Based Queries}
 \begin{ytemize}
 \item Longest sentences from miscellaneous paragraphs;
 \item Deterministic;
 \item Positional;
 \item Phrasal.
 \end{ytemize}
+\end{minipage}
+\begin{minipage}{0.45\linewidth}
+\begin{figure}[h]
+ %\centering
+  \includegraphics[width=1\linewidth]{img/document_paragraphs.pdf}
+\end{figure}
+\end{minipage}
+
+\begin{figure}[h]
+ \centering
+  \includegraphics[width=0.8\linewidth]{img/queryprocess.pdf}
+   \caption{Stepwise queries execution process.}
+\end{figure}
 
 \section{Selecting}
+Document snippets were used for deciding whether to download the document for the text alignment.
+We used 2-tuples measurement, which indicates how many neighbouring word pairs coexist in the snippet and in the suspicious document.
+Performance of this measure is depicted at picture~\ref{fig:snippet_graph}.
+Having this measure, a threshold for download decision needs to be set in order to maximize all discovered similarities
+and minimize total downloads.
+A profitable threshold is such that matches with the largest distance between those two curves.
 \begin{figure}
   \centering
   \includegraphics[width=0.8\textwidth]{img/snippets_graph.pdf}
@@ -188,40 +216,59 @@ From the suspicious document, there were three diverse types of queries extracte
   \label{fig:snippet_graph}
 \end{figure}
 
+
 %
 % Yenyova cast
 %
 
 \section{Text Alignment}
 
+The system uses the same basic principles as in \cite{suchomel_kas_12}.
+
 %
 % Spolecna cast
 %
 
 \section{Conclusion}
 
-NÄjakÃ½ zÃ¡vÄr
+\subsection{Candidate retrieval}
 
-%%% References
+\begin{itemize}
+\item{Second best ratio of recall to the number of queries}
+\item{Missing support for phrasal search in ChatNoir is a big stumbling block}
+\end{itemize}
 
-%% Note: use of BibTeX als works!!
+\subsection{Text alignment}
 
-\bibliographystyle{plain}
-\begin{thebibliography}{1}
+\begin{itemize}
+\item{Significant improvement against PAN 2013}
+\item{Word 4-grams are better than contextual 4-grams}
+\item{We need a better ranking system than plagdet!}
+\end{itemize}
 
-\bibitem{ISMU}
-\cemph{Masaryk University Information System}\\
-{\tt http://is.muni.cz/}, contact: {\tt iscor@fi.muni.cz}.
+%%% References
 
-\bibitem{Theses}
-\cemph{Czech National Archive of Graduate Theses}\\
-{\tt http://theses.cz/}, contact: {\tt theses@fi.muni.cz}.
+%% Note: use of BibTeX als works!!
 
-\bibitem{AWFC}
-\cemph{Sven Meyer Zu Eissen and Benno Stein: Intrinsic Plagiarism Detection}\\
-{\tt Proceedings of the European Conference on Information Retrieval (ECIR-06)}, {\tt 2006}
+\bibliographystyle{plain}
+\bibliography{pan13-notebook}
+\nocite{awfc}
 
-\end{thebibliography}
+%\begin{thebibliography}{1}
+%
+%\bibitem{ISMU}
+%\cemph{Masaryk University Information System}\\
+%{\tt http://is.muni.cz/}, contact: {\tt iscor@fi.muni.cz}.
+%
+%\bibitem{Theses}
+%\cemph{Czech National Archive of Graduate Theses}\\
+%{\tt http://theses.cz/}, contact: {\tt theses@fi.muni.cz}.
+%
+%\bibitem{AWFC}
+%\cemph{Sven Meyer Zu Eissen and Benno Stein: Intrinsic Plagiarism Detection}\\
+%{\tt Proceedings of the European Conference on Information Retrieval (ECIR-06)}, {\tt 2006}
+%
+%\end{thebibliography}
 
 \smallskip
 \hrule height .1em
@@ -229,14 +276,20 @@ NÄjakÃ½ zÃ¡vÄr
 
 % \sffamily
 
-QR kÃ³d?
 
+\hbox to \hsize{
+	{\hsize=0.5\hsize\vbox{
 \cemph{Contact information:}\\
-	Å imon Suchomel {\tt suchomel@fi.muni.cz},\\
-	Jan Kasprzak, {\tt kas@fi.muni.cz}.
-
+	Å imon Suchomel {\tt suchomel@fi.muni.cz}\\
+	Jan Kasprzak {\tt kas@fi.muni.cz}\\
+	{\cemph{\tt http://www.fi.muni.cz/\~{}kas/pan13/}}
+}
+	\hfill
+	{\hsize=0.4\hsize\vbox{
+	\includegraphics[width=\hsize]{qrcode.png}
+}}}}
+	
 
 \end{multicols}
 
 \end{document}
-