X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=pan13-poster%2Fposter.tex;h=a01b220400aeb1d6e42f7b2fa44eead3b1b26c76;hb=b079a98b8d46aee8c68ff3d968873eae403b25e1;hp=453c32ad31a459fb3c559e6acc77f088023ea266;hpb=91d4f5116c1cef56b06a6929ccd277d265f4f2bf;p=pan13-paper.git diff --git a/pan13-poster/poster.tex b/pan13-poster/poster.tex index 453c32a..a01b220 100755 --- a/pan13-poster/poster.tex +++ b/pan13-poster/poster.tex @@ -116,41 +116,40 @@ \begin{multicols}{2}\setlength{\columnseprule}{0pt} - - \section{Introduction} +% +A program for helping detering real-world plagiarism needs to accomplish many tasks. +Original documents which served for creation of plagiarism must be retrieved and also suspicious passages according to +input document must be highlighted. This poster presents methodology used during PAN2013 competition on uncovering plagiarism. -PAN 2013 LOrem ipsum Lorem ipsum Lorem ipsumLorem ipsumLorem ipsumLorem ipsumLorem ipsum - +The whole process is depicted at picture~\ref{fig:process}. The source retrieval task is divided into +2 subtasks: Quering and Selecting, during which the software utilizes given search engine. The retrieved +sources must be examined in detail in order to highlight as many plagiarism cases as possible. This process is depicted +as Text Alignment. +% \vfill \columnbreak - +% \begin{figure} \centering - \includegraphics[width=0.8\textwidth]{img/source_retrieval_process.pdf} + \includegraphics[width=0.7\textwidth]{img/source_retrieval_process.pdf} \caption{Plagiarism discovery process.} \label{fig:process} \end{figure} - - \end{multicols} - - - \begin{multicols}{2} - %\rm - %%% Introduction \section{Querying} Querying means to effectively utilize the search engine in order to retrieve as many relevant documents as possible with the minimum amount of queries. %We consider the resulting document relevantif it shares some of text characteristics with the suspicious document. -In real-world queries as such represent appreciable cost, therefore their minimization should be one of the top priorities. \\ -\subsection{Types of Queries} -From the suspicious document, there were three diverse types of queries extracted. -\subsubsection{Keywords Based Queries} +In real-world queries as such represent appreciable cost, therefore their minimization should be one of the top priorities. +%\subsection{Types of Queries} +From the suspicious document, there were three diverse types of queries extracted.\\ +\begin{minipage}{0.55\linewidth} +\subsection{Keywords Based Queries} \begin{ytemize} \item TF--IDF base automated keywords extraction; \item 5-token long; @@ -158,29 +157,58 @@ From the suspicious document, there were three diverse types of queries extracte \item Non-positional; \item Non-phrasal. \end{ytemize} -\subsubsection{Intrinsic Plagiarism Based Queries} +\end{minipage} +\begin{minipage}{0.45\linewidth} +\begin{figure}[h] + %\centering + \includegraphics[width=1\linewidth]{img/document_keywords.pdf} +\end{figure} +\end{minipage} +\begin{minipage}{0.55\linewidth} +\subsection{Intrinsic Plagiarism Based Queries} \begin{ytemize} -\item Averaged Word Frequency Class based chunking~\cite{AWFC}; +\item Averaged Word Frequency Class based chunking~\cite{awfc}; \item Random sentence selection from the chunk; \item Non-deterministic; \item Positional; \item Phrasal. \end{ytemize} - -\begin{figure}[r]{100pt} - \centering - \includegraphics[width=0.4\textwidth]{img/document_awfc.pdf} +\end{minipage} +\begin{minipage}{0.45\linewidth} +\begin{figure}[h] + %\centering + \includegraphics[width=1\linewidth]{img/document_awfc.pdf} \end{figure} - -\subsubsection{Paragraph Based Queries} +\end{minipage} +\begin{minipage}{0.55\linewidth} +\subsection{Paragraph Based Queries} \begin{ytemize} \item Longest sentences from miscellaneous paragraphs; \item Deterministic; \item Positional; \item Phrasal. \end{ytemize} +\end{minipage} +\begin{minipage}{0.45\linewidth} +\begin{figure}[h] + %\centering + \includegraphics[width=1\linewidth]{img/document_paragraphs.pdf} +\end{figure} +\end{minipage} + +\begin{figure}[h] + \centering + \includegraphics[width=0.8\linewidth]{img/queryprocess.pdf} + \caption{Stepwise queries execution process.} +\end{figure} \section{Selecting} +Document snippets were used for deciding whether to download the document for the text alignment. +We used 2-tuples measurement, which indicates how many neighbouring word pairs coexist in the snippet and in the suspicious document. +Performance of this measure is depicted at picture~\ref{fig:snippet_graph}. +Having this measure, a threshold for download decision needs to be set in order to maximize all discovered similarities +and minimize total downloads. +A profitable threshold is such that matches with the largest distance between those two curves. \begin{figure} \centering \includegraphics[width=0.8\textwidth]{img/snippets_graph.pdf} @@ -188,40 +216,59 @@ From the suspicious document, there were three diverse types of queries extracte \label{fig:snippet_graph} \end{figure} + % % Yenyova cast % \section{Text Alignment} +The system uses the same basic principles as in \cite{suchomel_kas_12}. + % % Spolecna cast % \section{Conclusion} -Nějaký závěr +\subsection{Candidate retrieval} -%%% References +\begin{itemize} +\item{Second best ratio of recall to the number of queries} +\item{Missing support for phrasal search in ChatNoir is a big stumbling block} +\end{itemize} -%% Note: use of BibTeX als works!! +\subsection{Text alignment} -\bibliographystyle{plain} -\begin{thebibliography}{1} +\begin{itemize} +\item{Significant improvement against PAN 2013} +\item{Word 4-grams are better than contextual 4-grams} +\item{We need a better ranking system than plagdet!} +\end{itemize} -\bibitem{ISMU} -\cemph{Masaryk University Information System}\\ -{\tt http://is.muni.cz/}, contact: {\tt iscor@fi.muni.cz}. +%%% References -\bibitem{Theses} -\cemph{Czech National Archive of Graduate Theses}\\ -{\tt http://theses.cz/}, contact: {\tt theses@fi.muni.cz}. +%% Note: use of BibTeX als works!! -\bibitem{AWFC} -\cemph{Sven Meyer Zu Eissen and Benno Stein: Intrinsic Plagiarism Detection}\\ -{\tt Proceedings of the European Conference on Information Retrieval (ECIR-06)}, {\tt 2006} +\bibliographystyle{plain} +\bibliography{pan13-notebook} +\nocite{awfc} -\end{thebibliography} +%\begin{thebibliography}{1} +% +%\bibitem{ISMU} +%\cemph{Masaryk University Information System}\\ +%{\tt http://is.muni.cz/}, contact: {\tt iscor@fi.muni.cz}. +% +%\bibitem{Theses} +%\cemph{Czech National Archive of Graduate Theses}\\ +%{\tt http://theses.cz/}, contact: {\tt theses@fi.muni.cz}. +% +%\bibitem{AWFC} +%\cemph{Sven Meyer Zu Eissen and Benno Stein: Intrinsic Plagiarism Detection}\\ +%{\tt Proceedings of the European Conference on Information Retrieval (ECIR-06)}, {\tt 2006} +% +%\end{thebibliography} \smallskip \hrule height .1em @@ -229,14 +276,20 @@ Nějaký závěr % \sffamily -QR kód? +\hbox to \hsize{ + {\hsize=0.5\hsize\vbox{ \cemph{Contact information:}\\ - Šimon Suchomel {\tt suchomel@fi.muni.cz},\\ - Jan Kasprzak, {\tt kas@fi.muni.cz}. - + Šimon Suchomel {\tt suchomel@fi.muni.cz}\\ + Jan Kasprzak {\tt kas@fi.muni.cz}\\ + {\cemph{\tt http://www.fi.muni.cz/\~{}kas/pan13/}} +} + \hfill + {\hsize=0.4\hsize\vbox{ + \includegraphics[width=\hsize]{qrcode.png} +}}}} + \end{multicols} \end{document} -