X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;ds=sidebyside;f=pan13-paper%2Fsimon-source_retrieval.tex;h=d5b338b948a6cc4a13fac8319d8bb51328861b7e;hb=eafe3e22e26382588563ac39d6f88dd022c740da;hp=2cb1a8f9c4f32945ef00a87e94c8fcc2fd9730c3;hpb=c89f0f7c72770832556ef260515fa625cae4190a;p=pan13-paper.git diff --git a/pan13-paper/simon-source_retrieval.tex b/pan13-paper/simon-source_retrieval.tex index 2cb1a8f..d5b338b 100755 --- a/pan13-paper/simon-source_retrieval.tex +++ b/pan13-paper/simon-source_retrieval.tex @@ -50,7 +50,8 @@ of those parts is done. \subsection{Querying} Querying means to effectively utilize the search engine in order to retrieve as many relevant documents as possible with the minimum amount of queries. We consider the resulting document relevant -if it shares some of text characteristics with the suspicious document. +if it shares some of text characteristics with the suspicious document. In real-world queries as such +represent appreciable cost, therefore their minimization should be one of the top priorities. We used 3 different types of queries\footnote{We used similar three-way based methodology in PAN 2012 Candidate Document Retrieval subtask. However, this time we completely replaced the headers based queries @@ -143,8 +144,31 @@ discovered search engine results were evaluated, but there were executed no more \subsection{Result Selection} +The second main decisive area about source retrieval task is to decide which from the search engine results to download. +This process is represented in figure~\ref{fig:source_retr_process} as 'Selecting'. +Nowadays in real-world is download very cheap and quick operation. There can be some disk space considerations +if there is a need to store original downloaded documents. The main cost represents documents post processing. +Mainly on the Internet there is a wide range of file formats, which for text alignment must be +converted into plaintext. This can time and computational-consuming. For example from many +pdf documents the plain text is hardly extractable, thus one need to use optical character recognition methods. + +The ChatNoir offers snippets for discovered documents. The snippet generation is considered costless +operation. The snippet purpose is to have a quick glance at a small extract of resulting page. +The extract is maximally 500 characters long and it is a portion of the document around given keywords. +On the basis of snippet, we needed to decide whether to actually download the result or not. + +Since the snippet is relatively small and it can be discontinuous part of the text, the +text alignment methods described in section~\ref{text_alignment} were insufficient for + + \subsection{Snippet Control} +\begin{figure} + \centering + \includegraphics[width=1.00\textwidth]{img/snippets_graph.pdf} + \caption{Downloads and similarities performance.} + \label{fig:snippet_graph} +\end{figure} \subsection{Source Retrieval Results}