X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;ds=sidebyside;f=pan13-paper%2Fsimon-source_retrieval.tex;h=d5b338b948a6cc4a13fac8319d8bb51328861b7e;hb=eafe3e22e26382588563ac39d6f88dd022c740da;hp=2cb1a8f9c4f32945ef00a87e94c8fcc2fd9730c3;hpb=c89f0f7c72770832556ef260515fa625cae4190a;p=pan13-paper.git

diff --git a/pan13-paper/simon-source_retrieval.tex b/pan13-paper/simon-source_retrieval.tex
index 2cb1a8f..d5b338b 100755
--- a/pan13-paper/simon-source_retrieval.tex
+++ b/pan13-paper/simon-source_retrieval.tex
@@ -50,7 +50,8 @@ of those parts is done.
 \subsection{Querying}
 Querying means to effectively utilize the search engine in order to retrieve as many relevant
 documents as possible with the minimum amount of queries. We consider the resulting document relevant 
-if it shares some of text characteristics with the suspicious document.  
+if it shares some of text characteristics with the suspicious document. In real-world queries as such
+represent appreciable cost, therefore their minimization should be one of the top priorities.
 
 We used 3 different types of queries\footnote{We used similar three-way based methodology in PAN 2012 
 Candidate Document Retrieval subtask. However, this time we completely replaced the headers based queries
@@ -143,8 +144,31 @@ discovered search engine results were evaluated, but there were executed no more
 
 
 \subsection{Result Selection}
+The second main decisive area about source retrieval task is to decide which from the search engine results to download.
+This process is represented in figure~\ref{fig:source_retr_process} as 'Selecting'. 
+Nowadays in real-world is download very cheap and quick operation. There can be some disk space considerations
+if there is a need to store original downloaded documents. The main cost represents documents post processing. 
+Mainly on the Internet there is a wide range of file formats, which for text alignment must be
+converted into plaintext. This can time and computational-consuming. For example from many
+pdf documents the plain text is hardly extractable, thus one need to use optical character recognition methods.
+
+The ChatNoir offers snippets for discovered documents. The snippet generation is considered costless
+operation. The snippet purpose is to have a quick glance at a small extract of resulting page.
+The extract is maximally 500 characters long and it is a portion of the document around given keywords.
+On the basis of snippet, we needed to decide whether to actually download the result or not.
+
+Since the snippet is relatively small and it can be discontinuous part of the text, the 
+text alignment methods described in section~\ref{text_alignment} were insufficient for 
+
+
 
 \subsection{Snippet Control}
+\begin{figure}
+  \centering
+  \includegraphics[width=1.00\textwidth]{img/snippets_graph.pdf}
+  \caption{Downloads and similarities performance.}
+  \label{fig:snippet_graph}
+\end{figure}
 \subsection{Source Retrieval Results}