+\subsubsection{Keywords Based Queries.}\r
+The keywords based queries are composed of automatically extracted keywords from the whole suspicious document.\r
+Their purpose is to retrieve documents concerning the same theme.\r
+%Two documents discussing the same theme usually share a set of overlapping keywords. Also the combination of keywords in query matters. \r
+As a method for automated keywords extraction, we used a frequency based approach described in~\cite{suchomel_kas_12}.\r
+The method combines term frequency analysis with TF-IDF score~\cite{Introduction_to_information_retrieval}. As a reference\r
+corpus we used English web corpus~\cite{ententen} crawled by SpiderLink~\cite{SpiderLink} in 2012 which contains 4.65 billion tokens. \r
+\r
+Each keywords based query was constructed from five top ranked keywords consecutively. Each keyword was\r
+used only in one query. Too long keywords based queries would be overspecific and it would have resulted\r
+in a low recall. On the other hand having constructed too short queries (one or two tokens) would have resulted\r
+in a low precision and also possibly low recall since they would be too general.\r
+In order to direct the search more at the highest ranked keywords we also extracted their \r
+most frequent two and three term long collocations. These were combined also into queries of 5 words.\r
+Resulting the 4 top ranked keywords alone can appear in two different queries, one from the keywords\r
+alone and one from the collocations.\r
+%Collocation describes its keyword better than the keyword alone. \r
+\r
+The keywords based queries are non-positional, since they represent the whole document. They are also non-phrasal since\r
+they are constructed of tokens gathered from different parts of the text. And they are deterministic; for certain input\r
+document the extractor always returns the same keywords.\r
+\r
+\subsubsection{Intrinsic Plagiarism Based Queries.}\r
+The second type of queries purpose to retrieve pages which contain text detected\r
+as different, in a manner of writing style, from other parts of the suspicious document.\r
+Such a change may point out plagiarized passage which is intrinsically bound up with the text. \r
+We implemented vocabulary richness method which computes average word frequency class value for \r
+a given text part. The method is described in~\cite{awfc}.\r
+%The problem is that generally methods based on the vocabulary statistics work better for longer texts.\r
+%According to authors this method scales well for shorter texts than other text style detection methods. \r
+The usage of this method is in our case limited by relatively short texts.\r
+It is also difficult to determine\r
+what parts of text to compare. Therefore we used sliding window concept for text chunking with the \r
+same settings as described in~\cite{suchomel_kas_12}.\r
+\r
+A representative sentence longer than 6 words was randomly selected among those that apply from the suspicious part of the document.\r
+The query was created from the representative sentence leaving out stop words.\r
+The intrinsic plagiarism based queries are positional. They carry the position of the representative sentence in the document.\r
+They are phrasal, since they represent a search for a specific sentence. And they are\r
+nondeterministic, because the representative sentence is selected randomly. \r
+ \r
+\subsubsection{Paragraph Based Queries.}\r
+The purpose of paragraph based queries is to check some parts of the text in more depth.\r
+Those are parts for which no similarity has been found during previous searches. \r
+For this case we considered a paragraph as a minimum text chunk for plagiarism to occur. \r
+%It is discussible whether a plagiarist would be persecuted for plagiarizing only one sentence in a paragraph.\r
+%A detection of a specific sentence is very difficult if we want to avoid exhaustive search approach.\r
+%If someone is to reuse some peace of continuous text, it would probably be no shorter than a paragraph. \r
+Despite the fact, that paragraphs differ in length, we represent one paragraph by only one query.\r
+\r
+%The paragraph based query was created from each paragraph of suspicious document.\r
+From each paragraph we extracted the longest sentence from which the query was constructed.\r
+Ideally the extracted sentence should carry the highest information gain.\r
+The query was maximally 10 words in length which is the upper bound of ChatNoir\r
+and was constructed from the selected sentence by omitting stop words.\r