+\subsubsection{Keywords Based Queries.}\r
+The keywords based queries compose of automatically extracted keywords from the whole suspicious document.\r
+Their purpose is to retrieve documents concerning the same theme. Two documents discussing the \r
+same theme usually share a set of overlapping keywords. Also the combination of keywords in\r
+query matters. \r
+As a method for automated keywords extraction, we used a frequency based approach described in~\cite{suchomel_kas_12}.\r
+The method combines term frequency analysis with TF-IDF score~\cite{Introduction_to_information_retrieval}. As a reference\r
+corpus we used English web corpus~\cite{ententen} crawled by SpiderLink~\cite{SpiderLink} in 2012 which contains 4.65 billion tokens. \r
+\r
+Each keywords based query were constructed from five top ranked keywords consecutively. Each keyword were\r
+used only in one query. Too long keywords based queries would be over-specific and it would have resulted\r
+in a low recall. On the other hand having constructed too short (one or two tokens) queries would have resulted\r
+in a low precision and also possibly low recall since they would be too general.\r
+\r
+In order to direct the search more at the highest ranked keywords we also extracted their \r
+most frequent two and three term long collocations. These were combined also into queries of 5 words.\r
+Resulting the 4 top ranked keywords alone can appear in two different queries, one from the keywords\r
+alone and one from the collocations. Collocation describes its keyword better than the keyword alone. \r
+\r
+The keywords based queries are non-positional, since they represent the whole document. They are also non-phrasal since\r
+they are constructed of tokens gathered from different parts of the text. And they are deterministic, for certain input\r
+document the extractor always returns the same keywords.\r
+\r
+\subsubsection{Intrinsic Plagiarism Based Queries.}\r
+The second type of queries purpose to retrieve pages which contain similar text detected\r
+as different, in a manner of writing style, from other parts of the suspicious document.\r
+Such a change may point out plagiarized passage which is intrinsically bound up with the text. \r
+We implemented vocabulary richness method which computes average word frequency class value for \r
+a given text part. The method is described in~\cite{awfc}. The problem is that generally methods\r
+based on the vocabulary statistics work better for longer texts. According to authors this method\r
+scales well for shorter texts than other text style detection methods. \r
+Still the usage is in our case limited by relatively short texts. It is also difficult to determine\r
+what parts of text to compare. Therefore we used sliding window concept for text chunking with the \r
+same settings as described in~\cite{suchomel_kas_12}.\r
+\r
+A representative sentence longer than 6 words was randomly selected among those that apply from the suspicious part of the document.\r
+An intrinsic plagiarism based query is created from the representative sentence leaving out stop words.\r
+\r
+The intrinsic plagiarism based queries are positional. They carry the position of the representative sentence in the document.\r
+They are phrasal, since they represent a search for a specific sentence. And they are\r
+nondeterministic, because the representative sentence is selected randomly. \r
+ \r
+\subsubsection{Paragraph Based Queries.}\r
+The purpose of paragraph based queries is to check some parts of the text in more depth.\r
+Parts for which no similarity has been found during previous searches. \r
+\r
+For this case we considered a paragraph as a minimum text chunk for plagiarism to occur. \r
+It is discussible whether a plagiarist would be persecuted for plagiarizing only one sentence in a paragraph.\r
+Also a detection of a specific sentence is very difficult if want to avoid exhaustive search approach.\r
+If someone is to reuse some peace of continuous text, it would probably be no shorter than a paragraph. \r
+Despite the fact, that paragraphs differ in length, we represent one paragraph by one query.\r
+\r
+\r
+The paragraph based query was created from each paragraph of a suspicious document.\r
+From each paragraph we extracted the longest sentence from which the query was constructed.\r
+Ideally the extracted sentence should carry the highest information gain.\r
+The query was maximally 10 words in length which is the upper bound of ChatNoir\r
+and was constructed from the selected sentence by omitting stop words.\r