X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=simon-searchengine.tex;h=22f0bffae4892793366573d149950358c1c1a87d;hb=37df9e83bf6f4596a950ea41859af30538c89d46;hp=6098bb61c353156dc2397d927e396e4d5dd8ac38;hpb=43cb4f147600ce6eff97e0e30f4cfdf2b9b519af;p=pan12-paper.git diff --git a/simon-searchengine.tex b/simon-searchengine.tex old mode 100644 new mode 100755 index 6098bb6..22f0bff --- a/simon-searchengine.tex +++ b/simon-searchengine.tex @@ -1,4 +1,227 @@ -\section{Search-Engine Queries} +\section{Candidate document retrieval}~\label{simon} +The basic concept of candidate document retrieval is to use a web search +engine to find suitable documents. In PAN 12 competition we used ChatNoir search +engine~\cite{chatnoir} which indexes the entire English part of the ClueWeb09 corpus. +We can now reduce the problem to constructing appropriate queries for ChatNoir search engine. +The goal is to retrieve similar documents, which contains the same passages as the source document +or similar documents dealing with the same theme. +The strongest emphasis has been places on minimizing the number of executed queries +since they may be charged or time limited or number limited in the real world. -% zkouska Gitu +In the PAN 12 test corpus, there were 32 text documents all contained at least one plagiarism case. +The documents were approximately 30 KB of size, the smallest were 18 KB and the largest were 44 KB. +We have used methods which are applicable in general to any type of text input with no apriori information about the input document. +%All queries from a given document were firstly preconstructed and they were afterwards sequentially executed +%according to their priority. After each executed query the intermediate results of plagiarism +%were calculated using the multi feature document comparison techniques described in section~\ref{yenya}. +%This process leads to continuing lowering the number of executed queries, for more details see the following passages. + + +We can divide constructed queries into three main groups according to their origin: + i) document keywords based, ii) intrinsic plagiarism based, and iii) headers based. +%Please note two major attributes of the extracted keywords. Firstly we say + +\subsection{Keywords based queries} +The document keywords based queries are constructed from automatically extracted keywords from the given document. +As for keyword extraction we used methodology based on word repetition similar to method described by Scott~\cite{text_patterns}. +The basic principle claims that the more frequent a single word in a given document is, the more likely it is to be key in it. +However the most frequent words usually do not carry any meaning. +For example articles (the), conjunctions (as) or prepositions (at). +These types of words are usually refered as the stop words. +Thus we firstly filter out stop words according to general English stop words list. +Secondly we used reference corpus word list as an additional filter. +We used words in their simple verbatim occurrence allied to a +statistical estimate of likelihood. As a result of this, a word to become a key +must not exists in the stop list and must occur in the suspicious document relatively much more frequently +than in the reference corpus. We used a corpus created from common English web sources, which contains +more then 4 billion tokens~\cite{ententen}. +The statistical measure TF-IDF~\cite{Introduction_to_information_retrieval} +was applied in order to decide whether the processed word shall become a keyword. +The TF-IDF combines term frequency (TF) and inverse document frequency (IDF) to produce composite weight of a given term. +Taking into account the average length of the suspicious document and minimizing the number of used queries we set TF-IDF weight threshold to 0.02, +which resulted in about 10 to 20 terms identified as keywords per document. + +Before executing web search tasks we must combine extracted keywords into befitting queries. +All queries were constructed from 5 keywords according to their TF-IDF weight consecutively. +In other words 5 top ranked keywords combine into the first query the other 5 top ranked keywords combine into the second query. +By using this method we usually obtained from 2 to 3 initial queries per document. + +\subsubsection{Collocations} +In addition to this we enhanced subsequent queries by the most frequent collocations of top 4 keywords. +For those 4 keywords we extracted the most frequent two-word and three-word collocation occurring within the single suspicious document. +A scope for collocates was a single sentence. We counted collocates as the most frequent neighbouring words within the scope also omitting stop words. +To take advantage of extracted collocations we combined them among selfs also into 5 word queries. +After omitting queries containing duplicated words we added, on average, two new queries to each suspicious document. +Despite of being composed from collocations we count those queries also as the keywords based. Together with +the queries created only from bare keywords, those queries were unconditionally executed as the initial queries. \\ + +Queries containing around 5 words should be optimal in order to retrieve highest mean average precision of results. +It means we would like to extract from results of those queries as many documents as possible but still dealing +with the similar theme as the source document. Expecting the order of words within a single query has small +impact on results, no other heuristics of combining keywords into queries were applied also +in order to keep the query amount minimal. In 32 from 33 input test documents were more than a hundred resulting documents found using +those initial queries, which gave a decent document base covering the source document theme for document comparison. + + + +%using less kw in q would result into more general or underspecifi + +All those previously mentioned queries characterize the document as a whole. +They should cover the theme of the document expecting the single document is actually dealing with a one theme. + +\subsection{Intrinsic plagiarism based queries} +Intrinsic plagiarism queries were constructed from a representative sentence residing in +the selected part of the document. Representative sentence is any sentence from the selected part of the document +which is greater then eight words in length. +Such a sentence should produce the least searching false-positives~\cite{knight}. + +The main task concerning representative sentences is to locate the suitable text part. +This part should be located by a method leading to discover potential plagiarism inside the document of its own, +which is called intrinsic plagiarism detection. Such a method is based on for example text statistics or syntactic features. +We chose statistical vocabulary richness method called average word frequency class described in~\cite{awfc}. +Using this method we can compute average word frequency class values (AWFC) for arbitrary parts of the text. The change +of this value between two adjacent text windows indicates change of writing style, which can be caused by a plagiarism. +We computed AWFC value for every text window of size 45 words, which was shifted through the text by 40 words span. +The values of window size and span of shifting was set empirically during training. Windows with +largest change of AWFC compared to AWFC of their neighbouring previous window were selected as candidate for query extraction. + + +The query from the representative sentence was composed by 6 words from the sentence beginning, also omitting stop words. +We chose 6 words to cover higher portion of selected sentence and not to exceed +the phrasal ChatNoir query limit. +This type of query is actually looking for documents containing the very same or at least the very similar sentence, +therefore we choose more than 5 word queries in order to narrow the search to more specific +results. Unfortunately in the time of PAN 12 competition the ChatNoir search engine did not +support phrasal search. Since querying a single sentence is phrasal search, one might have +expected even better results from those queries. + %Due to the fact, that ChatNoir does not currently support phrasal search, + +The intrinsic plagiarism based queries are document positionable, meaning that we can store +their position within the document. It is due to the fact, that +those queries are created from more or less subsequent words. As a result, they are also executed conditionally, for more +information see section~\ref{process}. +The position within the document were set to the +start position of the representative sentence. +They characterize only a specific portion of the suspicious document on the contrary to keyword based queries. + +\subsection{Headers based queries} +The last type of queries were constructed from local headers of the document. +A header usually characterize briefly and accurately the following section, thus it +can be used as a search query. In real scenarios the headers should +be located using some document metadata, in that way we can distinguish for example more header levels. +In the case of PAN 12 competition we have detected headers according to several +basic rules. Those basic rules applied on most headers from given test corpus text files. +As a header is considered any line which has an empty line above and bellow the actual line +and is from 2 to 6 words in length also omitting stop words. + +Note that length of header based queries are variable, thus they can be both specific +and general according to their length. They are also document positionable. +We calculated their position as start position of the following text. + +\subsection{Combining and executing queries}~\label{process} + + +After each executed query the intermediate results of plagiarism +were calculated using the multi feature document comparison techniques described in section~\ref{yenya}. +This process leads to continuing lowering the number of executed queries, for more details see the following passages. + +All queries from a given document were firstly preconstructed and they were afterwards sequentially executed +according to their priority. +The first all keyword based queries, the second +intrinsic plagiarism based and the last headers based, whilst we could omit some of the positionable queries during the process, +which is described further in this section. +The queries processing is outlined as Algorithm~\ref{alg:queries}, where for simplicity the snippet calculation and a web source download is omitted. +After executing a query the intermediate results were always +calculated as follows: For every query result in result page based +on a snippet to input suspicious document similarities, we decided whether to actually +download the resulting URL or not. Since the snippet contains portions of Web document to each +given keyword, we calculated pre-similarity as apportionment of every word match between the snipped and the suspicious document. +If there were at least 80\% match, we downloaded the web source for thorough investigation. +For each downloaded web document we calculated similarities with the input suspicious document +in the same way as described in section~\ref{yenya}. All similarities has been stored in form of intervals from within +the input suspicious document. In other words for every similar part between the downloaded +document and the suspicious document we stored the beginning position and the ending position of that part in +the suspicious document. + +As a result of that, we can during processing further queries, which haven't been executed yet, omit those +of which document position intersects with any of already found similarities intervals. This procedure +leads us to lowering the total number of executed queries. +All downloaded documents, which have at least one similarity interval, are declared +as possible source of the plagiarism. + + +\renewcommand{\algorithmicrequire}{\textbf{Input:}} +\renewcommand{\algorithmicensure}{\textbf{Output:}} + + +\algsetup{indent=2em} +\begin{algorithm}[h!] +\caption{Queries execution}\label{alg:queries} +\begin{algorithmic}[1] +\REQUIRE suspicious document $d$, queries $Q_{KW}, Q_{Intrinsic}, Q_{Headers}$ +%keyword, intrinsic and headers based queries $Q_{KW}, Q_{Intrinsic}, Q_{Headers}$, + +\ENSURE plagiarism source candidate web documents $W_{plag}$ +\medskip + +\STATE $I \leftarrow \emptyset; I \in \mathbb{N}^2$ +\STATE $W \leftarrow \emptyset$ + +\FORALL{$q \in (Q_{KW} \cup Q_{Intrinsic} \cup Q_{Headers}) $} + \IF{$position(q)$ do not intersects with any interval $ \vec{i} \in I$} + \STATE $W \leftarrow execute(q)$ + \FORALL{$w \in W$} + \STATE $I_{sub} \leftarrow get\_similarities(w, d)$ %\COMMENT{output is set of intervals similarities $[d_{start},d_{end}]$} + \STATE $I \leftarrow I \cup I_{sub}$ + \IF{$I_{sub} \neq \emptyset$} + \STATE $W_{plag} \leftarrow W_{plag} \cup \{w\}$ + \ENDIF + \ENDFOR + \ENDIF +\ENDFOR +\RETURN $W_{plag}$ + +\end{algorithmic} +\end{algorithm} + +\subsection{Queries comparison} +During the test phase there were extracted 133 keywords based queries, 165 intrinsic plagiarism +based queries and 331 headers based queries in total. Table~\ref{querycount} shows the arithmetic mean +of the query count per document. We can see that nearly half of the prepared queries were +omitted due to the fact, that there had been found a similarity including their document position. + +\begin{center} +\begin{table}[h] +\begin{center} +\caption{\footnotesize The arithmetic mean of the query count and the omitted query count per document} +\vspace{0.3 cm} +\begin{tabular}{|c|c|c|} +\hline +\label{querycount} +{\bf Query type} & {\bf Extracted}& {\bf Omitted} \\ \hline \hline +KW & 4.16 & N/A \\ \hline +Intrinsic & 5.16 & 2.35 \\ \hline +Headers & 10.34 & 4.75 \\ \hline +\end{tabular}\\ +\end{center} +\end{table} +\end{center} + + +\begin{center} +\begin{table}[h] +\begin{center} +\caption{\footnotesize The total portion of similarities found, taking into account number of similarities regardless of interval sizes.} +\vspace{0.3 cm} +\begin{tabular}{|c|c|} +\hline +\label{querySuccess} +{\bf Query type} & {\bf Similarities portion } \\ \hline \hline +KW & 72.5 \% \\ \hline +Intrinsic & 24.3 \% \\ \hline +Headers & 10.34 \% \\ \hline +\end{tabular}\\ +\end{center} +\end{table} +\end{center}