X-Git-Url: https://www.fi.muni.cz/~kas/git//home/kas/public_html/git/?a=blobdiff_plain;f=pan13-paper%2Fsimon-source_retrieval.tex;h=2777f3777f544a469697bf2305e6226e1fa9d287;hb=HEAD;hp=e32c1913b6233549de774586b17a7bbffdd08d4e;hpb=b16fe92fb7dd5fd6667718a0fe3d91e7ad95a581;p=pan13-paper.git diff --git a/pan13-paper/simon-source_retrieval.tex b/pan13-paper/simon-source_retrieval.tex index e32c191..2777f37 100755 --- a/pan13-paper/simon-source_retrieval.tex +++ b/pan13-paper/simon-source_retrieval.tex @@ -1 +1,214 @@ -\section{Source Retrieval} +\section{Source Retrieval}~\label{source_retr} +The source retrieval is a subtask in a plagiarism detection process during +which only a relatively small subset of documents are retrieved from the +large corpus. Those candidate documents are usually further compared in detail with the +suspicious document. In PAN 2013 source retrieval subtask the main goal was to +identify web pages which have been used as a source of plagiarism for test corpus creation. + +The test corpus contained 58 documents each discussing one topic only. +Those documents were created intentionally by + semiprofessional writers, thus they featured nearly realistic plagiarism cases~\cite{plagCorpus}. +Resources were looked up in the ClueWeb\footnote{\url{http://lemurproject.org/clueweb09.php/}} corpus. + Such conditions are similar to a realistic plagiarism detection scenario.%, such as for state of the art commercial plagiarism detection systems or the anti-plagiarism service developed on and utilized at the Masaryk University. +The main difference between real-world corpus +of suspicious documents such as for example corpus created from theses stored in the Information System of Masaryk University +and the corpus of suspicious documents used during the PAN 2013 competition is that in the PAN +corpus each document contains plagiarized passages. Therefore we can expected existence of +a plagiarized passage and deepen the search during the process +in certain parts of the document where no similar passage has yet been found. +% This is the main idea of improving recall of detected plagiarism in a suspicious document. + +An online plagiarism detection can be viewed as a reverse engineering task where +we need to find original documents from which the plagiarized document was created. +During the process the plagiarist locates original documents with the use of a search engine. +The user decides what query the search engine to ask and which of the results from the result page to use. + +%In the real-world scenario the corpus is the whole Web and the search engine can be a contemporary commercial search engine which scales to the size of the Web. + +The same methodology -- utilizing a search engine; is used for source retrieval. +This approach is based on the fact that we do not +possess enough resources to download and effectively process the whole corpus. +In the case of PAN 2013 competition we utilized the ChatNoir~\cite{chatnoir} +search engine which indexes the English subset of the ClueWeb. + +\begin{figure} + \centering + \includegraphics[width=0.8\textwidth]{img/source_retrieval_process.pdf} + \caption{Source retrieval process.} + \label{fig:source_retr_process} +\end{figure} + +The reverse engineering decision process resides in creation of suitable queries on the basis of the suspicious document +and in decision what to actually download and what to report as a plagiarism case from the search results. + +These first two stages are shown in Figure~\ref{fig:source_retr_process} as Querying and Selecting. Selected results +from the search engine are then textually aligned with the suspicious document (see section~\ref{text_alignment} for more details). +%This is the last decision phase -- what to report. +If there is any continuous passage of reused text detected, the result document is reported + and the continuous passages in the suspicious document are marked as {\it discovered} and no further processing +of those parts is done. + +\subsection{Querying} +Querying means to effectively utilize the search engine in order to retrieve as many relevant +documents as possible with the minimum amount of queries. We consider the resulting document relevant +if it shares some of text characteristics with the suspicious document. In real-world queries as such +represent appreciable cost, therefore their minimization should be one of the top priorities. + +We used 3 different types of queries\footnote{We used similar three-way based methodology in PAN 2012 +Candidate Document Retrieval subtask. However, this time we completely replaced the headers based queries +with paragraph based queries, since the headers based queries did not pay off in the overall process.}: +i) keywords based queries, ii) intrinsic plagiarism +based queries, and iii) paragraph based queries. Three main properties distinguish each type of query: i) Positional; ii) Phrasal; iii) Deterministic. +Positional queries carry extra information about a textual interval in the suspicious document which the query represents. +A phrasal query aims for retrieval of documents containing the same small piece of text. They are usually created from closely coupled words. +Deterministic queries for specific suspicious document are always the same no matter how many times we run the software. +%On the contrary the software can create in two runs potentially different nondeterministic queries. + +\subsubsection{Keywords Based Queries.} +The keywords based queries are composed of automatically extracted keywords from the whole suspicious document. +Their purpose is to retrieve documents concerning the same theme. +%Two documents discussing the same theme usually share a set of overlapping keywords. Also the combination of keywords in query matters. +As a method for automated keywords extraction, we used a frequency based approach described in~\cite{suchomel_kas_12}. +The method combines term frequency analysis with TF-IDF score. +As a reference corpus we used English web corpus~\cite{ententen} crawled by SpiderLink~\cite{SpiderLink} in 2012 which contains 4.65 billion tokens. + +Each keywords based query was constructed from five top ranked keywords consecutively. +Each keyword was used only in one query. +%Too long keywords based queries would be overspecific and it would have resulted in a low recall. +%On the other hand having constructed too short queries (one or two tokens) would have resulted in a low precision and also possibly low recall since they would be too general. +In order to direct the search more at the highest ranked keywords we also extracted their +most frequent two and three term long collocations. +These were combined also into queries of 5 words. +Resulting the 4 top ranked keywords alone can appear in two different queries, one from the keywords +alone and one from the collocations. +%Collocation describes its keyword better than the keyword alone. + +The keywords based queries are non-positional, since they represent the whole document. They are also non-phrasal since +they are constructed of tokens gathered from different parts of the text. And they are deterministic; for certain input +document the extractor always returns the same keywords. + +\subsubsection{Intrinsic Plagiarism Based Queries.} +The second type of queries purpose to retrieve pages which contain text detected +as different, in a manner of writing style, from other parts of the suspicious document. +%Such a change may point out plagiarized passage which is intrinsically bound up with the text. +%We implemented vocabulary richness method which computes average word frequency class value for +%a given text part. The method is described in~\cite{awfc}. +For this purpose we implemented vocabulary richness method~\cite{awfc} together with +sliding windows concept for text chunking as described in~\cite{suchomel_kas_12}. +%The problem is that generally methods based on the vocabulary statistics work better for longer texts. +%According to authors this method scales well for shorter texts than other text style detection methods. +%The usage of this method is in our case limited by relatively short texts. +%It is also difficult to determine +%what parts of text to compare. Therefore we used sliding window concept for text chunking with the +%same settings as described in~\cite{suchomel_kas_12}. + +A representative sentence longer than 6 words was randomly selected among those that apply from the suspicious part of the document. +The query was created from the representative sentence leaving out stop words. +The intrinsic plagiarism based queries are positional. They carry the position of the representative sentence.% in the document. +They are phrasal, since they represent a search for a specific sentence. And they are +nondeterministic, because the representative sentence is selected randomly. + +\subsubsection{Paragraph Based Queries.} +The purpose of paragraph based queries is to check some parts of the text in more depth. +Those are parts for which no similarity has been found during previous searches. +For this case we considered a paragraph as a minimum text chunk for plagiarism to occur. +%It is discussible whether a plagiarist would be persecuted for plagiarizing only one sentence in a paragraph. +%A detection of a specific sentence is very difficult if we want to avoid exhaustive search approach. +%If someone is to reuse some peace of continuous text, it would probably be no shorter than a paragraph. +Despite the fact, that paragraphs differ in length, we represent one paragraph by only one query. + +%The paragraph based query was created from each paragraph of suspicious document. +From each paragraph we extracted the longest sentence from which the query was constructed. +Ideally the extracted sentence should carry the highest information gain. +The query was maximally 10 words in length which is the upper bound of ChatNoir +and was constructed from the selected sentence by omitting stop words. + +\subsection{Search Control} +For each suspicious document we prepared all three types of queries during the first phase at once. +Queries were executed stepwise. +After processing each query the results were evaluated. %(see the following subsection~\ref{resSelection} for more details) +From all textual similarities between each result and the suspicious document, the document intervals of those similarities +were marked as {\it discovered}. + +%At first there were processed the keywords based queries. +Firstly, there were always all of the keywords based queries executed. +%After having all the keywords based queries processed, +Secondly the intrinsic plagiarism based queries were executed according to +their creation sequence. +%Since they carry its position, not all of them were carried out. +During the execution, if any of the query position intersected with any of the {\it discovered} interval, the +query was dropped out. Analogically, the last there were paragraph based queries processed. + +This search control results in two major properties. Firstly, the source retrieval effort was increased +in parts of the suspicious document, where there has not yet been found any textual similarity. +%It was increased especially by the paragraph based queries. +And secondly, after detection a similarity for a certain part of the text, +no more intentionally retrieval attempts for that part were effectuated. Meaning that all +discovered search engine results were evaluated, but there were executed no more queries regarding that passage. + + +\subsection{Result Selection}~\label{resSelection} +The second main decisive area about source retrieval task is to decide which from the search engine results to download. +This process is represented in Figure~\ref{fig:source_retr_process} as Selecting. +%Nowadays in the real-world a download is cheap and quick operation. +%There can be some disk space considerations if there is a need to store original downloaded documents. +% The main cost probably represents document post processing. +%Mainly on the Internet there is a wide range of file formats, which for text alignment must be +%converted into plaintext. This can be time and computational-consuming. +%For example from many pdf documents the plain text is hardly extractable, thus one need to use optical character recognition methods. + +The ChatNoir offers snippets for discovered documents. The snippet generation is considered costless +operation. The snippet purpose is to have a quick glance at a small extract of resulting page. +The extract is maximally 500 characters long and it is a portion of the document around given keywords. +On the basis of snippet, we needed to decide whether to actually download the result or not. + +\subsection{Snippet Control} +\begin{figure} + \centering + \includegraphics[width=0.8\textwidth]{img/snippets_graph.pdf} + \caption{Downloads and similarities performance.} + \label{fig:snippet_graph} +\end{figure} + +Since the snippet is relatively small and it can be discontinuous part of the text, the +text alignment methods described in section~\ref{text_alignment} were insufficient +in decision making over document download. Therefore we chose to compare existence of snippet word tuples +in the suspicious document. +%For 1-tuples the measure means how many words from the snippet +%also exist in the suspicious document. If the snippet contains many common words they may +%also occur in many documents. In this case the 1-tuple measurement is little decisive. + +We used 2-tuples measurement, which indicates how many neighbouring word pairs coexist in the snippet and in the suspicious document. +We decided according to this value whether to download the source or not. For the deduction + of the threshold value we used 4413 search results from various queries according to documents + in the training corpus. Each resulting document was textually aligned to its corresponding suspicious document. +%One similarity represents continuous passage of text alignment similarity as is described in the following section~\ref{text_alignment}. +In this way we calculated 248 similarities in total after downloading all of the 4431 documents. + +The 2-tuples similarity performance is depicted in Figure~\ref{fig:snippet_graph}. +Horizontal axis represents threshold of the 2-tuples similarity percentage between the snippet and the suspicious document. +The graph curves represent obtained resource percentage according to the snippet similarity threshold. +A profitable threshold is the one with the largest distance between those two curves. +We chose threshold of the snippet similarity to 20\%, which in the graph corresponds to 20\% of all +downloads and simultaneously with 70\% discovered similarities. + +\subsection{Source Retrieval Results} +In PAN 2013 Source Retrieval subtask we competed with other 8 teams. +There can not be selected the best approach because there were several independent +performance measures. Possibly each approach has its pros and cons and many approaches +are usable in different situations. + +We believe that in the realistic plagiarism detection the most important is to keep the number of +queries low and simultaneously maximize recall. +% It is often some tradeoff between cost and efectivness. +It is also advisable to keep the number of downloads low, but on the other hand, +it is relatively cheep and easily scalable operation. + +Our approach had the second best ratio of recall to the number of used queries, which +tells about the query efficacy. The approach with the best ratio used few queries (4.9 queries per document which +was 0.4 of the amount we used), but also obtained the lowest recall (0.65 of our recall). +The approach with the highest recall (and also lowest precision) achieved 2.8 times higher recall with 3.9 times more queries compared to ours. + +Our approach achieved also low precision, which means we reported many more results and they +were not considered as correct hits. On the other hand each reported result contained some +textual similarity according to text alignment subtask score, which we believe is still worthwhile to report.