Drobne upravy chyb

[pan13-paper.git] / pan13-paper / simon-source_retrieval.tex
diff --git a/pan13-paper/simon-source_retrieval.tex b/pan13-paper/simon-source_retrieval.tex

index d5b338b948a6cc4a13fac8319d8bb51328861b7e..b5b4a5a5da6b9739d2a3acd8e8c1aba647b2119d 100755 (executable)
--- a/pan13-paper/simon-source_retrieval.tex
+++ b/pan13-paper/simon-source_retrieval.tex
@@ -3,15 +3,15 @@ The source retrieval is a subtask in a plagiarism detection process during
  which only a relatively small subset of documents are retrieved from the\r
  large corpus. Those candidate documents are usually further compared in detail with the\r
  suspicious document. In the PAN 2013 source retrieval subtask the main goal was to\r
  which only a relatively small subset of documents are retrieved from the\r
  large corpus. Those candidate documents are usually further compared in detail with the\r
  suspicious document. In the PAN 2013 source retrieval subtask the main goal was to\r
-identified web pages which have been used as a source of plagiarism for creation of the \r
-test corpus. \r
-The test corpus contained 58 documents each discussing one and only one theme.\r
+identify web pages which have been used as a source of plagiarism for test corpus creation.\r
+\r
+The test corpus contained 58 documents each discussing only one theme.\r
  Those documents were created intentionally by\r
   semiprofessional writers, thus they feature nearly realistic plagiarism cases~\cite{plagCorpus}. \r
   Such conditions are similar to a realistic plagiarism detection scenario, such as for\r
  state of the art commercial plagiarism detection systems or the anti-plagiarism service developed on and\r
  utilized at the Masaryk University. The main difference between real-world corpus \r
  Those documents were created intentionally by\r
   semiprofessional writers, thus they feature nearly realistic plagiarism cases~\cite{plagCorpus}. \r
   Such conditions are similar to a realistic plagiarism detection scenario, such as for\r
  state of the art commercial plagiarism detection systems or the anti-plagiarism service developed on and\r
  utilized at the Masaryk University. The main difference between real-world corpus \r
-of suspicious documents such as for example corpus created from theses stored in Information System of Masaryk University\r
+of suspicious documents such as for example corpus created from theses stored in the Information System of Masaryk University\r
  and the corpus of suspicious documents used during the PAN 2013 competition is that in the PAN\r
  corpus each document contains plagiarism passages. Therefore we can deepen the search during the process\r
  in certain parts of the document where no similar passage has yet been found. This is the main\r
  and the corpus of suspicious documents used during the PAN 2013 competition is that in the PAN\r
  corpus each document contains plagiarism passages. Therefore we can deepen the search during the process\r
  in certain parts of the document where no similar passage has yet been found. This is the main\r
@@ -29,7 +29,7 @@ An online plagiarism detection can be viewed as a reverse engineering task where
  we need to find original documents from which the plagiarized document was created.\r
  During the process the plagiarist locates original documents with the use of a search engine.\r
  The user decides what query the search engine to ask and which of the results from the result page to use.\r
  we need to find original documents from which the plagiarized document was created.\r
  During the process the plagiarist locates original documents with the use of a search engine.\r
  The user decides what query the search engine to ask and which of the results from the result page to use.\r
-In real-world scenario the corpus is the whole Web and the search engine can be a contemporary commercial search engine\r
+In the real-world scenario the corpus is the whole Web and the search engine can be a contemporary commercial search engine\r
  which scales to the size of the Web. This methodology is based on the fact that we do not\r
  possess enough resources to download and effectively process the whole corpus.\r
  In the case of PAN 2013 competition the corpus\r
  which scales to the size of the Web. This methodology is based on the fact that we do not\r
  possess enough resources to download and effectively process the whole corpus.\r
  In the case of PAN 2013 competition the corpus\r
@@ -37,10 +37,10 @@ of source documents is the ClueWeb\footnote{\url{http://lemurproject.org/clueweb
  \r
  As a document retrieval tool for the competition we utilized the ChatNoir~\cite{chatnoir} search engine which indexes the English\r
  subset of the ClueWeb.   \r
  \r
  As a document retrieval tool for the competition we utilized the ChatNoir~\cite{chatnoir} search engine which indexes the English\r
  subset of the ClueWeb.   \r
-The reverse engineering decision process reside in creation of suitable queries on the basis of the suspicious document\r
+The reverse engineering decision process resides in creation of suitable queries on the basis of the suspicious document\r
  and in decision what to actually download and what to report as a plagiarism case from the search results.\r
  \r
  and in decision what to actually download and what to report as a plagiarism case from the search results.\r
  \r
-These first two stages can be viewed in figure~\ref{fig:source_retr_process} as Querying and Selecting. Selected results \r
+These first two stages are shown in Figure~\ref{fig:source_retr_process} as Querying and Selecting. Selected results \r
  from the search engine are forthwith textually aligned with the suspicious document (see section~\ref{text_alignment} for more details).\r
  This is the last decision phase -- what to report.\r
  If there is any continuous passage of reused text detected, the result document is reported\r
  from the search engine are forthwith textually aligned with the suspicious document (see section~\ref{text_alignment} for more details).\r
  This is the last decision phase -- what to report.\r
  If there is any continuous passage of reused text detected, the result document is reported\r
@@ -64,7 +64,7 @@ Deterministic queries for specific suspicious document are always the same no ma
  On the contrary the software can create in two runs potentially different nondeterministic queries.\r
  \r
  \subsubsection{Keywords Based Queries.}\r
  On the contrary the software can create in two runs potentially different nondeterministic queries.\r
  \r
  \subsubsection{Keywords Based Queries.}\r
-The keywords based queries compose of automatically extracted keywords from the whole suspicious document.\r
+The keywords based queries are composed of automatically extracted keywords from the whole suspicious document.\r
  Their purpose is to retrieve documents concerning the same theme. Two documents discussing the \r
  same theme usually share a set of overlapping keywords. Also the combination of keywords in\r
  query matters. \r
  Their purpose is to retrieve documents concerning the same theme. Two documents discussing the \r
  same theme usually share a set of overlapping keywords. Also the combination of keywords in\r
  query matters. \r
@@ -72,9 +72,9 @@ As a method for automated keywords extraction, we used a frequency based approac
  The method combines term frequency analysis with TF-IDF score~\cite{Introduction_to_information_retrieval}. As a reference\r
  corpus we used English web corpus~\cite{ententen} crawled by SpiderLink~\cite{SpiderLink} in 2012 which contains 4.65 billion tokens. \r
  \r
  The method combines term frequency analysis with TF-IDF score~\cite{Introduction_to_information_retrieval}. As a reference\r
  corpus we used English web corpus~\cite{ententen} crawled by SpiderLink~\cite{SpiderLink} in 2012 which contains 4.65 billion tokens. \r
  \r
-Each keywords based query were constructed from five top ranked keywords consecutively. Each keyword were\r
-used only in one query. Too long keywords based queries would be over-specific and it would have resulted\r
-in a low recall. On the other hand having constructed too short (one or two tokens) queries would have resulted\r
+Each keywords based query was constructed from five top ranked keywords consecutively. Each keyword was\r
+used only in one query. Too long keywords based queries would be overspecific and it would have resulted\r
+in a low recall. On the other hand having constructed too short queries (one or two tokens) would have resulted\r
  in a low precision and also possibly low recall since they would be too general.\r
  \r
  In order to direct the search more at the highest ranked keywords we also extracted their \r
  in a low precision and also possibly low recall since they would be too general.\r
  \r
  In order to direct the search more at the highest ranked keywords we also extracted their \r
@@ -83,7 +83,7 @@ Resulting the 4 top ranked keywords alone can appear in two different queries, o
  alone and one from the collocations. Collocation describes its keyword better than the keyword alone. \r
  \r
  The keywords based queries are non-positional, since they represent the whole document. They are also non-phrasal since\r
  alone and one from the collocations. Collocation describes its keyword better than the keyword alone. \r
  \r
  The keywords based queries are non-positional, since they represent the whole document. They are also non-phrasal since\r
-they are constructed of tokens gathered from different parts of the text. And they are deterministic, for certain input\r
+they are constructed of tokens gathered from different parts of the text. And they are deterministic; for certain input\r
  document the extractor always returns the same keywords.\r
  \r
  \subsubsection{Intrinsic Plagiarism Based Queries.}\r
  document the extractor always returns the same keywords.\r
  \r
  \subsubsection{Intrinsic Plagiarism Based Queries.}\r
@@ -107,16 +107,16 @@ nondeterministic, because the representative sentence is selected randomly.
   \r
  \subsubsection{Paragraph Based Queries.}\r
  The purpose of paragraph based queries is to check some parts of the text in more depth.\r
   \r
  \subsubsection{Paragraph Based Queries.}\r
  The purpose of paragraph based queries is to check some parts of the text in more depth.\r
-Parts for which no similarity has been found during previous searches. \r
+Those are parts for which no similarity has been found during previous searches. \r
  \r
  For this case we considered a paragraph as a minimum text chunk for plagiarism to occur. \r
  It is discussible whether a plagiarist would be persecuted for plagiarizing only one sentence in a paragraph.\r
  \r
  For this case we considered a paragraph as a minimum text chunk for plagiarism to occur. \r
  It is discussible whether a plagiarist would be persecuted for plagiarizing only one sentence in a paragraph.\r
-Also a detection of a specific sentence is very difficult if want to avoid exhaustive search approach.\r
+Also a detection of a specific sentence is very difficult if we want to avoid exhaustive search approach.\r
  If someone is to reuse some peace of continuous text, it would probably be no shorter than a paragraph. \r
  If someone is to reuse some peace of continuous text, it would probably be no shorter than a paragraph. \r
-Despite the fact, that paragraphs differ in length, we represent one paragraph by one query.\r
+Despite the fact, that paragraphs differ in length, we represent one paragraph by only one query.\r
  \r
  \r
  \r
  \r
-The paragraph based query was created from each paragraph of a suspicious document.\r
+The paragraph based query was created from each paragraph of suspicious document.\r
  From each paragraph we extracted the longest sentence from which the query was constructed.\r
  Ideally the extracted sentence should carry the highest information gain.\r
  The query was maximally 10 words in length which is the upper bound of ChatNoir\r
  From each paragraph we extracted the longest sentence from which the query was constructed.\r
  Ideally the extracted sentence should carry the highest information gain.\r
  The query was maximally 10 words in length which is the upper bound of ChatNoir\r
@@ -126,27 +126,27 @@ and was constructed from the selected sentence by omitting stop words.
  For each suspicious document we prepared all three types of queries during the first phase at once.\r
  Queries were executed stepwise. \r
  After processing each query the results were evaluated (see the following subsection~\ref{resSelection} for more details) and from\r
  For each suspicious document we prepared all three types of queries during the first phase at once.\r
  Queries were executed stepwise. \r
  After processing each query the results were evaluated (see the following subsection~\ref{resSelection} for more details) and from\r
-all textual similarities between each result and the suspicious document, the suspicious document intervals of those similarities\r
+all textual similarities between each result and the suspicious document the suspicious document intervals of those similarities\r
  were marked as 'discovered'. \r
  were marked as 'discovered'. \r
-At first the keywords based queries. All of the keywords based queries were\r
+At first there were processed the keywords based queries. All of them were\r
  always executed. \r
  After having all the keywords based queries processed, the intrinsic plagiarism based queries were executed according to \r
  their creation sequence. \r
  always executed. \r
  After having all the keywords based queries processed, the intrinsic plagiarism based queries were executed according to \r
  their creation sequence. \r
-Since they carry its position not all of the intrinsic plagiarism based queries were caried out.\r
+Since they carry its position, not all of the intrinsic plagiarism based queries were carried out.\r
  During the execution, if any of the query position intersected with any of the 'discovered' interval, the\r
  query was dropped out. In the same way, the last paragraph based queries were processed. \r
  \r
  During the execution, if any of the query position intersected with any of the 'discovered' interval, the\r
  query was dropped out. In the same way, the last paragraph based queries were processed. \r
  \r
-This search control results in two major properties. Firstly, the source retrieval effort were increased \r
-in parts of the suspicious document, where there have not yet been found any textual similarity.\r
-Especially by the paragraph based queries. And secondly, after detection a similarity for a certain part of the text,\r
+This search control results in two major properties. Firstly, the source retrieval effort was increased \r
+in parts of the suspicious document, where there has not yet been found any textual similarity.\r
+It was increased especially by the paragraph based queries. And secondly, after detection a similarity for a certain part of the text,\r
  no more intentionally retrieval attempts for that part were effectuated. Meaning that all\r
  discovered search engine results were evaluated, but there were executed no more queries regarding that passage.\r
  \r
  \r
  no more intentionally retrieval attempts for that part were effectuated. Meaning that all\r
  discovered search engine results were evaluated, but there were executed no more queries regarding that passage.\r
  \r
  \r
-\subsection{Result Selection}\r
+\subsection{Result Selection}~\label{resSelection}\r
  The second main decisive area about source retrieval task is to decide which from the search engine results to download.\r
  The second main decisive area about source retrieval task is to decide which from the search engine results to download.\r
-This process is represented in figure~\ref{fig:source_retr_process} as 'Selecting'. \r
-Nowadays in real-world is download very cheap and quick operation. There can be some disk space considerations\r
+This process is represented in Figure~\ref{fig:source_retr_process} as 'Selecting'. \r
+Nowadays in the real-world a download is very cheap and quick operation. There can be some disk space considerations\r
  if there is a need to store original downloaded documents. The main cost represents documents post processing. \r
  Mainly on the Internet there is a wide range of file formats, which for text alignment must be\r
  converted into plaintext. This can time and computational-consuming. For example from many\r
  if there is a need to store original downloaded documents. The main cost represents documents post processing. \r
  Mainly on the Internet there is a wide range of file formats, which for text alignment must be\r
  converted into plaintext. This can time and computational-consuming. For example from many\r
@@ -157,11 +157,6 @@ operation. The snippet purpose is to have a quick glance at a small extract of r
  The extract is maximally 500 characters long and it is a portion of the document around given keywords.\r
  On the basis of snippet, we needed to decide whether to actually download the result or not.\r
  \r
  The extract is maximally 500 characters long and it is a portion of the document around given keywords.\r
  On the basis of snippet, we needed to decide whether to actually download the result or not.\r
  \r
-Since the snippet is relatively small and it can be discontinuous part of the text, the \r
-text alignment methods described in section~\ref{text_alignment} were insufficient for \r
-\r
-\r
-\r
  \subsection{Snippet Control}\r
  \begin{figure}\r
    \centering\r
  \subsection{Snippet Control}\r
  \begin{figure}\r
    \centering\r
@@ -169,6 +164,45 @@ text alignment methods described in section~\ref{text_alignment} were insufficie
    \caption{Downloads and similarities performance.}\r
    \label{fig:snippet_graph}\r
  \end{figure}\r
    \caption{Downloads and similarities performance.}\r
    \label{fig:snippet_graph}\r
  \end{figure}\r
-\subsection{Source Retrieval Results}\r
-\r
  \r
  \r
+Since the snippet is relatively small and it can be discontinuous part of the text, the \r
+text alignment methods described in section~\ref{text_alignment} were insufficient \r
+in decision making over document download. Therefore we chose to compare existence of snippet word tuples\r
+in the suspicious document. For 1-tuples the measure means how many words from the snippet\r
+also exist in the suspicious document. If the snippet contains many common words they may\r
+also occur in many documents. In this case the 1-tuple measurement is little decisive. \r
+\r
+We used 2-tuples measurement, which indicates how many neighbouring word pairs coexist in the snippet and in the suspicious document.\r
+We decided according to this value whether to download the source or not. For the deduction \r
+ of the threshold value we used 4413 search results from various queries according to documents \r
+ in the training corpus. Each resulting document was textually aligned to its corresponding suspicious document.\r
+One similarity represents continuous passage of text alignment similarity as is described in the following section~\ref{text_alignment}.\r
+In this way we obtained 248 similarities in total after downloading all of the 4431 documents.\r
+\r
+The 2-tuples similarity performance is depicted in Figure~\ref{fig:snippet_graph}.\r
+Horizontal axis represents threshold of the 2-tuples similarity percentage between the snippet and the suspicious document.\r
+The graph curves represent obtain resource percentage according to the snippet similarity threshold.\r
+A profitable threshold is the one with the largest distance between those two curves.\r
+We chose threshold of the snippet similarity to 20\%, which in the graph corresponds to 20\% of all\r
+downloads and simultaneously with 70\% discovered similarities.\r
+ \r
+\subsection{Source Retrieval Results}\r
+In PAN 2013 Source Retrieval subtask we competed with other 8 teams. \r
+There can not be selected the best approach because there were several independent\r
+performance measures. Possibly each approach has its pros and cons and many approaches\r
+are usable in different situations. \r
+\r
+We believe that in the realistic plagiarism detection the most important is keeping the number of\r
+queries low and simultaneously maximizing recall. \r
+% It is often some tradeoff between cost and efectivness.\r
+It is also advisable to keep the number of downloads down, but on the other hand,\r
+it is relatively cheep and easily scalable operation.\r
+\r
+Our approach had the second best ration of recall to the number of used queries, which\r
+tells about the query efficacy. The approach with the best ratio used few queries (4.9 queries per document which\r
+was 0.4 of the amount we used), but also obtained the lowest recall (0.65 of our recall).\r
+The approach with highest recall (and also lowest precision) achieved 2.8 times higher recall with 3.9 times more queries compared to ours.\r
+\r
+Our approach achieved also low precision, which means we reported many more results and they\r
+were not considered as correct hits. On the other hand each reported result contained some\r
+textual similarity according to text alignment subtask score, which we believe is still worthwhile to report.\r