PV030 -- Textual Information Systems (spring 2013)
News |
Lectures |
References |
Tests |
- Nearest exam term will take place on May 29th, 10AM, C522.
For others see IS.muni.cz
- All-in-one slides from lectures in version suitable for printing:
one slide per A4 (PDF from 2013),
four slides per A4 (full PDF from 2013).
Slides used form IR book are not included, though, they are available
from Stanford page (.ppts).
- There is discussion group for the course in IS.muni.cz, as
official information and communication channel of the
course: please watch frequently!
- Study materials from several IR books available:
- First part of a FAQ:-) .
- 29. 5. 2013 10AM Final exam (70 min in C522)
- 14. 5. 2013 (3+ hours in C511)
Dictionary implementations.
Syntactical methods of compression.
Models of TIS: boolean, vector and probability.
IR evaluation (TREC et al.).
Document similarity, gensim.
Automatic structuring of texts.
Signature methods.
Compression with neural networks.
slides
Q&A session: all you wanted to know, but feared to ask.
- 7. 5. 2013 (3+ hours in C511)
Evaluation
in information retrieval
Coding theory: basic notions. Entropy, redundancy.
Universal encoding of natural numbers.
Introduction to compression. Statistical methods of compression.
shannon-Fano, Hufmann and arithmetic coding.
Compression dictionary methods. Adaptive dictionary compression methods.
slides
FGK (Dvorak)
LZ77 (Dvorak),
LZ78 (Dvorak),
PPM (Dvorak)
Methods with dictionary restructuralisation. Dictionary implementations.
Syntactical methods of compression.
slides
Homework: try document similarities on documents in projects
DML-CZ, EuDML computed by
gensim
- 30. 4. 2013 (3 hours in C411)
Scoring, term weighting and vector space model.
Computing scores in a complete search system.
Basics of corpus linguistics as an example of textual information system.
Indexing with natural language processing and its implementation.
slides
Homework: a) have a look at
Touchgraph as
a possible TIS interface; b) try wordnet (module add wn) on
aisa or elsewhere;
c) Mathematical Information Retrieval as
example of IR systems: EuDML,
MIaS/WebMIaS
- 23. 4. 2013 (3 hours in C411)
Index Construction.
Index Compression.
- 16. 4. 2013 (2 hours in D1, 12:00-12:50 B3511)
Brainstorming on anatomy of Google: Google paper on WWW7 conference,
Jeff Dean's video lecture,
Google File System,
Google executive,
PageRank Calculator
About Google in Czech,
Google
Gives Search a Refresh,
slides
- 9. 4. 2013 (2 hours in D1, 12:00-12:50 B311)
Information Retrieval (cont.):
Dictionaries and tolerant retrieval.
Homework: See Jeff Dean's lecture about Google scalability
Homework: readings till next week:
first paper about Google
- 2. 4. 2013 (2 hours in D1, 12:00-12:50 B311);
řešení úloh
Information
Retrieval Indexing (cont.):
The term vocabulary and postings lists.
Homework: Questions about Google from slides plus
read this article.
- 26. 3. 2013 (2+ hours in C416, 12:20-12:50 exercises in B311 [SkE])
Introduction to Information
Retrieval. Slides (Manning):
Boolean retrieval.
News: Infosys
award.
Homework is due Thursday, March 28th!
- 19. 3. 2013 (2 hours in D1, 12-13 exercises in B311)
Proximity search. Search classification: sixdimensional space of search
problems.
slides
Exercises (12 to 13) in B311: Visualisation of search engines (Pojer).
Sketch Engine motivation examples.
Index methods: preliminaries. Implementation of indexes.
Automatic indexing, thesauri construction.
- 13. 3. 2013 (3 hours in C416)
Search methods from right to left (variants of Boyer Moore,
Commentz-Walter, Buczilowski). Twoway automata with jumps:
generalization of exact search algorithms.
Hierarchy of search engines.
slides
animations
of algorithms Boyer-Moore, KMP (Buehler)
taxonomy
of search automaton constructions
reformulation of CW algo by L. Riedel
(in Czech)
homework:
Let we have patterns P= {tis, ti, iti}
1) Create NFA for searching P without epsilon transitions.
2) Create DFA equivalent to the NFA from 1)
3) Minimize DFA created in 2)
4) Compare the search by 3) with AC
5) You may experiment with finite automata and JFLAP
- 6. 3. 2013 (3 hours in C416)
Exact search of several patterns (AC),
regular expressions, exact search of infinitely many patterns.
slides
animations
of algorithms Boyer-Moore, KMP (Buehler)
taxonomy
of search automaton constructions
- 26. 2. 2013 (3 hours in C416)
Discussion about Watson readings.
Naive search optimisation.
Exact search of one pattern
(Shift-Or, Karp-Rabin, MP, KMP) and more patterns (AC).
Exact search of several patterns (AC),
regular expressions, exact search of infinite many patterns.
slides
animation
of
Aho-Corasick algorithm, and
implementation
in C#.
Animations:
String
matching algorithms (with animations, Lecroq),
Interactive Pattern Matching Animation (Goodrich),
animation
of algoritm KMP (Buehler)
- 19. 2. 2013 in C416.
Introduction, basic notions, classification of search problems.
slides
Watson,
paper
about Watson, (local copy)
Žákovi, který se hrozil chyb, Mistr řekl: "Ti, kdo nedělají chyby,
chybují nejvíc ze všech - nepokoušejí se o nic nového." Anthony de
Mello: O cestě.
sojka at fi dot muni dot cz --