PV211 -- Introduction to Information Retrieval (Spring 2019)
Intro |
News |
Lectures |
Exercises |
Previous courses |
Links |
Projects |
The course is based on the textbook
Manning, Raghavan and
Schutze: Introduction to Information Retrieval, taught
at Stanford, Munich and other places. In the course you will,
among other things, learn how is it possible that
Google is able to respond to 10,000+ questions per second
from different places on the globe within milliseconds.
There are numerous, rich and detailed materials available
on
Coursera. Several copies of the textbook are available
in the library at FI. Also this year, parts about machine (deep) learning
will be added, together with topics as image or XML retrieval.
Students are encouraged to try active/flipped learning approaches
wherever possible.
- 21. 2. 2019 10:00 D1: Introduction to IR, Boolean Retrieval.
Boolean retrieval slides 1,
IIR chapter 1
Exercises 1 (IS)
- 28. 2. 2019 10:00 D1: Dictionary and Postings'
storage (Indexing). Tolerant Retrieval.
Readings:
ternary trees,
Soundex demo.
Explore Google datacenters (YouTube video).
Term vocabulary and postings lists
slides 2,
IIR chapter 2
Dictionaries and tolerant retrieval
slides 3,
IIR chapter 3
Exercises 2 (IS)
- 7. 3. 2019 10:00 D1:
Index construction, MapReduce, Compression.
Readings:
Index construction slides 4,
IIR chapter 4
Compression slides 5,
IIR chapter 5
Exercises 3 (IS)
Relevant (voluntary) lectures by Tomas Mikolov:
11. 3. 2019 14:00 ESF MU, P201,
and
12. 3. 2019 14:00 FI MU D2 (Informatics Colloquium)
on Recent progress in AI research.
Tomáš
Mikolov
is a senior researcher at Facebook (AI team), previously at Google
and MicroSoft. Holder of Neuron 2018 prize. He earned his doctoral
degree from FIT VUT in 2013.
- 14. 3. 2019 10:00 D1:
Vector Space Model, IR system architecture.
Readings:
Scoring, term weighting, the vector space model slides 6,
Vector space model (slides Arguello),
IIR chapter 6
Scoring slides 7,
IIR chapter 7
slides Google architecture (Ed Austin),
slides Google infrastructure (Jeff Dean),
Jeff Dean (YouTube video),
Google Anatomy paper from 1998,
Google File System,
About Google [searches],
Jak funguje Google (YouTube video).
Complete search system
Challenges
in Building Google... (slides by Jeff Dean from Stanford CS276 course in 2015).
Readings (for exercises): Levenstein demo, Edit distance (YouTube video
in Czech), Levenshtein computation (YouTube video in Czech)",
Exercises 4 (IS)
- 21. 3. 2019 10AM D1:
Evaluation, Relevance feedback and Query expansion.
Readings: Evaluation and result summaries slides 8,
IIR chapter 8.
Query expansion slides 9,
IIR chapter 9.
Exercises 5 (IS):
Midterm TEST #1
- 28. 3. 2019 10AM D1:
Probabilistic Information Retrieval slides 11,
IIR chapter 11.
Language Models for IR slides 12,
IIR chapter 12.
Exercises 6 (IS)
- 4. 4. 2019 10AM D1:
Classification, SVM.
Readings: Text Classification and Naive
Bayes slides 13,
IIR chapter 13.
Vector Space Classification slides 14,
IIR chapter 14.
Support Vector Machines slides 15a,
Learning to Rank slides 15b,
(IIR chapter 15).
Exercises 7 (IS)
- 11. 4. 2019 10AM D3 (not in D1!): Clustering, machine learning.
Readings:
Clustering Introduction slides Cvinčeková,
Flat Clustering slides 16,
IIR chapter 16.
Hierarchical Clustering slides 17,
IIR chapter 17.
Exercises 8 (IS)
- 18. 4. 2019 10AM D1: Web search, Link Analysis.
Readings:
Web search slides 19,
IIR chapter 19.
Link Analysis slides 21,
IIR chapter 21,
How
Google finds a needle....
Exercises 9 (IS)
- 25. 4. 2019 10AM D1:
Crawling. Link Analysis. XML retrieval
Readings: Crawling slides 20,
IIR chapter 20,
Sketch Engine
XML retrieval slides 10,
IIR chapter 10,
MathML retrieval by MIaS in EuDML: slides
Latent Dirichlet Allocation Topic similarity by LDA: intro,
LDA slides by Blei,
LDA visual browser demo
Exercises
10 (IS): Midterm TEST #2
- 2. 5. 2019 10AM D1:
Latent Semantic Indexing, LDA, Semantic indexing and segmentation.
Readings: Latent Semantic Indexing slides 18,
IIR chapter 18,
Gensim,
Semantic indexing in ScaleText.
paper on ScaleText's design.
Exercises 11 (IS): Path similarity, PageRank, Hubs and authorities
- 9. 5. 2019 10AM D1:
10:00AM Agenda, warm-up (special type of IR topics repetition)
10:30AM Invited lecture: Seznam.cz Fulltext Architecture by Vladimír Kadlec
(LinkedIn).
Abstract: The talk covers all basic web search engine blocks: crawling,
indexing, query reformulation, relevance. Explanation of inner parts of the
user interface such as: auto completer, query corrector, suggested searches.
Real statistics from Seznam's traffic. As a bonus: Latest machine
learning advances in query correction.
Vladimír is currently
the head of the whole research team at Seznam.cz.
He earned his doctoral degree from FI MU in 2008. All of his research has been
related to natural language processing or information retrieval.
At Seznam.cz he designs and improves algorithms for the fulltext search engine.
Vladimir loves (almost) all sports from snowboarding to cycling.
His team works on realization of various machine learning tasks as fulltext search, text and web
page analysis, recommendation systems, or image recognition.
Exercises 12: Similarity search with Gensim
- 16. 5. 2019 10AM D1:
Dies Academicus, no contact teaching. Planned question and answers session, discussion and feedback
will take place in discussion forum instead.
I will be glad if you get encouraged into course topics and you decide
to get insight into it by solving [mini]projects.
Activities in this direction will be rewarded by the nontrivial number of
premium points towards successful grading.
Number of stars below is an estimate of project
difficulty, from miniproject [(*), 10 points] to big project size [(*****), 30+ points].
I am open to assign/extend a project as a Bachelor/ Masters/ Dissertation thesis,
just contact me.
- (*)+ Pointing to any (factual, typographical) errors
in the course materials.
- (**)+ Preparation of hot topic slides, production or preparation
of motivating Khan-Academy style video, or other course materials in LaTeX.
- (**)+ Presentation or teaching video on topics relevant to the course.
Possible topics: Sketch Engine, search with linguistic attributes,
random walks in texts, topic search and corpora,
time-constrained search, topic modelling with gensim, LDA,
Wolfram Alpha, specifics of search of structured data (chemical
and mathematical formulae, linguistic trees - syntactic or
dependency), etc.
- (***) Participation in IR competition at
Kaggle.com.
- (***) Participation in IR research on
Math Information Retrieval or
Gait Recognition or
ScaleText project.
- (***)+ Evaluation of Math Information Retrieval in system
MIaS - possible
as a Dean project under supervision of
Vít Novotný or
Dávid Lupták or
Michal Růžička
or as a Bachelor/ Masters/ Dissertation thesis.
To a pupil who was in danger, Master said, "Those who do not make mistakes,
they are most mistaken for all – they do not try anything new."
Anthony de Mello
sojka at fi dot muni dot cz --