PV211 -- Introduction to Information Retrieval (Spring 2020)
Intro |
News |
Lectures |
Exercises (auth)
|
Previous courses |
Projects |
The course is based on the textbook
Manning, Raghavan and
Schutze: Introduction to Information Retrieval (hard copies available
in the library at FI) taught at Stanford, Munich and other places.
In the course you will, among other things, learn how is it possible
to fulfill seekers' information needs at the pace of 10,000+ questions
per second on the global web scale within milliseconds.
This year, parts about machine (deep) learning
are added, while others are discarded due to capacity reasons.
Students will be motivated to try active/flipped learning approaches
wherever possible.
Course trailer (in Czech)
- 19. 2. 2020 12:00 D3:
Introduction to IR, Boolean Retrieval.
Boolean retrieval slides 1,
IIR chapter 1
Exercises
(week 1+2, auth IS)
- 26. 2. 2020 12:00 D3:
Dictionary and Postings' storage (Indexing). Tolerant Retrieval.
Readings:
ternary trees,
Soundex demo.
Explore Google datacenters (YouTube video).
Term vocabulary and postings lists
slides 2,
IIR chapter 2
Dictionaries and tolerant retrieval
slides 3,
IIR chapter 3
Exercises
(week 1+2, auth IS)
- 4. 3. 2020 12:00 D3:
Tolerant retrieval (cont.), Index construction, MapReduce.
Readings:
Index construction slides 4,
IIR chapter 4
Exercises
(week 3+4, auth IS)
- 11. 3. 2020 12:00 D3:
Index Compression, Scoring.
Readings:
Compression slides 5,
IIR chapter
5
Scoring, term weighting, the vector space model slides 6,
IIR chapter 6
Exercises
(week 3+4, auth IS)
- 18. 3. 2020 12:00 D3:
Vector Space Model, Anatomy of the web scale IR system.
Readings:
Vector space model (slides Arguello)
Google:
Anatomy
paper from 1998 (PDF),
(HTML),
slides
Google infrastructure by Jeff Dean,
Jeff Dean (YouTube video),
Google File System,
How Google works
(YouTube in Czech),
Challenges
in Building Google... (slides by Jeff Dean from Stanford CS276 course
in 2015),
Google crash course (in Czech),
slides
Google architecture (Ed Austin).
Exercises
(week 5+6, auth IS),
Sketch Engine
- 25. 3. 2020
Distributed Word Representations for Information Retrieval.
Readings: slides,
main "word2vec" paper,
Building scalable systems that understand content.
Exercises
(week 5+6, auth IS)
- 1. 4. 2020
Computing scores in complete search system. Ranking.
Scoring slides 7,
IIR chapter 7
Exercises
(week 7+8, auth IS)
- 8. 4. 2020
Evaluation in IR and result summaries
Readings: slides 8,
IIR chapter 8
Exercises
(week 7+8, auth IS)
- 15. 4. 2020
Relevance feedback and Query expansion.
Readings: Query expansion slides 9,
IIR chapter 9
Exercises
(week 9+10, auth IS)
- 22. 4. 2020
Text classification, Naive Bayes, Evaluation, Clustering, kNN.
Readings: Text Classification and Naive Bayes slides 13,
IIR chapter 13
Clustering Introduction slides Cvinčeková,
Flat Clustering slides 16,
IIR chapter 16
Exercises
(week 9+10, auth IS),
Term
project information: deadline by April 29th.
- 29. 4. 2020
Vector Space Classification.
Readings: slides 14,
IIR chapter 14.
Exercises
(week 11+12, auth IS)
- 6. 5. 2020
Latent Semantics Models.
Readings: Latent Semantic Indexing slides 18,
IIR chapter 18,
Gensim,
Latent Dirichlet Allocation Topic similarity by LDA: intro,
LDA slides by Blei,
LDA visual browser demo
Similarity search with Gensim (Exercise materials from 2019)
Exercises
(week 11+12, auth IS)
- 13. 5. 2020
Web search.
Readings: slides 19,
IIR chapter 19.
Exercises
(week 13+14, auth IS)
- 20. 5. 2020
Link Analysis.
Readings: slides 21,
IIR chapter 21,
How
Google finds a needle....
Exercises
(week 13+14, auth IS)
Also due to corona limitations these topics will not be
covered in the 2020 course run:
-
XML retrieval.
slides 10,
IIR chapter 10,
MathML retrieval by MIaS in EuDML: slides
-
Crawling. slides 20,
IIR chapter 20
-
SVM, Learning to Rank.
Support Vector Machines slides 15a,
Learning to Rank slides 15b,
(IIR chapter 15)
-
Probabilistic Information Retrieval slides 11,
IIR chapter 11
-
Language Models for IR slides 12,
IIR chapter 12
-
Hierarchical Clustering slides 17,
IIR chapter 17
- PV211 course page from
2019,
2018,
2017,
2016,
2015, and
2014. Web pages of similar courses at
Stanford and
Munich.
I will be glad if you get encouraged into course topics and you decide
to get insight into it by solving [mini]projects.
Activities in this direction will be rewarded by the nontrivial number of
premium points towards successful grading.
Number of stars below is an estimate of project
difficulty, from miniproject [(*), 10 points] to big project size [(*****), 30+ points].
I am also open to assign/extend a project as a Bachelor/ Masters/ Dissertation thesis.
- (*)+ Pointing to any (factual, typographical) errors
in the course materials.
- (**)+ Preparation of hot topic slides, production or preparation
of motivating Khan-Academy style video, or other course materials in LaTeX.
- (**)+ Presentation or teaching video on topics relevant to the course.
Possible topics: Sketch Engine, search with linguistic attributes,
random walks in texts, topic search and corpora,
time-constrained search, topic modelling with gensim, LDA,
Wolfram Alpha, specifics of search of structured data (chemical
and mathematical formulae, linguistic trees - syntactic or
dependency), etc.
- (***) Participation in IR competition at
Kaggle.com.
- (***)+ Participation in IR research in our group
Math Information Retrieval
on research agendas and
ARQMath task or
EuDML project or
DML project.
- (***)+ Evaluation of Math Information Retrieval in system
MIaS - possible
as a Dean project under supervision of
Vít Novotný or
Dávid Lupták or
Michal Štefánik
or as a Bachelor/ Masters/ Dissertation thesis.
To a pupil who was in danger, Master said, "Those who do not make mistakes,
they are most mistaken for all – they do not try anything new."
Anthony de Mello
sojka at fi dot muni dot cz --