PV211 -- Introduction to Information Retrieval (Spring 2017)
Intro |
News |
Lectures |
Exercises |
Links |
Projects |
The course is based on the book
Manning, Raghavan and
Schutze: Introduction to Information Retrieval, taught
at Stanford, Munich and other places. In the course you will,
among other things, learn how is it possible that
Google is able to respond to 10,000+ questions per second
from different places on the globe within milliseconds.
There are numerous, rich and detailed materials available
on
Coursera. Several copies of the textbook are available
in the library at FI. This year, parts about machine (deep) learning
will be added, together with topics as image, gait or XML retrieval.
Students are encouraged to try active/flipped learning approaches
wherever possible.
- We have got FR MU grant to improve the course materials, exercises (ROPOTs to test
your knowledge) and to have invited speakers.
- Invited lectures by Lukáš Vrábel (Seznam), Tomáš Mikolov (Facebook) and Michal
Balážia (FI MU) will take place as part of the course, others are being negotiated, e.g. with
Radim Řehůřek (RaRe Technologies).
All planned lectures below are presented and subject to change.
- IS MU discussion group for
PV211 is here for discussions. Questions are encouraged to be posted there!
Course trailer (in Czech).
- 25.4.: Exam terms posted in the IS.
- 21. 2. 2017 12:00 D3: Introductory lecture, Boolean Retrieval.
Boolean retrieval slides 1,
IIR chapter 1
Term vocabulary and postings lists
slides 2,
IIR chapter 2
Cvičení 1 (osnova ISu)
- 28. 2. 2017 12:00 D3: Dictionary and Postings' storage (Indexing).
Readings:
ternary trees,
Soundex demo.
Explore Google datacenters (YouTube video).
Term vocabulary and postings lists
slides 2,
IIR chapter 2
Dictionaries and tolerant retrieval
slides 3,
IIR chapter 3
Cvičení 2 (osnova ISu)
- 7. 3. 2017 12:00 D3: Index construction, MapReduce, Compression.
Readings:
Index construction slides 4,
IIR chapter 4
Compression slides 5,
IIR chapter 5
Cvičení 3 (osnova ISu)
- 14. 3. 2017 12:00 D3: Vector Space Model, IR system architecture.
Readings:
Scoring, term weighting, the vector space model slides 6,
Vector space model (slides Arguello),
IIR chapter 6
Scoring slides 7,
IIR chapter 7
slides Google architecture (Ed Austin),
slides Google infrastructure (Jeff Dean),
Jeff Dean (YouTube video),
Google Anatomy paper from 1998,
Google File System,
About Google [searches],
PageRank Calculator,
Jak funguje Google (YouTube video).
Complete search system
Challenges
in Building Google... (slides by Jeff Dean from Stanford CS276 course in 2015).
Cvičení 4 (osnova ISu)
- 21. 3. 2017 12:15 D3: Invited lectures.
Principles of Convolutional Neural Nets by Lukáš Vrábel. slides (PDF)
IS MU recorded video, 633 MB, MP4
Abstract: Interactive introduction to basic principles behind deep
convolutional neural networks for image processing. We will discuss the
inner working of neural networks, how can we use them for image classification
and how can we improve image processing by using convolution.
Lukáš Vrábel (LinkedIn)
works as research team lead at Seznam.cz. His team works on realization of various machine
learning tasks as text and web page analysis, recommendation systems, or image recognition.
cca 13:30 D3 Q&A session, then move to D2:
14:00 D2 (Informatics Colloquium):
Neural Networks for Natural Language Processing by Tomáš Mikolov.
info
recorded video iin IS MU, 505 MB, mp4
Abstract: Artificial neural networks are currently very successful in various machine learning tasks that involve natural language. In this talk, I will describe how recurrent neural network language models have been developed, as well as their most frequent applications to speech recognition and machine translation. I will also talk about distributed word representations, their interesting properties, and efficient ways how to compute them. Finally, I will describe our latest efforts to create a novel dataset that could be used to develop machines that can truly communicate with human users in natural language.
Tomáš Mikolov
is a senior researcher at Facebook (AI team), previously at Google.
He earned his doctoral degree from FIT VUT in 2013.
Readings:
main "word2vec" paper,
Building scalable systems that understand content.
MIDTERM test #1
- 28. 3. 2017 12PM D3: Evaluation, Relevance feedback and Query expansion.
Readings: Evaluation and result summaries slides 8,
IIR chapter 8.
Query expansion slides 9,
IIR chapter 9.
Cvičení 6 (interaktivní osnova ISu)
- 4. 4. 2017 12PM D3: Classification, SVM.
Readings: Text Classification and Naive Bayes slides 13,
IIR chapter 13.
Vector Space Classification slides 14,
IIR chapter 14.
Support Vector Machines slides 15a,
Learning to Rank slides 15b,
IIR chapter 15.
Cvičení 7 (interaktivní osnova ISu)
- 11. 4. 2017 12:15 D3: Image Search at Seznam by
Lukáš Vrábel (LinkedIn).
slides, (67MB PDF),
recordedI lecture in IS MU, 624 MB, MP4,
lecture poster, event image
Abstract: Introduction to the Seznam.cz
image search architecture.
We will talk about the system overview and basic signals used in machine learning
algorithms for relevance computation. We will cover the effect of user feedback
on quality of results, the technology behind user query understanding and the
deep convolutional neural networks for computer vision and image understanding.
Lukáš works as research team lead at Seznam.cz.
His team works on realization of various machine learning tasks as text and web
page analysis, recommendation systems, or image recognition.
Cvičení 8 (interaktivní osnova ISu)
- 18. 4. 2017 12PM D3: Clustering, machine learning.
Flat Clustering slides 16,
IIR chapter 16.
Hierarchical Clustering slides 17,
IIR chapter 17.
Latent Dirichlet Allocation Topic similarity by LDA: intro,
LDA slides by Blei,
LDA visual browser demo
Cvičení 9 (interaktivní osnova ISu)
- 25. 4. 2017 12PM D3: Web search, Crawling
Readings:
Web search slides 19,
IIR chapter 19.
MIDTERM test #2
- 2. 5. 2017 12PM D3: Gait Recognition, Link analysis
(by Michal Balážia)
Gait Recognition (as an example of classification and clustering) slides
Readings: Gait Recognition gait.fi.muni.cz
Readings: Link Analysis slides 21,
IIR chapter 21,
How
Google finds a needle....
Cvičení (interaktivní osnova ISu)
- 9. 5. 2017 12PM D3: Crawling. Link Analysis. XML retrieval
Crawling slides 20,
IIR chapter 20,
Sketch Engine,
Specific Crawling Techniques (Hudák)
Link Analysis (HITS) slides 21 (cont.)
Exercises are held this week, contrary the dean's and rector's sayings; details given on miniprojects, exams (see the lecture videos).
Cvičení (interaktivní osnova ISu)
- 16. 5. 2017 12PM D3: Latent Dirichlet Allocation (cont.),
Latent Semantic Indexing, Semantic indexing. MathML retrieval.
Readings: Latent Semantic Indexing slides 18,
IIR chapter 18,
Gensim,
Semantic indexing in ScaleText.
paper on ScaleText's design.
Readings: XML retrieval slides 10,
IIR chapter 10,
MathML retrieval by MIaS in EuDML: slides
Question and answers session.
Cvičení (interaktivní osnova ISu)
I will be glad if you get encouraged into course topics and you decide
to get insight into it by solving [mini]projects.
Activities in this direction will be rewarded by the nontrivial number of
premium points towards successful grading.
Number of stars below is an estimate of project
difficulty, from miniproject [(*), 10b] to big project size [(*****), 30+b].
I am open to assign/extend a project as a Bachelor or Master's thesis,
just contact me.
- (*)+ Preparation of solutions of exercises or slides
in LaTeX. Pointing to any (factual, typographical) errors in the course materials.
- (*)+ Help with production or preparation of motivating Khan-Academy
style video similar to those on Effa Academy.
- (**)+ Presentation or teaching video on topics relevant to
the course. Possible topics: Sketch Engine, search with
linguistic attributes, random walks in texts, topic search and corpora,
time-constrained search, topic modelling with gensim, LDA,
Wolfram Alpha, specifics of search of structured data (chemical
and mathematical formulae, linguistic trees - syntactic or
dependency),...
- (***) Participation in IR competition at
Kaggle.com.
- (***) Participation in IR research on
Math Information Retrieval or
Gait Recognition or
ScaleText project.
Žákovi, který se hrozil chyb, Mistr řekl: "Ti, kdo nedělají chyby,
chybují nejvíc ze všech - nepokoušejí se o nic nového." Anthony de
Mello: O cestě (in Czech).
sojka at fi dot muni dot cz --