PV211 -- Introduction to Information Retrieval
(Spring 2016)
Intro |
News |
Lectures |
Links |
Projects |
The course is based on the book
Manning, Raghavan and
Schutze: Introduction to Information Retrieval, taught
at Stanford, Munich and other places. There are
numerous, rich and detailed materials available
on
Coursera. Several copies of the textbook are available
in the library at FI.
In the course you will, among other things, learn how is it possible that
Google is able to respond to 10,000+ questions per second
from different places on the globe within miliseconds.
And it satisfies most questioners by ranked lists of
documents chosen from 100,000,000,000 documents in
such a way that seekers find the answer within first 10 hits
and their information needs are satisfied...
I will try to encourage you to `flipped learning' approaches where
possible, e.g. by encouraging creation of
learning resources like Khan Academy lessons: try to have a look at
intro PV211
trailer about history and importance of information retrieval.
- 22.2.: Negotiations about lectures by Seznam
specialists have been finalized. Two talks will be on topics:
a) Full Text Engine (1.3., Vláďa Kadlec),
b) Web Search, Spam, Advertisments,... (26.4., Vláďa Kadlec with Petr Sojka).
- As an experiment this year, there will be
Piazza.com PV211 playground for
your (anonymous) discussions related to the course.
(IS MU discussion group for
PV211 could be used as well): enjoy and use!
- 2.5.: Exam terms posted in the IS.
- 23. 2. 2016 8AM D2:
Introductory lectures.
video (course
trailer, in Czech)
Boolean retrieval slides 1,
IIR chapter 1.
- 1. 3. 2016 8AM D2:
Seznam.cz Fulltext Architecture by Vladimír Kadlec.
slides,
mp4 video (i518 MB, auth IS only)
Introduction to the Seznam.cz fulltext search architecture. The talk
covers all basic web search engine blocks: crawling, indexing, query
reformulation, relevance. Explanation of inner parts of the user
interface such as: auto completer, query corrector, suggested
searches. Real statistics from Seznam's traffic. As a bonus:
Image/video search.
Vladimír Kadlec is a senior researcher at Seznam.cz since 2011.
He earned his doctoral degree from FI MU in 2008.
All of his research has been related to natural language processing
or information retrieval. At Seznam.cz he designs and improves
algorithms for the fulltext search engine. Vladimir loves
(almost) all sports from snowboarding to cycling.
Readings (for exercises):
Levenstein demo,
Edit distance (YouTube video in Czech),
Levenshtein computation
(YouTube video in Czech)",
- 8. 3. 2016 8AM D2: Indexing
Readings:
ternary trees,
Soundex demo.
Explore Google datacenters (YouTube video).
Term vocabulary and postings lists
slides 2,
IIR chapter 2
Dictionaries and tolerant retrieval
slides 3,
IIR chapter 3.
- 15. 3. 2016 8AM D2: Vector Space Model, Google architecture
Readings:
Index construction slides 4,
IIR chapter 4
slides Google architecture (Ed Austin),
slides Google infrastructure (Jeff Dean),
Jeff Dean (YouTube video),
Google Anatomy paper from 1998,
Google File System,
Google executives,
PageRank Calculator,
Jak funguje Google (YouTube video).
Complete search system
Challenges
in Building Google... (slides by Jeff Dean from Stanford CS276 course in 2015).
- 22. 3. 2016 8AM D2: Compression,
Ranking
Compression slides 5,
IIR chapter 5
Scoring, term weighting, the vector space model slides 6,
Vector space model (slides Arguello),
IIR chapter 6.
- 29. 3. 2016 8AM D2:
Ranking (cont.), Scoring
Scoring slides 7,
IIR chapter 7
- 5. 4. 2016 8AM D2:
Evaluation, Relevance feedback and query expansion
Evaluation and result summaries
slides 8,
IIR chapter 8.
slides 9,
IIR chapter 9.
XML retrieval slides 10,
IIR chapter 10.
MathML retrieval by MIaS in EuDML: slides
- 12. 4. 2016 8AM D2:
Query Expansion slides 9,
IIR chapter 9.
XML retrieval slides 10,
IIR chapter 10.
MathML retrieval by MIaS in EuDML: slides
- 19. 4. 2016 8AM D2:
MathML retrieval by MIaS in EuDML: slides
Crawling slides 20,
IIR chapter 20.
- 26. 4. 2016 8AM D2:
Web search slides 19,
IIR chapter 19.
- 3. 5. 2016 8AM D2:
Latent Semantic Indexing slides 18,
IIR chapter 18,
Gensim.
Latent Dirichlet Allocation Topic similarity by LDA: intro,
LDA slides by Blei,
LDA visual browser demo
- 10. 5. 2016 8AM D2:
Website Visibility - The theory and practice of improving rankings
by Melius Weideman (5 book copies are at the FI MU library)
Resources: Online Marketing Essentials 1.0
Meta tag Best Practice,
Web Visibility Principles,
Search Basics.
- 17. 5. 2016 8AM D2:
Link Analysis, Pagerank slides 21,
IIR chapter 21.
Readings: How
Google finds a needle....
Question and answers session.
- Not covered this year, sorry:
Text Classification and Naive Bayes slides 13,
IIR chapter 13.
Vector Space Classification slides 14,
IIR chapter 14.
Probabilistic Information Retrieval slides 11,
IIR chapter 11.
Language Models for IR slides 12,
IIR chapter 12.
Support Vector Machines slides 15a,
Learning to Rank slides 15b,
IIR
chapter 15.
Flat Clustering slides 16,
IIR chapter 16.
Hierarchical Clustering slides 17,
IIR chapter 17.
I will be glad if you get encouraged into course topics and you decide
to get insight into it by solving [mini]projects.
Activities in this direction will be rewarded by the nontrivial number of
premium points towards successful grading.
Number of stars below is an estimate of project
difficulty, from miniproject [(*), 10b] to the Master's project
size [(*****), 50+b].
I am open to assign/extend a project as a Bachelor or Master's thesis,
just contact me.
- (*)+ Help with production or preparation of motivating Khan-Academy
style video similar to those on Effa Academy.
- (*)+ Preparation of solutions of exercises or slides
in LaTeX. Pointing to any errors in the course materials.
- (***)+ Presentation or teaching video on topics relevant to
the course. Possible topics: Sketch Engine, search with
linguistic attributes, random walks in texts, topic search and corpora,
time-constrained search, topic modelling with gensim, LDA,
Wolfram Alpha, specifics of search of structured data (chemical
and mathematical formulae, linguistic trees - syntactic or dependency
),...
- (****)+ Evaluation of Math Information Retrieval in system
MIaS -- possible
as a Dean project under supervision of
Martin Líška or
Michal Růžička
as a thesis.
- (****)+ Application of information `robot' based
on SoC Intel Galileo (hw is available).
- (*****) Participation in IR competition at
Kaggle.com.
- ...
Žákovi, který se hrozil chyb, Mistr řekl: "Ti, kdo nedělají chyby,
chybují nejvíc ze všech - nepokoušejí se o nic nového." Anthony de
Mello: O cestě.
This page intentionally does not contain any logotypes
of any [European] project programme, as preparation of the course
has not been subsidized by any financial support of external party.
sojka at fi dot muni dot cz --