PV211 -- Introduction to Information Retrieval
(Spring 2015)
Intro |
News |
Lectures |
Links |
Projects |
The course is based on the book
Manning, Raghavan and
Schutze: Introduction to Information Retrieval, taught
at Stanford, Munich and other places. There are
numerous, rich and detailed materials available
on
Coursera. Several copies of the textbook are available
in the library at FI.
In the course you will, among other things, learn how is it possible that
Google is able to respond to 10,000+ questions per second
from different places on the globe within miliseconds.
And it satisfies most questioners by ranked lists of
documents chosen from 100,000,000,000 documents in
such a way that seekers find
the answer within first 10 hits and their information needs are satisfied...
I will try to encourage you to `flipped learning' approaches where
possible, e.g. by encouraging creation of
learning resources like Khan Academy lessons: try to have a look at
intro PV211
trailer about history and importance of information retrieval.
- 13.5.: Posted testing ROPOT. It should be answered by May 20th, 8AM (30 points max.).
- 13.5.: Posted solutions of PV211 exercises (in Czech, sorry) prepared by
Dominik Szalai a Michal Krajčovič
and Lukáš Daubner. Use for your
preparation for exams.
- 30.4.: Exam terms posted in the IS.
- 24.3.: Negotiations about lectures by Seznam
specialists have been finalized. Three talks will be on topics:
a) Full Text Engine (8.4., Roman Rožnik),
b) Query Reformulation and Intent (15.4., Vláďa Kadlec).
c) Mining Big Data (22.4., Jan Lukavský).
- 28.2.: 53 students enrolled this year.
- 25.2.: No contact teaching today.
Posted lectures from last week in the IS.
- 18.2. 8am, D3: We start with first lectures.
- 17.2.: Lecture slides posted for whole course, together
with study
materials as Khan Academy style videos.
- There is a discussion group for PV211 course
in IS.muni.cz, as official information and communication
channel of the course: watch it frequently!
- First part of a FAQ:-) :-).
- 18. 2. 2015
Introductory lectures.
video (course
trailer, in Czech)
Boolean retrieval slides 1,
IIR chapter 1.
Term vocabulary and postings lists
slides 2,
IIR chapter 2.
- 4. 3. 2015
Dictionaries and tolerant retrieval
slides 3,
IIR chapter 3,
Levenstein demo,
Edit distance (YouTube video in Czech),
Levenshtein computation
(YouTube video in Czech)",
ternary trees,
Soundex demo.
- 11. 3. 2015
Index construction slides 4,
IIR chapter 4,
Explore Google datacenters (YouTube video).
Compression slides 5,
IIR chapter 5.
- 18. 3. 2015
Scoring, term weighting, the vector space model slides 6,
Vector space model (slides Arguello),
IIR chapter 6.
Complete search system
Challenges in Building Google... (slides by Jeff Dean from 2014 at Stanford course).
Additional materials and readings: slides Google architecture (Ed Austin),
slides Google infrastructure (Jeff Dean),
Jeff Dean (YouTube video),
Google Anatomy paper from 1998,
Google File System,
Google executives,
PageRank Calculator,
Jak funguje Google (YouTube video).
- 25. 3. 2015
Scoring slides 7,
IIR chapter 7
Evaluation and result summaries
slides 8,
IIR chapter 8.
- 1. 4. 2015
Relevance feedback and query expansion
slides 9,
IIR chapter 9.
XML retrieval slides 10,
IIR chapter 10.
MathML retrieval by MIaS in EuDML: slides
- 8. 4. 2015
8:00 Web search slides 19,
IIR chapter 19.
9:10-10:40 D3: Fulltext Architecture (Roman Rožnik)
slides (link Prezi),
slides (12 MB PDF),
video, mp4 (623 MB)
Search engine gives a user quite
good results in a split second? An introduction to the
architecture of search engine. What happens when user enter a query?
How robots crawl data from the web? Process of indexing, searching
in reverse index, storing the data at RAM, creating the snippets,
spreading the load to servers, collecting data for training
the ranking function, training process etc.
Roman Rožník,
is nowadays senior researcher at Seznam.cz. Roman fortunately succeed
in introducing supervised machine learning to problem of ranking/sorting
fulltext results and became a head of research relevancy team
(together 4 guys). Main focus of the team is to develop and
maintain our own supervised ranker (based on boosted regression
trees) and using it as a tool for various problems the life
in seznam.cz comes up with. Through this work Roman sees
any problem as a machine learning problem and any data as
training data or features.
- 15. 4. 2015
8:00-9:00 D3 Web search (cont.) slides 19,
IIR chapter 19.
9:10-10:40 D3: Query Reformulation in a Search Engine (RNDr. Vladimír Kadlec, Ph.D.)
slides (PDF),
video part1 (462 MB),
video part2, mp4 (220 MB)
Modern fulltext search engines (almost) never limit the search to the
keywords that user types in. The input query is reformulated,
supplemented by additional words and then searched
in the document index. The talk introduces several
query reformulation techniques, from diacritics
restoration to acronym expansion. All presented
algorithms are implemented in the Seznam.cz search engine.
Vladimir Kadlec is a senior researcher at Seznam.cz since 2011.
He earned his doctoral degree from FI MUNI in 2008.
All of his research has been related to natural language processing
or information retrieval. At Seznam.cz he designs and improves
algorithms for the fulltext search engine. Vladimir loves
(almost) all sports from snowboarding to cycling.
- 22. 4. 2015
8:00-9:00 D3
Crawling slides 20,
IIR chapter 20.
9:10-10:40 D3: Mining Useful and Big Data
from Internet (Ing. Jan Lukavský),
slidy,
video (467 MB).
Would you like to hear about the problems that need to be
solved in order to obtain really valuable data from the
Internet? What are the biggest problems you will face
when implementing web crawling system focused on a
specific purpose (i.e. not a general crawling, but
a 'focused' crawling)? How to design scalable systems
using modern data-processing tools? How can 'big-data'
help you in designing better and more functional systems?
This will be answered in Jan's presentation.
Jan is a development team leader at Seznam.cz.
His team is working on the backend of fulltext search engine, mainly
big data processing, web-scale crawling, large scale
data processing, etc. Currently, we are mostly using Hadoop Map
Reduce as a processing engine and Apache HBase for
primary data storage. Our algorithms are designed
to work on a scale of hundreds of computational
nodes and processing hundreds of terabytes of data
stored in tens of billions records.
- 29. 4. 2015
Link Analysis, Pagerank slides 21,
IIR chapter 21.
How
Google finds a needle....
- 6. 5. 2015
Dies Academicus -- rector's day: no contact lessons
- 13. 5. 2015
Latent Semantic Indexing slides 18,
IIR chapter
18, Gensim.
Eventually some important topics from chapters 11--17 on demand.
Q&A session: please prepare your questions!
- Not covered this year, sorry:
Text Classification and Naive Bayes slides 13,
IIR chapter 13.
Vector Space Classification slides 14,
IIR
chapter 14.
Probabilistic Information Retrieval slides 11,
IIR chapter 11.
Language Models for IR slides 12,
IIR chapter 12.
Support Vector Machines slides 15a,
Learning to Rank slides 15b,
IIR
chapter 15.
Flat Clustering slides 16,
IIR chapter 16.
Hierarchical Clustering slides 17,
IIR chapter 17.
I will be glad if you get encouraged into course topics and you decide
to get insight into it by solving [mini]projects.
Activities in this direction will be rewarded by the nontrivial number of
premium points towards successful grading.
Number of stars below is an estimate of project
difficulty, from miniproject [(*), 10b] to the Master's project
size [(*****), 50+b].
I am open to assign/extend a project as a Bachelor or Master's thesis,
just contact me.
- (*)+ Help with production or preparation of motivating Khan-Academy
style video similar to those on Effa Academy.
- (*)+ Preparation of solutions of exercises or slides
in LaTeX. Pointing to any errors in the course materials.
- (***)+ Presentation or teaching video on topics relevant to
the course. Possible topics: Sketch Engine, search with
linguistic attributes, random walks in texts, topic search and corpora,
time-constrained search, topic modelling with gensim, LDA,
Wolfram Alpha, specifics of search of structured data (chemical
and mathematical formulae, linguistic trees - syntaktic or dependency
),...
- (****)+ Evaluation of Math Information Retrieval in system
MIaS -- possible
as a Dean project under supervision of
Martin Líška or
Michal Růžička
as a thesis.
- (****)+ Application of information `robot' based
on SoC Intel Galileo (hw is available).
- (*****) Participation in IR competition at
Kaggle.com.
- ...
Žákovi, který se hrozil chyb, Mistr řekl: "Ti, kdo nedělají chyby,
chybují nejvíc ze všech - nepokoušejí se o nic nového." Anthony de
Mello: O cestě.
This page intentionally does not contain any logotypes
of any [European] project programme, as preparation of the course
has not been subsidized by any financial support of external party.
sojka at fi dot muni dot cz --