|
ILP team at FI MU
Nowadays activities |
Previous research |
Members |
Courses |
Conferences |
Projects |
Contact |
References
|
|
Nowadays activities |
Activities of ILP group at FI MU focus now on two application areas, natural
language processing and knowledge discovery in geographic data.
|
Natural Language Processing |
Natural language processing
Groundwork
DESAM [5], a corpus of Czech newspaper texts was built at Natural Language
Processing Laboratory, Faculty of Informatics, Masaryk University. It
contains more than 1 000 000 word positions, about 130 000 different word
forms, about 65 000 of them occuring more then once, and 1665 different
tags.
Semi-automatic disambiguator DIS of Czech noun groups was developed [16].
The ratio of the unambiguous word forms increased from 50.6% to 58.5% after
processing by DIS. The number of tags per ambiguous token has decreased from
3.3% to 2.7%.
In [17] a method for automatic finding of compound verb groups in a Czech
sentence is introduced. The method results in a definite clause grammar rule
- called a verb rule - that contains information about components of the
verb group and their tags.
Lemma disambiguator for Czech
-
was developed [12, 13] employing Progol. A
method for disambiguation were introduced that combines ILP and
instance-based learning. The algorithm reached accuracy greater than 90%, leaving
less than 15% of words ambiguous. Lemma disambiguation of unknown words was
described in [6]. Progol was also tested in tag disambiguation of Czech
nouns [12]. The first results for tag disambiguation reach average accuracy
91.5%.
GRIND system [3, 4]
-
was implemented which is
capable to learn a sequence of context-dependent parse actions from a set of
syntactically annotated sentences. In the first step, GRIND constructs
a sequence of `deepening operators'. Then, in the second learning phase,
a specification of constraints on application of these operators is induced by
means of ILP - so called `forbidding predicates' are learned.
Automatic tagging of compound verb groups
-
Finding all parts of a compound verb group in a Czech sentence and tagging the
group as a whole is an inevitable groundwork for any subsequent (semantic)
analysis. From annotated corpus DESAM, 126 DCG rules were extracted which cover
all frequent verb groups in Czech [17, 18].
Using those rules we are able to recognise compound verb groups in unannotated
Czech texts with the accuracy 93%.
Part-of-Speech Tagging by Means of Shallow Parsing, ILP
and Active Learning[19]
-
Part-of-speech tagger for Czech is described that employs
DIS shallow
parser for Czech, manually-coded rules and inductive logic programming.
The active learning method used resulted in the decrease
in the number of training examples to label as well as in a shorter
learning time without the decrease in recall or accuracy.
Compared with the previous work, both recall and
accuracy increased and the number of training examples to label decreased.
The method was tested on ambiguities that are frequent in Czech. The
accuracy reached was higher than 96% with recall higher than 95%.
|
Knowledge
discovery in geographic data |
Knowledge discovery in geographic data
- GWiM, the system from mining ingeographic data
-
Interpret of an inductive query language for knowledge discovery in
geographic data [13] was implemented employing WiM [7, 8, 15] system. Three
kinds of inductive queries were implemented. Two of them, that ask for
characteristic and discriminate rules, are adaptation of GeoMiner (Han et
al., SIGMOD'97) rules. The dependency rules add a new quality to the
inductive query language.
- New inductive query language
-
Extension of GWiM has been developed [1].
Neighbourhood graphs [11] are
used for description of spatial relations. The inductive query language is
fully integrated with PostgreSQL database system. C4.5, RT4 and Progol
are used for computation of inductive queries.
|
|
Previous ILP research |
The ILP system WiM [7, 8, 15] has been designed and implemented at FI MU,
Brno and CTU, Prague during last few years supported by ESPRIT ILP. WiM
extends Markus by shifting bias, generating negative examples and employing
oracles. Important feature of WiM is its ability to learn a logic program
from a small set of examples. If necessary it poses a query to the user. WiM
uses a specific strategy for the choice of this query the aim of which is to
decrease a number of negative examples as much as possible. Under this
project there were developed some versions of WiM dedicated to specialised
applications, e.g. object-oriented analysis and design [10] and knowledge
discovery in geographic data [9].
|
|
Members |
|
|
Courses |
Courses relevant to ILP taught by members of the group
- ILP - one semester course (3 hours per week)
- KDD - one semester course (3 hours per week)
- KDD project(one semester project)
|
|
Participation in conferences |
|
|
Participation in projects |
- ESPRIT METAL - combination of statistical methods with machine
learning, multistrategy learning
- Natural Language Processing Laboratory (with applications supporting
education of people with limited sight) (Ministry of Education, CZ) -
automatic recognition of noun phrases [16], synthesis of verb rules
[17], syntax analysis by means of machine learning [4] automatic
tagging of composed verb groups [18];
- ILP - WiM system [7, 15], applications of WiM in software engineering
[10] and KDD [9];
- Expressivity of ophthalmology diseases in descendent populations of a
rural region (IGA MZ CR 4377-3 Ministry of Health, CZ) - collaboration
with Health of Child Research Institute in Brno.
|
|
Contact address |
Lubos Popelínský,
popel@informatics.muni.cz
Faculty of Informatic, Masaryk University
Botanická 68a
CZ - 602 00 Brno
Czech Republic
|
|
References |
- Kuba P.: Knowledge discovery in spatial data. Master thesis FI MU
Brno, 2000 (in Czech).
- Kuba P., Popelínský L.: Automatic classification of spatial data. 7th
Conference on GIS GIS...2000, Ostrava 2000 (in Czech).
- Nepil M.: Automatic construction of natural language grammar. Master
Thesis, FI MU 2000 (in Czech).
- Nepil M.: Learning Parse Actions from Annotated Sentences (submitted
to TSD'00)
- K. Pala , P. Rychlý and P. Smrz: DESAM - annotated corpus for Czech.
In Plásil F., Jeffery K.G.(eds.): Proceedings of SOFSEM'97, Milovy,
Czech Republic. LNCS 1338, Springer-Verlag 1997. (modified version of
this paper is available as
technical report FI MU)
- Pavelek T., Popelínský L.: Towards lemma disambiguation: Similarity
classes. In Proc. of Summer School on Information Systems, Ruprechtov
1999 (in Czech)
- Flener P., Popelínský L. Stepánková O.: ILP nad Automatic Programming:
Towards three approaches. Proc. of 4th Workshop on Inductive Logic
Programming (ILP'94), Bad Honeff, Germany, 1994.
- Popelínský L.: Towards Program Synthesis From A Small Example Set.
Proceedings of 21st Czech-Slovak conference on Computer Science
SOFSEM'94, pp.91-96 Czech Society for Comp. Sci. Brno 1993. (See also
Proceedings of 10th WLP'94, Zuerich 1994, Switzerland.)
- Popelínský L.: Knowledge Discovery in Spatial Data by Means of ILP.
In: Zytkow J.M., Quafafou M.(Eds.): Principles of Data Mining and
Knowledge Discovery. Proc. of 2nd European Symposium PKDD'98, Nantes
France 1998. LNCS 1510, Springer-Verlag 1998.
- Popelínský L.: Inductive inference to support object-oriented analysis
and design. In: Proc. of 3rd Conf on Knowledge-Based Software
Engineering, Smolenice 1998, IOS Press.
- Popelínský L.: Approaches to Spatial Data Mining. In Proceedings of
GIS... Ostrava'99 Conference, ISSN 1211-4855, 1999.
- Popelínský L., Pavelek T., Ptácník T.: Towards disambiguation in Czech
corpora. In Proc. of LLL Workshop Bled, 1999
- Popelínský L., Pavelek T.: Mining lemma disambiguation rules from
Czech corpora. In Rauch J., Zytkow J.M.(Eds.):Principles and Practice
of Knowledge Discovery in Databases. Proc. of 3rdEuropean Conference
PKDD'99, Prague Czech Republic 1999. LNCS 1704, Springer-Verlag 1999.
- Popelínský L.: Towards practical inductive logic programming. PhD
thesis FEL CTU Prague 2000.
- Smrz P., Zácková E.: New Tools for Disambiguation of Czech Texts. In
Sojka P., Matousek V., Pala K., Kopecek I.: Text, Speech, Dialogue.
Proceedings of the 1st Workshop on Text, Speech, Dialogue - TSD'98,
Brno, Czech Republic, Sept. 1998.
- Zácková E. , Pala K.: Corpus-Based Rules for Czech Verb Discontinuous
Constituents. Proceedings of TSD'99, Springer Verlag 1999, LNAI 1692,
pp. 325-328. (extended and modified version of this paper is available
as technical report FI MU)
- Zácková E., Popelínský L., Nepil M. : Automatic
Tagging of Compound Verb Groups in Czech Corpora. In Proceedings of
TSD 2000, LNAI 1902, Springer Verlag 2000, pp. 115-120.
- Zácková E., Popelínský L., Nepil M. :
Recognition and
Tagging of Compound Verb Groups in Czech. In Proceedings of
CoNLL and LLL 2000, Lisbon, Portugal, Sept. 2000
-
Nepil M., Popelinsky L., Zackova E.:
Part-of-Speech Tagging by Means of Shallow Parsing, ILP and Active Learning
In Proceedings of 3rd Workshop on Learning Language in Logic(LLL), Strasbourg, 2001.
|
|
|
|
|
| popel@informatics.muni.cz |