Accepted Demonstrations
Demonstrations accepted for TSD 2002, with abstracts
Paper ID: 134
Type: Demonstration
Title: A Large-Scale Corpus of Polish and Tools for its Annotation
Contact author: Agnieszka Mykowiecka
Topic: Text - text corpora
Abstract: The aim of this demonstration is to
introduce a project financed by the
State Committee for Scientific
Research (a Polish government body;
grant number 7 T11C 043 20) aiming
at constructing a large corpus of
written Polish for NLP applications.
We briefly present the following
characteristics of the corpus:
- aims (NLP applications, but also
lexical, theoretical linguistic,
language teaching and
sociolinguistic applications in
mind);
- intended size and make-up of the
corpus;
- the original system of
morphosyntactic annotation;
- the system of structural and
meta-data annotation;
- XML (XCES) standards adopted;
- original tools for linguistic
annotation of the corpus:
- morphological analyser;
- statistical tagger;
- intended ways of making the corpus
publicly available.
We also demonstrate similarities and
differences between this and similar
corpus initiatives for other
languages, and justify the current
project in terms of the lack of
publicly available and/or
linguistically annotated corpora for
Polish.
Various aspects of this project will
be presented in more detail in
related demonstrations (depending on
the TSD organisers' decision to
accept them).
Related link:
Paper ID: 135
Type: Demonstration
Title: ISOLATED WORD RECOGNITION AND VISUALIZATION SOFTWARE
Contact author: Antanas Lipeika
Topic: Speech - automatic speech recognition
Abstract:
Isolated word recognition and visualization software "Recognition" has been developed. Isolated word recognition is based on dynamic time warping. Three types of local continuity constraints are available: III,V and VIII- Itakura. Relaxed endpoint constraints also are possible. Energy based algorithm is used for endpoint detection. LPC features are used for pattern comparison. Symmetric likelihood ratio distance is used for distance calculation.
Visualization of every step of speech recognition process (endpoint detection, speech pattern matching, decision making) is possible. Every speaker can create his own dictionary (collection of reference words) and use it for recognition. Recognition on-line mode is also available. One can press spacebar, say a word and result of recognition will be displayed on the screen.
Another program "Palyginimas" is developed for recognition performance evaluation using speech databases. One can open lists of test and reference speech files and perform recognition experiments. Results of recognition has been displayed on the screen and finally written in text file. Using recognition results on the screen it is possible to evoke decision making results for particular decision.
A number of recognition experiments was performed using this software. This research was supported by Lithuanian language in the information society 2000-2006 program.
Related link:
Paper ID: 136
Type: Demonstration
Title: Computer-based Translation from English to Lithuanian
Contact author: Bronius Tamulynas
Topic: Text - machine translation
Abstract: The conceptual hierarchical model for computer-based translation (CBT) system from English into Lithuanian is proposed. It is based on hierarchical blackboard architecture and includes virtual dictionary and several knowledge sources. It is shown, that such model and special set of knowledge sources with grammatical components may reduce the translation problem and improve its quality. According to the CBT model paradigm a system for specialized English translation into Lithuanian is created. It includes user interface, virtual dictionary, text parsing, translation engine and several knowledge sources modules. Direct translation strategy with some transfer elements of syntactic sentence groups is used. It allows implementing better translation quality for more complicated sources. Related link:
Paper ID: 137
Type: Demonstration
Title: IBM Dictionary and Linguistic Tools system “Frost”
Contact author: Alexander Troussov
Topic: Text - other
Abstract: IBM Dictionary and Linguistics Tools system, codenamed “Frost”, will eventually support over 30 languages, including some Western European languages, thus consolidating the results of more than 20 years development of lexical data and morphological analysers. The product is under development by IBMers from several countries; cooperation with academic communities is used for data development and for providing of linguistic expertise.
Frost architecture provides modular, crosslinguistic, cross-platform, and high-performance (several gigabytes per hour) base for industrial applications in Information Retrieval and Extraction, providing shallow parsing, part-of-speech tagging, morphological analysis and synonym support.
To increase performance and reduce developing cycle specific linguistics phenomena are generalized and classified according computational models most suitable for their processing. E.g. clitic processing in Romance languages, decomposition of solid compounds in Germanic languages, Chinese word segmentation are treated in Frost with the one formal computational tool. This tool is based on the special implementation of non-deterministic finite-state processing, when back-tracking logic is extracted from finite-state machine into separate module. Separated programming logic gives flexibility, while finite-state processing ensures high-speed string matching. Finite state processing in this scheme is reduced to finding of the hierarchy of prefixes in deterministic finite-state dictionary, which contains word formation elements, provided with morphological, morphotactic and statistical information.
Morphological analysis in Frost is based on the usage of finite-state automatons and transducers. Also finite-state devices have been present since the emergence of computer science and are extensively used in natural language processing (including speech processing), the focus was on mathematical and algorithmic approaches to the “topology” thus leaving the gap between industrial and academic research. IBM Dictionary and Linguistics Tools team developed new approaches to the analysis of finite-state devices performance which allowed to provide several times improvement in terms of the run-time.
Frost exploits variable node format, which allows the usage of binary search, hach-tables and other programming techniques in addition to previously widely used linear search and TRIE structures. Assigning of a format to a node is done according to graph theoretic analyses and statistics of the usage of this particular node in corpora processing. In addition to the performance advantages, variable node formats opened the way to efficient application of finite-state processing for non-alphabetical languages.
Another aspect of Frost finite-state tools is that their implementation takes into account the architecture of modern computers, specifically we use cache- and prefetching-friendly memory representation.
Finite-state processing typically has simple access code, so it is the speed of the memory access which might be crucial for the performance. The architecture of modern processors and computers include the hierarchy of data storage devices to provide caching of frequently used data. Operational systems provide prefetch. Finite-state processing is highly irregular type of computation, so it is hardly to expect that the progress in the development of standard hardware and software caching tools will eliminate the need of adjusting finite-state processing to become cache-friendly.
Related link:
Paper ID: 138
Type: Demonstration
Title: XML architecture for a modern corpus
Contact author: Piotr Banski
Topic: Text - text corpora
Abstract: The presentation concerns the XML
architecture assumed for the IPI PAN
Corpus of written Polish, being created
at the Institute of Computer Science of
Polish Academy of Sciences. The design
of this corpus implements the principles
laid out by the XCES Guidelines (see
e.g. Ide et al. 2000), featuring in
particular the so-called stand-off
annotation, whereby each text is split
into several components residing in
separate XML files that may instantiate
various DTDs or XML Schemas. More
specifically, each text in the IPI PAN
Corpus will be composed of three layers,
as sketched below.
main file <-
sentence segmentation file <-
morphosyntactic annotation file
The main file contains the actual text
with gross structural markup down to the
level of the paragraph, and with the
addition of tags signalling quotations
and various aspects of text highlighting
(italics, small caps, etc.). The next
document establishes the sentence
boundaries and may act as a base for
another layer of annotation, or for a
document that aligns sentences from
different versions of the base text (for
the purpose of creating a parallel
corpus and thus reusing the text
resources in the future). It also
resolves the tags indicating text
highlighted in the main file,
classifying the information conveyed by
e.g. italics as 'foreign text',
'proper name', 'emphasis',
etc. The third layer of annotation
contains the (often multiple) outputs of
the morphological analyser
(morphosyntactic information, lemmas)
and the output of the disambiguator.
This setup creates interesting
questions, relating e.g. to
cost-effectiveness of corpus creation,
or structuring the data model that
underlies the various possible types of
annotation. Thus, the multiple-layer
architecture of the corpus does not only
reflect the logical divisions among the
kinds of data visualised by the relevant
annotation files -- it is also geared
towards the issue of reusability and
extensibility of corpus
annotation. Additionally, as the
creation of the outer levels of
annotation proceeds, the need for human
intervention increases, which means that
at early stages of the project, the
corpus will consist of the innermost
layers of annotation only ('main
files'), as these can sometimes be
created fully automatically. The
additional layers will be provided
later, at an obvious cost.
The presentation will be illustrated
with fragments of a mini-corpus of
Polish designed to incorporate the major
architectural features of the IPI PAN
Corpus.
Related link:
Paper ID: 139
Type: Demonstration
Title: The MITRE Audio Hot Spotting Prototype - Using Multiple Speech and Natural Language Processing Technologies
Contact author: Qian Hu
Topic: Speech - other
Abstract: The MITRE Audio Hot Spotting Prototype - Using Multiple Speech and Natural Language Processing Technologies
Qian Hu, Stanley Boykin, Fred Goodman, Warren Greiff, Margot Peet
The MITRE Corporation
Audio contains more information than is conveyed by the text transcript produced by an automatic speech recognizer. Information such as: a) who is speaking, b) the vocal effort used by each speaker, and c) the presence of certain non-speech background sounds, are lost in a simple speech transcript. In addition, due to the variability of noise conditions, speaker variance, and the limitations of automatic speech recognizers, speech transcripts can be full of errors. Deletion errors can prevent the users from finding what they are looking for from audio or video data, while insertion and substitution errors can be misleading and/or confusing. Audio Hot Spotting technology permits a user to automatically locate regions of interest in an audio/video file that meet his/her specified criteria. In the query, users may search for keywords or phrases, speakers, both keywords and speakers, non-verbal speech characteristics, or non-speech signals of interest. In order to provide more and better information from multimedia data, we have incorporated multiple speech technologies and natural language processing techniques in the MITRE Audio Hot Spotting prototype currently under development.
We focused on finding words that are information rich and machine recognizable (i.e. content words). The MITRE Audio Hot Spotting prototype examines the speech recognizer output and creates an index list of content words. For example, short duration and weakly stressed words are much more likely to be mis-recognized. To eliminate words that are information poor and prone to mis-recognition, our index-generation algorithm takes the following factors into consideration: a)absolute word length, b) the number of syllables, c) the recognizer's own confidence score, d) the part of speech (i.e. verb, noun) using a POS tagger with some heuristic rules, and e) the word's frequency of occurrence. Experiments we have conducted indicate that the index list produced typically covers less than 10% of the total words spoken, while more than 90% of the indexed words are actually spoken and correctly recognized.
The prototype allows the user to query the system by keywords or phrases, either by selecting them from the index list or by manual entry. If matches are found, the system displays the recognized text and allows the user to play the audio or video in the vicinity of the match. In addition, the user can query and retrieve segments spoken by a particular speaker. We achieved this capability by integrating and extending a research speaker identification algorithm. Based on the speaker identification results, the system automatically computes the number of times and the total duration the speaker spoke. We combined large-vocabulary, speaker-independent, continuous-speech recognition and speaker identification to refine lexical queries by a particular speaker. For example, the user can ask for incidents of the word "terrorism" spoken only by the President. More recently, we have experimented with algorithms that detect information-bearing background sounds, such as applause and laughter, which can be queried and retrieved by users.
Related link:
Paper ID: 141
Type: Demonstration
Title: ALFANUM SYSTEM FOR CONTINUOUS SPEECH RECOGNITION
Contact author: Pekar Darko
Topic: Speech - automatic speech recognition
Abstract: This demo gives a brief presentation of a program package for continuous speech recognition, which is, so far, successful with small and medium dictionaries. The package is very large because it contains both modules for training and recognition. Each of these modules consists of several submodules and a variety of classes and functions. It includes two libraries developed in the last two years by the same authors. Those are slib library for digital signal processing and general purpose an_misc library. These are available at www.alfanum.com. The program package is a product of several years of work on the automatic speech recognition topic, starting from isolated words recognition, through connected words recognition, to continuous speech recognition using phonemes in context, which this system is based on. Since the system is based on phoneme-in-context recognition, it supports recognition of any set of words (grammar) which needs to be recognized. Changing grammar requires no additional training or speech database recording, but only building of a new trellis, which does not take more than a few seconds. The entire program is written in C++ programming language, it is fully developed by the authors, which means that it does not rely on any third party specialized library. Software is mostly independent of the platform or the operating system (except for the part which requires communication with hardware).
For the purpose of ASR engine demonstration, a small GUI application was made. This application simply prints out recognized words in a text box. Of course, these words and their transitions must be defined in a grammar file stated in the command line. Beside grammar file, it needs file containing HMM models, and a pronunciation dictionary. If all the files are provided, application will initialize structures in several seconds, and will stand by for recognition. Recognition can be started and stopped by clicking appropriate buttons.
The system is trained to be speaker independent, so anyone can address it with similar accuracy. Although text is printed out with some delay, the engine is fast and can process more than ten recognitions in parallel, depending on grammar complexity.
Related link: www.alfanum.com
Paper ID: 142
Type: Demonstration
Title: Romanian words inflection
Contact author: Elena Boian
Topic: Text - automatic morphology
Abstract: Romanian words inflection
S. Cojocaru, E. Boian
Being among highly inflexional languages, Romanian makes
really difficult the problem of word derivation.
The main part of Romanian inflective words were classified
according to the methods of creation the flexions. This classification
was useful and have lead to the idea to
introduce the special scattered context grammar formalizing
word-forms production [1]. Using these grammar rules, we can formalize
the inflexion process. This method (let us call it static) is based
on the knowledge
about the morphological group of the given word.
Nevertheless it is necessary to have the possibility to obtain a
new set of word-forms for the given item without this knowledge.
We need to detect the group number dynamically.
First of all the word-forms themselves should be produced.
There is a special program (called "dynamic method") to facilitate
this boring work.
The dynamic method results from the base word and the morphological
category (the part of speech, the gender for nouns etc.) and permits to
determine the criteria to
classify romanian words in three inflexion groups: automated,
partial automated and irregular [2]. To inflect one word it is necessary
to know the vowel and consonant alternations, the rules alternations
application context, the affixe series.
The affixe series tables, the alternations set and their admissible
combinations form the inflexion programs' base. We will consider these
processes for each part of speech.
Knowing all word forms we can determine the number of inflexion groups.
So we can reduce the presentation of all word forms to the static method.
Bibliography.
1.E.Boian, S.Cojocaru, L.Malahova. Instruments pour applications
linguistiques. La terminologie en roumanie et en Republique de
Moldova, Hors serie, N4, 2000, p.42-44.
2.E.Boian, S.Cojocaru. The inflexion regularities for the Romanian
language. Computer Science Journal of Moldova, Vol.4, No.1, 1996,
Chisinau.
Related link:
Paper ID: 143
Type: Demonstration
Title: GoDiS - Issue-based dialogue management in a multi-domain, multi-language dialogue system
Contact author: Elena Karagjosova
Topic: Dialogue - dialogue systems
Abstract: GoDiS (Gothenburg Dialogue System) is an experimental dialogue system
utilizing dialogue management based on Ginzburg's concept of Questions
Under Discussion (QUD). GoDiS is implemented using the TrindiKit, a
toolkit for implementing dialogue move engines and dialogue systems
based on the Information State approach. While originally built for
fairly simple information exchange dialogue, it is being extended to
handle action-oriented and negotiative dialogue.
One of the goals of the information state approach is to encourage
modularity, reusability and plug-and-play; to demonstrate this, GoDiS
has been adapted to several different dialogue types, domains, and
languages. It has also been enhanced with speech input and output, and
is able to switch languages in an ongoing dialogue. The current range
of applications includes the following:
--information exchange dialogue in travel agency: English
(with speech input/output), Swedish (with speech
input/output), German (with text input/output; speech output
under development), Spanish (with text input/output; speech
output under development)
-- action-oriented dialogue for a mobile phone interface:
English (with speech input/output), Swedish (with speech
input/output)
-- action-oriented dialogue for a VCR interface: English
(with speech input/output), Swedish (with speech
input/output)Related link: http://www.ling.gu.se/~stinae/TSD2002.html
Paper ID: 144
Type: Demonstration
Title: Hyphenation algorithm for Romanian language words
Contact author: Demidova Valentina
Topic: Text - parsing and part-of-speech tagging
Abstract: Hyphenation algorithm for Romanian language words
V.Demidova,T.Verlan
The problem of correct hyphenation of the Romanian language words
becomes urgent in view of lack of such algorithm just for Romanian
language in many widely used systems of automated text processing.
The algorithm of correct hyphenation of the Romanian language words
is based on classical rules of word division into
syllables which are based on letters' phonetic significance.
The classical rules base on a vowel sequence - simple and complex -
semivowels, and consonants - also simple and complex, located
between two vowels. Also in the algorithm is considered exceptions
of these rules, such as for the following combinations of
consonants: '"bl", "br", "cl", "cr", "fl", "fr", "hl", "pl", "pr",
"tl", "tr", "vl", "vr", "lpt", "mpt", "nct", "ndv", "rtf", "stm",
"ngstr" etc. The rules described above were maximally taken into consideration.
However, the specific character of the Romanian language does not permit completely
formalize them. The vowels present main difficulty for division
into syllables in the Romanian language. They can be simple and
complex (so called semivowels), stressed and unstressed, and the
division rules depend on that, to which category the given vowel
belongs. Besides this the ambiguity arises because of the mode in
which different vowels combinations are perceived by ear. When the
word is entered, all that we can to find out about it is the
consequence of vowels and consonants. The phonetic information is
not accessible for us. Therefore, we can not implement the above
described rules in all their completeness. However many situations
are quite solvable, but rather by an artificial way.
Problem of diphthongs and threephthongs is the most difficult one in the
process of word division into syllables for the Romanian language.
A difficult situation is with preffixes. When there is one of the
combinations "an", "in" at the beginning of the word, and it is
followed by the vowel, then an ambiguity appears. To avoid the
ambiguity, and in the first place taking into consideration the
problem of the word division from line to line, when one letter is
not left on a line, we have decided to reject the first hyphen.
Thus, the algorithm of division into syllables of rather extensive
class of the Romanian language words is obtained. Certainly, we have not
pretentious to the completeness. However, the testing showed, that 70%
of words in the texts from the scientific and art literature are
divided correctly. It is essential. The algorithm can be developed
more over, but to include some additions the further
analysis of the database is necessary. It is the routine work, which
takes a lot of time though.
However, the main and most frequently met letter combinations our
algorithm processes correctly. Therefore, it is effective enough.
Related link:
Paper ID: 145
Type: Demonstration
Title: Semantic Knowledge in an Information Retrieval System
Contact author: Styve Jaumotte
Topic: Text - information retrieval
Abstract: We present in this demonstration a system currently in progress aiming at extracting relevant information as answers to questions from a huge corpus of texts. This prototype proceeds in two stages:
- first, it selects like a classic textual Information Retrieval System (IRS), a set of documents likely to contain relevant knowledge
- second, the extraction process determines which parts of text from previous set of documents are answers to the user?s query.
The IRS is based on a Boolean model extended with a "Distance" operator to allow search in a restricted passage of N terms. The system can either work with the need of information expressed in natural language or as a Boolean query. If the question is written in natural language, the category of the question is first determined, then a reformulating module transforms it in a Boolean expression . This rewriting depends on the conceptual category of the question which comes from a syntactic analysis of Link Grammar Parser.
Our tool, which is between IRS and traditional Question Answering system stresses on semantic based rewriting of initial query using Wordnet as thesaurus. We distinguish three kinds of semantic knowledge in the thesaurus, each of them leading to a specific method of semantic rewriting : Definitional Knowledge (dictionary like definitions of concepts), relational knowledge (systemic characterization of concepts through lexical relations between concepts), document like characterization of concepts (where thesaurus terms descriptions are considered as document liable to IRS techniques).
Last step is extraction process, working on concordance of terms. This way seems to be limited to few categories (date / time queries) and gives fair results. The next step of the development is to introduce syntactic filter to improve quality of answers.
To ensure good performance in spite of the large amount of treated data, we have developed the core of the system in ANSI C which also provides the platform independence of the source code. The Graphic User Interface has been programmed in TCL/TK. The system is tested on the TREC (Text REtrieval Conference) data whish is made up of about 3Go of articles coming from different newspapers and more than 1500 questions).
Related link: http://www.info.univ-angers.fr/pub/jaumotte/qa_irs
Paper ID: 146
Type: Demonstration
Title: Keyword spotting system
Contact author: Petr Schwarz
Topic: Speech - automatic speech recognition
Abstract: A keyword spotting system, developed at Brno University of Technology,
Faculty of Information Technology, will be presented. This system is
designated mainly for detecting keywords in long speech
records. Keywords are modeled with triphone Hidden Markov Models. The
models were trained on Czech SpeechDat-E speech database [1]. For
recognition, a Modified Viterbi algorithm [2] is used.
New procedure to determine optimal thresholds for keyword acceptation,
based on statistical analysis of false acceptations and false alarms
on a large database, was developed.
All components of the system are in-house, but the models are
compatible with the widely used HTK toolkit. The system is running on
MS-Windows. A public evaluation version for Czech will be available
from http://www.fit.vutbr.cz/research/groups/speech/sw.html
[1] Heuvel Henk, Boudy Jerome, Bakcsi Zoltan, Cernocky Jan, Galunov
Valerij, Kochanina Julia, Majewski Wojciech, Pollak Petr, Rusko Milan,
Sadowski Jerzy, Staroniewicz Piotr, Tropf Herbert: SpeechDat-East:
Five multilingual speech databases for voice-operated teleservices
completed, In: Proc. EUROSPEECH 2001, Aalborg, 2001
[2] J. Junkawitsch, L. Neubauer, H. Hoege and G. Ruske: A new
keyword-spotting algorithm with pre-calculated optimal thresholds,
Proc. Intl. Conference on spoken language processing ICSLP, 1996
Related link: http://www.fit.vutbr.cz/research/groups/speech
Paper ID: 201
Type: Demonstration
Title: Generator module for InBASE NL data base Interface system
Contact author: Michael Bodasov
Topic: Text - information retrieval
Abstract: Our team want to present the software program - the system for understanding of NL queries to data bases InBASE, especially the new generator module, developed for it.
InBASE (demo is available on web site http://www.inbase.artint.ru/nl/nllist-eng.asp) - the commercially oriented system which, is offered for e-shops to enable NL access to their content, also in project management and document management areas. InBASE can be thought as educational and personal information system.
The generator block is used to rephrase user queries. It allows controlling correctness of user understanding by InBASE analyzer. Problems can arise from misunderstanding of the user query by the system, so that irrelevant information can be delivered and the user will never learn about it. The most general way to resolve these problems is to construct an NLG module generating an NL query from the internal query representation. The system generates a rephrased query and the user can control the correctness of the query understanding. The rephrased query can be also shown together with the fragment of DB to explain how the structure of the data was
delivered. On the other hand, new generation facilities will be a step to an interactive mode for the InBASE system.
At the present time we have working module version, which works in close relationship with InBASE system. The model is realized independent of domain area (domain area is set and tuned in InBASE system) for Russian language. We are going to realize switching between Russian and English generation languages to conference time.
Detailed information about the project is available in our full article present on current conference – ID 17, Michael V. Boldasov: User query understanding by the InBASE system as a source for a multilingual NLG module(first step)
Related link:
Paper ID: 202
Type: Demonstration
Title: Visualisation Techniques for Analysing Meaning
Contact author: Beate Dorow
Topic: Text - knowledge representation and reasoning
Abstract: Many ways of dealing with large collections of linguistic information
involve the general principle of mapping words, larger terms and documents
into some sort of abstract space. The inherent structure of these spaces
is often difficult to grasp directly.
Visualisation tools can help to uncover the relationships between
meanings in these spaces, giving a clearer picture of the natural
structure of linguistic information. We demonstrate a variety of tools for
visualising word-meanings in vector spaces and graph models, derived
from co-occurrence information and local syntactic analysis.
The tools presented in this demonstrated are all available for public
use on our website.Related link: http://infomap.stanford.edu
Paper ID: 203
Type: Demonstration
Title: Deploying Web-based Question Answering System to Local Archive
Contact author: Zhiping Zheng
Topic: Text - information retrieval
Abstract: AnswerBus (http://www.answerbus.com/, [2,3]) is a Web-based open-domainQuestion-Answering (QA) system. It successfully uses NLP/IR techniques andreaches very high correct answer rates. Although it is not designed for TRECit still correctly answers over 70% of TREC-8 questions with Web resources.The question remains whether the techniques for a web-based QA system can bedeployed to a local archive.In experiments with Xtramind (http://www.xtramind.com/) to answer thisquestion we chose part of DUC conference corpus as the local archive. Theselected corpus is in general business domain but we treat it as an opendomain archive. We tried to keep all techniques used in AnswerBus. Theseinclude QA specific dictionary, dynamic named entities extraction, answercandidate ranking system, pseudo-answer detection etc. But thesentence-question matching formula, proved effective in AnswerBus, is notsuitable for the QA system on local archive. Hence we developed a newalgorithm to judge if a sentence is a possible answer or not.The first step in deploying AnswerBus to local archive, a local searchengine to do the search is needed. We use Seven Tones(http://www.seventones.com/, [4]) search engine but drop the featuresrelated to specific domain. The benefits of this search engine include: 1) The indexing is very fast; 2) The index can be partially and logicallymodified; and 3) It is scalable to a large size.To evaluate the new system, we refer to the milestones described in [1]and provided questions, which covered 16 Arthur Graesser's questionscategories that ranged from easy to very difficult. The test result isvery encouraging and the accuracy of Top 1 is 72%.References:[1] John Burger et al. Issues, Tasks and Program Structures to RoadmapResearch in Question & Answering (Q&A). NIST. 2001.[2] Zhiping Zheng. AnswerBus Question Answering System. Human LanguageTechnology Conference (HLT 2002). San Diego, CA. March 24-27, 2002.[3] Zhiping Zheng. Developing a Web-based Question Answering System. TheEleventh World Wide Conference (WWW 2002). Honolulu, HI. May 7-11, 2002.[4] Zhiping Zheng. Seven Tones: Search for Linguistics and Languages. The2nd Meeting of the North American Chapter of Association for ComputationalLinguistics (NAACL 2001). Pittsburgh, PA. June 2-7, 2001.Related link: http://www.coli.uni-sb.de/~zheng/qa-local/
Paper ID: 204
Type: Demonstration
Title: Advance concordances with Bonito
Contact author: Pavel Rychly
Topic: Text - text corpora
Abstract: `Corpus' is a large collection of texts in electronic form. `Corpus
managers' are tools or sets of tools for the coping with corpora. They
can encode, query and visualize texts. Manatee is a powerful corpus
manager. It is a modular framework with several types of interfaces.
Bonito is a graphical user interface (GUI) of the Manatee system. It
enables queries to be formed and given to various corpora. The corpus
query result is the so-called concordance list that creates all corpus
positions corresponding with the query. The concordance list is
clearly displayed in `key word(s) in context' (KWIC) format. Statistics
can also be computed on the result.
The Manatee system handles `annotated corpora' that means a corpora
containing not only of a sequence of words but also an additional
information. Typically, this includes linguistic information which is
associated with the particular word forms in corpus: the basic word
form (lemma), part of speech (POS) and the respective grammatical
categories (tags). Another type of annotation are structure tags,
such as sentence boundaries or document boundaries. All types of
annotation can be used in the queries.
The demonstration will provide an overview of Bonito functions in real
world examples. It will cover: query language (from simple queries to
complex queries containing many types of annotation), positive and
negative filtering of concordances, computing frequency distributions,
collocation candidates, locating interested contexts with different
sort functions, creating and using sub-corpora and others.
Related link:
Paper ID: 205
Type: Demonstration
Title: Demonstration of multi-modal applications on IPAQ
Contact author: Jan Šedivý
Topic: Speech - automatic speech recognition
Abstract: We will show how the latest progress in speech technology can tremendously improve the usability of PDA and SmartPhone applications. The demonstrated multi-modal applications are built on IBM Embedded ViaVoice technology running on Compaq IPAQ. Most demos were shown at CeBIT'02 in Hannover. Voice-activated Jukebox, is an voice-enabled MP3 player, where the user can select from more than 1500 songs by speech. Voice-enabled map helps users quickly find one of several thousand streets in Prague. SMS Composer is a multi-modal application built on VoiceXML technology that allows to quickly construct and customize SMS messages from a large list of pre-defined and user-defined templates.Related link: