Accepted Demonstrations

Demonstrations accepted for TSD 2002, with abstracts

Agnieszka Mykowiecka: A Large-Scale Corpus of Polish and Tools for its Annotation (ID 134)
Antanas Lipeika: ISOLATED WORD RECOGNITION AND VISUALIZATION SOFTWARE (ID 135)
Bronius Tamulynas: Computer-based Translation from English to Lithuanian (ID 136)
Alexander Troussov: IBM Dictionary and Linguistic Tools system “Frost” (ID 137)
Piotr Banski: XML architecture for a modern corpus (ID 138)
Qian Hu: The MITRE Audio Hot Spotting Prototype - Using Multiple Speech and Natural Language Processing Technologies (ID 139)
Pekar Darko: ALFANUM SYSTEM FOR CONTINUOUS SPEECH RECOGNITION (ID 141)
Elena Boian: Romanian words inflection (ID 142)
Elena Karagjosova: GoDiS - Issue-based dialogue management in a multi-domain, multi-language dialogue system (ID 143)
Demidova Valentina: Hyphenation algorithm for Romanian language words (ID 144)
Styve Jaumotte: Semantic Knowledge in an Information Retrieval System (ID 145)
Petr Schwarz: Keyword spotting system (ID 146)
Michael Bodasov: Generator module for InBASE NL data base Interface system (ID 201)
Beate Dorow: Visualisation Techniques for Analysing Meaning (ID 202)
Zhiping Zheng: Deploying Web-based Question Answering System to Local Archive (ID 203)
Pavel Rychly: Advance concordances with Bonito (ID 204)
Jan Šedivý: Demonstration of multi-modal applications on IPAQ (ID 205)

Paper ID: 134
Type: Demonstration
Title: A Large-Scale Corpus of Polish and Tools for its Annotation
Contact author: Agnieszka Mykowiecka
Topic: Text - text corpora

Abstract: The aim of this demonstration is to introduce a project financed by the State Committee for Scientific Research (a Polish government body; grant number 7 T11C 043 20) aiming at constructing a large corpus of written Polish for NLP applications. We briefly present the following characteristics of the corpus: - aims (NLP applications, but also lexical, theoretical linguistic, language teaching and sociolinguistic applications in mind); - intended size and make-up of the corpus; - the original system of morphosyntactic annotation; - the system of structural and meta-data annotation; - XML (XCES) standards adopted; - original tools for linguistic annotation of the corpus: - morphological analyser; - statistical tagger; - intended ways of making the corpus publicly available. We also demonstrate similarities and differences between this and similar corpus initiatives for other languages, and justify the current project in terms of the lack of publicly available and/or linguistically annotated corpora for Polish. Various aspects of this project will be presented in more detail in related demonstrations (depending on the TSD organisers' decision to accept them). Related link:

Paper ID: 135
Type: Demonstration
Title: ISOLATED WORD RECOGNITION AND VISUALIZATION SOFTWARE
Contact author: Antanas Lipeika
Topic: Speech - automatic speech recognition

Abstract: Isolated word recognition and visualization software "Recognition" has been developed. Isolated word recognition is based on dynamic time warping. Three types of local continuity constraints are available: III,V and VIII- Itakura. Relaxed endpoint constraints also are possible. Energy based algorithm is used for endpoint detection. LPC features are used for pattern comparison. Symmetric likelihood ratio distance is used for distance calculation. Visualization of every step of speech recognition process (endpoint detection, speech pattern matching, decision making) is possible. Every speaker can create his own dictionary (collection of reference words) and use it for recognition. Recognition on-line mode is also available. One can press spacebar, say a word and result of recognition will be displayed on the screen. Another program "Palyginimas" is developed for recognition performance evaluation using speech databases. One can open lists of test and reference speech files and perform recognition experiments. Results of recognition has been displayed on the screen and finally written in text file. Using recognition results on the screen it is possible to evoke decision making results for particular decision. A number of recognition experiments was performed using this software. This research was supported by Lithuanian language in the information society 2000-2006 program. Related link:

Paper ID: 136
Type: Demonstration
Title: Computer-based Translation from English to Lithuanian
Contact author: Bronius Tamulynas
Topic: Text - machine translation

Abstract: The conceptual hierarchical model for computer-based translation (CBT) system from English into Lithuanian is proposed. It is based on hierarchical blackboard architecture and includes virtual dictionary and several knowledge sources. It is shown, that such model and special set of knowledge sources with grammatical components may reduce the translation problem and improve its quality. According to the CBT model paradigm a system for specialized English translation into Lithuanian is created. It includes user interface, virtual dictionary, text parsing, translation engine and several knowledge sources modules. Direct translation strategy with some transfer elements of syntactic sentence groups is used. It allows implementing better translation quality for more complicated sources. Related link:

Paper ID: 137
Type: Demonstration
Title: IBM Dictionary and Linguistic Tools system “Frost”
Contact author: Alexander Troussov
Topic: Text - other

Abstract: IBM Dictionary and Linguistics Tools system, codenamed “Frost”, will eventually support over 30 languages, including some Western European languages, thus consolidating the results of more than 20 years development of lexical data and morphological analysers. The product is under development by IBMers from several countries; cooperation with academic communities is used for data development and for providing of linguistic expertise. Frost architecture provides modular, crosslinguistic, cross-platform, and high-performance (several gigabytes per hour) base for industrial applications in Information Retrieval and Extraction, providing shallow parsing, part-of-speech tagging, morphological analysis and synonym support. To increase performance and reduce developing cycle specific linguistics phenomena are generalized and classified according computational models most suitable for their processing. E.g. clitic processing in Romance languages, decomposition of solid compounds in Germanic languages, Chinese word segmentation are treated in Frost with the one formal computational tool. This tool is based on the special implementation of non-deterministic finite-state processing, when back-tracking logic is extracted from finite-state machine into separate module. Separated programming logic gives flexibility, while finite-state processing ensures high-speed string matching. Finite state processing in this scheme is reduced to finding of the hierarchy of prefixes in deterministic finite-state dictionary, which contains word formation elements, provided with morphological, morphotactic and statistical information. Morphological analysis in Frost is based on the usage of finite-state automatons and transducers. Also finite-state devices have been present since the emergence of computer science and are extensively used in natural language processing (including speech processing), the focus was on mathematical and algorithmic approaches to the “topology” thus leaving the gap between industrial and academic research. IBM Dictionary and Linguistics Tools team developed new approaches to the analysis of finite-state devices performance which allowed to provide several times improvement in terms of the run-time. Frost exploits variable node format, which allows the usage of binary search, hach-tables and other programming techniques in addition to previously widely used linear search and TRIE structures. Assigning of a format to a node is done according to graph theoretic analyses and statistics of the usage of this particular node in corpora processing. In addition to the performance advantages, variable node formats opened the way to efficient application of finite-state processing for non-alphabetical languages. Another aspect of Frost finite-state tools is that their implementation takes into account the architecture of modern computers, specifically we use cache- and prefetching-friendly memory representation. Finite-state processing typically has simple access code, so it is the speed of the memory access which might be crucial for the performance. The architecture of modern processors and computers include the hierarchy of data storage devices to provide caching of frequently used data. Operational systems provide prefetch. Finite-state processing is highly irregular type of computation, so it is hardly to expect that the progress in the development of standard hardware and software caching tools will eliminate the need of adjusting finite-state processing to become cache-friendly. Related link:

Paper ID: 138
Type: Demonstration
Title: XML architecture for a modern corpus
Contact author: Piotr Banski
Topic: Text - text corpora

Abstract: The presentation concerns the XML architecture assumed for the IPI PAN Corpus of written Polish, being created at the Institute of Computer Science of Polish Academy of Sciences. The design of this corpus implements the principles laid out by the XCES Guidelines (see e.g. Ide et al. 2000), featuring in particular the so-called stand-off annotation, whereby each text is split into several components residing in separate XML files that may instantiate various DTDs or XML Schemas. More specifically, each text in the IPI PAN Corpus will be composed of three layers, as sketched below. main file <- sentence segmentation file <- morphosyntactic annotation file The main file contains the actual text with gross structural markup down to the level of the paragraph, and with the addition of tags signalling quotations and various aspects of text highlighting (italics, small caps, etc.). The next document establishes the sentence boundaries and may act as a base for another layer of annotation, or for a document that aligns sentences from different versions of the base text (for the purpose of creating a parallel corpus and thus reusing the text resources in the future). It also resolves the tags indicating text highlighted in the main file, classifying the information conveyed by e.g. italics as 'foreign text', 'proper name', 'emphasis', etc. The third layer of annotation contains the (often multiple) outputs of the morphological analyser (morphosyntactic information, lemmas) and the output of the disambiguator. This setup creates interesting questions, relating e.g. to cost-effectiveness of corpus creation, or structuring the data model that underlies the various possible types of annotation. Thus, the multiple-layer architecture of the corpus does not only reflect the logical divisions among the kinds of data visualised by the relevant annotation files -- it is also geared towards the issue of reusability and extensibility of corpus annotation. Additionally, as the creation of the outer levels of annotation proceeds, the need for human intervention increases, which means that at early stages of the project, the corpus will consist of the innermost layers of annotation only ('main files'), as these can sometimes be created fully automatically. The additional layers will be provided later, at an obvious cost. The presentation will be illustrated with fragments of a mini-corpus of Polish designed to incorporate the major architectural features of the IPI PAN Corpus. Related link:

Paper ID: 139
Type: Demonstration
Title: The MITRE Audio Hot Spotting Prototype - Using Multiple Speech and Natural Language Processing Technologies
Contact author: Qian Hu
Topic: Speech - other

Abstract: The MITRE Audio Hot Spotting Prototype - Using Multiple Speech and Natural Language Processing Technologies Qian Hu, Stanley Boykin, Fred Goodman, Warren Greiff, Margot Peet The MITRE Corporation Audio contains more information than is conveyed by the text transcript produced by an automatic speech recognizer. Information such as: a) who is speaking, b) the vocal effort used by each speaker, and c) the presence of certain non-speech background sounds, are lost in a simple speech transcript. In addition, due to the variability of noise conditions, speaker variance, and the limitations of automatic speech recognizers, speech transcripts can be full of errors. Deletion errors can prevent the users from finding what they are looking for from audio or video data, while insertion and substitution errors can be misleading and/or confusing. Audio Hot Spotting technology permits a user to automatically locate regions of interest in an audio/video file that meet his/her specified criteria. In the query, users may search for keywords or phrases, speakers, both keywords and speakers, non-verbal speech characteristics, or non-speech signals of interest. In order to provide more and better information from multimedia data, we have incorporated multiple speech technologies and natural language processing techniques in the MITRE Audio Hot Spotting prototype currently under development. We focused on finding words that are information rich and machine recognizable (i.e. content words). The MITRE Audio Hot Spotting prototype examines the speech recognizer output and creates an index list of content words. For example, short duration and weakly stressed words are much more likely to be mis-recognized. To eliminate words that are information poor and prone to mis-recognition, our index-generation algorithm takes the following factors into consideration: a)absolute word length, b) the number of syllables, c) the recognizer's own confidence score, d) the part of speech (i.e. verb, noun) using a POS tagger with some heuristic rules, and e) the word's frequency of occurrence. Experiments we have conducted indicate that the index list produced typically covers less than 10% of the total words spoken, while more than 90% of the indexed words are actually spoken and correctly recognized. The prototype allows the user to query the system by keywords or phrases, either by selecting them from the index list or by manual entry. If matches are found, the system displays the recognized text and allows the user to play the audio or video in the vicinity of the match. In addition, the user can query and retrieve segments spoken by a particular speaker. We achieved this capability by integrating and extending a research speaker identification algorithm. Based on the speaker identification results, the system automatically computes the number of times and the total duration the speaker spoke. We combined large-vocabulary, speaker-independent, continuous-speech recognition and speaker identification to refine lexical queries by a particular speaker. For example, the user can ask for incidents of the word "terrorism" spoken only by the President. More recently, we have experimented with algorithms that detect information-bearing background sounds, such as applause and laughter, which can be queried and retrieved by users. Related link:

Paper ID: 141
Type: Demonstration
Title: ALFANUM SYSTEM FOR CONTINUOUS SPEECH RECOGNITION
Contact author: Pekar Darko
Topic: Speech - automatic speech recognition

Abstract: This demo gives a brief presentation of a program package for continuous speech recognition, which is, so far, successful with small and medium dictionaries. The package is very large because it contains both modules for training and recognition. Each of these modules consists of several submodules and a variety of classes and functions. It includes two libraries developed in the last two years by the same authors. Those are slib library for digital signal processing and general purpose an_misc library. These are available at www.alfanum.com. The program package is a product of several years of work on the automatic speech recognition topic, starting from isolated words recognition, through connected words recognition, to continuous speech recognition using phonemes in context, which this system is based on. Since the system is based on phoneme-in-context recognition, it supports recognition of any set of words (grammar) which needs to be recognized. Changing grammar requires no additional training or speech database recording, but only building of a new trellis, which does not take more than a few seconds. The entire program is written in C++ programming language, it is fully developed by the authors, which means that it does not rely on any third party specialized library. Software is mostly independent of the platform or the operating system (except for the part which requires communication with hardware). For the purpose of ASR engine demonstration, a small GUI application was made. This application simply prints out recognized words in a text box. Of course, these words and their transitions must be defined in a grammar file stated in the command line. Beside grammar file, it needs file containing HMM models, and a pronunciation dictionary. If all the files are provided, application will initialize structures in several seconds, and will stand by for recognition. Recognition can be started and stopped by clicking appropriate buttons. The system is trained to be speaker independent, so anyone can address it with similar accuracy. Although text is printed out with some delay, the engine is fast and can process more than ten recognitions in parallel, depending on grammar complexity. Related link: www.alfanum.com

Paper ID: 142
Type: Demonstration
Title: Romanian words inflection
Contact author: Elena Boian
Topic: Text - automatic morphology

Abstract: Romanian words inflection S. Cojocaru, E. Boian Being among highly inflexional languages, Romanian makes really difficult the problem of word derivation. The main part of Romanian inflective words were classified according to the methods of creation the flexions. This classification was useful and have lead to the idea to introduce the special scattered context grammar formalizing word-forms production [1]. Using these grammar rules, we can formalize the inflexion process. This method (let us call it static) is based on the knowledge about the morphological group of the given word. Nevertheless it is necessary to have the possibility to obtain a new set of word-forms for the given item without this knowledge. We need to detect the group number dynamically. First of all the word-forms themselves should be produced. There is a special program (called "dynamic method") to facilitate this boring work. The dynamic method results from the base word and the morphological category (the part of speech, the gender for nouns etc.) and permits to determine the criteria to classify romanian words in three inflexion groups: automated, partial automated and irregular [2]. To inflect one word it is necessary to know the vowel and consonant alternations, the rules alternations application context, the affixe series. The affixe series tables, the alternations set and their admissible combinations form the inflexion programs' base. We will consider these processes for each part of speech. Knowing all word forms we can determine the number of inflexion groups. So we can reduce the presentation of all word forms to the static method. Bibliography. 1.E.Boian, S.Cojocaru, L.Malahova. Instruments pour applications linguistiques. La terminologie en roumanie et en Republique de Moldova, Hors serie, N4, 2000, p.42-44. 2.E.Boian, S.Cojocaru. The inflexion regularities for the Romanian language. Computer Science Journal of Moldova, Vol.4, No.1, 1996, Chisinau. Related link:

Paper ID: 143
Type: Demonstration
Title: GoDiS - Issue-based dialogue management in a multi-domain, multi-language dialogue system
Contact author: Elena Karagjosova
Topic: Dialogue - dialogue systems

Abstract: GoDiS (Gothenburg Dialogue System) is an experimental dialogue system utilizing dialogue management based on Ginzburg's concept of Questions Under Discussion (QUD). GoDiS is implemented using the TrindiKit, a toolkit for implementing dialogue move engines and dialogue systems based on the Information State approach. While originally built for fairly simple information exchange dialogue, it is being extended to handle action-oriented and negotiative dialogue. One of the goals of the information state approach is to encourage modularity, reusability and plug-and-play; to demonstrate this, GoDiS has been adapted to several different dialogue types, domains, and languages. It has also been enhanced with speech input and output, and is able to switch languages in an ongoing dialogue. The current range of applications includes the following: --information exchange dialogue in travel agency: English (with speech input/output), Swedish (with speech input/output), German (with text input/output; speech output under development), Spanish (with text input/output; speech output under development) -- action-oriented dialogue for a mobile phone interface: English (with speech input/output), Swedish (with speech input/output) -- action-oriented dialogue for a VCR interface: English (with speech input/output), Swedish (with speech input/output)Related link: http://www.ling.gu.se/~stinae/TSD2002.html

Paper ID: 144
Type: Demonstration
Title: Hyphenation algorithm for Romanian language words
Contact author: Demidova Valentina
Topic: Text - parsing and part-of-speech tagging

Abstract: Hyphenation algorithm for Romanian language words V.Demidova,T.Verlan The problem of correct hyphenation of the Romanian language words becomes urgent in view of lack of such algorithm just for Romanian language in many widely used systems of automated text processing. The algorithm of correct hyphenation of the Romanian language words is based on classical rules of word division into syllables which are based on letters' phonetic significance. The classical rules base on a vowel sequence - simple and complex - semivowels, and consonants - also simple and complex, located between two vowels. Also in the algorithm is considered exceptions of these rules, such as for the following combinations of consonants: '"bl", "br", "cl", "cr", "fl", "fr", "hl", "pl", "pr", "tl", "tr", "vl", "vr", "lpt", "mpt", "nct", "ndv", "rtf", "stm", "ngstr" etc. The rules described above were maximally taken into consideration. However, the specific character of the Romanian language does not permit completely formalize them. The vowels present main difficulty for division into syllables in the Romanian language. They can be simple and complex (so called semivowels), stressed and unstressed, and the division rules depend on that, to which category the given vowel belongs. Besides this the ambiguity arises because of the mode in which different vowels combinations are perceived by ear. When the word is entered, all that we can to find out about it is the consequence of vowels and consonants. The phonetic information is not accessible for us. Therefore, we can not implement the above described rules in all their completeness. However many situations are quite solvable, but rather by an artificial way. Problem of diphthongs and threephthongs is the most difficult one in the process of word division into syllables for the Romanian language. A difficult situation is with preffixes. When there is one of the combinations "an", "in" at the beginning of the word, and it is followed by the vowel, then an ambiguity appears. To avoid the ambiguity, and in the first place taking into consideration the problem of the word division from line to line, when one letter is not left on a line, we have decided to reject the first hyphen. Thus, the algorithm of division into syllables of rather extensive class of the Romanian language words is obtained. Certainly, we have not pretentious to the completeness. However, the testing showed, that 70% of words in the texts from the scientific and art literature are divided correctly. It is essential. The algorithm can be developed more over, but to include some additions the further analysis of the database is necessary. It is the routine work, which takes a lot of time though. However, the main and most frequently met letter combinations our algorithm processes correctly. Therefore, it is effective enough. Related link:

Paper ID: 145
Type: Demonstration
Title: Semantic Knowledge in an Information Retrieval System
Contact author: Styve Jaumotte
Topic: Text - information retrieval

Abstract: We present in this demonstration a system currently in progress aiming at extracting relevant information as answers to questions from a huge corpus of texts. This prototype proceeds in two stages: - first, it selects like a classic textual Information Retrieval System (IRS), a set of documents likely to contain relevant knowledge - second, the extraction process determines which parts of text from previous set of documents are answers to the user?s query. The IRS is based on a Boolean model extended with a "Distance" operator to allow search in a restricted passage of N terms. The system can either work with the need of information expressed in natural language or as a Boolean query. If the question is written in natural language, the category of the question is first determined, then a reformulating module transforms it in a Boolean expression . This rewriting depends on the conceptual category of the question which comes from a syntactic analysis of Link Grammar Parser. Our tool, which is between IRS and traditional Question Answering system stresses on semantic based rewriting of initial query using Wordnet as thesaurus. We distinguish three kinds of semantic knowledge in the thesaurus, each of them leading to a specific method of semantic rewriting : Definitional Knowledge (dictionary like definitions of concepts), relational knowledge (systemic characterization of concepts through lexical relations between concepts), document like characterization of concepts (where thesaurus terms descriptions are considered as document liable to IRS techniques). Last step is extraction process, working on concordance of terms. This way seems to be limited to few categories (date / time queries) and gives fair results. The next step of the development is to introduce syntactic filter to improve quality of answers. To ensure good performance in spite of the large amount of treated data, we have developed the core of the system in ANSI C which also provides the platform independence of the source code. The Graphic User Interface has been programmed in TCL/TK. The system is tested on the TREC (Text REtrieval Conference) data whish is made up of about 3Go of articles coming from different newspapers and more than 1500 questions). Related link: http://www.info.univ-angers.fr/pub/jaumotte/qa_irs

Paper ID: 146
Type: Demonstration
Title: Keyword spotting system
Contact author: Petr Schwarz
Topic: Speech - automatic speech recognition

Abstract: A keyword spotting system, developed at Brno University of Technology, Faculty of Information Technology, will be presented. This system is designated mainly for detecting keywords in long speech records. Keywords are modeled with triphone Hidden Markov Models. The models were trained on Czech SpeechDat-E speech database [1]. For recognition, a Modified Viterbi algorithm [2] is used. New procedure to determine optimal thresholds for keyword acceptation, based on statistical analysis of false acceptations and false alarms on a large database, was developed. All components of the system are in-house, but the models are compatible with the widely used HTK toolkit. The system is running on MS-Windows. A public evaluation version for Czech will be available from http://www.fit.vutbr.cz/research/groups/speech/sw.html [1] Heuvel Henk, Boudy Jerome, Bakcsi Zoltan, Cernocky Jan, Galunov Valerij, Kochanina Julia, Majewski Wojciech, Pollak Petr, Rusko Milan, Sadowski Jerzy, Staroniewicz Piotr, Tropf Herbert: SpeechDat-East: Five multilingual speech databases for voice-operated teleservices completed, In: Proc. EUROSPEECH 2001, Aalborg, 2001 [2] J. Junkawitsch, L. Neubauer, H. Hoege and G. Ruske: A new keyword-spotting algorithm with pre-calculated optimal thresholds, Proc. Intl. Conference on spoken language processing ICSLP, 1996 Related link: http://www.fit.vutbr.cz/research/groups/speech

Paper ID: 201
Type: Demonstration
Title: Generator module for InBASE NL data base Interface system
Contact author: Michael Bodasov
Topic: Text - information retrieval

Abstract: Our team want to present the software program - the system for understanding of NL queries to data bases InBASE, especially the new generator module, developed for it. InBASE (demo is available on web site http://www.inbase.artint.ru/nl/nllist-eng.asp) - the commercially oriented system which, is offered for e-shops to enable NL access to their content, also in project management and document management areas. InBASE can be thought as educational and personal information system. The generator block is used to rephrase user queries. It allows controlling correctness of user understanding by InBASE analyzer. Problems can arise from misunderstanding of the user query by the system, so that irrelevant information can be delivered and the user will never learn about it. The most general way to resolve these problems is to construct an NLG module generating an NL query from the internal query representation. The system generates a rephrased query and the user can control the correctness of the query understanding. The rephrased query can be also shown together with the fragment of DB to explain how the structure of the data was delivered. On the other hand, new generation facilities will be a step to an interactive mode for the InBASE system. At the present time we have working module version, which works in close relationship with InBASE system. The model is realized independent of domain area (domain area is set and tuned in InBASE system) for Russian language. We are going to realize switching between Russian and English generation languages to conference time. Detailed information about the project is available in our full article present on current conference – ID 17, Michael V. Boldasov: User query understanding by the InBASE system as a source for a multilingual NLG module(first step) Related link:

Paper ID: 202
Type: Demonstration
Title: Visualisation Techniques for Analysing Meaning
Contact author: Beate Dorow
Topic: Text - knowledge representation and reasoning

Abstract: Many ways of dealing with large collections of linguistic information involve the general principle of mapping words, larger terms and documents into some sort of abstract space. The inherent structure of these spaces is often difficult to grasp directly. Visualisation tools can help to uncover the relationships between meanings in these spaces, giving a clearer picture of the natural structure of linguistic information. We demonstrate a variety of tools for visualising word-meanings in vector spaces and graph models, derived from co-occurrence information and local syntactic analysis. The tools presented in this demonstrated are all available for public use on our website.Related link: http://infomap.stanford.edu

Paper ID: 203
Type: Demonstration
Title: Deploying Web-based Question Answering System to Local Archive
Contact author: Zhiping Zheng
Topic: Text - information retrieval

Abstract: AnswerBus (http://www.answerbus.com/, [2,3]) is a Web-based open-domainQuestion-Answering (QA) system. It successfully uses NLP/IR techniques andreaches very high correct answer rates. Although it is not designed for TRECit still correctly answers over 70% of TREC-8 questions with Web resources.The question remains whether the techniques for a web-based QA system can bedeployed to a local archive.In experiments with Xtramind (http://www.xtramind.com/) to answer thisquestion we chose part of DUC conference corpus as the local archive. Theselected corpus is in general business domain but we treat it as an opendomain archive. We tried to keep all techniques used in AnswerBus. Theseinclude QA specific dictionary, dynamic named entities extraction, answercandidate ranking system, pseudo-answer detection etc. But thesentence-question matching formula, proved effective in AnswerBus, is notsuitable for the QA system on local archive. Hence we developed a newalgorithm to judge if a sentence is a possible answer or not.The first step in deploying AnswerBus to local archive, a local searchengine to do the search is needed. We use Seven Tones(http://www.seventones.com/, [4]) search engine but drop the featuresrelated to specific domain. The benefits of this search engine include: 1) The indexing is very fast; 2) The index can be partially and logicallymodified; and 3) It is scalable to a large size.To evaluate the new system, we refer to the milestones described in [1]and provided questions, which covered 16 Arthur Graesser's questionscategories that ranged from easy to very difficult. The test result isvery encouraging and the accuracy of Top 1 is 72%.References:[1] John Burger et al. Issues, Tasks and Program Structures to RoadmapResearch in Question & Answering (Q&A). NIST. 2001.[2] Zhiping Zheng. AnswerBus Question Answering System. Human LanguageTechnology Conference (HLT 2002). San Diego, CA. March 24-27, 2002.[3] Zhiping Zheng. Developing a Web-based Question Answering System. TheEleventh World Wide Conference (WWW 2002). Honolulu, HI. May 7-11, 2002.[4] Zhiping Zheng. Seven Tones: Search for Linguistics and Languages. The2nd Meeting of the North American Chapter of Association for ComputationalLinguistics (NAACL 2001). Pittsburgh, PA. June 2-7, 2001.Related link: http://www.coli.uni-sb.de/~zheng/qa-local/

Paper ID: 204
Type: Demonstration
Title: Advance concordances with Bonito
Contact author: Pavel Rychly
Topic: Text - text corpora

Abstract: `Corpus' is a large collection of texts in electronic form. `Corpus managers' are tools or sets of tools for the coping with corpora. They can encode, query and visualize texts. Manatee is a powerful corpus manager. It is a modular framework with several types of interfaces. Bonito is a graphical user interface (GUI) of the Manatee system. It enables queries to be formed and given to various corpora. The corpus query result is the so-called concordance list that creates all corpus positions corresponding with the query. The concordance list is clearly displayed in `key word(s) in context' (KWIC) format. Statistics can also be computed on the result. The Manatee system handles `annotated corpora' that means a corpora containing not only of a sequence of words but also an additional information. Typically, this includes linguistic information which is associated with the particular word forms in corpus: the basic word form (lemma), part of speech (POS) and the respective grammatical categories (tags). Another type of annotation are structure tags, such as sentence boundaries or document boundaries. All types of annotation can be used in the queries. The demonstration will provide an overview of Bonito functions in real world examples. It will cover: query language (from simple queries to complex queries containing many types of annotation), positive and negative filtering of concordances, computing frequency distributions, collocation candidates, locating interested contexts with different sort functions, creating and using sub-corpora and others. Related link:

Paper ID: 205
Type: Demonstration
Title: Demonstration of multi-modal applications on IPAQ
Contact author: Jan Šedivý
Topic: Speech - automatic speech recognition

Abstract: We will show how the latest progress in speech technology can tremendously improve the usability of PDA and SmartPhone applications. The demonstrated multi-modal applications are built on IBM Embedded ViaVoice technology running on Compaq IPAQ. Most demos were shown at CeBIT'02 in Hannover. Voice-activated Jukebox, is an voice-enabled MP3 player, where the user can select from more than 1500 songs by speech. Voice-enabled map helps users quickly find one of several thousand streets in Prague. SMS Composer is a multi-modal application built on VoiceXML technology that allows to quickly construct and customize SMS messages from a large list of pre-defined and user-defined templates.Related link: