The paper describes prosodic annotation procedures of the GOPOLIS Slovenian speech data database and methods for automatic classification of different prosodic events. Several statistical parameters concerning duration and loudness of words, syllables and allophones were computed for the Slovenian language, for the first time on such a large amount of speech data. The evaluation of the annotated data showed a close match between automatically determined syntactic-prosodic boundary marker positions and those obtained by a rule-based approach.
The domain of spoken language technologies ranges from speech input and output systems to complex understanding and generation systems, including multi-modal systems of widely differing complexity (such as automatic dictation machines) and multilingual systems (for example, automatic dialogue and translation systems). The definition of standards and evaluation methodologies for such systems involves the specification and development of highly specific spoken language corpus and lexicon resources, and measurement and evaluation tools \cite{EAGLESHandbook}. This paper presents the MobiLuz spoken resources of the Slovene language, which will be made freely available for research purposes in speech technology and linguistics.
Technological developments in telephone-based dialogue systems have led to a situation where the main hindrance to progress is our lack of understanding of how dialogues work. The challenge to us is to understand dialogues in order to design efficient automated systems which take account of what users instinctively need. Two areas are addressed. Firstly, there is the fact that users automatically relate to the interpersonal aspect of each others participant role. The other one is that dialogue sequences are joint productions and grammatical expectations are exchanged in a way not immediately intuitive to the observer. Examples are presented and possible ways forward discussed.
Facial expressions and speech are means to convey information. They can be used to reinforce speech or even complementary to speech. The main goal of our research is to investigate how facial expressions can be associated to text-based speech in an automated way. As a first step we studied how people attach smileys to text in chat sessions and facial expressions to text balloons in cartoons. We developed an expert system with a set of rules that describe dependencies between text and facial expressions. The specific facial expressions are stored in a nonverbal dictionary and we developed a search engine for that dictionary. Finally we present a tool to generate 3D facial animations.
The main topic of this paper is on modelling a human operator in the dialogue manager of the Alparon system. First, a corpus of 200 human-human dialogues have been analysed by applying an approach surpassing a finite state automation approach. The corpus analysis resulted in a set of common strategies applied by professional human operators in similar situations. Secondly, a prototype system has been built based on the Alparon dialogue manager. This has been done by translating the strategies into knowledge rules and heuristics as these are used by the dialogue control modules in the Alparon dialogue manager.
This paper presents the result of an experimental system aimed at performing a robust semantic analysis of analysed speech input in the area of information system access. The goal of this experiment was to investigate the effectiveness of such a system in a pipelined architecture, where no control is possible over the morpho-syntactic analysis which precedes the semantic analysis and query formation.
This paper describes the conception of a rule-based tagger (part-of-speech disambiguator) of Czech currently developed for tagging the Czech National Corpus (cf. \cite{cnk}). The input of the tagger consists of sentences whose words are assigned all possible morphological analyses. The tagger disambiguates this input by successive elimination of tags which are syntactically implausible in the sentential context of the particular word. Due to this, the tagger promises substantially higher accuracy than current stochastic taggers for Czech. This is documented by the results concerning the disambiguation of the most frequent ambiguous word form in Czech - the word se .
This paper presents work in progress, the goal of which is to develop a module for automatic transition from analytic tree structures to tectogrammatical tree structures within the Prague Dependency Treebank project. Several rule-based and dictionary-based methods were combined in order to be able to make maximal use of both information extractable from the training set and a priori knowledge. The implementation of this approach was verified on a testing set, and a detailed evaluation of the results achieved so far is presented.
A method for stochastically modelling bidirectionality in chart parsing is presented. A bidirectional parser, which starts analysis from certain dynamically determined positions of the sentence (the islands ), has been built. This island-driven parser uses the stochastic model to guide the recognition process. The system has been trained and tested over two wide-coverage corpus: Spanish Lexesp and English Penn Treebank. Results regarding comparison of our approach with the basic Bottom-Up are encouraging.
In this paper we apply the ensemble approach to the identification of incorrectly annotated items (noise) in a training set. In a controlled experiment, memory-based, decision tree-based and transformation-based classifiers are used as a filter to detect and remove noise deliberately introduced into a manually tagged corpus. The results indicate that the method can be successfully applied to automatically detect errors in a corpus.
In the paper we present a method of syntactic parsing for inflectional language. This method consists of several steps including morphological and syntactical levels of analysis. We proposed a bottom-up model of syntactic analysis of the sentence. Its advantage is in the case of ill-formed sentence because the analyser is still able to parse at least parts of the sentence. We describe also experimental implementation of the proposed method, which is based on the use of XML and regular expressions.
The detection of speech from silence (actually background noise) is essential in many speech-processing systems. In real-field applications, the correct determination of voice segments highly improves the overall system accuracy and minimises the total computation time. This paper (This work was sponsored by Graphco Technologies Inc., Newtown, Pennsylvania, USA.) presents a novel robust and reliable speech detection algorithm to be used in a speaker recognition system. The paper first introduces some basic concepts on speech activity detection and reviews the techniques currently used in speech detection tasks. Then, the proposed speech/non-speech detection algorithm is described and experimental results are discussed. Conclusions about the algorithm performances are finally presented.
This paper presents an application of one method for improving fundamental frequency detection from a speech. The method is based on searching the best pitch paths over one or more words. It uses the idea that the fundamental frequency of a speaker cannot change sharply in a short time so that the pitch should not vary rapidly over one (or a few) words. This technique is created for improving pitch detection. It cannot detect the pitch itself, but it uses some pitch detectors. We compare some of them here and we try to determine which is the most suitable one for our method.
In this paper, we present the results of applying two different centring algorithms \cite{Brennanetal:1987,Strube:1998} to Danish discourses. Then we describe how we have adapted the algorithm for resolving anaphora referring to both individual NPs and discourse deictics presented in \cite{EckertetStrube:1999} so that it covers Danish discourse deictics. The modified algorithm has been manually tested on Danish dialogues and the obtained results have been evaluated.
FlexVoice, an integrated text-to-speech (TTS) system is presented in this paper. Its most distinctive feature is its low memory and CPU load while preserving the high quality of leading TTS systems. FlexVoice uses a hybrid approach that combines diphone concatenation with LPC-based parametric synthesis. Major improvements of speech quality are achieved by the careful design of each module at all synthesis levels (such as selection of training data for the various machine learning methods and that of the basic synthesis units for the parametric synthesiser). FlexVoice currently supports US English with two male and two female voices.
This paper argues that a dynamic, "left-to-right" approach to modelling syntax is best suited to the demands of the language modelling task. An outline of a dynamic grammar is presented, which is based on word-by-word transitions between incomplete "prefix" semantic structures. It is argued that a further advantage of this approach is that it dispenses with the need for any notion of syntactic structure, whether based on constituents or dependencies, and is thus preferable by the argument of Occam's razor.
This paper outlines a novel architecture for the development of a word sense disambiguation (WSD) system. It is based on the premiss that one way to improve the performance of such systems is through increased, and more flexible, human intervention. To this end a human-WSD program interface, WASPS (A Semi-Automatic Lexicographer's Workbench for Writing Word Sense Profiles, funded by EPSRC) is being developed for use by lexicographers in organising corpus data in the drawing up of new dictionary entries. A by-product of this activity will be an accurate sense disambiguation program.
In this paper, a partial prosodic analysis of two sets of speech styles in Czech is described. Four styles have been chosen as representatives of discontinuous styles (Address, Spelling, Headings, Dates and Prices); three styles are representing continuous styles (Weather Forecast, Sport News, News). Based on this analysis, the prosody of the selected speech styles has been simulated using the Epos \cite{tsdhaha} speech system.
The paper describes the ITC-irst approach for handling spoken dialogue interactions over the telephone network. Barge-in and utterance verification capabilities are going to be introduced into the developed software architecture. Some research activities that should enable accessing information in a new large applicative domain (i.e. the tourism domain) have been started. Objectives of the research are: language model adaptation and efficient information presentation, using a mixed representation approach.
In this paper, we present a corpus-based approach for tagging and chunking. The formalism used is based on stochastic finite-state automata. Therefore, it can include $n$-grams models or any stochastic finite-state automata learnt using grammatical inference techniques. As the models involved in our system are learnt automatically, it allows for a very flexible and portable system for different languages and chunk definitions. In order to show the viability of our approach, we present results for tagging and chunking using different combinations of bigrams and other more complex automata learnt by means of the Error Correcting Grammatical Inference (ECGI) algorithm. The experimentation was carried out on the Wall Street Journal corpus for English and on the LexEsp corpus for Spanish.
In text processing systems, German words require special treatment because of the possibility to form compound words as a combination of existing words. To this end, a universal word analysis system will be introduced which allows an analysis of all words in German texts according to their atomic components. A recursive decomposition algorithm, following the rules for word flexion, derivation, and compound generation in the German language, splits words into their smallest relevant parts (= atoms), which are stored in an atom table. The system is based on the foundations described in this article, and is being used for reliable, sense-conveying hyphenation, as well as for sense-conveying full text search, and in limited form also as a spelling checker.
Resolution of referential ambiguity is one of the most challenging problems of natural language processing. Especially frequently it is faced within dialogues. We present a heuristic algorithm for detection of the indirect antecedents for dialogue phrases based on the use of a dictionary of prototypic scenarios associated with each headword as well as of a thesaurus of the standard type. The conditions for filtration of the candidates for the antecedent are presented. We also present a similar algorithm for reconstruction of elliptical phrases of a special kind using a combinatory dictionary.
The paper discusses combining transparent intensional logic with (dependency-based) categorial grammars, based on the idea of a Curry-Howard correspondence between categories and semantics types.
Parsing coordination units is important as a part of preprocessing phase of an NLP tool. In this paper, a method for modelling and recognition of coordination of some noun phrase classes in Serbo-Croatian is presented. The model is based on local grammars and can be applied to a digital text by using an equivalent finite-state transducer. Therefore, the implementation of the model is efficient and can be embedded as a preprocessor in a system that performs further linguistic analysis of a digital text. In addition to the recognition of a coordination of noun phrases, we discuss possibilities to check some agreements between such a unit and other constituents of a sentence.
We discuss the automated construction of a translation mechanism capable of translating from any given textual input into some preferred logical notation. By constructing a syntax tree for the input text, we can augment it with semantic features through a data-flow analysis, in such a way that the parser in the shape of a (DCG) Prolog program may be extracted immediately. Over the years, our methods building on those principles, have been developed so that they seem to be mature for manipulating normal complicated texts meant for human communication now. Here scientific abstracts seem to be very interesting. We shall discuss the analysis of a particular scientific abstract, namely that from an article by M. Sergot \etal{} \cite{MS86}. A highly tentative comparison is made with several important alternative approaches, known from the scientific literature.
Two phases of an evaluation of annotating a Czech text corpus on an underlying syntactic level are described and the results are compared and analysed.
In this paper we present the results of our work on an implementation of a fast head-driven chart parser for the Czech language and constructing the appropriate grammar covering all prevailing grammatical phenomena of Czech. We re-assume our previous work on syntactic analysis that was based on the GLR mechanism. We have extended our metagrammar formalism so as to reinforce the declarativeness of the linguistic description. With respect to the massive ambiguity of the grammar we have enriched the head-driven chart parsing mechanism with probabilities obtained from training tree-bank corpus.
This article presents automatic phonetic segmentation of natural speech based on the use of a speech synthesiser and dynamic time warping (DTW) algorithm. The speech synthesiser is used to create a synthetic reference speech pattern with phonetic segmentation information (phonemes, diphones, syllables, intonation units, etc.). The reference synthetic speech pattern is then used in the alignment process. The main motivation for this work lay in the lack of usable segmentation tools for Czech, especially for the creation of prosodically labelled databases. The segmentation system has been developed for Czech and it uses the Czech TTS system.
Wide-frequency spectral analysis, autoregressive hidden Markov models (ARHMM) and self-organising neural networks (SOM) have been used for high accuracy speaker features modelling. The initial ARHMM parameters estimation based on Kalman filter is proposed. The five-keyword speaker identification system has been built and tested. The experiments show that this approach provides high accuracy of speaker identification even if the same words are pronounced by different speakers.
In our paper we propose new technique for language modelling of highly inflectional languages such as Czech, Russian an other Slavic languages. Our aim is to alleviate main problem encountered in these languages, which is enormous vocabulary growth caused by great number of different word forms derived from one word (lemma). We reduced the size of the vocabulary by decomposing words into stems and endings and storing these sub-word units (morphemes) in the vocabulary separately. Then we trained morpheme based language model on the decomposed corpus. This paper reports perplexities, OOV rates and some speech recognition results obtained with new language model.
We describe an implemented approach to text and sentence planning in a system for automatic generation of instruction texts. The approach aims at planning good quality texts through maximising coherence and cohesion.
In this paper, we propose two methods for speeding up discrete-utterance recognition in vocabularies with hundreds to several thousands of words. We show that acceptable results as well as short response time can be achieved if the words are represented by concatenated monophone models (multi-mixture HMMs). In such case, the computation load of the classic Viterbi procedure can be reduced significantly if a proper caching scheme is used. In several experiments done with test vocabularies containing hundreds and thousands of Czech words, we demonstrate that the recognition procedures can be speeded up by a factor of 50 to 100 without a loss of accuracy. The method is well suited for voice controlled systems with a large branching factor and low syntax, i.e. in voice portals, telephone directory assistance, etc.
This paper presents ARTIC, a brand-new Czech text-to-speech (TTS) system. ARTIC (Artificial Talker in Czech) is a concatenation-based system which consists of three main, and relatively independent, components: speech segment database (SSD), text analyser and speech synthesiser. A statistical approach to speech segment database construction is used. Text processing module includes the phonetic transcription of written text and the conversion to synthetic units. Speech processing is performed using a PSOLA algorithm.
Departing from a dialogue model that uses \colinterm{discourse obligations} as basic expressive means we will propose a set of inference rules that assign intentional structures to sequences of dialogue moves. We then can demonstrate that from our point of view \colinterm{conversational games} can be seen as macro-structures which are decomposable into smaller functional units where the coherence between the latter is explained in terms of obligations.
An application of dialogue systems for developing computer programs is described and discussed in this paper. The concept of generating program source code by means of a dialogue involves combining strategies with system and user initiative. The strategy with system initiative safely navigates the user, whereas the strategy with user initiative enables a quick and effective creation of the desired constructions of the source code and collaboration with the system using obtained knowledge to increase the effectiveness of the dialogue. The described system, which was initially developed for visually impaired users, can also be used by novice programmers as a tool for learning programming languages.
Fifteen short sentences spoken by four male speakers have been used as the test material. Each speaker has been asked to pronounce the sentences with three different rates: fast, normal, and slow. For perceptual experiment, two kinds of segmentations have been made: 1) one-syllable segmentation and 2) two-syllable segmentation. In the one-syllable segmentation, individual CV-syllables have been taken out from their contexts and presented to listeners. In the two-syllable segmentation, every consecutive two syllables have been isolated from the running speech, and the listeners have to identify each of the two syllables. In the first experiment, the results reveal that individual syllables do not have enough phonetic information to be correctly identified especially for the fast speech. The average identification of syllables for the fast speech is 35% and even vowels are identified less than 60%. In the second experiment, however, syllable identifications rose to a certain extent: for the fast speech, 54% for the first syllable and 73% for the second syllable. For the normal speech, they were 50% and 88%, respectively and for the slow speech, they were 56% and 90% respectively.
The increasing problem of information overload can be reduced by the improvement of information access tasks like Information Retrieval. Relevance Feedback plays a key role in this task, and is typically based only on the information extracted from documents judged by the user for a given query. We propose to make use of a thesaurus to complement this information to improve RF. This must be done by means of a Word Sense Disambiguation process that correctly identifies the suitable information from the thesaurus {\sc WordNet}. The results of our experiments show that the utilisation of a thesaurus requires Word Sense Disambiguation, and that with this process, Relevance Feedback is substantially improved.
We present ongoing work on prosody prediction for speech synthesis. This approach considers sentences as tree-like structures and decides on the prosody from a corpus of such structures using machine learning techniques. The prediction is achieved from the prosody of the closest sentence of the corpus through tree similarity measurements in a nearest neighbour context. We introduce a syntactic structure and a performance structure representation, the tree similarity metrics considered, and then we discuss the prediction method. Experiments are currently under process to qualify this approach.
Our experiments in Speaker Recognition showed that the combination of Speaker Verification and Utterance Verification techniques is an efficient way of improving the performance of a Speaker Authentication System. This technique is now implemented in the speaker authentication module of TelCorreo, an e-mail client that allows Internet users to read their e-mail using speech through the telephone.
In this paper we present TelCorreo \cite{telcorreo}: an e-mail client that allows the Galician Internet users to read their e-mail messages from locations where there is no web-connected computer available. This task is performed using speech through the telephone, and the system uses speech technology developed in Galician, including a speech recognition system and a text-to-speech converter.
The linguistic methods providing grammatical and lexical control of the acoustic recognition results are described. The Parser (NL-Processor) uses original local grammar, ATN-type grammar, and comprehensive computer dictionaries (Linguistic Knowledge Base - LKB). To choose the best of the plausible strings, a special meta-level is introduced that deals with the multi-criterion choice problem. An important feature of the techniques employed is that the best string does not necessarily have to be a grammatical one. The original approach to lexical n-grams dictionary correction is described too. Later this approach was generalised and now it is considered as a base of the new computer-aided technology of the LKB construction. The main component of this approach is the METAMODEL, based on UML and fuzzy mathematics. Both human and computer should participate in LKB construction process: human contributes to this process his intelligence and language intuition, computer - his speed, memory and computational capabilities.
The study of pronunciation variability is an important phonetic task, which has many applications (in speech synthesis and recognition systems, language teaching, forensic phonetics, etc.), and well-known variability of Russian speech deserves special investigations. Generally, one may distinguish geographical, functional, social and national variability of pronunciation. At the same time any type of pronunciation is characterised by its own internal variability. It concerns as well the standard speech. Database of contemporary Russian speech variability is currently being created in the Laboratory of the Experimental Phonetics of St. Petersburg University in order to investigate pronunciation variance of different speech types. The database of sound material is formed by recordings of the Phonetically Representative Texts pronounced by 1) standard Russian speakers, 2) regional Russian speakers, 3) speakers from the former USSR republics to which Russian is a second language, and 4) foreigners from European, Asian and American countries. Along with the sound database, an expert linguistic system is being created, which allows analysis and investigation of particular speech samples, types and corresponding parameters. An important role here plays the system of speech transcription, which should adequately reflect the actual variability.
Statistical parameters, usually used for diagnostic procedures, in many cases cannot be considered to be consistent ones from the statistical point of view, being strongly dependent on sample size. It leads to considerable devaluation of diagnostic results. This paper concerns the problem of consistency verification of parameters in the initial (pre-classification) stage of research. A complete list of parameters, which may be useful for description of text lexicostatistical structure, was determined. Each of these parameters was exposed to the justifiability test. In the result, a number of consistent parameters have been selected, which represent a description tool for the system characteristics of any text and corpora. Having rapid speed of convergence to the limit values, they may effectively perform classification procedures on text data of the arbitrary size. The proposed model of approximation makes it possible as well to forecast the values of all parameters for any sample size.
In this paper we will present the method we have used to collect a large database using live recording of a radio transmission and of an announcer. Our selection method uses phonetic and prosodic context information and frequency domain measurements in a search algorithm based on minimising two cost functions: the concatenation cost between adjacent speech units and the distance between the selected unit and a previously calculated target segment. For each basic unit, on average, we have collected around 100 similar units in different phonetic and prosodic contexts. To segment the basic units, we have used our highly accurate speech recognition system described in \cite{45bib1}. This collection of units has been used in our TTS system giving marked improvement in naturalness. With this new system, we hope to help people with disabilities that need to listen or speak using artificial speech.
The paper proposes a new framework to construct topic-sensitive language models for large vocabulary speech recognition. Identifying a domain of discourse, a model appropriate for the current domain can be built. In our experiments, the target domain was represented with a piece of text. By using appropriate features, sub-corpus of a large collection of training text was extracted. Our feature selection process was especially suited to languages where words are formed by many different inflectional affixatation. All words with the same meaning (but different grammatical form) were collected in one cluster and represented as one feature. We used the heuristic word weighting classifier $\mathit{TFIDF}$ (term frequency / inverse document frequency) to further shrink the feature vector. Final language model was built by interpolation of topic specific models and a general model. Experiments have been done by using English and Slovenian corpus.
This paper concerns a speaker independent recognition engine of Czech continuous speech designed for Czech telephone applications and describes the recognition module as an important component of a telephone dialogue system being designed and constructed at the Department of Cybernetics, the University of West Bohemia. The recognition is based on a statistical approach. The left-to-right three-state HMMs with an output probability density function expressed as multivariate Gaussian mixture are used to model triphones as basic units in acoustic modelling and stochastic regular grammars are implemented to reduce a task perplexity. A real time recognition process is supported by a very computation cost reduction approach estimating log-likelihood scores of Gaussian mixtures and also by a beam pruning used during Viterbi decoding. The present paper concerns the main part of the engine - a speaker independent recognition engine for continuous Czech speech.
An algorithm for modelling and generating prosody, based on a syntactical analysis, is described in this paper. The model provides a common framework for including prosody in spoken dialogue systems. The algorithm has been used to develop a system for natural language interaction with a mobile robot.
This paper investigates dynamic semantics of conversations from the point of view of semantical closedness, presuppositions and shared belief/common knowledge updates, by analysing the meta-expression "what do you mean (by X)?" into three major usages: semantic repair initiation, intentional repair initiation, and inferential repair initiation, since these three usages are deeply related to three types of semantical closedness: closedness of denotations, closedness of intention and closedness of inference of conversations. As a result, the proposed dynamic semantics of conversations is semantically closed in terms of shared beliefs of the conversants.
This paper presents a simplified approach to processing of ambiguous requests in spoken dialogues in information train timetable service systems. The simplified processing of ambiguous and incomplete utterances is based on a simple representation of the semantics of analyzed utterances by specially constructed frames, on the special representation and storage of facts extracted from the previous steps of dialogue (dialogue path, or dialogue history), and on the creation of the simple knowledge base containing the reasoning rules linking the meaning detected in the analyzed user's utterance with the facts stored in the dialogue path or in the system knowledge base. Some aspects of the implementation of utterance internal representing frames completion and their evaluation are discussed in the concluding part of this paper.
Lots of words can be said about the importance of speaker identification for people, but no word might be as meaningfull as the imagination of a life without having any speaker identification ability. For example, if we cannot identify people from their voices, without having any additional information it is impossible for us to decide on whom we are talking to on telephone. Of course, this ability seems so simple for us, but computer-based implementations are still far from human abilities. Furthermore, any speaker identification system on computers cannot be designed as an optimum solution. It is known that there is no optimum feature set definition for speaker identification systems. In this work, we study speaker identification performance dependency on the choice of frequency bands.
This paper describes an efficient algorithm for Japanese sentence compaction. First, a measure of grammatical goodness of phrase sequences is defined on the basis of a Japanese dependency grammar. Also a measure of topical importance of phrase sequences is given. Then the problem of sentence compaction is formulated as an optimisation problem of selecting a subsequence of phrases from the original sentence that maximises the sum of the grammatical goodness and the topical importance. A recurrence equation is derived by using the principle of dynamic programming, which is then translated into an algorithm to solve the problem. The algorithm is of polynomial-time with respect to the original sentence length. Finally, an example of sentence compaction is presented.
While current speech recognisers give acceptable performance in carefully controlled environments, their performance degrades rapidly when they are applied in more realistic situations. Generally, the environmental noise may be classified into two classes: the wide-band noise and narrow band noise. While the multi-band model has been shown to be capable of dealing with speech corrupted by narrow-band noise, it is ineffective for wide-band noise. In this paper, we suggest a combination of the frequency-filtering technique with the probabilistic union model in the multi-band approach. The new system has been tested on the TIDIGITS database, corrupted by white noise, noise collected from a railway station, and narrow-band noise, respectively. The results have shown that this approach is capable of dealing with noise of narrow-band or wide-band characteristics, assuming no knowledge about the noisy environment.
In this paper, we offer one possible way to bridge the gap existing between the lexical and logical semantic analysis of natural language expressions. We look for a link that would allow us to perform a more adequate and integrated semantic analysis of natural language expressions. The solution is to combine the descriptions of lexical units as they are developed within the area of lexical semantics (e.g. WordNet) with logical analysis of sentence meanings worked out within the Transparent Intensional Logic framework. The data structures based on both approaches may take the form of richer dictionary entries that together would form a dictionary of the new type - a Lexico-Logical Dictionary (LLD).
Usually, anaphora resolution researchers focus their works on defining a set of sources to extract useful information that helps anaphora resolution. Other works base their research on extracting relationships between utterances and anaphors obtaining satisfactory results working on languages with short anaphoric accessibility spaces like English. In this work, we state that anaphora resolution in Spanish needs an adequate definition of accessibility spaces and then we propose an annotation scheme for dialogue structure to define these spaces. This proposal has been tested achieving successful results when it is applied to real man-to-man telephone conversations.
The speech synthesis system developed at the Department of Phonetics of St. Petersburg University combines the advantages of two methods of concatenative speech synthesis - diphone- and allophone- based ones. In a new synthesis, we use physical realizations of allophones for the formation of the consonant inventory. The physical realizations of allophones are chosen with regard to their right phonetic context. In formation of the vowel set, we used the method which combines allophone and diphone based synthesis principles: the database contains halves of vowel allophones (from the physical beginning of the allophone up to its middle and from the middle of the allophone up to its right physical boundary).
In Czech corpora, compound verb groups are usually tagged in a word-by-word manner. As a consequence, some of the morphological tags of particular components of the verb group loose their original meaning. We present an improved method for automatic synthesis of verb rules. These rules describe all compound verb groups that are frequent in Czech. Using these rules, we can find compound verb groups in unannotated texts with high accuracy. The system for tagging compound verb groups in an annotated corpus that exploits the verb rules is described. \\[1mm] {\bf Keywords:} compound verb groups, corpora, morphosyntactic tagging, inductive logic programming
In this paper, we present some aspects of a cooperative web information retrieval system in the law domain. Our system is able to infer the user intentions and to keep the context of the user interaction in order to supply suggestions for further refinement of the user query. One important aspect of our system is its ability to compute clusters of documents associating a keyword to each cluster. A detailed example of an interaction with the system is presented.
The paper reassumes our papers presented at the previous TSD workshops \cite{62ref2,62ref3} and concerns the Czech speech corpus which is being developed at the Department of Cybernetics, University of West Bohemia in Pilsen. It describes procedures of corpus recording and annotation.
Filled pauses are normally used as a planning strategy: they signal speaker's intention to hold the floor in a conversation. They are normally realised by inserting a vowel (optionally followed by a nasal), but in Italian they can be produced by lengthening the final vowel of a word. Word final lengthening filled pauses are then an intermediate category between lexical and non-lexical speech event. In human machine interaction, the system should be able to discriminate between a "default" lexical speech event and one characterised by a word final lengthening for planning strategy: in this second case, the related communicative intention has to be additionally recognised. Our preliminary investigation shows that duration and F0 shape are reliable acoustic cues for identifying word final lengthening filled pauses in a variety of Italian.
Sensitive words are the compound words whose syntactic category is different from those of their components. According to the segmentation, a sensitive word may play different roles, leading to significantly different syntactic structures. If a syntactic analysis fails for a Chinese sentence, instead of examining each segmentation alternative in turn, sensitive words should be first examined in order to change the syntactic structure of the sentence. This will lead to a higher efficiency. Our examination of a machine-readable dictionary shows that there are a great number of such words. This shows that sensitive word is a widespread phenomenon in Chinese.
This paper describes a Unit Selection system based on diphones that was developed by the Speech Technology Group of the Enginyeria Arquitectura La Salle School, Universitat Ramon Llull. This system works with a PSOLA synthesiser for Catalan language which is used in an Oral Synthesised Message Editor (EMOVS) and Windows applications developed using Microsoft SAPI. Some common questions about Unit Selection are formulated in order to find solutions and achieve a better segmental speech quality.
Previous work analyzed the information in speech using analysis of variance (ANOVA). ANOVA assumes that sources of information (phone, speaker, and channel) are univariate gaussian. The sources of information, however, are not unimodal gaussian. Phones in speech recognition, e.g., are generally modeled using a multi-state, multi-mixture model. Therefore, this work extends ANOVA by assuming phones with 3 state, single mixture distribution and 5 state, single mixture distribution. This multi-state model was obtained by extracting variability due to position within phone from the error term in ANOVA. Further, linear discriminant analysis (LDA) is used to design discriminant features that better represent both the phone-induced variability and the position-within-phone variability. These features perform significantly better than conventional discriminant features obtained from 1-state phone model on continuous digit recognition task.
In this article, we present a test environment for a word analysis system that is used for reliable and sense-conveying hyphenation of German words. A crucial task is the hyphenation of compound words, a huge set of those can readily be formed from existing words. Due to this fact, testing and checking all existing words for correct hyphenation is infeasible. Therefore we have developed special test methods for large text files which filter the few problematic cases from the complete set of analysed words. These methods include detecting unknown or ambiguous words, comparing the output of different versions of the word analysis system, and choosing dubious words according to other special criteria. The test system is also suited for testing other applications that are based on word analysis, such as full text search.
The paper presents the technology of building a large German-French parallel corpus consisting of official documents of the European Union and Switzerland, and private and public organisations in France and Germany. The texts are morphosyntactically annotated, aligned at the sentence level and marked up in conformance with the TEI guidelines for standardised representation. The multi-level alignment method is applied; its precision is improved due to the correlation with the constraints of the classical alignment method of Gale and Church. The alignment information is encoded externally to the parallel text documents.The process of creating the corpus is an interesting algorithm of applying a number of software tools and adjusting intermediate production results.
This article describes an architectural framework of a multi-modal dialogue system. The framework is based on separation between the semantic and the syntactic parts of the dialogue. The semantics of the human-computer conversation is captured using a formal language. It describes the conversation by means of sequences of dialogue elements. The paper further elaborates how to derive the syntactic features of the conversation. A two-layer architecture has been proposed for the dialogue system. The upper layer, called sequencer, works with the description of the whole dialogue. The lower layer (driver dock) deals with individual dialogue elements. A prototype has been implemented to demonstrate the main benefits of our framework. These are adaptability and extensibility. The adaptability involves multiple modes of communication, where the modalities can be changed even during the course of a dialogue. Due to layering, the applications can be easily extended to additional modes and user interface devices.
This study examines the textual relationships involved in the use of italics which is considered by the researchers to be the author's way of textually signalling what in speech would be realized with phonetic prominence. Four Swedish children's stories were studied. Each instance of italics usage was analysed. Italics usage seemed to fall into two categories, contrast or emphasis , and phonetic focus seems to be the most appropriate way for the author to signal this. In order to interpret the examples, it was necessary to take the beliefs of the individual characters and the common ground into consideration. The authors argue that processing some examples involved accommodating a contrast set and that this process often characterised cases of emphasis .
In this paper we describe method of effective storage of linguistic data by means of covering and inhibiting patterns . Methodology of developing such patterns is outlined and results in the areas of morphology and hyphenation are given.
Information about the dialogue-state can be integrated into language models to improve performance of the speech recogniser in a dialogue system. A dialogue state is defined in this paper as the question, the user is replying to. One of the main problems in dialogue-state dependent language modelling is the limitation of training data. In order to obtain robust models, we use the method of rational interpolation to smooth between a dialogue-state dependent and a general language model. In contrast to linear interpolation methods, rational interpolation weights the different predictors according to their reliability. Semantic-pragmatic knowledge is used to enlarge the training data of the language models. Both methods reduce perplexity and word error rate significantly.
A need for the detection of ambiguous prepositional groups (Pg's) is discussed in the paper. Criteria for automatic Pg-disambiguation are presented, with special accent on word order patterns and verbonominal collocations, the main source of non-projective constructions caused by non-congruent Pg-modifiers of nouns. Usefulness of the proposed criteria and their ordering within an analysis by reduction is introduced.
In this paper an approach to speaker identification based on an estimation of parameters of a linear speech-production model is presented. The estimation is based on the discrete Kalman estimator. It is generally supposed that the vocal tract can be modelled by a system with constant parameters over short intervals. Taking this assumption into account, we can derive a special form of the discrete Kalman estimator for the model of speech production. The parameters of the vocal tract model obtained by the above mentioned Kalman estimation are then used to compute a new type of cepstral coefficients which we call Kalman cepstral coefficients (KCCs). These coefficients were used in text-independent speaker identification experiments based on discrete vector quantisation. Achieved results were then compared with results obtained by using the LPC-derived cepstral coefficients (LPCCs). The experiments were performed in a closed group of 591 speakers (312 male, 279 female).
One of the factors complicating activity with speech signals is its large degree of acoustic variability. To decrease influence of acoustic variability of speech signals, it is offered to use genetic algorithms in speech processing systems. We constructed a program model which implements the technology of speech recognition using genetic algorithms. We made experiments on our program model with a database of separated Belarussian words and achieve optimal results.
This paper presents a stochastic segmental speech recogniser that models the a posteriori probabilities directly. The main issues concerning the system are segmental phoneme classification, utterance-level aggregation and the pruning of the search space. For phoneme classification, artificial neural networks and support vector machines are applied. Phonemic segmentation and utterance-level aggregation is performed with the aid of anti-phoneme modelling. At the phoneme level, the system convincingly outperforms the HMM system trained on the same corpus, while at the word level it attains the performance of the HMM system trained without embedded training.
Information extraction is a key component in dialogue systems. Knowledge about the world as well as knowledge specific to each word should be used for robust semantic processing. An intelligent agent is necessary for a dialogue system when meanings are strictly defined by using a world state model. An extended concept structure is proposed to represent knowledge associated with each word in a "speech-friendly" way. By considering knowledge stored in the word concept model as well as knowledge base of the world model, meaning of a given sentence can be correctly identified. This paper describes the extended concept structure and how knowledge about words can be stored in this concept model.
This paper describes SYSTAR (SYntactic Simplification of Text for Aphasic Readers), the PSet module which splits compound sentences, activises seven agentive passive clause types, and resolves and replaces eight frequently occurring anaphoric pronouns. We describe our techniques and strategies, report on the results obtained after evaluation of a corpus of 100 newspaper articles downloaded from the website of a daily provincial, and briefly report on experimental studies with aphasic participants.
This paper describes results achieved with a text-document classification tool TEA (TExt Analyzer) based on the naive Bayes algorithm. TEA provides also a set of additional functions, which can assist users at fine-tuning the text classifiers and improving the classification accuracy, mainly through modifications of dictionaries generated during the training phase. Experiments, described in the paper, aimed at supporting work with medical unstructured text documents downloaded from the Internet. Good and stable results (around 97% of the classification accuracy) were achieved for selecting documents in a certain area of interest among a large number of documents from different areas.