Sound-to-Gesture Inversion in Speech: Mapping of Action and Perception

Description

Project Title:
Sound-to-Gesture Inversion in Speech: Mapping of Action and Perception
Acronym:
SPEECH MAPS
Number:
6975
Work Area:
Speech & Natural Language
Coordinator:
INPG - Université de Stendhal
Insitut de la Communication Parlée
URA-CNRS 368, B.P. 25
F - 38040 GRENOBLE CEDEX 09
Coordinator Country:
F
Partners
University of Leeds UK
Telecom Paris/Arecom F
Institut Estudis Catalans E
KTH S
Université de Lausanne CH
Associate partners
Universität Köln G
Université de Strasbourg II F
University of Southampton UK
Dublin City University IRL
Trinity College Dublin IRL
Università di Genova I
University of Lund S
Contact Point:
Dr. C. Abry and P. Badin
Telephone:
+33/76 82 43 37 and 76 57 48 26
Fax:
+33/76 82 43 35 and 76 57 47 10
E-Mail:
badin@icp.grenet.fr
Keywords:
speech inverse acoustics, speech production, speech audiovisual integration, speech robotics
Start Date:
1 September 92
Duration:
36 months
Status:
running
Abstract:
SPEECH MAPS aims to answer, both theoretically and technologically, a basic question in speech inverse acoustics: Can an articulatory robot learn to produce articulatory gestures from sounds? The robotics approach allows the mapping of action and perception in speech. The building of a dedicated learning architecture, Articulotron, incorporating an audiovisual perceptron as a front end, has been undertaken to bring a decisive advance towards solving the speech inverse problem. This could lead to major spinoffs for synthesis and recognition applications.

AIMS

Inverse mapping from speech sounds to articulatory gestures is a difficult problem, primarily because of the nonlinear, many-to-one, relationship of articulation to acoustics. So far, it has been an ill-posed problem, in the mathematical sense. Due to recent outstanding progress in robotics, it is now possible to answer, both theoretically and technologically, a basic question in speech inverse acoustics: Can an articulatory robot learn to produce articulatory gestures from sounds?

APPROACH AND METHODS

One can conceive of two complementary approaches to the speech inversion problem. The first uses all the knowledge in signal processing to identify the characteristics of the sources and filters corresponding to the vocal tract which produced the speech signal. The second is borrowed from control theory, and aims at determining inverse kinematics and/or dynamics for an articulatory robot with excess degrees of freedom. In both approaches, there is a clear need of knowledge of direct mapping (from articulation to acoustics), to find constraints in order to regularise the solution.
Following basic schemes in robotics, the speech production model is represented here by a realistic articulatory model, the plant, driven by a controller, ie a sequential network capable of synthesising motor sequences from sound prototypes. This ensemble, called Articulotron, displays fundamental spatio-temporal properties of serial ordering in speech (coarticulation phenomena) and adaptative behaviour to compensate for perturbations.
The robotics approach for speech allows the unification of Action and Perception. If speech communication is conceived of as a trade-off between the cost of production and the benefit of understanding, the constraints will be borrowed from the articulatory level, and the specific low level processing from auditory, and visual perception. Using an Audiovisual Perceptron to incorporate vision will lead to a more comprehensive formulation of the inversion problem: How can articulatory gestures be learned from hearing and seeing speech?

POTENTIAL

The integrated approach propounded in this project should lead (together with the Articulotron, the Audiovisual Perceptron and other tools for speech processing) to major "spinoffs" in R&D. Speech synthesis will greatly benefit from the learning ability of a robot taking advantage of adaptative biological principles. Low bit-rate transmission of speech can also be developed from this approach, through access to articulatory codebooks. Finally, speech recognition using the enhancement by vision of the acoustic signal in noise would also benefit from this low level inverse mapping.

PROGRESS AND RESULTS

Available deliverables cover the main four areas of research in the project, ie sources and vocal tract modelling, motor control, and audio and visual processing:
- Aerodynamic, acoustic and laryngograph data have been recorded in order to study excitation sources generation (noise and voice sources).
- A voice source model (Liljencrants-Fant) has been assessed by comparison with inverse filtered natural speech.
- Dynamics of voice and noise sources has been studied, especially glottis-constriction coordination for fricatives, and variations of the voice source in vowel-consonant sequences.
- As concerns vocal tract geometric and acoustic data, scanner and video measurements of the vocal tract have been realised, and a software for the digitalisation of labial and X-ray films was developed. Vocal tract bioacoustic measurements have been performed, using a new technique, and compared with a database of reference transfer functions.
- Articulatory-to-acoustic modelling has resulted in an acoustic vocal tract simulation software, including several new features (sources, VT energy loss mechanisms). An articulatory-acoustic codebook has been generated with a first version of the Speech Maps Interactive Plant "SMIP"
- A first set of data on articulatory timing has been recorded for the study of vocalic and consonantal coarticulation.
- A speech timing model was developed as a first step towards modelling motor encoding-programming.
- Methods for the recovery of articulatory trajectories of vowel-vowel (VV) gestures have been tested, together with inverse dynamics for selected articulators. Self-organised motor relaxation nets have been used to study trajectory formation. Learning of coarticulation and compensation phenomena has been experimented for selected VV with a control model.
- A method for the recovery of undershoot vocalic targets from acoustic parameters has been developed using principles of dynamics.
- To obtain visual input data for audiovisual integration, a set of labial gestures in vowels and consonant has been recorded and processed.
- Visual perception of labial anticipation has been tested, and four audiovisual integration models have been implemented and assessed.

LATEST PUBLICATIONS

INFORMATION DISSEMINATION ACTIVITIES

- Significant participation in the 3rd Seminar on Speech Production, 11-13 May 1993, Old Saybrook, CT, USA (about 30% of papers were presented by consortium partners).
- 125th meeting of the Acoustical Society of America (ASA).
- World Congress on Neural Networks, Portland, USA, July 11-15 1993.
- Significant participation (more than 15 communications) to the eurospeech conference, Berlin, 20 Sept. 1993.
- Several communications will be presented at the 3rd French Congress of Acoustics; ASA meetings; the 1994 International Conference on Spoken Language Processing.
- An International workshop on the special issue of Speech Robot Learning is planned.



Sven Müßig, last update 07-nov-1995. Your feedback is welcome.