Experimentos preliminares de verificación de locutores con una base de datos realista José Antonio Rubio García , José Manuel Pardo Muñoz, Ricardo de Córdoba.

Slides:



Advertisements
Presentaciones similares
Tasa de variación media de una función
Advertisements

Jacqueline Chávez Cuzcano
Tecnología Multimedia Fundamentos y Aplicaciones
¿De que factor depende el período de vibración de una masa unida a un resorte? Autor Se estudiará como afecta la masa unida al resorte el período de vibración.
DEFINICION DE TERMINOS SEMEJANTES
Curso de Seguridad Informática
COLEGIOS PRIVADOS: OTRA PERSPECTIVA DE LA EVALUACIÓN.
INTERNET y la docencia universitaria MANUEL AREA MOREIRA
SERVIDORES VOCALES INTERACTIVOS: DESARROLLO DE UN SERVICIO DE PÁGINAS BLANCAS POR TELÉFONO CON RECONOCIMIENTO DE VOZ PROYECTO IDAS (Interactive telephone-based.
TRABAJO FIN DE CARRERA Generación de Máscaras Soft para Compensación de Características en Reconocimiento Robusto del Habla I v á n L ó p e z E s p.
Cybertesis Elementos de una tesis.
ANÁLISIS EXPLORATORIO DE DATOS
Sistemas de Ecuaciones
Proyecto Medidas Electrónicas II
Bloques aleatorizados, cuadrados latinos y diseños relacionados
Ingeniería electrónica mención telecomunicaciones
Resultados Test de bebidas Diana Rocha Ricardo Parra Ericka García.
Unidad 7 Métodos de Ordenamiento Externos
“Construcción de Software para Regresión: El Caso de Selección de Modelos y Pruebas de Homocedasticidad” Previa a la obtención del Título de: INGENIERO.
FUNCIÓN IMPLÍCITA. Definición. Caso de varias ecuaciones.
Programación de Sistemas
Análisis comparativo de los resultados obtenidos por los estudiantes de la Institución Educativa Villa de Guadalupe en las Pruebas SABER de 2003 y 2005.
Dpto. Señales, Sistemas y Radiocomunicaciones
DESARROLLO DE UN SEGMENTADOR FONÉTICO AUTOMÁTICO PARA HABLA EXPRESIVA BASADO EN MODELOS OCULTOS DE MARKOV Proyecto Fin de Carrera Autor: Juan Carmona.
Tecnologías para e-learning (III)
PROYECTO FIN DE MÁSTER Estimación de Ruido Acústico Mediante Filtros de Partículas para Reconocimiento Robusto de Voz I v á n L ó p e z E s p e j o.
1 Alumno: Javier Insa Cabrera Director: José Hernández Orallo 23 de septiembre de 2010.
Grupo de Ingeniería Electrónica aplicada a Espacios INteligentes y TRAnsporte Modelado de arrays de micrófonos como cámaras de perspectiva en aplicaciones.
EL PÁRRAFO §.
Unidad 4: espacio vectorial
Rejilla de observaciones interpoladas de alta resolución en España para precipitación y temperatura: SpainHR Jesús Fernández.
La lógica de covarianza
Diagnóstico TI. El Diagnóstico o Auditoría de Tecnología de Información (TI) es el proceso de evaluación y recolección de evidencias de los Sistemas de.
Exámenes Oficiales de Alemán Deutsches Sprachdiplom DSD.
INTRODUCCIÓN A LA COMPUTACIÓN 8va Semana – 15va Sesión Miércoles 20 de Abril del 2005 Juan José Montero Román
INGENIERIA Y SOCIEDAD Guía para el Alumno

3. Funciones discriminantes para la f.d.p normal.
Reconocimiento de Patrones
Introducción al Proyecto de Sistemas Digitales II Curso
Caso: Situación de las Comunidades Autónomas españolasen cuanto a indicadores de bienestar En el periódico “El País” del día 17 de enero de 2002 se publicó.
1. Paradoja del Dinero: “Era un hombre tan pobre,
Autora: Carmen Rincón Llorente Tutor: Roberto Barra Chicote
Medidas de Posición y Centralización Estadística E.S.O.
Contrastes planeados y pruebas post hoc
Análisis y Diseño de Sistemas
TAMAÑO MINIMO DE MUESTRA PARA COMPARACIONES DE PROMEDIOS Mario Briones L. MV, MSc 2005.
Red en Anillo sobre Puerto Serie RS-232
EL LABORATORIO EN CASA: UN SISTEMA DE DESARROLLO BASADO EN EL MICROCONTROLADOR 68331, DE BAJO COSTE Juan Manuel Montero, José Colás, Tomás Palacios, Ricardo.
Curso 14/15. DISTRIBUCIÓN DEL ALUMNADO POR NIVELES DE COMPETENCIA SEGÚN PUNTUACIÓN ,00 % 5,88 %13,73 %25,53 %56,86% Puntuación directa : 34 Rango.
SISTEMA PARA LA CATEGORIZACIÓN AUTOMÁTICA DE CORREO ELECTRÓNICO Camilo Rodríguez, Departamento de Ingeniería de Sistemas, Universidad Nacional de Colombia.
1 Ana Mercedes Cáceres Instructor: Raúl Aguilar Año 2006 [Parte I ]
Seguridad Informática y Criptografía Material Docente de Libre Distribución Ultima actualización: 03/03/03 Archivo con 14 diapositivas Jorge Ramió Aguirre.
Antecedentes relevantes Antecedentes Avances en la Gestión de Trámites y Servicios Ambientales en Gobierno Digital: Automatización de la Gestión.
PC BD Alexandra Buri H José Rivera De La Cruz.
OPTIMIZACIÓN DE UN SERVICIO AUTOMÁTICO DE PÁGINAS BLANCAS POR TELÉFONO: PROYECTO IDAS R. Córdoba, R. San-Segundo, J. Colás, J.M. Montero, J. Ferreiros,
TEMAS ÉTICAMENTE DEBATIBLES “EL ABORTO”
El cultivo de la caña de azúcar. Brasil es el mayor productor y exportador, seguido de China y la India. Datos del Sindicato de la Industria de Caña de.
Unidad II. Probabilidad
Análisis de los Datos Cuantitativos
QUÉ SON LAS TÉCNICAS DE ESTUDIO
Método de mínimos cuadrados
Introducción Tarea Proceso Recursos Evaluación Conclusión.
El nivel C1. Parte escrita primer día del examen Comprensión de lectura 70 minutos Comprensión auditiva 40 minutos Comunicación escrita 120 minutos.
BENEMÉRITA UNIVERSIDAD AUTÓNOMA DE PUEBLA
Recomendaciones para el despliegue de Contenidos de T-Learning 2 de Octubre de 2010 SEMINARIO “LA INFLUENCIA DE LA GLOBALIDAD DIGITAL EN LA EDUCACIÓN”
Clase 17 Introducción a la Estadística Universidad de la República Centro Universitario Regional del Este Pablo Inchausti Licenciatura en Gestión Ambiental.
DISEÑOS POR BLOQUES ALEATORIZADOS
10 Etapas de administración de proyectos con el método Lewis
Conclusiones: En este trabajo se ha demostrado que se pueden aplicar los algoritmos OCH al ajuste de los parámetros de un modelo borroso mediante la discretización.
Transcripción de la presentación:

Experimentos preliminares de verificación de locutores con una base de datos realista José Antonio Rubio García , José Manuel Pardo Muñoz, Ricardo de Córdoba Herralde, Javier Macías Guarasa Grupo de Tecnología del Habla, Dpto. de Ingeniería Electrónica, E.T. S. de Ingenieros de Telecomunicación, U. Politécnica de Madrid

Indice Introducción Base de datos Sistema base Mejoras Conclusiones In order to be able to read a text in a natural way, the words in the sentences have to be grammatically classified, using both dictionaries and linguistic rules. There are two Set of rules designed for this task. The first one takes into account the endings and the beginnings of the words, for trying to guess the syntactic function of the word. Due to the size of the dictionaries (proper and common names), less than 1% of the words are tagged through the rules After these two steps, most of the words have been assigned the correct label (and some other incorrect ones): recall >99%. In order to disambiguate the text, another Set of rules is applied. The aim of this second Set is to account for the local and non-local syntactic context for selecting the appropriate label (among the tags offered by the dictionaries and the first rules). Using less than 100 of these rules, just less than 10 percent of the words remain as ambiguous words (high precision) The aim of this labelling is to be the input to the syntactic analyser that, using a chart parser and a robust CFG grammar, divides the sentence into small phrases (where each phrase boundary is a candidate for pause insertion). For this pause insertion rules we use heuristic rules, but we plan to work on a NN approach.

Introducción Objetivos: Desarrollar un sistema para verificación /identificación de locutores en condiciones realistas. Las pruebas serán independientes de texto. Evaluar el sistema con habla en distintas condiciones. Optimizar el sistema desde el punto de vista de coste computacional y la tasa de error. In order to be able to read a text in a natural way, the words in the sentences have to be grammatically classified, using both dictionaries and linguistic rules. There are two Set of rules designed for this task. The first one takes into account the endings and the beginnings of the words, for trying to guess the syntactic function of the word. Due to the size of the dictionaries (proper and common names), less than 1% of the words are tagged through the rules After these two steps, most of the words have been assigned the correct label (and some other incorrect ones): recall >99%. In order to disambiguate the text, another Set of rules is applied. The aim of this second Set is to account for the local and non-local syntactic context for selecting the appropriate label (among the tags offered by the dictionaries and the first rules). Using less than 100 of these rules, just less than 10 percent of the words remain as ambiguous words (high precision) The aim of this labelling is to be the input to the syntactic analyser that, using a chart parser and a robust CFG grammar, divides the sentence into small phrases (where each phrase boundary is a candidate for pause insertion). For this pause insertion rules we use heuristic rules, but we plan to work on a NN approach.

Base de datos 26 locutores Cada locutor habla lo siguiente 5 minutos de conversación normal (T1 y T2) Lectura de 5 frases en modo normal (FL1 y FL2) Lectura de las mismas 5 frases de forma rápida (FR1 y FR2) Todo ello dos veces separadas en el tiempo (primera y segunda grabación, notación 1 y 2). In order to be able to read a text in a natural way, the words in the sentences have to be grammatically classified, using both dictionaries and linguistic rules. There are two Set of rules designed for this task. The first one takes into account the endings and the beginnings of the words, for trying to guess the syntactic function of the word. Due to the size of the dictionaries (proper and common names), less than 1% of the words are tagged through the rules After these two steps, most of the words have been assigned the correct label (and some other incorrect ones): recall >99%. In order to disambiguate the text, another Set of rules is applied. The aim of this second Set is to account for the local and non-local syntactic context for selecting the appropriate label (among the tags offered by the dictionaries and the first rules). Using less than 100 of these rules, just less than 10 percent of the words remain as ambiguous words (high precision) The aim of this labelling is to be the input to the syntactic analyser that, using a chart parser and a robust CFG grammar, divides the sentence into small phrases (where each phrase boundary is a candidate for pause insertion). For this pause insertion rules we use heuristic rules, but we plan to work on a NN approach.

Base de datos La conversación se divide en nueve bloques de texto para hacer pruebas indepte de texto. Las frases tienen textos diferentes entre sí y con los tramos de conversación In order to be able to read a text in a natural way, the words in the sentences have to be grammatically classified, using both dictionaries and linguistic rules. There are two Set of rules designed for this task. The first one takes into account the endings and the beginnings of the words, for trying to guess the syntactic function of the word. Due to the size of the dictionaries (proper and common names), less than 1% of the words are tagged through the rules After these two steps, most of the words have been assigned the correct label (and some other incorrect ones): recall >99%. In order to disambiguate the text, another Set of rules is applied. The aim of this second Set is to account for the local and non-local syntactic context for selecting the appropriate label (among the tags offered by the dictionaries and the first rules). Using less than 100 of these rules, just less than 10 percent of the words remain as ambiguous words (high precision) The aim of this labelling is to be the input to the syntactic analyser that, using a chart parser and a robust CFG grammar, divides the sentence into small phrases (where each phrase boundary is a candidate for pause insertion). For this pause insertion rules we use heuristic rules, but we plan to work on a NN approach.

Sistema base Utilización de modelos de mezclas multigaussianas (64 mezclas) In order to be able to read a text in a natural way, the words in the sentences have to be grammatically classified, using both dictionaries and linguistic rules. There are two Set of rules designed for this task. The first one takes into account the endings and the beginnings of the words, for trying to guess the syntactic function of the word. Due to the size of the dictionaries (proper and common names), less than 1% of the words are tagged through the rules After these two steps, most of the words have been assigned the correct label (and some other incorrect ones): recall >99%. In order to disambiguate the text, another Set of rules is applied. The aim of this second Set is to account for the local and non-local syntactic context for selecting the appropriate label (among the tags offered by the dictionaries and the first rules). Using less than 100 of these rules, just less than 10 percent of the words remain as ambiguous words (high precision) The aim of this labelling is to be the input to the syntactic analyser that, using a chart parser and a robust CFG grammar, divides the sentence into small phrases (where each phrase boundary is a candidate for pause insertion). For this pause insertion rules we use heuristic rules, but we plan to work on a NN approach.

Resultados In order to be able to read a text in a natural way, the words in the sentences have to be grammatically classified, using both dictionaries and linguistic rules. There are two Set of rules designed for this task. The first one takes into account the endings and the beginnings of the words, for trying to guess the syntactic function of the word. Due to the size of the dictionaries (proper and common names), less than 1% of the words are tagged through the rules After these two steps, most of the words have been assigned the correct label (and some other incorrect ones): recall >99%. In order to disambiguate the text, another Set of rules is applied. The aim of this second Set is to account for the local and non-local syntactic context for selecting the appropriate label (among the tags offered by the dictionaries and the first rules). Using less than 100 of these rules, just less than 10 percent of the words remain as ambiguous words (high precision) The aim of this labelling is to be the input to the syntactic analyser that, using a chart parser and a robust CFG grammar, divides the sentence into small phrases (where each phrase boundary is a candidate for pause insertion). For this pause insertion rules we use heuristic rules, but we plan to work on a NN approach.

Resultados In order to be able to read a text in a natural way, the words in the sentences have to be grammatically classified, using both dictionaries and linguistic rules. There are two Set of rules designed for this task. The first one takes into account the endings and the beginnings of the words, for trying to guess the syntactic function of the word. Due to the size of the dictionaries (proper and common names), less than 1% of the words are tagged through the rules After these two steps, most of the words have been assigned the correct label (and some other incorrect ones): recall >99%. In order to disambiguate the text, another Set of rules is applied. The aim of this second Set is to account for the local and non-local syntactic context for selecting the appropriate label (among the tags offered by the dictionaries and the first rules). Using less than 100 of these rules, just less than 10 percent of the words remain as ambiguous words (high precision) The aim of this labelling is to be the input to the syntactic analyser that, using a chart parser and a robust CFG grammar, divides the sentence into small phrases (where each phrase boundary is a candidate for pause insertion). For this pause insertion rules we use heuristic rules, but we plan to work on a NN approach. FL2

Resultados II Los mejores resultados se obtienen con el modo de frases leídas normal. Se obtienen mejores resultados con datos de la misma sesión (otro texto) aunque tenga distinto modo de habla que con datos de la otra sesión. El modo de habla es más relevante cuando los porcentajes de error son menores In order to be able to read a text in a natural way, the words in the sentences have to be grammatically classified, using both dictionaries and linguistic rules. There are two Set of rules designed for this task. The first one takes into account the endings and the beginnings of the words, for trying to guess the syntactic function of the word. Due to the size of the dictionaries (proper and common names), less than 1% of the words are tagged through the rules After these two steps, most of the words have been assigned the correct label (and some other incorrect ones): recall >99%. In order to disambiguate the text, another Set of rules is applied. The aim of this second Set is to account for the local and non-local syntactic context for selecting the appropriate label (among the tags offered by the dictionaries and the first rules). Using less than 100 of these rules, just less than 10 percent of the words remain as ambiguous words (high precision) The aim of this labelling is to be the input to the syntactic analyser that, using a chart parser and a robust CFG grammar, divides the sentence into small phrases (where each phrase boundary is a candidate for pause insertion). For this pause insertion rules we use heuristic rules, but we plan to work on a NN approach.

Mejoras n. 1 Normalización de medidas de distancia Donde Sc es el locutor de prueba y Si son todos los locutores. Finalmente: In order to be able to read a text in a natural way, the words in the sentences have to be grammatically classified, using both dictionaries and linguistic rules. There are two Set of rules designed for this task. The first one takes into account the endings and the beginnings of the words, for trying to guess the syntactic function of the word. Due to the size of the dictionaries (proper and common names), less than 1% of the words are tagged through the rules After these two steps, most of the words have been assigned the correct label (and some other incorrect ones): recall >99%. In order to disambiguate the text, another Set of rules is applied. The aim of this second Set is to account for the local and non-local syntactic context for selecting the appropriate label (among the tags offered by the dictionaries and the first rules). Using less than 100 of these rules, just less than 10 percent of the words remain as ambiguous words (high precision) The aim of this labelling is to be the input to the syntactic analyser that, using a chart parser and a robust CFG grammar, divides the sentence into small phrases (where each phrase boundary is a candidate for pause insertion). For this pause insertion rules we use heuristic rules, but we plan to work on a NN approach.

Resultados In order to be able to read a text in a natural way, the words in the sentences have to be grammatically classified, using both dictionaries and linguistic rules. There are two Set of rules designed for this task. The first one takes into account the endings and the beginnings of the words, for trying to guess the syntactic function of the word. Due to the size of the dictionaries (proper and common names), less than 1% of the words are tagged through the rules After these two steps, most of the words have been assigned the correct label (and some other incorrect ones): recall >99%. In order to disambiguate the text, another Set of rules is applied. The aim of this second Set is to account for the local and non-local syntactic context for selecting the appropriate label (among the tags offered by the dictionaries and the first rules). Using less than 100 of these rules, just less than 10 percent of the words remain as ambiguous words (high precision) The aim of this labelling is to be the input to the syntactic analyser that, using a chart parser and a robust CFG grammar, divides the sentence into small phrases (where each phrase boundary is a candidate for pause insertion). For this pause insertion rules we use heuristic rules, but we plan to work on a NN approach.

Resultados In order to be able to read a text in a natural way, the words in the sentences have to be grammatically classified, using both dictionaries and linguistic rules. There are two Set of rules designed for this task. The first one takes into account the endings and the beginnings of the words, for trying to guess the syntactic function of the word. Due to the size of the dictionaries (proper and common names), less than 1% of the words are tagged through the rules After these two steps, most of the words have been assigned the correct label (and some other incorrect ones): recall >99%. In order to disambiguate the text, another Set of rules is applied. The aim of this second Set is to account for the local and non-local syntactic context for selecting the appropriate label (among the tags offered by the dictionaries and the first rules). Using less than 100 of these rules, just less than 10 percent of the words remain as ambiguous words (high precision) The aim of this labelling is to be the input to the syntactic analyser that, using a chart parser and a robust CFG grammar, divides the sentence into small phrases (where each phrase boundary is a candidate for pause insertion). For this pause insertion rules we use heuristic rules, but we plan to work on a NN approach.

Mejoras n. 2 Normalización de medidas de distancia con modelo global Donde Sb es un único modelo entrenado con todos los locutores de la base de datos In order to be able to read a text in a natural way, the words in the sentences have to be grammatically classified, using both dictionaries and linguistic rules. There are two Set of rules designed for this task. The first one takes into account the endings and the beginnings of the words, for trying to guess the syntactic function of the word. Due to the size of the dictionaries (proper and common names), less than 1% of the words are tagged through the rules After these two steps, most of the words have been assigned the correct label (and some other incorrect ones): recall >99%. In order to disambiguate the text, another Set of rules is applied. The aim of this second Set is to account for the local and non-local syntactic context for selecting the appropriate label (among the tags offered by the dictionaries and the first rules). Using less than 100 of these rules, just less than 10 percent of the words remain as ambiguous words (high precision) The aim of this labelling is to be the input to the syntactic analyser that, using a chart parser and a robust CFG grammar, divides the sentence into small phrases (where each phrase boundary is a candidate for pause insertion). For this pause insertion rules we use heuristic rules, but we plan to work on a NN approach.

Modelo 3 Utilización de una matriz de covarianza de cada locutor como modelo La distancia es una distancia entre matrices de covarianza In order to be able to read a text in a natural way, the words in the sentences have to be grammatically classified, using both dictionaries and linguistic rules. There are two Set of rules designed for this task. The first one takes into account the endings and the beginnings of the words, for trying to guess the syntactic function of the word. Due to the size of the dictionaries (proper and common names), less than 1% of the words are tagged through the rules After these two steps, most of the words have been assigned the correct label (and some other incorrect ones): recall >99%. In order to disambiguate the text, another Set of rules is applied. The aim of this second Set is to account for the local and non-local syntactic context for selecting the appropriate label (among the tags offered by the dictionaries and the first rules). Using less than 100 of these rules, just less than 10 percent of the words remain as ambiguous words (high precision) The aim of this labelling is to be the input to the syntactic analyser that, using a chart parser and a robust CFG grammar, divides the sentence into small phrases (where each phrase boundary is a candidate for pause insertion). For this pause insertion rules we use heuristic rules, but we plan to work on a NN approach.

Modelo 3 In order to be able to read a text in a natural way, the words in the sentences have to be grammatically classified, using both dictionaries and linguistic rules. There are two Set of rules designed for this task. The first one takes into account the endings and the beginnings of the words, for trying to guess the syntactic function of the word. Due to the size of the dictionaries (proper and common names), less than 1% of the words are tagged through the rules After these two steps, most of the words have been assigned the correct label (and some other incorrect ones): recall >99%. In order to disambiguate the text, another Set of rules is applied. The aim of this second Set is to account for the local and non-local syntactic context for selecting the appropriate label (among the tags offered by the dictionaries and the first rules). Using less than 100 of these rules, just less than 10 percent of the words remain as ambiguous words (high precision) The aim of this labelling is to be the input to the syntactic analyser that, using a chart parser and a robust CFG grammar, divides the sentence into small phrases (where each phrase boundary is a candidate for pause insertion). For this pause insertion rules we use heuristic rules, but we plan to work on a NN approach.

Comparacion de tiempos de entrenamiento In order to be able to read a text in a natural way, the words in the sentences have to be grammatically classified, using both dictionaries and linguistic rules. There are two Set of rules designed for this task. The first one takes into account the endings and the beginnings of the words, for trying to guess the syntactic function of the word. Due to the size of the dictionaries (proper and common names), less than 1% of the words are tagged through the rules After these two steps, most of the words have been assigned the correct label (and some other incorrect ones): recall >99%. In order to disambiguate the text, another Set of rules is applied. The aim of this second Set is to account for the local and non-local syntactic context for selecting the appropriate label (among the tags offered by the dictionaries and the first rules). Using less than 100 of these rules, just less than 10 percent of the words remain as ambiguous words (high precision) The aim of this labelling is to be the input to the syntactic analyser that, using a chart parser and a robust CFG grammar, divides the sentence into small phrases (where each phrase boundary is a candidate for pause insertion). For this pause insertion rules we use heuristic rules, but we plan to work on a NN approach.

Conclusiones El modelo 1 ofrece mejoras sustanciales respecto al base (17 de las 36 pruebas) El modelo de mejora 2 ofrece mejoras en la tasa media en todos los casos y significativas en 12 de 36 casos El modelo 3 no ofrece mejoras en tasa pero sí en tiempo de entrenamiento y de test In order to be able to read a text in a natural way, the words in the sentences have to be grammatically classified, using both dictionaries and linguistic rules. There are two Set of rules designed for this task. The first one takes into account the endings and the beginnings of the words, for trying to guess the syntactic function of the word. Due to the size of the dictionaries (proper and common names), less than 1% of the words are tagged through the rules After these two steps, most of the words have been assigned the correct label (and some other incorrect ones): recall >99%. In order to disambiguate the text, another Set of rules is applied. The aim of this second Set is to account for the local and non-local syntactic context for selecting the appropriate label (among the tags offered by the dictionaries and the first rules). Using less than 100 of these rules, just less than 10 percent of the words remain as ambiguous words (high precision) The aim of this labelling is to be the input to the syntactic analyser that, using a chart parser and a robust CFG grammar, divides the sentence into small phrases (where each phrase boundary is a candidate for pause insertion). For this pause insertion rules we use heuristic rules, but we plan to work on a NN approach.

Conclusiones Experimentos realizados con menos tiempo de entrenamiento empeoran el sistema significativamente Los próximos experimentos será con umbrales dependientes del locutor In order to be able to read a text in a natural way, the words in the sentences have to be grammatically classified, using both dictionaries and linguistic rules. There are two Set of rules designed for this task. The first one takes into account the endings and the beginnings of the words, for trying to guess the syntactic function of the word. Due to the size of the dictionaries (proper and common names), less than 1% of the words are tagged through the rules After these two steps, most of the words have been assigned the correct label (and some other incorrect ones): recall >99%. In order to disambiguate the text, another Set of rules is applied. The aim of this second Set is to account for the local and non-local syntactic context for selecting the appropriate label (among the tags offered by the dictionaries and the first rules). Using less than 100 of these rules, just less than 10 percent of the words remain as ambiguous words (high precision) The aim of this labelling is to be the input to the syntactic analyser that, using a chart parser and a robust CFG grammar, divides the sentence into small phrases (where each phrase boundary is a candidate for pause insertion). For this pause insertion rules we use heuristic rules, but we plan to work on a NN approach.