One of the problems in text-to-speech (TTS) systems and
speech-to-text (STT) systems is pronunciation estimation of unknown words. In this paper, we propose a method for extracting
unknown words and their pronunciations from similar sets of
Japanese text data and speech data. Out-of-vocabulary words
are extracted from text with a stochastic model and pronunciations hypotheses are generated. These entries are veriﬁed by
conducting automatic speech recognition on audio data. In this
work, we use news articles and broadcast TV news covering
similar topics. Most extracted pairs turned out to be correct
according to a human judges. We also tested the TTS frontend enhanced with these entries on other web news articles, and
observed an improvement in the pronunciation estimation accuracy of 9.2% (relative). The proposed method can be used to
realize a spoken language processing system that acquires and
updates its lexicon automatically.
Recent advances in spoken language processing (SLP) techniques have given rise to a number of practical applications.
One of these applications is text-to-speech (TTS), which converts written text into speech. One of the largest obstacles in a
TTS system is the existence of unknown words. Usually TTS
systems are equipped with a module which estimates a pronunciation of unknown words from their spelling. However, the
accuracy of this module is not sufﬁciently high, especially in
languages which use ideograms such as Japanese and Chinese.
Unknown words or out-of-vocabulary words are also problematic in speech-to-text (STT) systems.
In this paper, we propose a method for extracting unknown
words and their pronunciations automatically from comparable
sets of text data and speech data. The main idea is to compare
a collection of text data and a collection of speech data talking
about the same topics. Our method is summarized as follows:
1. Extract unknown word candidates from the text data.
2. Enumerate possible pronunciations for each word candidate.
3. Search for pronunciations in the speech data.
The search is executed by using an automatic speech recognizer
(ASR). Unless the searched pronunciation is very long, a possible pronunciations may be matched not only with correct words
but also at incorrect positions in speech data. Thus, when we
search for a possible pronunciation of an unknown word candidate, it is strongly required to check its context. This context
can be calculated from sentences in text data.
In some languages such as Japanese, the target language of
this research, words are not separated by a whitespace. Thus
ﬁrst of all, word boundaries must be identiﬁed by an automatic word segmenter. However, an automatic word segmenters
tend to make errors at unknown words and output incorrect
word boundaries. So we regard a text as a stochastically segmented corpus (SSC)  in which sentences are segmented into
word sequences stochastically, not determinatively as in ordinary methods. The ASR system searches for all possible pronunciations of unknown word candidates in speech data, representing contexts with a word n-gram model estimated from an
In the experiment, we extract word-pronunciation pairs
from broadcast TV news and web news articles in the same
period. Evaluation is done using a different set of web news
The method we propose in this paper for extracting unknown
words and their pronunciations uses an ASR coupled with a language model (LM) describing the contexts of the unknown word
candidates. In this section, we explain a TTS front-end based
on n-gram modeling.
2.1. Text-to-Speech Front-end
In the stochastic approach for pronunciation estimation , a
sentence is regarded as a sequence of pairs u consisting of
spelling of a word w and a phoneme sequence y, that is u =
. Using an n-gram model based on this unit, Mu,n, the
probability of a unit sequenceu = (u
h), is calculated
where ui (i ≤ 0) and u
h+1 is a special symbol BT (boundary
Given a character sequence x as an input sentence, the
front-end outputs uˆ, a sequence of units with the highest
probability, under the constraint that the concatenation of the
spellings is equal to the input sentence:
uˆ = argmax
where wi is the spelling of the pair ui.
2.2. Pronunciation Estimation for Unknown Word
In order to handle unknown words, a special symbol UU is introduced to represent all units outside of vocabulary U, a set of
word-pronunciation pairs. When a UU is predicted by Mu,n, a
In the original paper  the unit is a quadruplet of spelling of a
word, its part-of-speech, its phoneme sequence, and its accent sequence.
1. Decompose the spelling into a character sequence and
generate all possible pronunciations for the characters
from the dictionary
ex.) 守 (mo ri, ma mo, shu), 屋 (o ku, ya)
2. List all pronunciations of the word candidate by taking
one possible pronunciation for each character
ex.) mo ri o ku, mo ri ya, ma mo o ku,
ma mo ya, shu o ku, shu ya
3. For each possible pronunciation, calculate the joint probability in which the candidate word has the pronunciation
using the n-gram model based on word-pronunciation
pairs expressed by Equation (2).
ex.) P(mo ri o ku,守屋) = 0.65
P(mo ri ya,守屋) = 0.12
Note that in this example the correct pronunciation of the word
“守屋” is “mo ri ya,” the second probable one, thus the TTS
front-end fails to produce a correct pronunciation of this word.
4.3. Searching for Pronunciation in Speech
The last step is to check if these hypothesized pronunciations for
word candidates are observed in speech data. Since speech data
have no clear word boundary information and contain pronunciation ﬂuctuations and noises, a pronunciation may match at improper position as well. For example, let us assume that speech
data contain the pronunciation of a word “memorial park” as
··· me mo ri a ru pa a ku···.
A pronunciation “mo ri ya” for a word candidate “守屋” may
matches by mistake at the position of “mo ri a” when the pronunciation of the word “memorial park” is ﬂuctuated. Therefore
it is important to check the contexts of word candidates when
we search for pronunciations in speech data. So we propose
to use an ASR system coupled with an LM estimated from our
The following is the processes to count the frequencies of
candidate pairs of word and pronunciation appearing at phonetically and linguistically proper positions in speech data.
1. Prepare an ASR system with a proper acoustic model for
the speech data.
2. Add extracted word candidates to the vocabulary of the
3. Re-estimate an LM of the ASR system from the pseudoSSC used for word candidate extraction.
4. Execute speech recognition on the speech data talking
about comparable topics to the text data.
5. Count the frequencies of word-pronunciation pairs in the
ASR system results.
As a result of the above processes, we expect to obtain correct
word-pronunciation pairs with their frequencies from text data
and speech data.
As an evaluation of our method for extracting wordpronunciation pairs, we measured pronunciation estimation accuracies of a TTS front-end with and without extracted pairs.
5.1. Experiment Conditions
We prepared an annotated corpus composed of articles extracted
from newspapers and example sentences in a dictionary of daily
conversation. Each sentence in the corpus is segmented into
words and each word is annotated with a phoneme sequence.
Table 1 shows the corpus size. The ME-model for WBP estimation and a stochastic TTS front-end are built from this corpus.
Our method uses text data and speech data to extract
word-pronunciation pairs. The text data we used are composed of two sources: one is newspapers, which is different
from the corpus for building the ME-model, the other is web
news articles crawled 4 times a day for 68 days (02/11/2007
– 08/01/2008). Table 2 shows the corpus size. We extracted
word-pronunciation pairs from the text data. As for speech
data we recorded 30 minute TV news for 34 days (05/12/2007 –
Then we tested the TTS front-end on the web news articles
of 250 sentences on the day after the above period (09/01/2008).
5.2. Parameters and Other Features
We used the pseudo-SSCs derived from the text data for building an LM of the ASR, too. So we conducted preliminary experiments in which we calculated the perplexities of LMs built
from N pseudo-SSCs by changing the multiplier N. The result
showed that the LM built from 10 pseudo-SSCs had a similar
perplexity to the LM built from the SSC. Thus we set N to 10.
20 TOEIC Tips
- Set a goal
So, you’ve decided to take the TOEIC test. Congratulations! The first thing you should do is set a goal. If you are taking the test in order to apply for a job, find out what proficiency level is required.
Choose a goal that is achievable. If you aim too high, you will be disappointed. Remember, you can take the test as often as you want if you don’t mind paying the fee.
- Understand the test
Before you start studying for the test, make sure you understand the format of each section. You will be tested on your listening and reading comprehension skills. By doing model or practice tests, you will become very familiar with the TOEIC. The test should become “second nature” to you before you attempt the real thing.
- Make a study plan
Procrastination is one of the key reasons students fail the TOEIC test. You may book your TOEIC test months in advance. However, the day you decide to take the TOEIC test should be the day you start to study.
You will have to decide whether or not you are going to teach yourself the TOEIC with reliable resources or whether you are going to take a TOEIC preparation class. In order to get the best results, you should do both. If you cannot afford to take a TOEIC class, make sure to choose a TOEIC textbook that has explanatory answers. You will also want to have a teacher or tutor that you can go to from time to time with questions.
If you choose a TOEIC class, make sure that you trust your teacher and feel comfortable in his or her class. Take a class with a friend and make a commitment to study together in and outside of class.
Studying at the same time every day is a great way to improve your score. Write down your study plan and sign it!
- Divide study time appropriately
Each section is worth a certain amount of points. Don’t spend too much time studying one section. Many students make the mistake of studying the section that they enjoy the most. This is the section you should spend the least amount of time on.
You might want to divide your study week by focusing on a certain section each day. Remember, if Sunday is your day to practice Part VII (40 questions on the test), you might have to study twice as long as you would on Monday when you focus on Part I (20 questions on the test).
- Build a strong vocabulary
Another reason students fail the TOEIC test is that they have a very limited vocabulary. The day you decide to take the TOEIC test you should make yourself a blank dictionary. Use a notebook (an address book works great because it is divided into letters) and keep track of all of the new words you learn along the way. It is not useful to study vocabulary lists. You will only remember words that you have seen in context. For each entry, write the word and use it in a sentence. At the end of each week you should write a short letter or composition using as many of the words as you can.
This might also be the time to stop using your translation dictionary. Electric dictionaries make things too simple! You will not remember the word if it doesn’t take any effort to understand it.
Keep in mind that the TOEIC test has a business theme. You should study vocabulary from topics such as travel, banking, health, restaurants, offices, etc. You will also want to learn everyday idiomatic expressions.
- Isolate your weak points
After you have been studying the TOEIC for a while, you will find out which parts give you the most trouble. You might want to change how you divide your time. There are certain grammar points that many students have trouble with. If you are taking a TOEIC class, ask your teacher to bring in extra homework help on problems like these. If you are studying by yourself, find a good reference book in the library and look up your question. There may also be help on the Internet. For example, type “gerunds” into a search engine and you will probably find a useful exercise.
- Eliminate distractors
In every TOEIC question, there are at least two distractors (wrong answers that the test writer uses to trick you). It is much easier to choose the correct answer when you have only two to choose from. (The third choice is often impossible and easy to spot.) There are many types of distractors such as, similar sounds, homonyms, repeated words, etc. As you study, make yourself a list of distractors. When you come across them you will be able to eliminate them more easily.
- Trust your instincts
Sometimes an answer will jump out at you as either correct or incorrect. If you have been studying hard, chances are that your brain is telling you which choice to pick. Don’t change your answers after following your instinct. If you do decide to change an answer, make sure that you erase very carefully. A machine will be marking your test. Be sure to use a pencil and fill in your circle choice completely. Bring extra pencils, erasers, and a pencil sharpener!
- Don’t try to translate
Translating vocabulary and sentences wastes a lot of time. It is very rare that students have extra time during the TOEIC test. If you don’t know a word, look at the context of the sentence and the words around it. You will not be allowed to use a dictionary when you take the test.
- Guess as a last resort
On test day, if you don’t know the answer, and you have eliminated all of the distractors you can, don’t leave the space blank. There is a good chance you will not have time to go back to this question. You still have a 25% chance of getting the answer right if you guess. If you are sure that one or two of the answers are incorrect, your guess is even more likely to be correct!
- Be aware of time management
When you are doing practice tests, you should always be aware of the time. Never allow yourself an open ended study session. You will have to learn to work efficiently.
On test day, you should be especially careful in the Reading section. You will have 75 minutes to complete Parts V, VI and VII. Many students spend too long on section V or VI because they find these the most difficult. Don’t spend more than 30 minutes on the first two parts. Part VII will take you at least 40 minutes, and it is worth a lot of points, especially if you find it an easier section.
- Listen quickly
When you are studying for the TOEIC test, do not get in the habit of rewinding the tape. On test day you won’t have any control over the speed of the listening section. You will not even have time to think for very long between questions. Make sure that you do not get behind during the real test. If you do not know the answer, take your best guess. Then continue to follow along. Don’t look back at questions when you are waiting for another question to start.
- Practise reading aloud
Reading out loud will help your listening and reading comprehension skills. In order to comprehend English more quickly, it is important that you understand the rhythm of the language. Read from textbooks, pamphlets, newspapers, and even children’s novels. You might want to tape yourself and listen to how you sound.
- Use mass media
One of the best ways to prepare for the TOEIC test is to study real English. Watch television, listen to radio reports, and read newspapers and magazines. Pay special attention to ads, letters, weather and traffic reports, coupons, and special announcements. Do this with a friend, and write out questions for each other to answer. This is a great way to practice your wh-questions. It is also a great way to learn common idiomatic expressions.
- Use free web sites
There are many web sites that offer free model tests and samples. Type TOEIC into your search engine and start practising! Surfing the web is a great way to practise your reading and listening. If you are interested in a certain topic, such as snowboarding, type that into a search engine. You might want to reserve an hour a day for Internet studying. Just make sure to study English and don’t get caught wasting hours playing games!
- Teach a native English speaker your language
If you can’t afford a tutor, you might know a native English speaker who would be interested in learning your first language. Tell him you will teach him for free for one hour a week! You will have to use English to teach him, and you will learn many new English words and expressions at each session. Forcing yourself to teach someone a language will help you to understand English grammatical rules as well. Do anything you can to speak with native English speakers.
- Keep an English journal
Keeping a journal doesn’t have to be an account of your daily activities. You can write anything in a journal, such as how your studying is coming along, what your new favourite word is and why, or which teacher you admire. If you are studying TOEIC with a friend, make a list of writing topics for each other. You might decide to write a paragraph three times a week. Get your friend to try to find your mistakes. Finding your partner’s writing errors is great practice for Part V and VI.
- Ask questions
Never hesitate to ask lots of questions. In a TOEIC class, all of the students will benefit from your question. If you don’t understand something, such as conditionals, you may lose ten points on a TOEIC exam. A teacher is not always available, but students are everywhere! Sometimes other students can help you with a grammar problem even better than a teacher.
- Manage your stress
If you are feeling stressed about taking the TOEIC you may be studying too hard or expecting too much of yourself. Like everything else in life, balance is the key. Remind yourself that you will try to do your best. Before the test, take deep breaths and remember that you can always improve your score in a few months time. In between the listening and reading section, take a few deep breaths again to get focused.
- Don’t cram
You should never cram (study extremely hard in a short period of time) the night or even week before the TOEIC test. There is so much to learn when you study the TOEIC. The last week should be for reviewing and practising rather than learning new things. Make sure to get plenty of sleep the night before the test. On the day of the test, have a good meal and relax for a few hours before going to the testing centre. Plan to reward yourself when the test is over!