One of the problems in text-to-speech (TTS) systems and

speech-to-text (STT) systems is pronunciation estimation of unknown words. In this paper, we propose a method for extracting

unknown words and their pronunciations from similar sets of

Japanese text data and speech data. Out-of-vocabulary words

are extracted from text with a stochastic model and pronunciations hypotheses are generated. These entries are verified by

conducting automatic speech recognition on audio data. In this

work, we use news articles and broadcast TV news covering

similar topics. Most extracted pairs turned out to be correct

according to a human judges. We also tested the TTS frontend enhanced with these entries on other web news articles, and

observed an improvement in the pronunciation estimation accuracy of 9.2% (relative). The proposed method can be used to

realize a spoken language processing system that acquires and

updates its lexicon automatically.

1. Introduction

Recent advances in spoken language processing (SLP) techniques have given rise to a number of practical applications.

One of these applications is text-to-speech (TTS), which converts written text into speech. One of the largest obstacles in a

TTS system is the existence of unknown words. Usually TTS

systems are equipped with a module which estimates a pronunciation of unknown words from their spelling. However, the

accuracy of this module is not sufficiently high, especially in

languages which use ideograms such as Japanese and Chinese.

Unknown words or out-of-vocabulary words are also problematic in speech-to-text (STT) systems.

In this paper, we propose a method for extracting unknown

words and their pronunciations automatically from comparable

sets of text data and speech data. The main idea is to compare

a collection of text data and a collection of speech data talking

about the same topics. Our method is summarized as follows:

1. Extract unknown word candidates from the text data.

2. Enumerate possible pronunciations for each word candidate.

3. Search for pronunciations in the speech data.

The search is executed by using an automatic speech recognizer

(ASR). Unless the searched pronunciation is very long, a possible pronunciations may be matched not only with correct words

but also at incorrect positions in speech data. Thus, when we

search for a possible pronunciation of an unknown word candidate, it is strongly required to check its context. This context

can be calculated from sentences in text data.

In some languages such as Japanese, the target language of

this research, words are not separated by a whitespace. Thus

first of all, word boundaries must be identified by an automatic word segmenter. However, an automatic word segmenters

tend to make errors at unknown words and output incorrect

word boundaries. So we regard a text as a stochastically segmented corpus (SSC) [1] in which sentences are segmented into

word sequences stochastically, not determinatively as in ordinary methods. The ASR system searches for all possible pronunciations of unknown word candidates in speech data, representing contexts with a word n-gram model estimated from an


In the experiment, we extract word-pronunciation pairs

from broadcast TV news and web news articles in the same

period. Evaluation is done using a different set of web news


2. Language Model for TTS Front-end

The method we propose in this paper for extracting unknown

words and their pronunciations uses an ASR coupled with a language model (LM) describing the contexts of the unknown word

candidates. In this section, we explain a TTS front-end based

on n-gram modeling.

2.1. Text-to-Speech Front-end

In the stochastic approach for pronunciation estimation [2], a

sentence is regarded as a sequence of pairs u consisting of

spelling of a word w and a phoneme sequence y, that is u =



. Using an n-gram model based on this unit, Mu,n, the

probability of a unit sequenceu = (u


2 ···u

h), is calculated


Mu,n(u) =







where ui (i ≤ 0) and u

h+1 is a special symbol BT (boundary


Given a character sequence x as an input sentence, the

front-end outputs uˆ, a sequence of units with the highest

probability, under the constraint that the concatenation of the

spellings is equal to the input sentence:

uˆ = argmax





2 ···u

h), (1)

where wi is the spelling of the pair ui.

2.2. Pronunciation Estimation for Unknown Word

In order to handle unknown words, a special symbol UU is introduced to represent all units outside of vocabulary U, a set of

word-pronunciation pairs. When a UU is predicted by Mu,n, a


In the original paper [2] the unit is a quadruplet of spelling of a

word, its part-of-speech, its phoneme sequence, and its accent sequence.

1. Decompose the spelling into a character sequence and

generate all possible pronunciations for the characters

from the dictionary

ex.) (mo ri, ma mo, shu), (o ku, ya)

2. List all pronunciations of the word candidate by taking

one possible pronunciation for each character

ex.) mo ri o ku, mo ri ya, ma mo o ku,

ma mo ya, shu o ku, shu ya

3. For each possible pronunciation, calculate the joint probability in which the candidate word has the pronunciation

using the n-gram model based on word-pronunciation

pairs expressed by Equation (2).

ex.) P(mo ri o ku,守屋) = 0.65

P(mo ri ya,守屋) = 0.12




Note that in this example the correct pronunciation of the word

守屋 is “mo ri ya,” the second probable one, thus the TTS

front-end fails to produce a correct pronunciation of this word.

4.3. Searching for Pronunciation in Speech

The last step is to check if these hypothesized pronunciations for

word candidates are observed in speech data. Since speech data

have no clear word boundary information and contain pronunciation fluctuations and noises, a pronunciation may match at improper position as well. For example, let us assume that speech

data contain the pronunciation of a word “memorial park” as


··· me mo ri a ru pa a ku···.

A pronunciation “mo ri ya” for a word candidate “守屋 may

matches by mistake at the position of “mo ri a” when the pronunciation of the word “memorial park” is fluctuated. Therefore

it is important to check the contexts of word candidates when

we search for pronunciations in speech data. So we propose

to use an ASR system coupled with an LM estimated from our


The following is the processes to count the frequencies of

candidate pairs of word and pronunciation appearing at phonetically and linguistically proper positions in speech data.

1. Prepare an ASR system with a proper acoustic model for

the speech data.

2. Add extracted word candidates to the vocabulary of the

ASR system.

3. Re-estimate an LM of the ASR system from the pseudoSSC used for word candidate extraction.

4. Execute speech recognition on the speech data talking

about comparable topics to the text data.

5. Count the frequencies of word-pronunciation pairs in the

ASR system results.

As a result of the above processes, we expect to obtain correct

word-pronunciation pairs with their frequencies from text data

and speech data.

5. Evaluation

As an evaluation of our method for extracting wordpronunciation pairs, we measured pronunciation estimation accuracies of a TTS front-end with and without extracted pairs.

5.1. Experiment Conditions

We prepared an annotated corpus composed of articles extracted

from newspapers and example sentences in a dictionary of daily

conversation. Each sentence in the corpus is segmented into

words and each word is annotated with a phoneme sequence.

Table 1 shows the corpus size. The ME-model for WBP estimation and a stochastic TTS front-end are built from this corpus.

Our method uses text data and speech data to extract

word-pronunciation pairs. The text data we used are composed of two sources: one is newspapers, which is different

from the corpus for building the ME-model, the other is web

news articles crawled 4 times a day for 68 days (02/11/2007

– 08/01/2008). Table 2 shows the corpus size. We extracted

word-pronunciation pairs from the text data. As for speech

data we recorded 30 minute TV news for 34 days (05/12/2007 –


Then we tested the TTS front-end on the web news articles

of 250 sentences on the day after the above period (09/01/2008).

5.2. Parameters and Other Features

We used the pseudo-SSCs derived from the text data for building an LM of the ASR, too. So we conducted preliminary experiments in which we calculated the perplexities of LMs built

from N pseudo-SSCs by changing the multiplier N. The result

showed that the LM built from 10 pseudo-SSCs had a similar

perplexity to the LM built from the SSC. Thus we set N to 10.











20 TOEIC Tips

  1. Set a goal

    So, you’ve decided to take the TOEIC test. Congratulations! The first thing you should do is set a goal. If you are taking the test in order to apply for a job, find out what proficiency level is required.

    Choose a goal that is achievable. If you aim too high, you will be disappointed. Remember, you can take the test as often as you want if you don’t mind paying the fee.

  2. Understand the test

    Before you start studying for the test, make sure you understand the format of each section. You will be tested on your listening and reading comprehension skills. By doing model or practice tests, you will become very familiar with the TOEIC. The test should become “second nature” to you before you attempt the real thing.

  3. Make a study plan

    Procrastination is one of the key reasons students fail the TOEIC test. You may book your TOEIC test months in advance. However, the day you decide to take the TOEIC test should be the day you start to study.

    You will have to decide whether or not you are going to teach yourself the TOEIC with reliable resources or whether you are going to take a TOEIC preparation class. In order to get the best results, you should do both. If you cannot afford to take a TOEIC class, make sure to choose a TOEIC textbook that has explanatory answers. You will also want to have a teacher or tutor that you can go to from time to time with questions.

    If you choose a TOEIC class, make sure that you trust your teacher and feel comfortable in his or her class. Take a class with a friend and make a commitment to study together in and outside of class.

    Studying at the same time every day is a great way to improve your score. Write down your study plan and sign it!

  4. Divide study time appropriately

    Each section is worth a certain amount of points. Don’t spend too much time studying one section. Many students make the mistake of studying the section that they enjoy the most. This is the section you should spend the least amount of time on.

    You might want to divide your study week by focusing on a certain section each day. Remember, if Sunday is your day to practice Part VII (40 questions on the test), you might have to study twice as long as you would on Monday when you focus on Part I (20 questions on the test).

  5. Build a strong vocabulary

    Another reason students fail the TOEIC test is that they have a very limited vocabulary. The day you decide to take the TOEIC test you should make yourself a blank dictionary. Use a notebook (an address book works great because it is divided into letters) and keep track of all of the new words you learn along the way. It is not useful to study vocabulary lists. You will only remember words that you have seen in context. For each entry, write the word and use it in a sentence. At the end of each week you should write a short letter or composition using as many of the words as you can.

    This might also be the time to stop using your translation dictionary. Electric dictionaries make things too simple! You will not remember the word if it doesn’t take any effort to understand it.

    Keep in mind that the TOEIC test has a business theme. You should study vocabulary from topics such as travel, banking, health, restaurants, offices, etc. You will also want to learn everyday idiomatic expressions.

  6. Isolate your weak points

    After you have been studying the TOEIC for a while, you will find out which parts give you the most trouble. You might want to change how you divide your time. There are certain grammar points that many students have trouble with. If you are taking a TOEIC class, ask your teacher to bring in extra homework help on problems like these. If you are studying by yourself, find a good reference book in the library and look up your question. There may also be help on the Internet. For example, type “gerunds” into a search engine and you will probably find a useful exercise.

  7. Eliminate distractors

    In every TOEIC question, there are at least two distractors (wrong answers that the test writer uses to trick you). It is much easier to choose the correct answer when you have only two to choose from. (The third choice is often impossible and easy to spot.) There are many types of distractors such as, similar sounds, homonyms, repeated words, etc. As you study, make yourself a list of distractors. When you come across them you will be able to eliminate them more easily.

  8. Trust your instincts

    Sometimes an answer will jump out at you as either correct or incorrect. If you have been studying hard, chances are that your brain is telling you which choice to pick. Don’t change your answers after following your instinct. If you do decide to change an answer, make sure that you erase very carefully. A machine will be marking your test. Be sure to use a pencil and fill in your circle choice completely. Bring extra pencils, erasers, and a pencil sharpener!

  9. Don’t try to translate

    Translating vocabulary and sentences wastes a lot of time. It is very rare that students have extra time during the TOEIC test. If you don’t know a word, look at the context of the sentence and the words around it. You will not be allowed to use a dictionary when you take the test.

  10. Guess as a last resort

    On test day, if you don’t know the answer, and you have eliminated all of the distractors you can, don’t leave the space blank. There is a good chance you will not have time to go back to this question. You still have a 25% chance of getting the answer right if you guess. If you are sure that one or two of the answers are incorrect, your guess is even more likely to be correct!

  11. Be aware of time management

    When you are doing practice tests, you should always be aware of the time. Never allow yourself an open ended study session. You will have to learn to work efficiently.

    On test day, you should be especially careful in the Reading section. You will have 75 minutes to complete Parts V, VI and VII. Many students spend too long on section V or VI because they find these the most difficult. Don’t spend more than 30 minutes on the first two parts. Part VII will take you at least 40 minutes, and it is worth a lot of points, especially if you find it an easier section.

  12. Listen quickly

    When you are studying for the TOEIC test, do not get in the habit of rewinding the tape. On test day you won’t have any control over the speed of the listening section. You will not even have time to think for very long between questions. Make sure that you do not get behind during the real test. If you do not know the answer, take your best guess. Then continue to follow along. Don’t look back at questions when you are waiting for another question to start.

  13. Practise reading aloud

    Reading out loud will help your listening and reading comprehension skills. In order to comprehend English more quickly, it is important that you understand the rhythm of the language. Read from textbooks, pamphlets, newspapers, and even children’s novels. You might want to tape yourself and listen to how you sound.

  14. Use mass media

    One of the best ways to prepare for the TOEIC test is to study real English. Watch television, listen to radio reports, and read newspapers and magazines. Pay special attention to ads, letters, weather and traffic reports, coupons, and special announcements. Do this with a friend, and write out questions for each other to answer. This is a great way to practice your wh-questions. It is also a great way to learn common idiomatic expressions.

  15. Use free web sites

    There are many web sites that offer free model tests and samples. Type TOEIC into your search engine and start practising! Surfing the web is a great way to practise your reading and listening. If you are interested in a certain topic, such as snowboarding, type that into a search engine. You might want to reserve an hour a day for Internet studying. Just make sure to study English and don’t get caught wasting hours playing games!

  16. Teach a native English speaker your language

    If you can’t afford a tutor, you might know a native English speaker who would be interested in learning your first language. Tell him you will teach him for free for one hour a week! You will have to use English to teach him, and you will learn many new English words and expressions at each session. Forcing yourself to teach someone a language will help you to understand English grammatical rules as well. Do anything you can to speak with native English speakers.

  17. Keep an English journal

    Keeping a journal doesn’t have to be an account of your daily activities. You can write anything in a journal, such as how your studying is coming along, what your new favourite word is and why, or which teacher you admire. If you are studying TOEIC with a friend, make a list of writing topics for each other. You might decide to write a paragraph three times a week. Get your friend to try to find your mistakes. Finding your partner’s writing errors is great practice for Part V and VI.

  18. Ask questions

    Never hesitate to ask lots of questions. In a TOEIC class, all of the students will benefit from your question. If you don’t understand something, such as conditionals, you may lose ten points on a TOEIC exam. A teacher is not always available, but students are everywhere! Sometimes other students can help you with a grammar problem even better than a teacher.

  19. Manage your stress

    If you are feeling stressed about taking the TOEIC you may be studying too hard or expecting too much of yourself. Like everything else in life, balance is the key. Remind yourself that you will try to do your best. Before the test, take deep breaths and remember that you can always improve your score in a few months time. In between the listening and reading section, take a few deep breaths again to get focused.

  20. Don’t cram

    You should never cram (study extremely hard in a short period of time) the night or even week before the TOEIC test. There is so much to learn when you study the TOEIC. The last week should be for reviewing and practising rather than learning new things. Make sure to get plenty of sleep the night before the test. On the day of the test, have a good meal and relax for a few hours before going to the testing centre. Plan to reward yourself when the test is over!



Are you interested in learning abbreviations and acronyms in English?   7 comments

Teacher “ALVIN”

Come to the Philippines to learn “ENGLISH”

1) Are you interested in learning abbreviations and acronyms in English?
2) Do you know lots of abbreviations and acronyms?
3) Do you think abbreviations and acronyms are useful?
4) Do you have a favorite abbreviation or acronym?
5) Do you think alphabets that don’t have an English script (Chinese, Russian, Arabic, etc.) use abbreviations and acronyms?
6) Do you know the punctuation rules for abbreviations and acronyms?
7) What do you think are the world’s most common abbreviations and acronyms?
8) Do you ever invent your own acronyms to help you study?
9) The website says “RSVP” is one of the most popular queries. Do you know what it means?
10) What acronym could you create for your name?
1) What is the difference between an abbreviation and an acronym?
2) How many United Nations abbreviations and acronyms do you know?
3) Are there many abbreviations and acronyms in your language?
4) What do you think about spending a whole English lesson on abbreviations and acronyms?
5) Can you keep up to date with computer and technology abbreviations and acronyms (WIFI, WAP, ISP, WWW, etc.)?
6) Do you think abbreviations in e-mail and text messages are adding to or ruining the English (or your) language?
7) The website lists 33 different meanings for “BYOB”. Can you think of any?
8) “Scuba”, “modem”, radar”, “laser” and “NATO” are all acronyms. Do you know what they mean?
9) Would you like to study for an MA or PhD?
10) What do ante meridian and post meridian refer to?
A.B. Artium Baccalaureus [Bachelor of Arts]
abbr. abbreviation(s), abbreviated
Acad. Academy
A.D. anno Domini [in the year of the Lord]
alt. altitude
A.M. ante meridiem [before noon]; Artium Magister [Master of Arts]
AM amplitude modulation
Assn. Association
at. no. atomic number
at. wt. atomic weight
Aug. August
Ave. Avenue
AWOL absent without leave
b. born, born in
B.A. Bachelor of Arts
B.C. Before Christ
b.p. boiling point
B.S. Bachelor of Science
Btu British thermal unit(s)
C Celsius (centigrade)
c. circa [about]
cal calorie(s)
Capt. Captain
cent. century, centuries
cm centimeter(s)
co. county
Col. Colonel; Colossians
Comdr. Commander
Corp. Corporation
Cpl. Corporal
cu cubic
d. died, died in
D.C. District of Columbia
Dec. December
dept. department
dist. district
div. division
Dr. doctor
E east, eastern
ed. edited, edition, editor(s)
est. established; estimated
et al. et alii [and others]
F Fahrenheit
Feb. February
fl. floruit [flourished]
fl oz fluid ounce(s)
FM frequency modulation
ft foot, feet
gal. gallon(s)
Gen. General, Genesis
GMT Greenwich mean time
GNP gross national product
GOP Grand Old Party (Republican Party)
Gov. governor
grad. graduated, graduated at
H hour(s)
Hon. the Honorable
hr hour(s)
i.e. id est [that is]
in. inch(es)
inc. incorporated
Inst. Institute, Institution
IRA Irish Republican Army
IRS Internal Revenue Service
Jan. January
Jr. Junior
K Kelvin
kg kilogram(s)
km kilometer(s)
£ libra [pound], librae [pounds]
lat. latitude
lb libra [pound], librae [pounds]
Lib. Library
long. longitude
Lt. Lieutenant
Ltd. Limited
m meter(s)
M minute(s)
M.D. Medicinae Doctor [Doctor of Medicine]
mg milligram(s)
mi mile(s)
min minute(s)
mm millimeter(s)
mph miles per hour
Mr. Mister (always abbreviated)
Mrs. Mistress (always abbreviated)
Msgr Monsignor
mt. Mount, Mountain
mts. mountains
Mus. Museum
N north; Newton(s)
NAACP National Association for the Advancement of Colored People
NASA National Aeronautics and Space Administration
NATO North Atlantic Treaty Organization
NE northeast
no. number
Nov. November
OAS Organization of American States
Oct. October
Op. Opus [work]
oz ounce(s)
pl. plural
pop. population
pseud. pseudonym
pt. part(s)
pt pint(s)
pub. published; publisher
qt quart(s)
Rev. Revelation; the Reverend
rev. revised
R.N. registered nurse
rpm revolution(s) per minute
RR railroad
S south
S second(s)
SEATO Southeast Asia Treaty Organization
SEC Securities and Exchange Commission
sec second(s); secant
Sept. September
Ser. Series
Sgt. Sergeant
sq square
Sr. Senior
SSR Soviet Socialist Republic
St. Saint; Street
UNICEF United Nations Children’s Fund
uninc. unincorporated
Univ. University
U.S. United States
USA United States Army
USAF United States Air Force
USCG United States Coast Guard
USMC United States Marine Corps
USN United States Navy
USSR Union of Soviet Socialist Republics
VFW Veterans of Foreign Wars
VISTA Volunteers in Service to America
vol. volume(s)
vs. versus
W west; watt(s)
WHO World Health Organization
wt. weight
yd yard(s)
YMCA Young Men’s Christian Association
YWCA Young Women’s Christian Association