TTS Systems for Android Essay Example
TTS Systems for Android Essay Example

TTS Systems for Android Essay Example

Available Only on StudyHippo
  • Pages: 13 (3525 words)
  • Published: August 1, 2018
  • Type: Research Paper
View Entire Sample
Text preview

ABSTRACT

There are various TTS (Text to Speech) systems currently available for personal computers and web applications.

In the Smart Phone Platform, there are only a few TTS systems available for the Bangla Language. Android is currently a popular platform for smartphones. Several Bangla TTS systems exist with different mechanisms and techniques, utilizing various tools. In this article, we aim to present a comprehensive overview of all these mechanisms and summarize the existing systems.

Introduction

There are more than 250 million speakers of Bengali in four states across two countries worldwide.

In order to audibly read Bengali text, the most effective device currently available is a mobile phone. In Bangladesh, there are more than 14 million individuals who utilize mobile phones, and 30% of them use smartphones. The increasing popularity of smartphones can be attributed to their reliability, numerous functions, a

...

bility to access faster internet, and compatibility with open source applications. These capabilities have significantly enhanced communication, consequently making text messaging the primary means of communication.

Various TTS engines are accessible for English and other languages, simplifying our lives. Furthermore, there are also a few TTS Systems available for Bangla on smartphones. Text and speech serve as effective communication methods. The remarkable accomplishment of converting text to speech and vice versa greatly enhances the communication life cycle, making it more convenient than ever before. Thanks to this technology, individuals can express themselves by merely texting on their mobile phones. Speech represents the most natural form of interaction and communication.

Speech Synthesis is an important component of the TTS engine and has been approached by several authors in various ways for the Bengali language. By examining different approaches, we can gain a fundamenta

View entire sample
Join StudyHippo to see entire essay

understanding of the techniques employed for Speech Synthesis. Currently, TTS engines predominantly utilize pre-recorded voices but also generate symbolic linguistic representation. Thus, in this discourse, our emphasis will be on existing systems and their potential to produce more lifelike voices. Ultimately, the concatenation of final speech tokens should imitate natural communication patterns.

Recorded voices are stored in a Database. Different systems have varying sizes for the stored units, whether they are speeches or individual words recorded by humans. The clarity of these recordings can vary. The author has focused on optimizing the code and compressing the database as much as possible.

Many new methods of Speech Synthesis have been attempted. Android is a popular operating system for Smart phones because it allows the installation and use of open source applications. This allows anyone to create better applications for personal or business use. Therefore, it is crucial to develop a Bangla TTS for Android. Our research aims to familiarize with the best existing Bangla TTS systems on the Android platform and guarantee high-quality research outputs, findings, and potential future work. We have discussed the main points presented by various authors and provided a comparison of all the systems.

The research and development of the Bangla TTS Engine has greatly improved in recent years. Numerous publications have been released for the Android mobile platform. This discussion will focus on a few of these publications.

Case Study 1:

After reviewing the paper titled "A benglai Speech Synthesizer on Android OS" by Sankar Mukherjee and Shyamal Kumar Das Mandal, it was revealed that their objective was to create a Bengali speech synthesizer for mobile devices. They utilized the Epoch Synchronous Non Overlap Add (ESNOLA)

based concatenative speech synthesis technique to generate speech. Efforts were made to compress the database, as limited space was a constraint. In the past, small diaphone databases were used, leading to a decrease in the quality of synthesized speech.

On the other hand, Pucher and Frohlich (2005) utilized a large unit selection database and employed a server to generate synthesized speech output. A network was required to transfer the waveform to a mobile device. Their goal was to achieve high-quality output in near real-time on mobile devices. Speech synthesis is the process of converting textual data into speech waveforms.

The vocabulary size determines the Synthesis method. To model speech utterances, various speech synthesis techniques are utilized, including rule-based, articulatory modeling, and concatenative techniques. In this context, the ESNOLA concatenative speech synthesis method was employed to develop their synthesizer. ESNOLA enables adequate processing for ensuring proper matching between segments during concatenation and allows for an unlimited vocabulary without compromising quality. Therefore, this method could be suggested as a beneficial approach for Speech Synthesis.

The full operational method of the system has been designed as shown in the diagram. The system is divided into four parts, which include the Input text and output speech state. In between these two states, there are two important states: the Text analysis module and the Synthesizer Module. These states are where the major operations are designed to be performed.

A comprehensive speech requires multiple components including intonation, prosody, and phonological words. It is vital to handle exceptions when converting text to speech. This paper addresses all these aspects through the Text analysis module in their system model. This module consists of two sections - the

phonological analysis module and the analysis of the text for prosody and intonation. Exceptional words are dealt with in the initial Phonological rule section.

They have developed and implemented phonological rule analysis of the text for prosody and intonation, as mentioned by Basu et al. (2009). Additionally, they have also worked with an exceptional dictionary for language analysis requirements. Consequently, the phonological analysis module completes the overall processing of the text-related components. The subsequent module will be responsible for synthesizing the information.

The synthesizer module is responsible for producing a realistic and high-quality speech. It functions by receiving the finalized text from the text analysis module, generating a token, and then combining splices of pre-recorded speech to generate a synthesized voice output. This is done using the ESNOLA approach, as explained in the study by Shyamal Kr Das Mandal et al. (2007). In this approach, the synthesized speech output is created by concatenating basic signal segments from a signal dictionary at specific epoch positions. The application described in the study was implemented according to the following system specifications. However, memory management poses a significant challenge in the Android platform, limiting its widespread use.

The paper discusses the context of this application, which remains active as long as the application is running, regardless of the activities life cycle. The context is obtained by calling Activity.getApplication(). The database for partneme is stored on an external storage card. After generating output, the final speech file is deleted. The partneme database contains a total of 596 sound files, with a database size of 1.0 Mb and an application size of 2.26 Mb.

The TTS system has two main capabilities: reading Bengali messages

from a phone's inbox and generating speech by typing Bengali words using the English alphabet format. The performance and quality of the system are evaluated by calculating the total processing time, which starts from when the button is pressed to initiate speech and ends when the first sound is pronounced. The application underwent various tests, with the results provided below. Additionally, an audience assessed the application's performance. To evaluate the output speech quality, five subjects were chosen: three males (L1, L2, L3) and two females (L4, L5), aged between 24 and 50.

In an experiment, 10 original sentences and their modified versions were randomly presented for listening and judgment. The sentences were spoken by a speaker and an Android version. Participants rated the naturalness of the sentences on a 5-point scale (1=less natural – 5=most natural). The average score for the original sentences was 4.72, while the modified sentences had an average score of 2.88. In their paper, the authors discuss implementing a Bengali speech synthesizer on a mobile device.

Their objective was to modify components in ESNOLA for Android device compatibility in order to develop a real-time Speech text-to-speech (TTS) application.

Case Study 2:

An exceptional TTS engine's main goal is to convert written text in a specific language into spoken words using different modules. Language modeling and speech synthesis play crucial roles in achieving this.

According to the paper "Text to speech for Bangla Language using Festival" by Firoj Alam, Promila Kanti Nath, and Dr. Mumit Khan, we discovered that they utilized the open-source third party tool Festival TTS engine. Festival is a framework for constructing speech synthesis systems for any TTS engine. The Festival system is coded

in C++ and relies on the Edinburgh Speech Tools Library for its underlying structure. It also utilizes a Scheme (SIOD) based command interpreter for control. Festival offers API documentation. The authors implemented two different concatenative methods, specifically unit selection and multisyn unit selection, which are both supported by Festival.

The researchers have discussed various aspects of speech analysis and synthesis, including Text Analysis, Phonetic Analysis, Grapheme to Phoneme Conversion, Prosodic Analysis, Speech Database or Waveform Synthesis, Speech Output, and Analysis of the output result. One issue they faced was dealing with non-standard words in the input text. To address this problem, they utilized the text analysis component to convert all non-standard words into standard ones. Their grapheme-to-phoneme module generates phonemic symbol strings based on the information within the written text. To achieve the final speech synthesis, they employed concatenative unit selection technique and multisyn unit selection technique. In their proposed system, the first step is text analysis. The main goal of a TTS engine is to convert the input text into speech, which is why the input text needs to be in a standard format.

There is a probability that the input text might contain words of NSW (Non-Standard Word) type. The writer provided a list of NSW words such as numbers (year, time, ordinal, cardinal, floating point), abbreviations, acronyms, currency, dates, URLs. They utilized Text normalization in order to convert NSW words into SW (Standard Word) and they resolved any ambiguous tokens using a rule. In their study, they did not directly work with Unicode due to the lack of Unicode support in Festival. As a result, they converted Unicode text into ASCII. During the text

analysis phase, they divided the tokens by separating white spaces and punctuation marks.

In Festival, white space is seen as a separator and punctuation can separate the raw tokens. The Festival Ordered list includes tokens with features of white space and punctuation. White space is commonly used for tokenization. The researchers have discovered that the Bangla Language has over 10 types of non-standard white spaces (NSW), therefore each NSW can be identified as a separate token according to token identifier rules. The token identification process in Festival involves using regular expression schemes.

After identifying all NSW, the text is converted to standard word by using a pronunciation lexicon or letter-to-sound (LTS) rule. Sometimes the pronunciation of a word doesn't match the pronunciation form, but this problem has been solved by using a list of lexicons and the LTS rule. In the lexicon dictionary, 900 lexicons with their pronunciations have been inserted. The Steps of Phonetic Analysis within festival are as follows:

  1. Building a large amount of lexicon.
  2. Building letter-to-sound rules.

For concatenative synthesis, three techniques have been used: diphone, unit selection, and multisyn-unit selection. Based on articulatory analysis, they have identified 45 phones excluding 31 diphthongs along with their features.

To create a diphone database, it was necessary to include diphthongs. However, during the system's implementation, the diphthongs were excluded. The added duration was borrowed from the Kiswahili TTS system, but it does not accurately match the duration for Bangla language phones. Around 500-900 utterances were recorded to cover commonly used words in the language. The performance of the system was assessed based on acceptability/naturalness and intelligibility. Evaluation of synthesized speech took place at three levels: sentence level, word level, and

phrase level.

Intelligibility rates were assessed at different levels: sentence, phrase, and word. The sentence level had a rate of 85%, the phrase level had a rate of 83.33%, and the word level had a rate of 56.66%. In another experiment, synthesized speech naturalness was evaluated with rates of 90% for sentences, 85% for phrases, and 65% for words. The results can be seen in Figure X.

Case Study 3:

The text describes a model consisting of three parts. The first part is the "LINGUISTIC MODULE," which generates a linguistic representation from text. The second part is the "ACOUSTIC MODULE," which converts the linguistic representation into speech. The third and final part is the "VISUAL MODULE," which uses the linguistic representation to animate a talking head.
They built a relational lexical database by combining three source lexica: The Carnegie Mellon Pronouncing Dictionary, Moby Pronunciation II, and COMLEX English pronouncing lexicon. This database contains approximately 200,000 words, including over 1500 non-homophonous homographs. An interesting aspect of their project is the use of animated images that move in relation to the subject.

In their Linguistic Module, they tokenize textual input and look up word pronunciations and tags in the lexical database. For words that are not present in their lexical database, they use a dynamic programming alignment algorithm described for aligning sequences from the same alphabets. In the Letter-to-sound neural network, they define features for a letter to be the union of the features of the phones that it might represent. When they achieve competitive results, they believe that improved performance will come from simplifying the phonological representations found in the dictionary. They create a

preliminary linguistic representation of the utterance using this approach. This linguistic representation is then submitted to a postlexical module, where lexical pronunciations derived from the lexicon are converted to postlexical pronunciations typically used by the speaker.

They take into account the proximity to word, phrase, clause, and sentence boundaries. Once the linguistic representation is converted, it is forwarded to the Acoustic Module, which consists of three stages: 1. Duration Neural Network, 2. Phonetic Neural Network, and 3. Waveform Synthesizer. The Acoustic Module determines the timing of the speech signal by associating segment duration with each phone in the linguistic representation. An acoustic representation is generated for each ten-millisecond frame of speech, providing input parameters for the synthesis portion of a vocoder.

Finally, the vocoder's synthesis part is utilized to create speech based on the given acoustic descriptions. The intriguing aspect of their module lies in their provision of video for the speech, resulting in a more natural appearance. To achieve this, they gather animated images from nature. The video subsystem utilizes the outputs of both the linguistic module and the duration neural network to generate an animated figure via an additional neural network.

Case Study 4 :

Sanghamitra Mohanty has developed an intelligent tool called Priyambada that offers speech output in four Indian languages simultaneously: Hindi, Odiya, Bengali, and Telegu. She has implemented a common system for all languages as part of her solution.

The researcher discovered that Indian languages have a phonetic structure, and the mapping of phonemes follows a linear pattern. As a result, the vowels and consonants in these languages are very similar, with a few exceptions. The researcher took these exceptions into consideration and developed an

algorithm to address them. The TTS system involved three stages.

The first task is to create Speech Corpora. In this task, the researcher identifies speakers for four native languages and records their voices in a laboratory environment using a noise cancellation microphone. The recording is done at a sampling rate of 16 bit in a single channel with a frequency of 16000 Hz. Through this process, the researcher collects voice samples from the speakers.

The second task involves creating a database for Different Syllables extracted from the text. This includes storing individual polysyllables for different languages in a .wav file format.

Finally, the researcher plays the represented data by playing the .wav files.

There she does not provide a solution for the new word that is not in her current one. Using the C++ language, she created a fascinating tool that plays a crucial role.

Case Study 5:

They primarily concentrate on normalizing the text. It is likely that their work and processes involve tokenization, token classification, token sense disambiguation, and word representation. They discovered several ambiguous tokens in the Bangla language, which includes the use of multiple languages (English, Arabic, Hindi, etc.) within their language.

The most challenging part of tokenizing involves dealing with numbers, dates, years, times, and multi-text genres in Bangla language. To address this issue, two approaches were identified. The first approach involves tokenizing regular Bangla language, while the second approach revolves around managing ambiguous words. This process entails three stages.

The first stage utilizes a Tokenizer to tokenize English and other South Asian scripts. The second stage involves utilizing a Bangla Splitter to handle punctuation and delimiters, as well as tokenize phone numbers, years, times, and

floating point numbers. Lastly, a Classifier is employed to verify contextual rules and remove different forms of delimiters. In this stage, regular expressions are written in .jflex format to check each type of token.

Overall, this process ensures that ambiguous tokens appear more natural in their context.

To ensure that ambiguous words like non-natural number cardinal, ordinal, acronym, and abbreviations sound natural, the following stages are used: traverse from right to left and map the first two digits with a lexicon to obtain the expanded form (i.e.

After expanding the form of the third digit, the token "hundred" is inserted. The lexicon is used to obtain the expanded form of each pair of digits after the third digit. The token "thousand" is inserted after the expanded form of the fourth and fifth digits, and "lakh" is inserted after the expanded form of the sixth and seventh digits. These stages are continued. After each second block, the token "koti" is inserted to make it natural. The following text contains and their contents:

  1. Frances Alias, Xavier Servillano, Joan Claudi socoro and Xavier Gonzalvo "Towards High-Quality Next Generation Text-to-Speech Synthesis: A multi domain Approach by Automatic Domain Classification", IEEE Transactions on AUDIO, SPEECH AND LANGUAG PROCESSING, VOL16, NO,7 september 2008.
  2. Qing Guo, Jie Zhang, Nobuyuki Katae, Hao Yu , "High -Quality Prosody Generation in Mandrain Text-to-Speech system", FujiTSu Sci.Tech,J., vol.46, No.1,pp.40-46 ,2010.
  3. Gopalakrishna anumanchipalli,Rahul Chitturi, Sachin Joshi, Rohit Kumar, Satinder Pal Singh,R.n.v Sitaram,D.P.Kishore, "Development of Indian Language Speech Databases for Large Vocabulary Speech Recognition System",
  4. A.Black, H.Zen and K.Tokuda "Statistical parametric speech
  5. synthesis", in proc.ICASSP, Honolulu, HI 2007, vol IV, PP 1229-1232.
  6. G.Bailly, N.Campbell and b.Mobius, "ISCA special

session: Hot topics

  • in speech synthesis", in proc.Eurospeech, Genea, Switzerland, 2003, pp 37-40.
  • The following texts are related to speech synthesis and include various references:

  • M.Ostendorf and I.Bulyko, “The impact of speech recognition on speech synthesis”, in proc, IEEE Workshop Speech Synthesis, Santa Monica,2002,pp.99-106.
  • Text To Speech Synthesis – a knol by Jaibatrik Dutta .
  • Silvio Ferreia,Celina Thillou, Bernaud Gosselin, “From Picture to Speech: an Innovative Application for Embedded Environment”,
  • M.Nageshwara Rao, Samuel Thomas, T.Nagarajan and Hema A.Muthy, “Text-to-Speech Syntheis using syllable line units”
  • Jindrich Matousek, Josef Psutks, Jiri Krita, “Design of speech Corpus for Text-to-Speech Synthesis”.Beckman M. and Elam G. “Guidelines for ToBI Labeling”.Manuscript, version 3, 1997.
  • Corrigan G., Massey N., and Karaali O.
  • "Generating Segment Durations in a Text-to-Speech System: A Hybrid Rule-Based/Neural Network Approach". Proc. Eurospeech '97, Rhodes, September 1997. Gerson I., Karaali O., Corrigan G., and Massey N. "Neural Network Speech Synthesis".

    In 1996, the Speech Science and Technology conference (SST-96) took place in Australia.

  • Karaali O., Corrigan G., and Gerson I. presented a paper on "Speech Synthesis with Neural Networks" at the World Congress on Neural Networks (WCNN-96) in San Diego in September 1996.
  • At the Eurospeech '97 conference in September 1997, Karaali O., Corrigan G., Gerson I., and Massey N. discussed "Text-to-Speech Conversion with Neural Networks: A Recurrent TDNN Approach".
  • Kiparsky P. also contributed to the event.
  • "Lexical phonology and morphology". Linguistics in the morning calm, edited by I.S. Yang. Seoul: Hanshin, 1982.

  • Kruskal J.
  • "An overview of sequence comparison". Time

  • Warps, String Edits, and Macromolecules, edited by Joseph Kruskal and David Sankoff. Reading, MA: Addison- Wesley, 1983.
  • Linguistic
  • Data Consortium. COMLEX English pronouncing lexicon. Trustees of the University of Pennsylvania, version 0.2, 1995.

  • Miller C., Karaali O., and Massey N.
  • "Variation and Synthetic Speech". NWAVE 26, Quebec, October 1997.

  • Nusbaum H., Francis A., and Luks T. "Comparative valuation of the quality of synthetic speech produced at Motorola". Research report, Spoken Language Research Laboratory, University of Chicago, 1995.
  • O’Shaughnessy, D.
  • "The thesis by Seneff titled 'Modeling fundamental frequency, and its relationship to syntax, semantics, and phonetics' was completed in 1976 at M.I.T. Another relevant publication is the complex systems journal article by Sejnowski T. and Rosenberg C. titled 'NETtalk: a parallel network that learns to pronounce English text,' published in 1987. Additionally, Seneff also made contributions mentioned in this context."

    and Zue V. “Transcription and alignment of the TIMIT database”. M.I.T., 1988.

    Tuerk C. and Robinson T.

    The text includes references related to speech synthesis using artificial neural networks trained on cepstral coefficients. The first reference is from the proceedings of Eurospeech '93 conference in Berlin, held in September 1993. The second reference is a publication by Ward G. titled "Moby Pronunciator II", released in 1996. The third reference is a publication by R. Weide titled "The Carnegie Mellon Pronouncing Dictionary cmudict.0.4", published in 1995.

    Get an explanation on any task
    Get unstuck with the help of our AI assistant in seconds
    New