Image To Voice Converter Is Software Computer Science Essay Example
Image To Voice Converter Is Software Computer Science Essay Example

Image To Voice Converter Is Software Computer Science Essay Example

Available Only on StudyHippo
  • Pages: 9 (2290 words)
  • Published: August 9, 2018
  • Type: Case Study
View Entire Sample
Text preview

Image to Voice converter is software or a device to recognize an image and convert it into human voice. The purpose of the conversion is to provide communication aid for blind people to sense what the object in their hand or in front of them. This converter is also suitable for children at the age of three until six years old for early education part.

In this project converter, it consists of image processing and sound generation. For an image processing, it is a series of calculation techniques for analyzing, reconstructing, compressing, and enhancing images. When an object is inputting, an image will captured through scanning or webcam; analyze and manipulate of the image, accomplished using various specialized software applications such as MATLAB and output like a printer or a monitor.

Image processing has several techniques, including temp

...

late matching, KNN (K-Nearest Neighbour), thresholding and etc. For the template matching, it is a technique for finding small parts of an image to match with the template image; it is also used to identify printed characters, numbers, and other small, simple objects. KNN (K-Nearest Neighbour) is an algorithm that can work very well in practice and easy to understand. It is also a lazy algorithm that does not use the training data points to do any generalization. Besides, thresholding technique is one of the most important approaches to image segmentation. It is a non-linear operation that can converts a gray-scale image into a binary image.

The purpose of image processing in this project is to analysis of a picture using techniques that can identify shades, colours and relationships that cannot be observed by the human eye. Besides that, an imag

View entire sample
Join StudyHippo to see entire essay

processing is used to solve identification problems, i.e. in forensic medicine or in establishing weather maps from satellite photos. It assigns with images in bitmapped graphics form that have been scanned in or taken with digital cameras. For sound generation is to generate a sound through window sound library or play a wav file from computer.

Problem Statement

Nowadays, many visually impaired people still using blind man’s stick to sense the road of the direction and object in front of them in this society. With just only a plain stick and a pair of covered eye, it is difficult for a human to get sense of their direction. Probably, they would not know what the objects around the people which had been blinded eye. As we can see the economy nowadays is getting worse, most of the people or family members were getting busy on their busy work life; they have no extra time to spend on the handicap people to give them a good care. In this case, for all the handicap people especially blind people, they have to get use to it on their living style. In order than that, this product is also available to help the small kid’s to improve the ability on distinguishing or differentiate the daily use objects. This is the reason why the product mentioned above was developed.

Project Aim and Objective

The aim of this project is to develop an Image to Voice converter which able to recognize an image from the webcam and then convert it into sound by window sound library or wav file with good performance. To achieve the main objective of this project, there are sub-objectives need

to be carry through as follows:

  • To develop a unique image recognition algorithms for shapes and colours for real time application using MATLAB.
  • To analyze the performance of the image recognition algorithm in term of accuracy and time processing.
  • To develop an algorithm to convert recognized image to voice using MATLAB.
  • To analyze the performance of image to voice conversion algorithm.
  • Test the performance of the closed loop interface for the image and sound processing converter system.
  • To develop Graphical User Interface (GUI) of the image to voice converter for case of user finding.

Project Scope/Limitation

The scope of this project is to construct a unique image to voice converter within a period of time at cost not to exceed RM200. Referring to this project, it consists of hardware which is webcam and software which is MATLAB. The system of this project is to capture an image using webcam, then recognize an image and generate a sound using MATLAB with several techniques. This product specially created for visually impaired people or to improve small kid’s learning capability. There was few limitation of this project which specified as follows:

  • Shape limitation
  • Colour limitation
  • Resolution limitation
  • Distance limitation

Literature Review

Image processing is a technique to convert an image into digital specification and go through some actions on it, so as to get an enhanced image or to collect some advanced information from it. It is a kind of signal exemption in which input is image, like video frame or photograph and output may be image or features related with that image. Frequently, image processing institution consist of treating images as two dimensional signals while applying already set signal processing techniques

to them[1]. For the image recognition process can be divided into several algorithms which are image acquisition, image pre-processing, image segmentation, image representation and image classification. For the image acquisition, it is a digital image that captured by one or a few image sensors, such as various types of light-sensitive cameras, range sensors, tomography devices, radar, ultra-sonic cameras and etc. According to the type of sensor, the outcome of an image data is an generally two dimensional image, a three dimensional capacity, or an image order. The pixel values usually correspond to strength of light in one or a few spectral bands, but can also be involved many physical measures, such as depth, absorption or reflectance of sonic or electromagnetic waves, or nuclear magnetic resonance.

Image pre-processing is one of the algorithms that can increase the dependability of an optical inspection. This algorithm can be categorized into two categories which are image enhancement. Image enhancement requires intensifying the different features of images either for display or analysis targets. The enhancements techniques are edge enhancements, noise filtering, magnifying and sharpening an image. Several filter operations which increase or reduce certain image features allow an easier or faster evaluation. For examples, mean filter, median filter, wiener filter, and etc. With continuous use, an image will becomes degraded and has many errors. Image restoration is the process used to restore the degraded image. This process is also used to correct images read from different sensors that show up murky or out of focus[2].

Next, image segmentation is performed to assemble pixels into salient image areas, for example, areas corresponding to specific surfaces, objects, or inherent sections of objects. Segmentation could

be used for object recognition, occlusion boundary estimation within motion or stereo systems, image density, image editing, or image database. The traditional image segmentation method can be divided into several techniques including gray threshold segmentation method, edge extraction method, regional growth method and split consolidation method and etc. Threshold technique was applied in this project. It is a technique that deals with gray-scale images. For the moment of the influence of noise or illumination, it can be assumed that the majority of pixels belonging to the objects will have a relatively low gray-level, whereas the background pixels will have a relatively high gray-level. For example, Black is represented by a gray-level of 0, and White by a gray-level of 255. Based on this observation, we can divide the pixels in the image into two dominant groups, according to their gray-level. These gray-levels may serve as “detectors’ to distinguish between background and objects in the image. On the other hand, if the image is one of smooth-edged objects, then it will not be a pure black and white image; hence this would not be able to find two distinct gray-levels characterizing the background and the objects. This problem intensifies with the existence of noise[3]. In order to overcome the ill influence of noise and shading, there are two methods that can solve this problem which are Otsu known as “Global Threshold” and Neighbourhood known as “Adaptive Threshold”.

For the image representation, all information is commonly represented in binary. This is real of images as well as numbers and text. However, an important differentiation needs to be made between how image data is shown and how it is

stored. Displaying includes bitmap representation while storing as a file includes many image formats, such as jpeg and png[4]. There are few techniques for image representation which are Roundness ratio known as Circularity, Fourier Descriptors and etc.

The intent of the image classification procedure is to sort all pixels in a digital image into one of several land cover categories, or “themes”. This categorized data may then be used to deliver thematic maps of the land cover present in an image. Ordinarily, multispectral data are used to carry out the classification and truly the spectral pattern present within the data for each pixel is used as the numerical basis for categorization. The purpose of image classification is to determine and describe, as a distinct gray level or colour, the characteristics occurring in an image in terms of the object or kind of land cover these characteristics practically express on the ground[5]. The technique for this algorithm is using template matching and KNN (K-Nearest Neighbour).

Analysis on Similar Products and Paper Literatures

Oral Image to Voice Converter by Takaaki HASEGAWA and Keiichi OHTANI[16]:

In this paper, the authors propose a new speech communication system to convert oral image into voice, “Image input Microphone”. This system synthesizes the voice from only the oral image. This system provides high security and is not affected by acoustic noise, because actual utterance is not always necessary to input. Moreover, since the voice is synthesized without recognition, this system is independent of languages.

Simulations to convert oral image to voice about Japanese five vowels are carried out as basic investigation. A vocal tract area function is estimated from the oral image, and PARCOR synthesis filter is

obtained from the vocal tract area function. The PARCOR synthesis filter is driven by a pulse train. The performance of this system is evaluated by hearing tests of the synthesized voice. As a result, audible voice has been synthesized and the mean recognition rate of Japanese five vowels has been 91%.

This paper describes a system to convert oral image into voice with considering human’s lip-reading ability. In the proposed system, the voice is directly synthesized only from the oral image without recognition, and actual utterance is not always necessary to input. They use both the feature of a tongue and the feature of lips obtained from the oral image. Therefore this system is not affected by the acoustic noise, and simultaneously, it provides high security because of no utterance input capability.

The system structure of this product is using a vocal tract area function which is equivalent to the transfer function of the vocal tract as a parameter. “Indirect” means synthesis via the vocal tract area function. The vocal tract area function is obtained from the PARCOR analysis of speech signals, and speech signals are synthesized by inverse processing of PARCOR analysis. Therefore if the vocal tract area function is estimated from oral image signals, they can convert the oral image to the corresponding voice. Human utters various voice by changing the vocal tract, and each articulator moves not independently but cooperatively in utterance, It is generally known that the information of articulation is obtained from lip-reading.

Project’s Method

Median Filter

Median filters are nonlinear rank-order filters based on replacing each element of the source vector with the median value, taken over the fixed neighbourhood of the processed element.

These filters are widely used in image and signal processing applications. The purpose of median filtering is to removes impulsive noise, while keeping the signal blurring to the minimum[18].

Otsu’ Method

Otsu’s method is a widely used method of segmentation, also known as the maximum infra-class variance method or the minimum inter-class variance method. This method involves iterating through all the possible threshold values and calculating a measure of spread for the pixel levels each side of the threshold, i.e. the pixels that either falls in foreground or background. The aim is to find the threshold value where the sum of foreground and background spreads is at its minimum[11].

Roundness Ratio/Circularity

Roundness is defined as a condition of a surface of revolution like cylinder, cone or sphere where all points of the surface intersected by any plane perpendicular to a common axis in case of cylinder and cone. Since the axis and centre do not exist physically, measurements have to make with reference to surfaces of the figures of revolution only. For measuring roundness, it is only the circularity of the contour which is determined[12].

Template Matching

The classical template matching method is charactered as simple mechanism, high accuracy of detection, and is used as a general model evaluation and error estimation. Therefore, it plays a very important role in image processing, and is widely used in object detection and recognition. It is a technique for finding small parts of an image to match with a database image[14].

K-Nearest Neighbour (KNN)

K-Nearest Neighbour (KNN) is a branch of simple classification and regression algorithms. It can be defined as a lazy method. It does not use the training data points to do any generalization.

Although classification remains the primary application of KNN, it can use to do density estimation also. Since KNN is non parametric, it can do calculation for arbitrary assignation[19].

Get an explanation on any task
Get unstuck with the help of our AI assistant in seconds
New