Design And Implementation Of Voice Recognition System

ABSTRACT

This project attempted to design and implement a voice recognition system that would identify different users based on previously stored voice samples. Each user inputs audio samples with a keyword of his or her choice. This input was gathered but successful processing to extract meaningful spectral coefficients was not achieved. These coefficients were to be stored in a database for later comparison with future audio inputs. Afterwards, the system had to capture an
input from any user and match its spectral coefficients to all previously stored coefficients on the database, in order to identify the unknown speaker. Since the spectral coefficients were acquired adequately the system as a whole did not recognize anything, although we believe that modifying the system’s structure to decrease timing dependencies between the subsystems might make the
implementation more feasible and less complex.

TABLE OF CONTENTS

COVER PAGE

TITLE PAGE

APPROVAL PAGE

DEDICATION

ACKNOWELDGEMENT

ABSTRACT

CHAPTER ONE

INTRODUCTION

BACKGROUND OF THE PROJECT
PROBLEM STATEMENT
OBJECTIVES OF THE PROJECT
SCOPE OF THE PROJECT
SIGNIFICANCE OF THE PROJECT
APPLICATION OF THE PROJECT
COMPONENTS OF THE PROJECT
PROJECT ORGANISATION

CHAPTER TWO

LITERATURE REVIEW

HISTORICAL BACKGROUND OF THE STUDY
MODELS, METHODS, AND ALGORITHMS
PERFORMANCE OF THE SYSTEM
ACCURACY OF THE SYSTEM

CHAPTER THREE

SYSTEM DESIGN

DESIGN OVERVIEW
BLOCK DIAGRAM
MODULE DESCRIPTION AND IMPLEMENTATION
DISTANCE PROCESSOR SUBSYSTEM (J)
SYSTEM OPERATION (R)

CHAPTER FOUR

4.1 TESTING AND DEBUGGING

CHAPTER FIVE

CONCLUSION
RECOMMENDATION
REFERENCES

CHAPTER ONE

1.0 INTRODUCTION

Speech recognition (SR) is the inter-disciplinary sub-field of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers. It is also known as “automatic speech recognition” (ASR), “computer speech recognition”, or just “speech to text” (STT). It incorporates knowledge and research in the linguistics, computer science, and electrical engineering fields.

Some SR systems use “training” (also called “enrollment”) where an individual speaker reads text or isolated vocabulary into the system. The system analyzes the person’s specific voice and uses it to fine-tune the recognition of that person’s speech, resulting in increased accuracy. Systems that do not use training are called “speaker independent”[1] systems. Systems that use training are called “speaker dependent”.

Speech recognition applications include voice user interfaces such as voice dialing (e.g. “Call home”), call routing (e.g. “I would like to make a collect call”), domotic appliance control, search (e.g. find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g. a radiology report), speech-to-text processing (e.g., word processors or emails), and aircraft (usually termed Direct Voice Input).

The term voice recognition[2][3][4] or speaker identification[5][6] refers to identifying the speaker, rather than what they are saying. Recognizing the speaker can simplify the task of translating speech in systems that have been trained on a specific person’s voice or it can be used to authenticate or verify the identity of a speaker as part of a security process.

From the technology perspective, speech recognition has a long history with several waves of major innovations. Most recently, the field has benefited from advances in deep learning and big data. The advances are evidenced not only by the surge of academic papers published in the field, but more importantly by the worldwide industry adoption of a variety of deep learning methods in designing and deploying speech recognition systems. These speech industry players include Google, Microsoft, IBM, Baidu, Apple, Amazon, Nuance, SoundHound, IflyTek, CDAC many of which have publicized the core technology in their speech recognition systems as being based on deep learning.

1.1 BACKGROUND OF THE STUDY

Alternatively, this work is referred to as speech recognition, voice recognition is a computer software program or hardware device with the ability to decode the human voice. Voice recognition is commonly used to operate a device, perform commands, or write without having to use a keyboard, mouse, or press any buttons. Today, this is done on a computer with automatic speech recognition (ASR) software programs. Many ASR programs require the user to “train” the ASR program to recognize their voice so that it can more accurately convert the speech to text. For example, you could say “open Internet” and the computer would open the Internet browser.

The first ASR device was used in 1952 and recognized single digits spoken by a user (it was not computer driven). Today, ASR programs are used in many industries, including Healthcare, Military (e.g. F-16 fighter jets), Telecommunications, and personal computing (i.e. hands-free computing).

1.2 PROBLEM STATEMENT

Nearly 20% people of the world are suffering from various disabilities; many of them are blind or unable to use their hands effectively. The speech recognition systems in those particular cases provide a significant help to them, so that they can share information with people by operating computer through voice input. This project is designed and developed keeping that factor into mind, and a little effort is made to achieve this aim. Our project is capable to recognize the speech and convert the input audio into text; it also enables a user to perform operations such as “save, open, exit” a file by providing voice input.

1.3 PROJECT OBJECTIVE

To understand the speech recognition and its fundamentals.
Its working and applications in different areas
Its implementation as a desktop Application
Development for software that can mainly be used for: a) Speech Recognition b)Speech Generation c). Text Editing d)Tool for operating Machine through voice.

1.4 SCOPE OF THE STUDY

Voice recognition is “the technology by which sounds, words or phrases spoken by humans are converted into electrical signals, and these signals are transformed into coding patterns to which meaning has been assigned” [ADA90]. While the concept could more generally be called “sound recognition”, we focus here on the human voice because we most often and most naturally use our voices to communicate our ideas to others in our immediate surroundings. In the context of a virtual environment, the user would presumably gain the greatest feeling of immersion, or being part of the simulation, if they could use their most common form of communication, the voice. The difficulty in using voice as an input to a computer simulation lies in the fundamental differences between human speech and the more traditional forms of computer input. While computer programs are commonly designed to produce a precise and well-defined response upon receiving the proper (and equally precise) input, the human voice and spoken words are anything but precise. Each human voice is different, and identical words can have different meanings if spoken with different inflections or in different contexts. Several approaches have been tried, with varying degrees of success, to overcome these difficulties.

1.5 SIGNIFICANCE OF THE PROJECT

It also helps the user to open different system software such as opening Ms-paint, notepad and calculator. At the initial level effort is made to provide help for basic operations as discussed above, but the software can further be updated and enhanced in order to cover more operations.

1.6 TYPES OF VOICE RECOGNITION SYSTEMS

Automatic speech recognition is just one example of voice recognition, below are other examples of voice recognition systems.

Speaker dependent system – The voice recognition requires training before it can be used, which requires you to read a series of words and phrases.

Speaker independent system – The voice recognition software recognizes most users voices with no training.

Discrete speech recognition – The user must pause between each word so that the speech recognition can identify each separate word.

Continuous speech recognition – The voice recognition can understand a normal rate of speaking.

Natural language – The speech recognition not only can understand the voice but also return answers to questions or other queries that are being asked.

APPLICATION OF THE STUDY

As voice recognition improves, it is being implemented in more places and it’s very likely you have already used it. Below are some good examples of where you might encounter voice recognition.

Automated phone systems – Many companies today use phone systems that help direct the caller to the correct department. If you have ever been asked something like “Say or press number 2 for support” and you say “2,” you used voice recognition.

Google Voice – Google voice is a service that allows you to search and ask questions on your computer, tablet, and phone.

Siri – Apple’s Siri is another good example of voice recognition that helps answer questions on Apple devices.

Car Bluetooth – For cars with Bluetooth or Handsfree phone pairing you can use voice recognition to make commands such as “call my wife” to make calls without taking your eyes off the road.

In-car systems- Typically a manual control input, for example by means of a finger control on the steering-wheel, enables the speech recognition system and this is signalled to the driver by an audio prompt. Following the audio prompt, the system has a “listening window” during which it may accept a speech input for recognition. Simple voice commands may be used to initiate phone calls, select radio stations or play music from a compatible smartphone, MP3 player or music-loaded flash drive. Voice recognition capabilities vary between car make and model. Some of the most recent car models offer natural-language speech recognition in place of a fixed set of commands, allowing the driver to use full sentences and common phrases. With such systems there is, therefore, no need for the user to memorize a set of fixed command words.

Health care – In the health care sector, speech recognition can be implemented in front-end or back-end of the medical documentation process. Front-end speech recognition is where the provider dictates into a speech-recognition engine, the recognized words are displayed as they are spoken, and the dictator is responsible for editing and signing off on the document. Back end or deferred speech recognition is where the provider dictates into a digital dictation system, the voice is routed through a speech-recognition machine and the recognized draft document is routed along with the original voice file to the editor, where the draft is edited and report finalized. Deferred speech recognition is widely used in the industry currently.

One of the major issues relating to the use of speech recognition in healthcare is that the American Recovery and Reinvestment Act of 2009 (ARRA) provides for substantial financial benefits to physicians who utilize an EMR according to “Meaningful Use” standards. These standards require that a substantial amount of data be maintained by the EMR (now more commonly referred to as an Electronic Health Record or EHR). The use of speech recognition is more naturally suited to the generation of narrative text, as part of a radiology/pathology interpretation, progress note or discharge summary: the ergonomic gains of using speech recognition to enter structured discrete data (e.g., numeric values or codes from a list or a controlled vocabulary) are relatively minimal for people who are sighted and who can operate a keyboard and mouse.

A more significant issue is that most EHRs have not been expressly tailored to take advantage of voice-recognition capabilities. A large part of the clinician’s interaction with the EHR involves navigation through the user interface using menus, and tab/button clicks, and is heavily dependent on keyboard and mouse: voice-based navigation provides only modest ergonomic benefits. By contrast, many highly customized systems for radiology or pathology dictation implement voice “macros”, where the use of certain phrases – e.g., “normal report”, will automatically fill in a large number of default values and/or generate boilerplate, which will vary with the type of the exam – e.g., a chest X-ray vs. a gastrointestinal contrast series for a radiology system.

Therapeutic use – Prolonged use of speech recognition software in conjunction with word processors has shown benefits to short-term-memory restrengthening in brain AVM patients who have been treated with resection. Further research needs to be conducted to determine cognitive benefits for individuals whose AVMs have been treated using radiologic techniques.

Military (High-performance fighter aircraft) – Substantial efforts have been devoted in the last decade to the test and evaluation of speech recognition in fighter aircraft. Of particular note have been the US program in speech recognition for the Advanced Fighter Technology Integration (AFTI)/F-16 aircraft (F-16 VISTA), the program in France for Mirage aircraft, and other programs in the UK dealing with a variety of aircraft platforms. In these programs, speech recognizers have been operated successfully in fighter aircraft, with applications including: setting radio frequencies, commanding an autopilot system, setting steer-point coordinates and weapons release parameters, and controlling flight display.

1.8 COMPONENTS OF THE STUDY

For voice recognition to work you must have a computer with a sound card and either a microphone or a headset. Other devices like smart phones have all of the necessary hardware built into the device. Also, the software you use needs voice recognition support or if you want to use voice recognition everywhere you need a program like Nuance Naturally Speaking to be installed.

PROJECT ORGANISATION

The work is organized as follows: chapter one discuses the introductory part of the work, chapter two presents the literature review of the study, chapter three describes the methods applied, chapter four discusses the results of the work, chapter five summarizes the research outcomes and the recommendations.

. Each user inputs audio samples with a keyword of his or her choice. This input was gathered but successful processing to extract meaningful spectral coefficients was not achieved. These coefficients were to be stored in a database for later comparison with future audio inputs. Afterwards, the system had to capture an
input from any user and match its spectral coefficients to all previously stored coefficients on the database, in order to identify the unknown speaker. Since the spectral coefficients were acquired adequately the system as a whole did not recognize anything, although we believe that modifying the system’s structure to decrease timing dependencies between the subsystems might make the
implementation more feasible and less complex.

TABLE OF CONTENTS

COVER PAGE

TITLE PAGE

APPROVAL PAGE

DEDICATION

ACKNOWELDGEMENT

ABSTRACT

CHAPTER ONE

INTRODUCTION

BACKGROUND OF THE PROJECT
PROBLEM STATEMENT
OBJECTIVES OF THE PROJECT
SCOPE OF THE PROJECT
SIGNIFICANCE OF THE PROJECT
APPLICATION OF THE PROJECT
COMPONENTS OF THE PROJECT
PROJECT ORGANISATION

CHAPTER TWO

LITERATURE REVIEW

HISTORICAL BACKGROUND OF THE STUDY
MODELS, METHODS, AND ALGORITHMS
PERFORMANCE OF THE SYSTEM
ACCURACY OF THE SYSTEM

CHAPTER THREE

SYSTEM DESIGN

DESIGN OVERVIEW
BLOCK DIAGRAM
MODULE DESCRIPTION AND IMPLEMENTATION
DISTANCE PROCESSOR SUBSYSTEM (J)
SYSTEM OPERATION (R)

CHAPTER FOUR

4.1 TESTING AND DEBUGGING

CHAPTER FIVE

CONCLUSION
RECOMMENDATION
REFERENCES

CHAPTER ONE

1.0 INTRODUCTION

Speech recognition (SR) is the inter-disciplinary sub-field of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers. It is also known as “automatic speech recognition” (ASR), “computer speech recognition”, or just “speech to text” (STT). It incorporates knowledge and research in the linguistics, computer science, and electrical engineering fields.

Some SR systems use “training” (also called “enrollment”) where an individual speaker reads text or isolated vocabulary into the system. The system analyzes the person’s specific voice and uses it to fine-tune the recognition of that person’s speech, resulting in increased accuracy. Systems that do not use training are called “speaker independent”[1] systems. Systems that use training are called “speaker dependent”.

Speech recognition applications include voice user interfaces such as voice dialing (e.g. “Call home”), call routing (e.g. “I would like to make a collect call”), domotic appliance control, search (e.g. find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g. a radiology report), speech-to-text processing (e.g., word processors or emails), and aircraft (usually termed Direct Voice Input).

The term voice recognition[2][3][4] or speaker identification[5][6] refers to identifying the speaker, rather than what they are saying. Recognizing the speaker can simplify the task of translating speech in systems that have been trained on a specific person’s voice or it can be used to authenticate or verify the identity of a speaker as part of a security process.

From the technology perspective, speech recognition has a long history with several waves of major innovations. Most recently, the field has benefited from advances in deep learning and big data. The advances are evidenced not only by the surge of academic papers published in the field, but more importantly by the worldwide industry adoption of a variety of deep learning methods in designing and deploying speech recognition systems. These speech industry players include Google, Microsoft, IBM, Baidu, Apple, Amazon, Nuance, SoundHound, IflyTek, CDAC many of which have publicized the core technology in their speech recognition systems as being based on deep learning.

1.1 BACKGROUND OF THE STUDY

Alternatively, this work is referred to as speech recognition, voice recognition is a computer software program or hardware device with the ability to decode the human voice. Voice recognition is commonly used to operate a device, perform commands, or write without having to use a keyboard, mouse, or press any buttons. Today, this is done on a computer with automatic speech recognition (ASR) software programs. Many ASR programs require the user to “train” the ASR program to recognize their voice so that it can more accurately convert the speech to text. For example, you could say “open Internet” and the computer would open the Internet browser.

The first ASR device was used in 1952 and recognized single digits spoken by a user (it was not computer driven). Today, ASR programs are used in many industries, including Healthcare, Military (e.g. F-16 fighter jets), Telecommunications, and personal computing (i.e. hands-free computing).

1.2 PROBLEM STATEMENT

Nearly 20% people of the world are suffering from various disabilities; many of them are blind or unable to use their hands effectively. The speech recognition systems in those particular cases provide a significant help to them, so that they can share information with people by operating computer through voice input. This project is designed and developed keeping that factor into mind, and a little effort is made to achieve this aim. Our project is capable to recognize the speech and convert the input audio into text; it also enables a user to perform operations such as “save, open, exit” a file by providing voice input.

1.3 PROJECT OBJECTIVE

To understand the speech recognition and its fundamentals.
Its working and applications in different areas
Its implementation as a desktop Application
Development for software that can mainly be used for: a) Speech Recognition b)Speech Generation c). Text Editing d)Tool for operating Machine through voice.

1.4 SCOPE OF THE STUDY

Voice recognition is “the technology by which sounds, words or phrases spoken by humans are converted into electrical signals, and these signals are transformed into coding patterns to which meaning has been assigned” [ADA90]. While the concept could more generally be called “sound recognition”, we focus here on the human voice because we most often and most naturally use our voices to communicate our ideas to others in our immediate surroundings. In the context of a virtual environment, the user would presumably gain the greatest feeling of immersion, or being part of the simulation, if they could use their most common form of communication, the voice. The difficulty in using voice as an input to a computer simulation lies in the fundamental differences between human speech and the more traditional forms of computer input. While computer programs are commonly designed to produce a precise and well-defined response upon receiving the proper (and equally precise) input, the human voice and spoken words are anything but precise. Each human voice is different, and identical words can have different meanings if spoken with different inflections or in different contexts. Several approaches have been tried, with varying degrees of success, to overcome these difficulties.

1.5 SIGNIFICANCE OF THE PROJECT

It also helps the user to open different system software such as opening Ms-paint, notepad and calculator. At the initial level effort is made to provide help for basic operations as discussed above, but the software can further be updated and enhanced in order to cover more operations.

1.6 TYPES OF VOICE RECOGNITION SYSTEMS

Automatic speech recognition is just one example of voice recognition, below are other examples of voice recognition systems.

Speaker dependent system – The voice recognition requires training before it can be used, which requires you to read a series of words and phrases.

Speaker independent system – The voice recognition software recognizes most users voices with no training.

Discrete speech recognition – The user must pause between each word so that the speech recognition can identify each separate word.

Continuous speech recognition – The voice recognition can understand a normal rate of speaking.

Natural language – The speech recognition not only can understand the voice but also return answers to questions or other queries that are being asked.

APPLICATION OF THE STUDY