WO2009064054A1 - Method and apparatus to detect voice activity - Google Patents

Method and apparatus to detect voice activity Download PDF

Info

Publication number
WO2009064054A1
WO2009064054A1 PCT/KR2008/003231 KR2008003231W WO2009064054A1 WO 2009064054 A1 WO2009064054 A1 WO 2009064054A1 KR 2008003231 W KR2008003231 W KR 2008003231W WO 2009064054 A1 WO2009064054 A1 WO 2009064054A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
audio signal
noise
signal
random
Prior art date
Application number
PCT/KR2008/003231
Other languages
French (fr)
Inventor
Jae-Youn Cho
Original Assignee
Samsung Electronics Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung Electronics Co., Ltd. filed Critical Samsung Electronics Co., Ltd.
Publication of WO2009064054A1 publication Critical patent/WO2009064054A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • the present general inventive concept relates to an audio processing system, and more particularly, to a method and apparatus to detect voice activity by using a zero- crossing rate.
  • VAD Voice Activity Detection
  • EPD End Point Detection
  • voice activity or a starting point and an end point of a voice signal are detected by using the energy of a frame and a zero- crossing rate of a frame. For example, the voice activity of a frame is determined when its zero-crossing rate is low, and non- voice activity of a frame is determined when its zero-crossing rate is high.
  • zero- crossing rates for voice activity may not be distinctive from those for non- voice activity.
  • the present general inventive concept provides a method and apparatus to detect voice activity which enables the robust detection of voice activity that lessens the drawback of using zero-crossing rate.
  • the present general inventive concept also provides an audio processing device employing an apparatus to detect voice activity.
  • a zero-crossing rate due to random noise can be used in VAD or EPD.
  • a noise removal algorithm is applied to an audio signal before obtaining a zero-crossing rate so that a VAD or EPD system that is storing for noise can be established Description of Drawings
  • FIGS. IA and IB are block diagrams illustrating respective audio processing systems including a function of detecting voice activity, according to an embodiment of the present general inventive concept
  • FIG. 2A is a detailed block diagram illustrating a voice activity detector of the audio processing system of FIGS. IA and IB
  • FIG. 2B is a detailed block diagram illustrating a voice activity detector of the audio processing system of FIGS. IA and IB;
  • FIG. 3 is a block diagram illustrating a noise removal unit of the voice activity detector of FIG. 2;
  • FIG. 4 is a flowchart illustrating a method of detecting voice activity according to an embodiment of the present general inventive concept.
  • FIGS. 5A and 5B are graphs illustrating an audio signal and a zero-crossing rate for detecting voice activity according to an embodiment of the present general inventive concept. Best Mode
  • the foregoing and/or other aspects and utilities of the present general inventive concept may be achieved by providing a method of detecting voice activity, the method including adding a random signal having energy of a predetermined size to an audio signal, extracting predetermined voice detection parameters from the audio signal to which the random signal is added, and comparing the extracted predetermined voice detection parameters with a threshold value and determining voice and non-voice activities.
  • the audio signal may have stationary or non- stationary noise.
  • the random signal may have a zero-crossing rate that is larger than a standard value.
  • the random signal may be white Gaussian noise having a normal distribution.
  • the predetermined voice detection parameters may include frame power.
  • the method may further include removing a noise from an input audio signal to generate a noise removed signal as the audio signal.
  • the removing of the noise may include predicting noise properties of the audio signal, and subtracting the predicted noise properties from the audio signal and removing noise from the audio signal.
  • the foregoing and/or other aspects and utilities of the present general inventive concept may also be achieved by providing an apparatus to detect voice activity, the apparatus including a noise removal unit which removes noise included in an audio signal, a random signal generator which generates a random noise signal having energy of a determined size, an addition unit which adds the random signal generated by random signal generator to the audio signal from which noise is removed by the noise removal unit, a voice determination parameter extracting unit which extracts predetermined voice detection parameters from the audio signal to which the random signal is added by the addition unit, and a voice determination unit which detects voice and non- voice activities by using the voice detection parameters extracted by the voice determination parameter extracting unit.
  • the apparatus may further include a noise removal unit which removes noise included in an input audio signal to generate the noise removed signal as the audio signal.
  • the random signal generator may generate an energy corresponding to the non- voice activity as the random signal.
  • the random signal generator may generate an energy varying to correspond to a characteristic of the audio signal as the random signal.
  • the adding unit may selectively add the random signal to the audio signal according to a character of the audio signal.
  • an audio processing device including a voice activity detector which adds a random signal having energy of a determined size to the an audio signal to extract one or more predetermined voice detection parameters and compares the extracted predetermined voice detection parameters with a threshold value to determine voice and non-voice activities, and an audio signal processing unit which performs voice coding and a voice recognizing process according to information about voice and non- voice activities detected by the voice activity detector.
  • FIGS. IA and IB illustrate respective audio processing systems including a function of detecting voice activity, according to an embodiment of the present general inventive concept.
  • FIG. IA illustrates an audio processing system when an analog audio signal is input thereto.
  • the audio processing system of FIG. IA includes an Analog/Digital (AfD) converter
  • a voice activity detector 120 a voice activity detector 120, an audio signal processing unit 130, and a Digital/ Analog (D/ 'A) converter 140.
  • D/ 'A Digital/ Analog
  • the A/D converter 110 converts an analog audio signal into a digital audio signal.
  • the voice activity detector 120 adds a random signal having energy of a determined level to the audio signal output from the A/D converter 110, extracts one or more determined voice detection parameters, such as a zero-crossing rate of a frame or the power of a frame, from the audio signal to which the random signal is added, and compares the extracted voice detection parameters with a threshold value to determine voice and non- voice activities.
  • the random signal may be an energy corresponding to a predetermined noise level. It is possible that the random signal may be a signal having a predetermined voltage, and the predetermined voltage may have amplitude in positive and/or negative directions with respect to a reference.
  • the random signal may be a variable energy signal to correspond to an energy level of the audio signal, and thus the random signal varies according to the energy level of the audio signal.
  • the random signal may be selectively applied or added to the audio signal according to a characteristic of the audio signal, e.g., a level, amount, amplitude of the audio signal.
  • the zero-crossing rate may be a rate or a ratio of changing a level of an audio signal.
  • the zero-crossing rate is changed between voice activities and non-voice activities.
  • the zero-crossing rate according to the present embodiment can show a difference between boundaries of the voice activities and corresponding non- voice activities.
  • the audio signal processing unit 130 performs voice coding and a voice recognizing process according to information about voice and non- voice activities detected from the voice activity detector 120.
  • FIG. IB illustrates an audio processing system when a digital audio signal is input thereto.
  • the audio processing system of FIG. IB includes an audio decoder 110-1, a voice activity detector 120-1, an audio signal processing unit 130-1, and a D/A converter 140-1.
  • the audio decoder 110-1 restores digital audio data according to a predetermined decoding algorithm.
  • FIG. 2A is a detailed block diagram illustrating the voice activity detectors 120 and
  • the voice activity detector of FIG. 2A includes a noise removal unit 210, a random signal generator 220, an addition unit 230, a voice determination parameter extracting unit 240, and a voice determination unit 250.
  • the noise removal unit 210 removes stationary noise included in an audio signal.
  • the noise removal unit 210 removes stationary noise by using a spectral subtraction filter, a Weiner filter or other noise reduction filter.
  • the random signal generator 220 generates a random noise signal having energy of a predetermined size (level or amount) that is not harsh to the ears. It is possible that the random signal may be white Gaussian noise having a normal distribution or may have higher zero-crossing rate than that of speech signal.
  • the addition unit 230 adds the random signal generated by the random signal generator 220 to the audio signal from which the stationary noise is removed by the noise removal unit 210.
  • the voice determination parameter extracting unit 240 extracts one or more predetermined voice detection parameters from the audio signal to which the random signal is added by the addition unit 230.
  • the predetermined voice detection parameters may be a zero- crossing rate (ZCR), frame power, and a Liner Spectrum Frequency (LSF).
  • ZCR zero- crossing rate
  • LSF Liner Spectrum Frequency
  • the zero- crossing rate refers to a frequency of code conversions of samples in a frame
  • the LSF refers to frequency properties of signals.
  • the voice determination unit 250 extracts voice and non- voice activities using voice detection parameters such as ZCR and LSF extracted from the voice determination parameter extracting unit 240.
  • the voice determination unit 250 determines a frame as voice activity and when the ZCR is greater than the threshold value, the voice determination unit 250 determines a frame as non-voice activity.
  • FIG. 2B is a detailed block diagram illustrating the voice activity detectors 120 and
  • the voice activity detector of FIG. 2B includes a random signal generator 220-1, an addition unit 230-1, a voice determination parameter extracting unit 240-1, and a voice determination unit 250-1.
  • the addition unit 230-1 adds the random signal generated by the random signal generator 220- 1 to the audio signal.
  • Functions of a random signal generator 220-1, an addition unit 230-1, a voice determination parameter extracting unit 240-1, and a voice determination unit 250-1 are respectively the same as those of the random signal generator 220, the addition unit 230, the voice determination parameter extracting unit 240, and the voice determination unit 250.
  • FIG. 3 is a block diagram illustrating the noise removal unit 210 of FIG. 2A.
  • the noise removal unit 210 includes a noise prediction unit 310 and noise removal filter unit 320.
  • the noise prediction unit 310 predicts noise properties from an input audio signal.
  • input frame power is firstly compared with a determined threshold value.
  • the input frame is predicted as noise and a property value (for example, a spectrum) of the input frame is predicted as a noise property.
  • the noise removal filter unit 320 subtracts the noise property value predicted by the noise prediction unit 310 from the audio signal so as to remove noise from the input audio signal.
  • FIG. 4 is a flowchart illustrating a method of detecting voice activity according to an embodiment of the present general inventive concept.
  • one or more audio signals are input in units of frames.
  • the level of noise is generally different in each input audio signal.
  • stationary noise included in the audio signals is removed using a Wiener filter or a spectral subtraction filter.
  • a random noise signal having energy with a determined size that is not harsh to the ears is added to the audio signals from which stationary noise is removed, in operation 420.
  • the random noise signal has a zero-crossing rate that is larger than a standard value, in order to improve identification (detection) of voice/ non- voice activities.
  • Voice detection parameters such as a zero-crossing rate of a frame or the power of a frame, is then extracted from the audio signals to which the random signal is added, in operation 430.
  • the zero-crossing rate of a frame is obtained by dividing a frequency of code conversions of samples in a frame by the number of the samples.
  • the frame power is obtained by dividing the sum of square sizes of the samples in a frame by the number of the samples.
  • the extracted voice detection parameters are compared with a predetermined threshold value in operation 450.
  • a current frame is determined as voice activity in operation 460.
  • a current frame is determined as non- voice activity in operation 470.
  • a current frame is determined as voice activity and when the zero- crossing rate of a frame is greater than the predetermined threshold value, a current frame is determined as non- voice activity.
  • a current frame is determined as voice activity and when the frame power is less than the predetermined threshold, a current frame is determined as non- voice activity.
  • voice and non-voice activities are determined according to the comparison between the voice detection parameters and the predetermined threshold value and thus detection of voice activity of one frame is completed.
  • FIGS. 5A and 5B are graphs illustrating an audio signal and a zero-crossing rate for detecting voice activity according to an embodiment of the present invention.
  • FIG. 5A illustrates a graph (a) of plots of a general audio signal and a graph (b) of a zero-crossing rate of the audio signal.
  • an x-coordinate indicates time and a y-coordinate indicates size.
  • an x-coordinate indicates an order of a frame and a y-coordinate indicates a zero-crossing rate.
  • the zero-crossing rate is less in voice activity.
  • the zero- crossing rate is greater due to unknown signal components, for example, background noise.
  • the zero-crossing rate may less appears. Accordingly, in plots of a general audio signal, non- activity cannot be identified.
  • FIG. 5B illustrates a graph (a) of plots of an audio signal to which a random signal having a small amount of energy is added and a graph (b) of a zero-crossing rate of the audio signal.
  • an x-coordinate indicates time and a y-coordinate indicates size.
  • an x-coordinate indicates an order of a frame and a y-coordinate indicates a zero-crossing rate.
  • VAD Voice Activity Detection
  • EPD End Point Detection
  • the invention can also be embodied as computer readable codes on a computer readable recording medium.
  • the computer readable recording medium is any data storage device that can store programs or data which can be thereafter read by a computer system. Examples of the computer readable recording medium include readonly memory (ROM), random-access memory (RAM), CD-ROMs, magnetic tapes, hard disks, floppy disks, flash memory, optical data storage devices, and carrier waves (such as data transmission through the Internet).
  • the computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

Abstract

A method and apparatus to detect voice activity by using a zero-crossing rate includes removing noise included in an audio signal, adding a random signal having energy of a predetermined size to the audio signal from which noise is removed, extracting predetermined voice detection parameters from the audio signal to which the random signal is added, and comparing the extracted predetermined voice detection parameters with a threshold value and determining voice and non-voice activities.

Description

Description
METHOD AND APPARATUS TO DETECT VOICE ACTIVITY
Technical Field
[1] The present general inventive concept relates to an audio processing system, and more particularly, to a method and apparatus to detect voice activity by using a zero- crossing rate. Background Art
[2] In general, Voice Activity Detection (VAD) or End Point Detection (EPD) is used as a method of extracting voice activity from speech coding or speech recognition. In a conventional method of detecting voice activity, voice activity or a starting point and an end point of a voice signal are detected by using the energy of a frame and a zero- crossing rate of a frame. For example, the voice activity of a frame is determined when its zero-crossing rate is low, and non- voice activity of a frame is determined when its zero-crossing rate is high.
[3] Here, since some types of noise or null signal lower the zero-crossing rates, zero- crossing rates for voice activity may not be distinctive from those for non- voice activity.
Disclosure of Invention Technical Problem
[4] In other words, even though voice activity is detected using a zero-crossing rate in a conventional method, the detection may be false when some types of noise are added or there is no signal at all. Technical Solution
[5] The present general inventive concept provides a method and apparatus to detect voice activity which enables the robust detection of voice activity that lessens the drawback of using zero-crossing rate.
[6] The present general inventive concept also provides an audio processing device employing an apparatus to detect voice activity.
[7] Additional aspects and utilities of the present general inventive concept will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the general inventive concept. Advantageous Effects
[8] According to the present general inventive concept, artificial random noise is added to an audio signal so as to obtain a zero-crossing rate and identification of voice and non- voice activities can be improved.
[9] In addition, a zero-crossing rate due to random noise can be used in VAD or EPD. [10] Moreover, a noise removal algorithm is applied to an audio signal before obtaining a zero-crossing rate so that a VAD or EPD system that is storing for noise can be established Description of Drawings
[11] The above and other features and advantages of the present general inventive concept will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings in which:
[12] FIGS. IA and IB are block diagrams illustrating respective audio processing systems including a function of detecting voice activity, according to an embodiment of the present general inventive concept;
[13] FIG. 2A is a detailed block diagram illustrating a voice activity detector of the audio processing system of FIGS. IA and IB, and FIG. 2B is a detailed block diagram illustrating a voice activity detector of the audio processing system of FIGS. IA and IB;
[14] FIG. 3 is a block diagram illustrating a noise removal unit of the voice activity detector of FIG. 2;
[15] FIG. 4 is a flowchart illustrating a method of detecting voice activity according to an embodiment of the present general inventive concept; and
[16] FIGS. 5A and 5B are graphs illustrating an audio signal and a zero-crossing rate for detecting voice activity according to an embodiment of the present general inventive concept. Best Mode
[17] The foregoing and/or other aspects and utilities of the present general inventive concept may be achieved by providing a method of detecting voice activity, the method including adding a random signal having energy of a predetermined size to an audio signal, extracting predetermined voice detection parameters from the audio signal to which the random signal is added, and comparing the extracted predetermined voice detection parameters with a threshold value and determining voice and non-voice activities.
[18] The audio signal may have stationary or non- stationary noise.
[19] The random signal may have a zero-crossing rate that is larger than a standard value.
[20] The random signal may be white Gaussian noise having a normal distribution.
[21] The predetermined voice detection parameters may include frame power.
[22] The method may further include removing a noise from an input audio signal to generate a noise removed signal as the audio signal.
[23] The removing of the noise may include predicting noise properties of the audio signal, and subtracting the predicted noise properties from the audio signal and removing noise from the audio signal. [24] The foregoing and/or other aspects and utilities of the present general inventive concept may also be achieved by providing an apparatus to detect voice activity, the apparatus including a noise removal unit which removes noise included in an audio signal, a random signal generator which generates a random noise signal having energy of a determined size, an addition unit which adds the random signal generated by random signal generator to the audio signal from which noise is removed by the noise removal unit, a voice determination parameter extracting unit which extracts predetermined voice detection parameters from the audio signal to which the random signal is added by the addition unit, and a voice determination unit which detects voice and non- voice activities by using the voice detection parameters extracted by the voice determination parameter extracting unit.
[25] The apparatus may further include a noise removal unit which removes noise included in an input audio signal to generate the noise removed signal as the audio signal.
[26] The random signal generator may generate an energy corresponding to the non- voice activity as the random signal.
[27] The random signal generator may generate an energy varying to correspond to a characteristic of the audio signal as the random signal.
[28] The adding unit may selectively add the random signal to the audio signal according to a character of the audio signal.
[29] The foregoing and/or other aspects and utilities of the present general inventive concept may also be achieved by providing an audio processing device including a voice activity detector which adds a random signal having energy of a determined size to the an audio signal to extract one or more predetermined voice detection parameters and compares the extracted predetermined voice detection parameters with a threshold value to determine voice and non-voice activities, and an audio signal processing unit which performs voice coding and a voice recognizing process according to information about voice and non- voice activities detected by the voice activity detector.
[30] The foregoing and/or other aspects and utilities of the present general inventive concept may also be achieved by providing a computer readable recording medium having embodied thereon a computer program for executing a method of detecting voice activity including removing noise included in an audio signal, adding a random signal having energy of a predetermined size to the audio signal from which noise is removed, extracting predetermined voice detection parameters from the audio signal to which the random signal is added, and comparing the extracted predetermined voice detection parameters with a threshold value and determining voice and non- voice activities. Mode for Invention [31] Reference will now be made in detail to the embodiments of the present general inventive concept, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The embodiments are described below in order to explain the present general inventive concept by referring to the figures.
[32] FIGS. IA and IB illustrate respective audio processing systems including a function of detecting voice activity, according to an embodiment of the present general inventive concept.
[33] FIG. IA illustrates an audio processing system when an analog audio signal is input thereto.
[34] The audio processing system of FIG. IA includes an Analog/Digital (AfD) converter
110, a voice activity detector 120, an audio signal processing unit 130, and a Digital/ Analog (D/ 'A) converter 140.
[35] The A/D converter 110 converts an analog audio signal into a digital audio signal.
[36] The voice activity detector 120 adds a random signal having energy of a determined level to the audio signal output from the A/D converter 110, extracts one or more determined voice detection parameters, such as a zero-crossing rate of a frame or the power of a frame, from the audio signal to which the random signal is added, and compares the extracted voice detection parameters with a threshold value to determine voice and non- voice activities.
[37] Here, the random signal may be an energy corresponding to a predetermined noise level. It is possible that the random signal may be a signal having a predetermined voltage, and the predetermined voltage may have amplitude in positive and/or negative directions with respect to a reference. The random signal may be a variable energy signal to correspond to an energy level of the audio signal, and thus the random signal varies according to the energy level of the audio signal. The random signal may be selectively applied or added to the audio signal according to a characteristic of the audio signal, e.g., a level, amount, amplitude of the audio signal.
[38] The zero-crossing rate may be a rate or a ratio of changing a level of an audio signal.
The zero-crossing rate is changed between voice activities and non-voice activities. According to the addition of the random signal to the audio signal, the zero-crossing rate according to the present embodiment can show a difference between boundaries of the voice activities and corresponding non- voice activities.
[39] The audio signal processing unit 130 performs voice coding and a voice recognizing process according to information about voice and non- voice activities detected from the voice activity detector 120.
[40] The D/A converter 140 converts the audio signal processed in the audio signal processing unit 130 into an analog audio signal. [41] FIG. IB illustrates an audio processing system when a digital audio signal is input thereto.
[42] The audio processing system of FIG. IB includes an audio decoder 110-1, a voice activity detector 120-1, an audio signal processing unit 130-1, and a D/A converter 140-1.
[43] The audio decoder 110-1 restores digital audio data according to a predetermined decoding algorithm.
[44] Functions of the voice activity detector 120-1, the audio signal processing unit 130-1, and the D/A converter 140-1 are respectively the same as those of the voice activity detector 120, the audio signal processing unit 130, and the D/A converter 140 of FIG. IA.
[45] FIG. 2A is a detailed block diagram illustrating the voice activity detectors 120 and
120- lof FIGS. IA and IB.
[46] The voice activity detector of FIG. 2A includes a noise removal unit 210, a random signal generator 220, an addition unit 230, a voice determination parameter extracting unit 240, and a voice determination unit 250.
[47] In order to accurately extract a zero-crossing rate, the noise removal unit 210 removes stationary noise included in an audio signal. For example, the noise removal unit 210 removes stationary noise by using a spectral subtraction filter, a Weiner filter or other noise reduction filter.
[48] The random signal generator 220 generates a random noise signal having energy of a predetermined size (level or amount) that is not harsh to the ears. It is possible that the random signal may be white Gaussian noise having a normal distribution or may have higher zero-crossing rate than that of speech signal.
[49] The addition unit 230 adds the random signal generated by the random signal generator 220 to the audio signal from which the stationary noise is removed by the noise removal unit 210.
[50] Ultimately, when noise is removed from an audio signal, a zero-crossing rate of non- voice activity may be close to '0.' Accordingly, since a random noise is added to an audio signal, identification of non- voice activity can be improved by an improved zero-crossing rate.
[51] The voice determination parameter extracting unit 240 extracts one or more predetermined voice detection parameters from the audio signal to which the random signal is added by the addition unit 230.
[52] It is possible that the predetermined voice detection parameters may be a zero- crossing rate (ZCR), frame power, and a Liner Spectrum Frequency (LSF). The zero- crossing rate refers to a frequency of code conversions of samples in a frame and the LSF refers to frequency properties of signals. [53] The voice determination unit 250 extracts voice and non- voice activities using voice detection parameters such as ZCR and LSF extracted from the voice determination parameter extracting unit 240.
[54] For example, when the ZCR is less than a threshold value, the voice determination unit 250 determines a frame as voice activity and when the ZCR is greater than the threshold value, the voice determination unit 250 determines a frame as non-voice activity.
[55] FIG. 2B is a detailed block diagram illustrating the voice activity detectors 120 and
120-1 of FIGS.
[56] The voice activity detector of FIG. 2B includes a random signal generator 220-1, an addition unit 230-1, a voice determination parameter extracting unit 240-1, and a voice determination unit 250-1.
[57] The addition unit 230-1 adds the random signal generated by the random signal generator 220- 1 to the audio signal.
[58] Functions of a random signal generator 220-1, an addition unit 230-1, a voice determination parameter extracting unit 240-1, and a voice determination unit 250-1 are respectively the same as those of the random signal generator 220, the addition unit 230, the voice determination parameter extracting unit 240, and the voice determination unit 250.
[59] FIG. 3 is a block diagram illustrating the noise removal unit 210 of FIG. 2A.
[60] The noise removal unit 210 includes a noise prediction unit 310 and noise removal filter unit 320.
[61] The noise prediction unit 310 predicts noise properties from an input audio signal. As an example of predicting noise, input frame power is firstly compared with a determined threshold value. Here, when the input frame power is less than the determined threshold value, the input frame is predicted as noise and a property value (for example, a spectrum) of the input frame is predicted as a noise property.
[62] The noise removal filter unit 320 subtracts the noise property value predicted by the noise prediction unit 310 from the audio signal so as to remove noise from the input audio signal.
[63] FIG. 4 is a flowchart illustrating a method of detecting voice activity according to an embodiment of the present general inventive concept.
[64] Referring to FIG. 4, one or more audio signals are input in units of frames.
[65] Here, the level of noise is generally different in each input audio signal.
[66] Accordingly, regardless of the level of noise, stationary noise included in the audio signals is removed in order to perform regular voice activity identification, in operation 410.
[67] For example, stationary noise included in the audio signals is removed using a Wiener filter or a spectral subtraction filter.
[68] Then, a random noise signal having energy with a determined size that is not harsh to the ears is added to the audio signals from which stationary noise is removed, in operation 420. In addition, the random noise signal has a zero-crossing rate that is larger than a standard value, in order to improve identification (detection) of voice/ non- voice activities.
[69] Voice detection parameters, such as a zero-crossing rate of a frame or the power of a frame, is then extracted from the audio signals to which the random signal is added, in operation 430. For example, the zero-crossing rate of a frame is obtained by dividing a frequency of code conversions of samples in a frame by the number of the samples. The frame power is obtained by dividing the sum of square sizes of the samples in a frame by the number of the samples.
[70] Then, the extracted voice detection parameters are compared with a predetermined threshold value in operation 450.
[71] Here, when the voice detection parameters are less than the predetermined threshold value, a current frame is determined as voice activity in operation 460. When the voice detection parameters are greater than the predetermined threshold value, a current frame is determined as non- voice activity in operation 470.
[72] For example, when the zero-crossing rate of a frame is less than the predetermined threshold value, a current frame is determined as voice activity and when the zero- crossing rate of a frame is greater than the predetermined threshold value, a current frame is determined as non- voice activity.
[73] Also, when the frame power is greater than the predetermined threshold, a current frame is determined as voice activity and when the frame power is less than the predetermined threshold, a current frame is determined as non- voice activity.
[74] Accordingly, voice and non-voice activities are determined according to the comparison between the voice detection parameters and the predetermined threshold value and thus detection of voice activity of one frame is completed.
[75] FIGS. 5A and 5B are graphs illustrating an audio signal and a zero-crossing rate for detecting voice activity according to an embodiment of the present invention.
[76] FIG. 5A illustrates a graph (a) of plots of a general audio signal and a graph (b) of a zero-crossing rate of the audio signal. In the graph (a), an x-coordinate indicates time and a y-coordinate indicates size. In the graph (b), an x-coordinate indicates an order of a frame and a y-coordinate indicates a zero-crossing rate.
[77] Referring to FIG. 5A, in general, due to a strong low frequency signal component, the zero-crossing rate is less in voice activity. In non- activities 510 and 520, the zero- crossing rate is greater due to unknown signal components, for example, background noise. However, when abnormal circumstances which may generate complete non- activity or may include direct current components in a microphone are generated, the zero-crossing rate may less appears. Accordingly, in plots of a general audio signal, non- activity cannot be identified.
[78] FIG. 5B illustrates a graph (a) of plots of an audio signal to which a random signal having a small amount of energy is added and a graph (b) of a zero-crossing rate of the audio signal. In graph (a), an x-coordinate indicates time and a y-coordinate indicates size. In graph (b), an x-coordinate indicates an order of a frame and a y-coordinate indicates a zero-crossing rate.
[79] Referring to FIG. 5B, when the random signal having a small amount of energy is added to the audio signal according to the present embodiment, a high zero-crossing rate appears in non- voice activities 530 and 540. Accordingly, when the zero-crossing rate that is greater than a threshold value appears, it is determined as non- voice activity and when the zero-crossing rate that is less than the threshold value appears, it is determined as voice activity.
[80] Ultimately, voice and non-voice activities can be easily identified using a zero- crossing rate for the random signal in Voice Activity Detection (VAD) or End Point Detection (EPD). Industrial Applicability
[81] The invention can also be embodied as computer readable codes on a computer readable recording medium. The computer readable recording medium is any data storage device that can store programs or data which can be thereafter read by a computer system. Examples of the computer readable recording medium include readonly memory (ROM), random-access memory (RAM), CD-ROMs, magnetic tapes, hard disks, floppy disks, flash memory, optical data storage devices, and carrier waves (such as data transmission through the Internet). The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.
[82] While the present general inventive concept has been particularly shown and described with reference to exemplary embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present general inventive concept as defined by the following claims.

Claims

Claims
[1] L A method of detecting voice activity, the method comprising: adding a random signal having energy of a predetermined size to an audio signal; extracting one or more predetermined voice detection parameters from the audio signal to which the random signal is added; and comparing the extracted predetermined voice detection parameters with a threshold value and determining voice and non- voice activities of the audio signal.
[2] 2. The method of claim 1, wherein the audio signal have stationary noise or non- stationary noise.
[3] 3. The method of claim 1, wherein the random signal has a zero-crossing rate that is larger than the standard value.
[4] 4. The method of claim 1, wherein the predetermined voice detection parameters comprise a zero-crossing rate of a frame.
[5] 5. The method of claim 1, wherein the predetermined voice detection parameters comprise frame power.
[6] 6. The method of claim 1, further comprising: removing a noise from an input audio signal to generate a noise removed signal as the audio signal.
[7] 7. The method of claim 6, wherein the removing of the noise comprises: predicting noise properties of the audio signal; and subtracting the predicted noise properties from the audio signal and removing noise from the audio signal.
[8] 8. The method of claim 6, wherein the noise corresponds to the voice activity of the audio signal.
[9] 9. An apparatus to detect voice activity, comprising: a random signal generat or which generates a random noise signal having energy of a determined size; an addition unit which adds the random signal generated by random signal generat or to the audio signal; a voice determination parameter extracting unit which extracts predetermined voice detection parameters from the audio signal to which the random signal is added by the addition unit; and a voice determination unit which detects voice and non-voice activities by using the voice detection parameters extracted by the voice determination parameter extracting unit.
[10] 10. The apparatus of claim 9, wherein the noise removal unit comprises: a noise prediction unit which compares power of an audio frame with a predetermined threshold value and predicts noise properties of the audio signal; and a noise removal filter unit which subtracts noise properties predicted by the noise prediction unit from the audio signal and removes noise from the audio signal.
[11] 11. The apparatus of claim 9, further comprising: a noise removal unit which removes noise included in an input audio signal to generate the noise removed signal as the audio signal.
[12] 12. The apparatus of claim 9, wherein the random signal generat or generates an energy corresponding to the non- voice activity as the random signal.
[13] 13. The apparatus of claim 9, wherein the random signal generat or generates a n energy varying to correspond to a characteristic of the audio signal as the random signal.
[14] 14. The apparatus of claim 9, wherein the adding unit selectively adds the random signal to the audio signal according to a character of the audio signal.
[15] 15. An audio processing device comprising: a voice activity detector which adds a random signal having energy of a determined size to an audio signal to extract one or more predetermined voice detection parameters and compares the extracted predetermined voice detection parameters with a threshold value to determine voice and non-voice activities; and an audio signal processing unit which performs voice coding and a voice recognizing process according to information about voice and non-voice activities detected by the voice activity detector.
[16] 16. A computer readable recording medium having embodied thereon a computer program for executing a method of detecting voice activity comprising: adding a random signal having energy of a predetermined size to an audio signal; extracting predetermined voice detection parameters from the audio signal to which the random signal is added; and comparing the extracted predetermined voice detection parameters with a threshold value and determining voice and non- voice activities.
[17] 17. The computer readable recording medium of claim 16, wherein the method further comprises removing noise included in an input audio signal to generate the noise removed signal as the audio signal.
PCT/KR2008/003231 2007-11-13 2008-06-11 Method and apparatus to detect voice activity WO2009064054A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020070115501A KR101444099B1 (en) 2007-11-13 2007-11-13 Method and apparatus for detecting voice activity
KR10-2007-0115501 2007-11-13

Publications (1)

Publication Number Publication Date
WO2009064054A1 true WO2009064054A1 (en) 2009-05-22

Family

ID=40624587

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2008/003231 WO2009064054A1 (en) 2007-11-13 2008-06-11 Method and apparatus to detect voice activity

Country Status (3)

Country Link
US (1) US8046215B2 (en)
KR (1) KR101444099B1 (en)
WO (1) WO2009064054A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7807971B2 (en) 2008-11-19 2010-10-05 The Boeing Company Measurement of moisture in composite materials with near-IR and mid-IR spectroscopy

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120091068A (en) * 2009-10-19 2012-08-17 텔레폰악티에볼라겟엘엠에릭슨(펍) Detector and method for voice activity detection
HUE053127T2 (en) * 2010-12-24 2021-06-28 Huawei Tech Co Ltd Method and apparatus for adaptively detecting a voice activity in an input audio signal
US8700406B2 (en) * 2011-05-23 2014-04-15 Qualcomm Incorporated Preserving audio data collection privacy in mobile devices
CN107978325B (en) * 2012-03-23 2022-01-11 杜比实验室特许公司 Voice communication method and apparatus, method and apparatus for operating jitter buffer
US9305317B2 (en) 2013-10-24 2016-04-05 Tourmaline Labs, Inc. Systems and methods for collecting and transmitting telematics data from a mobile device
US9467569B2 (en) * 2015-03-05 2016-10-11 Raytheon Company Methods and apparatus for reducing audio conference noise using voice quality measures
US20170365249A1 (en) * 2016-06-21 2017-12-21 Apple Inc. System and method of performing automatic speech recognition using end-pointing markers generated using accelerometer-based voice activity detector
KR20180082033A (en) * 2017-01-09 2018-07-18 삼성전자주식회사 Electronic device for recogniting speech
CN108831508A (en) * 2018-06-13 2018-11-16 百度在线网络技术(北京)有限公司 Voice activity detection method, device and equipment
US11170760B2 (en) * 2019-06-21 2021-11-09 Robert Bosch Gmbh Detecting speech activity in real-time in audio signal
CN111951834A (en) * 2020-08-18 2020-11-17 珠海声原智能科技有限公司 Method and device for detecting voice existence based on ultralow computational power of zero crossing rate calculation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6349278B1 (en) * 1999-08-04 2002-02-19 Ericsson Inc. Soft decision signal estimation
KR20040000004A (en) * 2002-06-19 2004-01-03 엘지전자 주식회사 Apparatus of inspection for back light unit

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07113840B2 (en) * 1989-06-29 1995-12-06 三菱電機株式会社 Voice detector
JP2609752B2 (en) * 1990-10-09 1997-05-14 三菱電機株式会社 Voice / in-band data identification device
US5991718A (en) * 1998-02-27 1999-11-23 At&T Corp. System and method for noise threshold adaptation for voice activity detection in nonstationary noise environments
US6453285B1 (en) * 1998-08-21 2002-09-17 Polycom, Inc. Speech activity detector for use in noise reduction system, and methods therefor
US7423983B1 (en) * 1999-09-20 2008-09-09 Broadcom Corporation Voice and data exchange over a packet based network
US6560332B1 (en) * 1999-05-18 2003-05-06 Telefonaktiebolaget Lm Ericsson (Publ) Methods and apparatus for improving echo suppression in bi-directional communications systems
DE19935808A1 (en) * 1999-07-29 2001-02-08 Ericsson Telefon Ab L M Echo suppression device for suppressing echoes in a transmitter / receiver unit
KR200173377Y1 (en) 1999-09-28 2000-03-15 박정환 A sticker wallpaper for switch cover
KR100345402B1 (en) * 1999-11-12 2002-07-26 한국전자통신연구원 An apparatus and method for real - time speech detection using pitch information
KR100312335B1 (en) 2000-01-14 2001-11-03 대표이사 서승모 A new decision criteria of SID frame of Comfort Noise Generator of voice coder
US20030179888A1 (en) * 2002-03-05 2003-09-25 Burnett Gregory C. Voice activity detection (VAD) devices and methods for use with noise suppression systems
US6691085B1 (en) * 2000-10-18 2004-02-10 Nokia Mobile Phones Ltd. Method and system for estimating artificial high band signal in speech codec using voice activity information
US20020054685A1 (en) * 2000-11-09 2002-05-09 Carlos Avendano System for suppressing acoustic echoes and interferences in multi-channel audio systems
US6993481B2 (en) * 2000-12-04 2006-01-31 Global Ip Sound Ab Detection of speech activity using feature model adaptation
KR20020095502A (en) * 2001-06-14 2002-12-27 엘지전자 주식회사 Method for detecting end point of noise surroundings
US7330812B2 (en) * 2002-10-04 2008-02-12 National Research Council Of Canada Method and apparatus for transmitting an audio stream having additional payload in a hidden sub-channel
KR100463657B1 (en) * 2002-11-30 2004-12-29 삼성전자주식회사 Apparatus and method of voice region detection
ATE373302T1 (en) * 2004-05-14 2007-09-15 Loquendo Spa NOISE REDUCTION FOR AUTOMATIC SPEECH RECOGNITION
US7917356B2 (en) * 2004-09-16 2011-03-29 At&T Corporation Operating method for voice activity detection/silence suppression system
US7447279B2 (en) * 2005-01-31 2008-11-04 Freescale Semiconductor, Inc. Method and system for indicating zero-crossings of a signal in the presence of noise
KR100956876B1 (en) * 2005-04-01 2010-05-11 콸콤 인코포레이티드 Systems, methods, and apparatus for highband excitation generation
DK1760696T3 (en) * 2005-09-03 2016-05-02 Gn Resound As Method and apparatus for improved estimation of non-stationary noise to highlight speech
KR101334366B1 (en) * 2006-12-28 2013-11-29 삼성전자주식회사 Method and apparatus for varying audio playback speed
KR101437830B1 (en) * 2007-11-13 2014-11-03 삼성전자주식회사 Method and apparatus for detecting voice activity

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6349278B1 (en) * 1999-08-04 2002-02-19 Ericsson Inc. Soft decision signal estimation
KR20040000004A (en) * 2002-06-19 2004-01-03 엘지전자 주식회사 Apparatus of inspection for back light unit

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
AHMAD ET AL.: "An isolated speech endpoint detector using multiple speech features", TENCON 2004, vol. 2, 21 November 2004 (2004-11-21), pages 403 - 406 *
QIANG ET AL.: "On Prefiltering and Endpoint Detection of Speech Signal", PROCEEDINGS OFICSP 1998, vol. 1, 12 October 1998 (1998-10-12), pages 749 - 752 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7807971B2 (en) 2008-11-19 2010-10-05 The Boeing Company Measurement of moisture in composite materials with near-IR and mid-IR spectroscopy

Also Published As

Publication number Publication date
KR20090049298A (en) 2009-05-18
US20090125304A1 (en) 2009-05-14
KR101444099B1 (en) 2014-09-26
US8046215B2 (en) 2011-10-25

Similar Documents

Publication Publication Date Title
US8046215B2 (en) Method and apparatus to detect voice activity by adding a random signal
KR101437830B1 (en) Method and apparatus for detecting voice activity
Chatlani et al. Local binary patterns for 1-D signal processing
US20140067388A1 (en) Robust voice activity detection in adverse environments
US5970441A (en) Detection of periodicity information from an audio signal
KR100713366B1 (en) Pitch information extracting method of audio signal using morphology and the apparatus therefor
JP2001236085A (en) Sound domain detecting device, stationary noise domain detecting device, nonstationary noise domain detecting device and noise domain detecting device
JPH0715363A (en) Detection method of energy base for detection of signal buried in noise
Chandra et al. Usable speech detection using the modified spectral autocorrelation peak to valley ratio using the LPC residual
KR100714721B1 (en) Method and apparatus for detecting voice region
CN110556128B (en) Voice activity detection method and device and computer readable storage medium
US20180108345A1 (en) Device and method for audio frame processing
KR20090098891A (en) Method and apparatus for robust speech activity detection
JP7152112B2 (en) Signal processing device, signal processing method and signal processing program
CN110364187B (en) Method and device for recognizing endpoint of voice signal
CN115223584B (en) Audio data processing method, device, equipment and storage medium
CN116364107A (en) Voice signal detection method, device, equipment and storage medium
US20220130405A1 (en) Low Complexity Voice Activity Detection Algorithm
CN113936694B (en) Real-time human voice detection method, computer device and computer readable storage medium
US11790931B2 (en) Voice activity detection using zero crossing detection
CN115862685B (en) Real-time voice activity detection method and device and electronic equipment
CN111435593B (en) Voice wake-up device and method
JP3484559B2 (en) Voice recognition device and voice recognition method
WO2022093705A1 (en) Low complexity voice activity detection algorithm
CN113707180A (en) Crying sound detection method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08766193

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08766193

Country of ref document: EP

Kind code of ref document: A1