Biggest patent portfolios by company
by company
- INTERNATIONAL BUSINESS MACHINES CORPORATION 13,899
- CANON KABUSHIKI KAISHA 9,693
- NEC CORPORATION 6,843
- SAMSUNG ELECTRONICS CO., LTD. 6,726
- KABUSHIKI KAISHA TOSHIBA 6,682
- SONY CORPORATION 6,195
- HITACHI, LTD. 5,935
- FUJITSU LIMITED 5,841
- MATSUSHITA ELECTRIC INDUSTRIAL CO., LTD. 5,735
- MITSUBISHI DENKI KABUSHIKI KAISHA 5,253
Biggest patent portfolios by inventor
by inventor
- Silverbrook Kia 1,860
- Yamazaki Shunpei 1,585
- Satake Toshihiko 905
- Yamamoto Hiroshi 766
- WATANABE HIROSHI 753
- Weder Donald E. 657
- Forbes Leonard 618
- Tanaka Hiroshi 585
- Suzuki Takashi 575
- Takahashi Hiroshi 570
Patent appraised by patentsbase
$ 0GLOBAL PATENTRANK
# 56.000ABSTRACT
A method for producing vowel sounds in a waveform generator using non-stationary additive noise (NSAN) can include computing a frequency spectrum for a selected group of pitch pulses in a recorded sample of a spoken vowel; identifying a set of formant values in the computed frequency spectrum and creating an all-pole filter for the set of identified formant values; populating a zero-padded matrix with the selected group of pitch pulses and applying the all-pole filter to the matrix, the application of the filter producing a set of NSAN vectors; synthesizing a vowel sound in the waveform generator, the synthesis producing a further group of pitch pulses; and, adding the NSAN vectors to the further group of pitch pulses.
INFORMATION
DETAILED DESCRIPTION OF THE INVENTION
DETAILED DESCRIPTION OF THE INVENTION
The present invention is a method and apparatus for generating non-stationary additive noise (NSAN) for addition to synthesized speech produced in a speech synthesizer. Notably, the speech synthesizer can be included as part of a TTS engine for converting computer-readable text to synthesized speech. The method of the invention can produce NSAN from recorded speech and, subsequently, can apply the NSAN to vowel sounds produced in the speech synthesizer. In consequence, the application of the NSAN to the vowel sounds can mask the buzzing quality typically associated with the conventional speech synthesis of vowel sounds. Thus, synthesized speech produced using the inventive method can have a perceived naturalness not typically associated with synthesized speech containing conventionally produced vowel sounds.
FIG. 1 illustrates a TTS engine suitable for use in the present invention. As shown in FIG. 1, a TTS engine suitable for use in the present invention can include a text processor and a speech processor . The text-processor can parse input text into a set of linguistic units, for instance phonemes. The speech processor can receive the phonemes and can generate the synthesized speech waveform . Notably, the synthesized speech waveform can be in the form of a digital waveform suitable for use by audio circuitry, for example a sound card. Still, the invention is not limited in this regard and the synthesized speech waveform also can be a digital representation of synthesized speech suitable for further processing by TTS-aware application .
The text processor can include a pre-processing module , a normalization module , a root analysis module , a spelling-to-sound module , and a prosody module . In the pre-processing module , the text input can be scanned for pre-defined strings, annotations and phonetic spellings. In particular, during pre-processing user dictionaries can be consulted in consequence of which suitable replacements can be substituted for the pre-defined strings, annotations and phonetic spellings in the text input . Subsequently, in the normalization module , each character string not identified as an annotation or phonetic spelling can be converted into a word or series of words, spelled with letters of a selected alphabet, for example the English alphabet. For instance, during normalization, the text string “32” can be converted to “thirty-two” and the text string “=” can be converted to “equals”.
The root analysis module can analyze each word in the pre-processed and normalized text input and can characterize each word in terms of roots and affixes. In particular, a roots dictionary can be consulted to retrieve any user-specified pronunciations of roots. In the spelling-to-sound module , the spelled words can be converted into a phonetic representation of the speech (phonemes) using pre-defined spelling-to-sound rules. Finally, the prosody module can include prosody rules which can determine appropriate timing and melody for the speech converted text. Upon completion of prosody processing, an abstract linguistic representation of the speech can be provided to the speech processor in which the abstract linguistic representation can be converted into actual acoustic values.
The speech processor can include three components: an acoustic processor , a voice processor , and a waveform generator . The acoustic processor can generate acoustic values for the abstract linguistic representation. The acoustic values can be used to produce the phonemes and prosodic patterns specified by the text processor . Subsequently, the voice processor can supplement the acoustic values with voice characteristics. Finally, the waveform generator can produce the synthesized speech waveform which can be transmitted to a TTS-aware application or directly to audio circuitry, for example a sound card. Notably, in one aspect of the present invention, the waveform generator can be a Klatt type synthesizer as described in D. H. Klatt, 53 J. Acoust. Soc. Am. at 8-16 (1980), incorporated herein by reference.
Significantly, vowel sounds produced by the TTS Engine , in the absence of the present invention, can have a buzzy quality as perceived by a listener. Hence, to mask the buzzy quality of speech synthesized vowels and to produce a perceived naturalness of speech synthesized vowel sounds, NSAN can be generated and applied to speech synthesized vowels produced by the waveform generator in the speech processor of the TTS Engine . Specifically, FIG. 2 is a diagram of a process for generating NSAN for addition to synthesized vowels in the TTS Engine .
As shown in FIG. 2, the process can include a recording step in which a spoken vowel can be recorded. The spoken vowel can be recorded while in a steady state producing a recorded sample . Specifically, the spoken vowel can be recorded when the fundamental frequency of the spoken vowel is not changing (the fundamental frequency—the pitch of a sound—can be estimated by observing the rate of occurrence of the peaks in a waveform). Additionally, the spoken vowel can be recorded when the vowel value also is not changing. In consequence, the recorded sample can contain an optimal specification of corresponding formant values and spoken vowel bandwidth. In particular, if when recording the spoken vowel, the spoken vowel drifts in fundamental frequency or vowel value, the formant values derived therefrom can be inaccurate.
In step , a center section of the recorded sample can be selected. More particular, a section of the recorded sample can be selected which can include a set of pitch pulses suitable for identifying the vowel. In one aspect of the invention, twenty (20) pitch pulses can be selected in a steady state portion of the recorded sample . In some cases, the steady state portion of the recorded sample can appear near the center of the recorded sample. Still, the invention is neither limited in regard to the particular number of pitch pulses selected nor the location of the pitch pulses. Rather, only a set of pitch pulses selected from a steady state portion of the recorded sample is necessary in the present invention.
To determine the phonetic properties of the selected portion of the recorded sample , the selected portion can be decomposed from a complex waveform into individual waveforms comprising the complex waveform. This spectrographic analysis can reveal that the vowel has certain frequency bands with markedly high amplitudes or energy. These bands of high energy frequencies that occur in vowels are frequently referred to as formants. As is well known in the art, formants correspond to certain resonances of the vocal tract.
Hence, in step , an linear predictive coding (LPC) vocoder can compute an LPC spectrum for the selected portion of the recorded sample . Similar to conventional formant vocoders, using an LPC vocoder, predictor coefficients representing pitch, loudness and vocal tract shape can be extracted from the selected portion of the recorded sample.
By processing the selected portion of the recorded sample in the LPC vocoder, an LPC frequency spectrum can be produced. As is well known in the art, most of the information in a speech signal is contained in the first three formants. That is, a particular vowel can be identified by the first three formants. Accordingly, in step , the first three formant values (frequencies) can be selected in the LPC frequency spectrum . Notably, false formants are possible which can be caused by dipliphonia. As such, in step , the selected formant values can be verified against standard formant values for the recorded vowel.
Turning our attention to step , the recorded sample can be low-pass filtered using a cut-off frequency below the frequency of the selected first formant and above the fundamental frequency. In consequence, a filtered sample can be produced. Significantly, the low-pass filter can filter the recorded sample both forwards and backwards in order to eliminate a shift in the timing of the filtered sample . Additionally, by filtering the recorded sample both forwards and backwards, the time alignment can be preserved between the recorded sample and the filtered sample .
In step , a section of the filtered sample can be selected. Specifically, a center section of the filtered sample which corresponds to the center section of the selected portion of the recorded sample can be selected. Thus, where twenty pitch pulses have been selected in step , in step , a corresponding twenty pitch pulses can be selected in the filtered sample . In step , each individual pitch pulse in the selected portion of the filtered sample can be copied into a cell of a zero-padded matrix of filtered pitch pulses . In particular, each pitch pulse can be identified by a leading and trailing zero crossing, which, if the cut-off frequency of the low-pass filter has been set to a low enough value, should be unambiguous. Notably, the pitch pulses need not be truncated to a uniform length.
Correspondingly, in step , each individual pitch pulse in the selected portion of the recorded sample can be copied into a cell of a second zero-padded matrix of unfiltered pitch pulses . Specifically, each unfiltered pitch pulse can correspond to the same interval as the corresponding filtered pitch pulse. Hence, there can be a one-to-one correspondence of filtered and unfiltered pitch pulses. Each pitch pulse pair can share the same number of sample points, albeit the number of sample points can vary from pair to pair.
Turning now to step , an all-zero filter can derived from an all-pole filter created using the formant values (frequencies) selected in step . Notably, all-pole digital filters focus on spectral maxima of a signal. Accordingly, all-pole digital filters can be particularly sensitive to formants in a vowel sound. The predictor coefficients of step can be used to control the all-zero digital filter in such a way as to replicate the formants and other frequency variations in the recorded sample . Methods for creating an all-pole filter are well-known in the art and are described in detail in Klatt. Moreover, methods for deriving an all-zero filter therefrom also are well-known in the art and are described in Klatt.
In step , the all-zero filter created in step can be applied to the matrix of unfiltered pitch pulses . By applying me all-zero filter to the matrix of unfiltered pitch pulses , each unfiltered pitch pulse in the matrix of unfiltered pitch pulses can be individually filtered. This is equivalent to the inverse filtering of each of the matrix of unfiltered pitch pulses . Notably, the inverse filtering process of step is analogous To deriving an LPC model of each individual unfiltered pitch pulse. However, in the analogous case, the residue of the LPC analysis is while noise, whereas the residue of the inverse filtering process of step is a set of NSAN vectors . Significantly, the set of NSAN vectors produced by the inverse filtering process of step is not white noise because the order of the inverse filter is deliberately kept low. Thus, unlike white noise traditionally found in conventional waveform generators, the set of NSAN vectors produced by the method of the invention can retain some of the temporal structure of the original recorded sample .
Finally, in step , during speech synthesis, the vowel sound can be resynthesized by adding the low-pass filtered pitch pulses to the corresponding NSAN vectors . In one aspect of the invention, the ratio between the amplitude of each filtered pitch pulse and the corresponding NSAN vector can be 3:1. The resulting composite pulses can be concatenated in random order. Notably, any number of composite pulses can be concatenated. Finally, the concatenated pulses can passed through the all-pole filter of step in order to produce the synthesized vowel . Thus, by substituting the set of NSAN vectors for white noise (breathiness) produced by conventional waveform generators, the buzzing quality of the vowel sound can be masked.
BRIEF DESCRIPTION OF THE DRAWINGS
There are presently shown in the drawings embodiments which are presently preferred, it being understood, however, that the invention is not limited to the precise arrangements and instrumentalities shown.
FIG. 1 is a schematic representation of a Text-to-Speech (TTS) Engine suitable for producing synthesized speech in accordance with the inventive arrangements.
FIG. 2 is a diagram of a process of generating non-stationary additive noise (NSAN) for addition to synthesized speech produced in the TTS Engine of FIG. .
CLAIMS
1. A method for generating non-stationary additive noise (NSAN) comprising: selecting a group of pitch pulses in a recorded sample of a spoken vowel; computing a frequency spectrum for said selected group of pitch pulses; identifying formant values in said computed frequency spectrum; creating an all-zero filter based upon said identified formant values; populating a zero-padded matrix with said selected group of pitch pulses; and, applying said all-zero filter to said matrix, wherein said application of said all-zero filter to said matrix produces NSAN vectors, each said NSAN vector corresponding to a pitch pulse in said group of pitch pulses.
2. The method of claim 1, wherein said step of selecting a group of pitch pulses comprises: selecting twenty pitch pulses in said recorded sample of speech.
3. The method of claim 2, wherein said twenty pitch pulses are positioned in the center of said recorded sample.
4. The method of claim 1, wherein said step of computing a frequency spectrum comprises: applying a linear predictive coding (LPC) process to said selected group of pitch pulses; said LPC process extracting predictive coefficients from said selected group of pitch pulses.
5. The method of claim 1, wherein said identifying step comprises identifying the first three formant values in said computed frequency spectrum.
6. The method of claim 1, wherein said step of creating an all-pole filter further comprises: configuring said all-zero filter with said extracted predictive coefficients.
7. The method of claim 1, further comprising: low-pass filtering the recorded sample, selecting a group of filtered pitch pulses in said filtered sample, each filtered pitch pulse in said selected group of said filtered sample corresponding to a pitch pulse in said selected group of said recorded sample, and adding each NSAN vector to a corresponding filtered pitch pulse in said selected group of said filtered sample, each added NSAN vector corresponding to a filtered pitch pulse which corresponds to a pitch pulses in said recorded sample having a correspondence with said added NSAN vector.
8. The method of claim 7, wherein said step of low-pass filtering comprises: determining a fundamental frequency for said recorded sample; and, passing said recorded sample through a low-pass cut-off filter configured with cut-off frequencies corresponding to said first formant and said fundamental frequency.
9. The method of claim 8, wherein said step of passing comprises: passing said recorded sample through said low-pass cut-off filter both forwards and backwards.
10. A method for producing vowel sounds in a waveform generator using non-stationary additive noise (NSAN) comprising: computing a frequency spectrum for a selected group of pitch pulses in a recorded sample of a spoken vowel; identifying a set of formant values in said computed frequency spectrum and creating an all-zero filter for said set of identified formant values; populating a zero-padded matrix with said selected group of pitch pulses and applying said all-zero filter to said matrix, said application of said filter producing a set of NSAN vectors; synthesizing a vowel sound in the waveform generator, said synthesis producing a further group of pitch pulses; and, adding said NSAN vectors to said further group of pitch pulses.
11. The method of claim 10, wherein said step of computing a frequency spectrum comprises: applying a linear predictive coding (LPC) process to said selected group of pitch pulses; said LPC process extracting predictive coefficients from said selected group of pitch pulses.
12. The method of claim 10, wherein said identifying step comprises identifying the first three formant values in said computed frequency spectrum.
13. The method of claim 11, wherein said step of creating an all-zero filter further comprises: configuring said all-zero filter with said extracted predictive coefficients.
14. The method of claim 10, where said adding step comprises: sampling said synthesized vowel sound and selecting a group of pitch pulses in said sampled vowel sound; and, for each pitch pulse in said sample, re-sampling a corresponding NSAN vector to the length of said pitch pulse, multiplying said re-sampled NSAN vector by a scaling factor and adding said NSAN vector to said pitch pulse.
15. A machine readable storage, having stored thereon a computer program having a plurality of code sections for generating non-stationary additive noise (NSAN) for addition to synthesized speech, said code sections executable by a machine for causing the machine to perform the steps of: selecting a group of pitch pulses in a recorded sample of a spoken vowel; computing a frequency spectrum for said selected group of pitch pulses; identifying formant values in said computed frequency spectrum; creating an all-zero filter based upon said identified formant values; populating a zero-padded matrix with said selected group of pitch pulses; and, applying said all-zero filter to said matrix as an all-zero filter, wherein said application of said all-zero filter to said matrix produces NSAN vectors, each said NSAN vector corresponding to a pitch pulse in said group of pitch pulses.
16. The machine readable storage of claim 15, wherein said step of selecting a group of pitch pulses comprises: selecting twenty pitch pulses in said recorded sample of speech.
17. The machine readable storage of claim 16, wherein said twenty pitch pulses are positioned in the center of said recorded sample.
18. The machine readable storage of claim 15, wherein said step of computing a frequency spectrum comprises: applying a linear predictive coding (LPC) process to said selected group of pitch pulses; said LPC process extracting predictive coefficients from said selected group of pitch pulses.
19. The machine readable storage of claim 15, wherein said identifying step comprises identifying the first three formant values in said computed frequency spectrum.
20. The machine readable storage of claim 15, wherein said step of creating an all-pole filter further comprises: configuring said all-zero filter with said extracted predictive coefficients.
21. The machine readable storage of claim 15, further comprising: low-pass filtering the recorded sample, selecting a group of filtered pitch pulses in said filtered sample, each filtered pitch pulse in said selected group of said filtered sample corresponding to a pitch pulse in said selected group of said recorded sample, and adding each NSAN vector to a corresponding filtered pitch pulse in said selected group of said filtered sample, each added NSAN vector corresponding to a filtered pitch pulse which corresponds to a pitch pulses in said recorded sample having a correspondence with said added NSAN vector.
22. The machine readable storage of claim 21, wherein said step of low-pass filtering comprises: determining a fundamental frequency for said recorded sample; and, passing said recorded sample through a low-pass cut-off filter configured with cut-off frequencies corresponding to said first formant and said fundamental frequency.
23. The machine readable storage of claim 22, wherein said step of passing comprises: passing said recorded sample through said low-pass cut-off filter both forwards and backwards.
24. A machine readable storage, having stored thereon a computer program having a plurality of code sections for producing vowel sounds in a waveform generator using non-stationary additive noise (NSAN), said code sections executable by a machine for causing the machine to perform the steps of: computing a frequency spectrum for a selected group of pitch pulses in a recorded sample of a spoken vowel; identifying a set of formant values in said computed frequency spectrum and creating an all-pole filter for said set of identified formant values; populating a zero-padded matrix with said selected group of pitch pulses and applying said all-pole filter to said matrix, said application of said filter producing a set of NSAN vectors; synthesizing a vowel sound in the waveform generator, said synthesis producing a further group of pitch pulses; and, adding said NSAN vectors to said further group of pitch pulses.
25. The machine readable storage of claim 24, wherein said step of computing a frequency spectrum comprises: applying a linear predictive coding (LPC) process to said selected group of pitch pulses; said LPC process extracting predictive coefficients from said selected group of pitch pulses.
26. The machine readable storage of claim 24, wherein said identifying step comprises identifying the first three formant values in said computed frequency spectrum.
27. The machine readable storage of claim 25, wherein said step of creating an all-zero filter further comprises: configuring said all-zero filter with said extracted predictive coefficients.
28. The machine readable storage of claim 24, where said adding step comprises: sampling said synthesized vowel sound and selecting a group of pitch pulses in said sampled vowel sound; and, for each pitch pulse in said sample, re-sampling a corresponding NSAN vector to the length of said pitch pulse, multiplying said re-sampled NSAN vector by a scaling factor and adding said NSAN vector to said pitch pulse.
COPYRIGHT
User acknowledges that Fairview Research and its third party providers retain all right, title and interest in and to this xml under applicable copyright laws. User acquires no ownership rights to this xml including but not limited to its format. User hereby accepts the terms and conditions of the License Agreement.
