US3532821A - Speech synthesizer - Google Patents

Speech synthesizer Download PDF

Info

Publication number
US3532821A
US3532821A US778560A US3532821DA US3532821A US 3532821 A US3532821 A US 3532821A US 778560 A US778560 A US 778560A US 3532821D A US3532821D A US 3532821DA US 3532821 A US3532821 A US 3532821A
Authority
US
United States
Prior art keywords
speech
recorded
consonants
signal
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
US778560A
Inventor
Kazuo Nakata
Akira Ichikawa
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hitachi Ltd
Original Assignee
Hitachi Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hitachi Ltd filed Critical Hitachi Ltd
Application granted granted Critical
Publication of US3532821A publication Critical patent/US3532821A/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management

Definitions

  • a speech synthesizer for compiling a speech from prerecorded acoustic elements, wherein said acoustic elements are classified into two groups, one being a number of damped sinusoidal waves of different frequencies from which vowels and transient sounds of speech are produced according to control signals, the other being continuous signals having features of respective consonants, and the speech being synthesized by combining selected ones of said vowels and transient sounds with selected ones of said continuous signals corresponding to consonants.
  • This invention relates to a speech synthesizer, particularly to a system for artificially reproducing speech by compiling pre-recorded acoustic elements (hereafter referred to as a pre-record and compilation system).
  • a prerecorded unit is usually employed a word. Therefore, in order to increase the amount of speech synthesizable and to expand its application range from a limited particular field to a more generalized scope, a drastic increase in the number of units of speech (or words) to be pre-recorded is required. Such increase of the prerecorded words inevitably invites bulkiness and complication of the system and further, increases the access time required for reading out wanted words.
  • One approach for solving the above problem may be to store, as pre-recorded units or the acoustic elements, mono-syllables instead of words.
  • this method it is known that the quality of the compiled speech is poor both in clearness and in naturalness.
  • a reason for this inferior quality of the thus synthesized speech will be that a word made by combination of syllables is very different in the characteristic features of the component syllables such as the frequencies of formants, the intensity of envelope, the frequency of pitch, and the duration, from the naturally pronounced same word in an integral speech uttered with a particular meaning.
  • the only way is to increase the number of the pre-recorded units or acoustic elements, which contradicts the purpose for which syllables are adopted as the elements to be stored.
  • the main object of this invention is to provide an improved speech synthesizer of the pre-record and compilation system in which the above defects have been removed.
  • the objects of this invention are to increase the variety of the synthesizable speech, to minimize the number of the acoustic elements to be stored as basic units or constituents of synthesized speech, to improve the quality of the synthesized speech, especially ICC.
  • voiced sounds each having a constant repetition rate and consonants including nasal consonants, unvoiced consonants and voiced consonants are employed.
  • Each voiced sound is produced by selectively reading out and compiling together, at varied intervals determined by a control signal, a number of damped sinusoids of different frequencies which have been prerecorded on a recording medium.
  • the consonant part is compiled of a number of naturally pronounced consonants or synthesized consonants which represent the characteristic features of the natural consonants.
  • FIGS. 1a, 1b and 10 show a waveform of speech and the spectrum characteristics thereof
  • FIGS. 2a, 2b, 2c and 2d show a waveform of a particular oscillation and the specrum characteristics thereof
  • FIGS. 3 and 4 are schematic diagrams illustrating the synthesis of waveforms by means of a magnetic drum
  • FIG. 5 is a block diagram of an embodiment of a speech synthesizer according to this invention.
  • FIGS. 6 and 7 are diagrams for explaining the operation of the essential portions of the above embodiments.
  • a voice is produced when either of a voiced sound caused by the vibration'of vocal chords which is almost periodically repeated intermittent triangular waves or an unvoiced sound caused by a turbulent flow produced by contract of the vocal tract which is almost white random noise is passed through a cavity formed in the vocal tract, that is, an articulatory organ formed from the glottis to the lips.
  • FIG. 1a which shows a part of a waveform of a speech
  • the section indicated by reference numeral 1 represents a voiced sound in which the repetition rate of a vocal base is constant
  • the section 2 a consonant.
  • the frequency spectrums of the above sounds 1 and 2 are characterized, as shown in FIGS.
  • spectrum envelopes 3 which are indications of resonant characteristics of the articulator and by the inner structure of the spectrum indicating the features of the vocal base
  • the former being further characterized mainly by several single-resonance characteristics (that is, formants) 4, 4, 4", 5, 5' and the latter being characterized mainly by a harmonic line spectrum 6, which possesses periodicity and randomness of continuous spectrum.
  • this invention facilitates to synthesize a voiced sound of a constant repetition rate, for example, having a characteristic spectrum as shown in FIG. 1b from a number of pre-recorded damped sinusoidal waves of different frequencies.
  • a voiced sound of a constant repetition rate for example, having a characteristic spectrum as shown in FIG. 1b from a number of pre-recorded damped sinusoidal waves of different frequencies.
  • the principle of this synthesis will be explained hereunder.
  • a damped sinusoidal oscillation as shown in FIG. 2a gives a single-resonant frequency spectrum as shown in FIG. 2b, said damped sinusoidal oscillation being represented by the formula e sin w t, Where a is the damping factor, t the time, and w, the angular frequency of the oscillation. If the damped sinusoidal oscillations are repeated at a constant period T as shown in FIG. 20, the frequency spectrum thereof will become a harmonic line spectrum as shown in FIG. 2a. It is known in the acoustical theory on production of voice that the spectrum envelope 3 as shown in FIG. lb is produced by continuously cascading the single-resonant features as shown in FIG. 2b.
  • a transient sound from a voiced sound of a constant repetition rate that is, a sound having a particular frequency spectrum, to another sound having another frequency spectrum
  • suflicient smoothness according to the following steps; that is, quantizing a variation in frequencies of the characteristic formants of the respective sounds between said two voices; synthesizing sounds by adding the damped sinusoidal oscillations as described above; and then joining the sounds consecutively.
  • the number of the acoustical elements to be pro-recorded is required only to be enough to cover, with an appropriate space, the frequency bands which are essential for constituting a speech (including as far as the first, the second and the third formants).
  • An example of such a number as realized in an embodiment of this invention is shown in the following Table 1.
  • FIG. 1 illustrates schematically a magnetic drum on which the above-mentioned damped sinusoidal oscillations are recorded.
  • the damped sinusoidal oscillations are recorded for 20 ms. which corresponds to one revolution period of the drum. (This means that the time constant of damping is assumed to be about 20 ms. at most. This assumption will be appropriate in view of the band width of vowel formants.) For example, ten read-out heads are disposed at an equal space along the circumference of the drum, the time difference between two adjacent heads will be 2 ms.
  • T is the time interval between the reading with the I -th head and with I -th head. If the next reading is started when the beginning of the recorded signal comes to the position of the (I +k)-th head, the interval between two readings Will become longer by Tk seconds. While, if it is started at the position of the (I k)-th head, the interval will become shorter by Tk seconds. (T indicates the time that the rotating drum takes to move from a head to the next head.) Assuming that the recorded signal is read out by a head continuously for one revolution of the magnetic drum, that is, for 20 ms., it will be seen from FIG.
  • FIG. 5 which is a block diagram of an embodiment of this invention, a multiple output system of 11 channels is shown.
  • Constitutents of the sentence to be converted to a speech which are selected in the main apparatus 10 of the information processing system (usually, a common large high-speed electronic computor), are immediately converted into speech output control signals 11, 12, 111, with reference to a magnetic drum 20 which contains a pronounce dictionary (a stock of control signals for speech units to be articulated), and then are distributed to control signal decoders 101, 102, 1012 for the respective channels
  • the distributed control signals are decoded to a group of more tangible control LEMENTS TO BE RECORDED Fricative sounds Plosive sounds Nasal sounds Number Number Numb er of of of Consonants elements Oonsonants elements Consonants elements Therefore, the total number of the acoustical elements signals 21, "22, 211 for reading the recorded acoustito be recorded will be of the order of fifty.
  • a voice is separated into two parts, that is: (1) vowels and transient sounds (including semivowel and fluent sound) and (2) consonants (unvoiced consonants, voiced consonants and nasal consonants).
  • the part 1) is produced by reading out repeatedly but in varied periods the recorded damped sinusoidal waveforms, and the part (2) by directly reading out the required waveforms out of the recorded consonantal ones, and finally both parts are combined.
  • any syllable can be synthesized from the above-described parts (1) and (2).
  • the acoustical elements recorded on the magnetic drum are classified into two categories, that is: a group of damped sinusoidal waves used for the synthesis of the above-described part (1) and a group of conso nantal waves.
  • the first group is divided into three ranges overlapping each other in the fringe portions, that is: the first formant range (16 channels from 200 to 950 Hz.), the second formant range (16 channels from 800 to 2,400 Hz.) and the third formant range (8 channels from 2,200 to 3,500 Hz.).
  • channels on the magnetic drum are divided corresponding to the above two categories, the first category being further divided into three zones, namely, the first, second and third zones.
  • the recording channels of the drum are divided into four zones. That is, as shown in FIG. 6, the elements storage drum 400 is divided into four zones 401, 402, 403 and 404. Outputs of reading heads for respective channels in said four zones are led to the gate matrixes 411, 412, 413 and 414 for selecting outputs.
  • the matrixes 411, 412 and 413 for composing the formants are supplied commonly with a head selecting signal 451, while the remaining matrix 414 is supplied with a signal 452 for selecting the consonant reading head.
  • frequency selecting signals 461, 462 and 463 are given to the respective matrixes as the first, second and third formants should be independently controlled.
  • damped sinusoidal waves of different frequencies corresponding to the formant frequencies
  • particular periods corresponding to the pitch periods
  • the outputs from matrixes 412 and 413 are controlled as to the relative amplitudes to the output from the matrix 411 in analogue multipliers 422 and 423 with reference to control signals 465 and 466, and then added to the latter output in a summing amplifier 431.
  • Output from the summing amplifier 431 is further controlled as to the amplitude in an analogue multiplier 441 with reference to a control signal 481 so as to give a good effect of vocal sound and speech, and then let out through an output terminal 490 as a continuous speech.
  • the consonant selected by the matrix 414 is added to the vowel and transient sound in a summing amplifier 440, after the consonant is imparted an appropriate control of the amplitude relative to the vowel and transient sound in an analogue multipler 424.and with reference to the control signal 468.
  • FIG. 7 shows in more detail a part of one of the recorded elements selecting gate matrixes 411, 412, 413 and 414 shown in FIG. 6.
  • the gate matrixes 411, 412, 413 and 414 are substantially the same in operation, the following description will be made as to only one of them.
  • I recorded channels 1, 2 l on the magnetic drum are to be selectively readout by N reading heads 1, 2 N.
  • Signal 451 (for the matrixes 411, 412 and 413) or signal 452 (for the matrix 414) which specifies the heads by which the recorded signals are to be read out, is led to a decoding buffer 500 to be decoded therein.
  • the decoding bufi'er 500 supplies output 1 to output lines leading to the specified heads and output 0 to all of the remaining lines out of the lines 501 to 50N.
  • signal 461 (for 411), signal 462 (for 412) or signal 463 (for 413) which specifies the channels of which the outputs are to be taken, is led to another decoding buffer 600 to be decoded therein.
  • the decoding buffer 600 supplies signal 1 to the selected lines and signal 0 to the remaining lines out of the lines 601, 602 60!.
  • outputs from the channels associated with the 1st heads are connected to terminals 11, 12 1l respectively
  • outputs from the channels associated 'with the 2nd heads are connected to terminals 21, 22 2l
  • outputs from the channels associated with the N-th heads are connected to terminals N1, N2 Nl respectively.
  • Gate selecting signals 501, 502 50N and 601, 602 601 are first connected to selecting digital AND gates 111, 121 III; 211, 221 2Z1; and N11, N21 Nl1 respectively, as shown in the figure.
  • N x l gates only one gate, which receives the specified signal 1, opens to give an output 1 to the associated gate among the ensuing analogue gates 112, 122 1Z2; 212, 222 2Z2; N12, N22 Nl2.
  • the output of the specified head read from the specified channel is selected.
  • the decoded output from the decoding amplifier 500 specifies not only the head to be selected, but the time at which the signal is read out from the head. (As the signal is always read out from the starting point of the record, the starting time can be easily determined from the timing pulse on the drum.) Therefore, assuming that the digital AND gates 111, 112 N11, if they are opened once, maintain the output 1 during a complete revolution of the drum (the period being T ms., for example, 20 ms.), then this selecting gate matrix allows a read-out as shown in FIG. 4.
  • the read-out outputs are summed and let out from the output amplifier 700. This output corresponds to either one of the outputs 471, 472 or 473 in FIG. 6.
  • the read-out of a specified head from a specified channel is required to continue during the duration inherent to the particular consonant. It is achieved by controlling the duration with the decoded signal from the decoding buffer 500, whereas the duration is constant (for example, 20 ms.) in the case of vowels. This output corresponds to 474 in FIG. 6.
  • a speech synthesizer of the pre-record and compilation type comprising: a memory on which a number of damped sinusoidal Waves of difierent frequencies are recorded; means for selectively reading out at least one of said sinusoidal waves periodically according to a control signal, said period of reading out being variable; a memory on which a number of continuous signals having features of respective consonants are recorded; and means for selectively reading out at least one of said continuous signals at a specified time according to a control signal.
  • a speech synthesizer according to claim 1, which further comprises means for compiling the outputs of the first and second mentioned reading out means together.

Description

. KAzuo NAKATA ETA LV Oct. 6, 1910 SPEECH SYNTHESIZER I 7 Sheets-Sheet 5 Filed Nov. 25, 1968 E Em m m 5w XQQE E g INVENTORS K9200 Nam? wBY ATTORNEYS Oct. 6, 1970 KAZUO NAKATA ETAL 3,532,821
SPEECH SYNTHESIZER 7 Sheets-Sheet 6 Filed Nqv. 25, 1968 INVENTORS kozua Armin 0km lav/4mm a I M ATTORNEB 0a. 6,.1970 KAZUO NAKATA ET AL, 3,532,821
SPEECH SYNTHESIZER I v 7 SheeisSheet 7 Filed Nov. 25, 1968 vkv mNvis mw .SQR
mmkmbm 9,38% I... WQV
INVENTORS knzao M91097 Amen [CHI/(0M BY ATTORNEYS United States Patent C 3,532,821 SPEECH SYNTHESIZER Kazuo Nakata, Kokubunji-shi, and Akira Ichikawa, Musashino-shi, Japan, assignors to Hitachi, Ltd., Tokyo, Japan, a corporation of Japan Filed Nov. 25, 1968, Ser. No. 778,560 Claims priority, application Japan, Nov. 29, 1967, 42/76,093 Int. Cl. G101 1/00 US. Cl. 179-1 2 Claims ABSTRACT OF THE DISCLOSURE A speech synthesizer for compiling a speech from prerecorded acoustic elements, wherein said acoustic elements are classified into two groups, one being a number of damped sinusoidal waves of different frequencies from which vowels and transient sounds of speech are produced according to control signals, the other being continuous signals having features of respective consonants, and the speech being synthesized by combining selected ones of said vowels and transient sounds with selected ones of said continuous signals corresponding to consonants.
This invention relates to a speech synthesizer, particularly to a system for artificially reproducing speech by compiling pre-recorded acoustic elements (hereafter referred to as a pre-record and compilation system).
In a pre-record and compilation system, as a prerecorded unit is usually employed a word. Therefore, in order to increase the amount of speech synthesizable and to expand its application range from a limited particular field to a more generalized scope, a drastic increase in the number of units of speech (or words) to be pre-recorded is required. Such increase of the prerecorded words inevitably invites bulkiness and complication of the system and further, increases the access time required for reading out wanted words.
One approach for solving the above problem may be to store, as pre-recorded units or the acoustic elements, mono-syllables instead of words. With this method, however, it is known that the quality of the compiled speech is poor both in clearness and in naturalness. A reason for this inferior quality of the thus synthesized speech will be that a word made by combination of syllables is very different in the characteristic features of the component syllables such as the frequencies of formants, the intensity of envelope, the frequency of pitch, and the duration, from the naturally pronounced same word in an integral speech uttered with a particular meaning. To solve this problem, the only way is to increase the number of the pre-recorded units or acoustic elements, which contradicts the purpose for which syllables are adopted as the elements to be stored.
The main object of this invention is to provide an improved speech synthesizer of the pre-record and compilation system in which the above defects have been removed. Namely, the objects of this invention are to increase the variety of the synthesizable speech, to minimize the number of the acoustic elements to be stored as basic units or constituents of synthesized speech, to improve the quality of the synthesized speech, especially ICC.
in the naturalness of the speech, and to reduce the size of the system.
According to this invention, as an acoustic element to be stored, voiced sounds each having a constant repetition rate and consonants including nasal consonants, unvoiced consonants and voiced consonants are employed. Each voiced sound is produced by selectively reading out and compiling together, at varied intervals determined by a control signal, a number of damped sinusoids of different frequencies which have been prerecorded on a recording medium. On the other hand, the consonant part is compiled of a number of naturally pronounced consonants or synthesized consonants which represent the characteristic features of the natural consonants. These constituent consonants are pre-recorded in a recording medium, and read out under control of a control signal with the timing of the read-out and the duration being controlled.
This invention will be described in detail with reference to the accompanying drawings, in which FIGS. 1a, 1b and 10 show a waveform of speech and the spectrum characteristics thereof;
FIGS. 2a, 2b, 2c and 2d show a waveform of a particular oscillation and the specrum characteristics thereof;
FIGS. 3 and 4 are schematic diagrams illustrating the synthesis of waveforms by means of a magnetic drum;
FIG. 5 is a block diagram of an embodiment of a speech synthesizer according to this invention; and
FIGS. 6 and 7 are diagrams for explaining the operation of the essential portions of the above embodiments.
Fundamentally, a voice is produced when either of a voiced sound caused by the vibration'of vocal chords which is almost periodically repeated intermittent triangular waves or an unvoiced sound caused by a turbulent flow produced by contract of the vocal tract which is almost white random noise is passed through a cavity formed in the vocal tract, that is, an articulatory organ formed from the glottis to the lips. In FIG. 1a which shows a part of a waveform of a speech, the section indicated by reference numeral 1 represents a voiced sound in which the repetition rate of a vocal base is constant, and the section 2 a consonant. The frequency spectrums of the above sounds 1 and 2 are characterized, as shown in FIGS. 1b and 10 respectively, by spectrum envelopes 3 which are indications of resonant characteristics of the articulator and by the inner structure of the spectrum indicating the features of the vocal base, the former being further characterized mainly by several single-resonance characteristics (that is, formants) 4, 4, 4", 5, 5' and the latter being characterized mainly by a harmonic line spectrum 6, which possesses periodicity and randomness of continuous spectrum.
According to this invention, it facilitates to synthesize a voiced sound of a constant repetition rate, for example, having a characteristic spectrum as shown in FIG. 1b from a number of pre-recorded damped sinusoidal waves of different frequencies. The principle of this synthesis will be explained hereunder.
A damped sinusoidal oscillation as shown in FIG. 2a gives a single-resonant frequency spectrum as shown in FIG. 2b, said damped sinusoidal oscillation being represented by the formula e sin w t, Where a is the damping factor, t the time, and w, the angular frequency of the oscillation. If the damped sinusoidal oscillations are repeated at a constant period T as shown in FIG. 20, the frequency spectrum thereof will becomea harmonic line spectrum as shown in FIG. 2a. It is known in the acoustical theory on production of voice that the spectrum envelope 3 as shown in FIG. lb is produced by continuously cascading the single-resonant features as shown in FIG. 2b. Therefore, such a voiced sound as the period the relative amplitude of the second formant is (w /w the relative amplitude of the third formant is where e1 m and @11 indicate respectively an angular frequency of the first, the second and the third formants of the voice.
Further, a transient sound from a voiced sound of a constant repetition rate, that is, a sound having a particular frequency spectrum, to another sound having another frequency spectrum, can be synthesized, with suflicient smoothness, according to the following steps; that is, quantizing a variation in frequencies of the characteristic formants of the respective sounds between said two voices; synthesizing sounds by adding the damped sinusoidal oscillations as described above; and then joining the sounds consecutively.
Accordingly, in the speech synthesizer of this invention, the number of the acoustical elements to be pro-recorded is required only to be enough to cover, with an appropriate space, the frequency bands which are essential for constituting a speech (including as far as the first, the second and the third formants). An example of such a number as realized in an embodiment of this invention is shown in the following Table 1.
TABLE 1.AN EXAMPLE OF THE NUMBER OF THE RE- CORDED ACOUSTICAL ELEMENTS OF DAMPED SLNUS- OIDAL OSCILLATION As to the consonant portions of the voice (nasal consonant, unvoiced consonant and vocal or voiced consonant), it is only required to pre-record signals corresponding to the features of respective consonants. The number of such signals is at most 16 as shown in the following Table 2.
illustrates schematically a magnetic drum on which the above-mentioned damped sinusoidal oscillations are recorded.
Assuming that the lowest frequency of the pitch of the speech to be synthesized is 50 Hz., the damped sinusoidal oscillations are recorded for 20 ms. which corresponds to one revolution period of the drum. (This means that the time constant of damping is assumed to be about 20 ms. at most. This assumption will be appropriate in view of the band width of vowel formants.) For example, ten read-out heads are disposed at an equal space along the circumference of the drum, the time difference between two adjacent heads will be 2 ms. This is the minimum controllable step of the pitch period, and the pitch frequency is controlled according to the selection of readout with the following ten steps: 50 112., 55.5 Hz., 62.5 Hz., 715 Hz., 835 Hz., Hz., Hz., 166 Hz., 250 Hz. and 500 Hz. It Will be understood that these steps can be shortened by increasing the number N of the heads.
Referring to FIG. 3, it is assumed that the I -th head is reading at an instant and that T is the time interval between the reading with the I -th head and with I -th head. If the next reading is started when the beginning of the recorded signal comes to the position of the (I +k)-th head, the interval between two readings Will become longer by Tk seconds. While, if it is started at the position of the (I k)-th head, the interval will become shorter by Tk seconds. (T indicates the time that the rotating drum takes to move from a head to the next head.) Assuming that the recorded signal is read out by a head continuously for one revolution of the magnetic drum, that is, for 20 ms., it will be seen from FIG. 4 that the beginning portion of a reading period overlaps a portion of the signal read by the preceding acting head and the ending portion overlaps with a portion of the signal read by the ensuing head, thus the transition of physical features is achieved more smoothly, resulting in an improved quality of the synthesized speech.
In the following paragraphs, the pre-record and compilation type speech synthesizer of this invention will be described in detail in connection with an embodiment of the invention.
In FIG. 5 which is a block diagram of an embodiment of this invention, a multiple output system of 11 channels is shown. Constitutents of the sentence to be converted to a speech, which are selected in the main apparatus 10 of the information processing system (usually, a common large high-speed electronic computor), are immediately converted into speech output control signals 11, 12, 111, with reference to a magnetic drum 20 which contains a pronounce dictionary (a stock of control signals for speech units to be articulated), and then are distributed to control signal decoders 101, 102, 1012 for the respective channels Where the distributed control signals are decoded to a group of more tangible control LEMENTS TO BE RECORDED Fricative sounds Plosive sounds Nasal sounds Number Number Numb er of of of Consonants elements Oonsonants elements Consonants elements Therefore, the total number of the acoustical elements signals 21, "22, 211 for reading the recorded acoustito be recorded will be of the order of fifty. In order to improve the naturalness of the thus compiled speech, it is required to control the period of the above-described repetitive reproduction of the damped sinusoidal oscillations in accordance with the pitch period of the speech to be synthesized. A tangible method of such a control will cal elements. A part of the decoded signals is led to the recorded elements selecting gate matrixes 201, 202, 2011, while the remaining part is led to groups of controlling analogue multipliers (311, 312, 313), (321, 322, 333) (3111, 3112, 3113) for controlling the relative amplitudes of the readout signals. Thus, a specibe described hereunder with reference to FIG. 3 which 75 fied acoustical element is read out through a specified head on the elements storage drum 30 at a specified time; and then the relative amplitude is controlled as required. The amplitude controlled outputs are led to summing amplifiers 314, 324, 3n4 in the respective channels and are added to each other, and then are controlled in regard to the intensity in multipliers 315, 325-, 3115 as required for a phoneme and integral speech. After that, the outputs are combined with consonants in summing amplifiers 316, 326, 3n6 to become the resultant vocal outputs 31, 32, 3n. The above-described process is repeated, for example, every ms., thereby to produce a continuous speech output.
Next, the essential components of the system will be described in more detail. As has been explained already, according 0t this invention, a voice is separated into two parts, that is: (1) vowels and transient sounds (including semivowel and fluent sound) and (2) consonants (unvoiced consonants, voiced consonants and nasal consonants). In synthesizig speech, the part 1) is produced by reading out repeatedly but in varied periods the recorded damped sinusoidal waveforms, and the part (2) by directly reading out the required waveforms out of the recorded consonantal ones, and finally both parts are combined. It is known already that the fricative sounds and plosive sounds can be produced by increas ing the overlap of the consonant part (2) and the vowel and transient part, and the plosive sounds also by making steep the vowel and transient part. Therefore, any syllable can be synthesized from the above-described parts (1) and (2).
Of the parts 1) and (2), only the part (1) is necessary to be repeatedly read out at varied periods, and the variable periods are common to all of the first, second and third formants.
Therefore, the read-out of the recorded acoustical elements will be explained hereunder relating to a particular channel. The acoustical elements recorded on the magnetic drum are classified into two categories, that is: a group of damped sinusoidal waves used for the synthesis of the above-described part (1) and a group of conso nantal waves. The first group is divided into three ranges overlapping each other in the fringe portions, that is: the first formant range (16 channels from 200 to 950 Hz.), the second formant range (16 channels from 800 to 2,400 Hz.) and the third formant range (8 channels from 2,200 to 3,500 Hz.). In order to simplify the structure for control, channels on the magnetic drum are divided corresponding to the above two categories, the first category being further divided into three zones, namely, the first, second and third zones. Thus, the recording channels of the drum are divided into four zones. That is, as shown in FIG. 6, the elements storage drum 400 is divided into four zones 401, 402, 403 and 404. Outputs of reading heads for respective channels in said four zones are led to the gate matrixes 411, 412, 413 and 414 for selecting outputs. Of said four gate matrixes, the matrixes 411, 412 and 413 for composing the formants are supplied commonly with a head selecting signal 451, while the remaining matrix 414 is supplied with a signal 452 for selecting the consonant reading head.
In order to determine which channel (frequency) should be selected in the respective zones, frequency selecting signals 461, 462 and 463 are given to the respective matrixes as the first, second and third formants should be independently controlled. According to these control signals, damped sinusoidal waves of different frequencies (corresponding to the formant frequencies) repeatedly read out at particular periods (corresponding to the pitch periods) are obtained at output terminals 471, 472 and 473 of the gate matrixes 411, 412 and 413. The outputs from matrixes 412 and 413 are controlled as to the relative amplitudes to the output from the matrix 411 in analogue multipliers 422 and 423 with reference to control signals 465 and 466, and then added to the latter output in a summing amplifier 431. Output from the summing amplifier 431 is further controlled as to the amplitude in an analogue multiplier 441 with reference to a control signal 481 so as to give a good effect of vocal sound and speech, and then let out through an output terminal 490 as a continuous speech.
If a consonant is required, the consonant selected by the matrix 414 is added to the vowel and transient sound in a summing amplifier 440, after the consonant is imparted an appropriate control of the amplitude relative to the vowel and transient sound in an analogue multipler 424.and with reference to the control signal 468.
FIG. 7 shows in more detail a part of one of the recorded elements selecting gate matrixes 411, 412, 413 and 414 shown in FIG. 6. As the gate matrixes 411, 412, 413 and 414 are substantially the same in operation, the following description will be made as to only one of them.
In FIG. 7, it is assumed that I recorded channels 1, 2 l on the magnetic drum are to be selectively readout by N reading heads 1, 2 N.
Signal 451 (for the matrixes 411, 412 and 413) or signal 452 (for the matrix 414) which specifies the heads by which the recorded signals are to be read out, is led to a decoding buffer 500 to be decoded therein. The decoding bufi'er 500 supplies output 1 to output lines leading to the specified heads and output 0 to all of the remaining lines out of the lines 501 to 50N.
Meanwhile, signal 461 (for 411), signal 462 (for 412) or signal 463 (for 413) which specifies the channels of which the outputs are to be taken, is led to another decoding buffer 600 to be decoded therein. The decoding buffer 600 supplies signal 1 to the selected lines and signal 0 to the remaining lines out of the lines 601, 602 60!. As to the analogue read-out from each channel on the magnetic drum, outputs from the channels associated with the 1st heads are connected to terminals 11, 12 1l respectively, outputs from the channels associated 'with the 2nd heads are connected to terminals 21, 22 2l, and outputs from the channels associated with the N-th heads are connected to terminals N1, N2 Nl respectively.
Gate selecting signals 501, 502 50N and 601, 602 601 are first connected to selecting digital AND gates 111, 121 III; 211, 221 2Z1; and N11, N21 Nl1 respectively, as shown in the figure. As the result, among the N x l gates only one gate, which receives the specified signal 1, opens to give an output 1 to the associated gate among the ensuing analogue gates 112, 122 1Z2; 212, 222 2Z2; N12, N22 Nl2. Thus, the output of the specified head read from the specified channel is selected.
Further, the decoded output from the decoding amplifier 500 specifies not only the head to be selected, but the time at which the signal is read out from the head. (As the signal is always read out from the starting point of the record, the starting time can be easily determined from the timing pulse on the drum.) Therefore, assuming that the digital AND gates 111, 112 N11, if they are opened once, maintain the output 1 during a complete revolution of the drum (the period being T ms., for example, 20 ms.), then this selecting gate matrix allows a read-out as shown in FIG. 4.
The read-out outputs are summed and let out from the output amplifier 700. This output corresponds to either one of the outputs 471, 472 or 473 in FIG. 6.
In the consonant selecting gate matrix, the read-out of a specified head from a specified channel is required to continue during the duration inherent to the particular consonant. It is achieved by controlling the duration with the decoded signal from the decoding buffer 500, whereas the duration is constant (for example, 20 ms.) in the case of vowels. This output corresponds to 474 in FIG. 6.
It will be obvious that the above-described principle of this invention is equally applied either to a digital type recording of the acoustic elements or a cyclic memory consisting of a group of shift registers. However, it will be understood that in a digital recording, a digital-toanalogue converter is required for converting the read-out output to an analogue waveform.
What we claim is:
1. A speech synthesizer of the pre-record and compilation type comprising: a memory on which a number of damped sinusoidal Waves of difierent frequencies are recorded; means for selectively reading out at least one of said sinusoidal waves periodically according to a control signal, said period of reading out being variable; a memory on which a number of continuous signals having features of respective consonants are recorded; and means for selectively reading out at least one of said continuous signals at a specified time according to a control signal.
2. A speech synthesizer according to claim 1, which further comprises means for compiling the outputs of the first and second mentioned reading out means together.
References Cited UNITED STATES PATENTS KATHLEEN H. CLAFFY, Primary Examiner C. W. JIRAUCH, Assistant Examiner
US778560A 1967-11-29 1968-11-25 Speech synthesizer Expired - Lifetime US3532821A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP7609367 1967-11-29

Publications (1)

Publication Number Publication Date
US3532821A true US3532821A (en) 1970-10-06

Family

ID=13595216

Family Applications (1)

Application Number Title Priority Date Filing Date
US778560A Expired - Lifetime US3532821A (en) 1967-11-29 1968-11-25 Speech synthesizer

Country Status (4)

Country Link
US (1) US3532821A (en)
DE (1) DE1811040C3 (en)
FR (1) FR1593788A (en)
GB (1) GB1225142A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3689696A (en) * 1970-01-09 1972-09-05 Inoue K Speech synthesis from a spectrographic trace
US3723667A (en) * 1972-01-03 1973-03-27 Pkm Corp Apparatus for speech compression
US3798372A (en) * 1972-05-12 1974-03-19 D Griggs Apparatus and method for retardation of recorded speech
US3828132A (en) * 1970-10-30 1974-08-06 Bell Telephone Labor Inc Speech synthesis by concatenation of formant encoded words
US3830977A (en) * 1971-03-26 1974-08-20 Thomson Csf Speech-systhesiser
US3865982A (en) * 1973-05-15 1975-02-11 Belton Electronics Corp Digital audiometry apparatus and method
US3905030A (en) * 1970-07-17 1975-09-09 Lannionnais Electronique Digital source of periodic signals
DE3024062A1 (en) * 1980-06-26 1982-01-07 Siemens AG, 1000 Berlin und 8000 München Semiconductor module for speech synthesis - has speech units stored in analogue form in charge coupled devices
US4658374A (en) * 1979-02-28 1987-04-14 Sharp Kabushiki Kaisha Alphabetizing Japanese words in a portable electronic language interpreter

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE1297311B (en) * 1964-03-18 1969-06-12 Krefft Gmbh W Equipment for preparing, portioning and distributing food
US3998045A (en) * 1975-06-09 1976-12-21 Camin Industries Corporation Talking solid state timepiece

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US2243089A (en) * 1939-05-13 1941-05-27 Bell Telephone Labor Inc System for the artificial production of vocal or other sounds
US2771509A (en) * 1953-05-25 1956-11-20 Bell Telephone Labor Inc Synthesis of speech from code signals
US2793249A (en) * 1953-12-04 1957-05-21 Vilbig Friedrich Synthesizer for sound or voice reproduction
US3158685A (en) * 1961-05-04 1964-11-24 Bell Telephone Labor Inc Synthesis of speech from code signals
US3398241A (en) * 1965-03-26 1968-08-20 Ibm Digital storage voice message generator

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US2243089A (en) * 1939-05-13 1941-05-27 Bell Telephone Labor Inc System for the artificial production of vocal or other sounds
US2771509A (en) * 1953-05-25 1956-11-20 Bell Telephone Labor Inc Synthesis of speech from code signals
US2793249A (en) * 1953-12-04 1957-05-21 Vilbig Friedrich Synthesizer for sound or voice reproduction
US3158685A (en) * 1961-05-04 1964-11-24 Bell Telephone Labor Inc Synthesis of speech from code signals
US3398241A (en) * 1965-03-26 1968-08-20 Ibm Digital storage voice message generator

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3689696A (en) * 1970-01-09 1972-09-05 Inoue K Speech synthesis from a spectrographic trace
US3905030A (en) * 1970-07-17 1975-09-09 Lannionnais Electronique Digital source of periodic signals
US3828132A (en) * 1970-10-30 1974-08-06 Bell Telephone Labor Inc Speech synthesis by concatenation of formant encoded words
US3830977A (en) * 1971-03-26 1974-08-20 Thomson Csf Speech-systhesiser
US3723667A (en) * 1972-01-03 1973-03-27 Pkm Corp Apparatus for speech compression
US3798372A (en) * 1972-05-12 1974-03-19 D Griggs Apparatus and method for retardation of recorded speech
US3865982A (en) * 1973-05-15 1975-02-11 Belton Electronics Corp Digital audiometry apparatus and method
US4658374A (en) * 1979-02-28 1987-04-14 Sharp Kabushiki Kaisha Alphabetizing Japanese words in a portable electronic language interpreter
DE3024062A1 (en) * 1980-06-26 1982-01-07 Siemens AG, 1000 Berlin und 8000 München Semiconductor module for speech synthesis - has speech units stored in analogue form in charge coupled devices

Also Published As

Publication number Publication date
GB1225142A (en) 1971-03-17
FR1593788A (en) 1970-06-01
DE1811040A1 (en) 1969-07-24
DE1811040C3 (en) 1974-02-14
DE1811040B2 (en) 1973-07-12

Similar Documents

Publication Publication Date Title
US4577343A (en) Sound synthesizer
US3828132A (en) Speech synthesis by concatenation of formant encoded words
US4214125A (en) Method and apparatus for speech synthesizing
US4912768A (en) Speech encoding process combining written and spoken message codes
US6804649B2 (en) Expressivity of voice synthesis by emphasizing source signal features
US5400434A (en) Voice source for synthetic speech system
HU176776B (en) Method and apparatus for synthetizing speech
US3575555A (en) Speech synthesizer providing smooth transistion between adjacent phonemes
JPS6030960B2 (en) Synthesizer that converts digital frames into analog signals
US3532821A (en) Speech synthesizer
US7047194B1 (en) Method and device for co-articulated concatenation of audio segments
EP0561752B1 (en) A method and an arrangement for speech synthesis
Gu et al. A Mandarin-syllable Signal Synthesis Method with Increased Flexibility in Duration, Tone and Timbre
Cooper Speech synthesizers
US4075424A (en) Speech synthesizing apparatus
O'Shaughnessy Design of a real-time French text-to-speech system
Peterson et al. Objectives and techniques of speech synthesis
US20060059000A1 (en) Speech synthesis using concatenation of speech waveforms
JPS5880699A (en) Voice synthesizing system
Datta et al. Epoch Synchronous Overlap Add (ESOLA)
SU1075300A1 (en) Method of syllabic compiling of speech
JPS60113299A (en) Voice synthesizer
JPS58129500A (en) Singing voice synthesizer
JPH07152396A (en) Voice synthesizer
JPS60211499A (en) Voice synthesization method and apparatus