Daisy Bell – ankurb

Daisy Bell is a song very close to my heart, it was composed by Henry Dacre in 1892 as an ode to the beauty of a real life “daisy”, Frances Evelyn Maynard. However this is not the event that makes it special for me. It was the first song ever to be sung by electronic speech synthesis in bell labs by John Kelly in 1962 on an IBM 704 computer and guess who visited this facility? Aurthur C. Clarke, who then used this very song in 2001 a space odyssey during HAL’s most famous deactivation. Then again, the whole point is that it opened up a whole new frontier (speech synthesis) which soon became mainstream after this pioneering stunt took place.

There are several ways of doing speech synthesis one of them is known as concatenative synthesis. Which is stringing together recorded speech from a huge speech and then playing it with an associated file with it’s keyword or the phones, phrase etc. which are being said in the sentence, the down side of this is that it doesn’t sound continuous unless one uses algorithms to level out the volume difference , which might occur between two clips or there might be a slightly different speaking style in which the human speaks between two clips. Thus if the glitches are taken care of it is one of the most convincing forms but then again this system uses a huge database for generating speech , now the implications are that it takes up space and computing power to access it so it can be inefficient for small scale implementation but then again there is miniaturization.

One could also use Diphones which are nothing but transition between two sounds or phones so you use these to create speech in a very rambling sort of way. Personally I like this method as it will use up less space and it will be in my opinion extremely convincing when we make it work like it should work, however the problem is that it’s pretty hard to do that…

The rest of them are in my opinion needlessly complex like formant synthesis

Formant synthesis does not use human speech samples at runtime. Instead, the synthesized speech output is created using an acoustic model. Parameters such as fundamental frequency, voicing, and noise levels are varied over time to create a waveform of artificial speech. This method is sometimes called rules-based synthesis; however, many concatenative systems also have rules-based components.

This doesn’t produce convincing speech in any case, possibly due to the fact that our models are not that thorough or however it currently it does have a few advantages but then again it’s needlessly complicated. Now the thing is that there exists every type of system one can think of but the main argument here is to make them seamless and reliable. It may be like early computer graphics given enough time and computational power we may finally reach HAL…

Related

Leave a ReplyCancel reply