WaveNet: System for Generative Audio Modelling

Introduction

In this blog, we would discuss WaveNet: System for Generative Audio Modelling. A deep neural network called WaveNet can generate raw audio. It was developed by scientists at the London-based DeepMind AI company. By directly modeling waveforms with a neural network technology that has been trained with recordings of real speech, the system may produce voices that sound reasonably like people. Even though as of 2016 its text-to-speech synthesis was still less convincing than actual human speech, tests with US English and Mandarin apparently demonstrated that the system exceeds Google’s best current text-to-speech (TTS) systems.  WaveNet can simulate any type of audio, including music, because to its capacity to create raw waveforms.

 

 

 

What is Wavenet: System for Generative Audio Modelling?

By determining which sounds are most likely to follow one another, it generates the waveforms of speech patterns. Up to 24,000 sound samples per second are used to create each waveform one sample at a time. In addition, WaveNet automatically integrates natural-sounding components missed by earlier text-to-speech systems, such as lip-smacking and breathing rhythms, because the model learns from human speech. WaveNet gives computer-generated voices richness and depth by incorporating intonation, accents, emotion, and other crucial communication elements that prior systems had lacked. For instance, when WaveNet was originally released, it was developed with American English and Mandarin Chinese sounds that reduced the difference between human and artificial voices by 50%.

 

 

It is a fully convolutional neural network, and the different dilation factors used in the convolutional layers enable the receptive field to expand exponentially with depth and cover several timesteps. The input sequences used during training are actual waveforms captured from human speakers. Following training, we can sample the network to produce synthetic speech. A value is taken from the probability distribution calculated by the network at each step of the sampling process. The input then receives this value back, and a new prediction for the following step is made. It is computationally expensive to build up samples in this way, but they have discovered that it is necessary for producing complex, realistic-sounding audio.

 

 

 

Uses of Wavenet

1. With WaveNets, the performance gap between the state of the art and human performance is reduced by more than 50% for both US English and Mandarin Chinese.

For example,  Listen to the below audio clips

US English

parametric – Link

Wavenet –  Link

Mandarin Chinese

Parametric – Link

Wavenet – Link

2. We can use WaveNet to say the same thing in many voices by changing the speaker’s identity

For example

Audio clip 1 – Link

Audio clip 2 – Link

Audio Clip 3 – Link

3. We can also try to generate music using Wavenet. The sounds below were created by wavenet using data from a collection of classical piano music.

Music Clip 1 – Link

Music Clip 2 – Link

Music Clip 3 – Link

 

 

Also read – AlphaFold : Accurate protein structure prediction

 

Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *