Fork me on GitHub

⏩ ForwardTacotron

Inspired by Microsoft’s FastSpeech we modified Tacotron to generate speech in a single forward pass using a duration predictor to align text and generated mel spectrograms.

ForwardTacotron + MelGAN Vocoder

The samples are generated with a model trained 400K steps on LJSpeech together with the pretrained MelGAN vocoder provided by the MelGAN repo.

Scientists at the CERN laboratory say they have discovered a new particle.

normal speed faster (1.25) slower (0.85)

There’s a way to measure the acute emotional intelligence that has never gone out of style.

President Trump met with other leaders at the Group of 20 conference.

ForwardTacotron + WaveRNN Vocoder

The samples are generated with a model trained 100K steps on LJSpeech together with the pretrained WaveRNN vocoder provided by the WaveRNN repo.

Scientists at the CERN laboratory say they have discovered a new particle.

normal speed faster (1.25) slower (0.8)

There’s a way to measure the acute emotional intelligence that has never gone out of style.

President Trump met with other leaders at the Group of 20 conference.

ForwardTacotron + Griffin-Lim

The Senate's bill to repeal and replace the Affordable Care-Act is now imperiled.

normal speed faster (1.4) slower (0.6)

Generative adversarial network or variational auto-encoder.

Basilar membrane and otolaryngology are not auto-correlations.

 

Synthetic speech can be created by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of the stored speech units; a system that stores phones or diphones provides the largest output range, but may lack clarity. For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create a completely "synthetic" voice output.