Fork me on GitHub

⏩ ForwardTacotron

Inspired by Microsoft’s FastSpeech we modified Tacotron to generate speech in a single forward pass using a duration predictor to align text and generated mel spectrograms.

NEW (14.05.2021): Forward Tacotron V2 (Energy + Pitch) + HiFiGAN Vocoder

The samples are generated with a model trained 80K steps on LJSpeech together with the pretrained HiFiGAN vocoder provided by the HiFiGAN repo.

Scientists at the CERN laboratory say they have discovered a new particle.

There’s a way to measure the acute emotional intelligence that has never gone out of style.

President Trump met with other leaders at the Group of 20 conference.

In a statement announcing his resignation, Mr Ross, said: "While the intentions may have been well meaning, the reaction to this news shows that Mr Cummings interpretation of the government advice was not shared by the vast majority of people who have done as the government asked."

Forward Tacotron + MelGAN Vocoder

The samples are generated with a model trained 400K steps on LJSpeech together with the pretrained MelGAN vocoder provided by the MelGAN repo.

Scientists at the CERN laboratory say they have discovered a new particle.

normal speed	faster (1.25)	slower (0.85)

There’s a way to measure the acute emotional intelligence that has never gone out of style.

President Trump met with other leaders at the Group of 20 conference.

Forward Tacotron + WaveRNN Vocoder

The samples are generated with a model trained 100K steps on LJSpeech together with the pretrained WaveRNN vocoder provided by the WaveRNN repo.

Scientists at the CERN laboratory say they have discovered a new particle.

normal speed	faster (1.25)	slower (0.8)

There’s a way to measure the acute emotional intelligence that has never gone out of style.

President Trump met with other leaders at the Group of 20 conference.

Forward Tacotron + Griffin-Lim

The Senate's bill to repeal and replace the Affordable Care-Act is now imperiled.

normal speed	faster (1.4)	slower (0.6)

Generative adversarial network or variational auto-encoder.

Basilar membrane and otolaryngology are not auto-correlations.

Synthetic speech can be created by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of the stored speech units; a system that stores phones or diphones provides the largest output range, but may lack clarity. For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create a completely "synthetic" voice output.