Skip to content

Text

class LanguageTokenizer

Simple tokenizer for language to index mapping.

__init__

def __init__(languages)

Initializes a language tokenizer for a list of languages.

Args
  • languages (List[str]): List of languages, e.g. ['de', 'en'].

__call__

def __call__(lang)

Maps the language to an index.

Args
  • lang (str): Language to be mapped, e.g. 'de'.
Returns
  • int: Index of language.

decode

def decode(index)

Inverts the index mapping of a language.

Args
  • index (int): Index of language.
Returns
  • str: Language for the given index.

class SequenceTokenizer

Tokenizes text and optionally attaches language-specific start index (and non-specific end index).

__init__

def __init__(symbols, languages, char_repeats, lowercase, append_start_end, pad_token, end_token)

Initializes a SequenceTokenizer object.

Args
  • symbols (List[str]): Character (or phoneme) symbols.

  • languages (List[str]): List of languages.

  • char_repeats (int): Number of repeats for each character to allow the forward model to map to longer phoneme sequences. Example

  • lowercase (bool): Whether to lowercase the input word.

  • append_start_end (bool): Whether to append special start and end tokens. Start and end tokens are index mappings of the chosen language.

  • pad_token (str): Special pad token for index 0.

  • end_token (str): Special end of sequence token.

__call__

def __call__(sentence, language)

Maps a sequence of symbols for a language to a sequence of indices.

Args
  • sentence (Iterable[str]): Sentence (or word) as a sequence of symbols.

  • language (str): Language for the mapping that defines the start and end token indices.

Returns
  • List[int]: Sequence of token indices.

decode

def decode(sequence, remove_special_tokens)

Maps a sequence of indices to a sequence of symbols.

Args
  • sequence (Iterable[int]): Encoded sequence to be decoded.

  • remove_special_tokens (bool): Whether to remove special tokens such as pad or start and end tokens. (Default value = False)

  • sequence: Iterable[int]

Returns
  • List[str]: Decoded sequence of symbols.

class Preprocessor

Preprocesses data for a phonemizer training session.

__init__

def __init__(lang_tokenizer, text_tokenizer, phoneme_tokenizer)

Initializes a preprocessor object.

Args
  • lang_tokenizer (LanguageTokenizer): Tokenizer for input language.

  • text_tokenizer (SequenceTokenizer): Tokenizer for input text.

  • phoneme_tokenizer (SequenceTokenizer): Tokenizer for output phonemes.

__call__

def __call__(item)

Preprocesses a data point.

Args
  • item (Tuple): Data point comprised of (language, input text, output phonemes).

from_config

def from_config(cls, config)

Initializes a preprocessor from a config.

Args
  • config (Dict[str, Any]): Dictionary containing preprocessing hyperparams.
Returns
  • Preprocessor: Preprocessor object.