Text
class LanguageTokenizer
Simple tokenizer for language to index mapping.
__init__
def __init__(languages)
Initializes a language tokenizer for a list of languages.
Args
- languages (List[str]): List of languages, e.g. ['de', 'en'].
__call__
def __call__(lang)
Maps the language to an index.
Args
- lang (str): Language to be mapped, e.g. 'de'.
Returns
- int: Index of language.
decode
def decode(index)
Inverts the index mapping of a language.
Args
- index (int): Index of language.
Returns
- str: Language for the given index.
class SequenceTokenizer
Tokenizes text and optionally attaches language-specific start index (and non-specific end index).
__init__
def __init__(symbols, languages, char_repeats, lowercase, append_start_end, pad_token, end_token)
Initializes a SequenceTokenizer object.
Args
-
symbols (List[str]): Character (or phoneme) symbols.
-
languages (List[str]): List of languages.
-
char_repeats (int): Number of repeats for each character to allow the forward model to map to longer phoneme sequences. Example
-
lowercase (bool): Whether to lowercase the input word.
-
append_start_end (bool): Whether to append special start and end tokens. Start and end tokens are index mappings of the chosen language.
-
pad_token (str): Special pad token for index 0.
-
end_token (str): Special end of sequence token.
__call__
def __call__(sentence, language)
Maps a sequence of symbols for a language to a sequence of indices.
Args
-
sentence (Iterable[str]): Sentence (or word) as a sequence of symbols.
-
language (str): Language for the mapping that defines the start and end token indices.
Returns
- List[int]: Sequence of token indices.
decode
def decode(sequence, remove_special_tokens)
Maps a sequence of indices to a sequence of symbols.
Args
-
sequence (Iterable[int]): Encoded sequence to be decoded.
-
remove_special_tokens (bool): Whether to remove special tokens such as pad or start and end tokens. (Default value = False)
-
sequence: Iterable[int]
Returns
- List[str]: Decoded sequence of symbols.
class Preprocessor
Preprocesses data for a phonemizer training session.
__init__
def __init__(lang_tokenizer, text_tokenizer, phoneme_tokenizer)
Initializes a preprocessor object.
Args
-
lang_tokenizer (LanguageTokenizer): Tokenizer for input language.
-
text_tokenizer (SequenceTokenizer): Tokenizer for input text.
-
phoneme_tokenizer (SequenceTokenizer): Tokenizer for output phonemes.
__call__
def __call__(item)
Preprocesses a data point.
Args
- item (Tuple): Data point comprised of (language, input text, output phonemes).
from_config
def from_config(cls, config)
Initializes a preprocessor from a config.
Args
- config (Dict[str, Any]): Dictionary containing preprocessing hyperparams.
Returns
- Preprocessor: Preprocessor object.