Preprocessor
class Preprocessor
__init__
def __init__(start_token, end_token, punctuation_pattern, filter_pattern, add_input_start_end, lower_case, hash_numbers)
Initializes the preprocessor.
Args
-
start_token: Unique start token to be inserted at the beginning of the target text.
-
end_token: Unique end token to be attached at the end of a target text.
-
punctuation_pattern: Regex pattern for punktuation that is splitted from the tokens.
-
filter_pattern: Regex pattern for characters to be removed from the text.
-
add_input_start_end: Whether to add start and end token to input sequence.
-
lower_case: Whether to perform lower casing.
-
hash_numbers: Whether to replace numbers by a #.
__call__
def __call__(data)
Performs regex logic for string cleansing and attaches start and end tokens to the text.