Preprocessor

class Preprocessor

__init__

def __init__(start_token, end_token, punctuation_pattern, filter_pattern, add_input_start_end, lower_case, hash_numbers)

Initializes the preprocessor.

Args
  • start_token: Unique start token to be inserted at the beginning of the target text.

  • end_token: Unique end token to be attached at the end of a target text.

  • punctuation_pattern: Regex pattern for punktuation that is splitted from the tokens.

  • filter_pattern: Regex pattern for characters to be removed from the text.

  • add_input_start_end: Whether to add start and end token to input sequence.

  • lower_case: Whether to perform lower casing.

  • hash_numbers: Whether to replace numbers by a #.

__call__

def __call__(data)

Performs regex logic for string cleansing and attaches start and end tokens to the text.