bardi.nlp_engineering package

Steps

bardi normalizer module

Clean text with custom sets of regular expressions

class bardi.nlp_engineering.normalizer.CPUNormalizer(*args, **kwargs)[source]

Bases: Normalizer

Normalizer class for cleaning and standardizing text input using regular expression substitutions.

Note

This implementation of the Normalizer is specific for CPU computation.

fields

The name of the column(s) containing text to be normalized.

Type:

Union[str, List[str]]

regex_set

A list of dictionaries with keys, ‘regex_str’ and ‘sub_str’, used to perform regular expression substitutions of the text.

Type:

List[RegexSubPair]

lowercase

If True, lowercasing will be applied during normalization. Default is True.

Type:

Optional[bool]

retain_input_fields

If True, will retain the original contents of the fields specified in fields under the new names of: normalizer__<field>

Type:

Optional[bool]

run(data: Table, artifacts: dict | None = None) Tuple[Table, dict][source]

Run the CPU-based normalizer method based on the configuration used to create the object of the CPUNormalizer class.

Parameters:
  • data (pyarrow.Table) – A pyarrow Table containing at least one text column of type string or large_string.

  • artifacts (Optional[dict]) – Artifacts are not used in this run method but must be received to operate correctly in the pipeline run method.

Returns:

A tuple containing the pyarrow Table of cleaned data and an empty dictionary.

Return type:

Tuple[pyarrow.Table, dict]

class bardi.nlp_engineering.normalizer.Normalizer(fields: str | List[str], regex_set: List[RegexSubPair], lowercase: bool = True, retain_input_fields: bool = False)[source]

Bases: Step

Normalizer cleans and standardizes text input using regular expression substitutions. Lowercasing is also applied if desired.

Note

Avoid the direct instantiation of the Normalizer class and instead instantiate one of the child classes depending on hardware configuration.

fields

The field or fields to be normalized.

Type:

Union[str, List[str]]

regex_set

List of regex substitutions to be applied.

Type:

List[RegexSubPair]

lowercase

If True, lowercasing will be applied during normalization, defaults to True.

Type:

Optional[bool]

retain_input_fields

If True, will retain the original contents of the fields specified in fields under the new names of: normalizer__<field>

Type:

Optional[bool]

abstract run()[source]

Abstract method

bardi pre_tokenizer module

Split text columns into lists of tokens using simple patterns

class bardi.nlp_engineering.pre_tokenizer.CPUPreTokenizer(*args, **kwargs)[source]

Bases: PreTokenizer

The pre-tokenizer breaks down text into smaller units before further tokenization is applied.

Note

This implementation of the PreTokenizer is specific for CPU computation.

fields

The name of the column(s) containing text.

Type:

Union[str, List[str]]

split_pattern

A specific pattern of characters used to divide a string into smaller segments or tokens. By default, the split is done on a single space character.

Type:

str

retain_input_fields

If True, will retain the original contents of the fields specified in fields under the new names of: pretokenizer__<field>

Type:

Optional[bool]

run(data: Table, artifacts: dict | None = None) Tuple[Table, dict | None][source]

Runs a CPU-based pre-tokenizer method based on the configuration used to create the object of the CPUPreTokenizer class.

Parameters:
  • data (pyarrow.Table) – A pyarrow Table containing at least one text column of type string or large_string.

  • artifacts (Optional[dict]) – Artifacts are not used in this run method but must be received to operate correctly in the pipeline run method.

Returns:

The first position is a pyarrow.Table of pre-tokenized data. The second position is a dictionary of artifacts. No artifacts are produced in this run method, so the second position will return None.

Return type:

Tuple[pa.Table, Union[dict, None]]

class bardi.nlp_engineering.pre_tokenizer.PreTokenizer(fields: str | List[str], split_pattern: str = ' ', retain_input_fields: bool = False)[source]

Bases: Step

The pre-tokenizer breaks down text into smaller units before further tokenization is applied.

Note

Avoid the direct instantiation of the PreTokenizer class and instead instantiate one of the child classes depending on hardware configuration.

fields

The name of the column(s) containing text.

Type:

Union[str, List[str]]

split_pattern

A specific pattern of characters used to divide a string into smaller segments or tokens. By default, the split is done on a single space character.

Type:

str

retain_input_fields

If True, will retain the original contents of the fields specified in fields under the new names of: pretokenizer__<field>

Type:

Optional[bool]

abstract run()[source]

Abstract method

bardi tokenizer_trainer module

Train a tokenizer for transformer-based models

class bardi.nlp_engineering.tokenizer_trainer.CPUTokenizerTrainer(*args, **kwargs)[source]

Bases: TokenizerTrainer

TransformerTokenizer specific for CPU computation.

Note

This implementation of the TokenizerTrainer is specific for CPU computation.

field

the name of the column containing text

Type:

str

tokenizer_type

types of tokenizers that can be trained from scratch currently supported WordPiece, BPE, Unigram, WordLevel

Type:

str

vocab_size

number of tokens in a trained tokenizer

Type:

int

hf_cache_dir

path to a folder where hf tokenizers are stored

Type:

str

from_old_flag

if True, use pre-trained tokenizer as a template.

Type:

bool

checkpoint_path

path to pretrained tokenizer model.

Type:

str

tokenizer_fname

name for the file or folder where the trained tokenizer will be stored

Type:

str

corpus_gen_batch_size

size of batch for tokenizer training data corpus by deafult it is 1000

Type:

int

get_parameters() dict[source]

Retrive the embedding generetor object configuration

Returns:

a dictionary representation of the embedding generetor object’s attributes

Return type:

dict

run(data: Table, artifacts: dict) Tuple[Table, dict][source]

Runs tokenizer trainer based on the configuration used to create the object of the CPUTransformerTokenizer class

Parameters:
  • data (pyarrow.Table) – a pyarrow Table containing at least one list column containing text

  • artifacts (dict) – artifacts are not consumed in this run method, but must be received to operate correctly in the pipeline run method

Returns:

The first position is a pyarrow.Table of pre-tokenized data. The second position is a dictionary of artifacts. The dict will contain keys for “embedding_matrix” and “id_to_token”.

Return type:

Tuple[pyarrow.Table, dict]

write_artifacts(write_path: str, artifacts: dict | None) None[source]

Write the oartifactsproduced by the embedding_generator

Parameters:
  • write_path (str) – Path is a directory where files will be written

  • artifacts (Union[dict, None]) – Artifacts is a dictionary of artifacts produced in this step. Expected keys are: “id_to_token” and “embedding_matrix”

class bardi.nlp_engineering.tokenizer_trainer.TokenizerTrainer(fields: str | List[str], tokenizer_type: str = '', vocab_size: int = 1000, hf_cache_dir: str = '', from_old_flag: bool = False, checkpoint_path: str = None, tokenizer_fname: str = 'tokenizer', corpus_gen_batch_size: int = 1000, special_tokens: List[str] = None)[source]

Bases: Step

The TokenizerTrainer class provides ability to train:

  1. A NEW TOKENIZER FORM AN OLD ONE

    Train a new tokenizer based on a provided tokenizer (from_old_flag). Provide a trained tokenizers associated with given architecture (BERT, LLAMA) and train a new tokenizer from scratch that is configurated to the provided architecture.

  2. TRAIN NEW ARCHITECTURE AGNOSTIC TOKENIZER

    Use one of the supported tokenizer algorithms to train a new tokenizer from scratch.

Note

Avoid the direct instantiation of the TokenizerTrainer class and instead instantiate one of the child classes depending on hardware configuration.

field

the name of the column containing text

Type:

str

tokenizer_type

types of tokenizers that can be trained from scratch currently supported WordPiece, BPE, Unigram, WordLevel

Type:

str

vocab_size

number of tokens in a trained tokenizer

Type:

int

hf_cache_dir

path to a folder where hf tokenizers are stored

Type:

str

from_old_flag

if True, use pre-trained tokenizer as a template.

Type:

bool

checkpoint_path

path to pretrained tokenizer model.

Type:

str

tokenizer_fname

name for the file or folder where the trained tokenizer will be stored

Type:

str

corpus_gen_batch_size

size of batch for tokenizer training data corpus by deafult it is 1000

Type:

int

abstract run()[source]

Abstract method

set_write_config(data_config: DataWriteConfig = None, artifacts_config: TokenizerTrainerArtifactsWriteConfig = None)[source]

Overwrite the default file writing configurations

class bardi.nlp_engineering.tokenizer_trainer.TokenizerTrainerArtifactsWriteConfig[source]

Bases: TypedDict

Indicates the keys and data types expected in an artifacts write config dict for the embedding generator if overwriting the default configuration.

vocab_format: str
vocab_format_args: dict

bardi embedding_generator module

Train a Word2Vec model and create a vocab and word embeddings

class bardi.nlp_engineering.embedding_generator.CPUEmbeddingGenerator(*args, **kwargs)[source]

Bases: EmbeddingGenerator

The embedding generator provides an interface to create word embeddings or vector representations of words (tokens).

The embedding generator uses the Word2Vec model from the Gensim library.

Note

This implementation of the EmbeddingGenerator is specific for CPU computation.

fields

The name of the column(s) containing text to be considered in the vocab and used in Word2Vec.

Type:

Union[str, List[str]]

load_saved_model

If True, use a pre-trained Word2Vec model.

Type:

bool

checkpoint_path

Path to the Word2Vec model checkpoint.

Type:

str

cores

Number of cores to run the Word2Vec model on.

Type:

int

min_word_count

Ignores all words with total frequency lower than this.

Type:

int

window

Maximum distance between the current and predicted word.

Type:

int

vector_size

Output embedding size.

Type:

int

sample

The threshold for configuring which high-frequency words are randomly downsampled, use range (0, 1e-5).

Type:

float

min_alpha

Learning rate will linearly drop to min_alpha as training progresses.

Type:

float

negative

If > 0, negative sampling will be used.

Type:

int

epochs

Total number of iterations of all training data in the training of the Word2Vec model.

Type:

int

seed

Seed for the random number generator. For deterministic run, you need thread = 1 (aka CPU core) and PYTHONHASHSEED.

Type:

int

vocab_exclude_list

Provide a list of tokens that may be present in the text that you would like to exclude from the vocab and from Word2Vec.

Type:

List[str]

get_parameters() dict[source]

Retrieve the embedding generator object configuration.

Returns:

A dictionary representation of the EmbeddingGenerator object’s attributes.

Return type:

dict

run(data: Table, artifacts: dict) Tuple[Table, dict][source]

Runs a CPU-based embedding generator run method based on the configuration used to create the object of the CPUEmbeddingGenerator class

Parameters:
  • data (pyarrow.Table) – A pyarrow Table containing at least one list column containing text.

  • artifacts (dict) – Artifacts are not consumed in this run method, but must be received in the method to operate correctly in the pipeline run method.

Returns:

The first position is a pyarrow.Table of pre-tokenized data. The second position is a dictionary of artifacts. The dict will contain keys for “embedding_matrix” and “id_to_token”.

Return type:

Tuple[pyarrow.Table, dict]

write_artifacts(write_path: str, artifacts: dict) None[source]

Write the artifacts produced by the embedding generator.

Parameters:
  • write_path (str) – Path is a directory where files will be written.

  • artifacts (dict) – Artifacts is a dictionary of artifacts produced in this step. Expected keys are: “id_to_token” and “embedding_matrix”.

class bardi.nlp_engineering.embedding_generator.EmbeddingGenerator(fields: str | List[str], load_saved_model: bool = False, checkpoint_path: str | None = None, cores: int = 4, min_word_count: int = 10, window: int = 5, vector_size: int = 300, sample: float = 6e-05, min_alpha: float = 0.007, negative: int = 20, epochs: int = 30, seed: int = 42, vocab_exclude_list: List[str] = [])[source]

Bases: Step

The embedding generator provides an interface to create word embeddings or vector representations of words (tokens).

The embedding generator uses the Word2Vec model from the Gensim library.

Note

Avoid the direct instantiation of the PreTokenizer class and instead instantiate one of the child classes depending on hardware configuration.

fields

The name of the column(s) containing text to generate embeddings.

Type:

Union[str, List[str]]

load_saved_model

Whether to load a saved Word2Vec model or train a new one.

Type:

bool

checkpoint_path

Path to the saved model checkpoint if load_saved_model is True.

Type:

str

cores

Number of CPU cores to use for training.

Type:

int

min_word_count

Ignore all words with a total frequency lower than this.

Type:

int

window

Maximum distance between the current and predicted word within a sentence.

Type:

int

vector_size

Dimensionality of the word vectors.

Type:

int

sample

The threshold for configuring which higher-frequency words are randomly downsampled.

Type:

float

min_alpha

Learning rate will linearly drop to min_alpha as training progresses.

Type:

float

negative

If > 0, specifies how many “noise words” should be drawn.

Type:

int

epochs

Number of iterations (epochs) over the corpus.

Type:

int

seed

Seed for the random number generator.

Type:

int

vocab_exclude_list

List of words to force exclude from the vocabulary.

Type:

List[str]

abstract run()[source]

Abstract method

set_write_config(data_config: DataWriteConfig | None = None, artifacts_config: EmbeddingGeneratorArtifactsWriteConfig | None = None)[source]

Overwrite the default file writing configurations

class bardi.nlp_engineering.embedding_generator.EmbeddingGeneratorArtifactsWriteConfig[source]

Bases: TypedDict

Indicates the keys and data types expected in an artifacts write config dict for the embedding generator if overwriting the default configuration.

embedding_matrix_format: str
embedding_matrix_format_args: dict
vocab_format: str
vocab_format_args: dict

bardi vocab_encoder module

Apply a vocab mapping converting a list of tokens in a column into a list of integers

class bardi.nlp_engineering.vocab_encoder.CPUVocabEncoder(*args, **kwargs)[source]

Bases: VocabEncoder

The vocab encoder maps a vocab to a list of tokens

Note

This implementation of the VocabEncoder is specific for CPU computation.

fields

the name of the column containing a list of tokens that will be mapped to integers using a vocab

Type:

Union[str, List[str]]

field_rename

optional ability to rename the supplied field with the field_rename value

Type:

str

id_to_token

optional vocabulary in the form of {id: token} that will be used to map the tokens to integers. This is optional for the construction of the object, and can alternatively be provided in the run method. This flexibility handles the use of a pre-existing vocab versus creating a vocab during a pipeline run.

Type:

dict

concat_fields

indicate if you would like for fields to be concatenated into a single column or left as separate columns

Type:

bool

retain_input_fields

If True, will retain the original contents of the fields specified in fields under the new names of: vocabencoder__<field>

Type:

Optional[bool]

get_parameters()[source]

Retrive the vocab encoder object configuration Does not return the mapping (vocab) as it can be large

Returns:

a dictionary representation of the vocab encoder’s attributes

Return type:

dict

run(data: Table, artifacts: dict = None, id_to_token: dict = None) Table[source]

Run a vocab encoder using CPU computation

The vocab encoder relies on receiving a vocab to map. The vocab can be supplied in multiple ways:

  • id_to_token at object creation

  • contained in the pipeline artifacts dictionary passed to the run method referenced by the key, ‘id_to_token’

  • id_to_token in the run method

Parameters:
  • data (PyArrow Table) – The data to be processed. The data must contain the column specified by field at object creation

  • artifacts (dict) – A dictionary of pipeline artifacts which contains a vocab referenced by the key, ‘id_to_token’

  • id_to_token (dict) – If a vocab wasn’t passed at object creation or through the pipeline artifacts dict, then it must be passed here as a final option

Returns:

A tuple with:
  • the first element holding a PyArrow Table of data

processed with the vocab encoder - the second element of the tuple intended for artifacts is None

Return type:

Tuple(PyArrow Table, dict)

Raises:
  • AttributeError – The vocab (id_to_token) wasn’t supplied either at object creation or to the run method

  • TypeError – The run method was not supplied a PyArrow Table

class bardi.nlp_engineering.vocab_encoder.VocabEncoder(fields: str | List[str], field_rename: str = None, id_to_token: dict = None, concat_fields: bool = False, retain_input_fields: bool = False)[source]

Bases: Step

The vocab encoder maps a vocab to a list of tokens

Note

Avoid the direct instantiation of the VocabEncoder class and instead instantiate one of the child classes depending on hardware configuration

fields

the name of the column containing a list of tokens that will be mapped to integers using a vocab

Type:

Union[str, List[str]]

field_rename

optional ability to rename the supplied field with the field_rename value

Type:

str

id_to_token

optional vocabulary in the form of {id: token} that will be used to map the tokens to integers. This is optional for the construction of the object, and can alternatively be provided in the run method. This flexibility handles the use of a pre-existing vocab versus creating a vocab during a pipeline run.

Type:

dict

concat_fields

indicate if you would like for fields to be concatenated into a single column or left as separate columns

Type:

bool

retain_input_fields

If True, will retain the original contents of the fields specified in fields under the new names of: vocabencoder__<field>

Type:

Optional[bool]

abstract run()[source]

Abstract method

bardi tokenizer_encoder module

Apply a tokenizer to the provided text fields

class bardi.nlp_engineering.tokenizer_encoder.CPUTokenizerEncoder(*args, **kwargs)[source]

Bases: TokenizerEncoder

Implementation of the TokenizerEncoder for CPU computation

Note

This implementation of the TokenizerEncoder is specific for CPU computation.

fields

the name of the column containing a list of tokens that will be mapped to integers using a vocab

Type:

Union[str, List[str]],

return_tensors

type of tensors to return, ‘np’ for Numpy arrays, ‘pt’ for PyTorch tensors or ‘tf’ for TensorFlow

Type:

str

concat_fields

whether the text fields should be concatenate into a single text field, defaults to False

Type:

bool

retain_input_fields

If True, will retain the original contents of the fields specified in fields under the new names of: CPUTokenizerEncoder_input__<field>

Type:

Optional[bool]

retain_concat_field

If True, will retain the concatenation of the fields specified in fields under the name specified in field_rename or ‘text’ if not specified. If concat_fields is not True, this parameter will have no effect

Type:

Optional[bool]

field_rename

optional ability to rename the supplied field with the field_rename value

Type:

Optional[str]

hf_cache_dir

local directory where the HF pretrained tokenizers are stored

Type:

Optional[str]

model_name

name of a tokenizer file or folder. Not required if TokenizerTrainer is a prior step - tokenizer will be passed in artifacts

Type:

Optional[str]

cores

number of CPU cores for multithreading the tokenizer

Type:

Optional[int]

tokenizer_params

provide fine-grained settings for applying tokenizer

Type:

Optional[TokenizerConfig]

tokenizer_model

Tokenizer object passed through artifacts from TokenizerTrainer or read from file specified in model_name

Type:

transformers.PreTrainedTokenizerBase

get_parameters()[source]

Retrive the post-processor object configuration Does not return the mapping (vocab) as it can be large

Returns:

a dictionary representation of the post-processor’s attributes

Return type:

dict

run(data: Table, artifacts: dict = None) Table[source]

Run a tokenizer encoder based on provided configuration

The tokenizer encoder relies on receiving a tokenizer to apply to the text. The tokenizer can be supplied in multiple ways:

  • referencing the model_name at object creation

  • contained in the pipeline artifacts dictionary passed to the run method referenced by the key, ‘tokenizer_model’

Parameters:
  • data (PyArrow Table) – The data to be processed. The data must contain the column specified by field at object creation

  • artifacts (dict) – A dictionary of pipeline artifacts which contains a tokenizer referenced by the key, ‘tokenizer_model’

Returns:

The first element holding a PyArrow Table of data processed with the tokenizer encoder. The second element of the tuple intended for artifacts is None.

Return type:

Tuple[PyArrow Table, dict]

Raises:
  • AttributeError – The tokenizer wasn’t supplied either at object creation or to the run method

  • TypeError – The run method was not supplied a PyArrow Table

class bardi.nlp_engineering.tokenizer_encoder.TokenizerEncoder(fields: str | List[str], return_tensors: str = 'np', concat_fields: bool = False, retain_input_fields: bool = False, retain_concat_field: bool = False, field_rename: str | None = None, hf_cache_dir: str | None = None, model_name: str | None = None, cores: int | None = None, tokenizer_params: dict | None = None)[source]

Bases: Step

The tokenizer encoder uses a trained tokenizer to split text into tokens.

Note

Avoid the direct instantiation of the TokenizerEncoder class and instead instantiate one of the child classes depending on hardware configuration.

fields

the name of the column containing a list of tokens that will be mapped to integers using a vocab

Type:

Union[str, List[str]],

return_tensors

type of tensors to return, ‘np’ for Numpy arrays, ‘pt’ for PyTorch tensors or ‘tf’ for TensorFlow

Type:

str

concat_fields

whether the text fields should be concatenate into a single text field, defaults to False

Type:

bool

retain_input_fields

If True, will retain the original contents of the fields specified in fields under the new names of: TokenizerEncoder__<field>

Type:

Optional[bool]

retain_concat_field

If True, will retain the concatenation of the fields specified in fields under the name specified in field_rename or ‘text’ if not specified. If concat_fields is not True, this parameter will have no effect

Type:

Optional[bool]

field_rename

optional ability to rename the supplied field with the field_rename value

Type:

Optional[str]

hf_cache_dir

local directory where the HF pretrained tokenizers are stored

Type:

Optional[str]

model_name

name of a tokenizer file or folder. Not required if TokenizerTrainer is a prior step - tokenizer will be passed in artifacts

Type:

Optional[str]

cores

number of CPU cores for multithreading the tokenizer

Type:

Optional[int]

tokenizer_params

provide fine-grained customization for any valid HuggingFace Tokenizer parameter through a dictionary

Type:

Optional[dict]

tokenizer_model

Tokenizer object passed through artifacts from TokenizerTrainer or read from file specified in model_name

Type:

transformers.PreTrainedTokenizerBase

abstract run()[source]

Abstract method

bardi label_processor module

Encode label columns into numerical representations

class bardi.nlp_engineering.label_processor.CPULabelProcessor(*args, **kwargs)[source]

Bases: LabelProcessor

The label processor creates and maps a label vocab.

Note

This implementation of the LabelProcessor is specific for CPU computation.

fields

The name(s) of label column(s) of which the values are used to generate a standardized mapping and then that mapping is applied to the column(s).

Type:

Union[str, List[str]]

method

Currently only a default ‘unique’ method is supported which maps each unique value in the column to an id.

Type:

str

mapping

Mapping dict of the form {label: id} used to convert labels in the column to ids.

Type:

dict

id_to_label

The reverse of mapping. Of the form {id: label} used downstream to map the ids back to the original label values.

Type:

dict

retain_input_fields

If True, will retain the original contents of the fields specified in fields under the new names of: labelprocessor__<field>

Type:

Optional[bool]

id_to_label

If an id_to_label already exists, it can be directly applied. id_to_label is a dict of the form {field: {id: label}}

Type:

Optional[dict]

get_parameters() dict[source]

Retrive the label processor object configuration

Does not return the mapping (vocab), but does return the id_to_label dict. This is because the mapping is just the reverse of id_to_label.

Returns:

a dictionary representation of the splitter object’s attributes

Return type:

dict

run(data: Table, artifacts: dict | None = None, id_to_label: dict | None = None) Tuple[Table, dict][source]

Run a label processor using CPU computation

Parameters:
  • data (PyArrow Table) – The data to be processed. The data must contain the column specified by ‘field’ at object creation

  • artifacts (dict) – artifacts are not used in this run method, but must be received to operate correctly in the pipeline run method

  • id_to_label (Optional[dict]) – If an id_to_label already exists, it can be directly applied. id_to_label is a dict of the form {field: {id: label}}

Returns:

The first position is a pyarrow.Table of processed data. The second position is a dictionary of artifacts. The dict will contain a key for “id_to_label”.

Return type:

Tuple[pyarrow.Table, dict]

Raises:
  • NotImplementedError – A value other than ‘unique’ was provided for the label processor’s method

  • TypeError – The run method was not supplied a PyArrow Table

write_artifacts(write_path: str, artifacts: dict) None[source]

Write the outputs produced by the label_processor

Parameters:
  • write_path (str) – Path is a directory where files will be written

  • artifacts (dict) – Artifacts is a dictionary of artifacts produced in this step. Expected key is: “id_to_label”

class bardi.nlp_engineering.label_processor.LabelProcessor(fields: str | List[str], method: str = 'unique', retain_input_fields: bool = False, id_to_label: dict = None)[source]

Bases: Step

The label processor encodes label columns into numerical representations and provides a mapping for each label to its respective representation.

Note

Avoid the direct instantiation of the LabelProcessor class and instead instantiate one of the child classes.

fields

The name of a label column of which the values are used to generate a standardized mapping and then that mapping is applied to the column.

Type:

Union[str, List[str]]

method

Currently only a default ‘unique’ method is supported which maps each unique value in the column to an id.

Type:

str

retain_input_fields

If True, will retain the original contents of the fields specified in fields under the new names of: labelprocessor__<field>

Type:

Optional[bool]

id_to_label

If an id_to_label already exists, it can be directly applied. id_to_label is a dict of the form {field: {id: label}}

Type:

Optional[dict]

abstract run()[source]

Abstract method

set_write_config(data_config: DataWriteConfig | None = None, artifacts_config: LabelProcessorArtifactsWriteConfig | None = None)[source]

Overwrite the default file writing configurations

class bardi.nlp_engineering.label_processor.LabelProcessorArtifactsWriteConfig[source]

Bases: TypedDict

Indicates the keys and data types expected in an artifacts write config dict for the label processor if overwriting the default configuration.

id_to_label_args: dict | None
id_to_label_format: str

bardi splitter module

Segment the dataset into splits, such as test, train, and val

class bardi.nlp_engineering.splitter.CPUSplitter(*args, **kwargs)[source]

Bases: Splitter

The splitter adds a ‘split’ column to the data assigning each record to a particular split for downstream model training.

Two split types are available - creating a new random split from scratch and assigning previously created splits. This second option is helpful when running comparisons with other methods of data processing ensuring that splits are exactly the same.

Note

This implementation of the LabelProcessor is specific for CPU computation.

To create a splitter, pass the appropriate set of parameters through a defined NamedTuple for the type of split you want to create. i.e.,

CPUSplitter(split_method=NewSplit(
    split_proportions={
        'train': 0.75,
        'test': 0.15,
        'val': 0.15
    },
    unique_record_cols=[
        'document_id'
    ],
    group_cols=[
        'patient_id_number',
        'registry'
    ],
    labels_cols=[
        'reportability'
    ],
    random_seed=42
))

Splitter Method Named Tuples:

split_method

A named tuple of either MapSplit type or NewSplit type. Each contains a different set of values used to create the splitter depending upon split type

Type:

Union[MapSplit, NewSplit]

split_type
The type of split to be performed:
  • new - create a new random data split

  • map - reproduce an existing data split by mapping unique IDs

Type:

str

unique_record_cols

List of column names of which the combination forms a unique identifier in the dataset. This can be a single column name if that creates a unique identifier, but oftentimes in datasets a combination of fields are required.

Note: This set of columns MUST create a unique record or the program will crash.

Type:

List[str]

split_mapping

Only used for map split type. A dictionary mapping where the keys are the hash of the concatenated values from unique_record_cols, or represented by the following pseudocode,

hash(concat(*unique_record_cols))

The values are the corresponding split value (train, test, val) or (fold1, fold2, fold3), etc.

Type:

dict[str, str]

split_proportions

Only used for new split type. Mapping of split names to split proportions. i.e.,

{'train': 0.75, 'test': 0.15, 'val': 0.15}
{'fold1': 0.25, 'fold2': 0.25, 'fold3': 0.25, 'fold4': 0.25}

Note: values must add to 1.0.

Type:

dict[str, float]

num_splits

The number of splits contained in split_proportions.

Type:

int

group_cols

List of column names that form a ‘group’ that you would like to keep in discrete splits. E.x., if you had multiple medical notes for a single patient, you may desire that all notes for a single patient end up in the same split to prevent potential information leakage. In this case you would provide something like a patient_id.

Type:

List[str]

label_cols

List of column names containing labels. Efforts are made to balance label distribution across splits, but this is not guaranteed.

Type:

List[str]

random_seed

Required for reproducibility. If you have no preference, try on 42 for size.

Type:

int

run(data: Table, artifacts: dict = None) Tuple[Table, dict][source]

Runs a splitter using CPU computation based on the configuration used to create the object of the CPUSplitter class

Parameters:
  • data (PyArrow Table) – The data to be split

  • artifacts (dict) – artifacts are not consumed in this run method, but must be received to operate correctly in the pipeline run method

Returns:

A tuple with:
  • the first element holding a PyArrow Table of data including the new split column

  • the second element of the tuple intended for artifacts is None

Return type:

Tuple(PyArrow.Table, dict)

Raises:

TypeError – The run method was not supplied a PyArrow Table

class bardi.nlp_engineering.splitter.MapSplit(unique_record_cols: List[str], split_mapping: Dict[str, str], default_split_value: str)

Bases: tuple

Specify the requirements for splitting data exactly in line with an existing data split

default_split_value: str

Only used for map split type. A value to be used for split when the unique_record_cols cannot be found in the mapping.

split_mapping: Dict[str, str]

Only used for map split type. A dictionary mapping where the keys are the hash of the concatenated values from unique_record_cols, or represented by the following pseudocode,

hash(concat(*unique_record_cols))

The values are the corresponding split value (train, test, val) or (fold1, fold2, fold3), etc.

unique_record_cols: List[str]

List of column names of which the combination forms a unique identifier in the dataset. This can be a single column name if that creates a unique identifier, but oftentimes in datasets a combination of fields are required.

Note: This set of columns MUST create a unique record or the program will crash.

class bardi.nlp_engineering.splitter.NewSplit(split_proportions: Dict[str, float], unique_record_cols: List[str], group_cols: List[str], label_cols: List[str], random_seed: int)

Bases: tuple

Specify the requirements for splitting data with a new split from scratch.

group_cols: List[str]

List of column names that form a ‘group’ that you would like to keep in discrete splits. E.x., if you had multiple medical notes for a single patient, you may desire that all notes for a single patient end up in the same split to prevent potential information leakage. In this case you would provide something like a patient_id.

label_cols: List[str]

List of column names containing labels. Efforts are made to balance label distribution across splits, but this is not guaranteed.

random_seed: int

Required for reproducibility. If you have no preference, try on 42 for size.

split_proportions: Dict[str, float]

Only used for new split type. Mapping of split names to split proportions. i.e.,

{'train': 0.75, 'test': 0.15, 'val': 0.15}
{'fold1': 0.25, 'fold2': 0.25, 'fold3': 0.25, 'fold4': 0.25}

Note: values must add to 1.0.

unique_record_cols: List[str]

List of column names of which the combination forms a unique identifier in the dataset. This can be a single column name if that creates a unique identifier, but oftentimes in datasets a combination of fields are required.

Note: This set of columns MUST create a unique record or the program will crash.

class bardi.nlp_engineering.splitter.Splitter(split_method: MapSplit | NewSplit)[source]

Bases: Step

The splitter adds a ‘split’ column to the data assigning each record to a particular split for downstream model training.

Note

Avoid the direct instantiation of the Splitter class and instead instantiate one of the child classes.

split_method

A named tuple of either MapSplit type or NewSplit type. Each contains a different set of values used to create the splitter depending upon split type

Type:

Union[MapSplit, NewSplit]

split_type
The type of split to be performed:
  • new - create a new random data split

  • map - reproduce an existing data split by mapping unique IDs

Type:

str

unique_record_cols

List of column names of which the combination forms a unique identifier in the dataset. This can be a single column name if that creates a unique identifier, but oftentimes in datasets a combination of fields are required.

Note: This set of columns MUST create a unique record or the program will crash.

Type:

List[str]

split_mapping

Only used for map split type. A dictionary mapping where the keys are the hash of the concatenated values from unique_record_cols, or represented by the following pseudocode,

hash(concat(*unique_record_cols))

The values are the corresponding split value (train, test, val) or (fold1, fold2, fold3), etc.

Type:

dict[str, str]

split_proportions

Only used for new split type. Mapping of split names to split proportions. i.e.,

{'train': 0.75, 'test': 0.15, 'val': 0.15}
{'fold1': 0.25, 'fold2': 0.25, 'fold3': 0.25, 'fold4': 0.25}

Note: values must add to 1.0.

Type:

dict[str, float]

num_splits

The number of splits contained in split_proportions.

Type:

int

group_cols

List of column names that form a ‘group’ that you would like to keep in discrete splits. E.x., if you had multiple medical notes for a single patient, you may desire that all notes for a single patient end up in the same split to prevent potential information leakage. In this case you would provide something like a patient_id.

Type:

List[str]

label_cols

List of column names containing labels. Efforts are made to balance label distribution across splits, but this is not guaranteed.

Type:

List[str]

random_seed

Required for reproducibility. If you have no preference, try on 42 for size.

Type:

int

abstract run()[source]

Abstract method

RegEx Package

bardi regex_set module

Define RegexSet and RegexSubPair blueprints

class bardi.nlp_engineering.regex_library.regex_set.RegexSet[source]

Bases: object

Blueprint for creating a configurable, domain specific regular expression set

regex_set

a list of regular expression substitution pairs

Type:

List[RegexSubPair]

get_regex_set(lowercase_substitution=False, no_substitution=False) List[RegexSubPair][source]

Return the ordered set of regular expressions

lowercase_substitution

It True all the substitution tokeen like DATETOKEN will be returned in lowercase datetoken. Defaults to False.

Type:

Optional[bool]

no_substitution

If True all the regular expression that remove matched pattern will replace it with space instead of special token. Defaults to False.

Type:

Optionl[bool]

Returns:

a list of regular expression substitution pairs

Return type:

List[RegexSubPair]

class bardi.nlp_engineering.regex_library.regex_set.RegexSubPair[source]

Bases: TypedDict

Dictionary used for regular expression string substitutions

Example of a regex sub pair dictionary:

{
    "regex_str": r"\s",
    "sub_str": "WHITESPACE"
}
regex_str

regular expression pattern

Type:

str

sub_str

replacement value for matched string

Type:

str

regex_str: str
sub_str: str

bardi provided regex sets

Curated set of regular expressions for cleaning text from pathology reports.

class bardi.nlp_engineering.regex_library.pathology_report.PathologyReportRegexSet(convert_escape_codes: bool = True, handle_whitespaces: bool = True, remove_urls: bool = True, remove_special_punct: bool = True, remove_multiple_punct: bool = True, handle_angle_brackets: bool = True, replace_percent_sign: bool = True, handle_leading_digit_punct: bool = True, remove_leading_punct: bool = True, remove_trailing_punct: bool = True, handle_words_with_punct_spacing: bool = True, handle_math_spacing: bool = True, handle_dimension_spacing: bool = True, handle_measure_spacing: bool = True, handle_cassettes_spacing: bool = True, handle_dash_digit_spacing: bool = True, handle_literals_floats_spacing: bool = True, fix_pluralization: bool = True, handle_digits_words_spacing: bool = True, remove_phone_numbers: bool = True, remove_dates: bool = True, remove_times: bool = True, remove_addresses: bool = True, remove_dimensions: bool = True, remove_specimen: bool = True, remove_decimal_seg_numbers: bool = True, remove_large_digits_seq: bool = True, remove_large_floats_seq: bool = True, trunc_decimals: bool = True, remove_cassette_names: bool = True, remove_duration_time: bool = True, remove_letter_num_seq: bool = True)[source]

Bases: RegexSet

The PathologyReportRegexSet includes set of standard regular expression to normalize a pathology report.

Note

The set of regular expressions tailored for pathology reports was crafted with the understanding that dividing text based on punctuation often results in the loss of crucial information. E.g. terms like “her-2” should not be split. However, to ensure that the number of unique tokens remains manageable we employ a number of regular expression to separate some tokens around punctuation. E.g. 22-years becomes 22 years. This consideration is particularly important when employing the word2vec algorithm, as an excessive number of tokens can impede the model’s effectiveness by diluting the representation of key concepts.

convert_escape_codes

Removes escape codes such as x0d, x0a, etc.

Type:

bool

handle_whitespaces

Removes extra whitespaces: any new line, carriage return tab.

Type:

bool

remove_urls

Removes URLs found in the text that match the pattern.

Type:

bool

remove_special_punct

Removes special punctuation like (?,$).

Type:

bool

remove_multiple_punct

Removes duplicated punctuation. E.g. ---

Type:

bool

handle_angle_brackets

Removes angle brackets. E.g. <title> becomes title.

Type:

bool

replace_percent_sign

Replaces a percent sign with a ‘percent’ word.

Type:

bool

handle_leading_digit_punct

Removes punctuation when digit is attached to word. E.g. 22-years becomes 22 years.

Type:

bool

remove_leading_punct

Removes leading punctuation from words. E.g. -result becomes result.

Type:

bool

remove_trailing_punct

Removes trailing punctuation from words. E.g. result- becomes result.

Type:

bool

handle_words_with_punct_spacing

Matches words with hyphen, colon or period and splits them.

Type:

bool

handle_math_spacing

Matches “math operators symbols” like ><=%: and adds spaces aroud them.

Type:

bool

handle_dimension_spacing

Matches digits and x and adds spaces between them.

Type:

bool

handle_measure_spacing

Matches measurements in mm, cm and ml provides proper spacing between the digits and measure.

Type:

bool

handle_cassettes_spacing

Matches patterns like 5e-6f and adds spaces around them.

Type:

bool

handle_dash_digit_spacing

Matches dashes around digits and adds spaces around the dashes.

Type:

bool

handle_literals_floats_spacing

Matches character followed by a float and a word. This is a common formating problem. E.g. r18.0admission becomes r18.0 admission.

Type:

bool

fix_pluralization

Matches s character after a word and attaches it back to the word. This restores plural nouns demages by removed punctuation.

Type:

bool

handle_digits_words_spacing

Matches digits that are attached to the beginning of a word.

Type:

bool

remove_phone_numbers

Matches any phone number that consists of 10 digits with delimeters.

Type:

bool

remove_dates

Removes dates of prespecified format.

Type:

bool

remove_times

Matches time of format 11:20 am or 1.30pm or 9:52:07AM.

Type:

bool

remove_addresses

Matches any address of format num (street name) in 1 to 6 words 2-letter state and short or long zip code.

Type:

bool

remove_dimensions

Matches 2D or 3D dimension measurements and adds spaces around them.

Type:

bool

remove_specimen

Matches marking of a pathology speciman.

Type:

bool

remove_decimal_seg_numbers

Matches combinations of digits and periods or dashes. E.g. :code:` 1.78.9.87`.

Type:

bool

remove_large_digits_seq

Matches large sequences of digits (3 or more) and replaces it.

Type:

bool

remove_large_floats_seq

Matches large floats and replace them.

Type:

bool

trunc_decimals

Matches floats and keeps only first decimal.

Type:

bool = True

remove_cassette_names

Removes pathology samples’ markings. E.g. 1-e.

Type:

bool

remove_duration_time

Removes duration a speciment was treated. E.g. 32d09090301.

Type:

bool

remove_letter_num_seq

Removes a character followed directly by 6 to 10 digits.

Type:

bool

bardi provided regex library

Library of pre-defined regular expression substitution pairs.

bardi.nlp_engineering.regex_library.regex_lib.get_address_regex() RegexSubPair[source]

Matches any address of format num (street name) in 1 to 6 words 2-letter state and short or long zip code.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

1034 north 500 west provo ut 84604-3337

Output string:

ADDRESSTOKEN
bardi.nlp_engineering.regex_library.regex_lib.get_angle_brackets_regex() RegexSubPair[source]

Matches a content between matching angle brackets, keeps the content only, removes the brackets.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

<This should be fixed> But not this >90

Output string:

This should be fixed But not this >90
bardi.nlp_engineering.regex_library.regex_lib.get_cassette_name_regex() RegexSubPair[source]

Matches cassettes markings of the specified format:

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

block:  1-e

Output string:

block:  CASSETTETOKEN
bardi.nlp_engineering.regex_library.regex_lib.get_cassettes_spacing_regex() RegexSubPair[source]

Matches patterns like 5e-6f and adds spaces around them.

Return type:

RegexSubPair - (regex pattern, replacement string)

Examples

Input string:

3e-3f

Output string:

3e - 3f
bardi.nlp_engineering.regex_library.regex_lib.get_dash_digits_spacing_regex() RegexSubPair[source]

Matches dashes around digits and adds spaces around the dashes.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

right 1:30-2:30 1.5-2.0 cm 0.9 cm for the 7-6

Output string:

right 1:30 - 2:30 1.5 - 2.0 cm 0.9 cm for the 7 - 6
bardi.nlp_engineering.regex_library.regex_lib.get_dates_regex() RegexSubPair[source]

Matches dates of specified formats.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

co: 03/09/2021 1015 completed: 03/10/21 at 3:34.

Output string:

co:  DATETOKEN completed:  DATETOKEN .
bardi.nlp_engineering.regex_library.regex_lib.get_decimal_segmented_numbers_regex() RegexSubPair[source]

Matches combinations of digits and periods or dashes.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

1.78.9.87

Output string:

DECIMALSEGMENTEDNUMBERTOKEN
bardi.nlp_engineering.regex_library.regex_lib.get_digits_words_spacing_regex() RegexSubPair[source]

Matches digits that are attached to the beginning of a word.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

9837648admission

Output string:

9837648 admission
bardi.nlp_engineering.regex_library.regex_lib.get_dimension_spacing_regex() RegexSubPair[source]

Matches digits and x and adds spaces between them.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

measuring 1.3x0.7x0.1 cm

Output string:

measuring 1.3 x 0.7 x 0.1 cm
bardi.nlp_engineering.regex_library.regex_lib.get_dimensions_regex() RegexSubPair[source]

Matches 2D or 3D dimension measurements and adds spaces around them.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

3.5 x 2.5 x 9.0 cm and 33 x 6.5 cm

Output string:

DIMENSIONTOKEN  cm and  DIMENSIONTOKEN  cm
bardi.nlp_engineering.regex_library.regex_lib.get_duration_regex() RegexSubPair[source]

Matches duration specimen was treated:

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

32d0909091

Output string:

DURATIONTOKEN
bardi.nlp_engineering.regex_library.regex_lib.get_escape_code_regex() RegexSubPair[source]

Matches escape codes such as x0d, x0a, etc.

Returns:

{regex pattern, replacement string}

Return type:

RegexSubPair

Example

Input string:

Codes\x0d\x0a\x0d \r30

Output string:

Codes      30
bardi.nlp_engineering.regex_library.regex_lib.get_fix_pluralization_regex() RegexSubPair[source]

Matches s character after a word and attaches it back to the word. This restores plural nouns demages by removed punctuation.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

specimen s code s

Output string:

specimens codes
bardi.nlp_engineering.regex_library.regex_lib.get_large_digits_seq_regex() RegexSubPair[source]

Matches large sequences of digits and replaces it.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

456123456

Output string:

DIGITSEQUENCETOKEN
bardi.nlp_engineering.regex_library.regex_lib.get_large_float_seq_regex() RegexSubPair[source]

Matches large floats and replace them.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

456 123456.783

Output string:

456 LARGEFLOATTOKEN
bardi.nlp_engineering.regex_library.regex_lib.get_leading_digit_punctuation_regex() RegexSubPair[source]

Matches numeric digits at the start of a word, followed by punctuation and additional characters. Proceeds to eliminate the punctuation and inserts a space between the digits an the word.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

13-unremarkable 1-e 22-years

Output string:

13 unremarkable   1 e   22 years
bardi.nlp_engineering.regex_library.regex_lib.get_leading_punctuation_regex() RegexSubPair[source]

Matches leading punctuation and removes it.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

-3a -anterior -result- :cassette

Output string:

3a  anterior  result-  cassette
bardi.nlp_engineering.regex_library.regex_lib.get_letter_num_seq_regex() RegexSubPair[source]

Matches a character followed directly by 6 to 10 digits:

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

c001234567

Output string:

LETTERDIGITSTOKEN
bardi.nlp_engineering.regex_library.regex_lib.get_literals_floats_spacing_regex() RegexSubPair[source]

Matches character followed by a float and a word. This is a common formating problem. e.g. r18.0admission -> r18.0 admission

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

r18.0admission diagnosis: bi n13.30admission

Output string:

r18.0 admission diagnosis: bi n13.30 admission
bardi.nlp_engineering.regex_library.regex_lib.get_math_spacing_regex() RegexSubPair[source]

Matches “math operators symbols” like ><=%: and adds spaces aroud them.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

This is >95% 3+3=8  6/7

Output string:

This is  > 95 %  3 + 3 = 8  6 / 7
bardi.nlp_engineering.regex_library.regex_lib.get_measure_spacing_regex() RegexSubPair[source]

Matches measurements in mm, cm and ml provides proper spacing between the digits and measure. Also provides specing between 11th -> 11 th.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

10mm histologic type 2 x 3cm. this is 3.0-cm

Output string:

10 mm  histologic type 2 x 3 cm . this is 3.0 cm
bardi.nlp_engineering.regex_library.regex_lib.get_multiple_punct_regex() RegexSubPair[source]

Matches multiple occurences of symbols like -, . and _ replaces them with a single space.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

-----this is report ___ signature

Output string:

this is report   signature
bardi.nlp_engineering.regex_library.regex_lib.get_percent_sign_regex() RegexSubPair[source]

Matches the % sign and replaces it with a word ‘percent’.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

strong intensity >95%

Output string:

strong intensity >95 percent
bardi.nlp_engineering.regex_library.regex_lib.get_phone_number_regex() RegexSubPair[source]

Matches any phone number that consists of 10 digits with delimeters.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

Ph: (123) 456 7890. It is (123)4567890.

Output string:

Ph:  PHONENUMTOKEN . It is  PHONENUMTOKEN .
bardi.nlp_engineering.regex_library.regex_lib.get_spaces_regex() RegexSubPair[source]

Matches additional spaces (artifact of applying other regex), matches not needed periods that can be removed.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

located around lower arm specimen   date

Output string:

located around lower arm specimen date
bardi.nlp_engineering.regex_library.regex_lib.get_special_punct_regex() RegexSubPair[source]

Matches a set of chosen punctuation symbols _,();[]#{}*”’~?!|^ and replaces them with a single space.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

wt-1, ck-7 (focal) negative; [sth] ab|cd"

Output string:

wt-1  ck-7  focal  negative   sth  ab cd
bardi.nlp_engineering.regex_library.regex_lib.get_specimen_regex() RegexSubPair[source]

Matches marking of a pathology speciman.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

for s-21-009345 sh-22-0011300

Output string:

for  SPECIMENTOKEN   SPECIMENTOKEN
bardi.nlp_engineering.regex_library.regex_lib.get_time_regex() RegexSubPair[source]

Matches time of format 11:20 am or 1.30pm or 9:52:07AM.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

at 11:12 pm or 11.12am

Output string:

at  TIMETOKEN  or  TIMETOKEN
bardi.nlp_engineering.regex_library.regex_lib.get_trailing_punctuation_regex() RegexSubPair[source]

Matches trailing punctuation and removes it.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

-3a -anterior -result- :cassette

Output string:

-3a -anterior  -result :cassette
bardi.nlp_engineering.regex_library.regex_lib.get_trunc_decimals_regex() RegexSubPair[source]

Matches floats and keeps only first decimal.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

1.78  9.87 - 8.99

Output string:

1.7  9.8 - 8.9
bardi.nlp_engineering.regex_library.regex_lib.get_urls_regex() RegexSubPair[source]

Matches a url and replaces it with a URLTOKEN.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

Source: https://www.merck.com/keytruda_pi.pdf

Output string:

Source: URLTOKEN
bardi.nlp_engineering.regex_library.regex_lib.get_whitespace_regex() RegexSubPair[source]

Matches any new line, carriage return tab and multiple spaces and replaces it with a single space.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

INVASIVE:\nNegative    IN SITU:\nN/A  IN \tThe result\

Output string:

INVASIVE: Negative IN SITU: N/A IN The result
bardi.nlp_engineering.regex_library.regex_lib.get_words_with_punct_spacing_regex() RegexSubPair[source]

Matches words with hyphen, colon or period and splits them. Requires the words to be at least two characters in length to avoid splitting words like ph.d.

Return type:

RegexSubPair - (regex pattern, replacement string)

Example

Input string:

this-that her-2 tiff-1k description:gleason

Output string:

this that her-2 tiff-1k description gleason