bardi.nlp_engineering package

Steps

bardi normalizer module

Clean text with custom sets of regular expressions

class bardi.nlp_engineering.normalizer.CPUNormalizer(*args, **kwargs)[source]

Bases: Normalizer

Normalizer class for cleaning and standardizing text input using regular expression substitutions.

Note

This implementation of the Normalizer is specific for CPU computation.

fields

The name of the column(s) containing text to be normalized.

Type:: Union[str, List[str]]

regex_set

A list of dictionaries with keys, ‘regex_str’ and ‘sub_str’, used to perform regular expression substitutions of the text.

Type:: List[RegexSubPair]

lowercase

If True, lowercasing will be applied during normalization. Default is True.

Type:: Optional[bool]

retain_input_fields

If True, will retain the original contents of the fields specified in fields under the new names of: normalizer__<field>

Type:: Optional[bool]

run(data: Table, artifacts: dict | None = None) → Tuple[Table, dict][source]

Run the CPU-based normalizer method based on the configuration used to create the object of the CPUNormalizer class.

Parameters:

data (pyarrow.Table) – A pyarrow Table containing at least one text column of type string or large_string.
artifacts (Optional[dict]) – Artifacts are not used in this run method but must be received to operate correctly in the pipeline run method.

Returns:

A tuple containing the pyarrow Table of cleaned data and an empty dictionary.

Return type:

Tuple[pyarrow.Table, dict]

class bardi.nlp_engineering.normalizer.Normalizer(fields: str | List[str], regex_set: List[RegexSubPair], lowercase: bool = True, retain_input_fields: bool = False)[source]

Bases: Step

Normalizer cleans and standardizes text input using regular expression substitutions. Lowercasing is also applied if desired.

Note

Avoid the direct instantiation of the Normalizer class and instead instantiate one of the child classes depending on hardware configuration.

fields

The field or fields to be normalized.

Type:: Union[str, List[str]]

regex_set

List of regex substitutions to be applied.

Type:: List[RegexSubPair]

lowercase

If True, lowercasing will be applied during normalization, defaults to True.

Type:: Optional[bool]

retain_input_fields

If True, will retain the original contents of the fields specified in fields under the new names of: normalizer__<field>

Type:: Optional[bool]

abstract run()[source]: Abstract method

bardi pre_tokenizer module

Split text columns into lists of tokens using simple patterns

class bardi.nlp_engineering.pre_tokenizer.CPUPreTokenizer(*args, **kwargs)[source]

Bases: PreTokenizer

The pre-tokenizer breaks down text into smaller units before further tokenization is applied.

Note

This implementation of the PreTokenizer is specific for CPU computation.

fields

The name of the column(s) containing text.

Type:: Union[str, List[str]]

split_pattern

A specific pattern of characters used to divide a string into smaller segments or tokens. By default, the split is done on a single space character.

Type:: str

retain_input_fields

If True, will retain the original contents of the fields specified in fields under the new names of: pretokenizer__<field>

Type:: Optional[bool]

run(data: Table, artifacts: dict | None = None) → Tuple[Table, dict | None][source]

Runs a CPU-based pre-tokenizer method based on the configuration used to create the object of the CPUPreTokenizer class.

Parameters:

data (pyarrow.Table) – A pyarrow Table containing at least one text column of type string or large_string.
artifacts (Optional[dict]) – Artifacts are not used in this run method but must be received to operate correctly in the pipeline run method.

Returns:

The first position is a pyarrow.Table of pre-tokenized data. The second position is a dictionary of artifacts. No artifacts are produced in this run method, so the second position will return None.

Return type:

Tuple[pa.Table, Union[dict, None]]

class bardi.nlp_engineering.pre_tokenizer.PreTokenizer(fields: str | List[str], split_pattern: str = ' ', retain_input_fields: bool = False)[source]

Bases: Step

The pre-tokenizer breaks down text into smaller units before further tokenization is applied.

Note

Avoid the direct instantiation of the PreTokenizer class and instead instantiate one of the child classes depending on hardware configuration.

fields

The name of the column(s) containing text.

Type:: Union[str, List[str]]

split_pattern

A specific pattern of characters used to divide a string into smaller segments or tokens. By default, the split is done on a single space character.

Type:: str

retain_input_fields

If True, will retain the original contents of the fields specified in fields under the new names of: pretokenizer__<field>

Type:: Optional[bool]

abstract run()[source]: Abstract method

bardi tokenizer_trainer module

Train a tokenizer for transformer-based models

class bardi.nlp_engineering.tokenizer_trainer.CPUTokenizerTrainer(*args, **kwargs)[source]

Bases: TokenizerTrainer

TransformerTokenizer specific for CPU computation.

Note

This implementation of the TokenizerTrainer is specific for CPU computation.

field

the name of the column containing text

Type:: str

tokenizer_type

types of tokenizers that can be trained from scratch currently supported WordPiece, BPE, Unigram, WordLevel

Type:: str

vocab_size

number of tokens in a trained tokenizer

Type:: int

hf_cache_dir

path to a folder where hf tokenizers are stored

Type:: str

from_old_flag

if True, use pre-trained tokenizer as a template.

Type:: bool

checkpoint_path

path to pretrained tokenizer model.

Type:: str

tokenizer_fname

name for the file or folder where the trained tokenizer will be stored

Type:: str

corpus_gen_batch_size

size of batch for tokenizer training data corpus by deafult it is 1000

Type:: int

get_parameters() → dict[source]

Retrive the embedding generetor object configuration

Returns:: a dictionary representation of the embedding generetor object’s attributes
Return type:: dict

run(data: Table, artifacts: dict) → Tuple[Table, dict][source]

Runs tokenizer trainer based on the configuration used to create the object of the CPUTransformerTokenizer class

Parameters:

data (pyarrow.Table) – a pyarrow Table containing at least one list column containing text
artifacts (dict) – artifacts are not consumed in this run method, but must be received to operate correctly in the pipeline run method

Returns:

The first position is a pyarrow.Table of pre-tokenized data. The second position is a dictionary of artifacts. The dict will contain keys for “embedding_matrix” and “id_to_token”.

Return type:

Tuple[pyarrow.Table, dict]

write_artifacts(write_path: str, artifacts: dict | None) → None[source]

Write the oartifactsproduced by the embedding_generator

Parameters:

write_path (str) – Path is a directory where files will be written
artifacts (Union[dict, None]) – Artifacts is a dictionary of artifacts produced in this step. Expected keys are: “id_to_token” and “embedding_matrix”

class bardi.nlp_engineering.tokenizer_trainer.TokenizerTrainer(fields: str | List[str], tokenizer_type: str = '', vocab_size: int = 1000, hf_cache_dir: str = '', from_old_flag: bool = False, checkpoint_path: str = None, tokenizer_fname: str = 'tokenizer', corpus_gen_batch_size: int = 1000, special_tokens: List[str] = None)[source]

Bases: Step

The TokenizerTrainer class provides ability to train:

A NEW TOKENIZER FORM AN OLD ONE
Train a new tokenizer based on a provided tokenizer (from_old_flag). Provide a trained tokenizers associated with given architecture (BERT, LLAMA) and train a new tokenizer from scratch that is configurated to the provided architecture.
TRAIN NEW ARCHITECTURE AGNOSTIC TOKENIZER
Use one of the supported tokenizer algorithms to train a new tokenizer from scratch.

Note

Avoid the direct instantiation of the TokenizerTrainer class and instead instantiate one of the child classes depending on hardware configuration.

field

the name of the column containing text

Type:: str

tokenizer_type

types of tokenizers that can be trained from scratch currently supported WordPiece, BPE, Unigram, WordLevel

Type:: str

vocab_size

number of tokens in a trained tokenizer

Type:: int

hf_cache_dir

path to a folder where hf tokenizers are stored

Type:: str

from_old_flag

if True, use pre-trained tokenizer as a template.

Type:: bool

checkpoint_path

path to pretrained tokenizer model.

Type:: str

tokenizer_fname

name for the file or folder where the trained tokenizer will be stored

Type:: str

corpus_gen_batch_size

size of batch for tokenizer training data corpus by deafult it is 1000

Type:: int

abstract run()[source]: Abstract method

set_write_config(data_config: DataWriteConfig = None, artifacts_config: TokenizerTrainerArtifactsWriteConfig = None)[source]: Overwrite the default file writing configurations

class bardi.nlp_engineering.tokenizer_trainer.TokenizerTrainerArtifactsWriteConfig[source]

Bases: TypedDict

Indicates the keys and data types expected in an artifacts write config dict for the embedding generator if overwriting the default configuration.

vocab_format: str

vocab_format_args: dict

bardi embedding_generator module

Train a Word2Vec model and create a vocab and word embeddings

class bardi.nlp_engineering.embedding_generator.CPUEmbeddingGenerator(*args, **kwargs)[source]

Bases: EmbeddingGenerator

The embedding generator provides an interface to create word embeddings or vector representations of words (tokens).

The embedding generator uses the Word2Vec model from the Gensim library.

Note

This implementation of the EmbeddingGenerator is specific for CPU computation.

fields

The name of the column(s) containing text to be considered in the vocab and used in Word2Vec.

Type:: Union[str, List[str]]

load_saved_model

If True, use a pre-trained Word2Vec model.

Type:: bool

checkpoint_path

Path to the Word2Vec model checkpoint.

Type:: str

cores

Number of cores to run the Word2Vec model on.

Type:: int

min_word_count

Ignores all words with total frequency lower than this.

Type:: int

window

Maximum distance between the current and predicted word.

Type:: int

vector_size

Output embedding size.

Type:: int

sample

The threshold for configuring which high-frequency words are randomly downsampled, use range (0, 1e-5).

Type:: float

min_alpha

Learning rate will linearly drop to min_alpha as training progresses.

Type:: float

negative

If > 0, negative sampling will be used.

Type:: int

epochs

Total number of iterations of all training data in the training of the Word2Vec model.

Type:: int

seed

Seed for the random number generator. For deterministic run, you need thread = 1 (aka CPU core) and PYTHONHASHSEED.

Type:: int

vocab_exclude_list

Provide a list of tokens that may be present in the text that you would like to exclude from the vocab and from Word2Vec.

Type:: List[str]

get_parameters() → dict[source]

Retrieve the embedding generator object configuration.

Returns:: A dictionary representation of the EmbeddingGenerator object’s attributes.
Return type:: dict

run(data: Table, artifacts: dict) → Tuple[Table, dict][source]

Runs a CPU-based embedding generator run method based on the configuration used to create the object of the CPUEmbeddingGenerator class

Parameters:

data (pyarrow.Table) – A pyarrow Table containing at least one list column containing text.
artifacts (dict) – Artifacts are not consumed in this run method, but must be received in the method to operate correctly in the pipeline run method.

Returns:

The first position is a pyarrow.Table of pre-tokenized data. The second position is a dictionary of artifacts. The dict will contain keys for “embedding_matrix” and “id_to_token”.

Return type:

Tuple[pyarrow.Table, dict]

write_artifacts(write_path: str, artifacts: dict) → None[source]

Write the artifacts produced by the embedding generator.

Parameters:

write_path (str) – Path is a directory where files will be written.
artifacts (dict) – Artifacts is a dictionary of artifacts produced in this step. Expected keys are: “id_to_token” and “embedding_matrix”.

class bardi.nlp_engineering.embedding_generator.EmbeddingGenerator(fields: str | List[str], load_saved_model: bool = False, checkpoint_path: str | None = None, cores: int = 4, min_word_count: int = 10, window: int = 5, vector_size: int = 300, sample: float = 6e-05, min_alpha: float = 0.007, negative: int = 20, epochs: int = 30, seed: int = 42, vocab_exclude_list: List[str] = [])[source]

Bases: Step

The embedding generator provides an interface to create word embeddings or vector representations of words (tokens).

The embedding generator uses the Word2Vec model from the Gensim library.

Note

Avoid the direct instantiation of the PreTokenizer class and instead instantiate one of the child classes depending on hardware configuration.

fields

The name of the column(s) containing text to generate embeddings.

Type:: Union[str, List[str]]

load_saved_model

Whether to load a saved Word2Vec model or train a new one.

Type:: bool

checkpoint_path

Path to the saved model checkpoint if load_saved_model is True.

Type:: str

cores

Number of CPU cores to use for training.

Type:: int

min_word_count

Ignore all words with a total frequency lower than this.

Type:: int

window

Maximum distance between the current and predicted word within a sentence.

Type:: int

vector_size

Dimensionality of the word vectors.

Type:: int

sample

The threshold for configuring which higher-frequency words are randomly downsampled.

Type:: float

min_alpha

Learning rate will linearly drop to min_alpha as training progresses.

Type:: float

negative

If > 0, specifies how many “noise words” should be drawn.

Type:: int

epochs

Number of iterations (epochs) over the corpus.

Type:: int

seed

Seed for the random number generator.

Type:: int

vocab_exclude_list

List of words to force exclude from the vocabulary.

Type:: List[str]

abstract run()[source]: Abstract method

set_write_config(data_config: DataWriteConfig | None = None, artifacts_config: EmbeddingGeneratorArtifactsWriteConfig | None = None)[source]: Overwrite the default file writing configurations

class bardi.nlp_engineering.embedding_generator.EmbeddingGeneratorArtifactsWriteConfig[source]

Bases: TypedDict

Indicates the keys and data types expected in an artifacts write config dict for the embedding generator if overwriting the default configuration.

embedding_matrix_format: str

embedding_matrix_format_args: dict

vocab_format: str

vocab_format_args: dict

bardi vocab_encoder module

Apply a vocab mapping converting a list of tokens in a column into a list of integers

class bardi.nlp_engineering.vocab_encoder.CPUVocabEncoder(*args, **kwargs)[source]

Bases: VocabEncoder

The vocab encoder maps a vocab to a list of tokens

Note

This implementation of the VocabEncoder is specific for CPU computation.

fields

the name of the column containing a list of tokens that will be mapped to integers using a vocab

Type:: Union[str, List[str]]

field_rename

optional ability to rename the supplied field with the field_rename value

Type:: str

id_to_token

optional vocabulary in the form of {id: token} that will be used to map the tokens to integers. This is optional for the construction of the object, and can alternatively be provided in the run method. This flexibility handles the use of a pre-existing vocab versus creating a vocab during a pipeline run.

Type:: dict

concat_fields

indicate if you would like for fields to be concatenated into a single column or left as separate columns

Type:: bool

retain_input_fields

If True, will retain the original contents of the fields specified in fields under the new names of: vocabencoder__<field>

Type:: Optional[bool]

get_parameters()[source]

Retrive the vocab encoder object configuration Does not return the mapping (vocab) as it can be large

Returns:: a dictionary representation of the vocab encoder’s attributes
Return type:: dict

run(data: Table, artifacts: dict = None, id_to_token: dict = None) → Table[source]

Run a vocab encoder using CPU computation

The vocab encoder relies on receiving a vocab to map. The vocab can be supplied in multiple ways:

id_to_token at object creation

contained in the pipeline artifacts dictionary passed to the run method referenced by the key, ‘id_to_token’

id_to_token in the run method

Parameters:

data (PyArrow Table) – The data to be processed. The data must contain the column specified by field at object creation
artifacts (dict) – A dictionary of pipeline artifacts which contains a vocab referenced by the key, ‘id_to_token’
id_to_token (dict) – If a vocab wasn’t passed at object creation or through the pipeline artifacts dict, then it must be passed here as a final option

Returns:

A tuple with:

the first element holding a PyArrow Table of data

processed with the vocab encoder - the second element of the tuple intended for artifacts is None

Return type:

Tuple(PyArrow Table, dict)

Raises:

AttributeError – The vocab (id_to_token) wasn’t supplied either at object creation or to the run method
TypeError – The run method was not supplied a PyArrow Table

class bardi.nlp_engineering.vocab_encoder.VocabEncoder(fields: str | List[str], field_rename: str = None, id_to_token: dict = None, concat_fields: bool = False, retain_input_fields: bool = False)[source]

Bases: Step

The vocab encoder maps a vocab to a list of tokens

Note

Avoid the direct instantiation of the VocabEncoder class and instead instantiate one of the child classes depending on hardware configuration

fields

the name of the column containing a list of tokens that will be mapped to integers using a vocab

Type:: Union[str, List[str]]

field_rename

optional ability to rename the supplied field with the field_rename value

Type:: str

id_to_token

optional vocabulary in the form of {id: token} that will be used to map the tokens to integers. This is optional for the construction of the object, and can alternatively be provided in the run method. This flexibility handles the use of a pre-existing vocab versus creating a vocab during a pipeline run.

Type:: dict

concat_fields

indicate if you would like for fields to be concatenated into a single column or left as separate columns

Type:: bool

retain_input_fields

If True, will retain the original contents of the fields specified in fields under the new names of: vocabencoder__<field>

Type:: Optional[bool]

abstract run()[source]: Abstract method

bardi tokenizer_encoder module

Apply a tokenizer to the provided text fields

class bardi.nlp_engineering.tokenizer_encoder.CPUTokenizerEncoder(*args, **kwargs)[source]

Bases: TokenizerEncoder

Implementation of the TokenizerEncoder for CPU computation

Note

This implementation of the TokenizerEncoder is specific for CPU computation.

fields

the name of the column containing a list of tokens that will be mapped to integers using a vocab

Type:: Union[str, List[str]],

return_tensors

type of tensors to return, ‘np’ for Numpy arrays, ‘pt’ for PyTorch tensors or ‘tf’ for TensorFlow

Type:: str

concat_fields

whether the text fields should be concatenate into a single text field, defaults to False

Type:: bool

retain_input_fields

If True, will retain the original contents of the fields specified in fields under the new names of: CPUTokenizerEncoder_input__<field>

Type:: Optional[bool]

retain_concat_field

If True, will retain the concatenation of the fields specified in fields under the name specified in field_rename or ‘text’ if not specified. If concat_fields is not True, this parameter will have no effect

Type:: Optional[bool]

field_rename

optional ability to rename the supplied field with the field_rename value

Type:: Optional[str]

hf_cache_dir

local directory where the HF pretrained tokenizers are stored

Type:: Optional[str]

model_name

name of a tokenizer file or folder. Not required if TokenizerTrainer is a prior step - tokenizer will be passed in artifacts

Type:: Optional[str]

cores

number of CPU cores for multithreading the tokenizer

Type:: Optional[int]

tokenizer_params

provide fine-grained settings for applying tokenizer

Type:: Optional[TokenizerConfig]

tokenizer_model

Tokenizer object passed through artifacts from TokenizerTrainer or read from file specified in model_name

Type:: transformers.PreTrainedTokenizerBase

get_parameters()[source]

Retrive the post-processor object configuration Does not return the mapping (vocab) as it can be large

Returns:: a dictionary representation of the post-processor’s attributes
Return type:: dict

run(data: Table, artifacts: dict = None) → Table[source]

Run a tokenizer encoder based on provided configuration

The tokenizer encoder relies on receiving a tokenizer to apply to the text. The tokenizer can be supplied in multiple ways:

referencing the model_name at object creation

contained in the pipeline artifacts dictionary passed to the run method referenced by the key, ‘tokenizer_model’

Parameters:

data (PyArrow Table) – The data to be processed. The data must contain the column specified by field at object creation
artifacts (dict) – A dictionary of pipeline artifacts which contains a tokenizer referenced by the key, ‘tokenizer_model’

Returns:

The first element holding a PyArrow Table of data processed with the tokenizer encoder. The second element of the tuple intended for artifacts is None.

Return type:

Tuple[PyArrow Table, dict]

Raises:

AttributeError – The tokenizer wasn’t supplied either at object creation or to the run method
TypeError – The run method was not supplied a PyArrow Table

class bardi.nlp_engineering.tokenizer_encoder.TokenizerEncoder(fields: str | List[str], return_tensors: str = 'np', concat_fields: bool = False, retain_input_fields: bool = False, retain_concat_field: bool = False, field_rename: str | None = None, hf_cache_dir: str | None = None, model_name: str | None = None, cores: int | None = None, tokenizer_params: dict | None = None)[source]

Bases: Step

The tokenizer encoder uses a trained tokenizer to split text into tokens.

Note

Avoid the direct instantiation of the TokenizerEncoder class and instead instantiate one of the child classes depending on hardware configuration.

fields

the name of the column containing a list of tokens that will be mapped to integers using a vocab

Type:: Union[str, List[str]],

return_tensors

type of tensors to return, ‘np’ for Numpy arrays, ‘pt’ for PyTorch tensors or ‘tf’ for TensorFlow

Type:: str

concat_fields

whether the text fields should be concatenate into a single text field, defaults to False

Type:: bool

retain_input_fields

If True, will retain the original contents of the fields specified in fields under the new names of: TokenizerEncoder__<field>

Type:: Optional[bool]

retain_concat_field

If True, will retain the concatenation of the fields specified in fields under the name specified in field_rename or ‘text’ if not specified. If concat_fields is not True, this parameter will have no effect

Type:: Optional[bool]

field_rename

optional ability to rename the supplied field with the field_rename value

Type:: Optional[str]

hf_cache_dir

local directory where the HF pretrained tokenizers are stored

Type:: Optional[str]

model_name

name of a tokenizer file or folder. Not required if TokenizerTrainer is a prior step - tokenizer will be passed in artifacts

Type:: Optional[str]

cores

number of CPU cores for multithreading the tokenizer

Type:: Optional[int]

tokenizer_params

provide fine-grained customization for any valid HuggingFace Tokenizer parameter through a dictionary

Type:: Optional[dict]

tokenizer_model

Tokenizer object passed through artifacts from TokenizerTrainer or read from file specified in model_name

Type:: transformers.PreTrainedTokenizerBase

abstract run()[source]: Abstract method

bardi label_processor module

Encode label columns into numerical representations

class bardi.nlp_engineering.label_processor.CPULabelProcessor(*args, **kwargs)[source]

Bases: LabelProcessor

The label processor creates and maps a label vocab.

Note

This implementation of the LabelProcessor is specific for CPU computation.

fields

The name(s) of label column(s) of which the values are used to generate a standardized mapping and then that mapping is applied to the column(s).

Type:: Union[str, List[str]]

method

Currently only a default ‘unique’ method is supported which maps each unique value in the column to an id.

Type:: str

mapping

Mapping dict of the form {label: id} used to convert labels in the column to ids.

Type:: dict

id_to_label

The reverse of mapping. Of the form {id: label} used downstream to map the ids back to the original label values.

Type:: dict

retain_input_fields

If True, will retain the original contents of the fields specified in fields under the new names of: labelprocessor__<field>

Type:: Optional[bool]

id_to_label

If an id_to_label already exists, it can be directly applied. id_to_label is a dict of the form {field: {id: label}}

Type:: Optional[dict]

get_parameters() → dict[source]

Retrive the label processor object configuration

Does not return the mapping (vocab), but does return the id_to_label dict. This is because the mapping is just the reverse of id_to_label.

Returns:: a dictionary representation of the splitter object’s attributes
Return type:: dict

run(data: Table, artifacts: dict | None = None, id_to_label: dict | None = None) → Tuple[Table, dict][source]

Run a label processor using CPU computation

Parameters:

data (PyArrow Table) – The data to be processed. The data must contain the column specified by ‘field’ at object creation
artifacts (dict) – artifacts are not used in this run method, but must be received to operate correctly in the pipeline run method
id_to_label (Optional[dict]) – If an id_to_label already exists, it can be directly applied. id_to_label is a dict of the form {field: {id: label}}

Returns:

The first position is a pyarrow.Table of processed data. The second position is a dictionary of artifacts. The dict will contain a key for “id_to_label”.

Return type:

Tuple[pyarrow.Table, dict]

Raises:

NotImplementedError – A value other than ‘unique’ was provided for the label processor’s method
TypeError – The run method was not supplied a PyArrow Table

write_artifacts(write_path: str, artifacts: dict) → None[source]

Write the outputs produced by the label_processor

Parameters:

write_path (str) – Path is a directory where files will be written
artifacts (dict) – Artifacts is a dictionary of artifacts produced in this step. Expected key is: “id_to_label”

class bardi.nlp_engineering.label_processor.LabelProcessor(fields: str | List[str], method: str = 'unique', retain_input_fields: bool = False, id_to_label: dict = None)[source]

Bases: Step

The label processor encodes label columns into numerical representations and provides a mapping for each label to its respective representation.

Note

Avoid the direct instantiation of the LabelProcessor class and instead instantiate one of the child classes.

fields

The name of a label column of which the values are used to generate a standardized mapping and then that mapping is applied to the column.

Type:: Union[str, List[str]]

method

Currently only a default ‘unique’ method is supported which maps each unique value in the column to an id.

Type:: str

retain_input_fields

If True, will retain the original contents of the fields specified in fields under the new names of: labelprocessor__<field>

Type:: Optional[bool]

id_to_label

If an id_to_label already exists, it can be directly applied. id_to_label is a dict of the form {field: {id: label}}

Type:: Optional[dict]

abstract run()[source]: Abstract method

set_write_config(data_config: DataWriteConfig | None = None, artifacts_config: LabelProcessorArtifactsWriteConfig | None = None)[source]: Overwrite the default file writing configurations

class bardi.nlp_engineering.label_processor.LabelProcessorArtifactsWriteConfig[source]

Bases: TypedDict

Indicates the keys and data types expected in an artifacts write config dict for the label processor if overwriting the default configuration.

id_to_label_args: dict | None

id_to_label_format: str

bardi splitter module

Segment the dataset into splits, such as test, train, and val

class bardi.nlp_engineering.splitter.CPUSplitter(*args, **kwargs)[source]

Bases: Splitter

The splitter adds a ‘split’ column to the data assigning each record to a particular split for downstream model training.

Two split types are available - creating a new random split from scratch and assigning previously created splits. This second option is helpful when running comparisons with other methods of data processing ensuring that splits are exactly the same.

Note

This implementation of the LabelProcessor is specific for CPU computation.

To create a splitter, pass the appropriate set of parameters through a defined NamedTuple for the type of split you want to create. i.e.,

CPUSplitter(split_method=NewSplit(
    split_proportions={
        'train': 0.75,
        'test': 0.15,
        'val': 0.15
    },
    unique_record_cols=[
        'document_id'
    ],
    group_cols=[
        'patient_id_number',
        'registry'
    ],
    labels_cols=[
        'reportability'
    ],
    random_seed=42
))

Splitter Method Named Tuples:

bardi.nlp_engineering.splitter.NewSplit

bardi.nlp_engineering.splitter.MapSplit

split_method

A named tuple of either MapSplit type or NewSplit type. Each contains a different set of values used to create the splitter depending upon split type

Type:: Union[MapSplit, NewSplit]

split_type

The type of split to be performed:

new - create a new random data split
map - reproduce an existing data split by mapping unique IDs

Type:: str

unique_record_cols

List of column names of which the combination forms a unique identifier in the dataset. This can be a single column name if that creates a unique identifier, but oftentimes in datasets a combination of fields are required.

Note: This set of columns MUST create a unique record or the program will crash.

Type:: List[str]

split_mapping

Only used for map split type. A dictionary mapping where the keys are the hash of the concatenated values from unique_record_cols, or represented by the following pseudocode,

hash(concat(*unique_record_cols))

The values are the corresponding split value (train, test, val) or (fold1, fold2, fold3), etc.

Type:: dict[str, str]

split_proportions

Only used for new split type. Mapping of split names to split proportions. i.e.,

{'train': 0.75, 'test': 0.15, 'val': 0.15}
{'fold1': 0.25, 'fold2': 0.25, 'fold3': 0.25, 'fold4': 0.25}

Note: values must add to 1.0.

Type:: dict[str, float]

num_splits

The number of splits contained in split_proportions.

Type:: int

group_cols

List of column names that form a ‘group’ that you would like to keep in discrete splits. E.x., if you had multiple medical notes for a single patient, you may desire that all notes for a single patient end up in the same split to prevent potential information leakage. In this case you would provide something like a patient_id.

Type:: List[str]

label_cols

List of column names containing labels. Efforts are made to balance label distribution across splits, but this is not guaranteed.

Type:: List[str]

random_seed

Required for reproducibility. If you have no preference, try on 42 for size.

Type:: int

run(data: Table, artifacts: dict = None) → Tuple[Table, dict][source]

Runs a splitter using CPU computation based on the configuration used to create the object of the CPUSplitter class

Parameters:

data (PyArrow Table) – The data to be split
artifacts (dict) – artifacts are not consumed in this run method, but must be received to operate correctly in the pipeline run method

Returns:

A tuple with:

the first element holding a PyArrow Table of data including the new split column
the second element of the tuple intended for artifacts is None

Return type:

Tuple(PyArrow.Table, dict)

Raises:

TypeError – The run method was not supplied a PyArrow Table

class bardi.nlp_engineering.splitter.MapSplit(unique_record_cols: List[str], split_mapping: Dict[str, str], default_split_value: str)

Bases: tuple

Specify the requirements for splitting data exactly in line with an existing data split

default_split_value: str: Only used for map split type. A value to be used for split when the unique_record_cols cannot be found in the mapping.

split_mapping: Dict[str, str]

Only used for map split type. A dictionary mapping where the keys are the hash of the concatenated values from unique_record_cols, or represented by the following pseudocode,

hash(concat(*unique_record_cols))

The values are the corresponding split value (train, test, val) or (fold1, fold2, fold3), etc.

unique_record_cols: List[str]

List of column names of which the combination forms a unique identifier in the dataset. This can be a single column name if that creates a unique identifier, but oftentimes in datasets a combination of fields are required.

Note: This set of columns MUST create a unique record or the program will crash.

class bardi.nlp_engineering.splitter.NewSplit(split_proportions: Dict[str, float], unique_record_cols: List[str], group_cols: List[str], label_cols: List[str], random_seed: int)

Bases: tuple

Specify the requirements for splitting data with a new split from scratch.

group_cols: List[str]: List of column names that form a ‘group’ that you would like to keep in discrete splits. E.x., if you had multiple medical notes for a single patient, you may desire that all notes for a single patient end up in the same split to prevent potential information leakage. In this case you would provide something like a patient_id.

label_cols: List[str]: List of column names containing labels. Efforts are made to balance label distribution across splits, but this is not guaranteed.

random_seed: int: Required for reproducibility. If you have no preference, try on 42 for size.

split_proportions: Dict[str, float]

Only used for new split type. Mapping of split names to split proportions. i.e.,

{'train': 0.75, 'test': 0.15, 'val': 0.15}
{'fold1': 0.25, 'fold2': 0.25, 'fold3': 0.25, 'fold4': 0.25}

Note: values must add to 1.0.

unique_record_cols: List[str]

List of column names of which the combination forms a unique identifier in the dataset. This can be a single column name if that creates a unique identifier, but oftentimes in datasets a combination of fields are required.

Note: This set of columns MUST create a unique record or the program will crash.

class bardi.nlp_engineering.splitter.Splitter(split_method: MapSplit | NewSplit)[source]

Bases: Step

The splitter adds a ‘split’ column to the data assigning each record to a particular split for downstream model training.

Note

Avoid the direct instantiation of the Splitter class and instead instantiate one of the child classes.

split_method

A named tuple of either MapSplit type or NewSplit type. Each contains a different set of values used to create the splitter depending upon split type

Type:: Union[MapSplit, NewSplit]

split_type

The type of split to be performed:

new - create a new random data split
map - reproduce an existing data split by mapping unique IDs

Type:: str

unique_record_cols

List of column names of which the combination forms a unique identifier in the dataset. This can be a single column name if that creates a unique identifier, but oftentimes in datasets a combination of fields are required.

Note: This set of columns MUST create a unique record or the program will crash.

Type:: List[str]

split_mapping

Only used for map split type. A dictionary mapping where the keys are the hash of the concatenated values from unique_record_cols, or represented by the following pseudocode,

hash(concat(*unique_record_cols))

The values are the corresponding split value (train, test, val) or (fold1, fold2, fold3), etc.

Type:: dict[str, str]

split_proportions

Only used for new split type. Mapping of split names to split proportions. i.e.,

{'train': 0.75, 'test': 0.15, 'val': 0.15}
{'fold1': 0.25, 'fold2': 0.25, 'fold3': 0.25, 'fold4': 0.25}

Note: values must add to 1.0.

Type:: dict[str, float]

num_splits

The number of splits contained in split_proportions.

Type:: int

group_cols

List of column names that form a ‘group’ that you would like to keep in discrete splits. E.x., if you had multiple medical notes for a single patient, you may desire that all notes for a single patient end up in the same split to prevent potential information leakage. In this case you would provide something like a patient_id.

Type:: List[str]

label_cols

List of column names containing labels. Efforts are made to balance label distribution across splits, but this is not guaranteed.

Type:: List[str]

random_seed

Required for reproducibility. If you have no preference, try on 42 for size.

Type:: int

abstract run()[source]: Abstract method

RegEx Package

bardi regex_set module

Define RegexSet and RegexSubPair blueprints

class bardi.nlp_engineering.regex_library.regex_set.RegexSet[source]

Bases: object

Blueprint for creating a configurable, domain specific regular expression set

regex_set

a list of regular expression substitution pairs

Type:: List[RegexSubPair]

get_regex_set(lowercase_substitution=False, no_substitution=False) → List[RegexSubPair][source]

Return the ordered set of regular expressions

lowercase_substitution

It True all the substitution tokeen like DATETOKEN will be returned in lowercase datetoken. Defaults to False.

Type:: Optional[bool]

no_substitution

If True all the regular expression that remove matched pattern will replace it with space instead of special token. Defaults to False.

Type:: Optionl[bool]

Returns:: a list of regular expression substitution pairs
Return type:: List[RegexSubPair]

class bardi.nlp_engineering.regex_library.regex_set.RegexSubPair[source]

Bases: TypedDict

Dictionary used for regular expression string substitutions

Example of a regex sub pair dictionary:

{
    "regex_str": r"\s",
    "sub_str": "WHITESPACE"
}

regex_str

regular expression pattern

Type:: str

sub_str

replacement value for matched string

Type:: str

regex_str: str

sub_str: str

bardi provided regex sets

Curated set of regular expressions for cleaning text from pathology reports.

class bardi.nlp_engineering.regex_library.pathology_report.PathologyReportRegexSet(convert_escape_codes: bool = True, handle_whitespaces: bool = True, remove_urls: bool = True, remove_special_punct: bool = True, remove_multiple_punct: bool = True, handle_angle_brackets: bool = True, replace_percent_sign: bool = True, handle_leading_digit_punct: bool = True, remove_leading_punct: bool = True, remove_trailing_punct: bool = True, handle_words_with_punct_spacing: bool = True, handle_math_spacing: bool = True, handle_dimension_spacing: bool = True, handle_measure_spacing: bool = True, handle_cassettes_spacing: bool = True, handle_dash_digit_spacing: bool = True, handle_literals_floats_spacing: bool = True, fix_pluralization: bool = True, handle_digits_words_spacing: bool = True, remove_phone_numbers: bool = True, remove_dates: bool = True, remove_times: bool = True, remove_addresses: bool = True, remove_dimensions: bool = True, remove_specimen: bool = True, remove_decimal_seg_numbers: bool = True, remove_large_digits_seq: bool = True, remove_large_floats_seq: bool = True, trunc_decimals: bool = True, remove_cassette_names: bool = True, remove_duration_time: bool = True, remove_letter_num_seq: bool = True)[source]

Bases: RegexSet

The PathologyReportRegexSet includes set of standard regular expression to normalize a pathology report.

Note

The set of regular expressions tailored for pathology reports was crafted with the understanding that dividing text based on punctuation often results in the loss of crucial information. E.g. terms like “her-2” should not be split. However, to ensure that the number of unique tokens remains manageable we employ a number of regular expression to separate some tokens around punctuation. E.g. 22-years becomes 22 years. This consideration is particularly important when employing the word2vec algorithm, as an excessive number of tokens can impede the model’s effectiveness by diluting the representation of key concepts.

convert_escape_codes

Removes escape codes such as x0d, x0a, etc.

Type:: bool

handle_whitespaces

Removes extra whitespaces: any new line, carriage return tab.

Type:: bool

remove_urls

Removes URLs found in the text that match the pattern.

Type:: bool

remove_special_punct

Removes special punctuation like (?,$).

Type:: bool

remove_multiple_punct

Removes duplicated punctuation. E.g. ---

Type:: bool

handle_angle_brackets

Removes angle brackets. E.g. <title> becomes title.

Type:: bool

replace_percent_sign

Replaces a percent sign with a ‘percent’ word.

Type:: bool

handle_leading_digit_punct

Removes punctuation when digit is attached to word. E.g. 22-years becomes 22 years.

Type:: bool

remove_leading_punct

Removes leading punctuation from words. E.g. -result becomes result.

Type:: bool

remove_trailing_punct

Removes trailing punctuation from words. E.g. result- becomes result.

Type:: bool

handle_words_with_punct_spacing

Matches words with hyphen, colon or period and splits them.

Type:: bool

handle_math_spacing

Matches “math operators symbols” like ><=%: and adds spaces aroud them.

Type:: bool

handle_dimension_spacing

Matches digits and x and adds spaces between them.

Type:: bool

handle_measure_spacing

Matches measurements in mm, cm and ml provides proper spacing between the digits and measure.

Type:: bool

handle_cassettes_spacing

Matches patterns like 5e-6f and adds spaces around them.

Type:: bool

handle_dash_digit_spacing

Matches dashes around digits and adds spaces around the dashes.

Type:: bool

handle_literals_floats_spacing

Matches character followed by a float and a word. This is a common formating problem. E.g. r18.0admission becomes r18.0 admission.

Type:: bool

fix_pluralization

Matches s character after a word and attaches it back to the word. This restores plural nouns demages by removed punctuation.

Type:: bool

handle_digits_words_spacing

Matches digits that are attached to the beginning of a word.

Type:: bool

remove_phone_numbers

Matches any phone number that consists of 10 digits with delimeters.

Type:: bool

remove_dates

Removes dates of prespecified format.

Type:: bool

remove_times

Matches time of format 11:20 am or 1.30pm or 9:52:07AM.

Type:: bool

remove_addresses

Matches any address of format num (street name) in 1 to 6 words 2-letter state and short or long zip code.

Type:: bool

remove_dimensions

Matches 2D or 3D dimension measurements and adds spaces around them.

Type:: bool

remove_specimen

Matches marking of a pathology speciman.

Type:: bool

remove_decimal_seg_numbers

Matches combinations of digits and periods or dashes. E.g. :code:` 1.78.9.87`.

Type:: bool

remove_large_digits_seq

Matches large sequences of digits (3 or more) and replaces it.

Type:: bool

remove_large_floats_seq

Matches large floats and replace them.

Type:: bool

trunc_decimals

Matches floats and keeps only first decimal.

Type:: bool = True

remove_cassette_names

Removes pathology samples’ markings. E.g. 1-e.

Type:: bool

remove_duration_time

Removes duration a speciment was treated. E.g. 32d09090301.

Type:: bool

remove_letter_num_seq

Removes a character followed directly by 6 to 10 digits.

Type:: bool

bardi provided regex library

Library of pre-defined regular expression substitution pairs.

bardi.nlp_engineering.regex_library.regex_lib.get_address_regex() → RegexSubPair[source]

Matches any address of format num (street name) in 1 to 6 words 2-letter state and short or long zip code.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

1034 north 500 west provo ut 84604-3337

Output string:

ADDRESSTOKEN

bardi.nlp_engineering.regex_library.regex_lib.get_angle_brackets_regex() → RegexSubPair[source]

Matches a content between matching angle brackets, keeps the content only, removes the brackets.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

<This should be fixed> But not this >90

Output string:

This should be fixed But not this >90

bardi.nlp_engineering.regex_library.regex_lib.get_cassette_name_regex() → RegexSubPair[source]

Matches cassettes markings of the specified format:

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

block:  1-e

Output string:

block:  CASSETTETOKEN

bardi.nlp_engineering.regex_library.regex_lib.get_cassettes_spacing_regex() → RegexSubPair[source]

Matches patterns like 5e-6f and adds spaces around them.

Return type:: RegexSubPair - (regex pattern, replacement string)

Examples

Input string:

3e-3f

Output string:

3e - 3f

bardi.nlp_engineering.regex_library.regex_lib.get_dash_digits_spacing_regex() → RegexSubPair[source]

Matches dashes around digits and adds spaces around the dashes.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

right 1:30-2:30 1.5-2.0 cm 0.9 cm for the 7-6

Output string:

right 1:30 - 2:30 1.5 - 2.0 cm 0.9 cm for the 7 - 6

bardi.nlp_engineering.regex_library.regex_lib.get_dates_regex() → RegexSubPair[source]

Matches dates of specified formats.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

co: 03/09/2021 1015 completed: 03/10/21 at 3:34.

Output string:

co:  DATETOKEN completed:  DATETOKEN .

bardi.nlp_engineering.regex_library.regex_lib.get_decimal_segmented_numbers_regex() → RegexSubPair[source]

Matches combinations of digits and periods or dashes.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

1.78.9.87

Output string:

DECIMALSEGMENTEDNUMBERTOKEN

bardi.nlp_engineering.regex_library.regex_lib.get_digits_words_spacing_regex() → RegexSubPair[source]

Matches digits that are attached to the beginning of a word.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

9837648admission

Output string:

9837648 admission

bardi.nlp_engineering.regex_library.regex_lib.get_dimension_spacing_regex() → RegexSubPair[source]

Matches digits and x and adds spaces between them.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

measuring 1.3x0.7x0.1 cm

Output string:

measuring 1.3 x 0.7 x 0.1 cm

bardi.nlp_engineering.regex_library.regex_lib.get_dimensions_regex() → RegexSubPair[source]

Matches 2D or 3D dimension measurements and adds spaces around them.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

3.5 x 2.5 x 9.0 cm and 33 x 6.5 cm

Output string:

DIMENSIONTOKEN  cm and  DIMENSIONTOKEN  cm

bardi.nlp_engineering.regex_library.regex_lib.get_duration_regex() → RegexSubPair[source]

Matches duration specimen was treated:

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

32d0909091

Output string:

DURATIONTOKEN

bardi.nlp_engineering.regex_library.regex_lib.get_escape_code_regex() → RegexSubPair[source]

Matches escape codes such as x0d, x0a, etc.

Returns:: {regex pattern, replacement string}
Return type:: RegexSubPair

Example

Input string:

Codes\x0d\x0a\x0d \r30

Output string:

Codes      30

bardi.nlp_engineering.regex_library.regex_lib.get_fix_pluralization_regex() → RegexSubPair[source]

Matches s character after a word and attaches it back to the word. This restores plural nouns demages by removed punctuation.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

specimen s code s

Output string:

specimens codes

bardi.nlp_engineering.regex_library.regex_lib.get_large_digits_seq_regex() → RegexSubPair[source]

Matches large sequences of digits and replaces it.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

456123456

Output string:

DIGITSEQUENCETOKEN

bardi.nlp_engineering.regex_library.regex_lib.get_large_float_seq_regex() → RegexSubPair[source]

Matches large floats and replace them.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

456 123456.783

Output string:

456 LARGEFLOATTOKEN

bardi.nlp_engineering.regex_library.regex_lib.get_leading_digit_punctuation_regex() → RegexSubPair[source]

Matches numeric digits at the start of a word, followed by punctuation and additional characters. Proceeds to eliminate the punctuation and inserts a space between the digits an the word.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

13-unremarkable 1-e 22-years

Output string:

13 unremarkable   1 e   22 years

bardi.nlp_engineering.regex_library.regex_lib.get_leading_punctuation_regex() → RegexSubPair[source]

Matches leading punctuation and removes it.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

-3a -anterior -result- :cassette

Output string:

3a  anterior  result-  cassette

bardi.nlp_engineering.regex_library.regex_lib.get_letter_num_seq_regex() → RegexSubPair[source]

Matches a character followed directly by 6 to 10 digits:

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

c001234567

Output string:

LETTERDIGITSTOKEN

bardi.nlp_engineering.regex_library.regex_lib.get_literals_floats_spacing_regex() → RegexSubPair[source]

Matches character followed by a float and a word. This is a common formating problem. e.g. r18.0admission -> r18.0 admission

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

r18.0admission diagnosis: bi n13.30admission

Output string:

r18.0 admission diagnosis: bi n13.30 admission

bardi.nlp_engineering.regex_library.regex_lib.get_math_spacing_regex() → RegexSubPair[source]

Matches “math operators symbols” like ><=%: and adds spaces aroud them.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

This is >95% 3+3=8  6/7

Output string:

This is  > 95 %  3 + 3 = 8  6 / 7

bardi.nlp_engineering.regex_library.regex_lib.get_measure_spacing_regex() → RegexSubPair[source]

Matches measurements in mm, cm and ml provides proper spacing between the digits and measure. Also provides specing between 11th -> 11 th.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

10mm histologic type 2 x 3cm. this is 3.0-cm

Output string:

10 mm  histologic type 2 x 3 cm . this is 3.0 cm

bardi.nlp_engineering.regex_library.regex_lib.get_multiple_punct_regex() → RegexSubPair[source]

Matches multiple occurences of symbols like -, . and _ replaces them with a single space.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

-----this is report ___ signature

Output string:

this is report   signature

bardi.nlp_engineering.regex_library.regex_lib.get_percent_sign_regex() → RegexSubPair[source]

Matches the % sign and replaces it with a word ‘percent’.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

strong intensity >95%

Output string:

strong intensity >95 percent

bardi.nlp_engineering.regex_library.regex_lib.get_phone_number_regex() → RegexSubPair[source]

Matches any phone number that consists of 10 digits with delimeters.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

Ph: (123) 456 7890. It is (123)4567890.

Output string:

Ph:  PHONENUMTOKEN . It is  PHONENUMTOKEN .

bardi.nlp_engineering.regex_library.regex_lib.get_spaces_regex() → RegexSubPair[source]

Matches additional spaces (artifact of applying other regex), matches not needed periods that can be removed.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

located around lower arm specimen   date

Output string:

located around lower arm specimen date

bardi.nlp_engineering.regex_library.regex_lib.get_special_punct_regex() → RegexSubPair[source]

Matches a set of chosen punctuation symbols _,();[]#{}*”’~?!|^ and replaces them with a single space.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

wt-1, ck-7 (focal) negative; [sth] ab|cd"

Output string:

wt-1  ck-7  focal  negative   sth  ab cd

bardi.nlp_engineering.regex_library.regex_lib.get_specimen_regex() → RegexSubPair[source]

Matches marking of a pathology speciman.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

for s-21-009345 sh-22-0011300

Output string:

for  SPECIMENTOKEN   SPECIMENTOKEN

bardi.nlp_engineering.regex_library.regex_lib.get_time_regex() → RegexSubPair[source]

Matches time of format 11:20 am or 1.30pm or 9:52:07AM.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

at 11:12 pm or 11.12am

Output string:

at  TIMETOKEN  or  TIMETOKEN

bardi.nlp_engineering.regex_library.regex_lib.get_trailing_punctuation_regex() → RegexSubPair[source]

Matches trailing punctuation and removes it.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

-3a -anterior -result- :cassette

Output string:

-3a -anterior  -result :cassette

bardi.nlp_engineering.regex_library.regex_lib.get_trunc_decimals_regex() → RegexSubPair[source]

Matches floats and keeps only first decimal.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

1.78  9.87 - 8.99

Output string:

1.7  9.8 - 8.9

bardi.nlp_engineering.regex_library.regex_lib.get_urls_regex() → RegexSubPair[source]

Matches a url and replaces it with a URLTOKEN.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

Source: https://www.merck.com/keytruda_pi.pdf

Output string:

Source: URLTOKEN

bardi.nlp_engineering.regex_library.regex_lib.get_whitespace_regex() → RegexSubPair[source]

Matches any new line, carriage return tab and multiple spaces and replaces it with a single space.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

INVASIVE:\nNegative    IN SITU:\nN/A  IN \tThe result\

Output string:

INVASIVE: Negative IN SITU: N/A IN The result

bardi.nlp_engineering.regex_library.regex_lib.get_words_with_punct_spacing_regex() → RegexSubPair[source]

Matches words with hyphen, colon or period and splits them. Requires the words to be at least two characters in length to avoid splitting words like ph.d.

Return type:: RegexSubPair - (regex pattern, replacement string)

Example

Input string:

this-that her-2 tiff-1k description:gleason

Output string:

this that her-2 tiff-1k description gleason