bardi.nlp_engineering package
Steps
bardi normalizer module
Clean text with custom sets of regular expressions
- class bardi.nlp_engineering.normalizer.CPUNormalizer(*args, **kwargs)[source]
Bases:
Normalizer
Normalizer class for cleaning and standardizing text input using regular expression substitutions.
Note
This implementation of the Normalizer is specific for CPU computation.
- fields
The name of the column(s) containing text to be normalized.
- Type:
Union[str, List[str]]
- regex_set
A list of dictionaries with keys, ‘regex_str’ and ‘sub_str’, used to perform regular expression substitutions of the text.
- Type:
List[RegexSubPair]
- lowercase
If True, lowercasing will be applied during normalization. Default is True.
- Type:
Optional[bool]
- retain_input_fields
If True, will retain the original contents of the fields specified in fields under the new names of: normalizer__<field>
- Type:
Optional[bool]
- run(data: Table, artifacts: dict | None = None) Tuple[Table, dict] [source]
Run the CPU-based normalizer method based on the configuration used to create the object of the CPUNormalizer class.
- Parameters:
data (pyarrow.Table) – A pyarrow Table containing at least one text column of type string or large_string.
artifacts (Optional[dict]) – Artifacts are not used in this run method but must be received to operate correctly in the pipeline run method.
- Returns:
A tuple containing the pyarrow Table of cleaned data and an empty dictionary.
- Return type:
Tuple[pyarrow.Table, dict]
- class bardi.nlp_engineering.normalizer.Normalizer(fields: str | List[str], regex_set: List[RegexSubPair], lowercase: bool = True, retain_input_fields: bool = False)[source]
Bases:
Step
Normalizer cleans and standardizes text input using regular expression substitutions. Lowercasing is also applied if desired.
Note
Avoid the direct instantiation of the Normalizer class and instead instantiate one of the child classes depending on hardware configuration.
- fields
The field or fields to be normalized.
- Type:
Union[str, List[str]]
- regex_set
List of regex substitutions to be applied.
- Type:
List[RegexSubPair]
- lowercase
If True, lowercasing will be applied during normalization, defaults to True.
- Type:
Optional[bool]
- retain_input_fields
If True, will retain the original contents of the fields specified in fields under the new names of: normalizer__<field>
- Type:
Optional[bool]
bardi pre_tokenizer module
Split text columns into lists of tokens using simple patterns
- class bardi.nlp_engineering.pre_tokenizer.CPUPreTokenizer(*args, **kwargs)[source]
Bases:
PreTokenizer
The pre-tokenizer breaks down text into smaller units before further tokenization is applied.
Note
This implementation of the PreTokenizer is specific for CPU computation.
- fields
The name of the column(s) containing text.
- Type:
Union[str, List[str]]
- split_pattern
A specific pattern of characters used to divide a string into smaller segments or tokens. By default, the split is done on a single space character.
- Type:
str
- retain_input_fields
If True, will retain the original contents of the fields specified in fields under the new names of: pretokenizer__<field>
- Type:
Optional[bool]
- run(data: Table, artifacts: dict | None = None) Tuple[Table, dict | None] [source]
Runs a CPU-based pre-tokenizer method based on the configuration used to create the object of the CPUPreTokenizer class.
- Parameters:
data (pyarrow.Table) – A pyarrow Table containing at least one text column of type string or large_string.
artifacts (Optional[dict]) – Artifacts are not used in this run method but must be received to operate correctly in the pipeline run method.
- Returns:
The first position is a pyarrow.Table of pre-tokenized data. The second position is a dictionary of artifacts. No artifacts are produced in this run method, so the second position will return None.
- Return type:
Tuple[pa.Table, Union[dict, None]]
- class bardi.nlp_engineering.pre_tokenizer.PreTokenizer(fields: str | List[str], split_pattern: str = ' ', retain_input_fields: bool = False)[source]
Bases:
Step
The pre-tokenizer breaks down text into smaller units before further tokenization is applied.
Note
Avoid the direct instantiation of the PreTokenizer class and instead instantiate one of the child classes depending on hardware configuration.
- fields
The name of the column(s) containing text.
- Type:
Union[str, List[str]]
- split_pattern
A specific pattern of characters used to divide a string into smaller segments or tokens. By default, the split is done on a single space character.
- Type:
str
- retain_input_fields
If True, will retain the original contents of the fields specified in fields under the new names of: pretokenizer__<field>
- Type:
Optional[bool]
bardi tokenizer_trainer module
Train a tokenizer for transformer-based models
- class bardi.nlp_engineering.tokenizer_trainer.CPUTokenizerTrainer(*args, **kwargs)[source]
Bases:
TokenizerTrainer
TransformerTokenizer specific for CPU computation.
Note
This implementation of the TokenizerTrainer is specific for CPU computation.
- field
the name of the column containing text
- Type:
str
- tokenizer_type
types of tokenizers that can be trained from scratch currently supported WordPiece, BPE, Unigram, WordLevel
- Type:
str
- vocab_size
number of tokens in a trained tokenizer
- Type:
int
- hf_cache_dir
path to a folder where hf tokenizers are stored
- Type:
str
- from_old_flag
if True, use pre-trained tokenizer as a template.
- Type:
bool
- checkpoint_path
path to pretrained tokenizer model.
- Type:
str
- tokenizer_fname
name for the file or folder where the trained tokenizer will be stored
- Type:
str
- corpus_gen_batch_size
size of batch for tokenizer training data corpus by deafult it is 1000
- Type:
int
- get_parameters() dict [source]
Retrive the embedding generetor object configuration
- Returns:
a dictionary representation of the embedding generetor object’s attributes
- Return type:
dict
- run(data: Table, artifacts: dict) Tuple[Table, dict] [source]
Runs tokenizer trainer based on the configuration used to create the object of the CPUTransformerTokenizer class
- Parameters:
data (pyarrow.Table) – a pyarrow Table containing at least one list column containing text
artifacts (dict) – artifacts are not consumed in this run method, but must be received to operate correctly in the pipeline run method
- Returns:
The first position is a pyarrow.Table of pre-tokenized data. The second position is a dictionary of artifacts. The dict will contain keys for “embedding_matrix” and “id_to_token”.
- Return type:
Tuple[pyarrow.Table, dict]
- write_artifacts(write_path: str, artifacts: dict | None) None [source]
Write the oartifactsproduced by the embedding_generator
- Parameters:
write_path (str) – Path is a directory where files will be written
artifacts (Union[dict, None]) – Artifacts is a dictionary of artifacts produced in this step. Expected keys are: “id_to_token” and “embedding_matrix”
- class bardi.nlp_engineering.tokenizer_trainer.TokenizerTrainer(fields: str | List[str], tokenizer_type: str = '', vocab_size: int = 1000, hf_cache_dir: str = '', from_old_flag: bool = False, checkpoint_path: str = None, tokenizer_fname: str = 'tokenizer', corpus_gen_batch_size: int = 1000, special_tokens: List[str] = None)[source]
Bases:
Step
The TokenizerTrainer class provides ability to train:
- A NEW TOKENIZER FORM AN OLD ONE
Train a new tokenizer based on a provided tokenizer (from_old_flag). Provide a trained tokenizers associated with given architecture (BERT, LLAMA) and train a new tokenizer from scratch that is configurated to the provided architecture.
- TRAIN NEW ARCHITECTURE AGNOSTIC TOKENIZER
Use one of the supported tokenizer algorithms to train a new tokenizer from scratch.
Note
Avoid the direct instantiation of the TokenizerTrainer class and instead instantiate one of the child classes depending on hardware configuration.
- field
the name of the column containing text
- Type:
str
- tokenizer_type
types of tokenizers that can be trained from scratch currently supported WordPiece, BPE, Unigram, WordLevel
- Type:
str
- vocab_size
number of tokens in a trained tokenizer
- Type:
int
- hf_cache_dir
path to a folder where hf tokenizers are stored
- Type:
str
- from_old_flag
if True, use pre-trained tokenizer as a template.
- Type:
bool
- checkpoint_path
path to pretrained tokenizer model.
- Type:
str
- tokenizer_fname
name for the file or folder where the trained tokenizer will be stored
- Type:
str
- corpus_gen_batch_size
size of batch for tokenizer training data corpus by deafult it is 1000
- Type:
int
- set_write_config(data_config: DataWriteConfig = None, artifacts_config: TokenizerTrainerArtifactsWriteConfig = None)[source]
Overwrite the default file writing configurations
bardi embedding_generator module
Train a Word2Vec model and create a vocab and word embeddings
- class bardi.nlp_engineering.embedding_generator.CPUEmbeddingGenerator(*args, **kwargs)[source]
Bases:
EmbeddingGenerator
The embedding generator provides an interface to create word embeddings or vector representations of words (tokens).
The embedding generator uses the Word2Vec model from the Gensim library.
Note
This implementation of the EmbeddingGenerator is specific for CPU computation.
- fields
The name of the column(s) containing text to be considered in the vocab and used in Word2Vec.
- Type:
Union[str, List[str]]
- load_saved_model
If True, use a pre-trained Word2Vec model.
- Type:
bool
- checkpoint_path
Path to the Word2Vec model checkpoint.
- Type:
str
- cores
Number of cores to run the Word2Vec model on.
- Type:
int
- min_word_count
Ignores all words with total frequency lower than this.
- Type:
int
- window
Maximum distance between the current and predicted word.
- Type:
int
- vector_size
Output embedding size.
- Type:
int
- sample
The threshold for configuring which high-frequency words are randomly downsampled, use range (0, 1e-5).
- Type:
float
- min_alpha
Learning rate will linearly drop to min_alpha as training progresses.
- Type:
float
- negative
If > 0, negative sampling will be used.
- Type:
int
- epochs
Total number of iterations of all training data in the training of the Word2Vec model.
- Type:
int
- seed
Seed for the random number generator. For deterministic run, you need thread = 1 (aka CPU core) and PYTHONHASHSEED.
- Type:
int
- vocab_exclude_list
Provide a list of tokens that may be present in the text that you would like to exclude from the vocab and from Word2Vec.
- Type:
List[str]
- get_parameters() dict [source]
Retrieve the embedding generator object configuration.
- Returns:
A dictionary representation of the EmbeddingGenerator object’s attributes.
- Return type:
dict
- run(data: Table, artifacts: dict) Tuple[Table, dict] [source]
Runs a CPU-based embedding generator run method based on the configuration used to create the object of the CPUEmbeddingGenerator class
- Parameters:
data (pyarrow.Table) – A pyarrow Table containing at least one list column containing text.
artifacts (dict) – Artifacts are not consumed in this run method, but must be received in the method to operate correctly in the pipeline run method.
- Returns:
The first position is a pyarrow.Table of pre-tokenized data. The second position is a dictionary of artifacts. The dict will contain keys for “embedding_matrix” and “id_to_token”.
- Return type:
Tuple[pyarrow.Table, dict]
- write_artifacts(write_path: str, artifacts: dict) None [source]
Write the artifacts produced by the embedding generator.
- Parameters:
write_path (str) – Path is a directory where files will be written.
artifacts (dict) – Artifacts is a dictionary of artifacts produced in this step. Expected keys are: “id_to_token” and “embedding_matrix”.
- class bardi.nlp_engineering.embedding_generator.EmbeddingGenerator(fields: str | List[str], load_saved_model: bool = False, checkpoint_path: str | None = None, cores: int = 4, min_word_count: int = 10, window: int = 5, vector_size: int = 300, sample: float = 6e-05, min_alpha: float = 0.007, negative: int = 20, epochs: int = 30, seed: int = 42, vocab_exclude_list: List[str] = [])[source]
Bases:
Step
The embedding generator provides an interface to create word embeddings or vector representations of words (tokens).
The embedding generator uses the Word2Vec model from the Gensim library.
Note
Avoid the direct instantiation of the PreTokenizer class and instead instantiate one of the child classes depending on hardware configuration.
- fields
The name of the column(s) containing text to generate embeddings.
- Type:
Union[str, List[str]]
- load_saved_model
Whether to load a saved Word2Vec model or train a new one.
- Type:
bool
- checkpoint_path
Path to the saved model checkpoint if load_saved_model is True.
- Type:
str
- cores
Number of CPU cores to use for training.
- Type:
int
- min_word_count
Ignore all words with a total frequency lower than this.
- Type:
int
- window
Maximum distance between the current and predicted word within a sentence.
- Type:
int
- vector_size
Dimensionality of the word vectors.
- Type:
int
- sample
The threshold for configuring which higher-frequency words are randomly downsampled.
- Type:
float
- min_alpha
Learning rate will linearly drop to min_alpha as training progresses.
- Type:
float
- negative
If > 0, specifies how many “noise words” should be drawn.
- Type:
int
- epochs
Number of iterations (epochs) over the corpus.
- Type:
int
- seed
Seed for the random number generator.
- Type:
int
- vocab_exclude_list
List of words to force exclude from the vocabulary.
- Type:
List[str]
- set_write_config(data_config: DataWriteConfig | None = None, artifacts_config: EmbeddingGeneratorArtifactsWriteConfig | None = None)[source]
Overwrite the default file writing configurations
- class bardi.nlp_engineering.embedding_generator.EmbeddingGeneratorArtifactsWriteConfig[source]
Bases:
TypedDict
Indicates the keys and data types expected in an artifacts write config dict for the embedding generator if overwriting the default configuration.
- embedding_matrix_format: str
- embedding_matrix_format_args: dict
- vocab_format: str
- vocab_format_args: dict
bardi vocab_encoder module
Apply a vocab mapping converting a list of tokens in a column into a list of integers
- class bardi.nlp_engineering.vocab_encoder.CPUVocabEncoder(*args, **kwargs)[source]
Bases:
VocabEncoder
The vocab encoder maps a vocab to a list of tokens
Note
This implementation of the VocabEncoder is specific for CPU computation.
- fields
the name of the column containing a list of tokens that will be mapped to integers using a vocab
- Type:
Union[str, List[str]]
- field_rename
optional ability to rename the supplied field with the field_rename value
- Type:
str
- id_to_token
optional vocabulary in the form of {id: token} that will be used to map the tokens to integers. This is optional for the construction of the object, and can alternatively be provided in the run method. This flexibility handles the use of a pre-existing vocab versus creating a vocab during a pipeline run.
- Type:
dict
- concat_fields
indicate if you would like for fields to be concatenated into a single column or left as separate columns
- Type:
bool
- retain_input_fields
If True, will retain the original contents of the fields specified in fields under the new names of: vocabencoder__<field>
- Type:
Optional[bool]
- get_parameters()[source]
Retrive the vocab encoder object configuration Does not return the mapping (vocab) as it can be large
- Returns:
a dictionary representation of the vocab encoder’s attributes
- Return type:
dict
- run(data: Table, artifacts: dict = None, id_to_token: dict = None) Table [source]
Run a vocab encoder using CPU computation
The vocab encoder relies on receiving a vocab to map. The vocab can be supplied in multiple ways:
id_to_token at object creation
contained in the pipeline artifacts dictionary passed to the run method referenced by the key, ‘id_to_token’
id_to_token in the run method
- Parameters:
data (PyArrow Table) – The data to be processed. The data must contain the column specified by field at object creation
artifacts (dict) – A dictionary of pipeline artifacts which contains a vocab referenced by the key, ‘id_to_token’
id_to_token (dict) – If a vocab wasn’t passed at object creation or through the pipeline artifacts dict, then it must be passed here as a final option
- Returns:
- A tuple with:
the first element holding a PyArrow Table of data
processed with the vocab encoder - the second element of the tuple intended for artifacts is None
- Return type:
Tuple(PyArrow Table, dict)
- Raises:
AttributeError – The vocab (id_to_token) wasn’t supplied either at object creation or to the run method
TypeError – The run method was not supplied a PyArrow Table
- class bardi.nlp_engineering.vocab_encoder.VocabEncoder(fields: str | List[str], field_rename: str = None, id_to_token: dict = None, concat_fields: bool = False, retain_input_fields: bool = False)[source]
Bases:
Step
The vocab encoder maps a vocab to a list of tokens
Note
Avoid the direct instantiation of the VocabEncoder class and instead instantiate one of the child classes depending on hardware configuration
- fields
the name of the column containing a list of tokens that will be mapped to integers using a vocab
- Type:
Union[str, List[str]]
- field_rename
optional ability to rename the supplied field with the field_rename value
- Type:
str
- id_to_token
optional vocabulary in the form of {id: token} that will be used to map the tokens to integers. This is optional for the construction of the object, and can alternatively be provided in the run method. This flexibility handles the use of a pre-existing vocab versus creating a vocab during a pipeline run.
- Type:
dict
- concat_fields
indicate if you would like for fields to be concatenated into a single column or left as separate columns
- Type:
bool
- retain_input_fields
If True, will retain the original contents of the fields specified in fields under the new names of: vocabencoder__<field>
- Type:
Optional[bool]
bardi tokenizer_encoder module
Apply a tokenizer to the provided text fields
- class bardi.nlp_engineering.tokenizer_encoder.CPUTokenizerEncoder(*args, **kwargs)[source]
Bases:
TokenizerEncoder
Implementation of the TokenizerEncoder for CPU computation
Note
This implementation of the TokenizerEncoder is specific for CPU computation.
- fields
the name of the column containing a list of tokens that will be mapped to integers using a vocab
- Type:
Union[str, List[str]],
- return_tensors
type of tensors to return, ‘np’ for Numpy arrays, ‘pt’ for PyTorch tensors or ‘tf’ for TensorFlow
- Type:
str
- concat_fields
whether the text fields should be concatenate into a single text field, defaults to False
- Type:
bool
- retain_input_fields
If True, will retain the original contents of the fields specified in fields under the new names of: CPUTokenizerEncoder_input__<field>
- Type:
Optional[bool]
- retain_concat_field
If True, will retain the concatenation of the fields specified in fields under the name specified in field_rename or ‘text’ if not specified. If concat_fields is not True, this parameter will have no effect
- Type:
Optional[bool]
- field_rename
optional ability to rename the supplied field with the field_rename value
- Type:
Optional[str]
- hf_cache_dir
local directory where the HF pretrained tokenizers are stored
- Type:
Optional[str]
- model_name
name of a tokenizer file or folder. Not required if TokenizerTrainer is a prior step - tokenizer will be passed in artifacts
- Type:
Optional[str]
- cores
number of CPU cores for multithreading the tokenizer
- Type:
Optional[int]
- tokenizer_params
provide fine-grained settings for applying tokenizer
- Type:
Optional[TokenizerConfig]
- tokenizer_model
Tokenizer object passed through artifacts from TokenizerTrainer or read from file specified in model_name
- Type:
transformers.PreTrainedTokenizerBase
- get_parameters()[source]
Retrive the post-processor object configuration Does not return the mapping (vocab) as it can be large
- Returns:
a dictionary representation of the post-processor’s attributes
- Return type:
dict
- run(data: Table, artifacts: dict = None) Table [source]
Run a tokenizer encoder based on provided configuration
The tokenizer encoder relies on receiving a tokenizer to apply to the text. The tokenizer can be supplied in multiple ways:
referencing the model_name at object creation
contained in the pipeline artifacts dictionary passed to the run method referenced by the key, ‘tokenizer_model’
- Parameters:
data (PyArrow Table) – The data to be processed. The data must contain the column specified by field at object creation
artifacts (dict) – A dictionary of pipeline artifacts which contains a tokenizer referenced by the key, ‘tokenizer_model’
- Returns:
The first element holding a PyArrow Table of data processed with the tokenizer encoder. The second element of the tuple intended for artifacts is None.
- Return type:
Tuple[PyArrow Table, dict]
- Raises:
AttributeError – The tokenizer wasn’t supplied either at object creation or to the run method
TypeError – The run method was not supplied a PyArrow Table
- class bardi.nlp_engineering.tokenizer_encoder.TokenizerEncoder(fields: str | List[str], return_tensors: str = 'np', concat_fields: bool = False, retain_input_fields: bool = False, retain_concat_field: bool = False, field_rename: str | None = None, hf_cache_dir: str | None = None, model_name: str | None = None, cores: int | None = None, tokenizer_params: dict | None = None)[source]
Bases:
Step
The tokenizer encoder uses a trained tokenizer to split text into tokens.
Note
Avoid the direct instantiation of the TokenizerEncoder class and instead instantiate one of the child classes depending on hardware configuration.
- fields
the name of the column containing a list of tokens that will be mapped to integers using a vocab
- Type:
Union[str, List[str]],
- return_tensors
type of tensors to return, ‘np’ for Numpy arrays, ‘pt’ for PyTorch tensors or ‘tf’ for TensorFlow
- Type:
str
- concat_fields
whether the text fields should be concatenate into a single text field, defaults to False
- Type:
bool
- retain_input_fields
If True, will retain the original contents of the fields specified in fields under the new names of: TokenizerEncoder__<field>
- Type:
Optional[bool]
- retain_concat_field
If True, will retain the concatenation of the fields specified in fields under the name specified in field_rename or ‘text’ if not specified. If concat_fields is not True, this parameter will have no effect
- Type:
Optional[bool]
- field_rename
optional ability to rename the supplied field with the field_rename value
- Type:
Optional[str]
- hf_cache_dir
local directory where the HF pretrained tokenizers are stored
- Type:
Optional[str]
- model_name
name of a tokenizer file or folder. Not required if TokenizerTrainer is a prior step - tokenizer will be passed in artifacts
- Type:
Optional[str]
- cores
number of CPU cores for multithreading the tokenizer
- Type:
Optional[int]
- tokenizer_params
provide fine-grained customization for any valid HuggingFace Tokenizer parameter through a dictionary
- Type:
Optional[dict]
- tokenizer_model
Tokenizer object passed through artifacts from TokenizerTrainer or read from file specified in model_name
- Type:
transformers.PreTrainedTokenizerBase
bardi label_processor module
Encode label columns into numerical representations
- class bardi.nlp_engineering.label_processor.CPULabelProcessor(*args, **kwargs)[source]
Bases:
LabelProcessor
The label processor creates and maps a label vocab.
Note
This implementation of the LabelProcessor is specific for CPU computation.
- fields
The name(s) of label column(s) of which the values are used to generate a standardized mapping and then that mapping is applied to the column(s).
- Type:
Union[str, List[str]]
- method
Currently only a default ‘unique’ method is supported which maps each unique value in the column to an id.
- Type:
str
- mapping
Mapping dict of the form {label: id} used to convert labels in the column to ids.
- Type:
dict
- id_to_label
The reverse of mapping. Of the form {id: label} used downstream to map the ids back to the original label values.
- Type:
dict
- retain_input_fields
If True, will retain the original contents of the fields specified in fields under the new names of: labelprocessor__<field>
- Type:
Optional[bool]
- id_to_label
If an id_to_label already exists, it can be directly applied. id_to_label is a dict of the form {field: {id: label}}
- Type:
Optional[dict]
- get_parameters() dict [source]
Retrive the label processor object configuration
Does not return the mapping (vocab), but does return the id_to_label dict. This is because the mapping is just the reverse of id_to_label.
- Returns:
a dictionary representation of the splitter object’s attributes
- Return type:
dict
- run(data: Table, artifacts: dict | None = None, id_to_label: dict | None = None) Tuple[Table, dict] [source]
Run a label processor using CPU computation
- Parameters:
data (PyArrow Table) – The data to be processed. The data must contain the column specified by ‘field’ at object creation
artifacts (dict) – artifacts are not used in this run method, but must be received to operate correctly in the pipeline run method
id_to_label (Optional[dict]) – If an id_to_label already exists, it can be directly applied. id_to_label is a dict of the form {field: {id: label}}
- Returns:
The first position is a pyarrow.Table of processed data. The second position is a dictionary of artifacts. The dict will contain a key for “id_to_label”.
- Return type:
Tuple[pyarrow.Table, dict]
- Raises:
NotImplementedError – A value other than ‘unique’ was provided for the label processor’s method
TypeError – The run method was not supplied a PyArrow Table
- write_artifacts(write_path: str, artifacts: dict) None [source]
Write the outputs produced by the label_processor
- Parameters:
write_path (str) – Path is a directory where files will be written
artifacts (dict) – Artifacts is a dictionary of artifacts produced in this step. Expected key is: “id_to_label”
- class bardi.nlp_engineering.label_processor.LabelProcessor(fields: str | List[str], method: str = 'unique', retain_input_fields: bool = False, id_to_label: dict = None)[source]
Bases:
Step
The label processor encodes label columns into numerical representations and provides a mapping for each label to its respective representation.
Note
Avoid the direct instantiation of the LabelProcessor class and instead instantiate one of the child classes.
- fields
The name of a label column of which the values are used to generate a standardized mapping and then that mapping is applied to the column.
- Type:
Union[str, List[str]]
- method
Currently only a default ‘unique’ method is supported which maps each unique value in the column to an id.
- Type:
str
- retain_input_fields
If True, will retain the original contents of the fields specified in fields under the new names of: labelprocessor__<field>
- Type:
Optional[bool]
- id_to_label
If an id_to_label already exists, it can be directly applied. id_to_label is a dict of the form {field: {id: label}}
- Type:
Optional[dict]
- set_write_config(data_config: DataWriteConfig | None = None, artifacts_config: LabelProcessorArtifactsWriteConfig | None = None)[source]
Overwrite the default file writing configurations
- class bardi.nlp_engineering.label_processor.LabelProcessorArtifactsWriteConfig[source]
Bases:
TypedDict
Indicates the keys and data types expected in an artifacts write config dict for the label processor if overwriting the default configuration.
- id_to_label_args: dict | None
- id_to_label_format: str
bardi splitter module
Segment the dataset into splits, such as test, train, and val
- class bardi.nlp_engineering.splitter.CPUSplitter(*args, **kwargs)[source]
Bases:
Splitter
The splitter adds a ‘split’ column to the data assigning each record to a particular split for downstream model training.
Two split types are available - creating a new random split from scratch and assigning previously created splits. This second option is helpful when running comparisons with other methods of data processing ensuring that splits are exactly the same.
Note
This implementation of the LabelProcessor is specific for CPU computation.
To create a splitter, pass the appropriate set of parameters through a defined NamedTuple for the type of split you want to create. i.e.,
CPUSplitter(split_method=NewSplit( split_proportions={ 'train': 0.75, 'test': 0.15, 'val': 0.15 }, unique_record_cols=[ 'document_id' ], group_cols=[ 'patient_id_number', 'registry' ], labels_cols=[ 'reportability' ], random_seed=42 ))
Splitter Method Named Tuples:
- split_method
A named tuple of either MapSplit type or NewSplit type. Each contains a different set of values used to create the splitter depending upon split type
- split_type
- The type of split to be performed:
new - create a new random data split
map - reproduce an existing data split by mapping unique IDs
- Type:
str
- unique_record_cols
List of column names of which the combination forms a unique identifier in the dataset. This can be a single column name if that creates a unique identifier, but oftentimes in datasets a combination of fields are required.
Note: This set of columns MUST create a unique record or the program will crash.
- Type:
List[str]
- split_mapping
Only used for map split type. A dictionary mapping where the keys are the hash of the concatenated values from unique_record_cols, or represented by the following pseudocode,
hash(concat(*unique_record_cols))
The values are the corresponding split value (train, test, val) or (fold1, fold2, fold3), etc.
- Type:
dict[str, str]
- split_proportions
Only used for new split type. Mapping of split names to split proportions. i.e.,
{'train': 0.75, 'test': 0.15, 'val': 0.15} {'fold1': 0.25, 'fold2': 0.25, 'fold3': 0.25, 'fold4': 0.25}
Note: values must add to 1.0.
- Type:
dict[str, float]
- num_splits
The number of splits contained in split_proportions.
- Type:
int
- group_cols
List of column names that form a ‘group’ that you would like to keep in discrete splits. E.x., if you had multiple medical notes for a single patient, you may desire that all notes for a single patient end up in the same split to prevent potential information leakage. In this case you would provide something like a patient_id.
- Type:
List[str]
- label_cols
List of column names containing labels. Efforts are made to balance label distribution across splits, but this is not guaranteed.
- Type:
List[str]
- random_seed
Required for reproducibility. If you have no preference, try on 42 for size.
- Type:
int
- run(data: Table, artifacts: dict = None) Tuple[Table, dict] [source]
Runs a splitter using CPU computation based on the configuration used to create the object of the CPUSplitter class
- Parameters:
data (PyArrow Table) – The data to be split
artifacts (dict) – artifacts are not consumed in this run method, but must be received to operate correctly in the pipeline run method
- Returns:
- A tuple with:
the first element holding a PyArrow Table of data including the new split column
the second element of the tuple intended for artifacts is None
- Return type:
Tuple(PyArrow.Table, dict)
- Raises:
TypeError – The run method was not supplied a PyArrow Table
- class bardi.nlp_engineering.splitter.MapSplit(unique_record_cols: List[str], split_mapping: Dict[str, str], default_split_value: str)
Bases:
tuple
Specify the requirements for splitting data exactly in line with an existing data split
- default_split_value: str
Only used for map split type. A value to be used for split when the unique_record_cols cannot be found in the mapping.
- split_mapping: Dict[str, str]
Only used for map split type. A dictionary mapping where the keys are the hash of the concatenated values from unique_record_cols, or represented by the following pseudocode,
hash(concat(*unique_record_cols))
The values are the corresponding split value (train, test, val) or (fold1, fold2, fold3), etc.
- unique_record_cols: List[str]
List of column names of which the combination forms a unique identifier in the dataset. This can be a single column name if that creates a unique identifier, but oftentimes in datasets a combination of fields are required.
Note: This set of columns MUST create a unique record or the program will crash.
- class bardi.nlp_engineering.splitter.NewSplit(split_proportions: Dict[str, float], unique_record_cols: List[str], group_cols: List[str], label_cols: List[str], random_seed: int)
Bases:
tuple
Specify the requirements for splitting data with a new split from scratch.
- group_cols: List[str]
List of column names that form a ‘group’ that you would like to keep in discrete splits. E.x., if you had multiple medical notes for a single patient, you may desire that all notes for a single patient end up in the same split to prevent potential information leakage. In this case you would provide something like a patient_id.
- label_cols: List[str]
List of column names containing labels. Efforts are made to balance label distribution across splits, but this is not guaranteed.
- random_seed: int
Required for reproducibility. If you have no preference, try on 42 for size.
- split_proportions: Dict[str, float]
Only used for new split type. Mapping of split names to split proportions. i.e.,
{'train': 0.75, 'test': 0.15, 'val': 0.15} {'fold1': 0.25, 'fold2': 0.25, 'fold3': 0.25, 'fold4': 0.25}
Note: values must add to 1.0.
- unique_record_cols: List[str]
List of column names of which the combination forms a unique identifier in the dataset. This can be a single column name if that creates a unique identifier, but oftentimes in datasets a combination of fields are required.
Note: This set of columns MUST create a unique record or the program will crash.
- class bardi.nlp_engineering.splitter.Splitter(split_method: MapSplit | NewSplit)[source]
Bases:
Step
The splitter adds a ‘split’ column to the data assigning each record to a particular split for downstream model training.
Note
Avoid the direct instantiation of the Splitter class and instead instantiate one of the child classes.
- split_method
A named tuple of either MapSplit type or NewSplit type. Each contains a different set of values used to create the splitter depending upon split type
- split_type
- The type of split to be performed:
new - create a new random data split
map - reproduce an existing data split by mapping unique IDs
- Type:
str
- unique_record_cols
List of column names of which the combination forms a unique identifier in the dataset. This can be a single column name if that creates a unique identifier, but oftentimes in datasets a combination of fields are required.
Note: This set of columns MUST create a unique record or the program will crash.
- Type:
List[str]
- split_mapping
Only used for map split type. A dictionary mapping where the keys are the hash of the concatenated values from unique_record_cols, or represented by the following pseudocode,
hash(concat(*unique_record_cols))
The values are the corresponding split value (train, test, val) or (fold1, fold2, fold3), etc.
- Type:
dict[str, str]
- split_proportions
Only used for new split type. Mapping of split names to split proportions. i.e.,
{'train': 0.75, 'test': 0.15, 'val': 0.15} {'fold1': 0.25, 'fold2': 0.25, 'fold3': 0.25, 'fold4': 0.25}
Note: values must add to 1.0.
- Type:
dict[str, float]
- num_splits
The number of splits contained in split_proportions.
- Type:
int
- group_cols
List of column names that form a ‘group’ that you would like to keep in discrete splits. E.x., if you had multiple medical notes for a single patient, you may desire that all notes for a single patient end up in the same split to prevent potential information leakage. In this case you would provide something like a patient_id.
- Type:
List[str]
- label_cols
List of column names containing labels. Efforts are made to balance label distribution across splits, but this is not guaranteed.
- Type:
List[str]
- random_seed
Required for reproducibility. If you have no preference, try on 42 for size.
- Type:
int
RegEx Package
bardi regex_set module
Define RegexSet and RegexSubPair blueprints
- class bardi.nlp_engineering.regex_library.regex_set.RegexSet[source]
Bases:
object
Blueprint for creating a configurable, domain specific regular expression set
- regex_set
a list of regular expression substitution pairs
- Type:
List[RegexSubPair]
- get_regex_set(lowercase_substitution=False, no_substitution=False) List[RegexSubPair] [source]
Return the ordered set of regular expressions
- lowercase_substitution
It True all the substitution tokeen like DATETOKEN will be returned in lowercase datetoken. Defaults to False.
- Type:
Optional[bool]
- no_substitution
If True all the regular expression that remove matched pattern will replace it with space instead of special token. Defaults to False.
- Type:
Optionl[bool]
- Returns:
a list of regular expression substitution pairs
- Return type:
List[RegexSubPair]
- class bardi.nlp_engineering.regex_library.regex_set.RegexSubPair[source]
Bases:
TypedDict
Dictionary used for regular expression string substitutions
Example of a regex sub pair dictionary:
{ "regex_str": r"\s", "sub_str": "WHITESPACE" }
- regex_str
regular expression pattern
- Type:
str
- sub_str
replacement value for matched string
- Type:
str
- regex_str: str
- sub_str: str
bardi provided regex sets
Curated set of regular expressions for cleaning text from pathology reports.
- class bardi.nlp_engineering.regex_library.pathology_report.PathologyReportRegexSet(convert_escape_codes: bool = True, handle_whitespaces: bool = True, remove_urls: bool = True, remove_special_punct: bool = True, remove_multiple_punct: bool = True, handle_angle_brackets: bool = True, replace_percent_sign: bool = True, handle_leading_digit_punct: bool = True, remove_leading_punct: bool = True, remove_trailing_punct: bool = True, handle_words_with_punct_spacing: bool = True, handle_math_spacing: bool = True, handle_dimension_spacing: bool = True, handle_measure_spacing: bool = True, handle_cassettes_spacing: bool = True, handle_dash_digit_spacing: bool = True, handle_literals_floats_spacing: bool = True, fix_pluralization: bool = True, handle_digits_words_spacing: bool = True, remove_phone_numbers: bool = True, remove_dates: bool = True, remove_times: bool = True, remove_addresses: bool = True, remove_dimensions: bool = True, remove_specimen: bool = True, remove_decimal_seg_numbers: bool = True, remove_large_digits_seq: bool = True, remove_large_floats_seq: bool = True, trunc_decimals: bool = True, remove_cassette_names: bool = True, remove_duration_time: bool = True, remove_letter_num_seq: bool = True)[source]
Bases:
RegexSet
The PathologyReportRegexSet includes set of standard regular expression to normalize a pathology report.
Note
The set of regular expressions tailored for pathology reports was crafted with the understanding that dividing text based on punctuation often results in the loss of crucial information. E.g. terms like “her-2” should not be split. However, to ensure that the number of unique tokens remains manageable we employ a number of regular expression to separate some tokens around punctuation. E.g.
22-years
becomes22 years
. This consideration is particularly important when employing the word2vec algorithm, as an excessive number of tokens can impede the model’s effectiveness by diluting the representation of key concepts.- convert_escape_codes
Removes escape codes such as x0d, x0a, etc.
- Type:
bool
- handle_whitespaces
Removes extra whitespaces: any new line, carriage return tab.
- Type:
bool
- remove_urls
Removes URLs found in the text that match the pattern.
- Type:
bool
- remove_special_punct
Removes special punctuation like (?,$).
- Type:
bool
- remove_multiple_punct
Removes duplicated punctuation. E.g.
---
- Type:
bool
- handle_angle_brackets
Removes angle brackets. E.g.
<title>
becomestitle
.- Type:
bool
- replace_percent_sign
Replaces a percent sign with a ‘percent’ word.
- Type:
bool
- handle_leading_digit_punct
Removes punctuation when digit is attached to word. E.g.
22-years
becomes22 years
.- Type:
bool
- remove_leading_punct
Removes leading punctuation from words. E.g.
-result
becomesresult
.- Type:
bool
- remove_trailing_punct
Removes trailing punctuation from words. E.g.
result-
becomesresult
.- Type:
bool
- handle_words_with_punct_spacing
Matches words with hyphen, colon or period and splits them.
- Type:
bool
- handle_math_spacing
Matches “math operators symbols” like ><=%: and adds spaces aroud them.
- Type:
bool
- handle_dimension_spacing
Matches digits and x and adds spaces between them.
- Type:
bool
- handle_measure_spacing
Matches measurements in mm, cm and ml provides proper spacing between the digits and measure.
- Type:
bool
- handle_cassettes_spacing
Matches patterns like 5e-6f and adds spaces around them.
- Type:
bool
- handle_dash_digit_spacing
Matches dashes around digits and adds spaces around the dashes.
- Type:
bool
- handle_literals_floats_spacing
Matches character followed by a float and a word. This is a common formating problem. E.g.
r18.0admission
becomesr18.0 admission
.- Type:
bool
- fix_pluralization
Matches s character after a word and attaches it back to the word. This restores plural nouns demages by removed punctuation.
- Type:
bool
- handle_digits_words_spacing
Matches digits that are attached to the beginning of a word.
- Type:
bool
- remove_phone_numbers
Matches any phone number that consists of 10 digits with delimeters.
- Type:
bool
- remove_dates
Removes dates of prespecified format.
- Type:
bool
- remove_times
Matches time of format 11:20 am or 1.30pm or 9:52:07AM.
- Type:
bool
- remove_addresses
Matches any address of format num (street name) in 1 to 6 words 2-letter state and short or long zip code.
- Type:
bool
- remove_dimensions
Matches 2D or 3D dimension measurements and adds spaces around them.
- Type:
bool
- remove_specimen
Matches marking of a pathology speciman.
- Type:
bool
- remove_decimal_seg_numbers
Matches combinations of digits and periods or dashes. E.g. :code:` 1.78.9.87`.
- Type:
bool
- remove_large_digits_seq
Matches large sequences of digits (3 or more) and replaces it.
- Type:
bool
- remove_large_floats_seq
Matches large floats and replace them.
- Type:
bool
- trunc_decimals
Matches floats and keeps only first decimal.
- Type:
bool = True
- remove_cassette_names
Removes pathology samples’ markings. E.g.
1-e
.- Type:
bool
- remove_duration_time
Removes duration a speciment was treated. E.g.
32d09090301
.- Type:
bool
- remove_letter_num_seq
Removes a character followed directly by 6 to 10 digits.
- Type:
bool
bardi provided regex library
Library of pre-defined regular expression substitution pairs.
- bardi.nlp_engineering.regex_library.regex_lib.get_address_regex() RegexSubPair [source]
Matches any address of format num (street name) in 1 to 6 words 2-letter state and short or long zip code.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
1034 north 500 west provo ut 84604-3337
Output string:
ADDRESSTOKEN
- bardi.nlp_engineering.regex_library.regex_lib.get_angle_brackets_regex() RegexSubPair [source]
Matches a content between matching angle brackets, keeps the content only, removes the brackets.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
<This should be fixed> But not this >90
Output string:
This should be fixed But not this >90
- bardi.nlp_engineering.regex_library.regex_lib.get_cassette_name_regex() RegexSubPair [source]
Matches cassettes markings of the specified format:
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
block: 1-e
Output string:
block: CASSETTETOKEN
- bardi.nlp_engineering.regex_library.regex_lib.get_cassettes_spacing_regex() RegexSubPair [source]
Matches patterns like 5e-6f and adds spaces around them.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Examples
Input string:
3e-3f
Output string:
3e - 3f
- bardi.nlp_engineering.regex_library.regex_lib.get_dash_digits_spacing_regex() RegexSubPair [source]
Matches dashes around digits and adds spaces around the dashes.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
right 1:30-2:30 1.5-2.0 cm 0.9 cm for the 7-6
Output string:
right 1:30 - 2:30 1.5 - 2.0 cm 0.9 cm for the 7 - 6
- bardi.nlp_engineering.regex_library.regex_lib.get_dates_regex() RegexSubPair [source]
Matches dates of specified formats.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
co: 03/09/2021 1015 completed: 03/10/21 at 3:34.
Output string:
co: DATETOKEN completed: DATETOKEN .
- bardi.nlp_engineering.regex_library.regex_lib.get_decimal_segmented_numbers_regex() RegexSubPair [source]
Matches combinations of digits and periods or dashes.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
1.78.9.87
Output string:
DECIMALSEGMENTEDNUMBERTOKEN
- bardi.nlp_engineering.regex_library.regex_lib.get_digits_words_spacing_regex() RegexSubPair [source]
Matches digits that are attached to the beginning of a word.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
9837648admission
Output string:
9837648 admission
- bardi.nlp_engineering.regex_library.regex_lib.get_dimension_spacing_regex() RegexSubPair [source]
Matches digits and x and adds spaces between them.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
measuring 1.3x0.7x0.1 cm
Output string:
measuring 1.3 x 0.7 x 0.1 cm
- bardi.nlp_engineering.regex_library.regex_lib.get_dimensions_regex() RegexSubPair [source]
Matches 2D or 3D dimension measurements and adds spaces around them.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
3.5 x 2.5 x 9.0 cm and 33 x 6.5 cm
Output string:
DIMENSIONTOKEN cm and DIMENSIONTOKEN cm
- bardi.nlp_engineering.regex_library.regex_lib.get_duration_regex() RegexSubPair [source]
Matches duration specimen was treated:
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
32d0909091
Output string:
DURATIONTOKEN
- bardi.nlp_engineering.regex_library.regex_lib.get_escape_code_regex() RegexSubPair [source]
Matches escape codes such as x0d, x0a, etc.
- Returns:
{regex pattern, replacement string}
- Return type:
Example
Input string:
Codes\x0d\x0a\x0d \r30
Output string:
Codes 30
- bardi.nlp_engineering.regex_library.regex_lib.get_fix_pluralization_regex() RegexSubPair [source]
Matches s character after a word and attaches it back to the word. This restores plural nouns demages by removed punctuation.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
specimen s code s
Output string:
specimens codes
- bardi.nlp_engineering.regex_library.regex_lib.get_large_digits_seq_regex() RegexSubPair [source]
Matches large sequences of digits and replaces it.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
456123456
Output string:
DIGITSEQUENCETOKEN
- bardi.nlp_engineering.regex_library.regex_lib.get_large_float_seq_regex() RegexSubPair [source]
Matches large floats and replace them.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
456 123456.783
Output string:
456 LARGEFLOATTOKEN
- bardi.nlp_engineering.regex_library.regex_lib.get_leading_digit_punctuation_regex() RegexSubPair [source]
Matches numeric digits at the start of a word, followed by punctuation and additional characters. Proceeds to eliminate the punctuation and inserts a space between the digits an the word.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
13-unremarkable 1-e 22-years
Output string:
13 unremarkable 1 e 22 years
- bardi.nlp_engineering.regex_library.regex_lib.get_leading_punctuation_regex() RegexSubPair [source]
Matches leading punctuation and removes it.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
-3a -anterior -result- :cassette
Output string:
3a anterior result- cassette
- bardi.nlp_engineering.regex_library.regex_lib.get_letter_num_seq_regex() RegexSubPair [source]
Matches a character followed directly by 6 to 10 digits:
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
c001234567
Output string:
LETTERDIGITSTOKEN
- bardi.nlp_engineering.regex_library.regex_lib.get_literals_floats_spacing_regex() RegexSubPair [source]
Matches character followed by a float and a word. This is a common formating problem. e.g. r18.0admission -> r18.0 admission
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
r18.0admission diagnosis: bi n13.30admission
Output string:
r18.0 admission diagnosis: bi n13.30 admission
- bardi.nlp_engineering.regex_library.regex_lib.get_math_spacing_regex() RegexSubPair [source]
Matches “math operators symbols” like ><=%: and adds spaces aroud them.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
This is >95% 3+3=8 6/7
Output string:
This is > 95 % 3 + 3 = 8 6 / 7
- bardi.nlp_engineering.regex_library.regex_lib.get_measure_spacing_regex() RegexSubPair [source]
Matches measurements in mm, cm and ml provides proper spacing between the digits and measure. Also provides specing between 11th -> 11 th.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
10mm histologic type 2 x 3cm. this is 3.0-cm
Output string:
10 mm histologic type 2 x 3 cm . this is 3.0 cm
- bardi.nlp_engineering.regex_library.regex_lib.get_multiple_punct_regex() RegexSubPair [source]
Matches multiple occurences of symbols like -, . and _ replaces them with a single space.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
-----this is report ___ signature
Output string:
this is report signature
- bardi.nlp_engineering.regex_library.regex_lib.get_percent_sign_regex() RegexSubPair [source]
Matches the % sign and replaces it with a word ‘percent’.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
strong intensity >95%
Output string:
strong intensity >95 percent
- bardi.nlp_engineering.regex_library.regex_lib.get_phone_number_regex() RegexSubPair [source]
Matches any phone number that consists of 10 digits with delimeters.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
Ph: (123) 456 7890. It is (123)4567890.
Output string:
Ph: PHONENUMTOKEN . It is PHONENUMTOKEN .
- bardi.nlp_engineering.regex_library.regex_lib.get_spaces_regex() RegexSubPair [source]
Matches additional spaces (artifact of applying other regex), matches not needed periods that can be removed.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
located around lower arm specimen date
Output string:
located around lower arm specimen date
- bardi.nlp_engineering.regex_library.regex_lib.get_special_punct_regex() RegexSubPair [source]
Matches a set of chosen punctuation symbols _,();[]#{}*”’~?!|^ and replaces them with a single space.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
wt-1, ck-7 (focal) negative; [sth] ab|cd"
Output string:
wt-1 ck-7 focal negative sth ab cd
- bardi.nlp_engineering.regex_library.regex_lib.get_specimen_regex() RegexSubPair [source]
Matches marking of a pathology speciman.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
for s-21-009345 sh-22-0011300
Output string:
for SPECIMENTOKEN SPECIMENTOKEN
- bardi.nlp_engineering.regex_library.regex_lib.get_time_regex() RegexSubPair [source]
Matches time of format 11:20 am or 1.30pm or 9:52:07AM.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
at 11:12 pm or 11.12am
Output string:
at TIMETOKEN or TIMETOKEN
- bardi.nlp_engineering.regex_library.regex_lib.get_trailing_punctuation_regex() RegexSubPair [source]
Matches trailing punctuation and removes it.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
-3a -anterior -result- :cassette
Output string:
-3a -anterior -result :cassette
- bardi.nlp_engineering.regex_library.regex_lib.get_trunc_decimals_regex() RegexSubPair [source]
Matches floats and keeps only first decimal.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
1.78 9.87 - 8.99
Output string:
1.7 9.8 - 8.9
- bardi.nlp_engineering.regex_library.regex_lib.get_urls_regex() RegexSubPair [source]
Matches a url and replaces it with a URLTOKEN.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
Source: https://www.merck.com/keytruda_pi.pdf
Output string:
Source: URLTOKEN
- bardi.nlp_engineering.regex_library.regex_lib.get_whitespace_regex() RegexSubPair [source]
Matches any new line, carriage return tab and multiple spaces and replaces it with a single space.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
INVASIVE:\nNegative IN SITU:\nN/A IN \tThe result\
Output string:
INVASIVE: Negative IN SITU: N/A IN The result
- bardi.nlp_engineering.regex_library.regex_lib.get_words_with_punct_spacing_regex() RegexSubPair [source]
Matches words with hyphen, colon or period and splits them. Requires the words to be at least two characters in length to avoid splitting words like ph.d.
- Return type:
RegexSubPair - (regex pattern, replacement string)
Example
Input string:
this-that her-2 tiff-1k description:gleason
Output string:
this that her-2 tiff-1k description gleason