Welcome to bardi’s documentation!
BARDI (Batch-processing Abstraction for Raw Data Integration), is a specialized framework engineered to facilitate the development of reproducible data pre-processing pipelines within machine learning workflows.
- It emphasizes the following key aspects:
Abstraction: By transforming common data pre-processing operations into modular components, Bardi simplifies both the development and upkeep of complex data pipelines.
Efficiency: Utilizing Apache Arrow’s columnar memory model for data storage and Polars for computations, Bardi enhances processing speed through multithreading, optimizing the use of available CPU resources.
Modularity: Bardi’s design is based on a component-driven architecture, offering users the flexibility to incorporate specific modules tailored to their unique requirements. These modules are crafted to operate seamlessly both as standalone units and within the context of a comprehensive pipeline.
Extensibility: Designed with future growth in mind, Bardi allows for the straightforward addition of new custom steps, thereby broadening its functionality to encompass unaddressed demands and evolving data processing needs.
- Installation
- Basic Tutorial
- Preparing a Sample Set of Data
- Register the Sample Data as a Bardi Dataset
- Initialize a Pre-Processing Pipeline
- Adding a Normalizer to our Pipeline
- Adding a PreTokenizer
- Adding an EmbeddingGenerator
- Adding a VocabEncoder
- Adding a LabelProcessor
- Running the Pipeline
- Results
- Collecting Metadata
- Full Tutorial Script
- Advanced Tutorials
- bardi.pipeline
- bardi.data package
- bardi.nlp_engineering package