bardi.data package
bardi.data.data_handlers module
Dataset class definition and data handler functions for loading datasets from various sources
- class bardi.data.data_handlers.Dataset[source]
Bases:
object
A dataset object handles data in the form of columns and rows
Under the hood it uses a PyArrow Table as it is a modern and efficient starting point for both CPU & GPU workflows.
- data
The data table
- Type:
PyArrow.Table | List[PyArrow.Table]
- origin_query
If a SQL data source was used by a data_handler function, the SQL query is recorded here for reproducibility and data provenance.
- Type:
str
- origin_file_path
If a file was used as the data source by a data_handler function, the filepath is recorded for reproducibility and data provenance.
- Type:
str
- origin_format
The format of the data source
- Type:
str
- origin_row_count
The total row count of the original dataset
- Type:
int
- bardi.data.data_handlers.from_duckdb(path: str, query: str, min_batches: int = None) Dataset [source]
Create a bardi Dataset object using data returned from a custom query on a DuckDB database
- Parameters:
path (str) – A filepath to the DuckDB database file
query (str) – A valid SQL query adhering to DuckDB syntax specifications
min_batches (int) – An integer number used to split the data into this amount of smaller tables for distribution to worker nodes. Number will probably align with the number of worker nodes.
- Returns:
bardi Dataset object with the data attribute referencing the data that was supplied after conversion to a PyArrow Table.
- Return type:
- bardi.data.data_handlers.from_file(source: str | List[str], format: str, min_batches: int = None, *args, **kwargs) Dataset [source]
Create a bardi Dataset object from a file source
Accepted file types are: parquet, ipc, arrow, feather, csv, and orc
The function utilizes PyArrow’s dataset API to read files, and thus all keyword arguments available for its API are available here.
- Parameters:
source (str, List[str]) – Path to a single file, or list of paths
format (str) – Currently [“parquet”, “ipc”, “arrow”, “feather”, “csv”, “orc”] are supported
min_batches (int) – An integer number used to split the data into this amount of smaller tables for distribution to worker nodes in distributed computing environments.
- Returns:
bardi Dataset object with the data attribute referencing the data that was supplied after conversion to a PyArrow Table.
- Return type:
- Raises:
ValueError if the supplied file path does not contain a filetype of an – accepted format.
- bardi.data.data_handlers.from_json(json_data: Union[str, dict, List(dict)]) Dataset [source]
Create a bardi Dataset object from JSON data
- Parameters:
json_data (str) – An object of name/value pairs. Names will become columns in the PyArrow Table.
- Returns:
bardi Dataset object with the data attribute referencing the data that was supplied after conversion to a PyArrow Table.
- Return type:
- bardi.data.data_handlers.from_pandas(df: DataFrame, min_batches: int = None) Dataset [source]
Create a bardi dataset object from a Pandas DataFrame using the PyArrow function
- Parameters:
df (Pandas DataFrame) – A Pandas DataFrame containing data intended to be passed into a bardi pipeline
distributed (bool) – A flag which prompts splitting data into smaller chunks to prepare for later distribution to worker nodes. Also used to set a flag in the bardi Dataset object to direct future operations to be performed in a distributed manner.
min_batches (int) – An integer number used to split the data into this amount of smaller tables for distribution to worker nodes. Number will probably align with the number of worker nodes.
- Returns:
bardi Dataset object with the data attribute referencing the data that was supplied after conversion to a PyArrow Table.
- Return type:
- bardi.data.data_handlers.from_pyarrow(table: Table, min_batches: int = None) Dataset [source]
Create a bardi dataset object from an existing PyArrow Table
- Parameters:
table (PyArrow Table) – A PyArrow Table containing data intended to be passed into a bardi pipeline
min_batches (int) – An integer number used to split the data into this amount of smaller tables for distribution to worker nodes. Number will probably align with the number of worker nodes.
- Returns:
bardi Dataset object with the data attribute referencing the data that was supplied after conversion to a PyArrow Table.
- Return type:
- bardi.data.data_handlers.to_pandas(table: Table) DataFrame [source]
Return data as a pandas DataFrame
- Parameters:
table (PyArrow.Table) – Table of data you want to convert to a Pandas DataFrame
- Returns:
The same data as the input table, converted into a DataFrame
- Return type:
pandas.DataFrame
- bardi.data.data_handlers.to_polars(table: Table) DataFrame [source]
Return data as a polars DataFrame
- Parameters:
table (PyArrow.Table) – Table of data you want to convert to a Pandas DataFrame
- Returns:
The same data as the input table, converted into a DataFrame
- Return type:
polars.DataFrame
- bardi.data.data_handlers.write_file(data: Table, path: str, format: str, *args, **kwargs) None [source]
Write data to a file
Note
Only a subset of possible arguments are presented here. Additional arguments can be passed for specific file types. Reference PyArrow documentation for additional arguments.
- Parameters:
data (PyArrow Table) –
path (str) – path in filesystem where data is to be written
format (str) – filetype in “parquet”, “feather”, “csv”, “orc”, “json”, “npy”