bardi.data package

bardi.data.data_handlers module

Dataset class definition and data handler functions for loading datasets from various sources

class bardi.data.data_handlers.Dataset[source]

Bases: object

A dataset object handles data in the form of columns and rows

Under the hood it uses a PyArrow Table as it is a modern and efficient starting point for both CPU & GPU workflows.

data

The data table

Type:: PyArrow.Table | List[PyArrow.Table]

origin_query

If a SQL data source was used by a data_handler function, the SQL query is recorded here for reproducibility and data provenance.

Type:: str

origin_file_path

If a file was used as the data source by a data_handler function, the filepath is recorded for reproducibility and data provenance.

Type:: str

origin_format

The format of the data source

Type:: str

origin_row_count

The total row count of the original dataset

Type:: int

get_parameters() → dict[source]

bardi.data.data_handlers.from_duckdb(path: str, query: str, min_batches: int = None) → Dataset[source]

Create a bardi Dataset object using data returned from a custom query on a DuckDB database

Parameters:

path (str) – A filepath to the DuckDB database file
query (str) – A valid SQL query adhering to DuckDB syntax specifications
min_batches (int) – An integer number used to split the data into this amount of smaller tables for distribution to worker nodes. Number will probably align with the number of worker nodes.

Returns:

bardi Dataset object with the data attribute referencing the data that was supplied after conversion to a PyArrow Table.

Return type:

bardi.data.data_handlers.Dataset

bardi.data.data_handlers.from_file(source: str | List[str], format: str, min_batches: int = None, *args, **kwargs) → Dataset[source]

Create a bardi Dataset object from a file source

Accepted file types are: parquet, ipc, arrow, feather, csv, and orc

The function utilizes PyArrow’s dataset API to read files, and thus all keyword arguments available for its API are available here.

Parameters:

source (str, List[str]) – Path to a single file, or list of paths
format (str) – Currently [“parquet”, “ipc”, “arrow”, “feather”, “csv”, “orc”] are supported
min_batches (int) – An integer number used to split the data into this amount of smaller tables for distribution to worker nodes in distributed computing environments.

Returns:

bardi Dataset object with the data attribute referencing the data that was supplied after conversion to a PyArrow Table.

Return type:

bardi.data.data_handlers.Dataset

Raises:

ValueError if the supplied file path does not contain a filetype of an – accepted format.

bardi.data.data_handlers.from_json(json_data: Union[str, dict, List(dict)]) → Dataset[source]

Create a bardi Dataset object from JSON data

Parameters:: json_data (str) – An object of name/value pairs. Names will become columns in the PyArrow Table.
Returns:: bardi Dataset object with the data attribute referencing the data that was supplied after conversion to a PyArrow Table.
Return type:: bardi.data.data_handlers.Dataset

bardi.data.data_handlers.from_pandas(df: DataFrame, min_batches: int = None) → Dataset[source]

Create a bardi dataset object from a Pandas DataFrame using the PyArrow function

Parameters:

df (Pandas DataFrame) – A Pandas DataFrame containing data intended to be passed into a bardi pipeline
distributed (bool) – A flag which prompts splitting data into smaller chunks to prepare for later distribution to worker nodes. Also used to set a flag in the bardi Dataset object to direct future operations to be performed in a distributed manner.
min_batches (int) – An integer number used to split the data into this amount of smaller tables for distribution to worker nodes. Number will probably align with the number of worker nodes.

Returns:

bardi Dataset object with the data attribute referencing the data that was supplied after conversion to a PyArrow Table.

Return type:

bardi.data.data_handlers.Dataset

bardi.data.data_handlers.from_pyarrow(table: Table, min_batches: int = None) → Dataset[source]

Create a bardi dataset object from an existing PyArrow Table

Parameters:

table (PyArrow Table) – A PyArrow Table containing data intended to be passed into a bardi pipeline
min_batches (int) – An integer number used to split the data into this amount of smaller tables for distribution to worker nodes. Number will probably align with the number of worker nodes.

Returns:

bardi Dataset object with the data attribute referencing the data that was supplied after conversion to a PyArrow Table.

Return type:

bardi.data.data_handlers.Dataset

bardi.data.data_handlers.to_pandas(table: Table) → DataFrame[source]

Return data as a pandas DataFrame

Parameters:: table (PyArrow.Table) – Table of data you want to convert to a Pandas DataFrame
Returns:: The same data as the input table, converted into a DataFrame
Return type:: pandas.DataFrame

bardi.data.data_handlers.to_polars(table: Table) → DataFrame[source]

Return data as a polars DataFrame

Parameters:: table (PyArrow.Table) – Table of data you want to convert to a Pandas DataFrame
Returns:: The same data as the input table, converted into a DataFrame
Return type:: polars.DataFrame

bardi.data.data_handlers.write_file(data: Table, path: str, format: str, *args, **kwargs) → None[source]

Write data to a file

Note

Only a subset of possible arguments are presented here. Additional arguments can be passed for specific file types. Reference PyArrow documentation for additional arguments.

Parameters:

data (PyArrow Table) –
path (str) – path in filesystem where data is to be written
format (str) – filetype in “parquet”, “feather”, “csv”, “orc”, “json”, “npy”