Core¶

blocks.core.assemble(path, cgroups=None, rgroups=None, read_args={}, cgroup_args={}, merge='inner', filesystem=<blocks.filesystem.GCSFileSystem object at 0x7ff7a4de5050>)[source]¶

Assemble multiple dataframe blocks into a single frame

Each file included in the path (or subdirs of that path) is combined into a single dataframe by first concatenating over row groups and then merging over cgroups. The merges are performed in the order of listed cgroups if provided, otherwise in alphabetic order. Files are opened by a method inferred from their extension

Parameters:

path : str: The glob-able path to all datafiles to assemble into a frame e.g. gs://example//, gs://example//part.0.pq, gs://example/c[1-2]/ See the README for a more detailed explanation
cgroups : list of str, optional: The list of cgroups (folder names) to include from the glob path
rgroups : list of str, optional: The list of rgroups (file names) to include from the glob path
read_args : optional: Any additional keyword args to pass to the read function
cgroup_args : {cgroup: kwargs}, optional: Any cgroup specific read arguments, where each key is the name of the cgroup and each value is a dictionary of keyword args
merge : one of ‘left’, ‘right’, ‘outer’, ‘inner’, default ‘inner’: The merge strategy to pass to pandas.merge
filesystem : blocks.filesystem.FileSystem or similar: A filesystem object that implements the blocks.FileSystem API

Returns:

data : pd.DataFrame: The combined dataframe from all the blocks

blocks.core.divide(df, path, n_rgroup=1, rgroup_offset=0, cgroup_columns=None, extension='.pq', convert=False, filesystem=<blocks.filesystem.GCSFileSystem object at 0x7ff7a25efd50>, prefix=None, **write_args)[source]¶

Split a dataframe into rgroups/cgroups and save to disk

Note that this splitting does not preserve the original index, so make sure to have another column to track values

Parameters:

df : pd.DataFrame: The data to divide
path : str: Path to the directory (possibly on GCS) in which to place the columns
n_rgroup : int, default 1: The number of row groups to partition the data into The rgroups will have approximately equal sizes
rgroup_offset : int, default 0: The index to start from in the name of file parts e.g. If rgroup_offset=10 then the first file will be part_00010.pq
cgroup_columns : {cgroup: list of column names}: The column lists to form cgroups; if None, do not make cgroups Each key is the name of the cgroup, and each value is the list of columns to include To reassemble later make sure to include join keys for each cgroup
extension : str, default .pq: The file extension for the dataframe (file type inferred from this extension
convert : bool, default False: If true attempt to coerce types to numeric. This can avoid issues with ambiguous object columns but requires additional time
filesystem : blocks.filesystem.FileSystem or similar: A filesystem object that implements the blocks.FileSystem API
write_args : dict: Any additional args to pass to the write function

blocks.core.iterate(path, axis=-1, cgroups=None, rgroups=None, read_args={}, cgroup_args={}, merge='inner', filesystem=<blocks.filesystem.GCSFileSystem object at 0x7ff7a25efc90>)[source]¶

Iterate over dataframe blocks

Each file include in the path (or subdirs of that path) is opened as a dataframe and returned in a generator of (cname, rname, dataframe). Files are opened by a method inferred from their extension

Parameters:

path : str: The glob-able path to all datafiles to assemble into a frame e.g. gs://example//, gs://example//part.0.pq, gs://example/c[1-2]/ See the README for a more detailed explanation
axis : int, default -1: The axis to iterate along If -1 (the default), iterate over both columns and rows If 0, iterate over the rgroups, combining any cgroups If 1, iterate over the cgroups, combining any rgroups
cgroups : list of str, or {str: args} optional: The list of cgroups (folder names) to include from the glob path
rgroups : list of str, optional: The list of rgroups (file names) to include from the glob path
read_args : dict, optional: Any additional keyword args to pass to the read function
cgroup_args : {cgroup: kwargs}, optional: Any cgroup specific read arguments, where each key is the name of the cgroup and each value is a dictionary of keyword args
merge : one of ‘left’, ‘right’, ‘outer’, ‘inner’, default ‘inner’: The merge strategy to pass to pandas.merge, only used when axis=0
filesystem : blocks.filesystem.FileSystem or similar: A filesystem object that implements the blocks.FileSystem API

Returns:

data : generator: A generator of (cname, rname, dataframe) for each collected path If axis=0, yields (rname, dataframe) If axis=1, yields (cname, dataframe)

blocks.core.partitioned(path, cgroups=None, rgroups=None, read_args={}, cgroup_args={}, merge='inner', filesystem=<blocks.filesystem.GCSFileSystem object at 0x7ff7a25efcd0>)[source]¶

Return a partitioned dask dataframe, where each partition is a row group

The results are the same as iterate with axis=0, except that it returns a dask dataframe instead of a generator. Note that this requires dask to be installed

Parameters:

path : str: The glob-able path to all datafiles to assemble into a frame e.g. gs://example//, gs://example//part.0.pq, gs://example/c[1-2]/ See the README for a more detailed explanation
cgroups : list of str, or {str: args} optional: The list of cgroups (folder names) to include from the glob path
rgroups : list of str, optional: The list of rgroups (file names) to include from the glob path
read_args : dict, optional: Any additional keyword args to pass to the read function
cgroup_args : {cgroup: kwargs}, optional: Any cgroup specific read arguments, where each key is the name of the cgroup and each value is a dictionary of keyword args
merge : one of ‘left’, ‘right’, ‘outer’, ‘inner’, default ‘inner’: The merge strategy to pass to pandas.merge, only used when axis=0
filesystem : blocks.filesystem.FileSystem or similar: A filesystem object that implements the blocks.FileSystem API

Returns:

data : dask.dataframe: A dask dataframe partitioned by row groups, with all cgroups merged

blocks.core.place(df, path, filesystem=<blocks.filesystem.GCSFileSystem object at 0x7ff7a25efd10>, **write_args)[source]¶

Place a dataframe block onto the filesystem at the specified path

Parameters:	df : pd.DataFrame The data to place path : str Path to the directory (possibly on GCS) in which to place the columns write_args : dict Any additional args to pass to the write function filesystem : blocks.filesystem.FileSystem or similar A filesystem object that implements the blocks.FileSystem API