Core#
- blocks.core.assemble(path, cgroups=None, rgroups=None, read_args={}, cgroup_args={}, merge='inner', filesystem=<blocks.filesystem.base.FileSystem object>, tmpdir=None)[source]#
Assemble multiple dataframe blocks into a single frame
Each file included in the path (or subdirs of that path) is combined into a single dataframe by first concatenating over row groups and then merging over column groups. A row group is a subset of rows of the data stored in different files. A column group is a subset of columns of the data stored in different folders. The merges are performed in the order of listed cgroups if provided, otherwise in alphabetic order. Files are opened by a method inferred from their extension.
- Parameters
- pathstr
The glob-able path to all data files to assemble into a frame e.g. gs://example//, gs://example//part.0.pq, gs://example/c[1-2]/ See the README for a more detailed explanation
- cgroupslist of str, optional
The list of cgroups (folder names) to include from the glob path
- rgroupslist of str, optional
The list of rgroups (file names) to include from the glob path
- read_argsoptional
Any additional keyword args to pass to the read function
- cgroup_args{cgroup: kwargs}, optional
Any cgroup specific read arguments, where each key is the name of the cgroup and each value is a dictionary of keyword args
- mergeone of ‘left’, ‘right’, ‘outer’, ‘inner’, default ‘inner’
The merge strategy to pass to pandas.merge
- filesystemblocks.filesystem.FileSystem or similar
A filesystem object that implements the blocks.FileSystem API
- Returns
- datapd.DataFrame
The combined dataframe from all the blocks
- blocks.core.divide(df, path, n_rgroup=1, rgroup_offset=0, cgroup_columns=None, extension='.pq', convert=False, filesystem=<blocks.filesystem.base.FileSystem object>, prefix=None, tmpdir=None, **write_args)[source]#
Split a dataframe into rgroups/cgroups and save to disk
Note that this splitting does not preserve the original index, so make sure to have another column to track values
- Parameters
- dfpd.DataFrame
The data to divide
- pathstr
Path to the directory (possibly on GCS) in which to place the columns
- n_rgroupint, default 1
The number of row groups to partition the data into The rgroups will have approximately equal sizes
- rgroup_offsetint, default 0
The index to start from in the name of file parts e.g. If rgroup_offset=10 then the first file will be part_00010.pq
- cgroup_columns{cgroup: list of column names}
The column lists to form cgroups; if None, do not make cgroups Each key is the name of the cgroup, and each value is the list of columns to include To reassemble later make sure to include join keys for each cgroup
- extensionstr, default .pq
The file extension for the dataframe (file type inferred from this extension
- convertbool, default False
If true attempt to coerce types to numeric. This can avoid issues with ambiguous object columns but requires additional time
- filesystemblocks.filesystem.FileSystem or similar
A filesystem object that implements the blocks.FileSystem API
- prefix: str
Prefix to add to written filenames
- write_argsdict
Any additional args to pass to the write function
- blocks.core.iterate(path, axis=-1, cgroups=None, rgroups=None, read_args={}, cgroup_args={}, merge='inner', filesystem=<blocks.filesystem.base.FileSystem object>, tmpdir=None)[source]#
Iterate over dataframe blocks
Each file include in the path (or subdirs of that path) is opened as a dataframe and returned in a generator of (cname, rname, dataframe). Files are opened by a method inferred from their extension
- Parameters
- pathstr
The glob-able path to all files to assemble into a frame e.g. gs://example//, gs://example//part.0.pq, gs://example/c[1-2]/ See the README for a more detailed explanation
- axisint, default -1
The axis to iterate along If -1 (the default), iterate over both columns and rows If 0, iterate over the rgroups, combining any cgroups If 1, iterate over the cgroups, combining any rgroups
- cgroupslist of str, or {str: args} optional
The list of cgroups (folder names) to include from the glob path
- rgroupslist of str, optional
The list of rgroups (file names) to include from the glob path
- read_argsdict, optional
Any additional keyword args to pass to the read function
- cgroup_args{cgroup: kwargs}, optional
Any cgroup specific read arguments, where each key is the name of the cgroup and each value is a dictionary of keyword args
- mergeone of ‘left’, ‘right’, ‘outer’, ‘inner’, default ‘inner’
The merge strategy to pass to pandas.merge, only used when axis=0
- filesystemblocks.filesystem.FileSystem or similar
A filesystem object that implements the blocks.FileSystem API
- Returns
- datagenerator
A generator of (cname, rname, dataframe) for each collected path If axis=0, yields (rname, dataframe) If axis=1, yields (cname, dataframe)
- blocks.core.partitioned(path, cgroups=None, rgroups=None, read_args={}, cgroup_args={}, merge='inner', filesystem=<blocks.filesystem.base.FileSystem object>, tmpdir=None)[source]#
Return a partitioned dask dataframe, where each partition is a row group
The results are the same as iterate with axis=0, except that it returns a dask dataframe instead of a generator. Note that this requires dask to be installed
- Parameters
- pathstr
The glob-able path to all files to assemble into a frame e.g. gs://example//, gs://example//part.0.pq, gs://example/c[1-2]/ See the README for a more detailed explanation
- cgroupslist of str, or {str: args} optional
The list of cgroups (folder names) to include from the glob path
- rgroupslist of str, optional
The list of rgroups (file names) to include from the glob path
- read_argsdict, optional
Any additional keyword args to pass to the read function
- cgroup_args{cgroup: kwargs}, optional
Any cgroup specific read arguments, where each key is the name of the cgroup and each value is a dictionary of keyword args
- mergeone of ‘left’, ‘right’, ‘outer’, ‘inner’, default ‘inner’
The merge strategy to pass to pandas.merge, only used when axis=0
- filesystemblocks.filesystem.FileSystem or similar
A filesystem object that implements the blocks.FileSystem API
- Returns
- datadask.dataframe
A dask dataframe partitioned by row groups, with all cgroups merged
- blocks.core.pickle(obj, path, filesystem=<blocks.filesystem.base.FileSystem object>)[source]#
Save a pickle of obj at the specified path
- Parameters
- objObject
Any pickle compatible object
- pathstr
The path to the location to save the pickle file, support gcs paths
- filesystemblocks.filesystem.FileSystem or similar
A filesystem object that implements the blocks.FileSystem API
- blocks.core.place(df, path, filesystem=<blocks.filesystem.base.FileSystem object>, tmpdir=None, **write_args)[source]#
Place a dataframe block onto the filesystem at the specified path
- Parameters
- dfpd.DataFrame
The data to place
- pathstr
Path to the directory (possibly on GCS) in which to place the columns
- write_argsdict
Any additional args to pass to the write function
- filesystemblocks.filesystem.FileSystem or similar
A filesystem object that implements the blocks.FileSystem API
- blocks.core.unpickle(path, filesystem=<blocks.filesystem.base.FileSystem object>)[source]#
Load an object from the pickle file at path
- Parameters
- objObject
Any pickle compatible object
- pathstr
The path to the location of the saved pickle file, support gcs paths
- filesystemblocks.filesystem.FileSystem or similar
A filesystem object that implements the blocks.FileSystem API