interfaces

ArrayInterface(*args, **kwargs)

Provides numpy.array concepts.

BasicInterface(bucket_name, ACCESS_KEY, …)

Basic cottoncandy interface to the cloud.

DefaultInterface(*args, **kwargs)

Default cottoncandy interface to the cloud

EncryptedInterface(bucket, access, secret, url)

Interface that transparently encrypts everything uploaded to the cloud

FileSystemInterface(*args, **kwargs)

Emulate some file system functionality.

InterfaceObject()

ArrayInterface

class cottoncandy.interfaces.ArrayInterface(*args, **kwargs)

Bases: cottoncandy.interfaces.BasicInterface

Provides numpy.array concepts.

__init__(*args, **kwargs)
Parameters
  • bucket_name (str) –

  • ACCESS_KEY (str) –

  • SECRET_KEY (str) –

  • endpoint_url (str) – The URL for the S3 gateway

  • force_bucket_creation (bool) – Create requested bucket if it doesn’t exist

Returns

cci – Cottoncandy interface object

Return type

ccio

cloud2dataset(object_root, **metadata)

Get a dataset representation of the object branch.

Parameters

object_root (str) – The branch to create a dataset from

Returns

cc_dataset_object – This can be conceptualized as implementing an h5py/pytables object with load() and keys() methods.

Return type

cottoncandy.BrowserObject

cloud2dict(object_root, verbose=True, keys=None, **metadata)

Download all the arrays of the object branch and return a dictionary. This is the complement to dict2cloud

Parameters
  • object_root (str) – The branch to create the dictionary from

  • verbose (bool) – Whether to print object_root after completion

  • keys (A list of strings) – Specify which keys to download

Returns

datadict – An arbitrary depth dictionary.

Return type

dict

dict2cloud(object_name, array_dict, acl='authenticated-read', verbose=True, **metadata)

Upload an arbitrary depth dictionary containing arrays

Parameters
  • object_name (str) –

  • array_dict (dict) – An arbitrary depth dictionary of arrays. This can be conceptualized as implementing an HDF-like group

  • verbose (bool) – Whether to print object_name after completion

download_dask_array(object_name, dask_name='array')

Downloads a split matrix as a dask.array.Array object

This uses the stored object metadata to reconstruct the full n-dimensional array uploaded using upload_dask_array.

Examples

>>> s3_response = cci.upload_dask_array('test_dim', arr, axis=-1)
>>> dask_object = cci.download_dask_array('test_dim')
>>> dask_object
dask.array<array, shape=(100, 600, 1000), dtype=float64, chunksize=(100, 600, 100)>
>>> dask_slice = dask_object[..., :200]
>>> dask_slice
dask.array<getitem..., shape=(100, 600, 1000), dtype=float64, chunksize=(100, 600, 100)>
>>> downloaded_data = np.asarray(dask_slice) # this downloads the array
>>> downloaded_data.shape
(100, 600, 200)
download_npy_array(object_name)

Download a np.ndarray uploaded using np.save with np.load.

Parameters

object_name (str) –

Returns

array

Return type

np.ndarray

download_raw_array(object_name, buffersize=65536, **kwargs)

Download a binary np.ndarray and return an np.ndarray object This method downloads an array without any disk or memory overhead.

Parameters
  • object_name (str) –

  • buffersize (optional (defaults 2^16)) –

Returns

array

Return type

np.ndarray

Notes

The object must have metadata containing: shape, dtype and a gzip boolean flag. This is all automatically handled by upload_raw_array.

download_sparse_array(object_name)

Downloads a scipy.sparse array

Parameters

object_name (str) – The object name for the sparse array to be retrieved.

Returns

arr – The array stored at the location given by object_name

Return type

scipy.sparse.spmatrix

upload_dask_array(object_name, arr, axis=- 1, buffersize=104857600, **metakwargs)

Upload an array in chunks and store the metadata to reconstruct the complete matrix with dask.

Parameters
  • object_name (str) –

  • arr (np.ndarray) –

  • axis (int or None (default: -1)) – The axis along which to slice the array. If None is given, the array is chunked into ideal isotropic voxels. axis=None is WIP and atm works fine for near isotropic matrices

  • buffersize (scalar (default: 100MB)) – Byte size of the desired array chunks

Returns

response

Return type

boto3 response

Notes

Each array chunk is uploaded as a raw np.array with the prefix “pt%04i”. The metadata is stored as a json file metadata.json. For example, if an array is uploaded with the name “my_array_name” and split into 2 parts, the following objects are created:

  • my_array_name/pt0000

  • my_array_name/pt0001

  • my_array_name/metadata.json

upload_npy_array(object_name, array, acl='authenticated-read', **metadata)

Upload a np.ndarray using np.save

This method creates a copy of the array in memory before uploading since it relies on np.save to get a byte representation of the array.

Parameters
  • object_name (str) –

  • array (numpy.ndarray) –

  • acl (ACL for this object) –

  • **metadata (extra kwargs are uploaded to object metadata) –

Returns

reponse

Return type

boto3 upload response

upload_raw_array(object_name, array, compression=True, acl='authenticated-read', **metadata)

Upload a binary representation of a np.ndarray

This method reads the array content from memory to upload. It does not have any overhead.

Parameters
  • object_name (str) –

  • array (np.ndarray) –

  • compression (str, bool) – True uses the configuration defaults. False is no compression. Available options are: ‘gzip’, ‘LZ4’, ‘Zlib’, ‘Zstd’, ‘BZ2’ (attend to caps). NB: Zstd appears to be the only one that supports >2GB arrays.

  • acl (str) – ACL for the object

  • **metadata (optional) –

Notes

This method also uploads the array dtype, shape, and gzip flag as metadata

upload_sparse_array(object_name, arr)

Uploads a scipy.sparse array as a folder of array objects

Parameters
  • object_name (str) – The name of the object to be stored.

  • arr (scipy.sparse.spmatrix) – A scipy.sparse array to be saved. If type is DOK or LIL, it will be converted to csr before saving

BasicInterface

class cottoncandy.interfaces.BasicInterface(bucket_name, ACCESS_KEY, SECRET_KEY, url=None, force_bucket_creation=False, verbose=True, backend='s3', **kwargs)

Bases: cottoncandy.interfaces.InterfaceObject

Basic cottoncandy interface to the cloud.

__init__(bucket_name, ACCESS_KEY, SECRET_KEY, url=None, force_bucket_creation=False, verbose=True, backend='s3', **kwargs)
Parameters
  • bucket_name (str) –

  • ACCESS_KEY (str) – The S3 access key, or client secrets json file

  • SECRET_KEY (str) – The S3 secret key, or client credentials file

  • url (str) – The URL for the S3 gateway

  • force_bucket_creation (bool) – if bucket does not exist, make it?

  • verbose (bool) – print things?

  • backend ('s3'|'gdrive') – Access s3 or google drive?

  • kwargs (dict,) – S3 only. Passed to backend.

Returns

cci – Cottoncandy interface object

Return type

ccio

property bucket_name
create_bucket(bucket_name, acl='authenticated-read')

Create a new bucket

download_json(object_name)

Download a JSON object

Parameters

object_name (str) –

Returns

json_data – Dictionary representation of JSON file

Return type

dict

download_object(object_name)

Download object raw data. This simply calls the object body read() method.

Parameters

object_name (str) –

Returns

byte_data – Object byte contents

Return type

str

download_pickle(object_name)

Download a pickle object

Parameters

object_name (str) –

Returns

data_object

Return type

object

download_stream(object_name)

Returns the CloudStream object for an object :param self: :param object_name:

Returns

Return type

CloudStream object

download_to_file(object_name, file_name)

Download cloud object to a file

Parameters
  • object_name (str) –

  • file_name (str) – Absolute path where the data will be downloaded on disk

exists_bucket(bucket_name)

Check whether the bucket exists

exists_object(object_name, bucket_name=None, raise_err=False)

Check whether object exists in bucket

Parameters
  • object_name (str) – The object name

  • raise_err (boolean) – If set to True, this function will throw an exception if the object does not exist.

get_bucket()

Get bucket boto3 object

get_bucket_objects(**kwargs)

Get list of objects from the bucket.

This is a wrapper to self.get_bucket().bucket.objects

Parameters
  • limit (int, 1000) – Maximum number of items to return

  • page_size (int, 1000) – The page size for pagination

  • filter (dict) – A dictionary with key ‘Prefix’, specifying a prefix string. Only return objects matching this string. Defaults to ‘/’ (i.e. all objects).

  • kwargs (optional) – Dictionary of {method:value} for bucket.objects

Returns

objects_list

Return type

list (boto3 objects)

Notes

If you get a ‘PaginationError’, this means you have a lot of items on your bucket and should increase page_size

get_bucket_size(limit=1000000, page_size=1000000)

Counts the size of all objects in the current bucket.

Parameters
  • limit (int, 10^6) – Maximum number of items to return

  • page_size (int, 10^6) – The page size for pagination

Returns

total_bytes – The byte count of all objects in the bucket.

Return type

int

Notes

Because paging does not work properly, if there are more than limit,page_size number of objects in the bucket, this function will underestimate the total size. Check the printed number of objects for suspicious round numbers. TODO(anunez): Remove this note when the bug is fixed.

get_object(object_name, bucket_name=None)

Get a boto3 object. Create it if it doesn’t exist

get_objects(**kwargs)

Like get_bucket_objects, but more aptly named to the generic interface :param self: :param kwargs:

get_size()

Gets the total size of the current container of objects. Generic naming. :param self:

mpu_fileobject(object_name, file_object, buffersize=104857600, verbose=True, acl='authenticated-read', **metadata)

Multi-part upload for a file-object.

This automatically creates a multipart upload of an object. Useful for large objects that are loaded in memory. This avoids having to write the file to disk and then using upload_from_file.

Parameters
  • object_name (str) –

  • file_object – file-like object (e.g. StringIO, file, etc)

  • buffersize (int, (defaults to 100MB)) – Byte size of the individual parts to create.

  • verbose (bool) – verbosity flag of whether to print mpu information to stdout

  • **metadata (optional) – Metadata to store along with MPU object

pathjoin(a, *p)
rm_bucket(bucket_name)

Remove an empty bucket. Throws an exception when bucket is not empty.

set_bucket(bucket_name)

Bucket to use

show_buckets()

Show available buckets

show_objects(limit=1000, page_size=1000)

Print objects in the current bucket

upload_from_directory(disk_path, cloud_path=None, recursive=False, ExtraArgs={'ACL': 'authenticated-read'})

Upload a directory to the cloud

upload_from_file(flname, object_name=None, ExtraArgs={'ACL': 'authenticated-read'})

Upload a file to the cloud.

Parameters
  • file_name (str) – Absolute path of file to upload

  • object_name (str, None) – Name of uploaded object. If None, use the full file name as the object name.

  • ExtraArgs (dict) – Defaults dict(ACL=DEFAULT_ACL)

Returns

response

Return type

boto3 response

upload_json(object_name, ddict, acl='authenticated-read', **metadata)

Upload a dict as a JSON using json.dumps

Parameters
  • object_name (str) –

  • ddict (dict to upload) –

  • metadata (dict, optional) –

upload_object(object_name, body, acl='authenticated-read', **metadata)
upload_pickle(object_name, data_object, acl='authenticated-read', **metadata)

Upload an object using pickle: pickle.dumps

Parameters
  • object_name (str) –

  • data_object (object) –

DefaultInterface

class cottoncandy.interfaces.DefaultInterface(*args, **kwargs)

Bases: cottoncandy.interfaces.FileSystemInterface, cottoncandy.interfaces.ArrayInterface, cottoncandy.interfaces.BasicInterface

Default cottoncandy interface to the cloud

This includes numpy.array and file-system-like concepts for easy data I/O and bucket/object exploration.

__init__(*args, **kwargs)
Parameters
  • bucket_name (str) –

  • ACCESS_KEY (str) –

  • SECRET_KEY (str) –

  • endpoint_url (str) – The URL for the S3 gateway

  • force_bucket_creation (bool) – Create requested bucket if it doesn’t exist

  • backend ('s3'|'gdrive') – which backend to hook on to

Returns

cci

Return type

cottoncandy.InterfaceObject

EncryptedInterface

class cottoncandy.interfaces.EncryptedInterface(bucket, access, secret, url, encryption='AES', key=None, *args, **kwargs)

Bases: cottoncandy.interfaces.DefaultInterface

Interface that transparently encrypts everything uploaded to the cloud

__init__(bucket, access, secret, url, encryption='AES', key=None, *args, **kwargs)
Parameters
  • bucket

  • access

  • secret

  • url

  • encryption ('AES' | 'RSA') –

  • key (str) – if AES, key; if RSA, filename of .pem format key

  • backend ('s3'|'gdrive') – which backend to hook on to

  • args

  • kwargs

download_stream(object_name)

Returns the CloudStream object for an object :param self: :param object_name:

Returns

Return type

CloudStream object

download_to_file(object_name, file_name)

Download cloud object to a file

Parameters
  • object_name (str) –

  • file_name (str) – Absolute path where the data will be downloaded on disk

upload_from_file(local_file_name, object_name=None, ExtraArgs={'ACL': 'authenticated-read'})

Upload a file to the cloud.

Parameters
  • file_name (str) – Absolute path of file to upload

  • object_name (str, None) – Name of uploaded object. If None, use the full file name as the object name.

  • ExtraArgs (dict) – Defaults dict(ACL=DEFAULT_ACL)

Returns

response

Return type

boto3 response

upload_object(object_name, body, acl='authenticated-read', **metadata)

FileSystemInterface

class cottoncandy.interfaces.FileSystemInterface(*args, **kwargs)

Bases: cottoncandy.interfaces.BasicInterface

Emulate some file system functionality.

__init__(*args, **kwargs)
Parameters
  • bucket_name (str) –

  • ACCESS_KEY (str) –

  • SECRET_KEY (str) –

  • endpoint_url (str) – The URL for the S3 gateway

  • force_bucket_creation (bool) – Create requested bucket if it doesn’t exist

Returns

cci – Cottoncandy interface object

Return type

ccio

cp(source_name, dest_name, source_bucket=None, dest_bucket=None, overwrite=False)

Copy an object

Parameters
  • source_name (str) – Name of object to be copied

  • dest_name (str) – Copy name

  • source_bucket (str) – If copying from a bucket different from the default. Defaults to self.bucket_name

  • dest_bucket (str) – If copying to a bucket different from the source bucket. Defaults to source_bucket

  • overwrite (bool (defaults to False)) – Whether to overwrite the dest_name object if it already exists

download_directory(directory, disk_name)

Download an entire directory NOTE: currently only tested on s3

Parameters
  • self

  • directory (str) – directory on s3 to download

  • disk_name – name of directory on disk to download to

get_browser()

Return an object which can be tab-completed to browse the contents of the bucket as if it were a file-system

See documentation for cottoncandy.get_browser

get_object_owner(object_name)
glob(pattern, **kwargs)

Return a list of object names in the cloud storage that match the glob pattern.

Parameters
  • pattern (str,) – A glob pattern string

  • verbose (bool, optional) – If True, also print object name and creation date

  • limit (None, int, optional) –

  • page_size (int, optional) –

Returns

object_names

Return type

list

Example

>>> cci.glob('/path/to/*/file01*.grp/image_data')
['/path/to/my/file01a.grp/image_data',
 '/path/to/my/file01b.grp/image_data',
 '/path/to/your/file01a.grp/image_data',
 '/path/to/your/file01b.grp/image_data']
>>> cci.glob('/path/to/my/file02*.grp/*')
['/path/to/my/file02a.grp/image_data',
 '/path/to/my/file02a.grp/text_data',
 '/path/to/my/file02b.grp/image_data',
 '/path/to/my/file02b.grp/text_data',]

Some gotchas

limit: None, int, optional

The maximum number of objects to return

page_size: int, optional

This is important for buckets with loads of objects. By default, glob will download a maximum of 10^6 object names and perform the search. If more objects exist, the search might not find them and the page_size should be increased.

Notes

If more than 10^6 objects, provide page_size=10**7 kwarg.

glob_google_drive(pattern)

Globbing on google drive

Parameters

pattern

glob_s3(pattern, **kwargs)

Globbing on S3

Parameters
  • pattern

  • kwargs

ls(pattern, page_size=1000, limit=1000, verbose=False)

File-system like search for S3 objects

Parameters
  • pattern (str) – A ls-style command line like query

  • page_size (int (default: 1,000)) –

  • limit (int (default: 1,000)) –

Returns

object_names – Object names that match the search pattern

Return type

list

Notes

Increase page_size and limit if you have a lot of objects otherwise, the search might not return all matching objects in store.

lsdir(path='/', limit=1000)

List the contents of a directory

Parameters

path (str (default: "/")) –

Returns

matches – The children of the path.

Return type

list

mv(source_name, dest_name, source_bucket=None, dest_bucket=None, overwrite=False)

Move an object (make copy and delete old object)

Parameters
  • source_name (str) – Name of object to be moved

  • dest_name (str) – New object name

  • source_bucket (str) – If moving object from a bucket different from the default. Defaults to self.bucket_name

  • dest_bucket (str (defaults to None)) – If moving to another bucket, provide the bucket name. Defaults to source_bucket

  • overwrite (bool (defaults to False)) – Whether to overwrite the dest_name object if it already exists.

rm(object_name, recursive=False, delete=True)

Delete an object, or a subtree (‘path/to/stuff’).

Parameters
  • object_name (str) – The name of the object to delete. It can also be a subtree

  • recursive (bool) – When deleting a subtree, set recursive=True. This is similar in behavior to ‘rm -r /path/to/directory’.

  • delete (bool) – When in google drive, actually delete the file or only trash it?

Example

>>> import cottoncandy as cc
>>> cci = cc.get_interface('mybucket', verbose=False)
>>> response = cci.rm('data/experiment/file01.txt')
>>> cci.rm('data/experiment')
cannot remove 'data/experiment': use `recursive` to remove branch
>>> cci.rm('data/experiment', recursive=True)
deleting 15 objects...
search(pattern, **kwargs)

Print the objects matching the glob pattern

See glob documentation for details

InterfaceObject

class cottoncandy.interfaces.InterfaceObject

Bases: object

__init__()