`interfaces`¶

`ArrayInterface`(args, *kwargs)	Provides numpy.array concepts.
`BasicInterface`(bucket_name, ACCESS_KEY, …)	Basic cottoncandy interface to the cloud.
`DefaultInterface`(args, *kwargs)	Default cottoncandy interface to the cloud
`EncryptedInterface`(bucket, access, secret, url)	Interface that transparently encrypts everything uploaded to the cloud
`FileSystemInterface`(args, *kwargs)	Emulate some file system functionality.
`InterfaceObject`()

`ArrayInterface`¶

class cottoncandy.interfaces.ArrayInterface(*args, **kwargs)¶

Bases: cottoncandy.interfaces.BasicInterface

Provides numpy.array concepts.

__init__(*args, **kwargs)¶

Parameters

bucket_name (str) –
ACCESS_KEY (str) –
SECRET_KEY (str) –
endpoint_url (str) – The URL for the S3 gateway
force_bucket_creation (bool) – Create requested bucket if it doesn’t exist

Returns

cci – Cottoncandy interface object

Return type

ccio

cloud2dataset(object_root, **metadata)¶

Get a dataset representation of the object branch.

Parameters: object_root (str) – The branch to create a dataset from
Returns: cc_dataset_object – This can be conceptualized as implementing an h5py/pytables object with load() and keys() methods.
Return type: cottoncandy.BrowserObject

cloud2dict(object_root, verbose=True, keys=None, **metadata)¶

Download all the arrays of the object branch and return a dictionary. This is the complement to dict2cloud

Parameters

object_root (str) – The branch to create the dictionary from
verbose (bool) – Whether to print object_root after completion
keys (A list of strings) – Specify which keys to download

Returns

datadict – An arbitrary depth dictionary.

Return type

dict

dict2cloud(object_name, array_dict, acl='authenticated-read', verbose=True, **metadata)¶

Upload an arbitrary depth dictionary containing arrays

Parameters

object_name (str) –
array_dict (dict) – An arbitrary depth dictionary of arrays. This can be conceptualized as implementing an HDF-like group
verbose (bool) – Whether to print object_name after completion

download_dask_array(object_name, dask_name='array')¶

Downloads a split matrix as a dask.array.Array object

This uses the stored object metadata to reconstruct the full n-dimensional array uploaded using upload_dask_array.

Examples

>>> s3_response = cci.upload_dask_array('test_dim', arr, axis=-1)
>>> dask_object = cci.download_dask_array('test_dim')
>>> dask_object
dask.array<array, shape=(100, 600, 1000), dtype=float64, chunksize=(100, 600, 100)>
>>> dask_slice = dask_object[..., :200]
>>> dask_slice
dask.array<getitem..., shape=(100, 600, 1000), dtype=float64, chunksize=(100, 600, 100)>
>>> downloaded_data = np.asarray(dask_slice) # this downloads the array
>>> downloaded_data.shape
(100, 600, 200)

download_npy_array(object_name)¶

Download a np.ndarray uploaded using np.save with np.load.

Parameters: object_name (str) –
Returns: array
Return type: np.ndarray

download_raw_array(object_name, buffersize=65536, **kwargs)¶

Download a binary np.ndarray and return an np.ndarray object This method downloads an array without any disk or memory overhead.

Parameters

object_name (str) –
buffersize (optional (defaults 2^16)) –

Returns

array

Return type

np.ndarray

Notes

The object must have metadata containing: shape, dtype and a gzip boolean flag. This is all automatically handled by upload_raw_array.

download_sparse_array(object_name)¶

Downloads a scipy.sparse array

Parameters: object_name (str) – The object name for the sparse array to be retrieved.
Returns: arr – The array stored at the location given by object_name
Return type: scipy.sparse.spmatrix

upload_dask_array(object_name, arr, axis=- 1, buffersize=104857600, **metakwargs)¶

Upload an array in chunks and store the metadata to reconstruct the complete matrix with dask.

Parameters

object_name (str) –
arr (np.ndarray) –
axis (int or None (default: -1)) – The axis along which to slice the array. If None is given, the array is chunked into ideal isotropic voxels. axis=None is WIP and atm works fine for near isotropic matrices
buffersize (scalar (default: 100MB)) – Byte size of the desired array chunks

Returns

response

Return type

boto3 response

Notes

Each array chunk is uploaded as a raw np.array with the prefix “pt%04i”. The metadata is stored as a json file metadata.json. For example, if an array is uploaded with the name “my_array_name” and split into 2 parts, the following objects are created:

my_array_name/pt0000
my_array_name/pt0001
my_array_name/metadata.json

upload_npy_array(object_name, array, acl='authenticated-read', **metadata)¶

Upload a np.ndarray using np.save

This method creates a copy of the array in memory before uploading since it relies on np.save to get a byte representation of the array.

Parameters

object_name (str) –
array (numpy.ndarray) –
acl (ACL for this object) –
**metadata (extra kwargs are uploaded to object metadata) –

Returns

reponse

Return type

boto3 upload response

`BasicInterface`¶

class cottoncandy.interfaces.BasicInterface(bucket_name, ACCESS_KEY, SECRET_KEY, url=None, force_bucket_creation=False, verbose=True, backend='s3', **kwargs)¶

Bases: cottoncandy.interfaces.InterfaceObject

Basic cottoncandy interface to the cloud.

__init__(bucket_name, ACCESS_KEY, SECRET_KEY, url=None, force_bucket_creation=False, verbose=True, backend='s3', **kwargs)¶

Parameters

bucket_name (str) –
ACCESS_KEY (str) – The S3 access key, or client secrets json file
SECRET_KEY (str) – The S3 secret key, or client credentials file
url (str) – The URL for the S3 gateway
force_bucket_creation (bool) – if bucket does not exist, make it?
verbose (bool) – print things?
backend ('s3'|'gdrive') – Access s3 or google drive?
kwargs (dict,) – S3 only. Passed to backend.

Returns

cci – Cottoncandy interface object

Return type

ccio

property bucket_name¶

create_bucket(bucket_name, acl='authenticated-read')¶: Create a new bucket

download_json(object_name)¶

Download a JSON object

Parameters: object_name (str) –
Returns: json_data – Dictionary representation of JSON file
Return type: dict

download_object(object_name)¶

Download object raw data. This simply calls the object body read() method.

Parameters: object_name (str) –
Returns: byte_data – Object byte contents
Return type: str

download_pickle(object_name)¶

Download a pickle object

Parameters: object_name (str) –
Returns: data_object
Return type: object

download_stream(object_name)¶

Returns the CloudStream object for an object :param self: :param object_name:

Returns
Return type: CloudStream object

download_to_file(object_name, file_name)¶

Download cloud object to a file

Parameters

object_name (str) –
file_name (str) – Absolute path where the data will be downloaded on disk

exists_bucket(bucket_name)¶: Check whether the bucket exists

exists_object(object_name, bucket_name=None, raise_err=False)¶

Check whether object exists in bucket

Parameters

object_name (str) – The object name
raise_err (boolean) – If set to True, this function will throw an exception if the object does not exist.

get_bucket()¶: Get bucket boto3 object

get_bucket_objects(**kwargs)¶

Get list of objects from the bucket.

This is a wrapper to self.get_bucket().bucket.objects

Parameters

limit (int, 1000) – Maximum number of items to return
page_size (int, 1000) – The page size for pagination
filter (dict) – A dictionary with key ‘Prefix’, specifying a prefix string. Only return objects matching this string. Defaults to ‘/’ (i.e. all objects).
kwargs (optional) – Dictionary of {method:value} for bucket.objects

Returns

objects_list

Return type

list (boto3 objects)

Notes

If you get a ‘PaginationError’, this means you have a lot of items on your bucket and should increase page_size

get_bucket_size(limit=1000000, page_size=1000000)¶

Counts the size of all objects in the current bucket.

Parameters

limit (int, 10^6) – Maximum number of items to return
page_size (int, 10^6) – The page size for pagination

Returns

total_bytes – The byte count of all objects in the bucket.

Return type

int

Notes

Because paging does not work properly, if there are more than limit,page_size number of objects in the bucket, this function will underestimate the total size. Check the printed number of objects for suspicious round numbers. TODO(anunez): Remove this note when the bug is fixed.

get_object(object_name, bucket_name=None)¶: Get a boto3 object. Create it if it doesn’t exist

get_objects(**kwargs)¶: Like get_bucket_objects, but more aptly named to the generic interface :param self: :param kwargs:

get_size()¶: Gets the total size of the current container of objects. Generic naming. :param self:

mpu_fileobject(object_name, file_object, buffersize=104857600, verbose=True, acl='authenticated-read', **metadata)¶

Multi-part upload for a file-object.

This automatically creates a multipart upload of an object. Useful for large objects that are loaded in memory. This avoids having to write the file to disk and then using upload_from_file.

Parameters

object_name (str) –
file_object – file-like object (e.g. StringIO, file, etc)
buffersize (int, (defaults to 100MB)) – Byte size of the individual parts to create.
verbose (bool) – verbosity flag of whether to print mpu information to stdout
**metadata (optional) – Metadata to store along with MPU object

pathjoin(a, *p)¶

rm_bucket(bucket_name)¶: Remove an empty bucket. Throws an exception when bucket is not empty.

set_bucket(bucket_name)¶: Bucket to use

show_buckets()¶: Show available buckets

show_objects(limit=1000, page_size=1000)¶: Print objects in the current bucket

upload_from_directory(disk_path, cloud_path=None, recursive=False, ExtraArgs={'ACL': 'authenticated-read'})¶: Upload a directory to the cloud

upload_from_file(flname, object_name=None, ExtraArgs={'ACL': 'authenticated-read'})¶

Upload a file to the cloud.

Parameters

file_name (str) – Absolute path of file to upload
object_name (str, None) – Name of uploaded object. If None, use the full file name as the object name.
ExtraArgs (dict) – Defaults dict(ACL=DEFAULT_ACL)

Returns

response

Return type

boto3 response

upload_json(object_name, ddict, acl='authenticated-read', **metadata)¶

Upload a dict as a JSON using json.dumps

Parameters

object_name (str) –
ddict (dict to upload) –
metadata (dict, optional) –

upload_object(object_name, body, acl='authenticated-read', **metadata)¶

upload_pickle(object_name, data_object, acl='authenticated-read', **metadata)¶

Upload an object using pickle: pickle.dumps

Parameters

object_name (str) –
data_object (object) –

`DefaultInterface`¶

class cottoncandy.interfaces.DefaultInterface(*args, **kwargs)¶

Bases: cottoncandy.interfaces.FileSystemInterface, cottoncandy.interfaces.ArrayInterface, cottoncandy.interfaces.BasicInterface

Default cottoncandy interface to the cloud

This includes numpy.array and file-system-like concepts for easy data I/O and bucket/object exploration.

__init__(*args, **kwargs)¶

Parameters

bucket_name (str) –
ACCESS_KEY (str) –
SECRET_KEY (str) –
endpoint_url (str) – The URL for the S3 gateway
force_bucket_creation (bool) – Create requested bucket if it doesn’t exist
backend ('s3'|'gdrive') – which backend to hook on to

Returns

cci

Return type

cottoncandy.InterfaceObject

`EncryptedInterface`¶

class cottoncandy.interfaces.EncryptedInterface(bucket, access, secret, url, encryption='AES', key=None, *args, **kwargs)¶

Bases: cottoncandy.interfaces.DefaultInterface

Interface that transparently encrypts everything uploaded to the cloud

__init__(bucket, access, secret, url, encryption='AES', key=None, *args, **kwargs)¶

Parameters

bucket –
access –
secret –
url –
encryption ('AES' | 'RSA') –
key (str) – if AES, key; if RSA, filename of .pem format key
backend ('s3'|'gdrive') – which backend to hook on to
args –
kwargs –

download_stream(object_name)¶

Returns the CloudStream object for an object :param self: :param object_name:

Returns
Return type: CloudStream object

download_to_file(object_name, file_name)¶

Download cloud object to a file

Parameters

object_name (str) –
file_name (str) – Absolute path where the data will be downloaded on disk

upload_from_file(local_file_name, object_name=None, ExtraArgs={'ACL': 'authenticated-read'})¶

Upload a file to the cloud.

Parameters

file_name (str) – Absolute path of file to upload
object_name (str, None) – Name of uploaded object. If None, use the full file name as the object name.
ExtraArgs (dict) – Defaults dict(ACL=DEFAULT_ACL)

Returns

response

Return type

boto3 response

upload_object(object_name, body, acl='authenticated-read', **metadata)¶

`FileSystemInterface`¶

class cottoncandy.interfaces.FileSystemInterface(*args, **kwargs)¶

Bases: cottoncandy.interfaces.BasicInterface

Emulate some file system functionality.

__init__(*args, **kwargs)¶

Parameters

bucket_name (str) –
ACCESS_KEY (str) –
SECRET_KEY (str) –
endpoint_url (str) – The URL for the S3 gateway
force_bucket_creation (bool) – Create requested bucket if it doesn’t exist

Returns

cci – Cottoncandy interface object

Return type

ccio

cp(source_name, dest_name, source_bucket=None, dest_bucket=None, overwrite=False)¶

Copy an object

Parameters

source_name (str) – Name of object to be copied
dest_name (str) – Copy name
source_bucket (str) – If copying from a bucket different from the default. Defaults to self.bucket_name
dest_bucket (str) – If copying to a bucket different from the source bucket. Defaults to source_bucket
overwrite (bool (defaults to False)) – Whether to overwrite the dest_name object if it already exists

download_directory(directory, disk_name)¶

Download an entire directory NOTE: currently only tested on s3

Parameters

self –
directory (str) – directory on s3 to download
disk_name – name of directory on disk to download to

get_browser()¶

Return an object which can be tab-completed to browse the contents of the bucket as if it were a file-system

See documentation for cottoncandy.get_browser

get_object_owner(object_name)¶

glob(pattern, **kwargs)¶

Return a list of object names in the cloud storage that match the glob pattern.

Parameters

pattern (str,) – A glob pattern string
verbose (bool, optional) – If True, also print object name and creation date
limit (None, int, optional) –
page_size (int, optional) –

Returns

object_names

Return type

list

Example

>>> cci.glob('/path/to/*/file01*.grp/image_data')
['/path/to/my/file01a.grp/image_data',
 '/path/to/my/file01b.grp/image_data',
 '/path/to/your/file01a.grp/image_data',
 '/path/to/your/file01b.grp/image_data']
>>> cci.glob('/path/to/my/file02*.grp/*')
['/path/to/my/file02a.grp/image_data',
 '/path/to/my/file02a.grp/text_data',
 '/path/to/my/file02b.grp/image_data',
 '/path/to/my/file02b.grp/text_data',]

Some gotchas

limit: None, int, optional: The maximum number of objects to return
page_size: int, optional: This is important for buckets with loads of objects. By default, glob will download a maximum of 10^6 object names and perform the search. If more objects exist, the search might not find them and the page_size should be increased.

Notes

If more than 10^6 objects, provide page_size=10**7 kwarg.

glob_google_drive(pattern)¶

Globbing on google drive

Parameters: pattern –

glob_s3(pattern, **kwargs)¶

Globbing on S3

Parameters

pattern –
kwargs –

ls(pattern, page_size=1000, limit=1000, verbose=False)¶

File-system like search for S3 objects

Parameters

pattern (str) – A ls-style command line like query
page_size (int (default: 1,000)) –
limit (int (default: 1,000)) –

Returns

object_names – Object names that match the search pattern

Return type

list

Notes

Increase page_size and limit if you have a lot of objects otherwise, the search might not return all matching objects in store.

lsdir(path='/', limit=1000)¶

List the contents of a directory

Parameters: path (str (default: "/")) –
Returns: matches – The children of the path.
Return type: list

mv(source_name, dest_name, source_bucket=None, dest_bucket=None, overwrite=False)¶

Move an object (make copy and delete old object)

Parameters

source_name (str) – Name of object to be moved
dest_name (str) – New object name
source_bucket (str) – If moving object from a bucket different from the default. Defaults to self.bucket_name
dest_bucket (str (defaults to None)) – If moving to another bucket, provide the bucket name. Defaults to source_bucket
overwrite (bool (defaults to False)) – Whether to overwrite the dest_name object if it already exists.

rm(object_name, recursive=False, delete=True)¶

Delete an object, or a subtree (‘path/to/stuff’).

Parameters

object_name (str) – The name of the object to delete. It can also be a subtree
recursive (bool) – When deleting a subtree, set recursive=True. This is similar in behavior to ‘rm -r /path/to/directory’.
delete (bool) – When in google drive, actually delete the file or only trash it?

Example

>>> import cottoncandy as cc
>>> cci = cc.get_interface('mybucket', verbose=False)
>>> response = cci.rm('data/experiment/file01.txt')
>>> cci.rm('data/experiment')
cannot remove 'data/experiment': use `recursive` to remove branch
>>> cci.rm('data/experiment', recursive=True)
deleting 15 objects...

search(pattern, **kwargs)¶

Print the objects matching the glob pattern

See glob documentation for details

`InterfaceObject`¶

class cottoncandy.interfaces.InterfaceObject¶

Bases: object

__init__()¶

interfaces¶

ArrayInterface¶

BasicInterface¶

DefaultInterface¶

EncryptedInterface¶

FileSystemInterface¶

InterfaceObject¶

`interfaces`¶

`ArrayInterface`¶

`BasicInterface`¶

`DefaultInterface`¶

`EncryptedInterface`¶

`FileSystemInterface`¶

`InterfaceObject`¶