interfaces
¶
|
Provides numpy.array concepts. |
|
Basic cottoncandy interface to the cloud. |
|
Default cottoncandy interface to the cloud |
|
Interface that transparently encrypts everything uploaded to the cloud |
|
Emulate some file system functionality. |
ArrayInterface
¶
- class cottoncandy.interfaces.ArrayInterface(*args, **kwargs)¶
Bases:
cottoncandy.interfaces.BasicInterface
Provides numpy.array concepts.
- __init__(*args, **kwargs)¶
- Parameters
bucket_name (str) –
ACCESS_KEY (str) –
SECRET_KEY (str) –
endpoint_url (str) – The URL for the S3 gateway
force_bucket_creation (bool) – Create requested bucket if it doesn’t exist
- Returns
cci – Cottoncandy interface object
- Return type
ccio
- cloud2dataset(object_root, **metadata)¶
Get a dataset representation of the object branch.
- Parameters
object_root (str) – The branch to create a dataset from
- Returns
cc_dataset_object – This can be conceptualized as implementing an h5py/pytables object with
load()
andkeys()
methods.- Return type
cottoncandy.BrowserObject
- cloud2dict(object_root, verbose=True, keys=None, **metadata)¶
Download all the arrays of the object branch and return a dictionary. This is the complement to
dict2cloud
- Parameters
object_root (str) – The branch to create the dictionary from
verbose (bool) – Whether to print object_root after completion
keys (A list of strings) – Specify which keys to download
- Returns
datadict – An arbitrary depth dictionary.
- Return type
dict
- dict2cloud(object_name, array_dict, acl='authenticated-read', verbose=True, **metadata)¶
Upload an arbitrary depth dictionary containing arrays
- Parameters
object_name (str) –
array_dict (dict) – An arbitrary depth dictionary of arrays. This can be conceptualized as implementing an HDF-like group
verbose (bool) – Whether to print object_name after completion
- download_dask_array(object_name, dask_name='array')¶
Downloads a split matrix as a
dask.array.Array
objectThis uses the stored object metadata to reconstruct the full n-dimensional array uploaded using
upload_dask_array
.Examples
>>> s3_response = cci.upload_dask_array('test_dim', arr, axis=-1) >>> dask_object = cci.download_dask_array('test_dim') >>> dask_object dask.array<array, shape=(100, 600, 1000), dtype=float64, chunksize=(100, 600, 100)> >>> dask_slice = dask_object[..., :200] >>> dask_slice dask.array<getitem..., shape=(100, 600, 1000), dtype=float64, chunksize=(100, 600, 100)> >>> downloaded_data = np.asarray(dask_slice) # this downloads the array >>> downloaded_data.shape (100, 600, 200)
- download_npy_array(object_name)¶
Download a np.ndarray uploaded using
np.save
withnp.load
.- Parameters
object_name (str) –
- Returns
array
- Return type
np.ndarray
- download_raw_array(object_name, buffersize=65536, **kwargs)¶
Download a binary np.ndarray and return an np.ndarray object This method downloads an array without any disk or memory overhead.
- Parameters
object_name (str) –
buffersize (optional (defaults 2^16)) –
- Returns
array
- Return type
np.ndarray
Notes
The object must have metadata containing: shape, dtype and a gzip boolean flag. This is all automatically handled by
upload_raw_array
.
- download_sparse_array(object_name)¶
Downloads a scipy.sparse array
- Parameters
object_name (str) – The object name for the sparse array to be retrieved.
- Returns
arr – The array stored at the location given by object_name
- Return type
scipy.sparse.spmatrix
- upload_dask_array(object_name, arr, axis=- 1, buffersize=104857600, **metakwargs)¶
Upload an array in chunks and store the metadata to reconstruct the complete matrix with
dask
.- Parameters
object_name (str) –
arr (np.ndarray) –
axis (int or None (default: -1)) – The axis along which to slice the array. If None is given, the array is chunked into ideal isotropic voxels.
axis=None
is WIP and atm works fine for near isotropic matricesbuffersize (scalar (default: 100MB)) – Byte size of the desired array chunks
- Returns
response
- Return type
boto3 response
Notes
Each array chunk is uploaded as a raw np.array with the prefix “pt%04i”. The metadata is stored as a json file
metadata.json
. For example, if an array is uploaded with the name “my_array_name” and split into 2 parts, the following objects are created:my_array_name/pt0000
my_array_name/pt0001
my_array_name/metadata.json
- upload_npy_array(object_name, array, acl='authenticated-read', **metadata)¶
Upload a np.ndarray using
np.save
This method creates a copy of the array in memory before uploading since it relies on
np.save
to get a byte representation of the array.- Parameters
object_name (str) –
array (numpy.ndarray) –
acl (ACL for this object) –
**metadata (extra kwargs are uploaded to object metadata) –
- Returns
reponse
- Return type
boto3 upload response
See also
- upload_raw_array(object_name, array, compression=True, acl='authenticated-read', **metadata)¶
Upload a binary representation of a np.ndarray
This method reads the array content from memory to upload. It does not have any overhead.
- Parameters
object_name (str) –
array (np.ndarray) –
compression (str, bool) – True uses the configuration defaults. False is no compression. Available options are: ‘gzip’, ‘LZ4’, ‘Zlib’, ‘Zstd’, ‘BZ2’ (attend to caps). NB: Zstd appears to be the only one that supports >2GB arrays.
acl (str) – ACL for the object
**metadata (optional) –
Notes
This method also uploads the array
dtype
,shape
, andgzip
flag as metadata
- upload_sparse_array(object_name, arr)¶
Uploads a scipy.sparse array as a folder of array objects
- Parameters
object_name (str) – The name of the object to be stored.
arr (scipy.sparse.spmatrix) – A scipy.sparse array to be saved. If type is DOK or LIL, it will be converted to csr before saving
BasicInterface
¶
- class cottoncandy.interfaces.BasicInterface(bucket_name, ACCESS_KEY, SECRET_KEY, url=None, force_bucket_creation=False, verbose=True, backend='s3', **kwargs)¶
Bases:
cottoncandy.interfaces.InterfaceObject
Basic cottoncandy interface to the cloud.
- __init__(bucket_name, ACCESS_KEY, SECRET_KEY, url=None, force_bucket_creation=False, verbose=True, backend='s3', **kwargs)¶
- Parameters
bucket_name (str) –
ACCESS_KEY (str) – The S3 access key, or client secrets json file
SECRET_KEY (str) – The S3 secret key, or client credentials file
url (str) – The URL for the S3 gateway
force_bucket_creation (bool) – if bucket does not exist, make it?
verbose (bool) – print things?
backend ('s3'|'gdrive') – Access s3 or google drive?
kwargs (dict,) – S3 only. Passed to backend.
- Returns
cci – Cottoncandy interface object
- Return type
ccio
- property bucket_name¶
- create_bucket(bucket_name, acl='authenticated-read')¶
Create a new bucket
- download_json(object_name)¶
Download a JSON object
- Parameters
object_name (str) –
- Returns
json_data – Dictionary representation of JSON file
- Return type
dict
- download_object(object_name)¶
Download object raw data. This simply calls the object body
read()
method.- Parameters
object_name (str) –
- Returns
byte_data – Object byte contents
- Return type
str
- download_pickle(object_name)¶
Download a pickle object
- Parameters
object_name (str) –
- Returns
data_object
- Return type
object
- download_stream(object_name)¶
Returns the CloudStream object for an object :param self: :param object_name:
- Returns
- Return type
CloudStream object
- download_to_file(object_name, file_name)¶
Download cloud object to a file
- Parameters
object_name (str) –
file_name (str) – Absolute path where the data will be downloaded on disk
- exists_bucket(bucket_name)¶
Check whether the bucket exists
- exists_object(object_name, bucket_name=None, raise_err=False)¶
Check whether object exists in bucket
- Parameters
object_name (str) – The object name
raise_err (boolean) – If set to True, this function will throw an exception if the object does not exist.
- get_bucket()¶
Get bucket boto3 object
- get_bucket_objects(**kwargs)¶
Get list of objects from the bucket.
This is a wrapper to
self.get_bucket().bucket.objects
- Parameters
limit (int, 1000) – Maximum number of items to return
page_size (int, 1000) – The page size for pagination
filter (dict) – A dictionary with key ‘Prefix’, specifying a prefix string. Only return objects matching this string. Defaults to ‘/’ (i.e. all objects).
kwargs (optional) – Dictionary of {method:value} for
bucket.objects
- Returns
objects_list
- Return type
list (boto3 objects)
Notes
If you get a ‘PaginationError’, this means you have a lot of items on your bucket and should increase
page_size
- get_bucket_size(limit=1000000, page_size=1000000)¶
Counts the size of all objects in the current bucket.
- Parameters
limit (int, 10^6) – Maximum number of items to return
page_size (int, 10^6) – The page size for pagination
- Returns
total_bytes – The byte count of all objects in the bucket.
- Return type
int
Notes
Because paging does not work properly, if there are more than limit,page_size number of objects in the bucket, this function will underestimate the total size. Check the printed number of objects for suspicious round numbers. TODO(anunez): Remove this note when the bug is fixed.
- get_object(object_name, bucket_name=None)¶
Get a boto3 object. Create it if it doesn’t exist
- get_objects(**kwargs)¶
Like get_bucket_objects, but more aptly named to the generic interface :param self: :param kwargs:
- get_size()¶
Gets the total size of the current container of objects. Generic naming. :param self:
- mpu_fileobject(object_name, file_object, buffersize=104857600, verbose=True, acl='authenticated-read', **metadata)¶
Multi-part upload for a file-object.
This automatically creates a multipart upload of an object. Useful for large objects that are loaded in memory. This avoids having to write the file to disk and then using
upload_from_file
.- Parameters
object_name (str) –
file_object – file-like object (e.g. StringIO, file, etc)
buffersize (int, (defaults to 100MB)) – Byte size of the individual parts to create.
verbose (bool) – verbosity flag of whether to print mpu information to stdout
**metadata (optional) – Metadata to store along with MPU object
- pathjoin(a, *p)¶
- rm_bucket(bucket_name)¶
Remove an empty bucket. Throws an exception when bucket is not empty.
- set_bucket(bucket_name)¶
Bucket to use
- show_buckets()¶
Show available buckets
- show_objects(limit=1000, page_size=1000)¶
Print objects in the current bucket
- upload_from_directory(disk_path, cloud_path=None, recursive=False, ExtraArgs={'ACL': 'authenticated-read'})¶
Upload a directory to the cloud
- upload_from_file(flname, object_name=None, ExtraArgs={'ACL': 'authenticated-read'})¶
Upload a file to the cloud.
- Parameters
file_name (str) – Absolute path of file to upload
object_name (str, None) – Name of uploaded object. If None, use the full file name as the object name.
ExtraArgs (dict) – Defaults
dict(ACL=DEFAULT_ACL)
- Returns
response
- Return type
boto3 response
- upload_json(object_name, ddict, acl='authenticated-read', **metadata)¶
Upload a dict as a JSON using
json.dumps
- Parameters
object_name (str) –
ddict (dict to upload) –
metadata (dict, optional) –
- upload_object(object_name, body, acl='authenticated-read', **metadata)¶
- upload_pickle(object_name, data_object, acl='authenticated-read', **metadata)¶
Upload an object using pickle:
pickle.dumps
- Parameters
object_name (str) –
data_object (object) –
DefaultInterface
¶
- class cottoncandy.interfaces.DefaultInterface(*args, **kwargs)¶
Bases:
cottoncandy.interfaces.FileSystemInterface
,cottoncandy.interfaces.ArrayInterface
,cottoncandy.interfaces.BasicInterface
Default cottoncandy interface to the cloud
This includes numpy.array and file-system-like concepts for easy data I/O and bucket/object exploration.
- __init__(*args, **kwargs)¶
- Parameters
bucket_name (str) –
ACCESS_KEY (str) –
SECRET_KEY (str) –
endpoint_url (str) – The URL for the S3 gateway
force_bucket_creation (bool) – Create requested bucket if it doesn’t exist
backend ('s3'|'gdrive') – which backend to hook on to
- Returns
cci
- Return type
cottoncandy.InterfaceObject
EncryptedInterface
¶
- class cottoncandy.interfaces.EncryptedInterface(bucket, access, secret, url, encryption='AES', key=None, *args, **kwargs)¶
Bases:
cottoncandy.interfaces.DefaultInterface
Interface that transparently encrypts everything uploaded to the cloud
- __init__(bucket, access, secret, url, encryption='AES', key=None, *args, **kwargs)¶
- Parameters
bucket –
access –
secret –
url –
encryption ('AES' | 'RSA') –
key (str) – if AES, key; if RSA, filename of .pem format key
backend ('s3'|'gdrive') – which backend to hook on to
args –
kwargs –
- download_stream(object_name)¶
Returns the CloudStream object for an object :param self: :param object_name:
- Returns
- Return type
CloudStream object
- download_to_file(object_name, file_name)¶
Download cloud object to a file
- Parameters
object_name (str) –
file_name (str) – Absolute path where the data will be downloaded on disk
- upload_from_file(local_file_name, object_name=None, ExtraArgs={'ACL': 'authenticated-read'})¶
Upload a file to the cloud.
- Parameters
file_name (str) – Absolute path of file to upload
object_name (str, None) – Name of uploaded object. If None, use the full file name as the object name.
ExtraArgs (dict) – Defaults
dict(ACL=DEFAULT_ACL)
- Returns
response
- Return type
boto3 response
- upload_object(object_name, body, acl='authenticated-read', **metadata)¶
FileSystemInterface
¶
- class cottoncandy.interfaces.FileSystemInterface(*args, **kwargs)¶
Bases:
cottoncandy.interfaces.BasicInterface
Emulate some file system functionality.
- __init__(*args, **kwargs)¶
- Parameters
bucket_name (str) –
ACCESS_KEY (str) –
SECRET_KEY (str) –
endpoint_url (str) – The URL for the S3 gateway
force_bucket_creation (bool) – Create requested bucket if it doesn’t exist
- Returns
cci – Cottoncandy interface object
- Return type
ccio
- cp(source_name, dest_name, source_bucket=None, dest_bucket=None, overwrite=False)¶
Copy an object
- Parameters
source_name (str) – Name of object to be copied
dest_name (str) – Copy name
source_bucket (str) – If copying from a bucket different from the default. Defaults to
self.bucket_name
dest_bucket (str) – If copying to a bucket different from the source bucket. Defaults to
source_bucket
overwrite (bool (defaults to False)) – Whether to overwrite the dest_name object if it already exists
- download_directory(directory, disk_name)¶
Download an entire directory NOTE: currently only tested on s3
- Parameters
self –
directory (str) – directory on s3 to download
disk_name – name of directory on disk to download to
- get_browser()¶
Return an object which can be tab-completed to browse the contents of the bucket as if it were a file-system
See documentation for
cottoncandy.get_browser
- get_object_owner(object_name)¶
- glob(pattern, **kwargs)¶
Return a list of object names in the cloud storage that match the glob pattern.
- Parameters
pattern (str,) – A glob pattern string
verbose (bool, optional) – If True, also print object name and creation date
limit (None, int, optional) –
page_size (int, optional) –
- Returns
object_names
- Return type
list
Example
>>> cci.glob('/path/to/*/file01*.grp/image_data') ['/path/to/my/file01a.grp/image_data', '/path/to/my/file01b.grp/image_data', '/path/to/your/file01a.grp/image_data', '/path/to/your/file01b.grp/image_data'] >>> cci.glob('/path/to/my/file02*.grp/*') ['/path/to/my/file02a.grp/image_data', '/path/to/my/file02a.grp/text_data', '/path/to/my/file02b.grp/image_data', '/path/to/my/file02b.grp/text_data',]
Some gotchas
- limit: None, int, optional
The maximum number of objects to return
- page_size: int, optional
This is important for buckets with loads of objects. By default,
glob
will download a maximum of 10^6 object names and perform the search. If more objects exist, the search might not find them and the page_size should be increased.
Notes
If more than 10^6 objects, provide
page_size=10**7
kwarg.
- glob_google_drive(pattern)¶
Globbing on google drive
- Parameters
pattern –
- glob_s3(pattern, **kwargs)¶
Globbing on S3
- Parameters
pattern –
kwargs –
- ls(pattern, page_size=1000, limit=1000, verbose=False)¶
File-system like search for S3 objects
- Parameters
pattern (str) – A ls-style command line like query
page_size (int (default: 1,000)) –
limit (int (default: 1,000)) –
- Returns
object_names – Object names that match the search pattern
- Return type
list
Notes
Increase
page_size
andlimit
if you have a lot of objects otherwise, the search might not return all matching objects in store.
- lsdir(path='/', limit=1000)¶
List the contents of a directory
- Parameters
path (str (default: "/")) –
- Returns
matches – The children of the path.
- Return type
list
- mv(source_name, dest_name, source_bucket=None, dest_bucket=None, overwrite=False)¶
Move an object (make copy and delete old object)
- Parameters
source_name (str) – Name of object to be moved
dest_name (str) – New object name
source_bucket (str) – If moving object from a bucket different from the default. Defaults to
self.bucket_name
dest_bucket (str (defaults to None)) – If moving to another bucket, provide the bucket name. Defaults to
source_bucket
overwrite (bool (defaults to False)) – Whether to overwrite the dest_name object if it already exists.
- rm(object_name, recursive=False, delete=True)¶
Delete an object, or a subtree (‘path/to/stuff’).
- Parameters
object_name (str) – The name of the object to delete. It can also be a subtree
recursive (bool) – When deleting a subtree, set
recursive=True
. This is similar in behavior to ‘rm -r /path/to/directory’.delete (bool) – When in google drive, actually delete the file or only trash it?
Example
>>> import cottoncandy as cc >>> cci = cc.get_interface('mybucket', verbose=False) >>> response = cci.rm('data/experiment/file01.txt') >>> cci.rm('data/experiment') cannot remove 'data/experiment': use `recursive` to remove branch >>> cci.rm('data/experiment', recursive=True) deleting 15 objects...
- search(pattern, **kwargs)¶
Print the objects matching the glob pattern
See
glob
documentation for details