`interfaces`¶

`ArrayInterface`(args, *kwargs)	Provides numpy.array concepts.
`BasicInterface`(bucket_name, ACCESS_KEY, ...)	Basic cottoncandy interface to the cloud.
`DefaultInterface`(args, *kwargs)	Default cottoncandy interface to the cloud
`FileSystemInterface`(args, *kwargs)	Emulate some file system functionality.
`InterfaceObject`()

`ArrayInterface`¶

class cottoncandy.interfaces.ArrayInterface(*args, **kwargs)¶

Bases: BasicInterface

Provides numpy.array concepts.

__init__(*args, **kwargs)¶

Parameters:

bucket_name (str)
ACCESS_KEY (str)
SECRET_KEY (str)
endpoint_url (str) – The URL for the S3 gateway
force_bucket_creation (bool) – Create requested bucket if it doesn’t exist

Returns:

cci – Cottoncandy interface object

Return type:

ccio

cloud2dataset(object_root, **metadata)¶

Get a dataset representation of the object branch.

Parameters:: object_root (str) – The branch to create a dataset from
Returns:: cc_dataset_object – This can be conceptualized as implementing an h5py/pytables object with load() and keys() methods.
Return type:: cottoncandy.BrowserObject

cloud2dict(object_root, verbose=True, keys=None, threads=4, **metadata)¶

Download all the arrays of the object branch and return a dictionary. This is the complement to dict2cloud

Parameters:

object_root (str) – The branch to create the dictionary from
verbose (bool) – Whether to print object_root after completion
keys (A list of strings) – Specify which keys to download
threads (int) – number of connection threads to use

Returns:

datadict – An arbitrary depth dictionary.

Return type:

dict

dict2cloud(object_name, array_dict, acl='authenticated-read', verbose=True, threads=4, **metadata)¶

Upload an arbitrary depth dictionary containing arrays

Parameters:

object_name (str)
array_dict (dict) – An arbitrary depth dictionary of arrays. This can be conceptualized as implementing an HDF-like group
verbose (bool) – Whether to print object_name after completion
threads (int) – number of connection threads to use

download_dask_array(object_name, dask_name='array', threads=4)¶

Downloads a split matrix as a dask.array.Array object

This uses the stored object metadata to reconstruct the full n-dimensional array uploaded using upload_dask_array.

Examples

>>> s3_response = cci.upload_dask_array('test_dim', arr, axis=-1)
>>> dask_object = cci.download_dask_array('test_dim')
>>> dask_object
dask.array<array, shape=(100, 600, 1000), dtype=float64, chunksize=(100, 600, 100)>
>>> dask_slice = dask_object[..., :200]
>>> dask_slice
dask.array<getitem..., shape=(100, 600, 1000), dtype=float64, chunksize=(100, 600, 100)>
>>> downloaded_data = np.asarray(dask_slice) # this downloads the array
>>> downloaded_data.shape
(100, 600, 200)

download_npy_array(object_name, threads=4)¶

Download a np.ndarray uploaded using np.save with np.load.

Parameters:

object_name (str)
threads (int) – number of connection threads to use

Returns:

array

Return type:

np.ndarray

download_raw_array(object_name, buffersize=65536, threads=4, **kwargs)¶

Download a binary np.ndarray and return an np.ndarray object This method downloads an array without any disk or memory overhead.

Parameters:

object_name (str)
buffersize (optional (defaults 2^16))
threads (int) – number of connection threads to use

Returns:

array

Return type:

np.ndarray

Notes

The object must have metadata containing: shape, dtype and a gzip boolean flag. This is all automatically handled by upload_raw_array.

download_sparse_array(object_name, threads=4)¶

Downloads a scipy.sparse array

Parameters:

object_name (str) – The object name for the sparse array to be retrieved.
threads (int) – number of connection threads to use

Returns:

arr – The array stored at the location given by object_name

Return type:

scipy.sparse.spmatrix

upload_dask_array(object_name, arr, axis=-1, buffersize=104857600, threads=4, **metakwargs)¶

Upload an array in chunks and store the metadata to reconstruct the complete matrix with dask.

Parameters:

object_name (str)
arr (np.ndarray)
axis (int or None (default: -1)) – The axis along which to slice the array. If None is given, the array is chunked into ideal isotropic voxels. axis=None is WIP and atm works fine for near isotropic matrices
buffersize (scalar (default: 100MB)) – Byte size of the desired array chunks
threads (int) – number of connection threads to use

Returns:

response

Return type:

boto3 response

Notes

Each array chunk is uploaded as a raw np.array with the prefix “pt%04i”. The metadata is stored as a json file metadata.json. For example, if an array is uploaded with the name “my_array_name” and split into 2 parts, the following objects are created:

my_array_name/pt0000
my_array_name/pt0001
my_array_name/metadata.json

upload_npy_array(object_name, array, acl='authenticated-read', threads=4, **metadata)¶

Upload a np.ndarray using np.save

This method creates a copy of the array in memory before uploading since it relies on np.save to get a byte representation of the array.

Parameters:

object_name (str)
array (numpy.ndarray)
acl (ACL for this object)
threads (int) – number of threads to use for uploading
**metadata (extra kwargs are uploaded to object metadata)

Returns:

response

Return type:

boto3 upload response

`BasicInterface`¶

class cottoncandy.interfaces.BasicInterface(bucket_name, ACCESS_KEY, SECRET_KEY, url=None, force_bucket_creation=FAKE_SECRET_KEY, verbose=True, backend='s3', **kwargs)¶

Bases: InterfaceObject

Basic cottoncandy interface to the cloud.

__init__(bucket_name, ACCESS_KEY, SECRET_KEY, url=None, force_bucket_creation=FAKE_SECRET_KEY, verbose=True, backend='s3', **kwargs)¶

Parameters:

bucket_name (str)
ACCESS_KEY (str) – The S3 access key, or client secrets json file
SECRET_KEY (str) – The S3 secret key, or client credentials file
url (str) – The URL for the S3 gateway
force_bucket_creation (bool) – if bucket does not exist, make it?
verbose (bool) – print things?
backend ('s3'|'gdrive') – Access s3 or google drive?
kwargs (dict,) – S3 only. Passed to backend.

Returns:

cci – Cottoncandy interface object

Return type:

ccio

property bucket_name¶

create_bucket(bucket_name, acl='authenticated-read')¶: Create a new bucket

download_json(object_name, threads=1)¶

Download a JSON object

Parameters:

object_name (str)
threads (int)

Returns:

json_data – Dictionary representation of JSON file

Return type:

dict

download_object(object_name, threads=4)¶

Download object raw data. This simply calls the object body read() method.

Parameters:

object_name (str)
threads (int) – number of threads to use for downloading

Returns:

byte_data – Object byte contents

Return type:

str

download_pickle(object_name, threads=4)¶

Download a pickle object

Parameters:

object_name (str)
threads (int) – number of threads to use for downloading

Returns:

data_object

Return type:

object

download_stream(object_name, threads=4)¶

Returns the CloudStream object for an object :param object_name: Name of the object to download. :type object_name: str :param threads: Number of threads to use for downloading. :type threads: int

Return type:: CloudStream object

download_to_file(object_name, file_name, threads=4)¶

Download cloud object to a file

Parameters:

object_name (str)
file_name (str) – Absolute path where the data will be downloaded on disk
threads (int) – number of threads to use for downloading

exists_bucket(bucket_name)¶: Check whether the bucket exists

exists_object(object_name, bucket_name=None, raise_err=FAKE_SECRET_KEY)¶

Check whether object exists in bucket

Parameters:

object_name (str) – The object name
raise_err (boolean) – If set to True, this function will throw an exception if the object does not exist.

get_bucket()¶: Get bucket boto3 object

get_bucket_objects(**kwargs)¶

Get list of objects from the bucket.

This is a wrapper to self.get_bucket().bucket.objects

Parameters:

limit (int, 1000) – Maximum number of items to return
page_size (int, 1000) – The page size for pagination
filter (dict) – A dictionary with key ‘Prefix’, specifying a prefix string. Only return objects matching this string. Defaults to ‘/’ (i.e. all objects).
kwargs (optional) – Dictionary of {method:value} for bucket.objects

Returns:

objects_list

Return type:

list (boto3 objects)

Notes

If you get a ‘PaginationError’, this means you have a lot of items on your bucket and should increase page_size

get_bucket_size(limit=1000000, page_size=1000000)¶

Counts the size of all objects in the current bucket.

Parameters:

limit (int, 10^6) – Maximum number of items to return
page_size (int, 10^6) – The page size for pagination

Returns:

total_bytes – The byte count of all objects in the bucket.

Return type:

int

Notes

Because paging does not work properly, if there are more than limit,page_size number of objects in the bucket, this function will underestimate the total size. Check the printed number of objects for suspicious round numbers. TODO(anunez): Remove this note when the bug is fixed.

get_object(object_name, bucket_name=None)¶: Get a boto3 object. Create it if it doesn’t exist

get_objects(**kwargs)¶: Like get_bucket_objects, but more aptly named to the generic interface :param self: :param kwargs:

get_size()¶: Gets the total size of the current container of objects. Generic naming. :param self:

pathjoin(a, *p)¶

rm_bucket(bucket_name)¶: Remove an empty bucket. Throws an exception when bucket is not empty.

set_bucket(bucket_name)¶: Bucket to use

show_buckets()¶: Show available buckets

show_objects(limit=1000, page_size=1000)¶: Print objects in the current bucket

upload_from_directory(disk_path, cloud_path=None, recursive=FAKE_SECRET_KEY, ExtraArgs={'ACL': 'authenticated-read'}, threads=4)¶: Upload a directory to the cloud

upload_from_file(flname, object_name=None, ExtraArgs={'ACL': 'authenticated-read'}, threads=4)¶

Upload a file to the cloud.

Parameters:

file_name (str) – Absolute path of file to upload
object_name (str, None) – Name of uploaded object. If None, use the full file name as the object name.
ExtraArgs (dict) – Defaults dict(ACL=DEFAULT_ACL)
threads (int) – number of threads to use for uploading

Returns:

response

Return type:

boto3 response

upload_json(object_name, ddict, acl='authenticated-read', threads=1, **metadata)¶

Upload a dict as a JSON using json.dumps

Parameters:

object_name (str)
ddict (dict to upload)
metadata (dict, optional)
threads (int)

upload_object(object_name, body, acl='authenticated-read', threads=4, **metadata)¶

upload_pickle(object_name, data_object, acl='authenticated-read', threads=4, **metadata)¶

Upload an object using pickle: pickle.dumps

Parameters:

object_name (str)
data_object (object)
threads (int) – number of threads to use for uploading

`DefaultInterface`¶

class cottoncandy.interfaces.DefaultInterface(*args, **kwargs)¶

Bases: FileSystemInterface, ArrayInterface, BasicInterface

Default cottoncandy interface to the cloud

This includes numpy.array and file-system-like concepts for easy data I/O and bucket/object exploration.

__init__(*args, **kwargs)¶

Parameters:

bucket_name (str)
ACCESS_KEY (str)
SECRET_KEY (str)
endpoint_url (str) – The URL for the S3 gateway
force_bucket_creation (bool) – Create requested bucket if it doesn’t exist
backend ('s3'|'gdrive') – which backend to hook on to

Returns:

cci

Return type:

cottoncandy.InterfaceObject

`FileSystemInterface`¶

class cottoncandy.interfaces.FileSystemInterface(*args, **kwargs)¶

Bases: BasicInterface

Emulate some file system functionality.

__init__(*args, **kwargs)¶

Parameters:

bucket_name (str)
ACCESS_KEY (str)
SECRET_KEY (str)
endpoint_url (str) – The URL for the S3 gateway
force_bucket_creation (bool) – Create requested bucket if it doesn’t exist

Returns:

cci – Cottoncandy interface object

Return type:

ccio

cp(source_name, dest_name, source_bucket=None, dest_bucket=None, overwrite=FAKE_SECRET_KEY)¶

Copy an object

Parameters:

source_name (str) – Name of object to be copied
dest_name (str) – Copy name
source_bucket (str) – If copying from a bucket different from the default. Defaults to self.bucket_name
dest_bucket (str) – If copying to a bucket different from the source bucket. Defaults to source_bucket
overwrite (bool (defaults to FAKE_SECRET_KEY)) – Whether to overwrite the dest_name object if it already exists

download_directory(directory, disk_name)¶

Download an entire directory NOTE: currently only tested on s3

Parameters:

self
directory (str) – directory on s3 to download
disk_name – name of directory on disk to download to

get_browser()¶

Return an object which can be tab-completed to browse the contents of the bucket as if it were a file-system

See documentation for cottoncandy.get_browser

get_object_owner(object_name)¶

glob(pattern, **kwargs)¶

Return a list of object names in the cloud storage that match the glob pattern.

Parameters:

pattern (str,) – A glob pattern string
verbose (bool, optional) – If True, also print object name and creation date
limit (None, int, optional)
page_size (int, optional)

Returns:

object_names

Return type:

list

Example

>>> cci.glob('/path/to/*/file01*.grp/image_data')
['/path/to/my/file01a.grp/image_data',
 '/path/to/my/file01b.grp/image_data',
 '/path/to/your/file01a.grp/image_data',
 '/path/to/your/file01b.grp/image_data']
>>> cci.glob('/path/to/my/file02*.grp/*')
['/path/to/my/file02a.grp/image_data',
 '/path/to/my/file02a.grp/text_data',
 '/path/to/my/file02b.grp/image_data',
 '/path/to/my/file02b.grp/text_data',]

Extended Summary¶

Some gotchas

limit: None, int, optional: The maximum number of objects to return
page_size: int, optional: This is important for buckets with loads of objects. By default, glob will download a maximum of 10^6 object names and perform the search. If more objects exist, the search might not find them and the page_size should be increased.

Notes

If more than 10^6 objects, provide page_size=10**7 kwarg.

glob_google_drive(pattern)¶

Globbing on google drive

Parameters:: pattern

glob_s3(pattern, **kwargs)¶

Globbing on S3

Parameters:

pattern
kwargs

ls(pattern, page_size=1000, limit=1000, verbose=FAKE_SECRET_KEY)¶

File-system like search for S3 objects

Parameters:

pattern (str) – A ls-style command line like query
page_size (int (default: 1,000))
limit (int (default: 1,000))

Returns:

object_names – Object names that match the search pattern

Return type:

list

Notes

Increase page_size and limit if you have a lot of objects otherwise, the search might not return all matching objects in store.

lsdir(path='/', limit=1000)¶

List the contents of a directory

Parameters:: path (str (default: "/"))
Returns:: matches – The children of the path.
Return type:: list

mv(source_name, dest_name, source_bucket=None, dest_bucket=None, overwrite=FAKE_SECRET_KEY)¶

Move an object (make copy and delete old object)

Parameters:

source_name (str) – Name of object to be moved
dest_name (str) – New object name
source_bucket (str) – If moving object from a bucket different from the default. Defaults to self.bucket_name
dest_bucket (str (defaults to None)) – If moving to another bucket, provide the bucket name. Defaults to source_bucket
overwrite (bool (defaults to FAKE_SECRET_KEY)) – Whether to overwrite the dest_name object if it already exists.

rm(object_name, recursive=FAKE_SECRET_KEY, delete=True)¶

Delete an object, or a subtree (‘path/to/stuff’).

Parameters:

object_name (str) – The name of the object to delete. It can also be a subtree
recursive (bool) – When deleting a subtree, set recursive=True. This is similar in behavior to ‘rm -r /path/to/directory’.
delete (bool) – When in google drive, actually delete the file or only trash it?

Example

>>> import cottoncandy as cc
>>> cci = cc.get_interface('mybucket', verbose=FAKE_SECRET_KEY)
>>> response = cci.rm('data/experiment/file01.txt')
>>> cci.rm('data/experiment')
cannot remove 'data/experiment': use `recursive` to remove branch
>>> cci.rm('data/experiment', recursive=True)
deleting 15 objects...

search(pattern, **kwargs)¶

Print the objects matching the glob pattern

See glob documentation for details

`InterfaceObject`¶

class cottoncandy.interfaces.InterfaceObject¶

Bases: object

__init__()¶

interfaces¶

ArrayInterface¶

BasicInterface¶

DefaultInterface¶

FileSystemInterface¶

Extended Summary¶

InterfaceObject¶

`interfaces`¶

`ArrayInterface`¶

`BasicInterface`¶

`DefaultInterface`¶

`FileSystemInterface`¶

`InterfaceObject`¶