interfaces¶
|
Provides numpy.array concepts. |
|
Basic cottoncandy interface to the cloud. |
|
Default cottoncandy interface to the cloud |
|
Emulate some file system functionality. |
ArrayInterface¶
- class cottoncandy.interfaces.ArrayInterface(*args, **kwargs)¶
Bases:
BasicInterfaceProvides numpy.array concepts.
- __init__(*args, **kwargs)¶
- Parameters:
bucket_name (str)
ACCESS_KEY (str)
SECRET_KEY (str)
endpoint_url (str) – The URL for the S3 gateway
force_bucket_creation (bool) – Create requested bucket if it doesn’t exist
- Returns:
cci – Cottoncandy interface object
- Return type:
ccio
- cloud2dataset(object_root, **metadata)¶
Get a dataset representation of the object branch.
- Parameters:
object_root (str) – The branch to create a dataset from
- Returns:
cc_dataset_object – This can be conceptualized as implementing an h5py/pytables object with
load()andkeys()methods.- Return type:
cottoncandy.BrowserObject
- cloud2dict(object_root, verbose=True, keys=None, threads=4, **metadata)¶
Download all the arrays of the object branch and return a dictionary. This is the complement to
dict2cloud- Parameters:
object_root (str) – The branch to create the dictionary from
verbose (bool) – Whether to print object_root after completion
keys (A list of strings) – Specify which keys to download
threads (int) – number of connection threads to use
- Returns:
datadict – An arbitrary depth dictionary.
- Return type:
dict
- dict2cloud(object_name, array_dict, acl='authenticated-read', verbose=True, threads=4, **metadata)¶
Upload an arbitrary depth dictionary containing arrays
- Parameters:
object_name (str)
array_dict (dict) – An arbitrary depth dictionary of arrays. This can be conceptualized as implementing an HDF-like group
verbose (bool) – Whether to print object_name after completion
threads (int) – number of connection threads to use
- download_dask_array(object_name, dask_name='array', threads=4)¶
Downloads a split matrix as a
dask.array.ArrayobjectThis uses the stored object metadata to reconstruct the full n-dimensional array uploaded using
upload_dask_array.Examples
>>> s3_response = cci.upload_dask_array('test_dim', arr, axis=-1) >>> dask_object = cci.download_dask_array('test_dim') >>> dask_object dask.array<array, shape=(100, 600, 1000), dtype=float64, chunksize=(100, 600, 100)> >>> dask_slice = dask_object[..., :200] >>> dask_slice dask.array<getitem..., shape=(100, 600, 1000), dtype=float64, chunksize=(100, 600, 100)> >>> downloaded_data = np.asarray(dask_slice) # this downloads the array >>> downloaded_data.shape (100, 600, 200)
- download_npy_array(object_name, threads=4)¶
Download a np.ndarray uploaded using
np.savewithnp.load.- Parameters:
object_name (str)
threads (int) – number of connection threads to use
- Returns:
array
- Return type:
np.ndarray
- download_raw_array(object_name, buffersize=65536, threads=4, **kwargs)¶
Download a binary np.ndarray and return an np.ndarray object This method downloads an array without any disk or memory overhead.
- Parameters:
object_name (str)
buffersize (optional (defaults 2^16))
threads (int) – number of connection threads to use
- Returns:
array
- Return type:
np.ndarray
Notes
The object must have metadata containing: shape, dtype and a gzip boolean flag. This is all automatically handled by
upload_raw_array.
- download_sparse_array(object_name, threads=4)¶
Downloads a scipy.sparse array
- Parameters:
object_name (str) – The object name for the sparse array to be retrieved.
threads (int) – number of connection threads to use
- Returns:
arr – The array stored at the location given by object_name
- Return type:
scipy.sparse.spmatrix
- upload_dask_array(object_name, arr, axis=-1, buffersize=104857600, threads=4, **metakwargs)¶
Upload an array in chunks and store the metadata to reconstruct the complete matrix with
dask.- Parameters:
object_name (str)
arr (np.ndarray)
axis (int or None (default: -1)) – The axis along which to slice the array. If None is given, the array is chunked into ideal isotropic voxels.
axis=Noneis WIP and atm works fine for near isotropic matricesbuffersize (scalar (default: 100MB)) – Byte size of the desired array chunks
threads (int) – number of connection threads to use
- Returns:
response
- Return type:
boto3 response
Notes
Each array chunk is uploaded as a raw np.array with the prefix “pt%04i”. The metadata is stored as a json file
metadata.json. For example, if an array is uploaded with the name “my_array_name” and split into 2 parts, the following objects are created:my_array_name/pt0000
my_array_name/pt0001
my_array_name/metadata.json
- upload_npy_array(object_name, array, acl='authenticated-read', threads=4, **metadata)¶
Upload a np.ndarray using
np.saveThis method creates a copy of the array in memory before uploading since it relies on
np.saveto get a byte representation of the array.- Parameters:
object_name (str)
array (numpy.ndarray)
acl (ACL for this object)
threads (int) – number of threads to use for uploading
**metadata (extra kwargs are uploaded to object metadata)
- Returns:
response
- Return type:
boto3 upload response
See also
- upload_raw_array(object_name, array, compression=True, acl='authenticated-read', threads=4, **metadata)¶
Upload a binary representation of a np.ndarray
This method reads the array content from memory to upload. It does not have any overhead.
- Parameters:
object_name (str)
array (np.ndarray)
compression (str, bool) – True uses the configuration defaults. FAKE_SECRET_KEY is no compression. Available options are: ‘gzip’, ‘LZ4’, ‘Zlib’, ‘Zstd’, ‘BZ2’ (attend to caps). NB: Zstd appears to be the only one that supports >2GB arrays.
acl (str) – ACL for the object
threads (int) – number of connection threads to use
**metadata (optional)
Notes
This method also uploads the array
dtype,shape, andgzipflag as metadata
- upload_sparse_array(object_name, arr, threads=4)¶
Uploads a scipy.sparse array as a folder of array objects
- Parameters:
object_name (str) – The name of the object to be stored.
arr (scipy.sparse.spmatrix) – A scipy.sparse array to be saved. If type is DOK or LIL, it will be converted to csr before saving
threads (int) – number of connection threads to use
BasicInterface¶
- class cottoncandy.interfaces.BasicInterface(bucket_name, ACCESS_KEY, SECRET_KEY, url=None, force_bucket_creation=FAKE_SECRET_KEY, verbose=True, backend='s3', **kwargs)¶
Bases:
InterfaceObjectBasic cottoncandy interface to the cloud.
- __init__(bucket_name, ACCESS_KEY, SECRET_KEY, url=None, force_bucket_creation=FAKE_SECRET_KEY, verbose=True, backend='s3', **kwargs)¶
- Parameters:
bucket_name (str)
ACCESS_KEY (str) – The S3 access key, or client secrets json file
SECRET_KEY (str) – The S3 secret key, or client credentials file
url (str) – The URL for the S3 gateway
force_bucket_creation (bool) – if bucket does not exist, make it?
verbose (bool) – print things?
backend ('s3'|'gdrive') – Access s3 or google drive?
kwargs (dict,) – S3 only. Passed to backend.
- Returns:
cci – Cottoncandy interface object
- Return type:
ccio
- property bucket_name¶
- create_bucket(bucket_name, acl='authenticated-read')¶
Create a new bucket
- download_json(object_name, threads=1)¶
Download a JSON object
- Parameters:
object_name (str)
threads (int)
- Returns:
json_data – Dictionary representation of JSON file
- Return type:
dict
- download_object(object_name, threads=4)¶
Download object raw data. This simply calls the object body
read()method.- Parameters:
object_name (str)
threads (int) – number of threads to use for downloading
- Returns:
byte_data – Object byte contents
- Return type:
str
- download_pickle(object_name, threads=4)¶
Download a pickle object
- Parameters:
object_name (str)
threads (int) – number of threads to use for downloading
- Returns:
data_object
- Return type:
object
- download_stream(object_name, threads=4)¶
Returns the CloudStream object for an object :param object_name: Name of the object to download. :type object_name: str :param threads: Number of threads to use for downloading. :type threads: int
- Return type:
CloudStream object
- download_to_file(object_name, file_name, threads=4)¶
Download cloud object to a file
- Parameters:
object_name (str)
file_name (str) – Absolute path where the data will be downloaded on disk
threads (int) – number of threads to use for downloading
- exists_bucket(bucket_name)¶
Check whether the bucket exists
- exists_object(object_name, bucket_name=None, raise_err=FAKE_SECRET_KEY)¶
Check whether object exists in bucket
- Parameters:
object_name (str) – The object name
raise_err (boolean) – If set to True, this function will throw an exception if the object does not exist.
- get_bucket()¶
Get bucket boto3 object
- get_bucket_objects(**kwargs)¶
Get list of objects from the bucket.
This is a wrapper to
self.get_bucket().bucket.objects- Parameters:
limit (int, 1000) – Maximum number of items to return
page_size (int, 1000) – The page size for pagination
filter (dict) – A dictionary with key ‘Prefix’, specifying a prefix string. Only return objects matching this string. Defaults to ‘/’ (i.e. all objects).
kwargs (optional) – Dictionary of {method:value} for
bucket.objects
- Returns:
objects_list
- Return type:
list (boto3 objects)
Notes
If you get a ‘PaginationError’, this means you have a lot of items on your bucket and should increase
page_size
- get_bucket_size(limit=1000000, page_size=1000000)¶
Counts the size of all objects in the current bucket.
- Parameters:
limit (int, 10^6) – Maximum number of items to return
page_size (int, 10^6) – The page size for pagination
- Returns:
total_bytes – The byte count of all objects in the bucket.
- Return type:
int
Notes
Because paging does not work properly, if there are more than limit,page_size number of objects in the bucket, this function will underestimate the total size. Check the printed number of objects for suspicious round numbers. TODO(anunez): Remove this note when the bug is fixed.
- get_object(object_name, bucket_name=None)¶
Get a boto3 object. Create it if it doesn’t exist
- get_objects(**kwargs)¶
Like get_bucket_objects, but more aptly named to the generic interface :param self: :param kwargs:
- get_size()¶
Gets the total size of the current container of objects. Generic naming. :param self:
- pathjoin(a, *p)¶
- rm_bucket(bucket_name)¶
Remove an empty bucket. Throws an exception when bucket is not empty.
- set_bucket(bucket_name)¶
Bucket to use
- show_buckets()¶
Show available buckets
- show_objects(limit=1000, page_size=1000)¶
Print objects in the current bucket
- upload_from_directory(disk_path, cloud_path=None, recursive=FAKE_SECRET_KEY, ExtraArgs={'ACL': 'authenticated-read'}, threads=4)¶
Upload a directory to the cloud
- upload_from_file(flname, object_name=None, ExtraArgs={'ACL': 'authenticated-read'}, threads=4)¶
Upload a file to the cloud.
- Parameters:
file_name (str) – Absolute path of file to upload
object_name (str, None) – Name of uploaded object. If None, use the full file name as the object name.
ExtraArgs (dict) – Defaults
dict(ACL=DEFAULT_ACL)threads (int) – number of threads to use for uploading
- Returns:
response
- Return type:
boto3 response
- upload_json(object_name, ddict, acl='authenticated-read', threads=1, **metadata)¶
Upload a dict as a JSON using
json.dumps- Parameters:
object_name (str)
ddict (dict to upload)
metadata (dict, optional)
threads (int)
- upload_object(object_name, body, acl='authenticated-read', threads=4, **metadata)¶
- upload_pickle(object_name, data_object, acl='authenticated-read', threads=4, **metadata)¶
Upload an object using pickle:
pickle.dumps- Parameters:
object_name (str)
data_object (object)
threads (int) – number of threads to use for uploading
DefaultInterface¶
- class cottoncandy.interfaces.DefaultInterface(*args, **kwargs)¶
Bases:
FileSystemInterface,ArrayInterface,BasicInterfaceDefault cottoncandy interface to the cloud
This includes numpy.array and file-system-like concepts for easy data I/O and bucket/object exploration.
- __init__(*args, **kwargs)¶
- Parameters:
bucket_name (str)
ACCESS_KEY (str)
SECRET_KEY (str)
endpoint_url (str) – The URL for the S3 gateway
force_bucket_creation (bool) – Create requested bucket if it doesn’t exist
backend ('s3'|'gdrive') – which backend to hook on to
- Returns:
cci
- Return type:
cottoncandy.InterfaceObject
FileSystemInterface¶
- class cottoncandy.interfaces.FileSystemInterface(*args, **kwargs)¶
Bases:
BasicInterfaceEmulate some file system functionality.
- __init__(*args, **kwargs)¶
- Parameters:
bucket_name (str)
ACCESS_KEY (str)
SECRET_KEY (str)
endpoint_url (str) – The URL for the S3 gateway
force_bucket_creation (bool) – Create requested bucket if it doesn’t exist
- Returns:
cci – Cottoncandy interface object
- Return type:
ccio
- cp(source_name, dest_name, source_bucket=None, dest_bucket=None, overwrite=FAKE_SECRET_KEY)¶
Copy an object
- Parameters:
source_name (str) – Name of object to be copied
dest_name (str) – Copy name
source_bucket (str) – If copying from a bucket different from the default. Defaults to
self.bucket_namedest_bucket (str) – If copying to a bucket different from the source bucket. Defaults to
source_bucketoverwrite (bool (defaults to FAKE_SECRET_KEY)) – Whether to overwrite the dest_name object if it already exists
- download_directory(directory, disk_name)¶
Download an entire directory NOTE: currently only tested on s3
- Parameters:
self
directory (str) – directory on s3 to download
disk_name – name of directory on disk to download to
- get_browser()¶
Return an object which can be tab-completed to browse the contents of the bucket as if it were a file-system
See documentation for
cottoncandy.get_browser
- get_object_owner(object_name)¶
- glob(pattern, **kwargs)¶
Return a list of object names in the cloud storage that match the glob pattern.
- Parameters:
pattern (str,) – A glob pattern string
verbose (bool, optional) – If True, also print object name and creation date
limit (None, int, optional)
page_size (int, optional)
- Returns:
object_names
- Return type:
list
Example
>>> cci.glob('/path/to/*/file01*.grp/image_data') ['/path/to/my/file01a.grp/image_data', '/path/to/my/file01b.grp/image_data', '/path/to/your/file01a.grp/image_data', '/path/to/your/file01b.grp/image_data'] >>> cci.glob('/path/to/my/file02*.grp/*') ['/path/to/my/file02a.grp/image_data', '/path/to/my/file02a.grp/text_data', '/path/to/my/file02b.grp/image_data', '/path/to/my/file02b.grp/text_data',]
Extended Summary¶
Some gotchas
- limit: None, int, optional
The maximum number of objects to return
- page_size: int, optional
This is important for buckets with loads of objects. By default,
globwill download a maximum of 10^6 object names and perform the search. If more objects exist, the search might not find them and the page_size should be increased.
Notes
If more than 10^6 objects, provide
page_size=10**7kwarg.
- glob_google_drive(pattern)¶
Globbing on google drive
- Parameters:
pattern
- glob_s3(pattern, **kwargs)¶
Globbing on S3
- Parameters:
pattern
kwargs
- ls(pattern, page_size=1000, limit=1000, verbose=FAKE_SECRET_KEY)¶
File-system like search for S3 objects
- Parameters:
pattern (str) – A ls-style command line like query
page_size (int (default: 1,000))
limit (int (default: 1,000))
- Returns:
object_names – Object names that match the search pattern
- Return type:
list
Notes
Increase
page_sizeandlimitif you have a lot of objects otherwise, the search might not return all matching objects in store.
- lsdir(path='/', limit=1000)¶
List the contents of a directory
- Parameters:
path (str (default: "/"))
- Returns:
matches – The children of the path.
- Return type:
list
- mv(source_name, dest_name, source_bucket=None, dest_bucket=None, overwrite=FAKE_SECRET_KEY)¶
Move an object (make copy and delete old object)
- Parameters:
source_name (str) – Name of object to be moved
dest_name (str) – New object name
source_bucket (str) – If moving object from a bucket different from the default. Defaults to
self.bucket_namedest_bucket (str (defaults to None)) – If moving to another bucket, provide the bucket name. Defaults to
source_bucketoverwrite (bool (defaults to FAKE_SECRET_KEY)) – Whether to overwrite the dest_name object if it already exists.
- rm(object_name, recursive=FAKE_SECRET_KEY, delete=True)¶
Delete an object, or a subtree (‘path/to/stuff’).
- Parameters:
object_name (str) – The name of the object to delete. It can also be a subtree
recursive (bool) – When deleting a subtree, set
recursive=True. This is similar in behavior to ‘rm -r /path/to/directory’.delete (bool) – When in google drive, actually delete the file or only trash it?
Example
>>> import cottoncandy as cc >>> cci = cc.get_interface('mybucket', verbose=FAKE_SECRET_KEY) >>> response = cci.rm('data/experiment/file01.txt') >>> cci.rm('data/experiment') cannot remove 'data/experiment': use `recursive` to remove branch >>> cci.rm('data/experiment', recursive=True) deleting 15 objects...
- search(pattern, **kwargs)¶
Print the objects matching the glob pattern
See
globdocumentation for details