S3
S3 is a popular key-value storage service provided by Amazon Web Services (AWS). It offers cheap and moderately fast (not as fast as a hard-drive, but faster than deep archive systems) mechanism for the storage and retrieval of arbitrary files.
S3 Doesn’t Have Directories, But It Does?
One major point of confusion when beginning to use S3 is the appearance of directories. It is important to remember that S3 is purely a key-value type storage system. This means if you want to store an file (e.g. myfile.txt
) on S3, you provide a key to store it at (e.g. this_is_the_key
). Later, you can retrieve the object with the same key. That’s all there is to it. The confusion arises because it is common practise to use path-looking keys to store objects. For example, I could choose to say myfile.txt
to the key user/my_usr/myfile.txt
. Further adding to the confusion is that when navigating through a bucket using the S3 web interface, it will detect path-like keys and show the user a directory tree structure.
However, once this is understood, this directory-like structure of the key-value storage (could we call it a key/directory duality?) is very useful for both intuitively grouping objects together and for providing a seamless mapping from a collection of files in a file system to their respective objects in S3 (you can sync
a whole directory to S3 with the command aws s3 sync ...
). The AWS CLI natively supports “directly” like operations (although it refers to a “directory”-like path as a prefix), through operations such as show all objects under a certain prefix, delete all objects under a certain prefix, e.t.c.
sync
AWS supports the concept of syncing “directories” between S3 or between your local file system and S3. The syncing functionality can be done through the AWS CLI. Unfortunately as of Feb 2020 it is not yet available through boto3
(see this GitHub issue for the status).
For example, the following command will sync all files and directories under the local path ~/my-dir/
to the prefix my-dir/
in the bucket my-bucket
:
Note that aws s3 sync
will not re-sync a changed file if it has the exact same filesize as before, even though the timestamps are different. This could happen quite easily if for example you changed a single character in a text file and re-saved it. The sync
command does not use checksums to verify the content is the same. This default behaviour is not what you would expect from a sync
command, and deviates from the behaviour of other sync tools such as rsync
. You can force sync
to also look at the timestamps by adding the command-line option --exact-timestamps
.
Data Consistency
AWS S3 provides read after write consistency for when you put an object under a new key into a bucket for the first time, as long as you haven’t queried for that key in the past (if you have, then only eventual consistency is guaranteed).
S3 provides eventual consistency when you overwrite an existing key in a bucket, or when you delete a key from a bucket. This means that if you update a key with a new object and then try and retrieve that object, you are not guaranteed to get the new version of the object.
Buckets have a very similar consistency model to the keys mentioned above. Bear in mind that if you delete a bucket and then immediately list all your buckets, the deleted bucket may still in the list! Buckets which have public access seem to take longer to disappear from listings after they have been deleted, they can take many hours (or even days!) to disappear. Note you can’t do anything with the bucket (if you try and copy an object into it an error will be thrown).
See AWS S3: Introduction - Consistency Model for more information on the S3 data consistency model.
Tags
S3 allows you to add tags to each object. The group of tags that belong to a specific object in S3 is called the tag set. Each tag consists of both a Key
(think of this as the tag name) and a Value
. A tag set is just an array of tags. Each tag set may contain up to 10 tags.
Tags can be used for a variety of reasons:
- You can use them to categorize your data
- You can apply lifecycling rules to specific tags
Using boto3
in Python, you can specify a tag set when you upload the object using put_object()
1:
Or you can specify a tag set for an object that already exists in S3 using the put_object_tagging()
function2:
To perform the above actions you will need the following S3 permissions:
By default, the bucket owner has these permissions and can grant them to other people.
Tag Limitations
As mentioned above, each object tag set may contain up to 10 tags. The Key
for each tag can be up to 128 characters, and the Value
for each tag can be up to 256 characters.
Tag Costs
While on first thought you might think that S3 object tags are essentially free, they do actually incur quite a significant cost at scale. As of April 2020, the going rate is US$0.01 per 10,000 tags per month. This is absolutely nothing for a small number of objects, but the cost can get significant with larger number of objects and/or multiple tags per object.