Amazon Glacier and glaciercmd

I'm pretty sure that everyone who even reads this blog is familiar with Amazon Web Services. If not, here's the rundown. Amazon Web Services (AWS) is a HUGE web platform created and managed by Amazon. The biggest point of this platform is that anyone can make an account (provided you have a credit card), and can start using it immediately. If you use only a little bit of resources, you pay a small fee, if you use a lot, you pay a lot. AWS has become big with this pay-as-you-go method of billing, and it has it charms. However, it also mean you can rack up thousands of dollars in bills by accidentially DDoSing yourself.

As said, the AWS platform is massive, with datacenters in USA, EU and Asia. Even though they offer a plethora of services, for me, only three are interesting: EC2, S3 and and the new Glacier. In this article, I'll be looking at the last two specifically.

Amazon S3

Amazon Simple Storage Services (S3) is a Cloud Storage service. What does that mean? Think of Dropbox, and it's basically just that. The main difference is the pay-as-you-go model, and that S3 is designed to store lots of files, and serve them fast. S3 can be used to store assets of websites, like images and files, and serve them to the end-users. Just pay Amazon and they'll make sure your files get served.

Now, I find that Amazon S3 works great, especially with the s3cmd tool, which can be found in the Debian and Ubuntu repository. s3cmd enables you to upload files to your S3 buckets from the command line. I myself use it for backups, and upload a copy of all backups I make to S3. At the moment, I have about 20G of backups in S3, and it's expanding rapidly. This brings me to the main drawback of S3.

It can get expensive, fast. In my example above, 20G of backups, I would pay \(20G * 0.125 = 2.5\) USD a month. Not a whole lot. But backups are meant to be stored permanently, off-site. I currently still have about 400G of on-site backups as well. If I would store them all in S3, I would pay \(420G * 0.125 = 52.5\) USD a month.

To be fair, storing backups isn't really the intended way S3 should be used. Backups are store-once, read-never-until-something-goes-wrong type of data. Typically, you don't want to give anyone else access to your backups neither, because they might contain sensitive data. S3 is meant to store a lot of files and serve them to a large audience, like photo's on a website. Enter Amazon Glacier.

Amazon Glacier

Amazon Glacier (what, no fancy acronym this time?) is exactly meant to store lots of data that does not have to be accessed frequently or quickly. The way it's setup is entirely different from S3. Sure, you still have an API you can use, and you're still storing files in the cloud. But apart from that, Amazon Glacier is:

Slow
Has no fancy AWS console
Slow
Cheap
Did I mention slow?

Like the natural phenomenon it is called after, Glacier is REALLY slow. Whereas S3 has buckets to store files in, Glacier has vaults to store archives in. You can put stuff in your vaults, but getting them back out is another thing entirely. Vault file lists are only generated about once every 24 hours, and requesting one for download takes about four hours. Archives you put in vaults lose all their metadata, and are assigned an ID instead of file name. Luckily, you can still attach a description to an archive, which is included in the vault file lists. Archives can be downloaded from vaults, but again, this also takes about four hours.

The characteristics of Glacier may seem really poor, but think about it. Glacier is meant for archiving stuff, stuff you usually don't access, but have to store safely and securely anyways. Stuff like backups, photo's (as in backups or unprocessed RAW images), financial backlog (to comply with government regulations, and so on. For data like this, it doesn't really matter if you have to wait a few hours before your download is ready, because the data access does not have to be real time. As long as the data is there when you need it.

Finally, unlike S3, Glacier is really cheap. If I were to store the entire 420G of backup data in Glacier, this would cost me \(420G * 0.011 = 4.62\) USD per month. A lot less than 52.5 USD.

glaciercmd

The one thing that made S3 great to use, was the availability of the s3cmd tool. The possibility to use S3 in any script is great. To my dismay, no such tool existed for Glacier. So I decided to write one, the result of which can be found on the glaciercmd Github page.

glaciercmd supports all basic actions that can be done with Glacier.

It can list all your vaults in a given region
It can request an vault inventory, and poll the job until it's finished, presenting you with the archives in the vault
It can upload files to a vault
It can request a download archives from a vault, by creating the job and polling it until the data is ready

The only feature it really misses in my eye is a fancy progress bar, or indeed any feedback on how fast you're uploading/downloading. That's for another day.

Let me know if you've tried glaciercmd, and what you think about it in the comments. If you want to contribute, just fork it on Github and send me a pull request.

Amazon S3#

Amazon Glacier#

glaciercmd#

Amazon S3

Amazon Glacier

glaciercmd