How I automated my backups to Amazon S3 using rsync and s3fs.
The following is how I automated my backups to Amazon S3 in about 5 minutes.
I lot has changed since my original post on automating my backups to s3 using s3sync. There are more mature and easier to use solutions now. I am switching because using s3fs gives you much more options for using s3, it is easier to set up and it is faster.
I now use a combination of s3fs to mount a S3 bucket to local directory and then use rsync to keep up to date with my files. The following directions are geared towards Ubuntu linux, but could be modified for any linux distribution and Mac OSX.
STEP 1: Install s3fs
The first step is to install s3fs dependencies. (Assuming Ubuntu)
sudo apt-get install build-essential libcurl4-openssl-dev libxml2-dev libfuse-dev
Next, install the most recent version of s3fs. As of now the most recent is r177, but a quick check of s3fs downloads will show the most recent.
wget http://s3fs.googlecode.com/files/s3fs-r177-source.tar.gz tar -xzf s3fs* cd s3fs make sudo make install sudo mkdir /mnt/s3 sudo chown yourusername:yourusername /mnt/s3
STEP 2: Create script to mount your Amazon s3 bucket using s3fs and sync files.
The following assumes you already have a bucket created on Amazon S3. If this is not the case, you can use a tool like s3Fox to create one.
Choose a text editor of your choice and make a shell script to mount your bucket, perform rsync, then unmount. It is not necessary to unmount your S3 directory after each rsync, but I prefer to be safe. One mistake like an ‘rm’ on your root directory could wipe all of your files on your machine and your S3 mount. You should probably start with a test directory to be safe.
Make the file s3fs.sh
#!/bin/bash /usr/bin/s3fs yourbucket -o accessKeyId=yourS3key -o secretAccessKey=yourS3secretkey /mnt/s3 /usr/bin/rsync -avz --delete /home/username/dir/you/want/to/backup /mnt/s3 /bin/umount /mnt/s3
Note, the –delete option. This will delete any files that have been removed on the ’source’.
Change permissions to make executable
chmod 700 s3fs.sh
Before you run the entire script, you might want to run each line separately to make sure everything is working properly. The paths to rsync, umount might be different on your system. (Use ‘which rsync’ to check) Just for fun, I did a ‘df -h’, which showed I now have 256 Terabytes available on the s3 mount!
Next, run the script and let it do its work. This could take a long time depending on how much data you are uploading initially. Your internet upload speed will be the bottleneck.
sudo ./s3fs.sh
That’s it! You are backing up to Amazon S3. You probably want to automate this using cron after you are sure everything is running o.k. Just for simplicity of this tutorial, lets assume you are setting up the cron job as root so we don’t need to worry about editing permissions for mount/umounting directory.
STEP 3: Automate it with cron
sudo su crontab -e 0 0 * * * /path/to/s3fs.sh # this runs it everyday at midnight
p.s. I use this in combination with hourly backups to a second local machine using git to have revision history. I only backup nightly to s3 without revision history in case my house burns down etc. If you would like to know how I set up my git backups locally, just leave a comment and I can make a follow up post.

Hi John- great write up! just an FYI in this case the rsync -z switch (compression) has no effect because there is no remote rsync server; if required the http://www.subcloud.com version provides compression (and encryption)
Would be great if someone made a .deb and a gui for this.
Yes, me lazy…
Thanks Randy, I have updated the post.
this is awesome! And it even works. thank you so much!
This is fantastic — exactly what I’m looking to back-up my home server. One question: how would you go about adding e-mail verification? i.e. If the back-up is successful it sends an e-mail confirmation.
Thanks for the excellent article. I’m running into a “fuse: device not found, try ‘modprobe fuse’ first”. I’ve tried everything I can think of with no luck. sudo modprobe fuse runs (no output). Anyone else run into this or have any idea what’s wrong?
Dave, I had the same problem with my Gutsy EC2 instance you might want to check out this thread http://groups.google.com/group/ec2ubuntu/browse_thread/thread/9093236bc07d220b/2bf41010b95f8646?hl=en&lnk=gst
I installed fuse:
apt-get install -y fuse-utils encfs
and it worked for. Not sure if I needed encfs but installed it anyway.
BTW - Great post John - keep up the awesome work!
I for one would like to see more information on how you set up your computer to perform hourly backups using git to have a revision history.
As a second part to my post:
I added a few lines to the backup script described above to provide email support alerting me that the backup took place and describing the backup procedure. Here is an abbreviated sample of the script:
#!/bin/bash
SENDMAIL=/usr/sbin/sendmail
EMAIL=jay@localhost
# script to upload local directory upto s3
#change to directory containing script
cd /jdata/s3sync
# jdata Directory
export AWS_ACCESS_KEY_ID=88888888
export AWS_SECRET_ACCESS_KEY=88888888
export SSL_CERT_DIR=/jdata/s3sync/certs
echo -e “To: ${EMAIL}\nSubject: s3backup results\nContent-type: text/plain\n\n” > /tmp/s3backup.log
# and -n for dry run
ruby s3sync.rb -r -v –ssl –delete /jdata/ jayNewBucket:/jdata > /tmp/s3backup.log
# copy and modify line above for each additional folder to be synced
# home directory
ruby s3sync.rb -r -v –ssl –delete /home/ jayNewBucket:/home >> /tmp/s3backup.log
# copy and modify line above for each additional folder to be synced
cat /tmp/s3backup.log | ${SENDMAIL} “${EMAIL}”
Backing up to S3 isn’t necessarily the hard part. Backing up to S3 securely and efficiently, is. Two things should be addressed in the intro to this howto: 1. Does using rsync in this fashion take full advantage of rsync? In other words, does s3fs permit rsync to obtain a hash of a portion of a file, and update a portion of a file, or do those operations require the transfer of an entire file. 2. While S3 may encrypt things on their end, some users would prefer a solution where encryption happens locally, so the data is safe over the wire, as well as when in storage. Where, if anywhere, does s3fs encrypt the data?
Just curious, how do your S3 charges look?
I have 16 GBs of storage. Below is my cost for the past month
Greetings from Amazon Web Services,
This e-mail confirms that your latest billing statement is available on the AWS web site. Your account will be charged the following:
Total: $2.52
Please see the Account Activity area of the AWS web site for detailed account information:
And how does your restore procedure looks like? Backing up data is one thing, getting it back in a decent matter is another.
I’ve got the same issue. No solution found yet.
Have to use s3cmd.
I ran into a few problems on three different boxes setting this up. I never got one of them working but the other two are working fine. See this thread http://groups.google.com/group/s3fs-devel/browse_thread/thread/34df46c5ca90560b
Very nice article, thank you for your time and detailed script. If you could help me, I am trying to figure out something and you might already have the answer.
Ok, this is what I understand from the documentation of rsync/s3fs and s3sync:
- s3sync uses MD5 checksum to check if a file has changed on your disk. This md5 is provided in the file listing from s3 (i.e. LIST request)
- rsync compares the actual content of the files (doing md5 on portion of files) to determine what parts of the file has changed and only upload what is really needed. However, s3 doesn’t allow retrieval of blocks but will send the whole file to you. s3fs actually does a cache of the files to limit the bandwidth, but comparing files with rsync will still require download of what is already on s3 to this cache.
So, now, I wonder if there is no big bandwidth usage difference between using s3fs/rsync instead of s3sync?
Did you evaluate the difference of bandwidth usage/price between when you had the s3sync backup and now?
the rsync –delete is not working properly for me. When I delete a single file, it works well : the file is also deleted on the S3 bucket. But when I delete a folder, both folders & files contained in it are still on the S3 bucket when I do a “s3cmd ls”. Do you have the same problem ?
the rsync was really slow with s3fs, so searching around I found that Duplicity support S3 backup. It was easy to configure and it works really well for me : embeded compression to save on space in S3 and encryption with gpg. I did quite many trials and speed is also good : 140Mo backup in 15min.
Hi all.
I have successfully installed s3fs on Ubuntu 8.04 2.6.15-51-server.
The things is for any I/O operation on the mounted dir /mnt/dir-bkp I gen
t an I/O error. Same thing for rsync
eg rsync -va /home/dir1 /mnt/dir1-bkp/
Output
rsync: recv_generator: mkdir “/mnt/expo-bkp/dir1″ failed: Input/output error (5)
*** Skipping any contents from this failed directory ***
Any ideas ?
Hey John,
Happened upon your blog via google
Do you have any experience syncing the other way around? I would like to keep a copy of our s3 assets in sync on the server. I have changed Paperclip to use the file system in development mode, and downloading GB’s of data from S3 is error prone.
Anyone else having problems with s3fs going ballistic, even when idle, and using 100% CPU? My laptop almost caught fire!
There’s a note on 177 that it’s fixed but not for me.
Anyone else?
Forgot to mention OS X 10.5 Intel.
I know it won’t help if you decided to use s3fs but as an alternative for backups
I use http://s3sync.net/ or better if you think to a more stable and professional solution you can use an EC2 image acting as a rsync server.
Thanks for the great article! I would love to hear how you set up your automatic backups with git. We are using git in our company and have found it to be a great resource. We have contemplated using git as a backup solution, but have been reluctant due to the unknown complexity in the event of conflicts. I would be very interested in hearing your solution.
I successfully installed s3fs as you described on 3 different systems running hardy (8.04), but cannot successfully compile on a box running dapper (6.06). All systems are kept fully up to date with apt-get update/upgrade.
First, apt-get couldn’t find (or install, obviously) libcurl4-openssl-dev. The dapper repositories only had libcurl3-openssl-dev, so I apt-got that instead. However, the compilation step still fails. Output from make (some redundant stuff removed) is:
g++ -ggdb -Wall -D_FILE_OFFSET_BITS=64 -I/usr/include/fuse -lfuse -lpthread -lcurl -lgssapi_krb5 -lkrb5 -lk5crypto -lkrb5support -lcom_err -lresolv -lidn -ldl -lssl -lcrypto -lz -I/usr/include/libxml2 -L/usr/lib -lxml2 -lz -lm -lcrypto s3fs.cpp -o s3fs
s3fs.cpp:1648:74: error: macro “fuse_main” passed 4 arguments, but takes just 3
s3fs.cpp: In function ‘int s3fs_statfs(const char*, statvfs*)’:
s3fs.cpp:1141: error: invalid use of undefined type ’struct statvfs’
s3fs.cpp:1139: error: forward declaration of ’struct statvfs’
…
s3fs.cpp: In function ‘int s3fs_readdir(const char*, void*, int (*)(void*, const char*, const stat*, off_t), off_t, fuse_file_info*)’:
s3fs.cpp:1364: error: ‘curl_multi_timeout’ was not declared in this scope
s3fs.cpp: In function ‘int my_fuse_opt_proc(void*, const char*, int, fuse_args*)’:
s3fs.cpp:1531: error: ‘FUSE_OPT_KEY_NONOPT’ was not declared in this scope
s3fs.cpp:1543: error: ‘FUSE_OPT_KEY_OPT’ was not declared in this scope
…
s3fs.cpp: In function ‘int main(int, char**)’:
s3fs.cpp:1587: error: variable ‘fuse_args custom_args’ has initializer but incomplete type
s3fs.cpp:1587: error: ‘FUSE_ARGS_INIT’ was not declared in this scope
…
s3fs.cpp: At global scope:
s3fs.cpp:440: warning: ’size_t readCallback(void*, size_t, size_t, void*)’ defined but not used
make: *** [all] Error 1
I know this isn’t a support site, but I would appreciate any help anyone can provide. I *cannot* dist-upgrade the machine in question. (That is, I cannot risk screwing this production machine up.)
In the past I have been using S3sync or S3fs to backup my data files to Amazon’s S3 storage. Recently I switched to using Amazon’s Elastic Cloud EC2 to mirror my data files using Rsync. It works very well. It is much much faster than using s3sync or s3fs to backup to Amazon S3. What normally took all night using s3sync or s3fs was accomplished in a few hours with EC2 using the method described in the HowTo:
http://www.freewisdom.org/en/all/entries/2008/09/17/backup_with_rsync/
Some comments:
1. you need to use Sun’s version of Java. For Ubuntu I did the following:
apt-get install -y sun-java6-bin unzip
sudo update-java-alternatives -s java-6-sun
2. It was necessary for me to provide the full path to the file id_rsa-keypair
The only problem that I ran into is how to use the ssh commands in scripts and cron. Each time that I run the script, it is necessary to interactively respond to the question:
RSA key fingerprint is cb:79:eb:b5:40:2d:9a:2b:20:47:53:c8:09:4c:54:57.
Are you sure you want to continue connecting (yes/no)?
What is the password:
The RSA fingerprint and IP address change each time that I run the script because it creates and terminates an EC2 Instance—each of which have their own unique DNS name.
The only way that I could get the script to work without interactively responding to the ssh prompts is to set up the passkey without a passphrase which is the way that it is set up in the HowTo. This reduces the security of ssh and makes is easier for man-in-the-middle attacks. It was also necessary for me to modify the ssh commands which were described in the HowTo by adding an additional option to the ssh command:
’ssh -o StrickHostKeyChecking=no …..’
This further reduces the security of the system, but I can see no other way to run the scripts.
Another concern that I have is that the ‘known-hosts’ file which stores the host fingerprints will become increasingly large with each run of the script.
I want to know how you setup your git backups as well. Please do write up a post.
Thanks.
Every time a local file has been changed it will uploads the hole new file, not just what has changed. That means that if you are using S3sync for doing regular backups you are wasting unnecessary bandwidth.
Most of the daily change is user data are files that have been update and not new files.
You can bypass this limitation using rsync and 3rd party gateway like: http://www.s3rsync.com/
Hi,
I followed the instrucions and installed everything.
I can mount the s3 drive, I can see I have 256T available. I have a bucket available.
When i run the command I get the following:
[root@localhost /]# rsync -avz –delete /home/mysqldumps /mnt/s3
building file list … done
mysqldumps/
rsync: recv_generator: failed to stat “/mnt/s3/mysqldumps/backup.txt”: Not a directory (20)
sent 85 bytes received 26 bytes 74.00 bytes/sec
total size is 124 speedup is 1.12
rsync error: some files could not be transferred (code 23) at main.c(892) [sender=2.6.8]
Any help would be appreciated.
it worked for me as well.. Thanks
Great work. I will keep following your articles.
I think this is a very relevant question; if rsync needs to download the files from S3 in order to see if the file was updated, a bandwidth charge is incurred. In a scenario where a large volume of data is being backed up, this is important. In your experience, does this occur?
Yes, I was also under the impression that rsync + s3fs incurs a lot of bandwidth overhead, since rsync needs to download the entire original file before doing a compare. That’s why there are commercial services that perform the rsync for you on an EC2 instance (e.g. iirc that’s what JungleDisk does).
For a different solution that may interest you, check out my S3 backup script: http://dev.davidsoergel.com/trac/s3napback/. It’s very easy to use and handles backup rotation, incremental backups, compression, encryption, and MySQL and Subversion dumps. In my case the incremental-ness is per file, so you get to keep a history of prior versions, not just the latest one. Enjoy!