How I automated my backups to Amazon S3 using rsync and s3fs.

October 27th, 2008  | Tags: , , ,

The following is how I automated my backups to Amazon S3 in about 5 minutes.

I lot has changed since my original post on automating my backups to s3 using s3sync. There are more mature and easier to use solutions now. I am switching because using s3fs gives you much more options for using s3, it is easier to set up and it is faster.

I now use a combination of s3fs to mount a S3 bucket to local directory and then use rsync to keep up to date with my files. The following directions are geared towards Ubuntu linux, but could be modified for any linux distribution and Mac OSX.


STEP 1: Install s3fs

The first step is to install s3fs dependencies. (Assuming Ubuntu)

sudo apt-get install build-essential libcurl4-openssl-dev libxml2-dev libfuse-dev

Next, install the most recent version of s3fs. As of now the most recent is r177, but a quick check of s3fs downloads will show the most recent.

wget http://s3fs.googlecode.com/files/s3fs-r177-source.tar.gz
tar -xzf s3fs*
cd s3fs
make
sudo make install
sudo mkdir /mnt/s3
sudo chown yourusername:yourusername /mnt/s3

STEP 2: Create script to mount your Amazon s3 bucket using s3fs and sync files.

The following assumes you already have a bucket created on Amazon S3. If this is not the case, you can use a tool like s3Fox to create one.

Choose a text editor of your choice and make a shell script to mount your bucket, perform rsync, then unmount. It is not necessary to unmount your S3 directory after each rsync, but I prefer to be safe. One mistake like an ‘rm’ on your root directory could wipe all of your files on your machine and your S3 mount. You should probably start with a test directory to be safe.

Make the file s3fs.sh

#!/bin/bash
/usr/bin/s3fs yourbucket -o accessKeyId=yourS3key -o secretAccessKey=yourS3secretkey /mnt/s3
/usr/bin/rsync -avz --delete /home/username/dir/you/want/to/backup /mnt/s3
/bin/umount /mnt/s3

Note, the –delete option. This will delete any files that have been removed on the ’source’.
Change permissions to make executable

chmod 700 s3fs.sh

Before you run the entire script, you might want to run each line separately to make sure everything is working properly. The paths to rsync, umount might be different on your system. (Use ‘which rsync’ to check) Just for fun, I did a ‘df -h’, which showed I now have 256 Terabytes available on the s3 mount!

Next, run the script and let it do its work. This could take a long time depending on how much data you are uploading initially. Your internet upload speed will be the bottleneck.

sudo ./s3fs.sh

That’s it! You are backing up to Amazon S3. You probably want to automate this using cron after you are sure everything is running o.k. Just for simplicity of this tutorial, lets assume you are setting up the cron job as root so we don’t need to worry about editing permissions for mount/umounting directory.

STEP 3: Automate it with cron

sudo su
crontab -e
0 0 * * * /path/to/s3fs.sh # this runs it everyday at midnight

p.s. I use this in combination with hourly backups to a second local machine using git to have revision history. I only backup nightly to s3 without revision history in case my house burns down etc. If you would like to know how I set up my git backups locally, just leave a comment and I can make a follow up post.

  1. October 28th, 2008 at 07:05
    Reply | Quote | #1

    Hi John- great write up! just an FYI in this case the rsync -z switch (compression) has no effect because there is no remote rsync server; if required the http://www.subcloud.com version provides compression (and encryption)

  2. October 28th, 2008 at 07:29
    Reply | Quote | #2

    Would be great if someone made a .deb and a gui for this.

    Yes, me lazy…

  3. October 28th, 2008 at 08:51
    Reply | Quote | #3

    Thanks Randy, I have updated the post.

  4. Richard
    October 30th, 2008 at 13:57
    Reply | Quote | #4

    this is awesome! And it even works. thank you so much!

  5. Shamus R
    November 5th, 2008 at 13:03
    Reply | Quote | #5

    This is fantastic — exactly what I’m looking to back-up my home server. One question: how would you go about adding e-mail verification? i.e. If the back-up is successful it sends an e-mail confirmation.

  6. Dave
    November 6th, 2008 at 10:11
    Reply | Quote | #6

    Thanks for the excellent article. I’m running into a “fuse: device not found, try ‘modprobe fuse’ first”. I’ve tried everything I can think of with no luck. sudo modprobe fuse runs (no output). Anyone else run into this or have any idea what’s wrong?

  7. November 6th, 2008 at 21:31
    Reply | Quote | #7

    Dave, I had the same problem with my Gutsy EC2 instance you might want to check out this thread http://groups.google.com/group/ec2ubuntu/browse_thread/thread/9093236bc07d220b/2bf41010b95f8646?hl=en&lnk=gst

    I installed fuse:
    apt-get install -y fuse-utils encfs

    and it worked for. Not sure if I needed encfs but installed it anyway.

    BTW - Great post John - keep up the awesome work!

  8. Jay
    November 14th, 2008 at 05:22
    Reply | Quote | #8

    I for one would like to see more information on how you set up your computer to perform hourly backups using git to have a revision history.

    As a second part to my post:
    I added a few lines to the backup script described above to provide email support alerting me that the backup took place and describing the backup procedure. Here is an abbreviated sample of the script:

    #!/bin/bash
    SENDMAIL=/usr/sbin/sendmail
    EMAIL=jay@localhost
    # script to upload local directory upto s3
    #change to directory containing script
    cd /jdata/s3sync
    # jdata Directory
    export AWS_ACCESS_KEY_ID=88888888
    export AWS_SECRET_ACCESS_KEY=88888888
    export SSL_CERT_DIR=/jdata/s3sync/certs

    echo -e “To: ${EMAIL}\nSubject: s3backup results\nContent-type: text/plain\n\n” > /tmp/s3backup.log

    # and -n for dry run
    ruby s3sync.rb -r -v –ssl –delete /jdata/ jayNewBucket:/jdata > /tmp/s3backup.log
    # copy and modify line above for each additional folder to be synced

    # home directory
    ruby s3sync.rb -r -v –ssl –delete /home/ jayNewBucket:/home >> /tmp/s3backup.log
    # copy and modify line above for each additional folder to be synced

    cat /tmp/s3backup.log | ${SENDMAIL} “${EMAIL}”

  9. Tom Metro
    November 14th, 2008 at 08:37
    Reply | Quote | #9

    Backing up to S3 isn’t necessarily the hard part. Backing up to S3 securely and efficiently, is. Two things should be addressed in the intro to this howto: 1. Does using rsync in this fashion take full advantage of rsync? In other words, does s3fs permit rsync to obtain a hash of a portion of a file, and update a portion of a file, or do those operations require the transfer of an entire file. 2. While S3 may encrypt things on their end, some users would prefer a solution where encryption happens locally, so the data is safe over the wire, as well as when in storage. Where, if anywhere, does s3fs encrypt the data?

  10. Jack
    November 17th, 2008 at 09:52

    Just curious, how do your S3 charges look?

  11. Jay
    November 18th, 2008 at 07:06

    I have 16 GBs of storage. Below is my cost for the past month

    Greetings from Amazon Web Services,

    This e-mail confirms that your latest billing statement is available on the AWS web site. Your account will be charged the following:

    Total: $2.52

    Please see the Account Activity area of the AWS web site for detailed account information:

  12. Chris
    November 19th, 2008 at 08:13

    And how does your restore procedure looks like? Backing up data is one thing, getting it back in a decent matter is another.

  13. November 20th, 2008 at 05:12

    Dave :Thanks for the excellent article. I’m running into a “fuse: device not found, try ‘modprobe fuse’ first”. I’ve tried everything I can think of with no luck. sudo modprobe fuse runs (no output). Anyone else run into this or have any idea what’s wrong?

    I’ve got the same issue. No solution found yet.
    Have to use s3cmd.

  14. Dave
    November 20th, 2008 at 07:08

    I ran into a few problems on three different boxes setting this up. I never got one of them working but the other two are working fine. See this thread http://groups.google.com/group/s3fs-devel/browse_thread/thread/34df46c5ca90560b

  15. Anonymous
    November 24th, 2008 at 08:38

    Very nice article, thank you for your time and detailed script. If you could help me, I am trying to figure out something and you might already have the answer.

    Ok, this is what I understand from the documentation of rsync/s3fs and s3sync:
    - s3sync uses MD5 checksum to check if a file has changed on your disk. This md5 is provided in the file listing from s3 (i.e. LIST request)
    - rsync compares the actual content of the files (doing md5 on portion of files) to determine what parts of the file has changed and only upload what is really needed. However, s3 doesn’t allow retrieval of blocks but will send the whole file to you. s3fs actually does a cache of the files to limit the bandwidth, but comparing files with rsync will still require download of what is already on s3 to this cache.

    So, now, I wonder if there is no big bandwidth usage difference between using s3fs/rsync instead of s3sync?
    Did you evaluate the difference of bandwidth usage/price between when you had the s3sync backup and now?

  16. Bertrand
    November 24th, 2008 at 08:40

    the rsync –delete is not working properly for me. When I delete a single file, it works well : the file is also deleted on the S3 bucket. But when I delete a folder, both folders & files contained in it are still on the S3 bucket when I do a “s3cmd ls”. Do you have the same problem ?

  17. Bertrand
    November 26th, 2008 at 08:20

    the rsync was really slow with s3fs, so searching around I found that Duplicity support S3 backup. It was easy to configure and it works really well for me : embeded compression to save on space in S3 and encryption with gpg. I did quite many trials and speed is also good : 140Mo backup in 15min.

  18. Dragos
    November 27th, 2008 at 10:52

    Hi all.

    I have successfully installed s3fs on Ubuntu 8.04 2.6.15-51-server.

    The things is for any I/O operation on the mounted dir /mnt/dir-bkp I gen
    t an I/O error. Same thing for rsync

    eg rsync -va /home/dir1 /mnt/dir1-bkp/

    Output
    rsync: recv_generator: mkdir “/mnt/expo-bkp/dir1″ failed: Input/output error (5)
    *** Skipping any contents from this failed directory ***

    Any ideas ?

  19. ChristianD
    December 16th, 2008 at 13:27

    Hey John,

    Happened upon your blog via google :)

    Do you have any experience syncing the other way around? I would like to keep a copy of our s3 assets in sync on the server. I have changed Paperclip to use the file system in development mode, and downloading GB’s of data from S3 is error prone.

  20. January 3rd, 2009 at 13:38

    Anyone else having problems with s3fs going ballistic, even when idle, and using 100% CPU? My laptop almost caught fire!

    There’s a note on 177 that it’s fixed but not for me.

    Anyone else?

  21. January 3rd, 2009 at 13:38

    Forgot to mention OS X 10.5 Intel.

  22. Dragos
    January 3rd, 2009 at 14:20

    I know it won’t help if you decided to use s3fs but as an alternative for backups
    I use http://s3sync.net/ or better if you think to a more stable and professional solution you can use an EC2 image acting as a rsync server.

  23. January 22nd, 2009 at 22:08

    Thanks for the great article! I would love to hear how you set up your automatic backups with git. We are using git in our company and have found it to be a great resource. We have contemplated using git as a backup solution, but have been reluctant due to the unknown complexity in the event of conflicts. I would be very interested in hearing your solution.

  24. ThomasC
    January 28th, 2009 at 06:54

    I successfully installed s3fs as you described on 3 different systems running hardy (8.04), but cannot successfully compile on a box running dapper (6.06). All systems are kept fully up to date with apt-get update/upgrade.

    First, apt-get couldn’t find (or install, obviously) libcurl4-openssl-dev. The dapper repositories only had libcurl3-openssl-dev, so I apt-got that instead. However, the compilation step still fails. Output from make (some redundant stuff removed) is:

    g++ -ggdb -Wall -D_FILE_OFFSET_BITS=64 -I/usr/include/fuse -lfuse -lpthread -lcurl -lgssapi_krb5 -lkrb5 -lk5crypto -lkrb5support -lcom_err -lresolv -lidn -ldl -lssl -lcrypto -lz -I/usr/include/libxml2 -L/usr/lib -lxml2 -lz -lm -lcrypto s3fs.cpp -o s3fs

    s3fs.cpp:1648:74: error: macro “fuse_main” passed 4 arguments, but takes just 3
    s3fs.cpp: In function ‘int s3fs_statfs(const char*, statvfs*)’:
    s3fs.cpp:1141: error: invalid use of undefined type ’struct statvfs’
    s3fs.cpp:1139: error: forward declaration of ’struct statvfs’

    s3fs.cpp: In function ‘int s3fs_readdir(const char*, void*, int (*)(void*, const char*, const stat*, off_t), off_t, fuse_file_info*)’:
    s3fs.cpp:1364: error: ‘curl_multi_timeout’ was not declared in this scope
    s3fs.cpp: In function ‘int my_fuse_opt_proc(void*, const char*, int, fuse_args*)’:
    s3fs.cpp:1531: error: ‘FUSE_OPT_KEY_NONOPT’ was not declared in this scope
    s3fs.cpp:1543: error: ‘FUSE_OPT_KEY_OPT’ was not declared in this scope

    s3fs.cpp: In function ‘int main(int, char**)’:
    s3fs.cpp:1587: error: variable ‘fuse_args custom_args’ has initializer but incomplete type
    s3fs.cpp:1587: error: ‘FUSE_ARGS_INIT’ was not declared in this scope

    s3fs.cpp: At global scope:
    s3fs.cpp:440: warning: ’size_t readCallback(void*, size_t, size_t, void*)’ defined but not used
    make: *** [all] Error 1

    I know this isn’t a support site, but I would appreciate any help anyone can provide. I *cannot* dist-upgrade the machine in question. (That is, I cannot risk screwing this production machine up.)

  25. Jay Kramer
    January 28th, 2009 at 12:45

    In the past I have been using S3sync or S3fs to backup my data files to Amazon’s S3 storage. Recently I switched to using Amazon’s Elastic Cloud EC2 to mirror my data files using Rsync. It works very well. It is much much faster than using s3sync or s3fs to backup to Amazon S3. What normally took all night using s3sync or s3fs was accomplished in a few hours with EC2 using the method described in the HowTo:

    http://www.freewisdom.org/en/all/entries/2008/09/17/backup_with_rsync/

    Some comments:
    1. you need to use Sun’s version of Java. For Ubuntu I did the following:
    apt-get install -y sun-java6-bin unzip
    sudo update-java-alternatives -s java-6-sun

    2. It was necessary for me to provide the full path to the file id_rsa-keypair

    The only problem that I ran into is how to use the ssh commands in scripts and cron. Each time that I run the script, it is necessary to interactively respond to the question:

    RSA key fingerprint is cb:79:eb:b5:40:2d:9a:2b:20:47:53:c8:09:4c:54:57.
    Are you sure you want to continue connecting (yes/no)?

    What is the password:

    The RSA fingerprint and IP address change each time that I run the script because it creates and terminates an EC2 Instance—each of which have their own unique DNS name.

    The only way that I could get the script to work without interactively responding to the ssh prompts is to set up the passkey without a passphrase which is the way that it is set up in the HowTo. This reduces the security of ssh and makes is easier for man-in-the-middle attacks. It was also necessary for me to modify the ssh commands which were described in the HowTo by adding an additional option to the ssh command:
    ’ssh -o StrickHostKeyChecking=no …..’
    This further reduces the security of the system, but I can see no other way to run the scripts.

    Another concern that I have is that the ‘known-hosts’ file which stores the host fingerprints will become increasingly large with each run of the script.

  26. Matt
    February 5th, 2009 at 14:19

    I want to know how you setup your git backups as well. Please do write up a post.
    Thanks.

  27. addady
    February 28th, 2009 at 12:43

    Every time a local file has been changed it will uploads the hole new file, not just what has changed. That means that if you are using S3sync for doing regular backups you are wasting unnecessary bandwidth.

    Most of the daily change is user data are files that have been update and not new files.

    You can bypass this limitation using rsync and 3rd party gateway like: http://www.s3rsync.com/

  28. Jhon
    March 19th, 2009 at 08:55

    Hi,
    I followed the instrucions and installed everything.

    I can mount the s3 drive, I can see I have 256T available. I have a bucket available.

    When i run the command I get the following:

    [root@localhost /]# rsync -avz –delete /home/mysqldumps /mnt/s3
    building file list … done
    mysqldumps/
    rsync: recv_generator: failed to stat “/mnt/s3/mysqldumps/backup.txt”: Not a directory (20)

    sent 85 bytes received 26 bytes 74.00 bytes/sec
    total size is 124 speedup is 1.12
    rsync error: some files could not be transferred (code 23) at main.c(892) [sender=2.6.8]

    Any help would be appreciated.

  29. April 15th, 2009 at 06:25

    it worked for me as well.. Thanks

  30. April 15th, 2009 at 06:26

    Great work. I will keep following your articles.

  31. Anonymous
    May 3rd, 2009 at 12:07

    Anonymous :Very nice article, thank you for your time and detailed script. If you could help me, I am trying to figure out something and you might already have the answer.
    Ok, this is what I understand from the documentation of rsync/s3fs and s3sync:
    - s3sync uses MD5 checksum to check if a file has changed on your disk. This md5 is provided in the file listing from s3 (i.e. LIST request)
    - rsync compares the actual content of the files (doing md5 on portion of files) to determine what parts of the file has changed and only upload what is really needed. However, s3 doesn’t allow retrieval of blocks but will send the whole file to you. s3fs actually does a cache of the files to limit the bandwidth, but comparing files with rsync will still require download of what is already on s3 to this cache.
    So, now, I wonder if there is no big bandwidth usage difference between using s3fs/rsync instead of s3sync?
    Did you evaluate the difference of bandwidth usage/price between when you had the s3sync backup and now?

    I think this is a very relevant question; if rsync needs to download the files from S3 in order to see if the file was updated, a bandwidth charge is incurred. In a scenario where a large volume of data is being backed up, this is important. In your experience, does this occur?

  32. David Soergel
    June 7th, 2009 at 20:17

    Yes, I was also under the impression that rsync + s3fs incurs a lot of bandwidth overhead, since rsync needs to download the entire original file before doing a compare. That’s why there are commercial services that perform the rsync for you on an EC2 instance (e.g. iirc that’s what JungleDisk does).

    For a different solution that may interest you, check out my S3 backup script: http://dev.davidsoergel.com/trac/s3napback/. It’s very easy to use and handles backup rotation, incremental backups, compression, encryption, and MySQL and Subversion dumps. In my case the incremental-ness is per file, so you get to keep a history of prior versions, not just the latest one. Enjoy!

TOP