Wednesday, February 13, 2013

Oracle database on BTRFS - reduce costs through deduplication

I recently had a customer who wanted to reduce storage costs.  They had a 13TB PROD database, and needed 40 copies of the database to support DEV and TEST activities, totalling over 500TB.

We did some diagnostics on the databases, and worked out that only about 10% of data blocks in a clone were ever updated, so deduplicating at filesystem or storage block level could result in 90% saving.  The client was a large conservative bank, so the preference was to use existing products from large vendors.  We evaluated products from Delphix, NetApp SnapManager for Oracle, and Oracle ZFS appliance, and ended up choosing the NetApp product due to a combination of functionality, cost, and in-house skills.

Since then I have been wondering if we could have done it cheaper, and have done a quick evaluation of the BTRFS filesystem which does allow deduplication at filesystem level.  

BTRFS is a "copy on write" filesystem and is listed as EXPERIMENTAL, so I don't think it is ready to use in a PROD environment, but is worth testing to support DEV and TEST environments depending on the cost of downtime.

You can find more information here :-  http://btrfs.wiki.kernel.org

The following test was done using Oracle VirtualBox running Oracle Linux 6 2.6.39-300.28.1.el6uek.x86_64.  BTRFS is included in the Linux kernel, but I needed to install package btrfs-progs :-

yum install btrfs-progs

Format the storage device using mkfs.btrfs :-

[root@btrfs ~]# mkfs.btrfs /dev/sdb

WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

fs created label (null) on /dev/sdb
        nodesize 4096 leafsize 4096 sectorsize 4096 size 12.00GB
Btrfs Btrfs v0.19

You can see that we formatted a 12GB device with devault 4K block size.  Block size can impact performance, so if the BTRFS filesystem will only contain database files with 8K blocks, I suspect you would see better performance if the filesystem was also formatted to use 8K blocks.

Now mount the filesystem :-

[root@btrfs ~]# mount /dev/sdb /btrfs1

And add an entry to /etc/fstab :-

/dev/sdb          /btrfs1     btrfs   defaults        0 0

Using the "btrfs filesystem show" and "btrfs filesystem df" commands we can see the usage.  Note that the filesystem has reserved 2GB out of 12GB.

[root@btrfs /]# btrfs filesystem show /dev/sdb
Label: none  uuid: f6f7d6c7-14ca-4c01-8c33-3066fc7f4ad3
        Total devices 1 FS bytes used 92.00KB
        devid    1 size 12.00GB used 2.04GB path /dev/sdb

Btrfs Btrfs v0.19

[root@btrfs /]# btrfs filesystem df /btrfs1
Data: total=8.00MB, used=64.00KB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=1.00GB, used=24.00KB
Metadata: total=8.00MB, used=0.00

We can also see that usage at OS level.

[root@btrfs /]# df -k /btrfs1
Filesystem         1K-blocks      Used Available Use% Mounted on
/dev/sdb           12582912       120  10457024   1% /btrfs1

Create some directories for the Oracle binaries and datafiles :-

[root@btrfs ~]# cd /btrfs1
[root@btrfs u01]# mkdir app
[root@btrfs u01]# mkdir data
[root@btrfs u01]# chown oracle:dba app data

And now we can start using some of the BTRFS functionality. Create subvolumes for the source database binaries and datafiles. BTRFS allows us to take snapshots of the subvolumes at a later time, so it is best to plan subvolume layout before installing files.

Create subvolumes using "btrfs subvolume create" :-

[root@btrfs btrfs1]# su - oracle
[oracle@btrfs app]$ cd /btrfs1/app
[oracle@btrfs app]$ btrfs subvolume create source
Create subvolume './source'
[oracle@btrfs app]$ cd /btrfs1/data
[oracle@btrfs data]$ btrfs subvolume create source
Create subvolume './source'

Install the Oracle binaries and database in the subvolumes.

Oracle binaries in /btrfs1/app/source and create database "source" in /btrfs1/data/source.
This install used about 5.6GB of disk.

[oracle@btrfs source]$ df /btrfs1
Filesystem        1K-blocks      Used Available Use% Mounted on
/dev/sdb          12582912   5661424   4938480  54% /btrfs1

Database source currently has 1.1GB of datafiles, using total 1.3GB in the filesystem.

SQL> select sum(bytes)/1024/1024/1024 from dba_data_files;

SUM(BYTES)/1024/1024/1024
-------------------------
               1.11816406

[oracle@btrfs source]$ pwd

/btrfs1/data/source

[oracle@btrfs source]$ du -k .

1346188 .

[oracle@btrfs source]$ ls

control01.ctl redo01.log  redo03.log  system01.dbf  undotbs01.dbf

control02.ctl  redo02.log  sysaux01.dbf temp01.dbf users01.dbf


If we want to take a cold backup of the source database before we run some tests, all we need to do is shutdown and create a snapshot.  First check how much space is available in the filesystem and how much is used by the source database.

[oracle@btrfs data]$ pwd
/btrfs1/data
[oracle@btrfs data]$ df -k .
Filesystem        1K-blocks      Used Available Use% Mounted on
/dev/sdb          12582912   5661820   4938124  54% /btrfs1
[oracle@btrfs data]$ du -sk *
1346188 source

Create the snapshot using "btrfs subvolume snapshot".

[oracle@btrfs data]$ btrfs subvolume snapshot source backup1
Create a snapshot of 'source' in './backup1'

And we can see that the snapshot did not use any additional disk, but we now have a second set of datafiles in directory backup1. 

[oracle@btrfs data]$ df -k .
Filesystem        1K-blocks      Used Available Use% Mounted on
/dev/sdb           12582912   5661820   4938124  54% /btrfs1
[oracle@btrfs data]$ du -sk *
1346188 source
1346188 backup1

We can now start the database and run our tests, knowing we have a backup available.  Because BTRFS is a "copy on write" filesystem, any block that is changed in the source database will be copied to a new location, and the original left in place as part of the backup1 snapshot.  Since the binaries are also installed in a BTRFS subvolume, we could also take a snapshot of them in case we wanted to test patching or relinking.

If we needed additional test environments we can use the files from a snapshot, recreate the controlfile, and open resetlogs.

To create a new clone database called "target", take a snapshot :-

[oracle@btrfs data]$ btrfs subvolume snapshot source target
Create a snapshot of 'source' in './target'

Then create the parameter file inittarget.ora, audit directories, and script to recreate the controlfile.

[oracle@btrfs data]$ export ORACLE_SID=target
[oracle@btrfs data]$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.1.0 Production on Thu Feb 14 11:44:17 2013

Copyright (c) 1982, 2009, Oracle.  All rights reserved.
Connected to an idle instance.

SQL> @/home/oracle/recrctl_target.sql
ORACLE instance started.

Total System Global Area 1570009088 bytes
Fixed Size                  2213696 bytes
Variable Size             922749120 bytes
Database Buffers          637534208 bytes
Redo Buffers                7512064 bytes

Control file created.
Database altered.
Tablespace altered.
SQL> select name from v$database;

NAME
---------
TARGET

SQL> select name from v$datafile;

NAME
-----------------------------------------------------------------
/btrfs1/data/target/system01.dbf
/btrfs1/data/target/sysaux01.dbf
/btrfs1/data/target/undotbs01.dbf
/btrfs1/data/target/users01.dbf

At the OS level we can see both databases running, against their own files, with very little additional storage consumed.

[oracle@btrfs data]$ ps -ef | grep pmon
oracle    5581     1  0 11:44 ?        00:00:00 ora_pmon_target
oracle    5708     1  0 11:42 ?        00:00:00 ora_pmon_source

[oracle@btrfs data]$ df /btrfs1
Filesystem        1K-blocks      Used Available Use% Mounted on
/dev/sdb           12582912   5792420   4808972  55% /btrfs1

[oracle@btrfs data]$ du -sk *
1346188 source
1346188 backup1
1346828 target

The result :-
Using BTRFS snapshots, we now have two running databases, and one cold backup, with deduplication at filesystem block level.  If we had created the backup and clone in a normal filesystem (ext3 etc) we would have used almost 2.7GB of disk, but with deduplication we have only used about 0.1GB.  We now have 4.8GB free on the filesystem, so could create many more databases to keep the developers happy without making the accountants unhappy.   Disk usage will increase as blocks are updated, but unless 100% of the blocks are updated we will still see an advantage from deduplication.

Obviously this was a small example to demonstrate the use of BTRFS snapshots and deduplication, but the same will apply with large databases, with considerable savings.   In the example above we shutdown the source database to take a snap, but if the database is in archivelog mode we could just take a hot backup, or we could run a Dataguard Standby, and defer recovery to take a snapshot.  My tests were only run on a small VM which is not suitable for performance testing, so additional testing on physical hardware would be useful if database performance is important.

The benefits of block level deduplication are obvious, with significant cost savings on storage, but we should also consider the downside.  There is an overhead to deduplication, so database users are likely to see a degrade in performance.  It should also be noted that BTRFS is currently listed as experimental, so a bit more work needs to be done before it is ready for important systems, but it is worth testing to see what is possible.  In addition to the deduplication features offered by BTRFS, the filesystem also supports compression which would further reduce disk usage.