Scalable OpenSource Storage
Scalable OpenSource Storage
OPEN ARCHIVE
Mainframe Tape Libraries Open System Tape Libraries Archive Systeme FMA Software HP OEM FSE & FMA ARCHIVEMANAGER OPENARCHIVE
OPENARCHIVE - Homepage
Start of development in 2000 100 man-years of development Scalable archive solution from 1 TB up to several PB Common code base for Windows and Linux HSM approach data gets migrated to tapes or disks Fileystem interface (NTFS, POSIX) simplyfies integration Support for all kinds of SCSI backend devices (FC, iSCSI) API only necessary for special purposes > 150 Customers worldwide References: Bundesarchiv, Charite, many DMS vendors
HSMnet Server
Management Interface Library Agent
CIFS
HSMnet I/F
Partition Manager
NFS
HSMnet I/F
Partition Manager
Archiving and recalling of many files in parallel is slow Cluster file systems are not supported Performance is not appropriate for HPC environments Scalability might not be appropriate for cloud environments
Transparent blocklevel caching for SSD High speed iSCSI initiator (MP for performance and HA) RTSadmin for easy administration
iSCSI numbers with standard Debian client. RTS iSCSI Initiator plus kernel patches give much higher results.
Reliable
No single points of failure All data is replicated Self-healing
Self-managing
Automatically (re)distributes stored data
Object storage
Objects
Alphanumeric name Data blob (bytes to gigabytes) Named attributes (foo=bar)
Object pools
Separate flat namespace
Objects
PGs
OSDs
(grouped by failure domain)
distribution
Fast O(log n) calculation, no lookups Reliable replicas span failure domains Stable adding/removing OSDs moves few PGs
cosd btrfs
Why btrfs?
Featureful
Copy on write, snapshots, checksumming, multi-device
http
ceph
Thinly provisioned
Consume disk only when image is written to
Simple administration
CLI, librbd
$ rbd create foo --size 20G $ rbd list foo $ rbd snap create --snap=asdf foo $ rbd resize foo --size=40G $ rbd snap create --snap=qwer foo $ rbd snap ls foo 2 asdf 20971520 3 qwer 41943040
Object classes
etc.
POSIX filesytem
Create file system hierarchy on top of objects Cluster of cmds daemons
No local storage all metadata stored in objects Lots of RAM function has a large, distributed, coherent cache
Dynamic cluster
New daemons can be started up dynamically Automagically load balanced
POSIX example
Client MDS Cluster
fd=open(/foo/bar, O_RDONLY)
Client: requests open from MDS MDS: reads directory /foo from object store MDS: issues capability for file content
close(fd)
Client: relinquishes capability to MDS MDS out of I/O path Object locations are well knowncalculated from object name
Object Store
Scalable
Arbitrarily partition metadata, 10s-100s of nodes
Adaptive
Move work from busy to idle servers Replicate popular metadata on multiple nodes
Workload adaptation
many directories
same directory
Metadata scaling
Up to 128 MDS nodes, and 250,000 metadata ops/second
I/O rates of potentially many terabytes/second File systems containing many petabytes of data
Recursive accounting
Subtree-based usage accounting
Recursive file, directory, byte counts, mtime
$ ls -alSh | head total 0 drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51 . drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06 .. drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomceph drwxr-xr-x 1 mcg_test1 pg2419992 23G 2011-02-02 08:57 mcg_test1 drwx--x--- 1 luko adm 19G 2011-01-21 12:17 luko drwx--x--- 1 eest adm 14G 2011-02-04 16:29 eest drwxr-xr-x 1 mcg_test2 pg2419992 3.0G 2011-02-02 09:34 mcg_test2 drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzyceph drwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph $ getfattr -d -m ceph. pomceph # file: pomceph ceph.dir.entries="39" ceph.dir.files="37" ceph.dir.rbytes="10550153946827" ceph.dir.rctime="1298565125.590930000" ceph.dir.rentries="2454401" ceph.dir.rfiles="1585288" ceph.dir.rsubdirs="869113" ceph.dir.subdirs="2"
Fine-grained snapshots
Snapshot arbitrary directory subtrees
Volume or subvolume granularity cumbersome at petabyte scale
Simple interface
$ m kdir foo/ .s na p / one # c re a te s n a ps h ot $ ls foo / .s na p on e $ ls foo / b a r/ .s na p _o ne _1099 511 627 776 # pa re nt's s n a p n a m e is m a ng le d $ rm foo / m yfile $ ls -F fo o b a r/ $ ls foo / .s na p/ on e m yfile b a r/ $ rm dir foo/ .s na p / one # re m o ve s n a ps h ot
Efficient storage
Leverages copy-on-write at storage layer (btrfs)
# modprobe ceph # mount -t ceph 10.3.14.95:/ /mnt/ceph # df -h /mnt/ceph Filesystem Size Used Avail Use% Mounted on 10.3.14.95:/ 95T 29T 66T 31% /mnt/ceph
Userspace client
cfuse FUSE-based client libceph library (ceph_open(), etc.) Hadoop, Hypertable client modules (libceph)
I have a dream...
Archive
Tape Disk Archive
ClusterFS
SAN
OpenArchive CEPH
Contact
Thomas Uhl Cell: +49 170 7917711 [email protected] [email protected] www.twitter.com/tuhl de.wikipedia.org/wiki/Thomas_Uhl