EpicIBM Best Practices 2012 Final
EpicIBM Best Practices 2012 Final
INTRODUCTION
A General Description of the Epic Product Tiered database architecture utilizing Cach Enterprise Cache Protocol (ECP) technology
4
4 6
THE EPIC HARDWARE PLATFORM SIZING PROCESS A DESCRIPTION OF THE INTERSYSTEMS CACH DATABASE ENGINE GENERAL GUIDELINES FOR STORAGE HARDWARE
General Concepts The Use of RAID How Data is processed through the Storage System A Typical Layout of the Epic Production Cach data volumes FlashCopy EasyTier
12 13 14
15 16 17 18 20 20
CONFIGURATION GUIDELINES FOR THE DS8000 SERIES ENTERPRISE STORAGE SYSTEM 20 SVC AND THE EPIC STORAGE CONFIGURATION 22
CONFIGURATION GUIDELINES FOR THE DS5000 SERIES MID-RANGE STORAGE SYSTEM CONFIGURATION GUIDELINES FOR THE XIV STORAGE SYSTEM
28 29
CONFIGURATION GUIDELINES FOR THE N-SERIES STORAGE SYSTEM 29 CONFIGURING THE POWER SYSTEMS AIX SERVER
POWER7 Mounting With Concurrent I/O
29
29 30
Creation of Volume Groups, Logical Volumes, and File Systems for use by Cach Additional System Settings
30 32
ADDITIONAL RECOMMENDATIONS
What Information Should Be Collected When A Problem Occurs
33
38
INTRODUCTION
Epic is a Healthcare Information System (HIS) provider which delivers a comprehensive Electronic Medical Recordkeeping System covering all aspects of the Medical Healthcare Profession. The Epic Solution includes a variety of applications which cover such areas as Medical Billing, Emergency Room, Radiology, Outpatient, Inpatient and Ambulatory care. The Epic product relies almost exclusively on an Electronic Database Management System called Cach produced by InterSystems Corp. Epic has two main databases technologies. The on-line transactional production (OLTP) DB runs Cach as database engine. The analytical DB runs MS-SQL or Oracle. The analytical DB has the highest bandwidth but the Cach OLTP DB is by far the most critical to end user performance and consequently is where most of the attention needs to be focused. This Best Practices guide is therefore centered on the Cach OLTP DB. A General Description of the Epic Product There are two fundamental architecture models which Epic uses: (1) Single Symmetric Multiprocessing (SMP) (2) Enterprise Cache Protocol (ECP) The majority of customers are using the SMP architecture. Each architecture has a production database server that is clustered in an active-passive configuration to a failover server. The Epic production database server runs a post-relational database developed by InterSystems Corporation called Cach. The Cach language is a modern implementation of M (formerly MUMPS), which is a language originally created for healthcare applications.
Epic Applications
Epic Chronicles
OS S
Hardware
Functional Layers Of the Epic Architecture
Single symmetric multiprocessing (SMP) database server The single database server architecture provides the greatest ease of administration. The SMP model today scales well up to the 16 to 24 processor range. Beyond this point, the ECP model is required.
Tiered database architecture utilizing Cach Enterprise Cache Protocol (ECP) technology
Tiered database architecture utilizing Cach Enterprise Cache Protocol (ECP) technology The tiered architecture retains a central database server with a single data storage repository. Unlike the SMP architecture, most processing needs are offloaded to application servers. The application servers contain no permanent data. This architecture offers increased scaling over the SMP architecture.
Production database server (Epicenter OLTP) Cach and chronicle data repositories (CDR) live here, including clinical, financial and operational data. UNIX server hardware is clustered (see failover server) and is SAN-attached. The production database will be replicated to the data recovery (DR) site via Cach Shadow service or array-based replication. Failover server Used only when production has problems; then takes over functionality of the production database server. The switch from production to failover happens in minutes. UNIX server has same configuration as production database server and is connected to same SAN volumes. The cluster software is provided by OS vendors and triggers when production should failover. Epic scripts are added to the software scripts for automatically moving the application from production to failover hardware. Application server (app server) Cach service is running on these UNIX systems. User processing load is distributed via content switches across the application servers, rather than directly accessing the production database server. All permanent data lives on the database server, but temporary data is created for local activities on the app servers. App servers can be added or removed from the network for maintenance when necessary.
Copyright IBM Corporation, 2012
Scaling performance is accomplished by adding additional app servers. App servers cache block information brought from the database server so network traffic is not incurred for each request for data. App servers also run ECP (Enterprise Cache Protocol), which allows the app server to access the production database server directly over redundant, dedicated GigE networks. If an app server fails, the client (or clients) must reconnect and restart any unsaved activities. (Reporting) Shadow Server Near-real-time database of production or a delayed mirror of what is in production based on Cach journaling process. Replicated data is used for off-loading production reporting needs, such as Clarity. Shadow servers can also be used for disaster recovery purposes rather than host-based or array-based replication. The shadow server is SAN-attached. Clarity server OLAP Oracle or SQL RDBMS storing data extracted daily from Reporting Shadow database server via Extract, Transfer, Load (ETL) process. The Clarity server is SAN attached. BusinessObjects Windows servers will host Crystal and will run the reports that connect to the Clarity database. The results of the reports typically are distributed out to a file server. HA BLOB / file server cluster Used more by clinicals to store images, scans, voice files and dictation files. (Can be stored on same cluster, but some customers wish to separate them.) The HA file server cluster is SAN attached. Web server Connects to either application servers or production database server. Used for the Web applications: MyChart, EpicCare Link, EpicWeb, etc. Depending on the service functionality, it is linked to either production app servers or the production database server via TCP/IP. Print format server (EPS) Converts RTF (rich-text format) to PCL/PS and controls routing, batching, and archiving of printouts. Print relay server Can be run on the same server with the print format server. Used for downtime reporting. Info from here is sent to DR PC where users can access downtime reports. Full Client Workstation x86-based PC that runs the client software (Hyperspace) and communicates to a production application server using TCP/IP. When you set up the client on the workstation, there is an EpicComm configuration where you define the environments (production, training, test and so forth) to which that the workstation can connect.
1. If you choose not to use Citrix XenApp to present Epics Hyperspace client, or if you require third-party devices that arent fully supported through XenApp, you will need some number of full client workstations. See the Citrix XenApp Farm section for further details on the tradeoffs between full client and thin client implementations of Epic.
Copyright IBM Corporation, 2012
2. For each Epic software version, we publish workstation specifications which you can use to determine whether or not your existing workstations are adequate to run Hyperspace. 3. If you require new workstations, we publish Workstation Purchasing Guidelines which are reviewed regularly and are expected to exceed Epics minimum requirements for the next several years. The current workstation purchasing guidelines document appears as an appendix at the end of this document. 4. The number of workstations required will be determined in the early stages of your Epic implementation. This depends on your facility layout, the number of staff working in a given area, and the workflows performed in that area throughout the day. As a guideline, most organizations choose to have enough workstations so that one is readily available to every user at the busiest time of the day, in the users preferred work area. 5. Epic Monitor is optional functionality that you may choose to deploy in patient rooms in intensive care settings. Epic Monitor requires Windows Presentation Foundation. Consequently, Citrix XenApp is not a viable option for it at this time. Epic Monitor has the following display requirements: a. 24 touch screen monitor or larger b. Native resolution of at least 1680 x 1050 c. Resistive touch technology, which allows the use of gloves d. For usability reasons, a stable wall mount is an absolute necessity 6. We require a round-trip network latency of 40 ms or less between full client workstations and the Epic database server.
DR PC: Houses the downtime reports. CL/EMFI server (community lead / enterprise master file infrastructure) Community lead manages and maintains collaboratively built data shared across all instances in the
Copyright IBM Corporation, 2012
community. Enterprise server is used as a mechanism to move static master files between deployments in a community (a group of affiliated deployments). A common build / vocabulary can be distributed across the organization for ease and maintenance and consistent enterprise reporting. Essentially EMFI is an internal interface broker. Server is not critical for real-time operations, but is needed to make community configuration changes. The EMFI server is SAN attached and will use array-based replication to the DR facility.
Initially, the IBM account representative must request a copy of the sizing guide provided to the customer by Epic. This document will provide everything that will be needed as far as a working hardware configuration. The IBM account representative or business partner can communicate with the Epic IBM Alliance team via this email: [email protected], and can send the sizing guide copy through this email.
This is not the case during the flush or write burst which is initiated every eighty seconds. While Cach continues to issue 100% read requests, the DB engine also generates a large quantity of write requests in a very short amount of time. Epic has strict read latency guidelines to avoid degrading user performance. Write latencies also become increasingly important for high-end scalability. For large implementations this can lead to a clear conflict of requiring optimal read performance while at the same time demanding optimal write performance during intense write bursts. The reality is that no storage system can complete both 100% reads and 100% writes simultaneously. The performance metric which EPIC uses to determine adequate user response time is the time required for a read request to complete. The acceptable threshold is 15ms or less; that is, the interval of time required between the time a request has been generated and the time that the I/O request returns control to the user with the requested data available to the user. Cach also keeps a time-sequenced log of database changes, known as Cach journal files. Cach journal daemon writes out journal updates in a sequential manner every two seconds, or when a journal buffer becomes full, whichever happens sooner. The amount of journal updates is insignificant compared to the amount of database updates during each write daemon cycle. Therefore we normally consider the IO operations mostly random read operations when the write daemons are not active. Overall, the IO access pattern of Epic/Cach system is expected to consist of continuously random read operations topped with 80-second interval write bursts. In order to meet Epics response time requirements, the read service time measured at application level needs to be 15 ms or less in an SMP configuration and 12ms or less in an ECP configuration. Without adequate storage resources and diligent configuration of these resources, the read response time will degrade during the write burst period. Read response times which exceed 15ms will be perceived by the end user as an unacceptable delay in overall performance. Depending on how under-configured or incorrectly tuned a storage system is, response times as slow as 300+ms have been observed. The information provided in the next sections will outline steps needed to mitigate slow read response times. This document will not provide information regarding the makeup of the IBM storage system family. However, we will provide references which will furnish complete background information about IBM storage systems in general, or provide details about specific hardware which is addressed in this document.
General Concepts Much of todays storage technology was designed to solve two general problems: (1) safe, redundant and recoverable storage of large amounts of data and (2) rapid retrieval of the stored data. An assumption regarding the access of the data is that reading and updating of the information would occur at relatively constant rates. For example, data would be read from the storage system about 70% of the time, and written to the storage system about 30% of the time on a fixed basis. This ratio can vary widely depending on the end-user application. In the case of Cach, the application reads exclusively for 100% of the time. Following an 80-second interval, in addition to the requests for data, a large set of write requests to storage are introduced. This burst can consist of several hundred megabytes of 8K data blocks which must be written twice: once to the Cach WIJ and then to the actual random access data base. Most storage subsystems are not really optimized for this type of burst I/O behavior. Moreover, it was assumed that the ratio of reads to writes would remain relatively constant across a fixed period of time. Most of the best practices have assumed these constant ratio read/write conditions. In the case of Cach, the storage system must first operate in a read-only mode, followed next by a simultaneous, read-only and writefast mode. This cycle is repeated every eighty seconds. If the cache associated with the storage system is not large enough and becomes rapidly filled with data to be written to disk, the storage systems algorithm will direct the storage controller to de-stage the write cache. This operation supersedes the priority of any read requests, which are being processed. The two storage resources which can most often limit the read performance are (1) storage cache and (2) the ratio of number of disk spindles to total user data. The greater the number of spindles that are available during a write operation, the faster the writes to physical disk can be completed. Associated with the number spindles is the cache size available to the storage system. Data which is destined to be written to physical disk must wait in the write cache until the physical disk resources become available. The most significant limiting factor across all of the storage system components is the physical disk. Reading or writing to the disk is limited by the rotation speed of the platter and the time required to start and stop the read/write head movement. No matter what other factors are considered to maximize throughput, the wait time for an I/O request to be serviced by the disk will ultimately determine overall response time. Most first-time Epic users will want to fill the existing disk arrays to their maximum available capacity. This is especially true since under RAID 10, half of the spindles are already used simply to mirror the user data. Thus from the end-users perspective, only half of the spindles are available. Completely filling the useable RAID 10 formatted disk
Copyright IBM Corporation, 2012
space translates into more data that must be accessed by the single read/write head on an individual spindle. New technology such as Solid State Drives (SSDs) which are not limited by the rotation speed of the drive can handle the load with even RAID 5. With SSDs, it allows us to leverage new technologies such as EasyTier or advanced caching on XIV. Another problem with completely filling the disks is that, at the logical volume storage level, certain JFS metadata information must be retained on the same disk volumes. The metadata includes journal logs which allow the filesystem to recover from a logical volume failure. To protect the JFS metadata, it is advisable not to exceed 97% of the disk capacity for this reason alone. Rapid database growth is the norm and to be expected in a health care setting. Epic therefore sizes for three years of growth when providing the hardware sizing guide. Unexpected addition of new patients or patient data can rapidly consume large amounts of reserve capacity. For all of these reasons, it is recommended that the physical disk requirements do not exceed 60-70% utilization over the three years. The Use of RAID The striping of logical data across multiple spindles is an obvious way to evenly distribute the load of all available disks. Consider ten simultaneous requests for data. If all of the data was located on only one platter, then nine of the ten requests would remain queued while the first request was being serviced. Each read or write requires the disk to rotate and the read write head to move to a new position. All of this is completed sequentially. Now consider the same data spread or striped evenly across ten disk platters. The ten read requests can be serviced in parallel. Now the only limitation is the latency required to move the data from the storage cache to the requesting server. This time can be measured in units of hundredths of milliseconds. Depending on the type of disk drive, a physical read operation can consume between 3 to 5 milliseconds. The Cach I/O requests to the production data files are about 99% random in nature. This means that for almost every I/O request, the read/write heads will perform a seek operation. If the data were sequential, the read operation would require little or no seek operations. As one block of data is written, the read/write head will most likely be already positioned to write the next block. One method of striping is the use of RAID (Redundant Array of Independent Disks). RAID not only provides data striping for faster disk access but provides loss of data protection as well. The two most widely used types of RAID are 5 and 1+0 or 10. Based on testing done with EPIC we have determined that RAID 10 provides better performance compared with RAID 5. There are documented reasons why RAID 10 is
Copyright IBM Corporation, 2012
superior to RAID 5 particularly when multiple random writes of small blocks are required. The following reference provides additional details about RAID types and their respective performance: http://www.redbooks.ibm.com/abstracts/sg247146.html?Open RAID10 provides data redundancy by way of mirroring each disk. If one disk fails, a duplicate copy of that disk will provide the same data. When the failed disk is replaced, the system will rebuild the new platter with a copy of the data located on the mirrored drive. Besides striping at the storage level using RAID, striping is also done at the SVC level and at the Logical Volume Level as well. These striping methods will be covered in later sections. How Data is processed through the Storage System There are multiple logical and physical stages within a storage system that data must pass through. These stages include: (1) The physical disk drives, where data is actually written or read. The drives are arranged in groups of 16 disk units within a physical tray. We are currently recommending the use of 146GB drives or smaller. However, future disk density and access speed technology may allow for larger capacity drives. (2) The RAID Array, in this diagram RAID 10 is depicted. Each array consists of 8 disks from an array site. The RAID 10 Array will consist of either 4 + 4 or 3 + 3 + 2 spares. We especially recommend using 4 + 4 ranks for production on the DS8 storage. (3) The strip size used for the RAID 0 portion of the RAID 1 + 0 or RAID 10 is 128KB (4) The stripe size is 1MB or (128KB strip * 8 disks) (5) There are N sets of 8+8 disks which make up a RAID 10 rank. The 8+8 array can be split between the Epic prod and WIJ volumes into 6+6 and 2+2 respectively. The 6+6 consists of disks from one RAID 10 array and the 2+2 consists of disks from another RAID 10 array The extent size is typically set at 1GB. Multiple extents are used to create a LUN. (6) Depending on the size of the production database multiple LUNs should be created. The minimum number is two and the maximum recommendation is 32 LUNs.
(7) The storage cache size on the DS8K is a minimum of 32GB per controller. This value may change depending on the model of the storage system and the total size of the Epic database being used. For the DS8000 series storage systems, 1/32 of the total storage cache is dedicated for write I/Os. Therefore, a sufficient total amount of cache must be available to insure that the correct amount of write cache can handle the data coming from the Cache write burst. These values are the typical recommended configuration for a standard Epic installation. The values, however, may vary based on recommendations made by Epic or depending on the total size requirements of the database. Based on empirical evidence these values seem to provide the best overall performance. A Typical Layout of the Epic Production Cach data volumes Although there are any number of ways to configure the LUNs for use by the Cach DB product the following configurations seems to provide acceptable results: The Cach production database file systems/logical volumes, prd01-prd08, should be spread across as many ranks as possible. These ranks should be made up of spindles from multiple and diverse RAID10 arrays. Selection of the arrays should be evenly distributed across both storage system controllers as well as fiber channel adapters. The WIJ should also be allocated from disks belonging to the same arrays as the disks used for production. This keeps the WIJ volume spread across multiple spindles as much as possible. The WIJ and the production volumes are accessed at separate times. There will be no simultaneous contention from the WIJ and the production data for the ranks at any time during the write burst process. The Transaction Journal should be created from separate ranks from the database volumes to give extra protection against disk array failure. In the event that the production database arrays experience a catastrophic failure, the journal files will be used to recover any lost transactions. Here is a sample schematic representation of the disk layout for a DS8K system:
R A ID 10
R A ID 10
Journ al F ile D isk
DA PAIR 2
R A ID 10 R A ID 10 R A ID 5 R A ID 10
R A ID 5 R A ID 10 R A ID 5 R A ID 10 R A ID 5 R A ID 10 R A ID 5
Fla shC o py Disk D ata base F ile D isk D S81 00 Ho t Sp are D isk
DA PAIR 0 DA PAIR 3
R A ID 10 R A ID 10 R A ID 5
R A ID 10
R A ID 5
Below is an example of the commands to set up the volume groups, volumes and file systems for a single Epic instance:
mkvg -f -S -s 16 -y epicvg1 hdisk13 hdisk14 hdisk15 hdisk16 mklv -a e -b n -y prlv11 -e x -w n -x 35192 -t jfs2 epicvg1 35082 mklv -a e -b n -y prlv12 -e x -w n -x 35192 -t jfs2 epicvg1 35082 mklv -a e -b n -y prlv13 -e x -w n -x 35192 -t jfs2 epicvg1 35082 mklv -a e -b n -y prlv14 -e x -w n -x 35192 -t jfs2 epicvg1 35082 mklv -a e -b n -y prlv15 -e x -w n -x 35192 -t jfs2 epicvg1 35082 mklv -a e -b n -y prlv16 -e x -w n -x 35192 -t jfs2 epicvg1 35082 mklv -a e -b n -y prlv17 -e x -w n -x 35192 -t jfs2 epicvg1 35082 mklv -a e -b n -y prlv18 -e x -w n -x 35192 -t jfs2 epicvg1 35082 mklv -a e -b n -y wijlv1 -e x -w n -x 800 -t jfs2 epicvg1 795 crfs -v jfs2 -d prlv11 -m /epic/prd11 -A yes -p rw -a logname=INLINE mount -v jfs2 -o rw,rbrw,cio -o log=INLINE /dev/prlv11 /epic/prd11 crfs -v jfs2 -d prlv12 -m /epic/prd12 -A yes -p rw -a logname=INLINE mount -v jfs2 -o rw,rbrw,cio -o log=INLINE /dev/prlv12 /epic/prd12 crfs -v jfs2 -d prlv13 -m /epic/prd13 -A yes -p rw -a logname=INLINE mount -v jfs2 -o rw,rbrw,cio -o log=INLINE /dev/prlv13 /epic/prd13 crfs -v jfs2 -d prlv14 -m /epic/prd14 -A yes -p rw -a logname=INLINE mount -v jfs2 -o rw,rbrw,cio -o log=INLINE /dev/prlv14 /epic/prd14 crfs -v jfs2 -d prlv15 -m /epic/prd15 -A yes -p rw -a logname=INLINE mount -v jfs2 -o rw,rbrw,cio -o log=INLINE /dev/prlv15 /epic/prd15 crfs -v jfs2 -d prlv16 -m /epic/prd16 -A yes -p rw -a logname=INLINE mount -v jfs2 -o rw,rbrw,cio -o log=INLINE /dev/prlv16 /epic/prd16 crfs -v jfs2 -d prlv17 -m /epic/prd17 -A yes -p rw -a logname=INLINE mount -v jfs2 -o rw,rbrw,cio -o log=INLINE /dev/prlv17 /epic/prd17 crfs -v jfs2 -d prlv18 -m /epic/prd18 -A yes -p rw -a logname=INLINE
-a
options=rw,rbrw,cio
mount -v jfs2 -o rw,rbrw,cio -o log=INLINE /dev/prlv18 /epic/prd18 mkdir /epic/prd1 crfs -v jfs2 -d wijlv1 -m /epic/prd1 -A yes -p rw -a logname=INLINE -a options=rw,rbrw,cio mount -v jfs2 -o rw,rbrw,cio -o log=INLINE /dev/wijlv1 /epic/prd1
And here is an example of the resulting filesystem layout from running the commands above:
/dev/prlv11 /dev/prlv12 /dev/prlv13 /dev/prlv14 /dev/prlv15 /dev/prlv16 /dev/prlv17 /dev/prlv18 /dev/wijlv1 /dev/prlv11 /dev/prlv12 /dev/prlv13 /dev/prlv14 /dev/prlv15 /dev/prlv16 /dev/prlv17 /dev/prlv18 /dev/wijlv1 1293778944 86438760 94% 1293778944 86438640 94% 1293778944 86438512 94% 1293778944 86438568 94% 1293778944 86438568 94% 1293778944 86438680 94% 1293778944 86438936 94% 1293778944 86438776 94% 26050560 21903352 16% 6 8 1% /epic/prd11 8 1% /epic/prd12 8 1% /epic/prd13 8 1% /epic/prd14 8 1% /epic/prd15 8 1% /epic/prd16 8 1% /epic/prd17 8 1% /epic/prd18 1% /epic/prd1
/epic/prd11 jfs2 Nov 10 15:30 rw,rbw,rbr,cio,log=INLINE /epic/prd12 jfs2 Nov 10 15:30 rw,rbw,rbr,cio,log=INLINE /epic/prd13 jfs2 Nov 10 15:30 rw,rbw,rbr,cio,log=INLINE /epic/prd14 jfs2 Nov 10 15:30 rw,rbw,rbr,cio,log=INLINE /epic/prd15 jfs2 Nov 10 15:30 rw,rbw,rbr,cio,log=INLINE /epic/prd16 jfs2 Nov 10 15:30 rw,rbw,rbr,cio,log=INLINE /epic/prd17 jfs2 Nov 10 15:30 rw,rbw,rbr,cio,log=INLINE /epic/prd18 jfs2 Nov 10 15:30 rw,rbw,rbr,cio,log=INLINE /epic/prd1 jfs2 Nov 10 15:30 rw,rbw,rbr,cio,log=INLINE
For more information please refer to Epics File System Layout Recommendations document. FlashCopy Except for XIV, FlashCopy is used for creating point in time copies of the production database. The Cach db writes are momentarily suspended while the FlashCopy command completes. We recommend using incremental FlashCopy. The target drives for the FlashCopy can be different than the source drives. For example, 15K vs 10K, RAID5 vs RAID10 differences are acceptable. However SATA (or nearline) drives are not recommended. EasyTier EasyTier can be used within an Epic production environment, but the customer must continue to use at least the number of 15K RPM spindles recommended by Epic Hardware Configuration Guide.
When configuring an IBM DS8000 series storage unit, it is important to have production data LUNs from multiple ranks that are then assigned to different controllers. This is needed to work around the 25% NVS cache per rank limit. The ranks should be divided as evenly as possible between the available DA (Device Adapter) pairs. With the 15000 rpm disks configured in RAID 10, the DS8000 will present 2 types of arrays: 4+4 or 3+3+2s, depending on the number of disk per Device Adapter. The 4+4 arrays seem to have slightly better performance. We recommend using only the 4+4 arrays for the Epic production volumes. The 3+3 arrays can be used for shadow as well as other non-production activities. We recommend using one extent pool per rank for the DS8000 to simplify the management. The striping of the LUNs will be done at the AIX (LVM) level. When creating the Extent Pool, they need to be spread evenly on both servers (internal Controllers) of the DS8000. As a general rule, we always recommend using 4+4 arrays if possible on the DS8000 storage for the Epic production instance. If a Multiple-Array Extent Pool is required, it is preferable to create the extent pool with as many 4+4 arrays as possible. A minimum of two extent pools is required. These two extent pools should be associated with the two controllers. Volume groups should be created from one LUN per Extent Pool in the DS8000, in order to spread every AIX Logical Volume across every AIX physical volume in the Volume Group. When the DS8000 is shared with other applications than EPIC, The EPIC Production database arrays should be put on their own Device Adapter and not share the Device adapter with other applications if possible. FlashCopy is mandatory for the nightly backup. It is strongly recommended to use Incremental FlashCopy. If you need to have more than one Incremental FlashCopy to create the daily support database (for example), it is possible to do an incremental FlashCopy from the EPIC Reporting Shadow database. Please contact [email protected] for more information. The FlashCopy repository does not require the same types of spindles or geometry as the source disks. For example, RAID 5, 10K RPM drives could be used for the FlashCopy repository instead of higher-performance drives. For optimal performance of the Epic production environments, it is best to have at least 4 fibre channel ports on the DS8000 connected to a minimum of four HBAs on the server per production instance.
Copyright IBM Corporation, 2012
Since mid-November 2011, the Epic optimization package is available which significantly improves the performance of the DS8000 by reducing the peak read IOs. This applies to the following code on the DS8100/DS8300: R4.3 or higher, DS8700 R5.1 or higher, DS8800 R6.1 or higher. Please contact your IBM representative for the process to obtain the Epic optimization package.
3. As mentioned previously, when the SVC is shared with storage systems that are non-Epic related, we recommend dedicating one IOgroup (pair of nodes) of the cluster to the EPIC Production MDisks, and assign the rest of the load to the remainder of the cluster (the SVC supports up to 4 IOgroups). 4. Because of the Write Cache Partitioning feature (which prevents cache starvation), the SVC will not allocate more than 25% of the Write cache per Storage Pool (mdisk group) if there is more than 5 mdisk groups in the system. To get access to the full Write cache of the IOgroup we recommend creating at least
Copyright IBM Corporation, 2012
four StoragePools (Mdisk Groups). We recommend production OLTP data to be spread over at least 4 mdiskgroups to allow access to the full write cache. 5. When the SVC is used in conjunction with DS4/5000 series storage, the DS4/5000 write cache should be entirely disabled. Tests have determined that having the storage cache and the SVC write cache enabled results in poor response times both for reading and writing I/O rates. WITH SVC (or V7000 in Gateway mode) DS4000 Series / DS5000 Disable Storage Write Cache for the Production Volumes FIGURE I SVC and Storage Configuration WITHOUT SVC Storage Write Cache set to 5% Lower and 5% Upper
6. Please note, that the SVC has been tested with the Epic production db for all IBM supported platforms. The cache should now be on for both the SVC and the DS8000-series. 7. Its important to tune queue_depth setting for each hdisk according to the SVC performance guide, especially when using relatively large size VDISKS at SVC level.
3. We recommend setting the FlashCopy grain size to 64KB when using SSDs for the production OLTP data. 4. Storwize V7000 with 15K and 10K RPM SAS drives a. Epic recommends 15K RPM drives for production storage and have live production experience with 15K RPM drives. 15K RPM (but not 10K RPM) drives provide the level of performance required for Epic production. However for non-production Storwize V7000 10K RPM drives are acceptable. 5. The Storwize V7000 offers easy configuration tool with the GUI (Wizard) the array for the EPIC Production Database should be configured as RAID 10. 6. The spare disks do not have to be created with the array but need to be added once all the arrays have been created. By this method, you can control the location and the number of spare disks. 7. Just like the SVC it is recommended to use at least 4 Storagepools (mdiskgroups) if the number of disks permits having access to 100% of the Write cache. EPIC provides a cache requirement for the production database. So if the Write cache is sufficiently large enough, then additional storagepools may not be needed. 8. We have noticed that it is easier to manage groups of 4+4 Raid 10 arrays. This is not mandatory. 9. Storwize V7000 with SSDs (Solid State Drives) a. Although SSD performance significantly better than spinning drives, testing has shown that the write cycle length can be the limiting factor for performance on SSDs. As a general rule of thumb it is possible to replace six spinning drives with a single SSD if capacity permits. Additional SSDs are not expected to significantly change the write service times. b. RAID 5 is recommended for SSDs rather than RAID 10 due to the costperformance benefit of SSDs over spinning drives. V7000 Configuration ScreenShots The IBM Storwize V7000 GUI interface provides an array configuration wizard with logic that ensures new RAID arrays are created using appropriate candidate disks that will provide best performance and spare coverage.
The following example shows the 6 mdisks that comprise a 48 disk, RAID10 SAS array:
Finally, here is a view of the 8 logical volumes (LUNs) that have been mapped to our AIX server. These LUNs were added to a common Epic volume group (VG) and divided into 9 filesystems against which our database and WIJ simulations were executed.
Figure 4 - Sample LUN config New RAID arrays on the IBM Storwize V7000 were created using the interactive interface. The Storwize V7000 includes the capability to build optimal arrays through wizard driven array definition panels. The array configuration panels will select the drives most suited to your storage requirement based on the settings you choose. While there is no need to manually configure the storage to guarantee a balanced RAID array, there is still the option to create arrays from the graphical interface or using the command line interface.
CONFIGURATION GUIDELINES FOR THE DS5000 SERIES MIDRANGE STORAGE SYSTEM (Note: These configuration guidelines apply to the DS4000 series also)
The DS5000 Series Storage System is sufficiently different from the DS8000 such that some additional consideration must be made in order to obtain the best possible performance from this mid-range system. The following section provides specific details regarding the configuration of the DS5000 Storage System. As was mentioned in section
Copyright IBM Corporation, 2012
V., the description of the storage layout in Section IV was intended to be a generic starting point. The write cache flush should be set at 5% maximum and 5% minimum. When using IBM SVC to manage IBM DS5000 series storage units, the optimal results were achieved by disabling write cache (not the read cache) at the DS5K unit level for the LUNs that will be used to hold the database files only and use read/write cache at SVC level for the VDISKs that were constructed from those DS5K LUNs.
uses more than 16 cores and crosses a CEC boundary, will encounter performance issues due to the L3 cache-to-cache communication latency. If using a 795 Power Systems server for the Epic production instance, please refer to IBMs 795 Cross-Book Guidelines whitepaper. The 750 server consists of a single 32-core CEC (or book). Therefore the p7 L3 cache latency is not an issue and Epic can therefore scale up to 28 cores. Mounting With Concurrent I/O The primary change to a default AIX system is invoking the use of concurrent I/O or CIO. By default AIX uses the JFS2 filesystem. CIO bypasses the caching features which are enabled within JFS2. The principle reason for disabling JFS2 cache is because the Cach DB application is already caching needed data blocks. Cach determines what data needs to be written to permanent storage and what data should remain in the Cach global buffers. Having the JFS2 cache also making this determination will typically cause unnecessary extra work to be performed by the system. In addition the JFS2 cache requires real memory which could otherwise be used by the Cach global buffer. CIO is invoked via the o cio mount option. This option should be used on the database only file systems, typically /epic/prd01 /epic/prd08. These filesystems host the CACHE.DAT files which are exclusively random access in nature. The Cach Write Image Journal should be mounted with the default JFS2 mount options. Creation of Volume Groups, Logical Volumes, and File Systems for use by Cach Following are the steps necessary to create and mount the volumes which will host the Epic data volumes. It is assumed that the storage LUNs which correspond to the volumes have already been created either via the storage system or the SVC if available. Step 0. Make a top level root directory for the Epic/Cach EXAMPLE: mkdir /epic mkdir /epic/prd01 Step 1. Create the Volume groups EXAMPLE: mkvg -S -y epicprvg -s 16 hdisk1 hdisk2 hdisk3 .....
Copyright IBM Corporation, 2012
Step 2. Create the Logical Volumes EXAMPLE: mklv -a e -b n e x t jfs2 -y prdlv01 mklv -a e -b n e x t jfs2 -y prdlv02 Step 3: Create the File Systems EXAMPLE: crfs -v jfs2 -d prdlv01 -m /epic/prd01 -A yes -a logname=INLINE a options=cio (crfs -v jfs2 -d prdlv -m /epic/prd -A yes -a logname=INLINE a options=rw) Step 4: Mount the File Systems EXAMPLE: mount /epic/prd mount /epic/prd01 Step 5: Check that the appropriate entries and options are added to /etc/filesystems. These steps should be repeated for the eight production volumes, the WIJ and the Journal files. The WIJ should share the same LUNs as the production volumes. The Journal file should utilize a separate set of LUNs under a separate volume group. When the volumes are mounted the results from the mount command, the df command and the path command should resemble the following: # df /epic/prd0* Filesystem 1024-blocks /dev/prdlv01 78577664 /dev/prdlv02 78577664 /dev/prdlv03 78577664 /dev/prdlv04 78577664 /dev/prdlv05 78577664 /dev/prdlv06 78577664 /dev/prdlv07 78577664 /dev/prdlv08 78577664 epicprvg 10G hdisk1 hdisk2 hdisk3 ..... epicprvg 10G hdisk2 hdisk3 hdisk4 ..... hdisk1
Free %Used Iused %Iused Mounted on 1995268 98% 12 1% /epic/prd01 1995268 98% 12 1% /epic/prd02 1995260 98% 12 1% /epic/prd03 1995264 98% 12 1% /epic/prd04 1995292 98% 12 1% /epic/prd05 1995280 98% 12 1% /epic/prd06 1995376 98% 12 1% /epic/prd07 1985984 98% 12 1% /epic/prd08
Additional System Settings Following are the recommended changes to a subset of the AIX system tunable. A brief description of the reason for the change is included. # vmo vmo -p -o lru_file_repage=0 -- Determines which type of pages are replaced during a paging operation, based on file repage and computational re-page values. vmo -p -o maxclient%=90 -- Specifies that the number of client pages cannot exceed 90% of real memory vmo -p -o maxperm%=90 -- Specifies that the number of file pages should not exceed 90% of real memory
vmo -p -o vmm_mpsize_support=0 -- Use 4K memory pages only # ioo Required ioo parameters ioo -p -o lvm_bufcnt=64
ioo -p -o sync_release_ilock=1 -- Allows inodes to be unlocked after an I/O operation update. ioo -p -o numfsbufs=4096 Sets the number of available file system buffers ioo -p -o pv_min_pbuf=4096 Specifies the minimum number of physical I/O buffers per physical volume These j2_xxx settings improve the performance for JFS2 filesystems (optional) ioo -p -o j2_dynamicBufferPreallocation=256 -- Specifies the number of 16k slabs to preallocate when the filesystem is running low of bufstructs. ioo -p -o j2_maxPageReadAhead=2 -- Specifies the maximum number of pages to be read ahead when processing a sequentially accessed file on Enhanced JFS. ioo -p -o j2_maxRandomWrite=512 -- Specifies a threshold for random writes to accumulate in RAM before subsequent pages are flushed to disk by the Enhanced JFS's writebehind algorithm. The random write-behind threshold is on a per-file basis.
Copyright IBM Corporation, 2012
ioo -p -o j2_minPageReadAhead=1 -- Specifies the minimum number of pages to be read ahead when processing a sequentially accessed file on Enhanced JFS. ioo -p -o j2_nBufferPerPagerDevice=2048 -- Specifies the minimum number of file system bufstructs for Enhanced JFS. ioo -p -o j2_nPagesPerWriteBehindCluster=2 -- Specifies the number of pages per cluster processed by Enhanced JFS's write behind algorithm. ioo -p -o j2_nRandomCluster=1 -- Specifies the distance apart (in clusters) that writes have to exceed in order for them to be considered as random by the Enhanced JFS's random write behind algorithm. #additional required parameters lvmo -v epicrdvg -o pv_pbuf_count=4096 -- Increase the number of PV buffers for the production volume group chdev -l hdisk5 -P -a queue_depth=64 -- Sets the hdisk depth queue to 64 (default is 20) chdev -l sys0 -a maxuproc=32767 --- Sets the maximum processes per user to 32767
ADDITIONAL RECOMMENDATIONS
1. Boot from SAN Boot from SAN is not recommended when running the Epic environment: Both Cach and PowerHA depend on the O/S running correctly. If a SAN failure occurs such that the O/S can no longer communicate with the rootvg volume, even for a brief interval, the condition of the O/S is suspect. The system may appear to be operating correctly. However, if any O/S specific data was lost during transfer between RAM and disk, the O/S is no longer viable. Since all software running on the system depends entirely on the O/S, end user products or supporting middleware may no longer function correctly. Epic recommends that customers do not boot from SAN so that Epic can log into the system following a failure to troubleshoot. However, PowerHA 7.1 recommends that the customer boots from SAN partly because of the Live Partition Mobility feature. The decision to boot from SAN should be discussed with your Epic representative. 2. PowerHA (formerly known as HACMP)
Copyright IBM Corporation, 2012
There are multiple resources for information regarding the best method of configuring a PowerHA (i.e. HACMP) failover cluster. Epic will provide their customers with PowerHA callable scripts which contain the necessary instructions to cleanly shut down and start up the Epic and Cach environment. Most IT system administrators view PowerHA as being capable of recovering from any and all events that could occur to an Epic environment. As much as we would like to imagine such a safety mechanism, it doesnt exist. What PowerHA will do: Recover from any type of real hardware failure. This includes the servers, switches, disk systems and any other type of device which could experience a physical failure due to power loss, electronic component failure or a catastrophic event. What PowerHA will not do: Recover from user errors, either intentional or accidental. Since PowerHA depends on the operating system, it is assumed that if the operating system started running the Epic environment without a problem, it should continue to support the environment without a problem. There are two conditions where the O/S could fail (a) A hardware failure, or (b) A change made to the O/S environment by a user. In case (a), PowerHA will recognize the hardware failure and initiate a failover. PowerHA, however, will not support case (b). PowerHA requires diligent administration and monitoring. PowerHA cannot be installed and left alone to run by itself. Taking this approach will certainly result in eventual failure of the correct operation of PowerHA. All of the available PowerHA documentation makes two major recommendations: 1. Whenever a change is made to the cluster that is being managed by PowerHA, no matter how trivial it might seem, PowerHA must always be re-tested to insure that nothing was modified in such a way that PowerHA can not longer function properly. 2. Regardless of whether the system was modified or not, a manual PowerHA failover should be conducted at regular intervals, (for example, every three months). Item 2 provides two benefits: It gives confirmation that a PowerHA failover will work when an unexpected failure occurs. By executing a planned failover, any problems can quickly be identified and resolved. PowerHA depends greatly on the environment that it is assigned to manage. Due to its flexibility, there are many ways to mis-configure a PowerHA environment. There is only
Copyright IBM Corporation, 2012
one way to be certain that PowerHA has been configured to run successfully: Test, test and re-test. 3. PowerHA and SPOF (Single Point of Failure) In order for PowerHA to work, it must not be limited by Single Points Of Failure or SPOFs. For example, in order for PowerHA to maintain inter-nodal communication within the HA cluster there must exist more than a single communication path. This requires the availability of completely redundant switches, cables and adapters from one end to the other. Having 8 communication adapters on each node does no good if the two nodes are connected via a single data path (Ethernet cable). Having multiple redundant zones on a switch wont help if the switch loses power. Therefore, building in redundancy is a must. This requires that half of the equipment may be sitting idle, until a failure occurs, which unfortunately, is a cost of maintaining a High Availability environment. 4. PowerHA and ECVG Customers who are using Epic are required to provide a fail-over system which will take over in the event of a primary OLTP system failure. This is of obvious necessity in a health care related environment. IBM offers this facility on POWER based systems through the use of PowerHA. Should the active compute system which is running Epic encounter a failure, PowerHA will recognize the loss of the active system. The fail-over process causes the resources, (primarily the attached storage system), being used by the primary system to be acquired by the take-over system. The backup system will then attempt to start the same Epic environment. Although the takeover is not instantaneous, it does, however provide an automated method to recover from a catastrophic hardware failure. In more recent versions of PowerHA, IBM has introduced the use of Enhanced Concurrent Volume Groups (ECVG). The advantage of ECVG is primarily that the Epic database volumes are already varied on to both PowerHA nodes (active and standby nodes). In the event of a failure, the time required for the take-over node to acquire the Epic volumes is greatly reduced. Therefore IBM has encouraged their PowerHA customers to take advantage of ECVG mounted volumes which are associated with a PowerHA cluster. In the unlikely event that PowerHA itself fails, ECVG can potentially cause a split brain event.. When both nodes in the cluster can no longer communicate, or, especially if the takeover node believes that the primary node has failed, it is possible for both nodes to become active. Therefore, it is possible that the Epic software could start running on the takeover node while the primary node is still in play. Recent versions of PowerHA (versions 6.1 and 7.1) have significantly reduced the possibility of a split brain event occurring. In PowerHA version 6.1, ECVG can safely be used in the Epic environment.
Copyright IBM Corporation, 2012
In PowerHA version 7.1, ECVG is mandatory anyway. Therefore, since PowerHA ECVG can safely be used in an Epic environment. When logical volumes are mounted concurrently, it allows access from more than one compute node simultaneously. Therefore, when a volume group is mounted concurrently, data on the volumes can be updated by both nodes. 5. Micro Partitioning Micro Partitioning or SPLPAR is currently not supported within an Epic production environment. DLPAR, however, is supported. There are several reasons why Epic does not support the use of SPLPAR. (a) Epic expects no more than a 15ms response latency from the Cach based DB server. If both CPU and memory resources were to be shared between Epic and other applications, there is always a possibility that a non-Epic application could choke resources away from Epic during a critical time. (b) When Epic provides the sizing information, the assumption is that the Epic products are the only ones actively running on the system. Therefore, at a minimum, the Epic partition would need to be fully configured with the Epic required resources. Epic provides discounts to their customers if the customer has followed Epics recommendation regarding configuration. It is assumed that those resources are available at all times. Thus, in effect, the Epic LPAR would really be regarded as a fixed resource LPAR, or DLPAR. Epic sizes the DB server so that the customer is not running above 70% CPU utilization under normal load. We dont know how quickly a shared partition can obtain resources from another shared partition, before those resources can actually begin to provide some relief during a sudden and unplanned increase in resource demand originating from the Epic partition. In any case, the priority for spare capacity to the Epic partition would require top priority over all other partitions; thereby, once again, making the Epic partition, an effectively independent DLPAR. (c) Epic provides their customers a guarantee of performance. This is available to the customer on condition that the customer has followed the Epic recommended guidelines. Should a performance related problem occur, Epic will want to be able to reproduce the problem. If performance was degraded due to shared resources being unavailable, it would be more difficult for Epic (or IBM), to identify whether the cause was due to something that happened within the Epic partition, or whether an external load-driven event was the cause. (d) At this time, we have not adequately tested the interaction between SPLPAR and PowerHA. As an example, what would happen, or, what would we expect to have
Copyright IBM Corporation, 2012
happen, were the system to experience a physical CPU failure. What should PowerHA do if Epic happened to be using one tenth or more of the physical CPU at the time? Normally, loss of a resource would trigger a fail-over. However, this CPU is now a virtual resource. Epic, however, has no objection to the use of SPLPAR in a non-production environment, so long as performance is not being evaluated within that environment.
6. VIRTUAL I/O
Virtual I/O (VIO) may be used in the Epic environment. Although Virtual I/O may provide better use of existing hardware resources, the performance impacts must be considered in the production environment. The number of the adapters that are being included in a Virtual I/O environment must continuously provide the same level of performance as in a non-VIO environment. NPIV virtualizes a physical fibre-channel adapter, thereby allowing the assignment of multiple WWNs (World Wide Name IDs). Again, the total load of multiple LPARs being supported by a physical adapter must be considered. Epic prefers the use of physical adapters over VIO servers for the production OLTP system. If VIO servers are desired for enterprise virtualization/consolidation practices, the following considerations apply when using VIO with the production OLTP LPAR and its failover LPAR. a. Please follow IBMs best practices to set up sufficient redundancy at the VIO layer to avoid single points of failure. b. Please follow IBMs recommendation to properly size the VIO servers for the overall activities on the server frame. i. When using Oracle on an IBM Power Systems server as the Clarity RDBMS: The Clarity RDBMS Oracle server should be on separate VIO servers from the production OLTP LPAR and its failover LPAR. ii. You should employ redundant VIO servers. Each VIO server must have sufficient CPU and memory resources to support the full load expected. If they are in a shared processor pool, the VIO servers should have the highest weight within the pool to avoid being starved by activities from other application LPARs. iii. Each VIO server must have a total of at least 4 ports from at least 2 physical HBAs. The total IO bandwidth provided by the HBAs must accommodate the total IOPS projection from all LPARs, with sufficient redundancy. The IOPS projections from the main Epic components can be found in the previous IO projection and requirements section. iv. The total network bandwidth provided by the Ethernet adapters must accommodate the network traffic expected from all LPARs, with sufficient redundancy. 10 Gbit interfaces are generally more appropriate for large scale systems. If using 1 Gbit interfaces, multiple interfaces may have to be aggregated to provide adequate bandwidth and acceptable latency. The Ethernet network must provide sufficient amount of bandwidth for all of the Epic functional
Copyright IBM Corporation, 2012
requirements (eg, Shadow, Backup, etc). You may still find it beneficial to use separate NICs for traffic that may have unbounded bandwidth usage patterns. a. There are two technologies available to provide IO access via VIO: virtual SCSI and NPIV. Please discuss which technology best suits your needs with your IBM support. i. Be aware that queue_depth needs to be properly tuned at both VIO server layer and the production LPAR layer when using virtual SCSI. ii. Epic has conducted performance tests with NPIV and found the results acceptable. There could be different VIO considerations for SAN boot. If you desire to use SAN boot, please follow IBMs best practices for SAN boot. 7. Live Partition Mobility Live Partition Mobility (LPM) provides the ability move an existing running Epic instance from one Power Systems frame to another. During a migration, impact on performance may be observed depending on the size of the Epic environment being migrated. The database activity may be momentarily suspended. This may result in end user clients being disconnected temporarily. The alternative for migrating an Epic production instance from one Power Systems frame to another is to initiate a manual PowerHA failover. Using PowerHA would result in anywhere from at least a 5 to 15 minute outage, versus a brief end-user client disconnect of less than a minute when using Live Partition Mobility. Live Partition Mobility requires VIO Servers on both the source and target Power Systems frames. Use of NPIV is strongly recommended to support Live Partition Mobility. An LPM migration must be done only during low-use hours -- whenever there is minimal use of the Epic production database. What Information Should Be Collected When A Problem Occurs The Epic environment is complex given that there are many moving parts. A performance issue can be caused by any part of either the server system or of the storage system. Because each stage of the computational process depends on all others, it can often be difficult to identify the true culprit which causing a problem. For example, although it seems that obtaining data from storage appears slow, it may in fact be the case that the server is running out of I/O buffers or disk queues in order to handle the incoming data from the storage system. Therefore each stage of the process must be analyzed and diagnosed. The primary task is to determine whether a stage in the process is waiting for something, (starving), or whether the stage is overloaded. The disk I/O throughput may seem reasonable for the given configuration. However, users are noting a substandard response time. Upon further investigation, it is determined
Copyright IBM Corporation, 2012
that the Logical Volume Manager (LVM) has run out of resources on the server. This may not be immediately evident since we dont see large amounts of CPU being consumed. However, lack of certain JFS buffers could result in a bottleneck. Following is a partial list of information which should be collected when reporting a problem, either to IBM support or to anyone involved in technical support of Epic. (1) Have they filed a PMR with IBM? If so, provide the PMR number. (2) Has Epic Systems been made aware of the problem? Who is the primary Epic contact that they are dealing with? (3) Type of System P server, Model, # of CPUs, total memory, DLPARS, SPLPARS, etc. (4) Type of Storage. Number of spindles, Storage configuration, (eg RAID 5, RAID 10, Stripe size, number of ranks, LUNs etc.). (5) Are they using SVC? (6) Is the storage or SVC being shared with other non-Epic applications? (7) What has been changed prior to them experiencing the performance problem? For example, increased users, change in storage config, additional workloads etc. (8) Did the performance degrade suddenly, or was it a slow degradation over time. (9) Is there a particular hour of day or night that the performance degrades? Is it constant? (10) Can the customer provide results from the Epic RanRead facility? (11) Does the performance degradation occur during a flash copy or other back-end copy procedures? Also, if one is available, provide a topology diagram showing the OLTP, Shadow, failover servers, the storage switches and associated interconnects to each component which supports the entire Epic environment.