0% found this document useful (0 votes)

66 views574 pages

IBM - IBM Storage Scale Big Data and Analytics Guide (2023)

Uploaded by

Ramon Barrios

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views574 pages

IBM - IBM Storage Scale Big Data and Analytics Guide (2023)

Uploaded by

Ramon Barrios

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 574

IBM Storage Scale

Big Data and Analytics Guide

IBM

SC27-9284-14
Note
Before using this information and the product it supports, read the information in “Notices” on page
505.

This edition applies to Version 5 release 1 modification 9 of the following products, and to all subsequent releases and
modifications until otherwise indicated in new editions:
• IBM Storage Scale Data Management Edition ordered through Passport Advantage® (product number 5737-F34)
• IBM Storage Scale Data Access Edition ordered through Passport Advantage (product number 5737-I39)
• IBM Storage Scale Erasure Code Edition ordered through Passport Advantage (product number 5737-J34)
• IBM Storage Scale Data Management Edition ordered through AAS (product numbers 5641-DM1, DM3, DM5)
• IBM Storage Scale Data Access Edition ordered through AAS (product numbers 5641-DA1, DA3, DA5)
• IBM Storage Scale Data Management Edition for IBM® ESS (product number 5765-DME)
• IBM Storage Scale Data Access Edition for IBM ESS (product number 5765-DAE)
• IBM Storage Scale Backup ordered through Passport Advantage® (product number 5900-AXJ)
• IBM Storage Scale Backup ordered through AAS (product numbers 5641-BU1, BU3, BU5)
• IBM Storage Scale Backup for IBM® Storage Scale System (product number 5765-BU1)
Significant changes or additions to the text and illustrations are indicated by a vertical line (|) to the left of the change.
IBM welcomes your comments; see the topic “How to send your comments” on page xxx. When you send information
to IBM, you grant IBM a nonexclusive right to use or distribute the information in any way it believes appropriate without
incurring any obligation to you.
© Copyright International Business Machines Corporation 2017, 2023.
US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with
IBM Corp.
Contents

Tables................................................................................................................. vii
About this information.......................................................................................... ix
Prerequisite and related information...................................................................................................... xxix
Conventions used in this information......................................................................................................xxix
How to send your comments....................................................................................................................xxx
Summary of changes......................................................................................... xxxi

Chapter 1. Big data and analytics support.............................................................. 1

Chapter 2. IBM Storage Scale support for Hadoop...................................................3

Overview.......................................................................................................................................................3
Hadoop IBM Storage Scale Architecture............................................................................................... 4
HDFS Transparency overview.............................................................................................................. 10
Planning......................................................................................................................................................11
Hadoop cluster planning...................................................................................................................... 11
Hadoop distribution support................................................................................................................24
HDFS Transparency planning...............................................................................................................25
HDFS Transparency support matrix.....................................................................................................27
HDFS Transparency download.............................................................................................................28
Installing.................................................................................................................................................... 29
Installation prerequisites.....................................................................................................................30
Using installation toolkit...................................................................................................................... 34
Manual installation............................................................................................................................... 42
Uninstalling HDFS Transparency cluster............................................................................................. 46
Upgrading................................................................................................................................................... 47
Installation toolkit upgrade process for HDFS Transparency............................................................. 47
Manual rolling upgrade for HDFS Transparency..................................................................................51
Configuring................................................................................................................................................. 52
Password-less ssh access....................................................................................................................53
OS tuning for all nodes in HDFS Transparency.................................................................................... 55
Configure NTP to synchronize the clock in HDFS Transparency......................................................... 56
Configure Hadoop nodes......................................................................................................................56
Configure HDFS Transparency nodes.................................................................................................. 57
Cluster and file system information configuration.............................................................................. 62
HDFS auditing.......................................................................................................................................63
Administering.............................................................................................................................................64
Managing HDFS Transparency cluster................................................................................................. 64
Monitoring HDFS Transparency status using the mmhealth command............................................ 79
Monitoring HDFS Transparency status using IBM Storage Scale GUI................................................ 80
Recovering an HDFS Transparency cluster..........................................................................................80
Kerberos............................................................................................................................................... 81
TLS...................................................................................................................................................... 129
Apache Ranger................................................................................................................................... 138
Hadoop Storage Tiering with IBM Storage Scale HDFS Transparency............................................. 139
Hadoop distcp support.......................................................................................................................189
Multiple IBM Storage Scale File System support.............................................................................. 191
HDFS encryption................................................................................................................................ 192
Remote mount at fileset level............................................................................................................192
High availability configuration........................................................................................................... 193
Short-circuit read configuration.........................................................................................................200

iii
mmhadoopctl supports dual network............................................................................................... 204
Short circuit write...............................................................................................................................205
Multiple Hadoop clusters over the same file system........................................................................ 207
Automatic Configuration Refresh ......................................................................................................207
Rack locality support for shared storage...........................................................................................208
Accumulo support.............................................................................................................................. 210
Zero shuffle support...........................................................................................................................211
Troubleshooting.......................................................................................................................................212
HDFS Transparency protocol troubleshooting.................................................................................. 212
Limitations and differences from native HDFS..................................................................................234
HDFS Transparency limitations and recommendations................................................................... 250

Chapter 3. IBM Storage Scale Hadoop performance tuning guide........................ 253

Chapter 4. Cloudera Data Platform (CDP) Private Cloud Base.............................. 287

iv
Kerberos............................................................................................................................................. 315
Ranger................................................................................................................................................ 316
Transport Layer Security (TLS).......................................................................................................... 322
Configuring Apache Knox................................................................................................................... 327
HDFS encryption................................................................................................................................ 328
Rolling restart..................................................................................................................................... 330
Multiple IBM Storage Scale file system support............................................................................... 331
IBM Storage Scale service management...........................................................................................332
Verify HDFS Transparency version.................................................................................................... 332
Verify IBM Storage Scale service CSD version.................................................................................. 332
Verifying the CDP upgrade................................................................................................................. 332
Monitoring.......................................................................................................................................... 333
Monitoring................................................................................................................................................ 340
Upgrading.................................................................................................................................................341
Upgrading CDP................................................................................................................................... 341
Upgrading IBM Storage Scale............................................................................................................ 342
Limitations............................................................................................................................................... 343
Problem determination........................................................................................................................... 344

Chapter 5. Cloudera HDP 3.X............................................................................. 349

Planning................................................................................................................................................... 349
Hardware requirements..................................................................................................................... 349
Preparing the environment................................................................................................................ 349
Installation...............................................................................................................................................359
ESS setup............................................................................................................................................359
Adding Services..................................................................................................................................360
Create HDP cluster.............................................................................................................................360
Establish an IBM Spectrum Scale cluster on the Hadoop cluster.................................................... 364
Configure remote mount access........................................................................................................365
Install Mpack package....................................................................................................................... 366
Deploy the IBM Spectrum Scale service........................................................................................... 367
Verifying installation.......................................................................................................................... 372
Upgrading and uninstallation.................................................................................................................. 372
Upgrading HDP overview................................................................................................................... 373
Mpack package directories for HDP 3.x and Mpack stack................................................................ 375
Upgrading HDP 3.1.x HA and Mpack stack........................................................................................375
Post update process for HDP 3.x and Mpack stack...........................................................................381
Upgrading HDP 3.1.x non-HA............................................................................................................ 381
Upgrading HDFS Transparency..........................................................................................................385
Upgrading IBM Spectrum Scale file system...................................................................................... 387
HDP 2.6.4 to HDP 3.1.0.0.................................................................................................................. 388
HDP to CDP migration........................................................................................................................ 393
Uninstalling IBM Spectrum Scale Mpack and service.......................................................................393
Configuration........................................................................................................................................... 394
Setting up High Availability [HA]........................................................................................................ 394
IBM Spectrum Scale configuration parameter checklist.................................................................. 395
Dual-network deployment................................................................................................................. 396
Manually starting services in Ambari.................................................................................................397
Setting up local repository................................................................................................................. 398
Configuring LogSearch....................................................................................................................... 402
Hadoop Kafka/Zookeeper and IBM Spectrum Scale Kafka/Zookeeper........................................... 403
Create Hadoop local directories in IBM Spectrum Scale..................................................................403
Deploy HDP or IBM Spectrum Scale service on pre-existing IBM Spectrum Scale file system...... 404
Deploy FPO......................................................................................................................................... 406
Hadoop Storage Tiering..................................................................................................................... 407
Limited Hadoop nodes as IBM Spectrum Scale nodes..................................................................... 407
Configuring multiple file system mount point access....................................................................... 408

v
Support for Big SQL............................................................................................................................ 412
Administration......................................................................................................................................... 414
IBM Spectrum Scale-FPO deployment............................................................................................. 414
Ranger................................................................................................................................................ 417
Kerberos............................................................................................................................................. 423
Short-circuit read (SSR)..................................................................................................................... 427
Disabling short circuit write............................................................................................................... 428
IBM Spectrum Scale service management IBM Spectrum Scale.....................................................428
Ambari node management................................................................................................................ 435
Ambari maintenance mode support for IBM Spectrum Scale service............................................. 448
Restricting root access.......................................................................................................................450
IBM Spectrum Scale management GUI............................................................................................ 454
IBM Spectrum Scale versus Native HDFS......................................................................................... 455
Limitations............................................................................................................................................... 457
Limitations and information...............................................................................................................458
Problem determination........................................................................................................................... 461
Snap data collection...........................................................................................................................461
General............................................................................................................................................... 462
Troubleshooting Ambari.....................................................................................................................472

Chapter 6. Apache Hadoop.................................................................................489

Apache Hadoop 3.0.x Support................................................................................................................ 489
Enabling Kerberos with Apache Hadoop and CES HDFS........................................................................491
Setting up the Kerberos server.......................................................................................................... 491
Setting up Kerberos for HDFS Transparency nodes..........................................................................492
Configuring YARN and MapReduce....................................................................................................499
HDFS clients configuration...................................................................................................................... 500
MapReduce/YARN clients configuration................................................................................................. 501
Add HDFS client to CES HDFS nodes ..................................................................................................... 502

Accessibility features for IBM Storage Scale....................................................... 503

Accessibility features.............................................................................................................................. 503
Keyboard navigation................................................................................................................................ 503
IBM and accessibility...............................................................................................................................503

Notices..............................................................................................................505
Trademarks.............................................................................................................................................. 506
Terms and conditions for product documentation................................................................................. 506

Glossary............................................................................................................ 509

Index................................................................................................................ 517

vi
Tables

1. IBM Storage Scale library information units................................................................................................. x

2. Conventions...............................................................................................................................................xxix

3. IBM Storage Scale License requirement.................................................................................................... 12

4. For HDFS Transparency 3.1.1-8 or earlier, and 3.3.0-0 and later............................................................. 16

5. For HDFS Transparency 3.1.1-9 and later, and 3.2.2-0 and later............................................................. 16

6. Recommended port number settings for HDFS Transparency.................................................................. 18

7. CDP Private Cloud Base support.................................................................................................................24

8. HDP support................................................................................................................................................ 25

9. HDFS Transparency support matrix............................................................................................................27

10. Open-source Apache Hadoop support matrix......................................................................................... 27

11. Configurations for data replication........................................................................................................... 59

12. Generating internal configuration files..................................................................................................... 62

13. initmap.sh script command syntax...........................................................................................................62

14. Internal configuration files and location information.............................................................................. 63

15. Configuration parameters for gpfs-site.xml........................................................................................... 244

16. Sharing nothing -vs- Shared storage...................................................................................................... 258

17. Hardware configuration for FPO model..................................................................................................258

18. Hardware configuration for IBM Storage Scale client node in shared storage model..........................259

19. System Memory Allocation.....................................................................................................................261

20. How to change the memory size............................................................................................................ 261

21. Tuning configurations for Transparency over IBM Storage Scale FPO..................................................262

22. Tuning configurations for Transparency over IBM ESS or shared storage ........................................... 264

23. Configurations in hdfs-site.xml...............................................................................................................266

vii
24. Tuning MapReduce2............................................................................................................................... 269

25. Configurations for tuning Yarn................................................................................................................ 270

26. Hive’s Tuning........................................................................................................................................... 278

27. HBase Configuration Tuning................................................................................................................... 281

28. IBM Storage Scale Tuning.......................................................................................................................282

29. YCSB Configuration Tuning..................................................................................................................... 282

30. Support matrix........................................................................................................................................ 294

31. HDFS Transparency and CSD version for specific IBM Storage Scale version......................................294

32. Upgrade support..................................................................................................................................... 295

33. Example showing hdfs-site.xml parameters management................................................................... 312

34. Hadoop distribution support matrix....................................................................................................... 350

35. Packages required for upgrading to HDP 3.1......................................................................................... 388

36. IBM Spectrum Scale partitioning function matrix................................................................................. 416

37. NATIVE HDFS AND IBM SPECTRUM SCALE DIFFERENCES.................................................................. 456

viii
About this information
This edition applies to IBM Storage Scale version 5.1.9 for AIX®, Linux®, and Windows.
IBM Storage Scale is a file management infrastructure, based on IBM General Parallel File System (GPFS)
technology, which provides unmatched performance and reliability with scalable access to critical file
data.
To find out which version of IBM Storage Scale is running on a particular AIX node, enter:

lslpp -l gpfs\*

To find out which version of IBM Storage Scale is running on a particular Linux node, enter:

rpm -qa | grep gpfs (for SLES and Red Hat Enterprise Linux)

dpkg -l | grep gpfs (for Ubuntu Linux)

To find out which version of IBM Storage Scale is running on a particular Windows node, open Programs
and Features in the control panel. The IBM Storage Scale installed program name includes the version
number.

Which IBM Storage Scale information unit provides the information you need?
The IBM Storage Scale library consists of the information units listed in Table 1 on page x.
To use these information units effectively, you must be familiar with IBM Storage Scale and the AIX,
Linux, or Windows operating system, or all of them, depending on which operating systems are in use at
your installation. Where necessary, these information units provide some background information relating
to AIX, Linux, or Windows. However, more commonly they refer to the appropriate operating system
documentation.
Note: Throughout this documentation, the term "Linux" refers to all supported distributions of Linux,
unless otherwise specified.

© Copyright IBM Corp. 2017, 2023 ix

Table 1. IBM Storage Scale library information units
Information unit Type of information Intended users
IBM Storage Scale: This guide provides the following System administrators, analysts,
Concepts, Planning, and information: installers, planners, and
Installation Guide programmers of IBM Storage Scale
Product overview
clusters who are very experienced
• Overview of IBM Storage Scale with the operating systems on
• GPFS architecture which each IBM Storage Scale
cluster is based
• Protocols support overview:
Integration of protocol access
methods with GPFS
• Active File Management
• AFM-based Asynchronous
Disaster Recovery (AFM DR)
• Introduction to AFM to cloud
object storage
• Introduction to system health and
troubleshooting
• Introduction to performance
monitoring
• Data protection and disaster
recovery in IBM Storage Scale
• Introduction to IBM Storage Scale
GUI
• IBM Storage Scale management
API
• Introduction to Cloud services
• Introduction to file audit logging
• Introduction to clustered watch
folder
• Understanding call home
• IBM Storage Scale in an
OpenStack cloud deployment
• IBM Storage Scale product
editions
• IBM Storage Scale license
designation
• Capacity-based licensing
• Dynamic pagepool

x IBM Storage Scale: Big Data and Analytics Guide

Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: Planning
Concepts, Planning, and
Installation Guide • Planning for GPFS
• Planning for protocols
• Planning for cloud services
• Planning for IBM Storage Scale on
Public Clouds
• Planning for AFM
• Planning for AFM DR
• Planning for AFM to cloud object
storage
• Planning for performance
monitoring tool
• Planning for UEFI secure boot

IBM Storage Scale: • Firewall recommendations

Concepts, Planning, and
• Considerations for GPFS
Installation Guide
applications
• Security-Enhanced Linux support
• Space requirements for call home
data upload

About this information xi

Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: Installing System administrators, analysts,
Concepts, Planning, and installers, planners, and
Installation Guide • Steps for establishing and starting programmers of IBM Storage Scale
your IBM Storage Scale cluster clusters who are very experienced
• Installing IBM Storage Scale with the operating systems on
on Linux nodes and deploying which each IBM Storage Scale
protocols cluster is based
• Installing IBM Storage Scale on
public cloud by using cloudkit
• Installing IBM Storage Scale on
AIX nodes
• Installing IBM Storage Scale on
Windows nodes
• Installing Cloud services on IBM
Storage Scale nodes
• Installing and configuring IBM
Storage Scale management API
• Installing GPUDirect Storage for
IBM Storage Scale
• Installation of Active File
Management (AFM)
• Installing AFM Disaster Recovery
• Installing call home
• Installing file audit logging
• Installing clustered watch folder
• Installing the signed kernel
modules for UEFI secure boot
• Steps to permanently uninstall
IBM Storage Scale
Upgrading
• IBM Storage Scale supported
upgrade paths
• Online upgrade support for
protocols and performance
monitoring
• Upgrading IBM Storage Scale
nodes

xii IBM Storage Scale: Big Data and Analytics Guide

Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: • Upgrading IBM Storage Scale System administrators, analysts,
Concepts, Planning, and non-protocol Linux nodes installers, planners, and
Installation Guide programmers of IBM Storage Scale
• Upgrading IBM Storage Scale clusters who are very experienced
protocol nodes with the operating systems on
• Upgrading IBM Storage Scale on which each IBM Storage Scale
cloud cluster is based
• Upgrading GPUDirect Storage
• Upgrading AFM and AFM DR
• Upgrading object packages
• Upgrading SMB packages
• Upgrading NFS packages
• Upgrading call home
• Upgrading the performance
monitoring tool
• Upgrading signed kernel modules
for UEFI secure boot
• Manually upgrading pmswift
• Manually upgrading the IBM
Storage Scale management GUI
• Upgrading Cloud services
• Upgrading to IBM Cloud Object
Storage software level 3.7.2 and
above
• Upgrade paths and commands for
file audit logging and clustered
watch folder
• Upgrading IBM Storage Scale
components with the installation
toolkit
• Protocol authentication
configuration changes during
upgrade
• Changing the IBM Storage Scale
product edition
• Completing the upgrade to a new
level of IBM Storage Scale
• Reverting to the previous level of
IBM Storage Scale

About this information xiii

Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: • Coexistence considerations
Concepts, Planning, and
• Compatibility considerations
Installation Guide
• Considerations for IBM Storage
Protect for Space Management
• Applying maintenance to your
IBM Storage Scale system
• Guidance for upgrading the
operating system on IBM Storage
Scale nodes
• Considerations for upgrading
from an operating system not
supported in IBM Storage Scale
5.1.x.x
• Servicing IBM Storage Scale
protocol nodes
• Offline upgrade with complete
cluster shutdown

xiv IBM Storage Scale: Big Data and Analytics Guide

Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: This guide provides the following System administrators or
Administration Guide information: programmers of IBM Storage Scale
systems
Configuring
• Configuring the GPFS cluster
• Configuring GPUDirect Storage for
IBM Storage Scale
• Configuring the CES and protocol
configuration
• Configuring and tuning your
system for GPFS
• Parameters for performance
tuning and optimization
• Ensuring high availability of the
GUI service
• Configuring and tuning your
system for Cloud services
• Configuring IBM Power Systems
for IBM Storage Scale
• Configuring file audit logging
• Configuring clustered watch
folder
• Configuring the cloudkit
• Configuring Active File
Management
• Configuring AFM-based DR
• Configuring AFM to cloud object
storage
• Tuning for Kernel NFS backend on
AFM and AFM DR
• Configuring call home
• Integrating IBM Storage Scale
Cinder driver with Red Hat
OpenStack Platform 16.1
• Configuring Multi-Rail over TCP
(MROT)
• Dynamic pagepool configuration

About this information xv

Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: Administering System administrators or
Administration Guide programmers of IBM Storage Scale
• Performing GPFS administration systems
tasks
• Performing parallel copy with
mmxcp command
• Protecting file data: IBM Storage
Scale safeguarded copy
• Verifying network operation with
the mmnetverify command
• Managing file systems
• File system format changes
between versions of IBM Storage
Scale
• Managing disks

xvi IBM Storage Scale: Big Data and Analytics Guide

Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: • Managing protocol services System administrators or
Administration Guide programmers of IBM Storage Scale
• Managing protocol user systems
authentication
• Managing protocol data exports
• Managing object storage
• Managing GPFS quotas
• Managing GUI users
• Managing GPFS access control
lists
• Native NFS and GPFS
• Accessing a remote GPFS file
system
• Information lifecycle
management for IBM Storage
Scale
• Creating and maintaining
snapshots of file systems
• Creating and managing file clones
• Scale Out Backup and Restore
(SOBAR)
• Data Mirroring and Replication
• Implementing a clustered NFS
environment on Linux
• Implementing Cluster Export
Services
• Identity management on
Windows / RFC 2307 Attributes
• Protocols cluster disaster
recovery
• File Placement Optimizer
• Encryption
• Managing certificates to secure
communications between GUI
web server and web browsers
• Securing protocol data
• Cloud services: Transparent cloud
tiering and Cloud data sharing
• Managing file audit logging
• RDMA tuning
• Configuring Mellanox Memory
Translation Table (MTT) for GPFS
RDMA VERBS Operation
• Administering cloudkit
• Administering AFM
• Administering AFM DR

About this information xvii

Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: • Administering AFM to cloud System administrators or
Administration Guide object storage programmers of IBM Storage Scale
systems
• Highly available write cache
(HAWC)
• Local read-only cache
• Miscellaneous advanced
administration topics
• GUI limitations

xviii IBM Storage Scale: Big Data and Analytics Guide

Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: Problem This guide provides the following System administrators of GPFS
Determination Guide information: systems who are experienced with
the subsystems used to manage
Monitoring
disks and who are familiar with
• Monitoring system health by using the concepts presented in the IBM
IBM Storage Scale GUI Storage Scale: Concepts, Planning,
• Monitoring system health by using and Installation Guide
the mmhealth command
• Dynamic pagepool monitoring
• Performance monitoring
• Monitoring GPUDirect storage
• Monitoring events through
callbacks
• Monitoring capacity through GUI
• Monitoring AFM and AFM DR
• Monitoring AFM to cloud object
storage
• GPFS SNMP support
• Monitoring the IBM Storage Scale
system by using call home
• Monitoring remote cluster through
GUI
• Monitoring file audit logging
• Monitoring clustered watch folder
• Monitoring local read-only cache
Troubleshooting
• Best practices for troubleshooting
• Understanding the system
limitations
• Collecting details of the issues
• Managing deadlocks
• Installation and configuration
issues
• Upgrade issues
• CCR issues
• Network issues
• File system issues
• Disk issues
• GPUDirect Storage
troubleshooting
• Security issues
• Protocol issues
• Disaster recovery issues
• Performance issues

About this information xix

Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: Problem • GUI and monitoring issues
Determination Guide
• AFM issues
• AFM DR issues
• AFM to cloud object storage
issues
• Transparent cloud tiering issues
• File audit logging issues
• Cloudkit issues
• Troubleshooting mmwatch
• Maintenance procedures
• Recovery procedures
• Support for troubleshooting
• References

xx IBM Storage Scale: Big Data and Analytics Guide

Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: This guide provides the following • System administrators of IBM
Command and Programming information: Storage Scale systems
Reference Guide
Command reference • Application programmers who are
experienced with IBM Storage
• cloudkit command
Scale systems and familiar with
• gpfs.snap command the terminology and concepts in
• mmaddcallback command the XDSM standard
• mmadddisk command
• mmaddnode command
• mmadquery command
• mmafmconfig command
• mmafmcosaccess command
• mmafmcosconfig command
• mmafmcosctl command
• mmafmcoskeys command
• mmafmctl command
• mmafmlocal command
• mmapplypolicy command
• mmaudit command
• mmauth command
• mmbackup command
• mmbackupconfig command
• mmbuildgpl command
• mmcachectl command
• mmcallhome command
• mmces command
• mmchattr command
• mmchcluster command
• mmchconfig command
• mmchdisk command
• mmcheckquota command
• mmchfileset command
• mmchfs command
• mmchlicense command
• mmchmgr command
• mmchnode command
• mmchnodeclass command
• mmchnsd command
• mmchpolicy command
• mmchpool command
• mmchqos command
• mmclidecode command

About this information xxi

Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: • mmclone command • System administrators of IBM
Command and Programming Storage Scale systems
• mmcloudgateway command
Reference Guide
• mmcrcluster command • Application programmers who are
experienced with IBM Storage
• mmcrfileset command Scale systems and familiar with
• mmcrfs command the terminology and concepts in
• mmcrnodeclass command the XDSM standard

• mmcrnsd command
• mmcrsnapshot command
• mmdefedquota command
• mmdefquotaoff command
• mmdefquotaon command
• mmdefragfs command
• mmdelacl command
• mmdelcallback command
• mmdeldisk command
• mmdelfileset command
• mmdelfs command
• mmdelnode command
• mmdelnodeclass command
• mmdelnsd command
• mmdelsnapshot command
• mmdf command
• mmdiag command
• mmdsh command
• mmeditacl command
• mmedquota command
• mmexportfs command
• mmfsck command
• mmfsckx command
• mmfsctl command
• mmgetacl command
• mmgetstate command
• mmhadoopctl command
• mmhdfs command
• mmhealth command
• mmimgbackup command
• mmimgrestore command
• mmimportfs command
• mmkeyserv command

xxii IBM Storage Scale: Big Data and Analytics Guide

Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: • mmlinkfileset command • System administrators of IBM
Command and Programming Storage Scale systems
• mmlsattr command
Reference Guide
• mmlscallback command • Application programmers who are
experienced with IBM Storage
• mmlscluster command Scale systems and familiar with
• mmlsconfig command the terminology and concepts in
• mmlsdisk command the XDSM standard

• mmlsfileset command
• mmlsfs command
• mmlslicense command
• mmlsmgr command
• mmlsmount command
• mmlsnodeclass command
• mmlsnsd command
• mmlspolicy command
• mmlspool command
• mmlsqos command
• mmlsquota command
• mmlssnapshot command
• mmmigratefs command
• mmmount command
• mmnetverify command
• mmnfs command
• mmnsddiscover command
• mmobj command
• mmperfmon command
• mmpmon command
• mmprotocoltrace command
• mmpsnap command
• mmputacl command
• mmqos command
• mmquotaoff command
• mmquotaon command
• mmreclaimspace command
• mmremotecluster command
• mmremotefs command
• mmrepquota command
• mmrestoreconfig command
• mmrestorefs command
• mmrestrictedctl command
• mmrestripefile command

About this information xxiii

Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: • mmrestripefs command • System administrators of IBM
Command and Programming Storage Scale systems
• mmrpldisk command
Reference Guide
• mmsdrrestore command • Application programmers who are
experienced with IBM Storage
• mmsetquota command Scale systems and familiar with
• mmshutdown command the terminology and concepts in
• mmsmb command the XDSM standard

• mmsnapdir command
• mmstartup command
• mmstartpolicy command
• mmtracectl command
• mmumount command
• mmunlinkfileset command
• mmuserauth command
• mmwatch command
• mmwinservctl command
• mmxcp command
• spectrumscale command
Programming reference
• IBM Storage Scale Data
Management API for GPFS
information
• GPFS programming interfaces
• GPFS user exits
• IBM Storage Scale management
API endpoints
• Considerations for GPFS
applications

xxiv IBM Storage Scale: Big Data and Analytics Guide

Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: Big Data This guide provides the following • System administrators of IBM
and Analytics Guide information: Storage Scale systems
Summary of changes • Application programmers who are
experienced with IBM Storage
Big data and analytics support
Scale systems and familiar with
Hadoop Scale Storage Architecture the terminology and concepts in
the XDSM standard
• Elastic Storage Server
• Erasure Code Edition
• Share Storage (SAN-based
storage)
• File Placement Optimizer (FPO)
• Deployment model
• Additional supported storage
features
IBM Spectrum® Scale support for
Hadoop
• HDFS transparency overview
• Supported IBM Storage Scale
storage modes
• Hadoop cluster planning
• CES HDFS
• Non-CES HDFS
• Security
• Advanced features
• Hadoop distribution support
• Limitations and differences from
native HDFS
• Problem determination
IBM Storage Scale Hadoop
performance tuning guide
• Overview
• Performance overview
• Hadoop Performance Planning
over IBM Storage Scale
• Performance guide

About this information xxv

Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale: Big Data Cloudera Data Platform (CDP) • System administrators of IBM
and Analytics Guide Private Cloud Base Storage Scale systems
• Overview • Application programmers who are
• Planning experienced with IBM Storage
Scale systems and familiar with
• Installing the terminology and concepts in
• Configuring the XDSM standard
• Administering
• Monitoring
• Upgrading
• Limitations
• Problem determination

IBM Storage Scale: Big Data Cloudera HDP 3.X • System administrators of IBM
and Analytics Guide Storage Scale systems
• Planning
• Application programmers who are
• Installation
experienced with IBM Storage
• Upgrading and uninstallation Scale systems and familiar with
• Configuration the terminology and concepts in
the XDSM standard
• Administration
• Limitations
• Problem determination
Open Source Apache Hadoop
• Open Source Apache Hadoop
without CES HDFS
• Open Source Apache Hadoop with
CES HDFS

xxvi IBM Storage Scale: Big Data and Analytics Guide

Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale Erasure IBM Storage Scale Erasure Code • System administrators of IBM
Code Edition Guide Edition Storage Scale systems
• Summary of changes • Application programmers who are
experienced with IBM Storage
• Introduction to IBM Storage Scale
Scale systems and familiar with
Erasure Code Edition
the terminology and concepts in
• Planning for IBM Storage Scale the XDSM standard
Erasure Code Edition
• Installing IBM Storage Scale
Erasure Code Edition
• Uninstalling IBM Storage Scale
Erasure Code Edition
• Creating an IBM Storage Scale
Erasure Code Edition storage
environment
• Using IBM Storage Scale Erasure
Code Edition for data mirroring
and replication
• Deploying IBM Storage Scale
Erasure Code Edition on VMware
infrastructure
• Upgrading IBM Storage Scale
Erasure Code Edition
• Incorporating IBM Storage Scale
Erasure Code Edition in an Elastic
Storage Server (ESS) cluster
• Incorporating IBM Elastic Storage
Server (ESS) building block in an
IBM Storage Scale Erasure Code
Edition cluster
• Administering IBM Storage Scale
Erasure Code Edition
• Troubleshooting
• IBM Storage Scale RAID
Administration

About this information xxvii

Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale Container This guide provides the following • System administrators of IBM
Native Storage Access information: Storage Scale systems
• Overview • Application programmers who are
• Planning experienced with IBM Storage
Scale systems and familiar with
• Installation prerequisites the terminology and concepts in
• Installing the IBM Storage Scale the XDSM standard
container native operator and
cluster
• Upgrading
• Configuring IBM Storage Scale
Container Storage Interface (CSI)
driver
• Using IBM Storage Scale GUI
• Maintenance of a deployed cluster
• Cleaning up the container native
cluster
• Monitoring
• Troubleshooting
• References

IBM Storage Scale Data This guide provides the following • System administrators of IBM
Access Service information: Storage Scale systems
• Overview • Application programmers who are
• Architecture experienced with IBM Storage
Scale systems and familiar with
• Security the terminology and concepts in
• Planning the XDSM standard
• Installing and configuring
• Upgrading
• Administering
• Monitoring
• Collecting data for support
• Troubleshooting
• The mmdas command
• REST APIs

xxviii IBM Storage Scale: Big Data and Analytics Guide

Table 1. IBM Storage Scale library information units (continued)
Information unit Type of information Intended users
IBM Storage Scale Container This guide provides the following • System administrators of IBM
Storage Interface Driver information: Storage Scale systems
Guide
• Summary of changes • Application programmers who are
• Introduction experienced with IBM Storage
Scale systems and familiar with
• Planning the terminology and concepts in
• Installation the XDSM standard
• Upgrading
• Configurations
• Using IBM Storage Scale
Container Storage Interface Driver
• Managing IBM Storage Scale
when used with IBM Storage
Scale Container Storage Interface
driver
• Cleanup
• Limitations
• Troubleshooting

Prerequisite and related information

For updates to this information, see IBM Storage Scale in IBM Documentation.
For the latest support information, see the IBM Storage Scale FAQ in IBM Documentation.

Conventions used in this information

Table 2 on page xxix describes the typographic conventions used in this information. UNIX file name
conventions are used throughout this information.
Note: Users of IBM Storage Scale for Windows must be aware that on Windows, UNIX-style
file names need to be converted appropriately. For example, the GPFS cluster configuration data
is stored in the /var/mmfs/gen/mmsdrfs file. On Windows, the UNIX namespace starts under
the %SystemDrive%\cygwin64 directory, so the GPFS cluster configuration data is stored in the
C:\cygwin64\var\mmfs\gen\mmsdrfs file.

Table 2. Conventions
Convention Usage
bold Bold words or characters represent system elements that you must use literally,
such as commands, flags, values, and selected menu options.
Depending on the context, bold typeface sometimes represents path names,
directories, or file names.

bold bold underlined keywords are defaults. These take effect if you do not specify a
underlined different keyword.

About this information xxix

Table 2. Conventions (continued)
Convention Usage
constant width Examples and information that the system displays appear in constant-width
typeface.
Depending on the context, constant-width typeface sometimes represents path
names, directories, or file names.

italic Italic words or characters represent variable values that you must supply.
Italics are also used for information unit titles, for the first use of a glossary term,
and for general emphasis in text.

<key> Angle brackets (less-than and greater-than) enclose the name of a key on the
keyboard. For example, <Enter> refers to the key on your terminal or workstation
that is labeled with the word Enter.
\ In command examples, a backslash indicates that the command or coding example
continues on the next line. For example:

mkcondition -r IBM.FileSystem -e "PercentTotUsed > 90" \

-E "PercentTotUsed < 85" -m p "FileSystem space used"

{item} Braces enclose a list from which you must choose an item in format and syntax
descriptions.
[item] Brackets enclose optional items in format and syntax descriptions.
<Ctrl-x> The notation <Ctrl-x> indicates a control character sequence. For example,
<Ctrl-c> means that you hold down the control key while pressing <c>.
item... Ellipses indicate that you can repeat the preceding item one or more times.
| In synopsis statements, vertical lines separate a list of choices. In other words, a
vertical line means Or.
In the left margin of the document, vertical lines indicate technical changes to the
information.

Note: CLI options that accept a list of option values delimit with a comma and no space between
values. As an example, to display the state on three nodes use mmgetstate -N NodeA,NodeB,NodeC.
Exceptions to this syntax are listed specifically within the command.

How to send your comments

Your feedback is important in helping us to produce accurate, high-quality information. If you have any
comments about this information or any other IBM Storage Scale documentation, send your comments to
the following e-mail address:
[email protected]
Include the publication title and order number, and, if applicable, the specific location of the information
about which you have comments (for example, a page number or a table number).
To contact the IBM Storage Scale development organization, send your comments to the following e-mail
address:
[email protected]

xxx IBM Storage Scale: Big Data and Analytics Guide

Summary of changes
This topic summarizes changes to IBM Storage Scale Big Data and Analytics (BDA) support section.
For information about IBM Storage Scale changes, see the IBM Storage Scale Summary of changes.
For information about BDA feature support, see the List of stabilized, deprecated, and discontinued
features section under the Summary of changes.
For information about the resolved IBM Storage Scale APARs, see IBM Storage Scale APARs Resolved.
For information about supported HDFS Transparency versions with IBM Storage Scale, see “HDFS
Transparency support matrix” on page 27.
For information about supported Cloudera Data Platform (CDP) versions with IBM Storage Scale, see
“Support Matrix” on page 294.

Summary of changes as updated, February 2024

Changes in IBM Storage Scale 5.1.9-2
• Includes HDFS Transparency 3.1.1-17 and HDFS Transparency 3.2.2-7.
Changes in HDFS Transparency 3.1.1-17 in IBM Storage Scale 5.1.9-2
• Updated several JavaScript files related to the NameNode and DataNode GUI.
• Fixed multiple issues that occurred while stopping or starting HDFS Transparency roles when the IBM
Storage Scale file system was respectively unmounted or remounted.
Note: HDFS Transparency 3.2.2-7 supports an upgrade only from HDFS Transparency 3.2.2-5.

Summary of changes as updated, December 2023

Changes in IBM Storage Scale 5.1.9-1
• Includes HDFS Transparency 3.1.1-16 and HDFS Transparency 3.2.2-7.
Changes in HDFS Transparency 3.1.1-16 in IBM Storage Scale 5.1.9-1
• Fixed an issue where the reinstallation of the same HDFS Transparency rpm version failed and could not
be recovered.
• Included runLog4jV1Patcher.sh in /usr/lpp/mmfs/hadoop/scripts/ to patch a user provided
log4j JAR.
Changes in HDFS Transparency 3.2.2-7 in IBM Storage Scale 5.1.9-1
• Fixed an issue where the reinstallation of the same HDFS Transparency rpm version failed and could not
be recovered.
• Included runLog4jV1Patcher.sh in /usr/lpp/mmfs/hadoop/scripts/ to patch a log4j JAR
provided by a user.
• Fixed an issue where too many lookups and log entries for missing UID and GID would impact the HDFS
Transparency performance.
• Improved hdfs dfs ls to use IBM Storage Scale ls as input, instead of caching the metadata and
synchronizing with IBM Storage Scale regularly.
Changes in the documentation
• Restructured the "IBM Storage Scale support for Hadoop" chapter.
• Moved "Hadoop IBM Storage Scale Architecture" to "IBM Storage Scale support for Hadoop" >
"Overview".

© Copyright IBM Corp. 2017, 2023 xxxi

Note: HDFS Transparency 3.2.2-7 supports an upgrade only from HDFS Transparency 3.2.2-5.

Summary of changes as updated, November 2023

Changes in Cloudera Data Platform Private Cloud Base
• From IBM Storage Scale 5.1.8.0, CDP Private Cloud Base 7.1.9-CHF1 is certified with IBM Storage Scale
on x86 and Power LE. For more information, see “Support Matrix” on page 294.
Changes in IBM Storage Scale 5.1.9-0
• Includes HDFS Transparency 3.1.1-15 and HDFS Transparency 3.2.2-6.
Changes in HDFS Transparency 3.1.1-15 in IBM Storage Scale 5.1.9-0
• Added the mmhdfs config dump subcommand. For more information, see mmhdfs command.
• Increased the performance for recursive deletions of snapshot-enabled directories by avoiding the
mmlssnapshot dependency.
• Improved internal data structures to avoid directory lock contentions.
• Fixed an issue where the log includes many messages like aclutil.cc get_file failed [No
such file or directory].
• Fixed an issue where the getContentSummary returns inconsistent results if multiple files in the same
directory are removed at the same time.
• Added buffered logging and log filtering, which increases HDFS Transparency I/O throughput. For more
information, see “Buffered logging and filtering” on page 268.
• Changed the installation process to use self-provided JAR files. For more information, see “Installation
prerequisites” on page 30.
Changes in HDFS Transparency 3.2.2-6 in IBM Storage Scale 5.1.9-0
• Rebranded scripts in HDFS Transparency 3.2.2.-6 from "IBM Spectrum Scale" to "IBM Storage Scale".
• Added the mmhdfs config dump subcommand. For more information, see mmhdfs command
• Increased the performance for recursive deletions of snapshot-enabled directories by avoiding the
mmlssnapshot dependency.
• Improved internal data structures to avoid directory lock contentions.
• Fixed an issue where the log includes many messages like aclutil.cc get_file failed [No
such file or directory].
• Fixed an issue where the getContentSummary returns inconsistent results if multiple files in the same
directory are removed at the same time.
• Added buffered logging and log filtering, which increases HDFS Transparency I/O throughput. For more
information, see “Buffered logging and filtering” on page 268.
• Changed the installation process to use self-provided JAR files. For more information, see “Installation
prerequisites” on page 30.
Note: In upcoming IBM Storage Scale versions, HDFS Transparency 3.2.2-x will be replaced by an HDFS
Transparency 3.3.5-x version based on the correspondent Apache Hadoop 3.3.5 version. The Apache
Hadoop version used as basis for the HDFS Transparency version will be supported by Apache Bigtop.

Summary of changes as updated, July 2023

Changes in IBM Storage Scale 5.1.8-1
• Includes HDFS Transparency 3.1.1-14, HDFS Transparency 3.2.2-5, and HDFS Transparency 3.3.0-2.
Changes in HDFS Transparency 3.1.1-14 in IBM Storage Scale 5.1.8-1
• Rebranding of documentation and scripts in HDFS Transparency 3.1.1-14

xxxii IBM Storage Scale: Big Data and Analytics Guide

• Adding argument "validity" to gpfs_tls_configuration.py to define the time for which TLS
certifications are valid.
• Previous default of 90 days TLS certification validity was changed to 1826 days if no "validity" argument
is passed to gpfs_tls_configuration.py.
• Improved error handling for malconfigurations in gpfs_tls_configuration.py.
• Improvement of documentation around TLS certification setup. See “TLS” on page 129

Summary of changes as updated, April 2023

Changes in IBM Storage Scale 5.1.7-1
• Includes HDFS Transparency 3.1.1-13, HDFS Transparency 3.2.2-5, and HDFS Transparency 3.3.0-2.
Changes in HDFS Transparency 3.1.1-13 in IBM Storage Scale 5.1.7-1
• Fixed an issue where appending to an existing file in an encryption zone failed (APAR IJ45843).
• Improved parallel data access by reducing the locking scope on directory-level to avoid parent directory
locking.
• Fixed an issue where the rm and du commands would fail with NoSuchFileException.
• Reduced exception to warning when a file lease cannot be found while creating a file, in order to prevent
application-side failures.
• Improved overall performance by changing the update process for NameNode metadata and reducing
the syncChildren calls.
• Fixed an issue where the NameNode crashes by failing to finalize the shared edit log on NameNode
failover.
• Improved the listing performance by changing the way stat is called and avoiding stat oscillation
behavior.
• Changed the RSA Key strength in the TLS enablement script from 1024 to 2048.
• Added an AccessControlException in the put command if used for deleted users.
• Fixed an issue where the TLS script fails with enable-tls option if the dfs.namenode.http-
address parameter is missing in the configuration.
• Realigned the usage text of hdfs getconf.
• Fixed an issue where the NameNode will not start because of a missing dependent jar. For resolution in
the affected HDFS Transparency versions 3.1.1-11, 3.1.1-12, 3.2.2-2 and 3.2.2-3, see NameNode fails
to start in HDFS Transparency 3.1.1-11, 3.1.1-12, 3.2.2-2 or 3.2.2-3.
Changes in HDFS Transparency 3.2.2.-5 in IBM Storage Scale 5.1.7-1
• Fixed an issue where parallel move or rename and listing operations on the same directory can lead to a
deadlock situation.

Summary of changes as updated, March 2023

Changes in IBM Storage Scale 5.1.7-0
• Includes HDFS Transparency 3.1.1-12, HDFS Transparency 3.2.2-4, and HDFS Transparency 3.3.0-2.
Changes in HDFS Transparency 3.2.2-4 in IBM Storage Scale 5.1.7-0
• Fixed an issue where gpfs_kerberos_configuration.py fails to run.
• Improved parallel data access by reducing the locking scope on directory level to avoid parent directory
locking.
• Fixed an issue where the rm and du commands fail with NoSuchFileException.
• Reduced exception to warning when a file lease cannot be found while creating a file, in order to prevent
application side failures.

Summary of changes xxxiii

• Improved overall performance by changing the update process for NameNode metadata and reducing
syncChildren calls.
• Fixed an issue where the NameNode crashes by failing to finalize the shared edit log on NameNode
failover.
• Improved the listing performance by changing the way stat is called and avoiding stat oscillation
behavior.
• Fixed an issue where the TLS script fails with enable-tls option if the dfs.namenode.http-
address parameter is missing in the configuration.
• Changed TLS encryption value to 2048.
• Fixed an issue where the NameNode will not start because of a missing dependent jar. For resolution in
the affected HDFS Transparency versions 3.1.1-11, 3.1.1-12, 3.2.2-2 and 3.2.2-3, see NameNode fails
to start in HDFS Transparency 3.1.1-11, 3.1.1-12, 3.2.2-2 or 3.2.2-3.

Summary of changes as updated, January 2023

Changes in IBM Storage Scale 5.1.6-1
• Includes HDFS Transparency 3.1.1-12, HDFS Transparency 3.2.2-3 and HDFS Transparency 3.3.0-2.
Changes in IBM Storage Scale 5.1.2-9
• Includes HDFS Transparency 3.1.1-12 and HDFS Transparency 3.3.0-2.
Changes in HDFS Transparency 3.1.1-12 in IBM Storage Scale 5.1.2-9 and IBM Storage Scale 5.1.6-1
• Added security fix for CVE-2022-25168.

Summary of changes as updated, December 2022

Changes in IBM Storage Scale 5.1.6-0
• Includes HDFS Transparency 3.1.1-11, HDFS Transparency 3.2.2-3, and HDFS Transparency 3.3.0-2.
Changes in HDFS Transparency 3.1.1-11 in IBM Storage Scale 5.1.6-0
• Fixed the issue where a ticket expiration in an AD Kerberos environment can lead to two active
NameNodes.
• Included fine-grained read/write locking of file lease manager to improve the performance.
• Fixed the issue where mmhdfs config import ignored ranger-hdfs-policymgr-ssl.xml.
• Added general security fixes.
Changes in HDFS Transparency 3.2.2-3 in IBM Storage Scale 5.1.6-0
• Added general security fixes.
• Added a fix for the scripts in /usr/lpp/mmfs/hadoop/scripts/ to run with Python 3.8.
Changes in IBM Storage Scale Big Data Analytics Integration Toolkit for HDFS Transparency (Toolkit
for HDFS) 1.0.8-0 in IBM Storage Scale 5.1.6-0
• Fixed an issue that lets the IBM Storage Scale Install Toolkit fail if the file system configured for HDFS
Transparency includes an underscore.
Changes in Cloudera Data Platform Private Cloud Base
• From IBM Storage Scale 5.1.4.0, CDP Private Cloud Base 7.1.8 is certified with IBM Storage Scale on
Power. For more information, see “Support Matrix” on page 294.
For Hue to work properly, Cloudera Manager 7.7.1+ requires Python version to be at v3.8 on the Hue
nodes.

xxxiv IBM Storage Scale: Big Data and Analytics Guide

Summary of changes as updated, October 2022
Changes in IBM Storage Scale 5.1.5-1
• Includes HDFS Transparency 3.1.1-10, HDFS Transparency 3.2.2-2 and HDFS Transparency 3.3.0-2.
Changes in HDFS Transparency 3.2.2-2 in IBM Storage Scale 5.1.5-1
• Fixed the issue where a ticket expiration in an AD Kerberos environment can lead to two active
NameNodes.
• Included fine-grained read/write locking of file lease manager to improve the performance.
• Added general security fixes.
• Added security fix for CVE-2022-25168.
Changes in the documentation
• Added “Remote mount at fileset level” on page 192.
• Added HDFS Transparency to IBM Storage Scale support matrix on “HDFS Transparency support
matrix” on page 27.

Summary of changes as updated, September 2022

Changes in IBM Storage Scale 5.1.5
• Includes HDFS Transparency 3.1.1-10, HDFS Transparency 3.2.2-1 and HDFS Transparency 3.3.0-2.
Changes in Cloudera Data Platform Private Cloud Base
• From IBM Storage Scale 5.1.4.0, CDP Private Cloud Base 7.1.8 is certified with IBM Storage Scale on
x86. For more information, see “Support Matrix” on page 294.

Summary of changes as updated, August 2022

Changes in IBM Storage Scale 5.1.2.6
• Includes HDFS Transparency 3.1.1-10 and HDFS Transparency 3.3.0-2.
Note: IBM Storage Scale 5.1.3.0, IBM Storage Scale 5.1.3.1 and IBM Storage Scale 5.1.4.0 include
earlier versions of HDFS Transparency and an upgrade must be considered to IBM Storage Scale 5.1.4.1
or later.
Added support for Red Hat IPA Kerberos for HDFS Transparency.

Summary of changes as updated, July 2022

Changes in HDFS Transparency 3.1.1-10 in IBM Storage Scale 5.1.4.1
• Fixed the issue where a fast repetitive usage of mmces service stop hdfs and mmces service
start hdfs can lead to two standby NameNodes.
• Added security fix for CVE-2022-23305, CVE-2022-23307, CVE-2022-23302 and CVE-2020-9488.
Changes in HDFS Transparency 3.3.0-2 in IBM Storage Scale 5.1.4.1
• Added security fix for CVE-2022-23305, CVE-2022-23307, CVE-2022-23302, CVE-2020-9488.

Summary of changes as updated, June 2022

Changes in HDFS Transparency 3.2.2-1 in IBM Storage Scale 5.1.4.0
• Supports CES HDFS Transparency 3.2.2-1 for Open Source Apache Hadoop 3.2.2 distribution on RH 7.9
on x86_64.
Changes in HDFS Transparency 3.1.1-9 in IBM Storage Scale 5.1.4.0

Summary of changes xxxv

• Optimized the internal metadata data structures for the NameNode for improved memory efficiency. For
more information, see “Recommended hardware resource configuration” on page 16.
• Fixed the parsing problem of hadoop-env.sh that used to skip the last line and therefore might miss
configuration key-value pairs on the last line of the file.

Summary of changes as updated, May 2022

Changes in HDFS Transparency 3.2.2-0 in IBM Storage Scale 5.1.3.2
• IBM Storage Scale 5.1.3 PTF2 is a technology preview version specifically for Hadoop users who want
to try out HDFS Transparency 3.2.2 for Open-source Apache Hadoop 3.2.2 during a limited download
period in the Fix Central. This technology preview is only available for Data Management Edition on
RHEL 7.9 on x86_64 with a limited-time period for nonproduction usage. IBM Storage Scale 5.1.3
PTF2 contains the additional HDFS Transparency 3.2.2 with the IBM Storage Scale 5.1.3 PTF1 content.
Therefore, this technology preview cannot be installed if IBM Storage Scale 5.1.3 PTF1 is already
installed.

Summary of changes as updated, April 2022

Changes in Cloudera Data Platform Private Cloud Base
• CDP Private Cloud Base 7.1.7 SP1 is certified with IBM Storage Scale starting from version 5.1.2.2. For
more information, see “Support Matrix” on page 294.

Summary of changes as updated, March 2022

Changes in IBM Storage Scale Big Data Analytics Integration Toolkit for HDFS Transparency (Toolkit
for HDFS) 1.0.5.0 in IBM Storage Scale 5.1.3
• Supports the parallel offline upgrade.
The parallel offline upgrade support will change the current offline upgrade process from sequential to
parallel. This will significantly reduce the upgrade time in the offline mode.
Changes in IBM Storage Scale file system core configuration in IBM Storage Scale 5.1.3
• For updates to the tscCmdAllowRemoteConnections parameter, see the File system core
improvements section under the IBM Storage Scale Summary of changes documentation.

Summary of changes as updated, January 2022

Changes in HDFS Transparency 3.1.1-8 in IBM Storage Scale 5.0.5.12
Changes in HDFS Transparency 3.1.1-8 and 3.3.0-1 in IBM Storage Scale 5.1.2.2
• Added security fix for CVE-2021-4104 and CVE-2019-17571.
Changes in HDFS Transparency 3.1.0-10 in IBM Fix Central
• Added security fix for CVE-2021-4104 and CVE-2019-17571.
• Fixed the timing rename failures.
Note that HDFS Transparency 3.1.0-10 is the last PTF in the 3.1.0.x stream.
For more information, see IBM Security Bulletin.

Summary of changes as updated, December 2021

Changes in HDFS Transparency 3.1.0-9
• Optimized the handling of the metadata for NameNode for improved memory efficiency.

xxxvi IBM Storage Scale: Big Data and Analytics Guide

To ensure that the data on IBM Storage Scale that is to be processed with HDFS Transparency is up to
date, the IBM Storage Scale mount option mtime -E: YES (default value) must be set to always return
the accurate file modification times.
• Optimized parallelism for DataNode request processing for the performance improvement. This
includes the ports of HDFS-15150 and HDFS-15160 that introduces three DataNode configuration
parameters. For more information, see “Configuration options for HDFS Transparency” on page 242.
• The IBM Storage Scale file system is now explicitly checked in mount and unmount callbacks during
HDFS Transparency startup and shutdown. Unrelated IBM Storage Scale file systems no longer affect
HDFS Transparency. This means that HDFS Transparency will start only if the relevant mount point
is properly mounted and will stop if the relevant mount point is unmounted based on the HDFS
Transparency status checking in the IBM Storage Scale event callback process.
• Fixed intermittent issues in date and size output when listing files.

Summary of changes as updated, November 2021

Changes in HDFS Transparency 3.1.1-7 in IBM Storage Scale 5.1.2.1
• Support added for Java 11.

Summary of changes as updated, October 2021

Changes in Mpack version 2.7.0.10
• The IBM Storage Scale service can now be deployed or upgraded in a single or multiple HDFS
namespace configuration. This includes adding DataNode using Ambari in multiple HDFS namespaces.
• Decommissioning DataNodes using the Ambari HDFS service is now supported.
• Fixed NamenodeHAState init arguments after 1 retry failure during HDP upgrading with Ambari
2.7.5.17-6 and Mpack 2.7.0.9 at the HDFS service upgrade step.
• The IBM Storage Scale service can now be deployed in Ambari in remote cluster mount configuration for
non-root Ambari and IBM Storage Scale environment.
• The MoveNameNodeTransparency.py script now supports moving the HDFS Transparency NameNode
when Kerberos is enabled.
Changes in Cloudera Data Platform Private Cloud Base
• CDP Private Cloud Base 7.1.7 is certified with IBM Storage Scale from version 5.1.1.2 on Power LE
platform. For more information, see “Support Matrix” on page 294.
Changes in IBM Storage Scale Big Data Analytics Integration Toolkit for HDFS Transparency (Toolkit
for HDFS) 1.0.4-0 in IBM Storage Scale 5.1.2
• The cleanup -n option of installation toolkit will now clear only the configuration of a single
HDFS cluster instead of clearing the configurations of all the HDFS clusters in a multi-HDFS cluster
environment from the toolkit's metadata.
Changes in HDFS Transparency 3.1.1-6
• Optimized the handling of the metadata for the NameNode performance improvement.
• Optimized parallelism for DataNode request processing for performance improvement. This includes
ports of HDFS-15150 and HDFS-15160 that introduces three DataNode configuration parameters. For
more information, see “Configuration options for HDFS Transparency” on page 242.
• Fixed getListing RPC to handle the remaining files correctly when the block locations are requested that
would cause higher-level services to get an incomplete directory listing.
• Support for decommissioned DataNodes is enabled. For more information, see “Decommissioning
DataNodes” on page 77.
• Fixed metadata handling when a listing would not show the correct creation time.
Documentation update

Summary of changes xxxvii

• Added configuration parameters for gpfs-site.xml that describes the specific IBM Storage Scale
parameters. For more information, see “Configuration parameters for gpfs-site.xml” on page 244.
• Moved the Configuration options for HDFS Transparency information to “Configuration parameters” on
page 242.

Summary of changes as updated, August 2021

Changes in Cloudera Data Platform Private Cloud Base
• CDP Private Cloud Base 7.1.7 is certified with IBM Storage Scale 5.1.1.2 on x86_64 platform.
• CDP 7.1.7 supports the upgrade path from CDP 7.1.6 with CSD 1.1.0-0 on IBM Storage Scale 5.1.1.1 to
CDP 7.1.7 with CSD 1.2.0-0 on IBM Storage Scale 5.1.1.2. For more information, see “Upgrading CDP”
on page 341.

Summary of changes as updated, July 2021

Changes in HDFS Transparency 3.3.0-0 in IBM Storage Scale 5.1.1.2
• Supports CES HDFS Transparency 3.3 for Open Source Apache Hadoop 3.3 distribution on RH 7.9 on
x86_64.
Changes in IBM Storage Scale Big Data Analytics Integration Toolkit for HDFS Transparency (Toolkit
for HDFS) 1.0.3-2 in IBM Storage Scale 5.1.1.2
• Supports new installation of CES HDFS Transparency 3.3 through the IBM Storage
Scale installation toolkit on RH 7.9 on x864_64 when the environment variable
SCALE_HDFS_TRANSPARENCY_VERSION_33_ENABLE=True is exported. For more information, see
“Steps for install toolkit” on page 32.
Changes in HDFS Transparency 3.1.0-8
• Optimized the handling of the metadata for the NameNode performance improvement.
• Fixed getListing RPC to handle the remaining files correctly when block locations are requested that
would cause higher-level services to get an incomplete directory listing.
• Backported the fix for a race condition that caused parsing error of java.io.BufferedInputStream
in org.apache.hadoop.conf.Configuration class (HADOOP-15331).
• Fixed the handling of the file listing so that the java.nio.file.NoSuchFileException warning
messages do not occur.
• Fixed the handling of getBlockLocation RPC on the files that do not exist. This prevented the YARN
ResourceManager to start after configuring node labels directory.
• Support for decommissioned DataNodes is enabled. For more information, see “Decommissioning
DataNodes” on page 77.
• General security fixes and CVE-2020-9492 in IBM Support.
Changes in Cloudera HDP
• The --sync-hdp option used for upgrading HDP is now deprecated.

Summary of changes as updated, June 2021

Changes in Cloudera Data Platform Private Cloud Base
• CDP Private Cloud Base 7.1.6 is now certified on ppc64le.
Changes in HDFS Transparency 3.1.1-5 in IBM Storage Scale 5.1.1.1
• Fixed the handling of the file listing. Therefore, the java.nio.file.NoSuchFileException warning
messages will no longer occur.
• Fixed the handling of getBlockLocation RPC on files that do not exist. This prevented the YARN
ResourceManager to start after configuring the node labels directory.

xxxviii IBM Storage Scale: Big Data and Analytics Guide

• From HDFS Transparency 3.1.1-5, the gpfs_tls_configuration.py script automates the
configuration of Transport Layer Security (TLS) on the CES HDFS Transparency cluster.
Changes in IBM Storage Scale Big Data Analytics Integration Toolkit for HDFS Transparency (Toolkit
for HDFS) 1.0.3.1 in IBM Storage Scale 5.1.1.1
• From Toolkit version 1.0.3.1, creating multiple CES HDFS clusters using the IBM Storage Scale
installation toolkit during the same deployment run is supported.

Summary of changes as updated, May 2021

Changes in Cloudera Data Platform Private Cloud Base
• From CDP Private Cloud Base 7.1.6, Impala is certified on IBM Storage Scale 5.1.1 on x86_64.

Summary of changes as updated, April 2021

Changes in Cloudera Data Platform Private Cloud Base
• CDP Private Cloud Base 7.1.6 is certified with IBM Storage Scale 5.1.1.0. This CDP Private Cloud Base
version supports Transport Layer Security (TLS) and HDFS encryption.
Changes in HDFS Transparency 3.1.1-4
• Fixed the mmhdfs command to recognize short hostname configuration for NameNodes and Data
Nodes. Therefore, The node is not a namenode or datanode error message will no longer
occur.
• The IBM Storage Scale file systems are now explicitly checked in mount and unmount callbacks during
HDFS Transparency startup and shutdown process. Unrelated IBM Storage Scale file systems no longer
affect HDFS Transparency. This means that HDFS Transparency will start only if the relevant mount
point is properly mounted and will stop if the relevant mount point is unmounted based on the HDFS
Transparency status checking in the IBM Storage Scale event callback process.
• HDFS Transparency NameNode log now contains the HDFS Transparency full version information and
the gpfs.encryption.enable value.
• Added general security fixes and CVE-2020-4851 in IBM Support.
• Added a new custom json file method for the Kerberos script. For more information, see “Configuring
Kerberos using the Kerberos script provided with IBM Storage Scale” on page 117.
Changes in IBM Storage Scale Big Data Analytics Integration Toolkit for HDFS Transparency (Toolkit
for HDFS) 1.0.3.0
• IBM Storage Scale installation toolkit now uses Ansible® configuration.
• Creating multiple CES HDFS clusters in the installation toolkit at the same deployment run is not
supported under Ansible-based toolkit. For workaround, see Multi-HDFS cluster deployment through
IBM Storage Scale 5.1.1.0 installation toolkit is not supported.

Summary of changes as updated, March 2021

Changes in IBM Storage Scale CES HDFS Transparency
• IBM Storage Scale CES HDFS Transparency now supports both the NameNode HA and non-HA options.
Also, DataNode can now have Hadoop services colocated within the same node. For more information,
see “Alternative architectures” on page 291.
Changes in Mpack version 2.7.0.9
• The Ambari maintenance mode for clusters is now supported by the IBM Storage Scale service on
gpfs.storage.type with shared or remote environments. Earlier, when the user performed a Start
all or Stop all operation from the Ambari GUI, the IBM Storage Scale service or its components that
are used to start or stop respectively even when they were set to maintenance mode.

Summary of changes xxxix

• The Mpack upgrade process does not reinitialize the following HDFS parameters to the Mpack’s
recommended settings:
– dfs.client.read.shortcircuit
– dfs.datanode.hdfs-blocks-metadata.enabled
– dfs.ls.limit
– dfs.datanode.handler.count
– dfs.namenode.handler.count
– dfs.datanode.max.transfer.threads
– dfs.replication
– dfs.namenode.shared.edits.dir
Earlier any updates to these parameters by the end user were overwritten. As this issue is now fixed,
any customized hdfs-site.xml configuration will not be changed during the upgrade process.
• In addition to Check Integration Status option in the Ambari service, you can now view the Mpack
version/build information in version.txt in the Mpack tar.gz package.
• The hover message for the GPFS Quorum Nodes text field within the IBM Storage Scale service GUI
has been updated. The hostnames to be entered for the Quorum Nodes should be from the IBM Storage
Scale Admin network hostnames.
• The Mpack uninstaller script cleans up the IBM Storage Scale Ambari stale link that is no longer
required. Therefore, the Ambari server restart will not fail because of the Mpack dependencies.
• The Mpack installation, upgrade, and uninstall script now supports the sudo root permission.
• The anonymous UID verification is checked only if hadoop.security.authentication is not set to
Kerberos.
• The IBM Storage Scale service can now monitor the status of configured file system mount point
(gpfs.mnt.dir).

In earlier releases of Mpack, the IBM Storage Scale service was able to monitor only the status of the
IBM Storage Scale runtime daemon.
If any of the configured file system is not mounted on the IBM Storage Scale node, the status for the
GPFS_NODE component for that node will now appear as down in the Ambari GUI.

Summary of changes as updated, January 2021

Changes in Cloudera Data Platform Private Cloud Base
Cloudera Data Platform Private Cloud Base with IBM Storage Scale is supported on Power®. For more
information, see “Support Matrix” on page 294.
Changes in HDFS Transparency 3.1.0-7
• Fixed the NullPointerException error message that appeared in the NameNode logs.
• Fixed the JMX output to correctly report "open" operations when the gpfs.ranger.enabled
parameter is set to scale.
• A vulnerability in IBM Storage Scale allows injecting malicious content into the log files. For the security
fix information, see IBM Support.
Documentation update
Configuration options for using multiple threads to list a directory and load the metadata of its children
are provided for HDFS Transparency 3.1.1-3 and 3.1.0-6. For more information, see the list option.

Summary of changes as updated, December 2020

Changes in HDFS Transparency 3.1.1-3

xl IBM Storage Scale: Big Data and Analytics Guide

• HDFS Transparency implements performance enhancement by using fine-grained file system locking
mechanism. After HDFS Transparency 3.1.1-3 is installed, ensure that the gpfs.ranger.enabled
field is set to scale in /var/mmfs/hadoop/etc/hadoop/gpfs-site.xml. For more information, see
“Setting configuration options in CES HDFS” on page 69.
• The create Hadoop users and groups script and the create Kerberos principals and keytabs script in IBM
Storage Scale now reside in the /usr/lpp/mmfs/hadoop/scripts directory.
• Requires Python 3.6 or later.
Changes in IBM Storage Scale Big Data Analytics Integration Toolkit for HDFS Transparency (Toolkit
for HDFS) 1.0.2-1
• The toolkit installation failure due to nodes that are not a part of the CES HDFS cluster and does not
have JAVA installed and do not have JAVA_HOME set is now fixed.
• The following proxyuser configurations were added into core-site.xml by the installation toolkit to
configure a CES HDFS cluster:

hadoop.proxyuser.livy.hosts=*
hadoop.proxyuser.livy.groups=*
hadoop.proxyuser.hive.hosts=*
hadoop.proxyuser.hive.groups=*
hadoop.proxyuser.oozie.hosts=*
hadoop.proxyuser.oozie.groups=*

Changes in IBM Storage Scale Cloudera Custom Service Descriptor (CDP CSD) 1.0.0-0
• Integrates IBM Storage Scale service into CDP Private Cloud Base Cloudera Manager.

Summary of changes as updated, November 2020

Changes in HDFS Transparency 3.1.1-2
• Supports CDP Private Cloud Base. For more information, see “Support Matrix” on page 294.
• Includes Hadoop sample scripts to create users and groups in IBM Storage Scale and set up the
Kerberos principals and keytabs. Requires Python 3.6 or later.
• Summary operations (for example, du, count, and so on) in HDFS Transparency can be now done
multi-threaded based on the number of files and subdirectories. It improves the performance when
performing the operation on a path that contains numerous files and subdirectories. The performance
improvement depends on the system environment. For more information, see Functional limitations.
Changes in IBM Storage Scale Big Data Analytics Integration Toolkit for HDFS Transparency (Toolkit
for HDFS) 1.0.2-0
• Added support to deploy CES HDFS in SLES 15 and Ubuntu 20.04 on x86_64 platforms.
• Package was renamed from bda_integration-<version>.noarch.rpm to gpfs.bda-
integration-<version>.noarch.rpm .
• Requires Python 3.6 or later.
Changes in IBM Storage Scale Cloudera Custom Service Descriptor (CDP CSD) 1.0.0-0 EA
• Integrates IBM Storage Scale service into CDP Private Cloud Base Cloudera Manager.

Summary of changes as updated, October 2020

Changes in HDFS Transparency 3.1.0-6
• HDFS Transparency now implements performance enhancement by using the fine-grained file system
locking mechanism instead of using the Apache Hadoop global file system locking mechanism. From
HDFS Transparency 3.1.0-6, set gpfs.ranger.enabled to scale from the HDP Ambari GUI under the

Summary of changes xli

IBM Storage Scale service configuration page. If you are not using Ambari, set gpfs.ranger.enabled
in /var/mmfs/hadoop/etc/hadoop/gpfs-site.xml as follows:

<property>
<name>gpfs.ranger.enabled</name>
<value>scale</value>
<final>false</final>
</property>

Note: The scale option replaces the original true/false values.

• Summary operations (for example, du, count, and so on) in HDFS Transparency can be now done
multi-threaded based on the number of files and subdirectories. It improves the performance when
performing the operation on a path that contains numerous files and subdirectories. The performance
improvement depends on the system environment. For more information, see Functional limitations.

Summary of changes as updated, August 2020

Changes in Mpack version 2.7.0.8
For Mpack 2.7.0.7 and earlier, a restart of the IBM Storage Scale service would overwrite the IBM Storage
Scale customized configuration if the gpfs.storage.type parameter was set to shared.
From Mpack 2.7.0.8, if the gpfs.storage.type parameter is set to shared or shared,shared, the IBM
Storage Scale service will not set the IBM Storage Scale tunables, that are seen under the IBM Storage
Scale service, back to the IBM Storage Scale cluster or file system.

Summary of changes as updated, July 2020

Changes in IBM Storage Scale Big Data Analytics Integration Toolkit for HDFS Transparency (Toolkit
for HDFS) 1.0.1.1
• Supports rolling upgrade of HDFS Transparency through installation toolkit.
Note: If the SMB protocol is enabled, all protocols are required to be offline for some time because the
SMB does not support the rolling upgrade.
• Requires IBM Storage Scale 5.0.5.1 and HDFS Transparency 3.1.1-1. For more information, see CES
HDFS “HDFS Transparency support matrix” on page 27.
• From IBM Storage Scale 5.0.5.1, only one CES-IP is needed for one HDFS cluster during installation
toolkit deployment.
Changes in HDFS Transparency 3.1.0-5
• When gpfs.replica.enforced is set to gpfs, client replica setting is not honored. Convert the WARN
namenode.GPFSFs (GPFSFs.java:setReplication(123)) - Set replication operation
invalid when gpfs.replica.enforced is set to gpfs message to Debug, because this
message can occur many times in the NameNode log.
• Fixed NameNode hangs when you are running the mapreduce jobs because of the lock synchronized
issue.
• From IBM Storage Scale 5.0.5, the gpfs.snap --hadoop can access the HDFS Transparency logs
from the user configured directories.
• From HDFS Transparency 3.1.0-5, the default value for dfs.replication is 3 and
gpfs.replica.enforced is gpfs. Therefore, it uses the IBM Storage Scale file system replication
and not the Hadoop HDFS replication. Also, increasing the dfs.replication value to 3 helps the hdfs
client to tolerate the DataNode failures.
Note: You need to have at least three DataNodes when you set the dfs.replication to 3.
• Changed permission mode for editlog files to 640.
• For two file systems, HDFS Transparency ensures that the NameNodes and DataNodes are stopped
before unmounting the second file system mount point.

xlii IBM Storage Scale: Big Data and Analytics Guide

Note: The local directory path for the second file system mount usage is not removed. Ensure this local
directory path is empty before starting the NameNode.
• HDFS Transparency does not manage the storage. Therefore, the Apache Hadoop block function call
used for native HDFS gives a false metric information. Therefore, HDFS Transparency does not run the
Apache Hadoop block function calls.
• Delete operations in HDFS Transparency can be now done multi-threaded based on the number of
files and subdirectories. It improves performance when deleting a path that contains numerous files
and subdirectories. The performance improvement depends on the system environment. For more
information, see Functional limitations.
Changes in Mpack version 2.7.0.7
• Supports HDP upgrade with Mpack 2.7.0.7 without unintegrating HDFS Transparency. .
• The Mpack 2.7.0.7 supports Ambari version 2.7.4 or later.
• The installation and upgrade scripts now support complex KDC password when Kerberos is enabled.
• You can now upgrade from older Mpacks (versions 2.7.0.x) to Mpack 2.7.0.7 if Kerberos is enabled
without using the workaround .
• The upgrade postEU process is now simplified and can now automatically accept the user agreement
license.
• The upgrade postEU option now requests the user inputs only once during the upgrade process.
• During the Mpack installation or upgrade process, the backup directory that is created by the Mpack
installer now includes a date timestamp added to the directory name.
• The Check Integration Status UI action in IBM Storage Scale service now shows the unique Mpack build
ID.
• If you are enabling Kerberos after integrating IBM Storage Scale service, ZKFC initialization used to fail
because the hdfs_jaas.conf file was missing. A workaround is no longer required.
• Ambari now supports rolling restart for NameNodes and DataNodes.
• The configuration changes will be in effect after you restart the NameNodes and DataNodes and do not
require all the HDFS Transparency nodes to be restarted.
• If the SSL is enabled, the upgrade script asks for the hostname instead of the IP address.
• The upgrade script requesting true/false inputs are no longer case sensitive.
• When deployment type is set to gpfs.storage.type=shared, a local GPFS cluster would be created
even if the bidirectional passwordless ssh was not set up properly between the GPFS Master and the
ESS contact node. This issue is now fixed. The deployment fails in such scenarios and an error message
is displayed.
• If you are using IBM Storage Scale 4.2.3.2, Ambari service hangs because the mmchconfig would
be prompting for an ENTER feedback for the LogFileSize parameter. From Mpack 2.7.0.7, the
LogFileSize configuration cannot be modified. The LogFileSize parameter can be configured only
through the command line by using the mmchconfig command.

Summary of changes as updated, May 2020

Changes in IBM Storage Scale Big Data Analytics Integration Toolkit for HDFS Transparency (Toolkit
for HDFS) 1.0.1.0
• Supports offline upgrade of HDFS Transparency.
• Requires IBM Storage Scale 5.0.5 and HDFS Transparency 3.1.1-1. For more information, see CES HDFS
“HDFS Transparency support matrix” on page 27.
Changes in HDFS Transparency 3.1.1-1
• A check is performed while you are running the mmhdfs config upload command to ensure that the
ces_group_name is consistent with the HDFS Transparency dfs.nameservices values.

Summary of changes xliii

• From IBM Storage Scale 5.0.5, the gpfs.snap --hadoop can now access the HDFS Transparency logs
from the user-configured directories.
• From HDFS Transparency 3.1.1-1, the default value for dfs.replication is 3 and
gpfs.replica.enforced is gpfs. Therefore, it uses the IBM Storage Scale file system replication
and not the Hadoop HDFS replication. Also, increasing the dfs.replication value to 3 helps the hdfs
client to tolerate the DataNode failures.
Note: You need to have at least three DataNodes when you set the dfs.replication to 3.
• Fixed NameNode hangs when you are running the mapreduce jobs because of the lock synchronized
issue.
CES HDFS changes
• From IBM Storage Scale 5.0.5, HDFS Transparency version 3.1.1-1 and Big Data Analytics Integration
Toolkit for HDFS Transparency (Toolkit for HDFS) version 1.0.1.0, HDFS Transparency and Toolkit for
HDFS packages are signed with a GPG (GNU Privacy Guard) key and can be deployed by the IBM
Storage Scale installation toolkit.
For more information, go to IBM Storage Scale documentation and see the following topics:
– Installation toolkit changes subsection under the Summary of changes topic.
– Limitations of the installation toolkit topic under the Installing > Installing IBM Spectrum Scale on
Linux nodes and deploying protocols > Installing IBM Spectrum Scale on Linux nodes with the
installation toolkit.

Summary of changes as updated, March 2020

Changes in IBM Storage Scale Big Data Analytics Integration Toolkit for HDFS Transparency (Toolkit
for HDFS) 1.0.0.1
• Supports deployment on ESS.
• Supports remote mount file system only for CES HDFS protocol.
• Requires IBM Storage Scale 5.0.4.3 and HDFS Transparency 3.1.1-0. For more information, see CES
HDFS “HDFS Transparency support matrix” on page 27.

Summary of changes as updated, January 2020

Changes in HDFS Transparency 3.1.1-0
• Integrates with CES protocol and IBM Storage Scale installation toolkit.
• Supports Open Source Apache Hadoop distribution and Red Hat Enterprise Linux operating systems.
Changes in HDFS Transparency 3.1.0-4
• Export NODE_HDFS_MAP_GPFS commented line into hadoop-env.sh file for mmhadoopctl multi-
network usage.
• Fixed data replicate with AFM DR disk usage due to shrinkfit.
• Fixed Job will not fail if one DataNode failed when using gpfs.replica, enforced=gpfs, gpfs.storage.type
and dfs.replication > 1 in shared mode.
• Change to log warning messages for outdated clusterinfo and diskinfo files.
• Fixed deleting a file issue on the 2nd file system when trash is enabled in a two file system
configuration.
• Use the default community-defined port number for dfs.datanode (address, ipc.address, and
http.address) to reduce port conflicts with ephemeral ports.
• Fixed hadoop df output that was earlier not consistent with the POSIX df output when 2 FS is
configured.
• Fixed dfs -du that was earlier displaying wrong free space value.

xliv IBM Storage Scale: Big Data and Analytics Guide

Changes in Mpack version 2.7.0.6
• Supports HDP 3.1.5.

Summary of changes as updated, November 2019

Changes in Mpack version 2.7.0.5
• The Mpack Installation script SpectrumScaleMPackInstaller.py will no longer ask for the KDC
credentials, even when the HDP Hadoop cluster is Kerberos enabled. The KDC credentials are
only required to be setup before executing the IBM Storage Scale service Action "Unintegrated
Transparency".
• If you are deploying the IBM Storage Scale service in a shared storage configuration
(gpfs.storage.type=shared), the Mpack will check for consistency of UID, GID of the anonymous
user only on the local GPFS nodes. The Mpack will not perform this check on the ESS nodes.
• If you are deploying the IBM Storage Scale service with two file system support with
gpfs.storage.type=shared,shared or gpfs.storage.type=remote,remote, then the Block
Replication in HDFS (dfs.blocksize) will default to 1.
• From Mpack 2.7.0.5, the issue of having all the nodes managed by Ambari to be set as GPFS nodes
during deployment is fixed. For example, if you set some nodes as Hadoop client nodes and some nodes
as GPFS nodes for HDFS Transparency NameNode and DataNodes, the deployment will succeed.
• In Mpack 2.7.0.4, if the gpfs.storage.type was set to shared, stopping the Scale service from
Ambari would report a failure in the UI even if the operation had succeeded internally. This issue has
been fixed in Mpack 2.7.0.5.
• IBM Storage Scale Ambari deployment can now support gpfs.storage.type=shared,shared
mode.

Summary of changes as updated, October 2019

IBM Erasure Code Edition (ECE) is supported as shared storage mode for Hadoop with HDFS
Transparency 3.1.0-3 and IBM Storage Scale 5.0.3.

Summary of changes as updated, September 2019

Changes in HDFS Transparency 3.1.0-3
• Validate open file limit when starting Transparency.
• mmhadoopctl supports dual network configuration when NODE_HDFS_MAP_GPFS is set in /var/
mmfs/hadoop/etc/hadoop/hadoop-env.sh. See section “mmhadoopctl supports dual network” on
page 204 for more details.
Changes in Mpack version 2.7.0.4
• For FPO clusters, the restripeOnDiskFailure value will be set to NO regardless of the original
set value during the stopping of GPFS main components. After the GPFS main stop completes, the
restripeOnDiskFailure value will be set back to its original value.
• The IBM Storage Scale service will do a graceful shutdown and will no longer do a force unmount of the
GPFS file system via mmunmount -f.
• Seeing intermittent failure of one of the HDFS Transparency NameNodes at the startup due to the timing
issue when both the NameNode HA and Kerberos are enabled has now been fixed.
• The HDFS parameter dfs.replication is set to the mmlsfs -r value (Default number of data
replicas) of the GPFS file system for gpfs.storage.type=shared instead of the Hadoop replication
value of 3.
• The Mpack installer (*.bin) file can now accept the license silently when the --accept-licence option
is specified.

Summary of changes xlv

Summary of changes as updated, May 2019
Changes in HDFS Transparency 3.1.0-2
• Issue fixed when a map reduce task fails after running for one hour when the Ranger is enabled.
• Issue fixed when Hadoop permission settings do not work properly in a kerberized environment.
Documentation updates
• Updated the Migrating IOP to HDP for BI 4.2.5 and HDP 2.6 information.

Summary of changes as updated, March 2019

Changes in Mpack version 2.7.0.3
• Supports dual network configuration
• Issue fixed to look only at the first line in the shared_gpfs_node.cfg file to get the host name for
shared storage so the deployment of shared file system would not hang.
• Removed gpfs_base_version and gpfs_transparency_version fields from the IBM Storage
Scale service configuration GUI. This removes the restart all that is required after IBM Storage Scale is
deployed.
• Mpack can now find the correct installed HDP version when multiple HDP versions are seen.
• IBM Storage Scale service is now able to handle hyphenated file system names so that the service will
be able to start properly during file system mount.
• IBM Storage Scale entry into system_action_definitions.xml is fixed. Therefore, the IBM Storage
Scale </actionDefinition> ending tag is not on the same line as the </actionDefinitions> tag. Otherwise,
there is a potential installation issue when a new service is added after IBM Storage Scale service
because the new service is added in between the IBM Storage Scale entry and the </actionDefinition></
actionDefinitions> line.
HDFS Transparency 3.1.0-1
• Fixed Hadoop du to calculate all files under all subdirectories for the user even when the files have not
been accessed.
• Supports ViewFS in HDP 3.1 with Mpack 2.7.0.3.

Summary of changes as updated, February 2019

Changes in Mpack version 2.7.0.2
• Supports HDP 3.1.
• SLES 12 SP3 support for new installs on x86 64 only.
• Upgrade the HDFS Transparency on all nodes in the IBM Storage Scale cluster instead of just upgrading
it only on the NameNode and DataNodes.

Summary of changes as updated, December 2018

Changes in Mpack version 2.7.0.1
• Supports HDP 3.0.1.
• Supports preserving Kerberos token delegation during NameNode failover.
• IBM Storage Scale service Stop All/Start All service actions now support the best practices for IBM
Storage Scale stop/start as per Restarting a large IBM Storage Scale cluster topic in the IBM Storage
Scale: Administration Guide.
• The HDFS Block Replication parameter, dfs.replication, is automatically set to match the actual
value of the IBM Storage Scale Default number of data replicas parameter, defaultDataReplicas,
when adding the IBM Storage Scale service for remote mount storage deployment model.

xlvi IBM Storage Scale: Big Data and Analytics Guide

HDFS Transparency 3.1.0-0
• Supports preserving Kerberos token delegation during NameNode failover.
• Fixed CWE/SANS security exposures in HDFS Transparency.
• Supports Hadoop 3.1.1

Summary of changes as updated, October 2018

Changes in Mpack version 2.4.2.7
• Supports preserving Kerberos token delegation during NameNode failover.
• IBM Storage Scale service Stop All/Start All service actions now support the best practices for IBM
Storage Scale stop/start as per Restarting a large IBM Storage Scale cluster topic in the IBM Storage
Scale: Administration Guide.
HDFS Transparency 2.7.3-4
• Supports preserving Kerberos token delegation during NameNode failover.
• Supports native HDFS encryption.
• Fixed CWE/SANS security exposures in HDFS Transparency.

Summary of changes as updated, August 2018

Changes in Mpack version 2.7.0.0
• Supports HDP 3.0.
Changes in HDFS Transparency version 3.0.0-0
• Supports HDP 3.0 and Mpack 2.7.0.0.
• Supports Apache Hadoop 3.0.x.
• Support native HDFS encryption.
• Changed IBM Storage Scale configuration location from /usr/lpp/mmfs/hadoop/etc/ to /var/
mmfs/hadoop/etc/ and default log location for open source Apache from /usr/lpp/mmfs/hadoop/
logs to /var/log/transparency.
New documentation sections
• Hadoop Scale Storage Architecture
• Hadoop Performance tuning guide
• Hortonworks Data Platform 3.X for HDP 3.0
• Open Source Apache Hadoop

Summary of changes as updated, July 2018

Changes in Mpack version 2.4.2.6
• HDP 2.6.5 is supported.
• Mpack installation resumes from the point of failure when the installation is re-run.
• The Collect Snap Data action in the IBM Storage Scale service in the Ambari GUI can capture the
Ambari agents' logs in to a tar package under the /var/log/ambari.gpfs.snap* directory.
• Use cases where the Ambari server and the GPFS main are colocated on the same host but are
configured with multiple IP addresses are handled within the IBM Storage Scale service installation.
• On starting IBM Storage Scale from Ambari, if a new kernel version is detected on the IBM Storage Scale
node, the GPFS portability layer is automatically rebuilt on that node.

Summary of changes xlvii

• On deploying the IBM Storage Scale service, the Ambari server restart is not required. However, the
Ambari server restart is still required when running the Service Action > Integrate Transparency or
Unintegrate Transparency from the Ambari UI.

Summary of changes as updated, May 2018

Changes in HDFS Transparency 2.7.3-3
• Non-root password-less login of contact nodes for remote mount is supported.
• When the Ranger is enabled, uid greater than 8388607 is supported.
• Hadoop storage tiering is supported.
Changes in Mpack version 2.4.2.5
• HDP 2.6.5 is supported.

Summary of changes as updated, February 2018

Changes in HDFS Transparency 2.7.3-2
• Snapshot from a remote-mounted file system is supported.
• IBM Storage Scale fileset-based snapshot is supported.
• HDFS Transparency and IBM Storage Scale Protocol SMB can coexist without the SMB ACL controlling
the ACL for files or directories.
• HDFS Transparency rolling upgrade is supported.
• Zero shuffle for IBM ESS is supported.
• Manual update of file system configurations when root password-less access is not available for remote
cluster is supported.
Changes in Mpack version 2.4.2.4
• HDP 2.6.4 is supported.
• IBM Storage Scale admin mode central is supported.
• The /etc/redhat-release file workaround for CentOS deployment is removed.

Summary of changes as updated, January 2018

Changes in Mpack version 2.4.2.3
• HDP 2.6.3 is supported.

Summary of changes as updated, December 2017

Changes in Mpack version 2.4.2.2
• The Mpack version 2.4.2.2 does not support migration from IOP to HDP 2.6.2. For migration, use the
Mpack version 2.4.2.1.
• From IBM Storage Scale Mpack version 2.4.2.2, new configuration parameters have been added to the
Ambari management GUI. These configuration parameters are as follows:
gpfs.workerThreads defaults to 512.
NSD threads per disk defaults to 8.
For IBM Storage Scale version 4.2.0.3 and later, gpfs.workerThreads field takes effect and
gpfs.worker1Threads field is ignored. For versions lower than 4.2.0.3, gpfs.worker1Threads
field takes effect and gpfs.workerThreads field is ignored.
Verify if the disks are already formatted as NSDs - defaults to yes
• The default values of the following parameters have changed. The new values are as follows:

xlviii IBM Storage Scale: Big Data and Analytics Guide

gpfs.supergroup defaults to hdfs,root now instead of hadoop,root.
gpfs.syncBuffsPerIteration defaults to 100. Earlier it was 1.
Percentage of Pagepool for Prefetch defaults to 60 now. Earlier it was 20.
gpfs.maxStatCache defaults to 512 now. Earlier it was 100000.
• The default maximum log file size for IBM Storage Scale has been increased to 16 MB from 4 MB.

Summary of changes as updated, October 2017

Changes in Mpack version 2.4.2.1 and HDFS Transparency 2.7.3-1
• The GPFS Ambari integration package is now called the IBM Storage Scale Ambari management pack (in
short, management pack or MPack).
• Mpack 2.4.2.1 is the last supported version for BI 4.2.5.
• IBM Storage Scale Ambari management pack version 2.4.2.1 with HDFS Transparency version 2.7.3.1
supports BI 4.2/BI 4.2.5 IOP migration to HDP 2.6.2.
• The remote mount configuration in Ambari is supported. (For HDP only)
• Support for two IBM Storage Scale file systems/deployment models under one Hadoop cluster/Ambari
management. (For HDP only)
This allows you to have a combination of IBM Storage Scale deployment models under one Hadoop
cluster. For example, one file system with shared-nothing storage (FPO) deployment model along with
one file system with shared storage (ESS) deployment model under single Hadoop cluster.
• Metadata operation performance improvements for Ranger enabled configuration.
• Introduction of Short circuit write support for improved performance where HDFS client and Hadoop
DataNodes are running on the same node.

Summary of changes xlix

l IBM Storage Scale: Big Data and Analytics Guide
Chapter 1. Big data and analytics support
Analytics is defined as the discovery and communication of meaningful patterns in data. Big data
analytics is the use of advanced analytic techniques against very large, diverse data sets (structured or
unstructured) which can be processed through streaming or batch. Big data is a term applied to data sets
whose size or type is beyond the ability of traditional data processing to capture, manage, and process the
data.
Analyzing big data allows analysts, researchers, and business users to make better and faster decisions
using data that was previously inaccessible or unusable. Using advanced analytics techniques such as text
analytics, machine learning, predictive analytics, data mining, statistics, and natural language processing,
businesses can analyze previously untapped data sources independent or together with their existing
enterprise data to gain new insights resulting in significantly better and faster decisions.
IBM Storage Scale is an enterprise class software-defined storage for high performance, large scale
workloads on-premises or in the cloud with flash, disk, tape, local, and remote storage in its storage
portfolio. IBM Storage Scale unifies data silos, including those across multiple geographies and around
the globe using Active File Management (AFM) to help ensure that data is always available in the right
place at the right time with synchronous and asynchronous disaster recovery (AFM DR).
IBM Storage Scale is used for diverse workloads across every industry to deliver performance, reliability,
and availability of data which are essential to the business.
This scale-out storage solution provides file, object and integrated data analytics for:
• Compute clusters (technical computing)
• Big data and analytics
• Hadoop Distributed File System (HDFS)
• Private cloud
• Content repositories
• Simplified data management and integrated information lifecycle management (ILM)
IBM Storage Scale enables you to build a data ocean solution to eliminate silos, improve infrastructure
utilization, and automate data migration to the best location or tier of storage anywhere in the world to
help lower latency, improve performance or cut costs. You can start small with just a few commodity
servers fronting commodity storage devices and then grow to a data lake architecture or even an ocean of
data. IBM Storage Scale is a proven solution in some of the most demanding environments with massive
storage capacity under a single global namespace. Furthermore, your data ocean can store either files or
objects and you can run analytics on the data in-place, which means that there is no need to copy the data
to run your jobs. This design to provide anytime, anywhere access to data, enables files and objects to
be managed together with standardized interfaces such as POSIX, OpenStack Swift, NFS, SMB/CIFS, and
extended S3 API interfaces, delivering a true data without borders capability for your environments.
Decision making is a critical function in any enterprise. The decision-making process that is enhanced
by analytics can be described as consuming and collecting data, detecting relationships and patterns,
applying sophisticated analysis techniques, reporting, and automation of the follow-on action. The IT
system that supports decision making is composed of the traditional "systems of record" and “systems of
engagement” and IBM Storage Scale brings all the diverse data types of structured and unstructured data
seamlessly to create a “systems of insight” for enterprise systems.

© Copyright IBM Corp. 2017, 2023 1

Evolving alongside Big Data Analytics, IBM Storage Scale can improve time to insight by supporting
Hadoop and non-Hadoop application data sharing. Avoiding data replication and movement can reduce
costs, simplify workflows, and add enterprise features to business-critical data repositories. Big Data
Analytics on IBM Storage Scale can help reduce costs and increase security with data tiering, encryption,
and support across multiple geographies.

2 IBM Storage Scale: Big Data and Analytics Guide

Chapter 2. IBM Storage Scale support for Hadoop
IBM Storage Scale provides integration with Hadoop applications that use the Hadoop connector.
If you plan to use a Hadoop distribution with the Hadoop connector, see the chapter that corresponds
to your Cloudera distribution (CDP Private Cloud Base) or the Chapter 6, “Apache Hadoop,” on page 489
under the big data and analytics support documentation.
Different Hadoop connectors
• Second generation HDFS Transparency
– IBM Storage Scale HDFS Transparency (also known as, HDFS Protocol) offers a set of interfaces that
allows applications to use HDFS Client to access IBM Storage Scale through HDFS RPC requests.
HDFS Transparency implementation integrates both the NameNode and the DataNode services and
responds to the request as if it were HDFS.
• First generation Hadoop connector
– The IBM Storage Scale Hadoop connector implements Hadoop file system APIs and the FileContext
class so that it can access the IBM Storage Scale.

Overview
All data transmission and metadata operations in HDFS are through the RPC mechanism and processed
by the NameNode and the DataNode services within HDFS.
IBM Storage Scale HDFS protocol implementation integrates both the NameNode and the DataNode
services and responds to the request as if it were HDFS. Advantages of the HDFS transparency are as
follows:
• HDFS-compliant APIs or shell-interface command.
• Application client isolation from storage. Application client might access data in IBM Storage Scale file
system without GPFS client installed.
• Improved security management by Kerberos authentication and encryption for RPCs.
• Simplified file system monitor by Hadoop Metrics2 integration.

In the following sections, DFS client is the node installed with HDFS client package. Hadoop Node is
the node that is installed with any Hadoop-based components (such as Hive, Hbase, Pig, and Ranger).
Hadoop service is the Hadoop-based application or components. HDFS Transparency node is the node
running HDFS Transparency NameNode or DataNode.

© Copyright IBM Corp. 2017, 2023 3

Integration of Cluster Export Services (CES) protocol and deployment toolkit with HDFS Transparency are
supported starting with HDFS Transparency 3.1.1 and IBM Storage Scale 5.0.4.2. For more information,
see “HDFS Transparency overview” on page 10.
For information about downloading the HDFS Transparency package, see “HDFS Transparency download”
on page 28.

Hadoop IBM Storage Scale Architecture

IBM Storage Scale allows Hadoop applications to access centralized storage or local storage data. All
Hadoop nodes can access the storage as a GPFS™ client. You can share a cluster between Hadoop and any
other application.
IBM Storage Scale has the following supported storage modes that Hadoop can access:
• Centralized Storage Mode:
– IBM Elastic Storage® Server
– IBM Erasure Code Edition
– Shared Storage (SAN-based storage)
• Local Storage Mode:
– File Placement Optimizer

Elastic Storage Server

IBM Elastic Storage Server is an optimized disk storage solution that is bundled with IBM hardware
and innovative IBM Storage Scale RAID technology (based on erasure coding) that can be used to
protect hardware failure instead of using data replication and offer better storage efficiency than the local
storage.
It performs fast background disk rebuilds in minutes without affecting application performance. HDFS
Transparency (2.7.0-1 and later) allows the Hadoop or Spark applications to access the data stored in
IBM Elastic Storage Server, as illustrated in the following figure:

Figure 1. HDFS Transparency for IBM Elastic Storage Server

For more information, see Elastic Storage Server documentation.

4 IBM Storage Scale: Big Data and Analytics Guide

Erasure Code Edition
IBM Storage Scale Erasure Code Edition (ECE) provides IBM Storage Scale RAID as software and it allows
customers to create IBM Storage Scale clusters that use scale-out storage on any hardware that meets
the minimum hardware requirements.
All the benefits of IBM Storage Scale and IBM Storage Scale RAID can be achieved by using existing
commodity hardware.
IBM Storage Scale Erasure Code Edition provides the following features:
• Reed Solomon highly fault tolerant declustered Erasure Coding that protects against individual drive
failures and node failures.
• Disk Hospital to identify issues before they become disasters.
• End-to-end checksum to identify and correct errors that are introduced by network, media, or both.
• Fast background disk rebuilds in minutes without affecting application performance.
HDFS Transparency version 3.1.0-3 and later allows the Hadoop or Spark applications to access the data
stored in IBM ECE.
Note: The ECE storage must be configured as shared storage to be used in the Hadoop environment.

Figure 2. HDFS Transparency over ECE as shared storage mode

For more information, see the IBM Storage Scale Erasure Code Edition guide.

Share Storage (SAN-based storage)

HDFS Transparency (2.7.0-1 or later) allows Hadoop and Spark applications to access data stored in
shared storage mode, such as IBM Storwize® V7000 etc.
This is illustrated in the following figure:

Chapter 2. IBM Storage Scale support for Hadoop 5

Figure 3. HDFS Transparency over IBM Storage Scale NSD for shared storage

File Placement Optimizer (FPO)

HDFS transparency allows big data applications to access IBM Storage Scale local storage - File
Placement Optimizer (FPO) mode.
This is illustrated in the following figure:

Figure 4. HDFS Transparency over IBM Storage Scale FPO

For more information, see the File Placement Optimizer topic in the IBM Storage Scale: Administration
Guide.

Deployment model
A deployment model must be considered from two levels: IBM Storage Scale level and HDFS
Transparency level.
From IBM Storage Scale level, the following two deployment models are available:

6 IBM Storage Scale: Big Data and Analytics Guide

• Remote mount mode
• Single cluster mode
From HDFS Transparency level, the following two deployment models are available:
• All Hadoop nodes as IBM Storage Scale nodes
• Limited Hadoop nodes as IBM Storage Scale nodes
Note: Hadoop services or HDFS Transparency cannot be colocated with the ESS EMS, I/O nodes, or ECE
nodes.

Model 1: Remote mount with all Hadoop nodes as IBM Storage Scale nodes
Use this model if you are using IBM Elastic Storage Server and you have small Hadoop node size, typically
less than 50 Hadoop nodes.
This is illustrated in the following figure:

Figure 5. Remote mount with all Hadoop nodes as IBM Storage Scale nodes

This model consists of two IBM Storage Scale clusters. Configure Hadoop on the IBM Storage HDFS
Transparency nodes. The Hadoop and HDFS Transparency node make up for one IBM Storage Scale
cluster. The IBM Storage Scale local cluster is the IBM Storage Scale clients to the IBM Elastic Storage
Server when remote mount is configured. The IBM Elastic Storage Server is the second IBM Storage Scale
cluster. All the Hadoop and Spark services run on the IBM Storage Scale Hadoop local cluster.
With this model, one IBM Elastic Storage Server can be shared with different groups and the remote
mount mode can isolate the storage management from the IBM Storage Scale local cluster. Some
operations from local clusters (for example, mmshutdown -a) do not impact the storage side IBM
Storage Scale. Meanwhile, one can enable Hadoop Short Circuit read/write to gain better I/O performance
for Hadoop and Spark jobs.

Model 2: Remote mount with limited Hadoop nodes as IBM Storage Scale nodes
Use this model if you are using IBM Elastic Storage Server and you have huge Hadoop node size, typically
more than 1000 Hadoop nodes.
This is illustrated by the following figure:

Chapter 2. IBM Storage Scale support for Hadoop 7

Figure 6. Remote mount with limited Hadoop nodes as IBM Storage Scale nodes

This deployment model is used for large number of nodes in the Hadoop cluster (for example, more than
1000 nodes). Creating a large IBM Storage Scale cluster requires careful planning and increased demands
on the network. The deployment model in Figure 6 on page 8 limits the IBM Storage Scale deployment
to just the nodes that are running the HDFS Transparency service rather than the entire Hadoop cluster.
The data traffic goes from Hadoop nodes, network RPC, HDFS Transparency nodes and IBM Storage
Scale Clients, network RPC, IBM Storage Scale NSD servers, and SAN storage. Short-circuit read/write
configuration does not help the data reading performance.

Model 3: Single cluster with all Hadoop nodes as IBM Storage Scale nodes
Use this model if you are using IBM Storage Scale FPO.
This is illustrated in the following figure:

Figure 7. Single cluster with all Hadoop nodes as IBM Storage Scale nodes

8 IBM Storage Scale: Big Data and Analytics Guide

In this deployment model, Hadoop/Spark jobs can leverage the data locality from IBM Storage Scale FPO.
If you are using IBM Elastic Storage Server storage, you can consider the model that is illustrated in the
following figure:

Figure 8. Single cluster with all Hadoop nodes as IBM Storage Scale nodes (Elastic Storage Server)

If you use SAN-based storage, you can consider the model that is illustrated in the following figure:

Figure 9. Single cluster with all Hadoop nodes as IBM Storage Scale nodes (SAN storage)

Model 4: Single cluster with limited Hadoop nodes as IBM Storage Scale nodes
Use this model if you are using a SAN-based storage.
This is illustrated in the following figure:

Chapter 2. IBM Storage Scale support for Hadoop 9

Figure 10. Single cluster with limited Hadoop nodes as IBM Storage Scale nodes

In this deployment model, HDFS Transparency services run on IBM Storage Scale NSD servers that have
local connection path to SAN storages. All the other Hadoop/Spark services run on the Hadoop nodes and
take network RPC to read/write data from or into IBM Storage Scale.

Additional supported storage features

This section describes the Hadoop Storage Tiering and Multiple IBM Storage Scale file system support
features.
Hadoop Storage Tiering
Hadoop Storage Tiering setup can run jobs on the Hadoop cluster with native HDFS cluster and can
read and write the data from IBM Storage Scale in real time. For more information, see Hadoop
Storage Tiering.
Multiple IBM Storage Scale file system support
If you use multiple IBM Storage Scale clusters and you want to access them from the local IBM
Storage Scale Hadoop cluster, see “Multiple IBM Storage Scale File System support” on page 191. If
Ambari is available, see “Configuring multiple file system mount point access” on page 408.

HDFS Transparency overview

Starting from HDFS Transparency 3.1.1 and IBM Storage Scale 5.0.4.2, HDFS Transparency is integrated
with the IBM Storage Scale installation toolkit and the Cluster Export Services (CES) protocol.
The installation toolkit automates the steps that are required to install GPFS, deploy protocols, and install
updates and patches. CES provides highly available file and object services to a GPFS cluster like NFS,
Object and SMB protocol support.
With the CES HDFS integration, the installation toolkit can now install HDFS Transparency as part of the
CES protocol stack. The CES interface can now control and configure HDFS Transparency using the same
interfaces as with the other protocols.
With the integration of HDFS into CES protocol, the use of the protocol server function requires extra
licenses that need to be accepted.

10 IBM Storage Scale: Big Data and Analytics Guide

For more information about the installation toolkit and CES protocol, see the Overview of the installation
toolkit and Protocols support overview: Integration of protocol access methods with GPFS topic in the IBM
Storage Scale: Concepts, Planning, and Installation Guide.

CES HDFS integration

• The installation toolkit can install and configure NameNodes and DataNodes.
• CES configures and manages only the NameNodes. A CES IP will be assigned for every CES HDFS
cluster.
• Multiple HDFS clusters can be supported on the same IBM Storage Scale cluster.
• Each HDFS cluster requires to have its own CES group and cluster name where the CES group is the
cluster name prefixed with “hdfs”.
• CES HDFS NameNode failover does not use ZKFailoverController. This is because CES will elect a new
node to host the CES IP using its own failover mechanism. HDFS clients will always talk to the same CES
IP. Therefore, NameNode failover happens transparently. The Hadoop clients require to be configured
so that it only knows about one NameNode in order to work properly with the CES HDFS protocol
failover functionality.
• CES HDFS protocol is installed only if it is enabled.

Planning
Learn about Hadoop distributions supported on IBM Storage Scale and aspects to consider while planning
your integration with HDFS Transparency.

Hadoop cluster planning

In a Hadoop cluster that runs the HDFS protocol, a node can be a DFS Client, a NameNode, or a
DataNode, or all of them. The Hadoop cluster might contain nodes that are all part of an IBM Storage
Scale cluster or where only some of the nodes belong to the IBM Storage Scale cluster.

NameNode
You can specify a single NameNode or multiple NameNodes to protect against a single point of failure in
the cluster. For more information, see “High availability configuration” on page 193. The NameNode must
be a part of an IBM Storage Scale cluster and must have a robust configuration to reduce the chances of a
single-node failure. The NameNode is defined by setting the fs.defaultFS parameter to the hostname
of the NameNode in the core-site.xml file.
Note: The Secondary NameNode in native HDFS is not needed for HDFS Transparency because the HDFS
Transparency NameNode is stateless and does not maintain an FSImage like state information.

DataNode
You can specify multiple DataNodes in a cluster. The DataNodes must be a part of an IBM Storage Scale
cluster. The DataNodes are specified by listing their hostnames in the workers configuration file.

DFS Client
The DFS Client can be a part of an IBM Storage Scale cluster. When the DFS Client is a part of an IBM
Storage Scale cluster, it can read data from IBM Storage Scale through an RPC or use the short-circuit
mode. Otherwise, the DFS Client can access data from IBM Storage Scale only through an RPC. You can
specify the NameNode address in DFS Client configuration so that DFS Client can communicate with the
appropriate NameNode service.
In a production cluster, it is recommended to configure NameNode HA: one active NameNode and one
standby NameNode. Active NameNode and standby NameNode must be located in two different nodes.
For a small test or a POC cluster, such as 2-node or 3-node cluster, you can configure one node as

Chapter 2. IBM Storage Scale support for Hadoop 11

NameNode and DataNode. However, in a production cluster, it is not recommended to configure the same
node as both NameNode and DataNode.
The purpose of cluster planning is to define the node roles: Hadoop node, HDFS transparency node, and
GPFS node.

License planning
HDFS Transparency does not require additional license. If you have IBM Storage Scale license, you can
get the HDFS Transparency package from the IBM Storage Scale package or see the HDFS Transparency
download section.
As for IBM Storage Scale license, any IBM Storage Scale license works with HDFS Transparency. However,
you should take IBM Storage Scale Standard Edition or IBM Storage Scale Advanced Edition or Data
Management Edition so that you could leverage some advanced enterprise features (such as IBM Storage
Scale storage pool, fileset, encryption or AFM) to power your Hadoop data platform.
First go through the “Hadoop IBM Storage Scale Architecture” on page 4 section, select the mode that
you are planning to use and then refer the license requirements in the following table:

Table 3. IBM Storage Scale License requirement

Storage Deployment Mode License requirement
Category
IBM Storage Illustrated in Figure 7 on page • 3+ IBM Storage Scale server license for manager/quorum (3
Scale FPO 8 quorum tolerates 1 quorum node failure. If you want higher
quorum node failure tolerance, you need to configure more
quorum node/licenses, maximally up to 8 quorum nodes in
one cluster).
• All other nodes take the IBM Storage Scale FPO license.
Note: If you purchase IBM Storage Scale capacity license, you
do not need to purchase additional licenses mentioned above.

12 IBM Storage Scale: Big Data and Analytics Guide

Table 3. IBM Storage Scale License requirement (continued)
Storage Deployment Mode License requirement
Category
IBM Storage Illustrated in Figure 9 on page • All NSD servers must be with IBM Storage Scale server
Scale + SAN 9 license.
storage
• At least 2 NSD servers are required for HDFS Transparency (1
NameNode and 1 DataNode); it is recommended to take 4+
NSD servers for HDFS Transparency (1 active NameNode, 1
standby NameNode, 2 DataNodes).
Note: If you purchase IBM Storage Scale capacity license, you
do not need to purchase additional licenses mentioned above.

Illustrated in Figure 8 on page • 2+ IBM Storage Scale server license for quorum/NSD servers
9 with tiebreak disks for quorum. If you want to tolerate more
(configure one IBM Storage quorum node failure, configure more IBM Storage Scale NSD
Scale cluster) servers/quorum nodes.
– All HDFS Transparency nodes should take IBM Storage
Scale server license under this configuration.
Note: If you purchase IBM Storage Scale capacity license, you
do not need to purchase additional licenses mentioned above.

Illustrated in Figure 6 on page • For home IBM Storage Scale cluster (NSD server cluster), 2+
8 NSD servers (IBM Storage Scale server license) configured
(configure IBM Storage Scale with tiebreak disks for quorum. If you want to tolerate more
Multi-cluster) quorum node failure, configure more IBM Storage Scale NSD
servers/quorum nodes.
• For local IBM Storage Scale cluster (all as IBM Storage
Scale clients), 3+ IBM Storage Scale server license for
quorum/manager (configure more IBM Storage Scale server
license node to tolerate more quorum node failure); All HDFS
Transparency nodes take IBM Storage Scale server license.
other nodes could take IBM Storage Scale client license.
Note: If you purchase IBM Storage Scale capacity license, you
do not need to purchase additional licenses mentioned above.

Illustrated in Figure 5 on page • For home IBM Storage Scale cluster (NSD server cluster), 2+
7 NSD servers (IBM Storage Scale server license) configured
(configure IBM Storage Scale with tiebreak disks for quorum. If you want to tolerate more
Multi-cluster) quorum node failure, configure more IBM Storage Scale NSD
servers/quorum nodes.
• For local IBM Storage Scale cluster (all as IBM Storage Scale
clients), 3+ IBM Storage Scale server license for quorum/
manager (configure more IBM Storage Scale server license
node to tolerate more quorum node failure) is required. All
other HDFS Transparency nodes need IBM Storage Scale
server license.
Note: If you purchase IBM Storage Scale capacity license, you
do not need to purchase additional licenses mentioned above.

Chapter 2. IBM Storage Scale support for Hadoop 13

Table 3. IBM Storage Scale License requirement (continued)
Storage Deployment Mode License requirement
Category
IBM ESS Illustrated in Figure 8 on page • If you take the ESS nodes as the quorum nodes, then you do
9 not need to purchase the new IBM Storage Scale licenses.
(configure one IBM Storage Note: Purchasing ESS will give you the license rights to use
Scale cluster) the nodes as quorum.
• All other nodes take IBM Storage Scale server license.
Note: If you purchase IBM ESS with IBM Storage Scale
capacity license, you do not need to purchase additional
licenses mentioned above.

Illustrated in Figure 5 on page • Create ESS nodes as home cluster (you do not need to
7 purchase new IBM Storage Scale license after you purchase
(configure IBM Storage Scale IBM ESS).
Multi-cluster) • For local IBM Storage Scale cluster (all as IBM Storage Scale
clients), 3+ IBM Storage Scale server license for quorum/
manager (configure more IBM Storage Scale server license
node to tolerate more quorum node failure); all other take
IBM Storage Scale server license.
Note: If you purchase IBM ESS with IBM Storage Scale
capacity license, you do not need to purchase additional
licenses mentioned above.

Illustrated in Figure 6 on page • Create ESS nodes as home cluster (you do not need to
8 purchase new IBM Storage Scale license after you purchase
(configure IBM Storage Scale IBM ESS).
Multi-cluster) • For local IBM Storage Scale cluster (all as IBM Storage Scale
clients), 3+ IBM Storage Scale server license for quorum/
manager (configure more IBM Storage Scale server license
node to tolerate more quorum node failure); all other HDFS
Transparency nodes take IBM Storage Scale server license.
Note: If you purchase IBM ESS with IBM Storage Scale
capacity license, you do not need to purchase additional
licenses mentioned above.

Note: If you plan to configure IBM Storage Scale protocol, you need to configure IBM Storage Scale
services over nodes with IBM Storage Scale server license but no NSD disks in the file system. If you
purchase IBM Storage Scale capacity license, you do not need to purchase additional licenses for IBM
Storage Scale Protocol nodes.

14 IBM Storage Scale: Big Data and Analytics Guide

Node roles planning
This section describes the node roles planning in FPO mode and shared storage mode and the integration
with various hadoop distributions.

Node roles planning in FPO mode

In the FPO mode, all nodes are IBM Storage Scale nodes, Hadoop nodes, and HDFS Transparency nodes.

In this figure, one node is selected as the HDFS Transparency NameNode. All the other nodes are
HDFS Transparency DataNodes. Also, the HDFS Transparency NameNode can be an HDFS Transparency
DataNode. Any one node can be selected as HDFS Transparency HA NameNode. The administrator must
ensure that the primary HDFS Transparency NameNode and the standby HDFS Transparency NameNode
are not the same node.
In this mode, Hadoop cluster must be larger than or equal to the HDFS transparency cluster.
Note: The Hadoop cluster might be smaller than HDFS transparency cluster but this configuration is not
typical and not recommended. Also, the HDFS transparency cluster must be smaller than or equal to IBM
Storage Scale cluster because the HDFS transparency must read and write data to the local mounted
file system. Usually, in the FPO mode, the HDFS transparency cluster is equal to the IBM Storage Scale
cluster.
Note: Some nodes in the IBM Storage Scale (GPFS) FPO cluster might be GPFS clients without any disks
in the file system.

The shared storage mode or IBM ESS

Among these nodes, you need to define at least one NameNode and one DataNode. If NameNode HA is
configured, you need at least two nodes for NameNode HA and one DataNode.
In production, you need at least two DataNodes to tolerate one DataNode failure if your file system
takes data replica 1. If your file system takes data replica 2 (for example, IBM Storage Scale over shared
storage), you need at least three DataNodes to tolerate one DataNode failure.
After HDFS transparency nodes are selected, see “Installing” on page 29 and “Configuring” on page 52
to configure HDFS Transparency on these nodes.

Integration with Hadoop distributions

If you deploy HDFS transparency with a Hadoop distribution, such as IBM BigInsights® IOP or
HortonWorks HDP, configure the native HDFS NameNode as the HDFS Transparency NameNode and
configure native HDFS DataNodes as HDFS Transparency DataNodes. Add these nodes into IBM Storage

Chapter 2. IBM Storage Scale support for Hadoop 15

Scale cluster. This setup results in fewer configuration changes. Therefore, before installing Hadoop
distribution, you need to plan the nodes as NameNode and the nodes as DataNodes.
If the HDFS Transparency NameNode is not the same as the native HDFS NameNode, some services
might fail to start and can require additional configuration changes.

Hardware and software requirements

Hardware & OS matrix support

In addition to the normal operating system, IBM Storage Scale, and Hadoop requirements, the
Transparency connector has minimum hardware requirements of 1 CPU (processor core) and 4 GB to
8 GB physical memory on each node where it is running. This is stated as a general guideline and actual
configuration may vary.
For information about Hadoop distribution support, see “Hadoop distribution support” on page 24.

Recommended hardware resource configuration

10Gb Ethernet network is the minimum recommended configuration for Hadoop nodes. Higher speed
networks, such as 25Gb/40Gb/100Gb/InfiniBand, can provide overall better performance. Hadoop nodes
should have a minimum of 100GB memory and at least four physical cores. If Hadoop services are
running with the same nodes as the HDFS Transparency service, a minimum of 8 physical cores is
recommended. If an IBM Storage Scale FPO deployment pattern is used, 10-20 internal SAS/SATA disks
per node are recommended.
In a production cluster, minimal node number for HDFS Transparency is 3. The first node as active
NameNode, the second node as standby NameNode and the third node as DataNode. In testing
cluster, one node is sufficient for HDFS Transparency cluster and the node could be configured as both
NameNode and DataNode.
HDFS Transparency is a light-weight daemon and usually one logic modern processor (For example,
4-core or 8-core CPU with 2+GHz frequency).
For memory requirements, see the following tables:

Table 4. For HDFS Transparency 3.1.1-8 or earlier, and 3.3.0-0 and later
Ranger Support HDFS Transparency NameNode HDFS Transparency DataNode
Ranger support is off [1] 2GB or 4GB 2GB
Ranger support is on (by default) Depends on the file number 2GB
that the Hadoop applications will
access [2]: 1024 bytes * inode
number

Table 5. For HDFS Transparency 3.1.1-9 and later, and 3.2.2-0 and later
HDFS Transparency NameNode HDFS Transparency DataNode
Depends on the file number that the Hadoop 2GB
applications will access: 700 bytes * inode number.

Note: The file number means the total inode number under /gpfs.mnt.dir/gpfs.data.dir
(refer /usr/lpp/mmfs/hadoop/etc/hadoop/gpfs-site.xml (for HDFS Transparency 2.7.3-x)
or /var/mmfs/hadoop/etc/hadoop/gpfs-site.xml (for HDFS Transparency 3.0.x)).
As for SAN-based storage or IBM ESS, the number of Hadoop nodes required for scaling depends on the
workload types. If the workload is I/O sensitive, you could calculate the Hadoop node number according
to the bandwidth of ESS head nodes and the bandwidth of Hadoop node. For example, if the network
bandwidth from your ESS head nodes is 100Gb and if your Hadoop node is configured with 10Gb network,
for I/O sensitive workloads, 10 Hadoop nodes (100Gb/10Gb) will drive all network bandwidth for your

16 IBM Storage Scale: Big Data and Analytics Guide

ESS head nodes. Considering that most Hadoop workloads are not pure I/O reading/writing workloads,
you can take 10~15 Hadoop nodes in this configuration.

IBM ECE minimum hardware requirements

At a high level, it is required to have between 4 to 32 storage servers per recovery group (RG), and
each server must be a x86_64 server running Red Hat® Enterprise Linux version 7.5 or 7.6. The storage
configuration must be identical for all the storage servers. The supported storage types are SAS-attached
HDD or SSD drives and using specified LSI adapters, or enterprise-class NVMe drives. Each storage server
must have at least one SSD or NVMe drive. This is used for a fast write cache as well as user data storage.
For more information about hardware requirement, see ECE Minimum hardware requirements in the IBM
Storage Scale Erasure Code Edition guide.

IBM Storage Scale software requirements

This section describes the software requirements for HDFS Transparency on IBM Storage Scale.
Ensure that the required packages needed by GPFS are installed on all the HDFS Transparency nodes. For
more information, see the Software Requirements topic in the IBM Storage Scale: Concepts, Planning, and
Installation Guide.

Kernel
IBM Storage Scale requires the Kernel packages.
Installation of Kernel packages
1. On all the IBM Storage Scale nodes, confirm that the output of rpm -qa |grep kernel includes the
following:
• kernel-headers
• kernel-devel
• kernel
If any of the kernel RPM is missing, install it. If the kernel packages do not exist, run the following yum
install command:

yum -y install kernel kernel-headers kernel-devel

2. Check the installed kernel rpms. Unlike HDFS, IBM Storage Scale is a kernel-level file system that
integrates with the operating system. This is a critical dependency. Ensure that the environment has
the matching kernel, kernel-devel, and kernel-headers. The following example uses RHEL 7.4:

[root@c902f05x01 ~]# uname -r

3.10.0-693.11.6.el7.x86_64 <== Find kernel-devel and kernel-headers to match this

[root@c902f05x01 ~]# rpm -qa | grep kernel

kernel-tools-3.10.0-693.el7.x86_64
kernel-headers-3.10.0-693.11.6.el7.x86_64 <== kernel-headers matches
kernel-tools-libs-3.10.0-693.el7.x86_64
kernel-debuginfo-3.10.0-693.11.6.el7.x86_64
kernel-devel-3.10.0-693.11.6.el7.x86_64 <== kernel-devel matches
kernel-3.10.0-693.el7.x86_64
kernel-debuginfo-common-x86_64-3.10.0-693.11.6.el7.x86_64
kernel-3.10.0-693.11.6.el7.x86_64
kernel-devel-3.10.0-693.el7.x86_64
[root@c902f05x01 ~]#

Warning: Kernels are updated after the original operating system installation. Ensure that the
active kernel version matches the installed version of both kernel-devel and kernel-headers.

SELinux
This topic gives information about SELinux.
If you are using HDFS Transparency, from IBM Storage Scale 5.0.5, SELinux is supported in permissive or
enforcing mode on Red Hat Enterprise.

Chapter 2. IBM Storage Scale support for Hadoop 17

If you are using Hortonworks Data Platform (HDP) in any IBM Storage Scale release, SELinux must be
disabled.
If you are using Cloudera Data Platform (CDP) Private Cloud Base from IBM Storage Scale 5.1, SELinux is
supported in permissive or enforcing mode on Red Hat ® Enterprise.
For more information, see:
• Security-Enhanced Linux support section in the IBM Storage Scale: Concepts, Planning, and Installation
Guide.
• Cloudera HDP Disable SELinux and PackageKit and check the umask Value documentation.
• Cloudera CDP Private Cloud Base Setting SELinux Mode documentation.

NTP
This topic gives information about Network Time Protocol (NTP).
Configure NTP on all the nodes in your system to ensure that the clocks of all the nodes are synchronized.
On Red Hat Enterprise Linux nodes

# yum install -y ntp

# ntpdate <NTP_server_IP>
# systemctl enable ntpd
# systemctl start ntpd
# timedatectl list-timezones
# timedatectl set-timezone
# systemctl enable ntpd

Firewall recommendations for HDFS Transparency

Firewalls that are associated with open systems are specific to deployments, operating systems, and it
varies from customer to customer. It is the responsibility of the system administrator or Lab Service (LBS)
to set the firewall accordingly; similar to what Linux distributions do presently. For information on IBM
Storage Scale firewall, see the IBM Storage Scale system using firewall section in the IBM Storage Scale:
Administration Guide.
This section describes only the recommendations for HDFS Transparency firewall settings.

Table 6. Recommended port number settings for HDFS Transparency

HDFS Transparency Property Port Number Comments
dfs.namenode.rpc-address nn-host1: 8020 RPC address that handles all
clients requests.
In the case of HA/Federation
where multiple NameNodes
exist, the name service id is
added to the name. For example,
dfs.namenode.rpc-address.ns1
dfs.namenode.rpc-
address.EXAMPLENAMESERVICE
.
The value of this property will
take the form of nn-host1:rpc-
port.
The NameNode's default RPC
port is 8020.

dfs.namenode.http-address 0.0.0.0:9870 The address and the base port

where the dfs NameNode web UI
will listen on.

18 IBM Storage Scale: Big Data and Analytics Guide

Table 6. Recommended port number settings for HDFS Transparency (continued)
HDFS Transparency Property Port Number Comments
dfs.datanode.address 0.0.0.0:9866 The DataNode server address
and port for data transfer.
dfs.datanode.http.address 0.0.0.0:9864 The DataNode HTTP server
address and port.
dfs.datanode.ipc.address 0.0.0.0:9867 The DataNode IPC server
address and port.

Setting the firewall policies for HDFS Transparency

1. Run the firewall-cmd to add and reload the recommended ports.
On each of the HDFS Transparency NameNodes, set the NameNode server port.
The following example uses 8020:

# firewall-cmd --add-port=8020/tcp --permanent

On each of the HDFS Transparency NameNodes, set the NameNode webui port:

# firewall-cmd --add-port=9870/tcp --permanent

On each of the HDFS Transparency DataNodes, set the following ports:

# firewall-cmd --add-port=9864/tcp --permanent

# firewall-cmd --add-port=9866/tcp --permanent
# firewall-cmd --add-port=9867/tcp --permanent

For all HDFS Transparency that ran --add-port, run reload and check the ports:

# firewall-cmd --reload
# firewall-cmd --zone=public --list-ports

For example:

[root@c8f2n01 webhdfs]# firewall-cmd --zone=public --list-ports

1191/tcp 60000-61000/tcp 8020/tcp 9870/tcp 9864/tcp 9866/tcp 9867/tcp

2. For the changes to reflect, restart HDFS Transparency.

If HDFS Transparency is running, find the standby NameNode and restart the services followed by a
failover.
a. Get the standby NameNode.

# /usr/lpp/mmfs/hadoop/bin/hdfs haadmin -getAllServiceState

For example:

[root@c8f2n01 webhdfs]# /usr/lpp/mmfs/hadoop/bin/hdfs haadmin -getAllServiceState

c8f2n01:8020 active
c8f2n05:8020 standby

b. Restart the Standby NameNode (for example, on c8f2n05).

For HDFS Transparency 3.1.0 or earlier, run the following command:

# /usr/lpp/mmfs/hadoop/sbin/mmhadoopctl connector restart

For HDFS Transparency 3.1.1 or later, run the following command:

Chapter 2. IBM Storage Scale support for Hadoop 19

# /usr/lpp/mmfs/bin/mmces service stop HDFS
# /usr/lpp/mmfs/bin/mmces service start HDFS

c. Transition standby to active NameNode.

For example: nn1 is c8f2n01 and nn2 is c8f2n05.
For HDFS Transparency 3.1.0 and earlier, run the following command:

# /usr/lpp/mmfs/hadoop/bin/hdfs haadmin -transitionToActive nn2

# /usr/lpp/mmfs/hadoop/bin/hdfs haadmin -getAllServiceState

For HDFS Transparency 3.1.1 and later, run the following command:

# /usr/lpp/mmfs/bin/mmces address move --ces-ip x.x.x.x --ces-node nn2

# /usr/lpp/mmfs/hadoop/bin/hdfs haadmin -getAllServiceState

d. The original NameNode is now the standby NameNode.

Restart the new Standby NameNode (for example, c8f2n01).
For HDFS Transparency 3.1.0 and earlier, run the following command:

# /usr/lpp/mmfs/hadoop/sbin/mmhadoopctl connector restart

For HDFS Transparency 3.1.1 and later, run the following command:

# /usr/lpp/mmfs/bin/mmces service stop HDFS

# /usr/lpp/mmfs/bin/mmces service start HDFS

e. You can now transition back to the original NameNode by running the following command:
For HDFS Transparency 3.1.0 and earlier, run the following command:

# /usr/lpp/mmfs/hadoop/bin/hdfs haadmin -transitionToActive nn1

# /usr/lpp/mmfs/hadoop/bin/hdfs haadmin -getAllServiceState

For HDFS Transparency 3.1.1 and later, run the following command:

# /usr/lpp/mmfs/bin/mmces address move --ces-ip x.x.x.x --ces-node nn1

# /usr/lpp/mmfs/hadoop/bin/hdfs haadmin -getAllServiceState

3. Restart all Hadoop services on all the nodes.

For example, on node with Yarn service:

/opt/hadoop-3.1.3/sbin/stop-yarn.sh
/opt/hadoop-3.1.3/sbin/start-yarn.sh

Hadoop service roles

In a Hadoop ecosystem, there are a lot of different roles for different components. For example, HBase
Master Server, Yarn Resource Manager and Yarn Node Manager.
You need to plan to distribute these master roles over different nodes as evenly as possible. If you put all
these master roles onto a single node, memory might become an issue.
When running Hadoop over IBM Storage Scale, it is recommended that up to 25% of the physical memory
is reserved for GPFS pagepool with a maximum of 20 GB. If HBase is being used, it is recommended that
up to 30% of the physical memory be reserved for the GPFS pagepool. If the node has less than 100
GB of physical memory, then the heap size for Hadoop Master services needs to be carefully planned. If
HDFS transparency NameNode service and HBase Master service are resident on the same physical node,
HBase workload stress may result in Out of Memory (OOM) exceptions.

20 IBM Storage Scale: Big Data and Analytics Guide

Dual network interfaces
This section explains about the FPO mode and IBM ESS or SAN-based storage mode.

FPO mode
If the FPO cluster has a dual 10 Gb network, you have the following two configuration options:
• The first option is to bind the two network interfaces and deploy the IBM Storage Scale cluster and the
Hadoop cluster over the bonded interface.
• The second option is to configure one network interface for the Hadoop services including the HDFS
transparency service and configure the other network interface for IBM Storage Scale to use for data
traffic. This configuration can minimize interference between disk I/O and application communication.
To ensure that the Hadoop applications use data locality for better performance, perform the following
steps:
1. Configure the first network interface with one subnet address (for example, 192.0.2.0). Configure the
second network interface as another subnet address (for example, 192.0.2.1).
2. Create the IBM Storage Scale cluster and NSDs with the IP or hostname from the first network
interface.
3. Install the Hadoop cluster and HDFS transparency services by using IP addresses or hostnames
from the first network interface.
4. Run mmchconfig subnets=192.0.2.1 -N all.
Note: 192.0.2.1 is the subnet used for IBM Storage Scale data traffic.
For Hadoop map/reduce jobs, the scheduler Yarn checks the block location. HDFS Transparency returns
the hostname that is used to create the IBM Storage Scale cluster, as block location to Yarn. If the
hostname is not found within the NodeManager list, Yarn cannot schedule the tasks according to the data
locality. The suggested configuration can ensure that the hostname for block location can be found in
Yarn's NodeManager list and therefore it can schedule the task according to the data locality.
For a Hadoop distribution like IBM BigInsights IOP, all Hadoop components are managed by Ambari™. In
this scenario, all Hadoop components, HDFS transparency and IBM Storage Scale cluster must be created
using one network interface. The second network interface must be used for GPFS data traffic.

Centralized Storage Modes (ESS, ECE, SAN-based)

For Centralized Storage, you have two configuration options:
• The first option is to configure the two adapters as bond adapter and then, deploy HortonWorks HDP
and IBM Storage Scale over the bond adapters.
• The second option is to configure one adapter for IBM Storage Scale cluster and HortonWorks HDP
and configure another adapter as subnets of IBM Storage Scale. For more information on subnets, see
GPFS and network communication in the IBM Storage Scale: Concepts, Planning, and Installation Guide.
Perform the following steps:
1. Configure the first network interface with one subnet address (for example, 192.0.2.0). Configure the
second network interface as another subnet address (for example, 192.0.2.1).
2. Create the IBM Storage Scale cluster with the IP or hostname from the first network interface.
3. Install the Hadoop cluster and HDFS transparency services by using the IP addresses or hostnames
from the first network interface.
4. Run mmchconfig subnets=192.0.2.1 -N all.
Note: 192.0.2.1 is the subnet used for IBM Storage Scale data traffic.

Chapter 2. IBM Storage Scale support for Hadoop 21

Setting up local repository
Mirror repository server
IBM Storage Scale requires a local repository. Therefore, select a server to act as the mirror repository
server. This server requires the installation of the Apache HTTP server or a similar HTTP server.
Every node in the Hadoop cluster must be able to access this repository server. This mirror server can be
defined in the DNS, or you can add an entry for the mirror server in /etc/hosts on each node of the
cluster.
• Create an HTTP server on the mirror repository server, such as Apache httpd. If the Apache httpd is not
already installed, install it with the yum install httpd command. You can start the Apache httpd by
running one of the following commands:
– apachectl start
– service httpd start
• [Optional]: Ensure that the http server starts automatically on reboot by running the following
command:
– chkconfig httpd on
• Ensure that the firewall settings allow inbound HTTP access from the cluster nodes to the mirror web
server.
• On the mirror repository server, create a directory for your repositories, such as <document root>/
repos. For Apache httpd with document root /var/www/html, type the following command:
– mkdir -p /var/www/html/repos
• Test your local repository by browsing the web directory:
– http://<yum-server>/repos
For example:

# rpm -qa | grep httpd

# service httpd start
# service httpd status
Active: active (running)  Check to ensure is active
# systemctl enable httpd

Local OS repository
You must create the operating system repository because some of the IBM Storage Scale files, such as
rpms have dependencies on all nodes.
1. Create the repository path:

mkdir /var/www/html/repos/<rhel_OSlevel>

2. Synchronize the local directory with the current yum repository:

cd /var/www/html/repos/<rhel_OSlevel>

Note: Before going to the next step, ensure that you have registered your system. For instructions
to register a system, refer to Get Started with Red Hat Subscription Manager. Once the server is
subscribed, run the following command: subscription-manager repos --enable=<repo_id>
3. Run the following command:

reposync --gpgcheck -l --repoid=rhel-7-server-rpms --download_path=/var/www/html/repos/

<rhel_OSlevel>

4. Create a repository for this node:

createrepo -v /var/www/html/repos/<rhel_OSlevel>

22 IBM Storage Scale: Big Data and Analytics Guide

5. Ensure that all the firewalls are disabled or that you have the httpd service port open, because yum
uses http to get the packages from the repository.
6. On all nodes in the cluster that require the repositories, create a file in /etc/yum.repos.d called
local_<rhel_OSlevel>.repo.
7. Copy this file to all nodes. The contents of this file must look like the following:

[local_rhel_version]
name=local_rhel_version
enabled=1
baseurl=http://<internal IP that all nodes can reach>/repos/<rhel_OSlevel>
gpgcheck=0

8. Run the yum repolist and yum install rpms without external connections.

Local IBM Storage Scale repository

This section describes how to configure a local IBM Storage Scale repository for manual installation.
The following table lists the IBM Storage Scale 5.0.5 and later supported editions for the HDFS
Transparency clusters:

IBM Storage Scale Comments

Edition
Data Management See the Capacity-based licensing topic in the IBM Storage Scale: Concepts,
Planning, and Installation Guide.
Data Access See the Capacity-based licensing topic in the IBM Storage Scale: Concepts,
Planning, and Installation Guide.
Advanced Edition Legacy edition replaced by Data Management edition.
Standard Edition Legacy edition replaced by Data Access edition.

The following example uses IBM Storage Scale 5.1.2.2:

1. On the repository web server, create a directory for your IBM Storage Scale repos, such as <document
root>/repos/GPFS. For Apache httpd with document root /var/www/html, type the following
command:

mkdir -p /var/www/html/repos/GPFS/5.1.2.2

2. Obtain the IBM Storage Scale software. If you have already installed IBM Storage Scale manually, skip
this step. Download the IBM Storage Scale package. In this example, IBM Storage Scale 5.1.2.2 is
downloaded from Fix Central, the package is unzipped, and the installer is extracted.
For example, as root or a user with sudo privileges, run the installer to get the IBM Storage Scale
packages into a user-specified directory via the --dir option:

chmod +x Spectrum_Scale_Data_Management-5.1.2.2-x86_64-Linux-install
./Spectrum_Scale_Data_Management-5.1.2.2-x86_64-Linux-install --silent --dir /var/www/html/
repos/GPFS/5.1.2.2

Note: The --silent option is used to accept the software license agreement, and the --dir
option places the IBM Storage Scale rpms into the directory /var/www/html/repos/GPFS/
5.1.2.2/gpfs_rpms. Without specifying the --dir option, the default location is /usr/lpp/mmfs/
gpfs_rpms/5.1.2.2/gpfs_rpms.
3. If the packages are extracted into the IBM Storage Scale default directory, /usr/lpp/mmfs/
5.1.2.2/gpfs_rpms, copy all the IBM Storage Scale files that are required for your installation
environment into the IBM Storage Scale repository path:

cd /usr/lpp/mmfs/5.1.2.2/gpfs_rpms

cp -R * /var/www/html/repos/GPFS/5.1.2.2/gpfs_rpms

Chapter 2. IBM Storage Scale support for Hadoop 23

4. Copy the HDFS Transparency package to the IBM Storage Scale repo path that you want to install
manually.
Note: The repo must contain only one HDFS Transparency package. Remove all old transparency
packages.

cp gpfs.hdfs-protocol-3.1.1-(version) /var/www/html/repos/GPFS/5.1.2.2/gpfs_rpms

5. Create a yum repository:

# cd /var/www/html/repos/GPFS/5.1.2.2/gpfs_rpms
# createrepo .

6. Access the repository at http://<yum-server>/repos/GPFS/5.1.2.2/gpfs_rpms.

Hadoop distribution support

Cloudera distributions and Open Source Apache Hadoop are the officially supported Hadoop distributions.
For more information, contact [email protected].

OS and Arch support

HDFS Transparency supports a subset of the supported operating systems and the architecture platform
that IBM Storage Scale supports.
IBM Storage Scale aligns with the OS vendor life cycle support statement. For RHEL, see Red Hat
Enterprise Linux Life Cycle.

Java support
HDFS Transparency requires Java™ OpenJDK 8 or OpenJDK 11.
OpenJDK 11 is supported from HDFS Transparency 3.1.1-8.

CDP Private Cloud Base support

Table 7. CDP Private Cloud Base support
CDP Private Cloud Base version HDFS Transparency version
See “Support Matrix” on page 294 3.1.1-X stream, 3.1.1-2 and later

Open Source Apache Hadoop support

• Open Source Apache Hadoop support is based on Cloudera's Hadoop supported versions. For more
information, see CDP Private Cloud Base support matrix and HDP support matrix.
• IBM Storage Scale 5.1.3.2 technical preview release, the CES HDFS Transparency 3.2.2 is supported for
a limited-time usage only on non-production clusters with Open Source Apache Hadoop 3.2.2 on RH 7.9
on x86_64.
• From IBM Storage Scale 5.1.4.0, CES HDFS Transparency 3.2.2-0 is supported for Open Source Apache
Hadoop 3.2.2 on RH 7.9 on x86_64.
• From IBM Storage Scale 5.1.1.2, CES HDFS Transparency 3.3.x-x is supported for Open Source Apache
Hadoop 3.3 on RH 7.9 on x86_64.

BigInsights IOP support

Support for IBM BigInsights is discontinued.

24 IBM Storage Scale: Big Data and Analytics Guide

HDP support
Support for Cloudera HDP is discontinued.

Table 8. HDP support

HDP version HDFS Transparency version
HDP 3.1 3.1.0-X stream

HDFS Transparency planning

The recommended configuration is to configure CES HDFS (NameNodes and DataNodes) as IBM Storage
Scale client nodes remote mount to the centralized storage.
In each of the following architecture figures, these remote mounts are represented by the ESS blue
boxes. For information about the centralized storage mode, see “Hadoop IBM Storage Scale Architecture”
on page 4.
Note: File Placement Optimizer (FPO) is not a supported storage for the CES HDFS configuration.
The Hadoop master and clients are outside of the IBM Storage Scale cluster but the IBM Storage Scale
NameNodes and DataNodes are part of the Hadoop cluster; in the following architecture diagrams, a
Hadoop cluster is represented by a Hadoop cluster green box. The installer node does not need to be a
part of the IBM Storage Scale cluster.
To add other protocols like SMB, NFS or OBJ to the cluster, ensure that the other protocol requirements
are met. Before you add these protocols, see CES HDFS Limitations and recommendations.
For more information, see the following topics in the IBM Storage Scale: Concepts, Planning, and
Installation Guide:
• Planning for GPFS
• Preparing to use the installation toolkit
• Planning for Protocols
Note:
• If you are using Cloudera CDP distribution, see “Support Matrix” on page 294, “Architecture” on page
288 and “Alternative architectures” on page 291.
• CES HDFS is not supported for Cloudera HDP distribution does not support CES HDFS.
• The NameNode cannot be colocated with the DataNode or with any other Hadoop services.

The following figures show the different architecture configuration layouts:

Chapter 2. IBM Storage Scale support for Hadoop 25

Figure 11. CES HDFS single HDFS configuration

Figure 12. CES HDFS multiple HDFS configuration

Figure 13. CES HDFS with other protocols configurations layout to the ESS

26 IBM Storage Scale: Big Data and Analytics Guide

HDFS Transparency support matrix
The support matrix for HDFS Transparency and the IBM Storage Scale Big Data Analytics Integration
Toolkit for HDFS Transparency (Toolkit for HDFS) is bundled together to work for the supported IBM
Storage Scale release.

Table 9. HDFS Transparency support matrix

IBM Storage Scale HDFS Transparency version
version
3.1.1-x stream 3.2.2-x stream 3.3.0-x stream
5.1.9.2 3.1.1-17 3.2.2-7 N/A
5.1.9.1 3.1.1-16 3.2.2-7 N/A
5.1.9.0 3.1.1-15 3.2.2-6 N/A
5.1.8.1 3.1.1-14 3.2.2-5 3.3.0-2
5.1.7.1 - 5.1.8.0 3.1.1-13 3.2.2-5 3.3.0-2
5.1.7 3.1.1-12 3.2.2-4 3.3.0-2
5.1.6.1 3.1.1-12 3.2.2-3 3.3.0-2
5.1.6 3.1.1-11 3.2.2-3 3.3.0-2
5.1.5.1 3.1.1-10 3.2.2-2 3.3.0-2
5.1.5 3.1.1-10 3.2.2-1 3.3.0-2
5.1.4.1 3.1.1-10 3.2.2-1 3.3.0-2
5.1.4 3.1.1-9 3.2.2-1 3.3.0-1
5.1.3.2 3.1.1-8 3.2.2-0 3.3.0-1
5.1.3 - 5.1.3.1 3.1.1-8 - 3.3.0-1
5.1.2.9 3.1.1-12 - 3.3.0-2
5.1.2.6 - 5.1.2.8 3.1.1-10 - 3.3.0-2
5.1.2.2 - 5.1.2.5 3.1.1-8 - 3.3.0-1
5.1.2.1 3.1.1-7 - 3.3.0-0
5.1.2 3.1.1-6 - 3.3.0-0
5.1.1.2 - 5.1.1.4 3.1.1-5 - 3.3.0-0
5.1.1.1 3.1.1-5 - -
5.1.1.0 3.1.1-4 - -
5.1.0.1 - 5.1.0.3 3.1.1-3 - -

Open-source Apache Hadoop support is certified on HDFS, Yarn, and MapReduce components with the
following configurations:

Table 10. Open-source Apache Hadoop support matrix

Open-source Apache HDFS Transparency OS Platform
Hadoop version version
3.1.3 3.1.1-0 - 3.1.1-14 RHEL 7.9 and later x86_64
ppc64le

Chapter 2. IBM Storage Scale support for Hadoop 27

Table 10. Open-source Apache Hadoop support matrix (continued)
Open-source Apache HDFS Transparency OS Platform
Hadoop version version
3.2.2 3.2.2-x x86_64
RHEL 7.9
RHEL9.x

3.3.0 3.3.0-x RHEL 7.9 x86_64

For more information about CDP Private Cloud Base support, see CDP Private cloud base support matrix.
Note:
• CES HDFS is not supported on Cloudera HDP distribution.
• Unlike previous versions of HDFS Transparency, HDFS Transparency 3.1.1-x, 3.2.2-x, and 3.3.0-x are
tightly coupled with IBM Storage Scale. You need to upgrade the IBM Storage Scale package to get the
correct supported versions for CES HDFS.
• Support for CES HDFS started from IBM Storage Scale 5.0.4.2 with HDFS Transparency 3.1.1-0 and
Toolkit for HDFS 1.0.0.0.
• Support for CDP Private Cloud Base with CES HDFS started from IBM Storage Scale 5.1.0 with HDFS
Transparency 3.1.1-2 package with bda_integration toolkit version 1.0.2.0.
• If the OS is not supported for a specific IBM Storage Scale release, then it is also not supported for
HDFS Transparency. For more information, see “OS and Arch support” on page 24.
• Unlike previous versions of HDFS Transparency 3.1.1-x, HDFS Transparency 3.1.1-15+ is delivered
without dependent JAR files. For more information about the installation process, see “Installation
prerequisites” on page 30.
• Unlike previous versions of HDFS Transparency 3.2.2-x, HDFS Transparency 3.2.2-6+ is delivered
without dependent JAR files. For more information about the installation process, see “Installation
prerequisites” on page 30.

HDFS Transparency download

The download source and the contents of an installation package vary depending on the HDFS
Transparency version.
1. Visit IBM Fix Central to download the HDFS Transparency package.
2. For HDFS Transparency 3.1.0 and earlier:
a. Search for Spectrum_Scale_HDFS_Transparency-<version>-<arch>-Linux to find the correct
package.
b. Untar the downloaded package:

tar zxvf Spectrum_Scale_HDFS_Transparency-<version>-<arch>-Linux.tgz

For HDFS Transparency version 3.1.1 or later, the HDFS Transparency package is available through the
IBM Storage Scale software. The IBM Storage Scale software is delivered in a self-extracting archive.
This self-extracting archive can be downloaded from the Fix Central. For more information, see the
Extracting the IBM Storage Scale software on Linux nodes topic in the IBM Storage Scale: Concepts,
Planning, and Installation Guide.
For example, Spectrum_Scale_Advanced-5.1.0.0-x86_64-Linux is the fix pack in the
Fix Central and Spectrum_Scale_Advanced-5.1.0.0-x86_64-Linux-install is the self-
extracting archive package.

28 IBM Storage Scale: Big Data and Analytics Guide

For IBM Storage Scale 5.1.0.0 on RHEL7, the self-extracting installation package places the packages
in the following default directory:

/usr/lpp/mmfs/5.1.0.0/hdfs_rpms/rhel7/hdfs_3.1.1.x

Packages information
• For HDFS Transparency 3.1.0 stream:
– IBM Storage Scale HDFS Transparency
For example, gpfs.hdfs-protocol-3.1.0-5.x86_64.rpm.
• For HDFS Transparency 3.1.1-0 and 3.1.1-1:
– IBM Storage Scale Big Data Analytics Integration Toolkit for HDFS Transparency (Toolkit for HDFS)
For example, bda_integration-1.0.1-1.noarch.rpm.
– IBM Storage Scale HDFS Transparency
For example, gpfs.hdfs-protocol-3.1.1-1.x86_64.rpm.
• For HDFS Transparency 3.1.1-2 and later:
– IBM Storage Scale Big Data Analytics Integration Toolkit for HDFS Transparency (Toolkit for HDFS)
For example, gpfs.bda-integration-1.0.2-0.noarch.rpm.
– IBM Storage Scale HDFS Transparency
For example, gpfs.hdfs-protocol-3.1.1-2.x86_64.rpm.
Note:
• From IBM Storage Scale 5.1.0, the BDA Toolkit for HDFS is named gpfs.bda-integration. In IBM
Storage Scale 5.0.4 and 5.0.5, it was named bda_integration.
• In Fix Central, the fix pack name for HDFS Transparency version syntax is "x.x.x.x".
For example, Spectrum_Scale_HDFS_Transparency-3.1.0.5-x86_64-Linux.
• The installation toolkit cannot be used if the HDFS Transparency package is not a part of the IBM
Storage Scale software self-extracting archive package because of the signed repo checking (for
example, as a patch/efix package). Therefore, you must use the manual installation method to install or
upgrade.
To verify whether the HDFS Transparency can be used in your environment, see the following sections:
• The HDFS Transparency “Hardware & OS matrix support” on page 16
• HDFS Transparency matrix support
• “Hadoop distribution support” on page 24

Installing
This section will describe the steps to install the HDFS Transparency nodes (NameNode and DataNodes)
as GPFS client nodes to be added to the centralized storage system to create a single GPFS cluster. All
other Hadoop nodes (master and clients) are to be set up outside of the GPFS cluster.
Before you proceed, see the following sections:
• “HDFS Transparency planning” on page 25
• “HDFS Transparency support matrix” on page 27
• “HDFS Transparency limitations and recommendations” on page 250
• “Installation prerequisites” on page 30
Note: For Cloudera® HDP distribution, CES HDFS is not supported.
Note:

Chapter 2. IBM Storage Scale support for Hadoop 29

• The centralized storage file system needs to be available before setting up the CES HDFS protocol
nodes.
• Required to create the CES shared root (cesSharedRoot) file system.
• Do not follow steps that deploy NSDs on the HDFS Transparency nodes because centralized storage
mode is the only one supported currently.
• FPO is not supported.
• HDFS Transparency does not require to have the Hadoop distribution installed onto the IBM Storage
Scale HDFS Transparency nodes. However, if the HDFS client is not installed on the CES HDFS
NameNodes and DataNodes, then functions like distcp will not work because HDFS Transparency does
not include the bin/hadoop command.
• When adding HDFS protocol into CES, the other protocols (NFS, SMB, Object) and GUI and performance
monitor can be configured and deployed at the same time.
• SMB requires NFSv4 ACL permission while HDFS requires ALL ACL permission. Therefore, a warning
will be seen if HDFS protocol is added to the protocol node and the ACL is not correct after the install
toolkit deployment. The ACL should always be set to ALL if the HDFS protocol is used after deployment
of the protocols.

Installation prerequisites
Set up the basic IBM Storage Scale installation prerequisites before installing CES HDFS.
See the Installation prerequisites section in the IBM Storage Scale: Concepts, Planning, and Installation
Guide for base Scale installation requirements.
• NTP setup
It is recommended that Network Time Protocol (NTP) must be configured on all the nodes in
your system to ensure that the clocks of all the nodes are synchronized. Clocks that are not
synchronized cause debugging issues and authentication problems with the protocols. Across all the
HDFS Transparency and Hadoop nodes, follow the steps that are listed in “Configure NTP to synchronize
the clock in HDFS Transparency” on page 56.
• SSH and network setup
Set up passwordless SSH as follows:
– From the admin node to the other nodes in the cluster.
– From protocol nodes to other nodes in the cluster.
– From every protocol node to the rest of the protocol nodes in the cluster.
– On fresh Red Hat Enterprise Linux 8 installations, you must create passwordless SSH keys by using
the ssh-keygen -m PEM command.
• CES public IP
– A set of CES public IPs (or Export IPs) is required. These IPs are used to export data using the
protocols. Export IPs are shared among all protocols and are organized in a public IP pool. See
Adding export IPs section under Deploying protocols topic in the IBM Storage Scale: Concepts,
Planning, and Installation Guide.
– If you are using only the HDFS protocol, it is sufficient to have just one CES Public IP.
– The CES IP/hostname used for CES HDFS must be resolved by the DNS service and not just by an
entry in your /etc/hosts file. Otherwise, you might encounter errors when you add the Hadoop
services.
Note: This is a Java requirement.
• ACL
In general, the recommendation is to configure the file system to support NFSv4 ACLs. NFSv4 ACL is
a requirement for ACL usage with the SMB and NFS protocols. However, ALL ACL is requirement for

30 IBM Storage Scale: Big Data and Analytics Guide

ACL usage with HDFS protocols. If the protocol node has multiple protocols, the final ACL setting after
deployment should be set to -k ALL if you are using HDFS protocol.
For more information, see examples under the mmchfs command topic in the IBM Storage Scale:
Command and Programming Reference Guide.
• Packages
Corresponding kernel-header, kernel-devel, gcc, cpp, gcc-c++, instils, make must be installed.
yum install kernel-devel cpp gcc gcc-c++ binutils make
Note: If you are using CDP Private Cloud Base, you need to install Python 2.7 on Red Hat Enterprise
Linux 8.0 nodes. By default, Python 3 might be installed on Red Hat Enterprise Linux 8.0 nodes. CDP
Private Cloud Base with CES HDFS requires the nodes to have both Python 2.7 and Python 3.
• UID/GID consistency value under IBM Storage Scale
Ensure that all the user IDs and group IDs used in the IBM Storage Scale cluster for running jobs,
accessing the IBM Storage Scale file system or for the Hadoop services must be created and have the
same values across all the IBM Storage Scale nodes. This is required for IBM Storage Scale.
You can also use the /usr/lpp/mmfs/hadoop/scripts/gpfs_create_hadoop_users_dirs.py
script that is provided with HDFS Transparency 3.1.1-3 and later. Any users or groups that are created
with this script are guaranteed to have consistent UID/GID across all the nodes.
• Starting with HDFS Transparency 3.1.1-15 and HDFS Transparency 3.2.2-6, the dependent JAR files are
not shipped with the HDFS Transparency rpm. The dependent JAR files need to be provided before an
installation or upgrade.
For 3.1.1-x on all HDFS Transparency nodes, complete the following steps:
1. If it does not exist, create the path /opt/hadoop/jars by using the command:

$ mkdir -p /opt/hadoop/jars

2. Download hadoop-3.1.4.tar.gz from Apache by issuing the following commands:

$ cd /opt/hadoop/jars
$ wget https://archive.apache.org/dist/hadoop/core/hadoop-3.1.4/hadoop-3.1.4.tar.gz

3. Extract the content of the tar files by using this command:

$ tar -xvf hadoop-3.1.4.tar.gz

4. Download additional JAR files from the maven repository and save them in /opt/hadoop/jars.
The additional JAR files that are needed are:
– curator-client-2.12.0.jar
– curator-framework-2.12.0.jar
– curator-recipes-2.12.0.jar
– guava-11.0.2.jar
– hadoop-annotations-3.1.1.jar
– hadoop-auth-3.1.1.jar
– jsch-0.1.54.jar
– jsr305-3.0.0.jar
– xz-1.0.jar
Alternatively, download hadoop-3.1.1.tar.gz from Apache and extract it in /opt/hadoop/
jars.
5. Proceed with the installation or upgrade.
For 3.2.2-x on all HDFS Transparency nodes, complete the following steps:

Chapter 2. IBM Storage Scale support for Hadoop 31

1. If it does not exist, create the path /opt/hadoop/jars by using the command:

$ mkdir -p /opt/hadoop/jars

2. Download hadoop-3.2.4.tar.gz from Apache by issuing the following commands:

$ cd /opt/hadoop/jars
$ wget https://archive.apache.org/dist/hadoop/core/hadoop-3.2.4/hadoop-3.2.4.tar.gz

3. Extract the content of the tar files by using this command:

$ tar -xvf hadoop-3.2.4.tar.gz

4. Download additional JAR files from the maven repository and save them in /opt/hadoop/jars.
The additional JAR files that are needed are:
– accessors-smart-1.2.jar
– hadoop-annotations-3.2.2.jar
– hadoop-auth-3.2.2.jar
– jetty-xml-9.4.20.v20190813.jar
– jul-to-slf4j-1.7.25.jar
– log4j-1.2.17.jar
– slf4j-api-1.7.25.jar
– slf4j-log4j12-1.7.25.jar
– stax2-api-3.1.4.jar
Alternatively, download hadoop-3.2.2.tar.gz from Apache and extract it in /opt/hadoop/
jars.
5. Proceed with the installation or upgrade.
The following sections are steps for installation with snips from the IBM Storage Scale installation
documentation:
• If you are planning to use the installation toolkit, follow the “Steps for install toolkit” on page 32
section.
• If you are planning to install manually, follow the “Steps for manual installation” on page 33 section.

Steps for install toolkit

This section lists the steps for the installation of the toolkit.
Note: Ensure that the steps in the “Installation prerequisites” on page 30 section are completed before
you proceed with the steps listed in this section.
1. Install the following packages for the installation toolkit:
• python-2.7
• net-tools
• elfutils-libelf-devel [Only on Red Hat Enterprise Linux 8.0 nodes with kernel version 4.15 or later]
2. Install the JAVA openjdk-devel on all the nodes

yum install java-1.8.0-openjdk-devel

3. Export Java home in root profile

# vi ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export PATH=$PATH:$JAVA_HOME/bin

4. Obtain and run the IBM Storage Scale self-extracting installation package.

32 IBM Storage Scale: Big Data and Analytics Guide

Run the self-extracting installation package:

# ./Spectrum_Scale_Advanced-5.1.0.0-x86_64-Linux-install

After the IBM Storage Scale package is expanded, the hdfs_3.1.1-x folder will contain two
packages required for the installation toolkit usage that will reside in the /usr/lpp/mmfs/5.1.0.0/
hdfs_rpms/rhel7 directory.
The HDFS Transparency and the IBM Storage Scale Big Data Analytics Integration Toolkit for HDFS
Transparency (Toolkit for HDFS) are the two packages that are required for the installation toolkit
usage based on the “HDFS Transparency support matrix” on page 27 versioning and should reside in
the hdfs_3.1.1-x directory.
Installation Toolkit supports the deployment of the following versions of HDFS Transparency:
a. From IBM Storage Scale 5.1.1.2 through 5.1.3.1:
• HDFS Transparency 3.1.1-x
• HDFS Transparency 3.3.x-x
b. From IBM Storage Scale 5.1.3.2 (Technical preview release)/5.1.4:
• HDFS Transparency 3.1.1-x
• HDFS Transparency 3.2.2-x
• HDFS Transparency 3.3.x-x
By default, HDFS Transparency 3.1.1-x is deployed.
If you want to deploy HDFS Transparency 3.2.2-x, set the following environment variable before
running the installation toolkit command:

# export SCALE_HDFS_TRANSPARENCY_VERSION_322_ENABLE=True

If you want to deploy HDFS Transparency 3.3.x-x, you need to set the following environment variable
before running the installation toolkit command:

# export SCALE_HDFS_TRANSPARENCY_VERSION_33_ENABLE=True

You can set the SCALE_HDFS_TRANSPARENCY_VERSION_<version>_ENABLE variable in ~/.bashrc.

Here, <version> is 322 or 33 without any "." between the numbers.
For more information on the IBM Storage Scale software package, see Extracting the IBM Storage Scale
software on Linux nodes in IBM Storage Scale: Concepts, Planning, and Installation Guide and “HDFS
Transparency download” on page 28 section.
5. Install the required packages for Ansible toolkit deployment.
From IBM Storage Scale 5.1.1, the IBM Storage Scale installation toolkit uses the Ansible deployment.
For Red Hat 7 and 8, the installation toolkit installs the supported version of Ansible on the installer
node when you run the ./spectrumscale setup -s InstallNodeIP command.
To manually install the correct Ansible for your environment, see the Preparing to use the installation
toolkit topic in the IBM Storage Scale: Concepts, Planning, and Installation Guide.
6. After the setup is complete, see “Installing” on page 29 followed by “Using installation toolkit” on
page 34.

Steps for manual installation

This section lists the steps for the manual installation of the toolkit.
Note: Ensure that the steps in the “Installation prerequisites” on page 30 section are completed before
you proceed with the steps listed in this section.

Chapter 2. IBM Storage Scale support for Hadoop 33

1. On the nodes designated for CES HDFS, to extract the software, follow the steps in the Preparing
to install the GPFS software on Linux nodes topic in the IBM Storage Scale: Concepts, Planning, and
Installation Guide.
2. To install the packages, follow the steps listed in the Installing IBM Storage Scale packages on Linux
systems topic in the IBM Storage Scale: Concepts, Planning, and Installation Guide.
3. After the GPFS packages are installed, run /usr/lpp/mmfs/bin/mmbuildgpl to build the
portability layer on each node.
4. Install the JAVA openjdk-devel on all the nodes by executing the following command:

yum install java-1.8.0-openjdk-devel

5. Export Java home in root profile:

# vi ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export PATH=$PATH:$JAVA_HOME/bin

6. After the setup is complete, see “Installing” on page 29, followed by “Manual installation” on page
42 to install and configure CES HDFS.

Using installation toolkit

This section describes how to install CES HDFS using the installation toolkit.
Run these steps after the Setup “Steps for install toolkit” on page 32 are completed.
The installation toolkit requires two packages to perform the installation for CES HDFS:
• HDFS Transparency
• IBM Storage Scale Big Data Analytics Integration Toolkit for HDFS Transparency (Toolkit for HDFS)
The installation toolkit must be run as the root user.
On the installer node, where the self-extracting IBM Storage Scale package resides, the installation
toolkit default extraction path starting from IBM Storage Scale 5.1.1.x is /usr/lpp/mmfs/
package_code_version/ansible-toolkit. For IBM Storage Scale version earlier than 5.1.1, the
default extraction path is /usr/lpp/mmfs/package_code_version/installer.
There are two modes to setup CES HDFS nodes in the centralized file system:
• Adding CES HDFS nodes into the same GPFS cluster as the centralized file system.
• If the CES HDFS nodes are separate GPFS cluster from the centralized file system you need to first
setup the remote mount configuration. For more information, see the Mounting a remote GPFS file
system topic in the IBM Storage Scale: Administration Guide.

Adding CES HDFS nodes into the centralized file system

This topic lists the steps to add the CES HDFS nodes into the same GPFS cluster as the centralized file
system.
1. Ensure that the centralized file system is already installed, configured and active. For example, the
ESS.
2. Create the CES shared root file system which will be used by CES installation.
Note: The recommendation for CES shared root is a dedicated file system. A dedicated file system
can be created with the mmcrfs command. The CES shared root must reside on GPFS and must be
available when it is configured through mmchconfig command.
For more information, see the Setting up Cluster Export Services shared root file system topic in IBM
Storage Scale: Administration Guide.
3. Change to the installer directory to run the spectrumscale commands:

34 IBM Storage Scale: Big Data and Analytics Guide

For IBM Storage Scale 5.1.1 and later:

# cd /usr/lpp/mmfs/5.1.1.0/ansible-toolkit

For IBM Storage Scale 5.1.0 and earlier:

# cd /usr/lpp/mmfs/5.0.4.2/installer

4. Instantiate the installer node (chef zero server)

To configure the installer node, issue the following command:

./spectrumscale setup -s InstallNodeIP -i SSHIdentity

The -s argument identifies the IP that the nodes will use to retrieve their configuration. This IP will be
the one associated with a device on the installer node. This is automatically validated during the setup
phase.
Optionally, you can specify a private SSH key to be used to communicate with the nodes in the cluster
definition file, using the -i argument.
In an Elastic Storage Server (ESS) cluster, if you want to use the installation toolkit to install GPFS and
deploy protocols, you must specify the setup type as ess while setting up the installer node:

./spectrumscale setup -s InstallNodeIP -i SSHIdentity -st ess

5. Use the installation toolkit to populate the cluster definition file from the centralized storage.
Re-populate the cluster definition file with the current cluster state by issuing the ./spectrumscale
config populate --node Node command.
In a cluster containing ESS, you must specify the EMS node with the config populate command.
For example:

./spectrumscale config populate --node EMSNode

6. Add the nodes that will be used for CES HDFS into the existing centralized file system. The additional
nodes are added into the same GPFS cluster.

./spectrumscale node add FQDN

Deployment of protocol services is performed on a subset of the cluster nodes that have been
designated as protocol nodes using the ./spectrumscale node add FQDN -p command.
NameNodes are protocol nodes and requires the -p option during the node add operation.
DataNodes are not protocol nodes.
For example:
For non-HA

# NameNodes (Protocol node)

./spectrumscale node add c902f05x05.gpfs.net -p

For HA

# NameNodes (Protocol node)

./spectrumscale node add c902f05x05.gpfs.net -p
./spectrumscale node add c902f05x06.gpfs.net -p

# DataNodes
./spectrumscale node add c902f05x07.gpfs.net
./spectrumscale node add c902f05x08.gpfs.net
./spectrumscale node add c902f05x09.gpfs.net
./spectrumscale node add c902f05x10.gpfs.net

Chapter 2. IBM Storage Scale support for Hadoop 35

7. If call home is enabled in the cluster definition file, specify the minimum call home configuration
parameters.

./spectrumscale callhome config -n CustName -i CustID -e CustEmail -cn CustCountry

For more information, see the Enabling and configuring call home using the installation toolkit topic in
the IBM Storage Scale: Concepts, Planning, and Installation Guide.
8. Do environment checks before initiating the installation procedure.

./spectrumscale install -pr

9. Start the IBM Storage Scale installation and add the nodes into the existing cluster.

./spectrumscale install

Enable and deploy CES HDFS

Before you deploy the protocols, there must be a GPFS cluster that has GPFS started with at least one file
system for the CES shared root file system. Protocol nodes requires at least two GPFS file systems to be
mounted: one for CES shared root and one for data.
1. Enable HDFS.

./spectrumscale enable hdfs

2. Set the CES IPs.

Data is served through these protocols from a pool of addresses designated as Export IP addresses
or CES public IP addresses. This example uses 192.0.2.2 and 192.0.2.3.

./spectrumscale config protocols -e 192.0.2.2, 192.0.2.3

Note: For IBM Storage Scale releases earlier to 5.0.5.1, a minimum of two CES IPs are required as
input for configuring protocol when HDFS is enabled through the installation toolkit even though the
HDFS protocol requires only one IP address.
From IBM Storage Scale 5.0.5.1, only one CES-IP is needed for one HDFS cluster during installation
toolkit deployment.
3. Configure the shared root directory.
Get the CES shared root file system that was created from the step in “Adding CES HDFS nodes into
the centralized file system” on page 34 and configure the protocols to point to a file system that will
be used as the shared root using the following command:

./spectrumscale config protocols -f FS_Name -m FS_Mountpoint

For example:

./spectrumscale config protocols -f cesSharedRoot -m /gpfs/cesSharedRoot

For more information, see the Defining a shared file system for protocols section in IBM Storage Scale:
Concepts, Planning, and Installation Guide.
4. Create the NameNodes and DataNodes for a new CES HDFS cluster.

./spectrumscale config hdfs new -n NAME -nn NAMENODES -dn DATANODES -f FILESYSTEM -d DATADIR

The -f is the gpfs.mnt.dir value and -d DATADIR is the gpfs.data.dir value as seen in the
HDFS Transparency configuration files. Therefore, each new HDFS Transparency cluster requires its
own -d DATADIR value.
For example:

36 IBM Storage Scale: Big Data and Analytics Guide

For non-HA

# ./spectrumscale config hdfs new -n myhdfscluster -nn c902f05x05 -dn

c902f05x07,c902f05x08,c902f05x09,c902f05x10 -f gpfs -d gpfshdfs

For HA

# ./spectrumscale config hdfs new -n myhdfscluster -nn c902f05x05,c902f05x06 -dn

c902f05x07,c902f05x08,c902f05x09,c902f05x10 -f gpfs -d gpfshdfs

Where

-n NAME, --name NAME HDFS cluster name.

-nn NAMENODES, --namenodes NAMENODES NameNode hostnames (comma separated).
-dn DATANODES, --datanodes DATANODES DataNode hostnames (comma separated).
-f FILESYSTEM, --filesystem FILESYSTEM Spectrum Scale file system name.
-d DATADIR, --datadir DATADIR Spectrum Scale data directory name.

Note: The -n NAME is the HDFS cluster name. The CES group contains the HDFS cluster name prefix
with hdfs.
The -d DATADIR is a unique 32-character name required for each HDFS cluster to be created on the
same centralized storage.
To configure multiple HDFS clusters, see “Adding a new HDFS cluster into existing HDFS cluster on
the same GPFS cluster (Multiple HDFS clusters)” on page 73 section.
5. List the configured HDFS cluster by running the following command:

./spectrumscale config hdfs list

For example:
Single HDFS cluster list:

Cluster Name : mycluster

NameNodesList : [c902f09x11kvm1],[c902f09x11kvm2]
DataNodesList : [c902f09x11kvm3],[c902f09x11kvm4]
FileSystem : gpfs1
DataDir : datadir1

Multi-HDFS cluster list:

Cluster Name : mycluster1

NameNodesList : [c902f09x11kvm1],[c902f09x11kvm2]
DataNodesList : [c902f09x11kvm3],[c902f09x11kvm4]
FileSystem : gpfs1
DataDir : datadir1

Cluster Name : mycluster2

NameNodesList : [c902f09x11kvm5],[c902f09x11kvm6]
DataNodesList : [c902f09x11kvm7],[c902f09x11kvm8]
FileSystem : gpfs1
DataDir : datadir2

Note: Multi-HDFS cluster is not supported in IBM Storage Scale Big Data Analytics Integration Toolkit
for HDFS Transparency (Toolkit for HDFS) version 1.0.3.0 under IBM Storage Scale 5.1.1.0.
6. Do environment checks before initiating the installation procedure.

./spectrumscale deploy --pr

7. Start the IBM Storage Scale installation and the creation of the CES HDFS nodes.

./spectrumscale deploy

8. Verify CES HDFS service after deployment is completed.

Chapter 2. IBM Storage Scale support for Hadoop 37

/usr/lpp/mmfs/bin/mmces service list -a

9. Check whether the CES HDFS protocol IPs values are configured properly.

/usr/lpp/mmfs/bin/mmces address list

For more information, see Listing CES HDFS IPs.

10. After the CES HDFS nodes are installed, create the HDFS client nodes manually. For more
information, see Chapter 6, “Apache Hadoop,” on page 489.
For information on the spectrumscale, mmces, and mmhdfs commands, see the IBM Storage Scale:
Command and Programming Reference Guide.
Note: If HDFS Transparency is a part of the protocols used in the cluster, ensure that the ACL for
GPFS file system is set to -k ALL after all the protocols are installed.
mmlsfs to check the -k value.
mmchfs to change the -k value.
Restart all services and IBM Storage Scale to pick up the -k changes.

Separate CES HDFS cluster remote mount into the centralized file system
This topic lists the steps to create the CES HDFS nodes in a separate GPFS cluster from the centralized
file system. It is mandatory to setup the remote mount configuration between the CES HDFS GPFS cluster
and the centralized file system GPFS cluster before you can deploy the CES HDFS configuration through
the install toolkit.
Use the install toolkit to create a scale cluster for nodes that will be designated as CES HDFS NameNodes
and DataNodes. This will be the local GPFS cluster and will be the accessing cluster. The local GPFS
cluster requires to create NSDs to be used for the CES Shared root file system.

Preparing installer node on the local GPFS cluster

1. Change to the installer directory to run the spectrumscale commands.
For IBM Storage Scale 5.1.1 and later:

# cd /usr/lpp/mmfs/5.1.1.0/ansible-toolkit

For IBM Storage Scale 5.1.0 and earlier:

# cd /usr/lpp/mmfs/5.0.4.2/installer

2. Instantiate the installer node (chef zero server).

To configure the installer node, issue the following command:

./spectrumscale setup -s InstallNodeIP -i SSHIdentity

Configuring local GPFS cluster

1. Create the CES shared root file system on the local GPFS cluster that will be used by the CES
installation.

38 IBM Storage Scale: Big Data and Analytics Guide

The local GPFS cluster requires to have a minimum of 2 nodes with 1 disk per node to create the NSDs
to be used for the CES shared root file system.
Set up a minimum of two nodes as NSD nodes using the -n option.

./spectrumscale node add <Nsd server Node1> -n

./spectrumscale node add <Nsd server Node2> -n

./spectrumscale nsd add -p Node1 -fs <local CES shared root filesystem name> -fg 1 <
device>
./spectrumscale nsd add -p Node2 -fs <local CES shared root filesystem name> -fg 2 <
device>

For example,

./spectrumscale node add c902f05x05.gpfs.net -n

./spectrumscale node add c902f05x06.gpfs.net -n

./spectrumscale nsd add -p c902f05x05.gpfs.net -fs cesSharedRoot -fg 1 "/dev/sdk"

./spectrumscale nsd add -p c902f05x06.gpfs.net -fs cesSharedRoot -fg 2 "/dev/sdl"

2. Add the NameNodes created in the local GPFS to set up as protocol nodes and add in the DataNodes.
Deployment of protocol services is performed on a subset of the cluster nodes that have been
designated as protocol nodes using the ./spectrumscale node add FQDN -p command.
NameNodes are protocol nodes and requires the -p option during the node add operation.
DataNodes are not protocol nodes.
For example:
For non-HA:

# NameNodes (Protocol node)

./spectrumscale node add c902f05x05.gpfs.net -p

For HA

# NameNodes (Protocol node)

./spectrumscale node add c902f05x05.gpfs.net -p
./spectrumscale node add c902f05x06.gpfs.net -p

# DataNodes
./spectrumscale node add c902f05x07.gpfs.net
./spectrumscale node add c902f05x08.gpfs.net
./spectrumscale node add c902f05x09.gpfs.net
./spectrumscale node add c902f05x10.gpfs.net

3. If call home is enabled in the cluster definition file, specify the minimum call home configuration
parameters.

./spectrumscale callhome config -n CustName -i CustID -e CustEmail -cn CustCountry

For more information, see the Enabling and configuring call home using the installation toolkit topic in
the IBM Storage Scale: Concepts, Planning, and Installation Guide.
4. Perform the environment checks before initiating the installation procedure.

./spectrumscale install -pr

5. Start the IBM Storage Scale installation to create the local cluster with NameNodes and DataNodes.

./spectrumscale install

For information on deploying a Scale cluster through the installation toolkit, see the Using the
installation toolkit to perform installation tasks: Explanations and examples topic in the IBM Storage
Scale: Concepts, Planning, and Installation Guide.

Chapter 2. IBM Storage Scale support for Hadoop 39

Setting up remote mount access
1. After the local GPFS cluster is installed, set up the remote mount file system on the local GPFS cluster
to the owning cluster. The owning cluster is the centralized file system (for example, ESS).
For more information, see the Mounting a remote GPFS file system topic in the IBM Storage Scale:
Administration Guide.

Enabling and deploying CES HDFS

Before you deploy the protocols in a remote mount mode, the local GPFS cluster and the centralized file
system GPFS cluster requires to be up and active. Remote mount access is required to be set up and
configured. Protocol nodes requires at least 2 GPFS file systems to be mounted: one for CES shared root
and one for data.
1. Enable HDFS.

./spectrumscale enable hdfs

2. Set the CES IPs.

Data is served through these protocols from a pool of addresses designated as Export IP addresses or
CES public IP addresses. This example uses 192.0.2.2 and 192.0.2.3.

./spectrumscale config protocols -e 192.0.2.2, 192.0.2.3

Note: For IBM Storage Scale releases earlier to 5.0.5.1, a minimum of two CES IPs is required as input
for configuring protocol when HDFS is enabled through the installation toolkit even though the HDFS
protocol requires to use only one IP address.
From IBM Storage Scale 5.0.5.1, only one CES-IP is needed for one HDFS cluster during installation
toolkit deployment.
3. Configure the shared root directory.
Get the CES shared root file system and configure the protocols to point to a file system that will be
used as the shared root using the following command:

./spectrumscale config protocols -f cesSharedRoot -m FS_Mountpoint

For example:

./spectrumscale config protocols -f cesSharedRoot -m /gpfs/cesSharedRoot

For more information, see the Defining a shared file system for protocols topic in the IBM Storage Scale:
Concepts, Planning, and Installation Guide.
Note: Use a dedicated file system for CES shared root. A dedicated file system can be created with
the mmcrfs command. The CES shared root must reside on GPFS and must be available when it is
configured through the mmchconfig command.
For more information, see the Setting up Cluster Export Services shared root file system topic in IBM
Storage Scale: Administration Guide.
4. Set up the NameNodes and DataNodes for a new CES HDFS cluster.

./spectrumscale config hdfs new -n NAME -nn NAMENODES -dn DATANODES -f FILESYSTEM -d DATADIR

The -f is the file system name that belongs to the gpfs.mnt.dir mountpoint value and -d
DATADIR option is the gpfs.data.dir value as seen in the HDFS Transparency configuration files.
Therefore, each new HDFS Transparency cluster requires its own -d DATADIR value.
From IBM Storage Scale 5.0.4.3, the -f option can take a remote mount file system only if the file
system is already configured as remote mount and is shown in the mmremotefs command.
For example:

40 IBM Storage Scale: Big Data and Analytics Guide

# /usr/lpp/mmfs/bin/mmremotefs show
Local Name Remote Name Cluster name Mount Point Mount Options Automount Drive Priority
remotefs gpfs504-FS2 c550f6u34.pok.stglabs.ibm.com /remoteFS2 rw no - 0

where, remotefs is the remote file system name.

For non-HA

# ./spectrumscale config hdfs new -n myhdfscluster -nn c902f05x05 -dn

c902f05x07,c902f05x08,c902f05x09,c902f05x10 -f remotefs -d gpfshdfs

For HA

# ./spectrumscale config hdfs new -n myhdfscluster -nn c902f05x05,c902f05x06 -dn

c902f05x07,c902f05x08,c902f05x09,c902f05x10 -f remotefs -d gpfshdfs

where,

-n NAME, --name NAME HDFS cluster name.

Note: The -n NAME is the HDFS cluster name. The CES group contains the HDFS cluster name prefix
with “hdfs”.
The -d DATADIR is a unique 32-character name required for each HDFS cluster to be created on the
same centralized storage.
To configure multiple HDFS clusters, see the “Adding a new HDFS cluster into existing HDFS cluster on
the same GPFS cluster (Multiple HDFS clusters)” on page 73 section.
5. List the configured HDFS cluster by running the following command:

./spectrumscale config hdfs list

6. From IBM Storage Scale 5.1.1.2, Installation Toolkit supports deployment of the following two
versions of HDFS Transparency:
• HDFS Transparency 3.1.1.x
• HDFS Transparency 3.3.x
By default, HDFS Transparency 3.1.1.x is deployed. If you want to deploy HDFS Transparency 3.3.x,
you need to set the following environment variable before running the installation toolkit command:

#export SCALE_HDFS_TRANSPARENCY_VERSION_33_ENABLE=True

You can set the SCALE_HDFS_TRANSPARENCY_VERSION_33_ENABLE variable in ~/.bashrc.

7. Perform the environment checks before initiating the installation procedure.

./spectrumscale deploy --pr

8. Start the IBM Storage Scale installation and the creation of the CES HDFS nodes.

./spectrumscale deploy

Verify cluster
1. Verify CES HDFS service after deployment is completed.

/usr/lpp/mmfs/bin/mmces service list -a

2. Check if the CES HDFS protocol IPs values are configured properly.

Chapter 2. IBM Storage Scale support for Hadoop 41

/usr/lpp/mmfs/bin/mmces address list

For more information, see Listing CES HDFS IPs.

3. After the CES HDFS nodes are installed, create the HDFS client nodes manually. For more information,
see Chapter 6, “Apache Hadoop,” on page 489.
For information on the spectrumscale, mmces and mmhdfs commands, see the IBM Storage Scale:
Command and Programming Reference Guide guide.
Note: If HDFS Transparency is a part of the protocols used in the cluster, ensure that the ACL for GPFS
file system is set to -k ALL after all the protocols are installed. Otherwise, the HDFS NameNodes
would fail to start.
mmlsfs to check the -k value.
mmchfs to change the -k value.
Restart all services and IBM Storage Scale to pick up the -k changes.

Manual installation
This section describes how to manually install and create the CES HDFS into a centralized file system.
Run these steps after the Steps in the “Steps for manual installation” on page 33 are completed.

Adding CES HDFS nodes into the centralized file system

1. Ensure that the centralized file system is already installed, configured and active. For example, the
ESS.
2. Create a CES shared root file system which will be used by CES installation.
Note: The recommendation for CES shared root is a dedicated file system. A dedicated file system
can be created with the mmcrfs command. The CES shared root must reside on GPFS and must be
available when it is configured through mmchconfig.
For more information, see the Setting up Cluster Export Services shared root file system topic in the IBM
Storage Scale: Administration Guide.
3. Add nodes designated for CES HDFS to the existing GPFS cluster.
On a node that already belongs to the GPFS cluster issue the following command:

mmaddnode -N c16f1n07.gpfs.net, c16f1n08.gpfs.net, c16f1n09.gpfs.net,

c16f1n10.gpfs.net, c16f1n11.gpfs.net, c16f1n12.gpfs.net

Run mmlscluster to ensure that the nodes are added.

4. After the CES HDFS nodes are added to the existing cluster follow the “Enable and Configure CES
HDFS” on page 42 section to manually setup non-HA and HA HDFS Transparency cluster.

Enable and Configure CES HDFS

This section describes how to enable and configure CES HDFS manually using the IBM Storage Scale
commands.
1. Install HDFS protocol packages on all the CES HDFS nodes.
On Red Hat Enterprise Linux issue the following command:

# rpm -ivh gpfs.hdfs-protocol-<version>.<arch>.rpm

For example:

rpm -ivh gpfs.hdfs-protocol-3.1.1-0.ppc64.rpm

42 IBM Storage Scale: Big Data and Analytics Guide

2. Configure CES shared root.
On the CES node, follow the steps in the Setting up Cluster Export Services shared root file system
topic in the IBM Storage Scale: Administration Guide to configure the cesShareRoot directory.
a. Create cesSharedRoot using the following command:

mmchconfig cesSharedRoot=/gpfs/cessharedroot

Note: The CES shared root must reside on GPFS and must be available when it is configured
through mmchconfig.
3. Enable CES on the required nodes.
Users must assign CES nodes belonging to one HDFS cluster to a CES group. If all the CES nodes
(NameNodes) belong to a single HDFS cluster, they must be assigned to one CES group for that HDFS
cluster. If different CES nodes belong to different HDFS clusters, they must be assigned to different
CES groups accordingly in order to differentiate them.
Note: Every HDFS cluster must have a CES group defined.
The CES group name should be the hdfs cluster name with a 'hdfs' prefix and will be used as
the name of the configuration tar in Clustered Configuration Repository (CCR). For example, if the
CES group name is hdfsmycluster, the configuration tar in CCR will be hdfsmycluster.tar and the
hdfs cluster name will be mycluster. For more information on CCR, see the Clustered configuration
repository topic in the IBM Storage Scale: Concepts, Planning, and Installation Guide.

mmchnode --ces-enable --ces-group=[clustername] -N [Namenode list]

For non-HA

mmchnode --ces-enable --ces-group [groupname] -N [NameNode]

For HA

mmchnode --ces-enable --ces-group [groupname] -N [NameNode1,NameNode2]

For example, with HDFS HA cluster NameNodes as c16f1n07 and c16f1n08, run the following
command:

mmchnode --ces-enable --ces-group hdfsmycluster -N c16f1n07,c16f1n08

4. Define CES IP for CES HA failover.

A CES address that is associated with a group must be assigned only to a node that is also associated
with the same group. For CES HDFS, NameNodes belonging to the same HDFS Transparency cluster
belong to the same group.
A CES HDFS group can be assigned a CES IP address so the HDFS clients can be configured using that
CES IP to access the HDFS cluster. This is to configure IP failover provided by CES.
For example, two HDFS Transparency clusters will have two CES groups (grp1, grp2).
Each group has two CES nodes (NameNodes).
Group grp1 will be assigned a CES IP address ip1 and group grp2 will be assigned a CES IP address
ip2.
If the CES node serving the CES IP ip1 fails, the CES IP ip1 will fail over to the other CES node in the
group grp1 and the HDFS Transparency service on the 2nd CES node can continue to provide service
for that group.
You can run the following command to assign a CES IP to a CES group:

mmces address add --ces-group [groupname] --ces-ip [ip]

For non-HA and HA

Chapter 2. IBM Storage Scale support for Hadoop 43

mmces address add --ces-group [groupname] --ces-ip x.x.x.x

For example, create CES group named as hdfsmycluster for the HDFS HA cluster:

mmces address add --ces-group hdfsmycluster --ces-ip 192.0.2.4

5. Configure the HDFS configuration files settings: core-site.xml, hdfs-site.xml, gpfs-

site.xml and hadoop_env.sh in /var/mmfs/hadoop/etc/hadoop. Ensure that the
fs.defaultFS is configured without the hdfs prefix in the cluster name.
a. hadoop_env.sh
For non-HA and HA:
First set the JAVA_HOME configuration to the correct JAVA home path on the node before
executing any other mmhdfs commands. (Replace with your Java version)

mmhdfs config set hadoop-env.sh -k JAVA_HOME=/usr/jdk64/jdk1.8.0_112

b. core-site.xml
For non-HA and HA

mmhdfs config set core-site.xml -k fs.defaultFS=hdfs://mycluster

c. hdfs-site.xml
For non-HA
mmhdfs config set hdfs-site.xml -k dfs.blocksize=134217728 -k
dfs.nameservices=mycluster -k dfs.ha.namenodes.mycluster=nn1 -k
dfs.namenode.rpc-address.mycluster.nn1=c16f1n07.gpfs.net:8020 -k
dfs.namenode.http-address.mycluster.nn1=c16f1n07.gpfs.net:50070 -k
dfs.client.failover.proxy.provider.mycluster=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFail
overProxyProvider -k
dfs.namenode.rpc-bind-host=0.0.0.0 -k dfs.namenode.servicerpc-bind-host=0.0.0.0 -k
dfs.namenode.lifeline.rpc-bind-host=0.0.0.0 -k dfs.namenode.http-bind-host=0.0.0.0
-k gpfs.ranger.enabled=scale

Note: For non-HA cluster, the property dfs.namenode.shared.edits.dir in hdfs-site.xml

configuration file is not needed. Delete this property value otherwise the NameNode will fail to
start.

mmhdfs config del hdfs-site.xml -k dfs.namenode.shared.edits.dir

For HA
mmhdfs config set hdfs-site.xml -k dfs.blocksize=134217728 -k dfs.nameservices=cluster -k
dfs.ha.namenodes.mycluster=nn1,nn2 -k dfs.namenode.rpc-address.mycluster.nn1=c16f1n07.gpfs.net:8020
-k
dfs.namenode.http-address.mycluster.nn1=c16f1n07.gpfs.net:50070 -k
dfs.namenode.rpc-address.mycluster.nn2=c16f1n08.gpfs.net:8020 -k
dfs.namenode.http-address.mycluster.nn2=c16f1n08.gpfs.net:50070 -k
dfs.client.failover.proxy.provider.mycluster=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFail
overProxyProvider -k
dfs.namenode.shared.edits.dir=file:///gpfs/HA-mycluster -k dfs.namenode.rpc-bind-host=0.0.0.0 -k
dfs.namenode.servicerpc-bind-host=0.0.0.0 -k dfs.namenode.lifeline.rpc-bind-host=0.0.0.0 -k
dfs.namenode.http-bind-host=0.0.0.0
-k dfs.ha.fencing.methods='shell(/bin/true)' -k gpfs.ranger.enabled=scale

Note: For IBM Storage Scale over shared storage or ESS, the recommended value for
dfs.blocksize is 536870912. For more information on tuning, see “HDFS Transparency
Tuning” on page 262.
d. gpfs-site.xml
For non-HA and HA

mmhdfs config set gpfs-site.xml -k gpfs.mnt.dir=/gpfs/fs0 -k gpfs.data.dir=cluster-data

-k gpfs.storage.type=shared -k gpfs.replica.enforced=gpfs

6. Remove the localhost value from the DataNode list by running the following command:

44 IBM Storage Scale: Big Data and Analytics Guide

mmhdfs worker remove localhost

If the localhost value is not removed, then mmhdfs hdfs status later will show the following
errors:
c16f1n13.gpfs.net: This node is not a datanode
mmdsh: c16f1n13.gpfs.net remote shell process had return code 1.
7. Add DataNodes.
Run the following command on the CES transparency node to add DataNodes into an HDFS
Transparency cluster.

mmhdfs worker add/remove [dn1,dn2,...dnN]

For example:

mmhdfs worker add c16f1n07.gpfs.net,c16f1n08.gpfs.net,c16f1n09.gpfs.net

8. Enable proxyuser settings for HDFS Transparency.

If you are planning to use Hive, Livy or Oozie services with CDP Private Cloud Base, configure the
proxyuser settings for those services by running the following commands:

mmhdfs config set core-site.xml -k hadoop.proxyuser.hive.groups=*

mmhdfs config set core-site.xml -k hadoop.proxyuser.hive.hosts=*
mmhdfs config set core-site.xml -k hadoop.proxyuser.livy.hosts=*
mmhdfs config set core-site.xml -k hadoop.proxyuser.livy.groups=*
mmhdfs config set core-site.xml -k hadoop.proxyuser.oozie.hosts=*
mmhdfs config set core-site.xml -k hadoop.proxyuser.oozie.groups=*

9. Upload the configuration into CCR.

Run the following command on to upload the configuration into CCR.

mmhdfs config upload

Note: Remember to run this command after the HDFS transparency configuration is changed.
Otherwise, the modified configuration will be overwritten when HDFS service restarts.
10. For HA environment, the Shared Edit log needs to be initialized from one of the NameNode by running
the following command:

/usr/lpp/mmfs/hadoop/bin/hdfs namenode -initializeSharedEdits

11. Enable HDFS and start NameNode service.

Once the config upload completes, the configuration will be pushed to all the NameNodes and
DataNodes when enabling the HDFS service.

mmces service enable HDFS

This command will start HDFS NameNode service on ALL the CES nodes.
Note: If the configuration is not correct at this time, the command will print an error message for
HDFS Transparency related important settings that are not set properly.
12. Start all the DataNodes by running the following command from one node in the new HDFS
Transparency cluster.

mmhdfs hdfs-dn start

13. Check the NameNodes and DataNodes status in the new HDFS Transparency cluster.

mmhdfs hdfs status

14. Verify the HDFS Transparency cluster.

Chapter 2. IBM Storage Scale support for Hadoop 45

For HDFS CES NON-HA, run the hdfs shell command to check that the HDFS Transparency cluster
is working.
/usr/lpp/mmfs/hadoop/bin/hdfs dfs -ls /
For HDFS CES HA, if the NameNodes is started then you need to verify that one NameNode is in
Active status and the other is in Standby. Otherwise, HDFS Transparency is not in a healthy state.
Run the following command to retrieve the status of the all the HDFS NameNodes and check the
state:

/usr/lpp/mmfs/hadoop/bin/hdfs haadmin -getAllServiceState

Run the hdfs shell command to confirm the HDFS HA cluster is working.

/usr/lpp/mmfs/hadoop/bin/hdfs dfs -ls /

15. Check if the CES HDFS protocol IPs values are configured properly.

/usr/lpp/mmfs/bin/mmces address list

For more information, see Listing CES HDFS IPs.

16. After the CES HDFS nodes are installed and verified, create the HDFS client nodes manually. For more
information, see Chapter 6, “Apache Hadoop,” on page 489.
For information about the spectrumscale, mmces, mmhdfs commands, see IBM Storage Scale:
Command and Programming Reference Guide.
Note: If HDFS Transparency is a part of the protocols used in the cluster, ensure that the ACL for
GPFS file system is set to -k ALL after all protocols are installed.
mmlsfs to check the -k value.
mmchfs to change the -k value.
Restart all services and IBM Storage Scale.

Uninstalling HDFS Transparency cluster

This section describes how to manually uninstall CES HDFS.
HDFS Transparency maintains various files that contain configuration and data that is related to the file
system related. Because these files are critical for the proper functioning of HDFS Transparency and must
be preserved across releases, they are not automatically removed when you uninstall HDFS Transparency.
Follow these steps if you do not intend to use HDFS Transparency on any of the nodes in your cluster.
1. Stop the HDFS Transparency cluster by using the following command.

# mmhdfs hdfs stop

2. Disable the HDFS service from the CES protocols by issuing the next command:

# mmces service disable hdfs

3. To remove the assigned CES IP for the HDFS Transparency cluster, use the following command:

# mmces address remove --ces-ip <CES_IP>

4. To disable CES HDFS on the assigned CES HDFS node, issue the next command:

# mmchnode --ces-disable -N <NameNode1,NameNode2>

5. If no other CES protocols exist, clear the cesSharedRoot configuration:

#mmchconfig cesSharedRoot=DEFAULT

6. Uninstall the HDFS Transparency package:

46 IBM Storage Scale: Big Data and Analytics Guide

#rpm -e gpfs.hdfs-protocol

Upgrading
This section describes the process to upgrade CES HDFS Transparency.
Note: Starting with HDFS Transparency 3.1.1-15 and HDFS Transparency 3.2.2-6, dependent JAR files
need to be provided. This is also required as a prerequisite for an upgrade. For more information, see the
instructions to provide dependent JAR files.
If Kerberos is enabled, see “Prerequisites for Kerberos” on page 64 before you proceed to the upgrade
sections.

Installation toolkit upgrade process for HDFS Transparency

From IBM Storage Scale 5.0.5.0, the installation toolkit supports offline upgrade for HDFS protocol.
From IBM Storage Scale 5.0.5.1, the installation toolkit supports online upgrade for HDFS protocol.

Online installation toolkit upgrade

From IBM Storage Scale 5.0.5.1 and BDA integration 1.0.1.1, the installation toolkit supports the CES
HDFS Transparency online upgrade.
1. After the IBM Storage Scale install package is extracted, starting from IBM Storage Scale 5.0.5.1
with Toolkit for HDFS at 1.0.1.1 and HDFS Transparency 3.1.1-1, the default location (/usr/lpp/
mmfs/5.0.5.1/) for the files will contain the correct packages to do the online upgrade. Ensure
that the HDFS Transparency and Toolkit for HDFS residing in /usr/lpp/mmfs/<Scale version>/
hdfs_rpms/rhel7/hdfs_3.1.1.x (Default Red Hat location) have the support combination
versions as stated in the CES HDFS “HDFS Transparency support matrix” on page 27 section.
2. For IBM Storage Scale 5.1.1 and later:

# cd /usr/lpp/mmfs/5.1.1.0/ansible-toolkit

For IBM Storage Scale 5.1.0 and earlier:

# cd /usr/lpp/mmfs/5.0.5.1/installer

Run the following command:

# ./spectrumscale setup -s <Installer IP>

For ESS:

# ./spectrumscale setup -s <EMS Node> -st ess

3. Populate the existing configuration:

# ./spectrumscale config populate -N < HDFS Node>

Where HDFS NODE is any node in the HDFS Transparency cluster.
For example:
./spectrumscale config populate -N c902f09x11.gpfs.net

4. The installation toolkit automatically updates only the HDFS package when HDFS protocol is enabled
in the toolkit. To check if HDFS is enabled, run the following command:

./spectrumscale node list

5. Run upgrade precheck.

# ./spectrumscale upgrade precheck

6. Deploy the upgrade if the precheck is successful.

Chapter 2. IBM Storage Scale support for Hadoop 47

# ./spectrumscale upgrade run

For more information about the installation toolkit online upgrade process, see the following topics in the
IBM Storage Scale: Concepts, Planning, and Installation Guide:
• Upgrading IBM Storage Scale components with the installation toolkit
• Upgrade process flow
• Performing online upgrade by using the installation toolkit
For CES HDFS, the online upgrade flow is as follows:

Offline installation toolkit upgrade

From IBM Storage Scale 5.0.5, with Toolkit for HDFS at 1.0.1.0 and HDFS Transparency at 3.1.1-1, the
installation toolkit supports only the offline upgrade process for HDFS protocol. If other protocols (for
example, SMB, NFS) are also configured along with HDFS, those protocols will also be updated.
Ensure that you review the Performing offline upgrade or excluding nodes from upgrade using installation
toolkit documentation in the IBM Storage Scale: Concepts, Planning, and Installation Guide.
The installation toolkit will update the nodes in the upgrade config offline node list only if those nodes
have been shut down and are suspended. Ensure that there are sufficient quorum nodes available to run
GPFS before shutting down the CES NameNodes and DataNodes.

48 IBM Storage Scale: Big Data and Analytics Guide

Offline HDFS installation upgrade procedure
The following process uses IBM Storage Scale 5.0.5 as an example.
1. After the IBM Storage Scale installation package is extracted, starting from IBM Storage Scale 5.0.5
with Toolkit for HDFS at 1.0.1.0 and HDFS Transparency 3.1.1-1, the default location (/usr/lpp/
mmfs/5.0.5.0/) for the files will contain the correct packages to do the offline upgrade. Ensure
that the HDFS Transparency and Toolkit for HDFS residing in /usr/lpp/mmfs/<Scale version>/
hdfs_rpms/rhel7/hdfs_3.1.1.x (Default Red Hat location) have the support combination
versions as stated in the CES HDFS “HDFS Transparency support matrix” on page 27 section.
2. For IBM Storage Scale 5.1.1 and later:

# cd /usr/lpp/mmfs/5.1.1.0/ansible-toolkit

For IBM Storage Scale 5.1.0 and earlier:

# cd /usr/lpp/mmfs/5.0.5.1/installer

Run the following command:

# ./spectrumscale setup -s <Installer IP>

For ESS:

# ./spectrumscale setup -s <EMS Node> -st ess

3. Populate the existing configuration:

# ./spectrumscale config populate -N < HDFS Node>

Where HDFS NODE is any node in the HDFS Transparency cluster.
For example:
./spectrumscale config populate -N c902f09x11.gpfs.net

4. Shut down the NameNodes and DataNodes.

# /usr/lpp/mmfs/bin/mmshutdown -N <NameNode and DataNodes list>

For example,
# /usr/lpp/mmfs/bin/mmshutdown -N c902f09x09,c902f09x10,c902f09x11,c902f09x12
Fri Feb 21 00:09:11 EST 2020: mmshutdown: Starting force unmount of GPFS file systems
Fri Feb 21 00:09:56 EST 2020: mmshutdown: Shutting down GPFS daemons
Fri Feb 21 00:10:04 EST 2020: mmshutdown: Finished

Note: Only the HDFS NameNodes and DataNodes need to be shut down. The other CES protocol
nodes do not need to be shut down.
5. Suspend CES service on all the NameNodes.

# /usr/lpp/mmfs/bin/mmces node suspend -N <NameNodes>

For example,
[root@c902f09x09 installer]# /usr/lpp/mmfs/bin/mmces node suspend -N c902f09x11,c902f09x12
Node c902f09x11.gpfs.net now in suspended state.
Node c902f09x12.gpfs.net now in suspended state.

Note:
• Shutting down GPFS using the mmshutdown command will stop the CES HDFS NameNodes and
DataNodes.
• For HDFS Transparency to be upgraded, all the CES NameNodes are required to be suspended.
6. The installation toolkit automatically updates only the HDFS package when HDFS protocol is enabled
in the toolkit. To check if HDFS is enabled, run the following command:

./spectrumscale node list

Chapter 2. IBM Storage Scale support for Hadoop 49

7. Run upgrade configuration in offline mode for the NameNodes and DataNodes.

# ./spectrumscale upgrade config offline -N <List of NameNodes and DataNodes>

For example,
[root@c902f09x09 installer]# ./spectrumscale upgrade config offline -N
c902f09x09,c902f09x10,c902f09x11,c902f09x12
[ INFO ] The node c902f09x09.gpfs.net is added as offline.
[ INFO ] The node c902f09x10.gpfs.net is added as offline.
[ INFO ] The node c902f09x11.gpfs.net is added as offline.
[ INFO ] The node c902f09x12.gpfs.net is added as offline.

Note:
• This will only upgrade the HDFS Transparency nodes.
• Ensure that you list all the NameNodes and DataNodes in the HDFS Transparency cluster into the
offline list.
8. Check the protocol configuration list to ensure that they are set to “offline".

# ./spectrumscale upgrade config list

For example:
The upgrade config list of NameNodes and other protocol nodes shows under Phase2: Protocol
Nodes Upgrade. The upgrade config list of DataNodes and other nodes shows under Phase1: Non
Protocol Nodes Upgrade.
In the following example, there are two NameNodes (c902f09x11.gpfs.net, c902f09x12.gpfs.net)
and two DataNodes (c902f09x09.gpfs.net,c902f09x10.gpfs.net):

# ./spectrumscale upgrade config list

[ INFO ] GPFS Node SMB NFS OBJ HDFS GPFS
[ INFO ]
[ INFO ] Phase1: Non Protocol Nodes Upgrade
[ INFO ] c902f09x09.gpfs.net - - - - offline
[ INFO ] c902f09x10.gpfs.net - - - - offline
[ INFO ]
[ INFO ] Phase2: Protocol Nodes Upgrade
[ INFO ] c902f09x11.gpfs.net offline offline offline offline offline
[ INFO ] c902f09x12.gpfs.net offline offline offline offline offline
[ INFO ]

9. Run the upgrade precheck.

# ./spectrumscale upgrade precheck

10. Deploy the upgrade if the precheck is successful.

# ./spectrumscale upgrade run

11. After the upgrade completes successfully, start the HDFS Transparency cluster.
a. Start the NameNodes.

/usr/lpp/mmfs/bin/mmces service start hdfs -a

b. Start the DataNodes.

/usr/lpp/mmfs/hadoop/sbin/mmhdfs hdfs-dn start

50 IBM Storage Scale: Big Data and Analytics Guide

Manual rolling upgrade for HDFS Transparency
HDFS Transparency 3.1.1-x is the version for CES HDFS integration. HDFS Transparency supports rolling
upgrades when the commands are executed manually on the command line and not through the
installation toolkit.

Manual rolling upgrade for CES HDFS Transparency NameNode

This topic lists the steps to manually perform a rolling upgrade for the NameNodes.
As root, follow the steps listed below to perform the rolling upgrade for the NameNode(s):
1. If NameNode HA is configured on the standby HDFS Transparency NameNode, stop the standby
NameNode from the bash console with the following command:

/usr/lpp/mmfs/bin/mmces service stop hdfs

Note: If CES protocols such as SMB co-existed with HDFS, then the CES IP of SMB will failover from
the standby NameNode to the active NameNode.
If HDFS Transparency NameNode HA was not configured, then go to step 5.
Note: When you upgrade the HDFS Transparency NameNode with non-HA configured, HDFS
Transparency service gets interrupted.
2. Upgrade the standby HDFS Transparency NameNode.
cd to the directory where the upgrade HDFS Transparency package resides.
Run the following command from the bash console to update the HDFS Transparency package:

rpm -Uvh gpfs.hdfs-protocol-3.1.1-<version>.<os>.rpm

3. Start the standby NameNode.

Run the following command from the bash console:

/usr/lpp/mmfs/bin/mmces service start hdfs

4. Move the CES IP of HDFS from the active NameNode to the standby NameNode.
This does a failover of the current active NameNode to become the new standby NameNode.
Run the following command from the bash console:

/usr/lpp/mmfs/bin/mmces address move --ces-ip x.x.x.x --ces-node <standby_namenode_host>

The original standby NameNode is now the active NameNode after the CES IP is moved successfully.
5. Check to see if the new active NameNode is active.
Run the following commands from the bash console:

/usr/lpp/mmfs/bin/mmces service list -a

/usr/lpp/mmfs/hadoop/sbin/mmhdfs hdfs-nn status
/usr/lpp/mmfs/hadoop/bin/hdfs haadmin -getAllServiceState

6. Stop the new standby HDFS Transparency NameNode, for which the status changed from active to
standby, so that the HDFS Transparency package can be upgraded.
Run the following command from the bash console:

/usr/lpp/mmfs/bin/mmces service stop hdfs

Note: If other CES protocols such as SMB co-existed with HDFS, then the CES IP of SMB will failover to
the new active NameNode.
7. Upgrade the new standby HDFS Transparency NameNode.

Chapter 2. IBM Storage Scale support for Hadoop 51

Run the following command from the bash console to upgrade the HDFS Transparency package:

rpm -Uvh gpfs.hdfs-protocol-3.1.1-<version>.<os>.rpm

8. Start the new standby NameNode

Run the following command from the bash console:

/usr/lpp/mmfs/bin/mmces service start hdfs

Note: If other CES protocols such as SMB co-existed with HDFS, then the CES IP of SMB will fail over
back to the new standby NameNode.

Manual rolling upgrade for CES HDFS Transparency DataNode

This topic lists the steps to manually perform a rolling upgrade for the DataNodes.
Note: This is an online upgrade. Connected clients will wait for the DataNode to restart and continue
the operation. The default timeout is 30 seconds and can be modified on client side by setting
dfs.client.datanode-restart.timeout to a higher value. If the DataNode is not restarted
within the specified time, the client considers the DataNode as dead (by default for 10 minutes,
see dfs.client.write.exclude.nodes.cache.expiry.interval.millis) and will failover to
another DataNode if dfs.replication is greater than 1.
1. Copy the latest gpfs.hdfs-protocol-<VERSION>.<ARCH> package to the DataNode that you want to
upgrade.
2. Log in to the DataNode.
3. Upgrade the RPM by running the following command:

rpm -Uvh gpfs.hdfs-protocol-<VERSION>.<ARCH>

4. Shut down the DataNode by running the following command:

/usr/lpp/mmfs/hadoop/bin/hdfs dfsadmin -shutdownDatanode <HOST>:<IPC_PORT> upgrade

Note: The default IPC port is 9867. You can see the IPC port value in dfs.datanode.ipc.address
under the /var/mmfs/hadoop/etc/hadoop/hdfs-site.xml file.
5. Wait for the DataNode to shut down. Run the following command to check that the DataNode status is
set to dead:

/usr/lpp/mmfs/hadoop/sbin/mmhdfs datanode status

For example:

# /usr/lpp/mmfs/hadoop/sbin/mmhdfs datanode status

c902f08x04.gpfs.net: cescluster1: datanode is dead, previous pid is 26077

6. Start the DataNode again by running the following command:

/usr/lpp/mmfs/hadoop/sbin/mmhdfs datanode start

Configuring
The following configurations are for manually configuring HDFS Transparency. For example, set up HDFS
Transparency for open-source Apache or Cloudera CDP stack.
For configuring, Hadoop distribution must be installed under $YOUR_HADOOP_PREFIX on each
machine in the Hadoop cluster. The configurations for IBM Storage Scale HDFS transparency are
located under /usr/lpp/mmfs/hadoop/etc/hadoop (for HDFS Transparency 2.7.3-x) or /var/mmfs/
hadoop/etc/hadoop (for HDFS Transparency 3.0.x) for any Hadoop distribution. Configuration files for

52 IBM Storage Scale: Big Data and Analytics Guide

Hadoop distribution are located in different locations. For example, /etc/hadoop/conf for Cloudera
CDP.
The core-site.xml and hdfs-site.xml configuration files should be synchronized between all the
nodes and kept identical for the IBM Storage Scale HDFS Transparency and Hadoop Distribution. The
log4j.properties configuration file can differ between the IBM Storage Scale HDFS Transparency and
the open-source Apache Hadoop distribution.

Password-less ssh access

Ensure that the root password-less ssh access does not prompt a response for the user. If the root
password-less access configuration cannot be setup, HDFS transparency fails to start. The mmhadoopctl
and mmhdfs commands require password-less ssh to all the nodes including itself.
If the IBM Storage Scale cluster is configured as adminMode=central, HDFS Transparency NameNodes
can be configured on the management nodes of the IBM Storage Scale cluster. To check if the IBM
Storage Scale cluster is configured as adminMode=central, run mmlsconfig adminMode.
If the IBM Storage Scale cluster is configured in sudo wrapper mode, IBM Storage Scale requires the
user to have password-less root access to all the other nodes as a common user. To check if the IBM
Storage Scale cluster is configured in sudo wrapper mode, log in as a root user in the node and execute
ssh <non-root>@<other-node> in the password-less mode. With IBM Storage Scale in sudo wrapper
mode, HDFS Transparency still requires the node to have root access to all the other nodes including itself
to run the mmhadoopctl and mmhdfs commands.
HDFS Transparency provides the following options for root password-less requirement:
1. Local cluster options
For the local cluster, follow one of the following options for the root password-less requirement:
a. By default, HDFS Transparency requires root password-less access between any two nodes in the
HDFS Transparency cluster.
b. If the above option is not feasible, you need at least one node with root password-less access
to all the other HDFS Transparency nodes and to itself. In such a case, mmhadoopctl/mmhdfs
command can be run only on this node and this node should be configured as HDFS Transparency
NameNodes. If NameNode HA is configured, all NameNodes should be configured with root
password-less access to all DataNodes.
Note:
• If you configure the IBM Storage Scale cluster in admin central mode (mmchconfig
adminMode=central), you can configure HDFS Transparency NameNodes on the IBM
Storage Scale management nodes. Therefore, you have root password-less access from these
management nodes to all the other nodes in the cluster.
• If the file system is remotely mounted, HDFS Transparency requires two password-less access
configurations: one is for the local cluster (configure HDFS Transparency according to this option
for password-less access in the local cluster) and the other is for remote file system.
2. Remote cluster options
For the remote file system, follow one of the following options for the root password-less requirement:
a. By default, HDFS Transparency NameNodes require root password-less access to at least one of
the contact nodes (the 1st contact node is recommended if you cannot configure all contact nodes
as password-less access) from the remote cluster.
For example, in the following cluster, ess01-dat.gpfs.net and ess02-dat.gpfs.net are
contact nodes. ess01-dat.gpfs.net is the first contact node because it is listed first in the
property Contact nodes:

# /usr/lpp/mmfs/bin/mmremotecluster show all

Cluster name: test01.gpfs.net
Contact nodes: ess01-dat.gpfs.net,ess02-dat.gpfs.net

Chapter 2. IBM Storage Scale support for Hadoop 53

SHA digest: abe321118158d045f5087c00f3c4b0724ed4cfb8176a05c348ae7d5d19b9150d
File systems: latestgpfs (gpfs0)

Note: HDFS Transparency DataNodes do not require root password-less access to the contact
nodes.
b. From HDFS Transparency 2.7.3-3, HDFS Transparency supports non-root password-less access to
one of the contact nodes as a common user (instead of root user).
First, on HDFS Transparency NameNodes, configure password-less access for the root user as
a non-privileged user to the contact nodes (at least one contact node and recommend the first
contact node) from the remote cluster. Here, the gpfsadm user is used as an example.
Add the following into the /usr/lpp/mmfs/hadoop/etc/hadoop/gpfs-site.xml (for
HDFS Transparency 2.7.3-x) or /var/mmfs/hadoop/etc/hadoop/gpfs-site.xml (for HDFS
Transparency 3.1.x) file on HDFS Transparency NameNodes.

<property>
<name>gpfs.ssh.user</name>
<value>gpfsadm</value>
</property>

On one of the contact nodes (the first contact node is recommended), edit /etc/sudoers using
visudo and add the following to the sudoers file.

gpfsadm ALL=(ALL) NOPASSWD: /usr/lpp/mmfs/bin/mmlsfs, /usr/lpp/mmfs/bin/

mmlscluster,
/usr/lpp/mmfs/bin/mmlsnsd, /usr/lpp/mmfs/bin/mmlsfileset, /usr/lpp/mmfs/bin/mmlssnapshot,
/usr/lpp/mmfs/bin/mmcrsnapshot, /usr/lpp/mmfs/bin/mmdelsnapshot, /usr/lpp/mmfs/bin/
tslsdisk

The gpfsadm user can run these IBM Storage Scale commands for any filesets in the file system
using the sudo configurations above.
Note: Comment out Defaults requiretty. Otherwise, sudo: sorry, you must have a
tty to run sudo error will occur.

#
# Disable "ssh hostname sudo <cmd>", because it will show the password in clear.
# You have to run "ssh -t hostname sudo <cmd>".
#
#Defaults requiretty

Note: Before you start HDFS Transparency, log in HDFS Transparency NameNodes as root and run
ssh gpfsadmin@<the configured contact node> /usr/lpp/mmfs/bin/mmlsfs <fs-
name> to confirm that it works.
c. Manually generate the internal configuration files from the contact node and copy them onto the
local nodes so that you do not require root or user password-less ssh to the contact nodes.
From HDFS transparency 2.7.3-2, you can configure gpfs.remotecluster.autorefresh as
false in /usr/lpp/mmfs/hadoop/etc/hadoop/gpfs-site.xml (for HDFS Transparency 2.7.3-
x) or /var/mmfs/hadoop/etc/hadoop/gpfs-site.xml (for HDFS Transparency 3.1.x).
Manually copy the /usr/lpp/mmfs/hadoop/sbin/initmap.sh script from the NameNode to
one of the contact nodes. The script can be copied to any directory.
Create the /var/mmfs/hadoop/etc/hadoop directory on the contact node and copy the
contents of the /var/mmfs/hadoop/etc/hadoop directory from the NameNode to the directory
created on the contact node.
Log on the contact node as root and run the initmap.sh command.
For example, to get the initmap files for two file systems on the contact node, run the following
command:

/<savedir>/initmap.sh -i all <fs1>,<fs2>

54 IBM Storage Scale: Big Data and Analytics Guide

Note: Do not use the -d option when running on the contact node.
Copy the generated internal configuration files to all the HDFS Transparency nodes.
The initmap.sh script requires to be re-run on the remote system if any of the following are
changed:
• There are updates to the dataReplica configuration values for the filesystem.
• The gpfs cluster name (from the mmlscluster output) is changed through the mmchcluster
command on the remote system.
• There are updates to the filesystem name in either the remote or local clusters.
• There are updates to the contact nodes information from the local cluster to the remote cluster.
For the initmap.sh script command syntax and generated internal configuration files, see
“Cluster and file system information configuration” on page 62.
Note: If gpfs.remotecluster.autorefresh is configured as false, the snapshot from Hadoop
interface is not supported against the remote mounted file system.
If the IBM Storage Scale cluster is configured as adminMode=central (check by executing
mmlsconfig adminMode), HDFS Transparency NameNodes can be configured on the
management nodes of the IBM Storage Scale cluster.

OS tuning for all nodes in HDFS Transparency

This topic describes the ulimit tuning.

ulimit tuning
For all nodes, ulimit -n and ulimit -u must be larger than or equal to 65536. Smaller value makes
the Hadoop java processes report unexpected exceptions.
In Red Hat, add the following lines at the end of /etc/security/limits.conf file:

* soft nofile 65536

* hard nofile 65536

* soft nproc 65536

* hard nproc 65536

For other Linux distributions, see the relevant documentation.

After the above change, all the Hadoop services must be restarted for the change to take effect.
If you are using Ambari, ensure that you restart each ambari-agent and then restart HDFS Transparency in
order to pick up the changes in the /etc/security/limits.conf.
Note:
• This must be done on all nodes including the Hadoop client nodes and the HDFS Transparency nodes.
• If the ambari agent is restarted using the command line (for example, ambari-agent restart), the
ulimit inherited by the DataNode process is from the /etc/security/limits.conf file.
If the ambari agent is started using systemd, (for example, server reboot), the ulimit inherited by the
DataNode process is from the systemd config file (for example, LimitNOFILE value).
kernel.pid_max
Usually, the default value is 32K. If you see the allocate memory error or unable to
create new native thread error, you can try to increase the kernel.pid_max by adding
kernel.pid_max=99999 at the end of /etc/sysctl.conf followed by running the sysctl -p
command.

Chapter 2. IBM Storage Scale support for Hadoop 55

Configure NTP to synchronize the clock in HDFS Transparency
For distributed cluster, configure NTP to synchronize the clock in the cluster. If the cluster can access the
internet, take the public NTP servers to synchronize the clock on the nodes. However, if the cluster does
not have access to the internet, then configure one of the nodes as the NTP server so that all the other
nodes will synchronize the clock according to that NTP server.
Refer to the CONFIGURING NTP USING NTPD to configure NTP on RHEL 7.
HDFS Transparency requires the clock to be synchronized on all nodes. Otherwise potential issues will
occur.

Configure Hadoop nodes

On HortonWorks HDP, you could configure Hadoop on Ambari GUI. If you are not familiar with HDFS/
Hadoop, set up the native HDFS first by seeing the Hadoop cluster setup guide. Setting up the HDFS
Transparency to replace the native HDFS is easier after you set up HDFS/Hadoop.
Hadoop and HDFS Transparency must take the same core-site.xml, hdfs-site.xml, slaves (Hadoop
2.7.x) or workers (Hadoop 3.0.x+), hadoop-env.sh and log4j.properties for both Hadoop nodes
and HDFS Transparency. This means, native HDFS in Hadoop and HDFS Transparency must take the same
NameNodes and DataNodes.
Note:
1. For HortonWorks HDP, the configuration files above are located under /etc/hadoop/
conf. For open source Apache Hadoop, the configuration files are located under
$YOUR_APACHE_HADOOP_HOME/etc/hadoop and /usr/lpp/mmfs/hadoop/etc/hadoop for
HDFS Transparency 2.7.3-x.
2. From HDFS Transparency 2.7.3-3, the configurations are located under /usr/lpp/mmfs/
hadoop/etc/hadoop. From HDFS Transparency 3.0.0, the configurations are located under /var/
mmfs/hadoop/etc/hadoop.
If your native HDFS NameNodes are different than HDFS Transparency NameNodes, you need to update
fs.defaultFS in your Hadoop configuration (for HortonWorks HDP it is located under /etc/Hadoop/
conf. If it is open source Apache Hadoop, it is located under $YOUR_HADOOP_PREFIX/etc/hadoop/.):

<property>
<name>fs.defaultFS</name>
<value>hdfs://hs22n44:8020</value>
</property>

For HDFS Transparency 2.7.0-x, 2.7.2-0, 2.7.2-1, do not export the Hadoop environment variables on the
HDFS Transparency nodes because this can lead to issues when the HDFS Transparency uses the Hadoop
environment variables to map to its own environment. The following Hadoop environment variables can
affect HDFS Transparency:
• HADOOP_HOME
• HADOOP_HDFS_HOME
• HADOOP_MAPRED_HOME
• HADOOP_COMMON_HOME
• HADOOP_COMMON_LIB_NATIVE_DIR
• HADOOP_CONF_DIR
• HADOOP_SECURITY_CONF_DIR
For HDFS Transparency versions 2.7.2-3+, 2.7.3-x and 3.0.x+, the environmental variables listed above
can be exported except for HADOOP_COMMON_LIB_NATIVE_DIR. This is because HDFS Transparency
uses its own native .so library.
For HDFS Transparency versions 2.7.2-3+ and 2.7.3-x:

56 IBM Storage Scale: Big Data and Analytics Guide

• If you did not export HADOOP_CONF_DIR, HDFS Transparency will read all the configuration files
under /usr/lpp/mmfs/hadoop/etc/hadoop such as the gpfs-site.xml file and the hadoop-
env.sh file.
• If you export HADOOP_CONF_DIR, HDFS Transparency will read all the configuration files under
$HADOOP_CONF_DIR. As gpfs-site.xml is required for HDFS Transparency, it will only read the
gpfs-site.xml file from the /usr/lpp/mmfs/hadoop/etc/hadoop directory.
For questions or issues with HDFS Transparency configuration, send an email to [email protected].

Configure HDFS Transparency nodes

This section provides information on configuring HDFS transparency nodes.

Hadoop configurations files

This topic lists the Hadoop configuration files.
By default, HDFS Transparency 2.7.3-x uses the following configuration files located under /usr/lpp/
mmfs/hadoop/etc/hadoop:
• core-site.xml
• hdfs-site.xml
• slaves
• log4j.properties
• hadoop-env.sh
HDFS Transparency 3.0.0+ uses the following configuration files located under /var/mmfs/
hadoop/etc/hadoop:
• core-site.xml
• hdfs-site.xml
• workers
• log4j.properties
• hadoop-env.sh

Configure the storage mode

Use this procedure to configure the storage mode.
Modify the /var/mmfs/hadoop/etc/hadoop/gpfs-site.xml file on the
hdfs_transparency_node1 node:

<property>
<name>gpfs.storage.type</name>
<value>local</value>
</property>

The property gpfs.storage.type is used to specify the storage mode: local or shared. Local is for IBM
Storage Scale FPO file system and shared is for IBM Storage Scale over Centralized Storage or remote
mounted file system. This is a required configuration parameter and the gpfs-site.xml configuration
file must be synchronized with all the HDFS Transparency nodes after the modification.

Update other configuration files

Use this procedure to update the configuration files.
Note: To configure Hadoop HDFS, Yarn, etc. refer to the hadoop.apache.org website.

Chapter 2. IBM Storage Scale support for Hadoop 57

Configuring Apache Hadoop
Modify the /var/mmfs/hadoop/etc/hadoop/gpfs-site.xml file on the
hdfs_transparency_node1 node:

<property>
<name>gpfs.mnt.dir</name>
<value>/gpfs_mount_point</value>
</property>

<property>
<name>gpfs.supergroup</name>
<value>hdfs,root</value>
</property>

<property>
<name>gpfs.replica.enforced</name>
<value>dfs</value>
</property>

In gpfs-site.xml, all the Hadoop data is stored under the /gpfs_mount_point/data_dir directory.
You can have two Hadoop clusters over the same file system and these clusters are isolated from
each other. When Hadoop operates the file, one limitation is that if there is a link under the /
gpfs_mount_point/data_dir directory that points to a file outside the /gpfs_mount_point/
data_dir directory, it reports an exception because that file is not accessible by Hadoop.
If you do not want to explicitly configure the gpfs.data.dir parameter, leave it as null. For example,
keep its value as <value></value>.
Note: Do not configure it as <value>/</value>.
The gpf.supergroup must be configured according to your cluster. You need to add some Hadoop
users, such as HDFS, yarn, hbase, hive, oozie, etc under the same group named Hadoop and
configure gpfs.supergroup as Hadoop. You might specify two or more comma-separated groups as
gpfs.supergroup. For example, group1,group2,group3.
Note: Users in gpfs.supergroup are super users and they can control all the data in /
gpfs_mount_point/data_dir directory. This is similar to the user root in Linux. Since HDFS
Transparency 2.7.3-1, gpfs.supergroup could be configured as hdfs,root.
The gpfs.replica.enforced parameter is used to control the replica rules. Hadoop controls the
data replication through the dfs.replication parameter. When running Hadoop over IBM Storage
Scale, IBM Storage Scale has its own replication rules. If you configure gpfs.replica.enforced
as dfs, dfs.replication is always effective unless you specify dfs.replication in the command
options when submitting jobs. If gpfs.replica.enforced is set to gpfs, all the data will be replicated
according to IBM Storage Scale configuration settings. The default value for this parameter is dfs.
Usually, you must not change core-site.xml and hdfs-site.xml located under /var/mmfs/
hadoop/etc/hadoop/. These two files must be consistent as the files used by Hadoop nodes.
You need to modify /var/mmfs/hadoop/etc/hadoop/workers to add all HDFS transparency
DataNode hostnames and one hostname per line, for example:

# cat /var/mmfs/hadoop/etc/hadoop/workers
hs22n44
hs22n54
hs22n45

You might check /var/mmfs/hadoop/etc/hadoop/log4j.properties and modify it accordingly.

This file might be different from the log4j.properties used by Hadoop nodes.

58 IBM Storage Scale: Big Data and Analytics Guide

After you finish the configurations, use the following command to sync it to all IBM Storage Scale HDFS
transparency nodes:
hdfs_transparency_node1#/usr/lpp/mmfs/hadoop/sbin/mmhadoopctl connector syncconf /var/mmfs/hadoop/etc/hadoop

Configure storage type data replication

To get the file system data replica values, run the mmlsfs <fsName> -r -R command to review the
output values. The value of -r is the default number of data replicas and the value of -R is the maximum
number of data replicas.
Important: The value of -R cannot be changed after the file system creation. Usually, the value 3 is the
recommended values for -r and -R if you are using IBM Storage Scale FPO and the value 1 for -r and 2
for -R are recommended values for production when you are using Centralized Storage.
For different storage modes, refer to the following table for recommended combination for
dfs.replication, gpfs.replica.enforced and file system data replica.

Table 11. Configurations for data replication

Storage mode dfs.replication gpfs.replica.e File system Comments
nforced data replica
#1 FPO 3 gpfs or dfs -r = 3 -R = 3 Other combinations
are not
(gpfs.storage.type=local)
recommended.
#2 ESS 1 dfs -r = 1 -R = 2 Follow the HDFS
protocol. But the
(gpfs.storage.type=share -r = 1 -R = 3
job will fail if one
d)
DN is down after
getBlockLocation
is returned.
Potential issue: Does
not show the
advantage that all DN
can access the blocks.
If you are using
this configuration
you must use the
mmlsattr command
to check the file
replication value. If
the set file replication
value is less than the
dfs.replication
value, the HDFS
interface cannot be
used to check
the file replication
value because the
NameNode returns
at least the
dfs.replication
value in the shared
storage mode.

Chapter 2. IBM Storage Scale support for Hadoop 59

Table 11. Configurations for data replication (continued)
Storage mode dfs.replication gpfs.replica.e File system Comments
nforced data replica
#3 ESS 2 or 3 gpfs -r = 1 -R = 2 Follow the HDFS
protocol (returns 2 or
(gpfs.storage.type=share -r = 1 -R = 3
3 DNs) but does not
d)
match the real storage
usage on GPFS level.
Job will not fail if
one DN is down after
getBlockLocation
is returned.
Potential risk: Upper-
layer applications
calculate the disk
space consumption as
replication * file size,
thinking a file takes
more storage space
than it actually does.
HDFS Transparency
will still use the actual
disk space correctly.

#4 ESS 1 gpfs -r = 1 -R = 2 Do not use if the

application wants to
(gpfs.storage.type=share -r = 1 -R = 3
set the replication
d)
value from HDFS
protocol.
#5 ESS 2 or 3 dfs -r = 1 -R = 2 All the data will be
set as replica 2 or
(gpfs.storage.type=share -r = 1 -R = 3
3 which will not take
d)
advantage of using
IBM ESS or SAN
storage.
If you are using
this configuration
you must use the
mmlsattr command
to check the file
replication value. If
the set file replication
value is less than the
dfs.replication
value, the HDFS
interface cannot be
used to check
the file replication
value because the
NameNode returns
at least the
dfs.replication
value in the shared
storage mode.

60 IBM Storage Scale: Big Data and Analytics Guide

Note:
• The dfs.replication is defined in the hdfs-site.xml file. The gpfs.storage.type and
gpfs.replica.enforced are defined in the gpfs-site.xml file.
• Starting from HDFS Transparency version 3.1.1-1, the default value for dfs.replication is 3 in
hdfs-site.xml and gpfs.replica.enforced is gpfs in gpfs-site.xml.
• The dfs.replication value should be smaller or equal to the DataNode count.

Update environment variables for HDFS transparency service

Use the following procedure to update the environment variables for HDFS transparency service.
The administrator might need to update some environment variables for the HDFS Transparency service.
For example, change JVM options or Hadoop environment variables like HADOOP_LOG_DIR.
To update this, follow these steps:
1. On the HDFS Transparency NameNode, modify the /usr/lpp/mmfs/hadoop/etc/hadoop/
hadoop-env.sh (for HDFS Transparency 2.7.3-x) or /var/mmfs/hadoop/etc/hadoop/hadoop-
env.sh (for HDFS Transparency 3.0.0+) and other files as necessary.
2. Sync the changes to all the HDFS Transparency nodes. For information on synching the HDFS
Transparency configurations, refer “Sync HDFS Transparency configurations” on page 61.

Sync HDFS Transparency configurations

Usually, all the HDFS Transparency nodes take the same configurations. So, if you change
the configurations of HDFS Transparency on one node, you need to run /usr/lpp/mmfs/bin/
mmhadoopctl on the node to sync the changed configurations into all other HDFS Transparency nodes.
For example, hdfs_transparency_node1 is the node where you update your HDFS Transparency
configurations:
For HDFS Transparency 2.7.3-x:

hdfs_transparency_node1#/usr/lpp/mmfs/hadoop/sbin/mmhadoopctl connector syncconf /usr/lpp/mmfs/

hadoop/etc/hadoop

For HDFS Transparency 3.0.0-x ~ 3.1.0-x:

hdfs_transparency_node1#/usr/lpp/mmfs/hadoop/sbin/mmhadoopctl connector syncconf /var/mmfs/

hadoop/etc/hadoop

For HDFS Transparency 3.1.1-x +:

hdfs_transparency_node1#/usr/lpp/mmfs/hadoop/sbin/mmhdfs config upload

Note: If you are using HDP with Mpack 2.4.2.1 or later, ensure you change configurations through Ambari
only. If you change configurations by using mmhadoopctl syncconf, the changes get overwritten by
Ambari integration after HDFS service restart.

Chapter 2. IBM Storage Scale support for Hadoop 61

Cluster and file system information configuration
After HDFS Transparency is started successfully for the first time, it executes a script called initmap.sh
to automatically generate internal configuration files which contain the GPFS cluster information, disk-to-
hostname map information and ip-to-hostname map information.

Table 12. Generating internal configuration files

HDFS Transparency version Generating internal configuration files
HDFS Transparency 2.7.3-0 and earlier If new disks are added in the file system or if
the file systems are recreated, the initmap.sh
script must be executed manually on the
HDFS Transparency NameNode so that internal
configuration files are updated with the new
information.
HDFS Transparency 2.7.3-1 and later The NameNode will run the initmap.sh script
every time it starts so the script does not need to
be run manually.
HDFS Transparency 2.7.3-2 and later The internal configuration files will be generated
automatically if they are not detected and will be
synched to all other HDFS Transparency nodes
when HDFS Transparency is started.

Table 13. initmap.sh script command syntax

HDFS Transparency version initmap.sh script command syntax
HDFS Transparency 2.7.3-1 and /usr/lpp/mmfs/hadoop/sbin/initmap.sh <fsName>
earlier diskinfo nodeinfo clusterinfo
HDFS Transparency 2.7.3-2 /usr/lpp/mmfs/hadoop/sbin/initmap.sh true <fsName>
diskinfo nodeinfo clusterinfo
HDFS Transparency 2.7.3-3+ and Local cluster /usr/lpp/mmfs/hadoop/sbin/initmap.sh
3.0.0+ mode -d -i all [fsName]
Remote Mount /usr/lpp/mmfs/hadoop/sbin/initmap.sh
mode -d -r -u [gpfs.ssh.user] -i all
[fsName]

Note:
• -d option propagates the generated files to all the nodes. This option should only be used when running
on the NameNode.
• -r option requests to run the necessary commands on the contact nodes to generate the config files if
this is a remote mounted file system.
• -u [gpfs.ssh.user] option is not necessary if gpfs.ssh.user is not set in /usr/lpp/
mmfs/Hadoop/etc/hadoop/gpfs-site.xml (HDFS Transparency 2.7.3.3) or in /var/mmfs/
hadoop/etc/hadoop/gpfs-site.xml (HDFS Transparency 3.0.0+).
• -i all option generates all the required configuration files.

62 IBM Storage Scale: Big Data and Analytics Guide

Table 14. Internal configuration files and location information
HDFS Transparency version Internal configuration files and location
HDFS Transparency 2.7.3-0 and earlier Generated internal configuration files
diskid2hostname, nodeid2hostname and
clusterinfo4hdfs are under the /var/
mmfs/etc/ directory.
HDFS Transparency 2.7.3-1 and later Generated internal configuration
files diskid2hostname.<fs-name>,
nodeid2hostname.<fs-name> and
clusterinfo4hdfs.<fs-name> are under
the /var/mmfs/etc/hadoop for HDFS
Transparency 2.7.3-x or under /var/mmfs/
hadoop/init for HDFS Transparency 3.0.0+.

The following are examples of the internal configuration files:

# pwd
/var/mmfs/hadoop/init

# cat clusterinfo4hdfs.latestgpfs
clusterInfo:test01.gpfs.net:8833372820164647001
fsInfo:1:2:1048576:8388608
remoteInfo:gpfs0:ess01-dat.gpfs.net,ess02-dat.gpfs.net

# cat diskid2hostname.latestgpfs
1:ess01-dat.gpfs.net:30.1.1.11
2:ess01-dat.gpfs.net:30.1.1.11
3:ess01-dat.gpfs.net:30.1.1.11
4:ess01-dat.gpfs.net:30.1.1.11
5:ess02-dat.gpfs.net:30.1.1.12
6:ess02-dat.gpfs.net:30.1.1.12
7:ess02-dat.gpfs.net:30.1.1.12
8:ess02-dat.gpfs.net:30.1.1.12

# cat nodeid2hostname.latestgpfs
1:ess01-dat.gpfs.net:30.1.1.11
2:ess02-dat.gpfs.net:30.1.1.12

Note:
• The internal configuration files can be removed from all the nodes and regenerated when the HDFS
Transparency is restarted or when the initmap.sh command is executed depending on the HDFS
Transparency version you are at. If the internal configuration files are missing, HDFS transparency
re-runs the script and will take longer to start.
• If only stopping and restarting the DataNode hits an error because the initmap files are outdated and
failed to refresh configurations, on the DataNode, use the touch command on the initmap files so that
the modification times are updated and it can come up properly. See Table 14 on page 63 for the
initmap config file locations.

HDFS auditing
By default, the HDFS audit logs are not enabled in HDFS Transparency.
To enable the log4j-based HDFS audit logs, perform the following:
1. Log in to one of the CES HDFS NameNode hosts and change the value of HDFS_AUDIT_LOGGER from
INFO,NullAppender to INFO,RFAAUDIT in the /var/mmfs/hadoop/etc/hadoop/hadoop-env.sh
file.
2. Upload the configurations to IBM Storage Scale CCR by running the following command:

/usr/lpp/mmfs/hadoop/sbin/mmhdfs config upload

3. Start the HDFS Transparency services.

Chapter 2. IBM Storage Scale support for Hadoop 63

4. The HDFS audit logs will be created in the default log directory location (/var/log/transparency)
of HDFS Transparency.
If you want to change the default location for the audit logs, update the HADOOP_LOG_DIR parameter
in hadoop-env.sh to point to the new directory.
For example:
export HADOOP_LOG_DIR=/audit/logs

Administering
Different configurations, features, and tools are supported for managing, monitoring, or automating HDFS
Transparency on IBM Storage Scale.

Managing HDFS Transparency cluster

Operations to administer and manage an HDFS Transparency cluster and its nodes.

Prerequisites for Kerberos

This topic lists the prerequisites for administering a Kerberos enabled CES HDFS cluster.
Note: Only MIT Kerberos is supported.
If you are adding a new NameNode or DataNode, execute step 1 and step 2. For all other administrative
operations, go to step 3.
1. On the new node, create the Hadoop users and groups by following the instructions in “Configuring
users, groups and file system access for IBM Storage Scale” on page 302.
2. Initialize Kerberos on the new node by running the Kerberos configuration script /usr/lpp/mmfs/
hadoop/scripts/gpfs_create_hadoop_users_dirs.py as mentioned in “Configuring Kerberos
using the Kerberos script provided with IBM Storage Scale” on page 117. This will create the principals
and keytabs specific to the new node.
3. Obtain a Kerberos token for the hdfs user to administer CES HDFS when using either the installation
toolkit method or the manual method. Run the following command:

# kinit -kt /etc/security/keytab/hdfs.headless.keytab hdfs@<Realm Name>

Note: The previous command needs to be executed on all the CES HDFS NameNodes.
4. Verify that there is a valid token by running the following command:

# klist

Listing CES HDFS IPs

The mmces address list command displays all the currently configured protocol IPs.
For more information, see the Protocol node IP further configuration topic in the IBM Storage Scale:
Concepts, Planning, and Installation Guide.
To list the CES HDFS protocol IPs, run the mmces address list command on a CES HDFS NameNode
as root:

root@c902f09x09# mmces address list

Address Node Ces Group Attributes

------------- --------------------- --------------- ------------
192.0.2.5 c902f09x09.gpfs.net hdfsmycluster1 hdfsmycluster1
192.0.2.6 c902f09x11.gpfs.net hdfsmycluster2 hdfsmycluster2
192.0.2.7 c902f09x10.gpfs.net none none
192.0.2.8 c902f09x12.gpfs.net none none

64 IBM Storage Scale: Big Data and Analytics Guide

Note: The CES group and Attributes values contain the HDFS cluster name prefixed with hdfs.
This example uses the CES Cluster name/group mycluster1 for the first HDFS Transparency cluster and
mycluster2 for the second HDFS Transparency cluster where 192.0.2.5 and 192.0.2.6 CES IPs belong to
mycluster1 and mycluster2 HDFS Transparency clusters respectively.
For installation toolkit, the spectrumscale config hdfs new -n <CLUSTER_NAME> command
automatically adds the hdfs prefix to the input CLUSTER_NAME which then becomes the CES Group and
Attribute values.
For example, spectrumscale config hdfs new -n mycluster1.
For more information, see “Adding CES HDFS nodes into the centralized file system” on page 34 or
“Separate CES HDFS cluster remote mount into the centralized file system” on page 38.
For manual installation, you need to manually set the hdfs prefix.
Run mmchnode --ces-enable --ces-group hdfsmycluster1 command to set the CES Cluster
Name/Group with hdfs prefix.

Setting CES HDFS configuration files

This section describes the configuration files settings that will be changed in the “Enable and Configure
CES HDFS” on page 42 section when using the mmhdfs command while you are manually trying to setup
the CES HDFS cluster.
Before enabling HDFS Transparency, some configuration must be set. Some of them can be done
automatically and some must be set manually.
Edit config fields
Use the following command on one CES transparency node to edit the config fields locally one at a time.
After modifying the config fields, ensure that you upload to CCR (See Edit config files and upload section):

mmhdfs config set [config file] -k [key1=value] -k [key2=value] ... -k [keyX-value]

Edit config files and upload

Use the following command on one CES transparency node to download the configuration files, edit them
and then upload the changes into CCR:

mmhdfs config import/export [a local config dir] [config_file1,config_file2,...]

mmhdfs config upload

Configuration file settings

The following configurations should be set to proper value to support CES IP failover:
For hadoop_env.sh:
JAVA_HOME: Set the correct java home path for the node.
For hdfs-site.xml:
• dfs.nameservices: Set to the logical name of the cluster. This must be equal to the CES group name
without the hdfs prefix.
In the following example, we use hdfscluster as the CES group name where hdfs is the prefix, and
cluster is the cluster name:

<property>
<name>dfs.nameservices</name>
<value>cluster</value>
</property>

• dfs.ha.namenodes.[nameservice ID]: Set to a list of comma-separated NameNode IDs.

For example:

Chapter 2. IBM Storage Scale support for Hadoop 65

<property>
<name>dfs.ha.namenodes.cluster</name>
<value>nn1,nn2</value>
</property>

If there is only one NameNode (Only one CES node which means no CES HA) the list should contain only
one ID.
For example:

<property>
<name>dfs.ha.namenodes.cluster</name>
<value>nn1</value>
</property>

• dfs.namenode.rpc-address.[nameservice ID].[namenode ID]: Set to the fully qualified RPC

address for each NameNode to listen on.
For example:

<property>
<name>dfs.namenode.rpc-address.cluster.nn1</name>
<value>machine1.example.com:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.cluster.nn2</name>
<value>machine2.example.com:8020</value>
</property>

• dfs.namenode.http-address.[nameservice ID].[namenode ID]: Set to the fully qualified

HTTP address for each NameNode to listen on.
For example:

<property>
<name>dfs.namenode.http-address.hdfscluster.nn1</name>
<value>machine1.example.com:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.hdfscluster.nn2</name>
<value>machine2.example.com:50070</value>
</property>

• dfs.namenode.shared.edits.dir: Set to a directory which will be used to store shared editlogs for
this HDFS HA cluster. The recommendation is to use a name like HA-[dfs.nameservices].
For example:

<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>file:///gpfs/HA-cluster</value>
</property>

Note: If there is only one NameNode (Only one CES node which means no CES HA), do not set this
property. Otherwise, NameNode will fail to start. The NameNode shared edit dir is used for HA.
• dfs.client.failover.proxy.provider. [nameservice ID]

<property>
<name>dfs.client.failover.proxy.provider.cluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

• dfs.namenode.rpc-bind-host: This should be set to 0.0.0.0.

For example:

<property>
<name>dfs.namenode.rpc-bind-host</name>
<value>0.0.0.0</value>
</property>

66 IBM Storage Scale: Big Data and Analytics Guide

• dfs.namenode.servicerpc-bind-host: This should be set to 0.0.0.0.
For example:

<property>
<name>dfs.namenode.servicerpc-bind-host</name>
<value>0.0.0.0</value>
</property>

• dfs.namenode.lifeline.rpc-bind-host: This should be set to 0.0.0.0.

For example:

<property>
<name>dfs.namenode.lifeline.rpc-bind-host</name>
<value>0.0.0.0</value>
</property>

• dfs.namenode.http-bind-host: This should be set to 0.0.0.0.

For example:

<property>
<name>dfs.namenode.http-bind-host</name>
<value>0.0.0.0</value>
</property>

For core-site.xml:
fs.defaultFS: This should be set to the value of the dfs.nameservices. For CES HDFS, this must be
the CES HDFS group name without the hdfs prefix.
For example:

</property>
<name>fs.defaultFS</name>
<value>hdfs://cluster</value>
</property>

Follow the “Enable and Configure CES HDFS” on page 42 section to set the configuration values for
non-HA and HA CES HDFS Transparency cluster.

Change CES HDFS NON-HA cluster into CES HDFS HA cluster

Changing CES HDFS Non-HA cluster into CES HDFS HA cluster using install toolkit
This topic lists the steps to change CES HDFS Non-HA cluster into CES HDFS HA cluster using install
toolkit.
To add another NameNode to an existing CES HDFS cluster and to set it to HA configuration, follow the
steps below:
1. If the new NameNode is not already a CES node, add it as a protocol node:
a. Add the new NameNode by running the following command:

/spectrumscale node add NAMENODE -p

b. Before you initiate the installation procedure for the new NameNode, run the following command to
perform the environment checks:

/spectrumscale install -pr

c. To set up the new NameNode, run the following command:

/spectrumscale install

2. Add the new NameNode into the HDFS cluster by running the following command:

Chapter 2. IBM Storage Scale support for Hadoop 67

/spectrumscale config hdfs add -n CLUSTER_NAME -nn NAMENODE

3. Disable HDFS by running the following command:

/usr/lpp/mmfs/bin/mmces service disable hdfs

4. Before you initiate the deployment, run the following command to perform the environment checks:

/spectrumscale deploy -pr

5. To deploy the new configuration, run the following command:

/spectrumscale deploy

Note: After the deployment completes, HDFS is automatically enabled.

Manually change CES HDFS NON-HA cluster into CES HDFS HA cluster
This topic lists the steps to manually change CES HDFS NON-HA cluster into CES HDFS HA cluster.
1. If the new NameNode is already a part of your IBM Storage Scale cluster, go to the next step.
Otherwise, install IBM Storage Scale on that node by following “Steps for manual installation” on
page 33. Then add the new nodes into the existing IBM Storage Scale cluster by running the following
command:

mmaddnode -N

2. Log in to the new NameNode as root and install the HDFS Transparency package into the new
NameNode. Issue the following command on RHEL:

# rpm -ivh gpfs.hdfs-protocol-<version>.<arch>.rpm

3. Stop the existing HDFS Transparency NameNode.

mmces service stop hdfs

4. Enable CES on the new NameNode by giving the same CES group name as that of the existing HDFS
cluster.

mmchnode --ces-enable --ces-group hdfscluster -N c16f1n08

5. For the new NameNode, add related property into the hdfs-site.xml.
For example, the existing HDFS NON-HA cluster NameNode is c16f1n07, add another NameNode
c16f1n08 to the cluster.

mmhdfs config set hdfs-site.xml -k

dfs.namenode.shared.edits.dir=file:///gpfs/HA-cluster -k
dfs.ha.namenodes.cluster=nn1,nn2 -k
dfs.namenode.rpc-address.cluster.nn2=c16f1n08.gpfs.net:8020 -k
dfs.namenode.http-address.cluster.nn2=c16f1n08.gpfs.net:50070

6. Upload the configuration into CCR.

mmhdfs config upload

7. Initialize the NameNode shared directory to store HDFS cluster HA info from one NameNode.

/usr/lpp/mmfs/hadoop/bin/hdfs namenode -initializeSharedEdits

8. Start the existing HDFS Transparency NameNode.

mmces service start hdfs -N c16f1n07,c16f1n08

68 IBM Storage Scale: Big Data and Analytics Guide

mmces service start hdfs -a

9. If the CES HDFS cluster is Kerberos enabled, ensure that you configure Kerberos for the new
NameNode by following “Setting up Kerberos for HDFS Transparency nodes” on page 109.
10. Check the status of the added NameNodes in the HDFS cluster.

mmces service list -a

11. Check the status of both the NameNodes. One should be Active and the other should be in standby.

/usr/lpp/mmfs/hadoop/bin/hdfs haadmin -getAllServiceState

12. Restart DataNodes to take effect.

mmhdfs hdfs-dn restart

13. Check the status of the DataNodes.

mmhdfs hdfs-dn status

Setting configuration options in CES HDFS

This section lists the steps to set the configuration options in the CES HDFS.
To set configurations in the CES HDFS environment, run the following steps:
1. Stop HDFS Transparency.
2. Get the configuration file that you want to change.
3. Update the configuration file.
4. Import the file to CES HDFS.
5. Upload the changes to CES HDFS.
6. Start HDFS Transparency.

Setting up the gpfs.ranger.enabled field

From HDFS Transparency 3.1.1-3, ensure that the gpfs.ranger.enabled field is set to scale. The
scale option replaces the original true/false values.
1. Stop HDFS Transparency.
If you are using CDP Private Cloud Base, stop HDFS Transparency from the Cloudera Manager GUI.
Otherwise, on the CES HDFS Transparency, run the following:

/usr/lpp/mmfs/bin/mmces service stop hdfs -a

/usr/lpp/mmfs/hadoop/sbin/mmhdfs hdfs-dn stop

2. After HDFS Transparency has completely stopped, on the CES HDFS node, run the following command:

/usr/lpp/mmfs/hadoop/sbin/mmhdfs hdfs status

3. Update the HDFS Transparency configuration files and upload the changes. Get the config files by
running the following commands:

mkdir /tmp/hdfsconf
/usr/lpp/mmfs/hadoop/sbin/mmhdfs config export /tmp/hdfsconf gpfs-site.xml
cd /tmp/hdfsconf/

4. Update the config files in /tmp/hdfsconf with the following changes:

<property>
<name>gpfs.ranger.enabled</name>
<value>scale</value>

Chapter 2. IBM Storage Scale support for Hadoop 69

<final>false</final>
</property>

Note: From HDFS Transparency 3.1.0-6 and 3.1.1-3, ensure that the gpfs.ranger.enabled field is
set to scale. The scale option replaces the original true/false values.
5. Import the files into the CES HDFS cluster by running the following command:

/usr/lpp/mmfs/hadoop/sbin/mmhdfs config import /tmp/hdfsconf gpfs-site.xml

6. Upload the changes to the CES HDFS cluster by running the following command:

/usr/lpp/mmfs/hadoop/sbin/mmhdfs config upload

7. Start HDFS Transparency.

If you are using CDP Private Cloud Base, start HDFS Transparency from the Cloudera Manager GUI.
Click IBM Spectrum Scale > Actions > Start.
Otherwise, on the CES HDFS Transparency node, run the following:

/usr/lpp/mmfs/bin/mmces service start hdfs -a

/usr/lpp/mmfs/hadoop/sbin/mmhdfs hdfs-dn start

8. After HDFS Transparency has completely started, on the CES HDFS node, run the following command:

/usr/lpp/mmfs/hadoop/sbin/mmhdfs hdfs status

Setting the Java heap size for NameNode/DataNode

HDFS Transparency does not set the Java heap size value in hadoop_env.sh for NameNode or
DataNode. Therefore, the JVM autoscales based on the machine memory size.
If you need to set the Java heap size, perform the following:
1. Stop HDFS Transparency.
2. Ensure that HDFS Transparency has stopped by running the following command:

mmhdfs hdfs status

3. Get the config file by running the following command:

mkdir /tmp/hdfsconf /usr/lpp/mmfs/hadoop/sbin/mmhdfs config export /tmp/hdfsconf

hadoop_env.sh cd /tmp/hdfsconf

4. In /tmp/hdfsconf, update the hadoop_env.sh to set the -Xmx and -Xms options for
HDFS_NAMENODE_OPTS and/or HDFS_DATANODE_OPTS.
For example:

SHARED_HDFS_NAMENODE_OPTS=“-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC

-XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=1248m -XX:MaxNewSize=1248m
-Xloggc:/var/log/hadoop/$USER/gc.log-`date +‘%Y%m%d%H%M’` -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:CMSInitiatingOccupancyFraction=70
-XX:+UseCMSInitiatingOccupancyOnly -Xms9984m -Xmx9984m -Dhadoop.security.logger=INFO,DRFAS
-Dhdfs.audit.logger=INFO,DRFAAUDIT”

export HDFS_NAMENODE_OPTS=“${SHARED_HDFS_NAMENODE_OPTS}
-XX:OnOutOfMemoryError=\“/usr/hdp/current/hadoop-hdfs-namenode/bin/kill-name-node\”
-Dorg.mortbay.jetty.Request.maxFormContentSize=-1 ${HDFS_NAMENODE_OPTS}”

export HDFS_DATANODE_OPTS=“-server -XX:ParallelGCThreads=4 -XX:+UseConcMarkSweepGC

-XX:OnOutOfMemoryError=\“/usr/hdp/current/hadoop-hdfs-datanode/bin/kill-data-node\”
-XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m
-Xloggc:/var/log/hadoop/$USER/gc.log-`date +‘%Y%m%d%H%M’` -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms1024m -Xmx1024m
-Dhadoop.security.logger=INFO,DRFAS
-Dhdfs.audit.logger=INFO,DRFAAUDIT ${HDFS_DATANODE_OPTS}
-XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly”

70 IBM Storage Scale: Big Data and Analytics Guide

Note: You can set the -Xmx and -Xms options directly in the HDFS_NAMENODE_OPTS and
HDFS_DATANODE_OPTS export options.
5. Import the files into the CES HDFS cluster.
6. Upload the changes to the CES HDFS cluster.
7. Start HDFS Transparency.
8. Check the status of the HDFS Transparency cluster by running the following command:

mmhdfs hdfs status

Enabling and disabling CES HDFS

This section lists the steps to enable and disable CES HDFS.
CES HDFS NameNodes are CES protocol nodes.

Enabling CES HDFS

The following steps are relevant only when CES HDFS is disabled and you want to re-enable CES HDFS.
To enable CES HDFS, run the following steps:
1. Check the CES information by running the following commands:

# mmces node list

# mmces service list
# mmces address list

2. Reload the existing HDFS configuration by running the following command:

# mmhdfs config upload

3. Enable HDFS by running the following command:

# mmces service enable hdfs

Note: Running this command will start the NameNodes.

4. Start the DataNodes by running the following command:

# mmhdfs hdfs-dn start

5. Verify the status of CES HDFS by running the following command:

# mmces node list

# mmces service list
# mmces address list
# mmhdfs hdfs status

Disabling CES HDFS

The following steps are relevant only when CES HDFS is enabled and you want to disable CES HDFS.
Note that running the following steps will not delete but only disable the CES HDFS protocol.
To disable CES HDFS, run the following steps:
1. Check the CES information by running the following command:

# mmces node list

# mmces service list
# mmces address list

2. Disable CES HDFS by running the following command:

# mmces service disable hdfs

Chapter 2. IBM Storage Scale support for Hadoop 71

Note: Running this command will stop the NameNodes.
3. Stop the DataNodes by running the following command:

# mmhdfs hdfs-dn stop

4. Verify the status of CES HDFS by running the following command:

# mmces service list

# mmces address list
# mmces node list

Note: In the output of the mmces node list command, the Node Flags column might be set to
Failed. This output occurs because HDFS is disabled.

Removing a NameNode from existing HDFS HA cluster

Removing a NameNode from existing HDFS HA cluster using install toolkit

Removing the NameNodes and DataNodes using the install toolkit is not supported.

Manually remove a NameNode from existing HDFS HA cluster

This topic lists the steps to manually remove a NameNode from existing HDFS HA cluster.
1. Stop the existing HDFS Transparency NameNodes.

mmces service stop hdfs -N c16f1n07,c16f1n08

mmces service stop hdfs -a

2. Disable the CES HDFS services on the NameNode that you want to remove.

mmchnode --ces-disable -N c16f1n08

3. Remove the NameNode related property from the hdfs-site.xml.

For example, the existing HDFS HA cluster NameNode is c16f1n07(nn1) and c16f1n08(nn2). The
NameNode that will be removed is c16f1n08.

mmhdfs config del hdfs-site.xml -k

dfs.namenode.shared.edits.dir -k
dfs.namenode.rpc-address.cluster.nn2 -k
dfs.namenode.http-address.cluster.nn2

mmhdfs config set hdfs-site.xml -k dfs.ha.namenodes.cluster=nn1

4. Upload the configuration into CCR.

mmhdfs config upload

5. Start the existing HDFS Transparency NameNode.

mmces service start hdfs

6. Check the HDFS NameNode status in the existing HDFS cluster.

mmces service list -a

7. Restart DataNodes to take effective.

mmhdfs hdfs-dn restart

8. Check DataNodes status.

72 IBM Storage Scale: Big Data and Analytics Guide

mmhdfs hdfs-dn status

Adding a new HDFS cluster into existing HDFS cluster on the same GPFS
cluster (Multiple HDFS clusters)
This section describes how to add a new HDFS Transparency cluster onto the same GPFS cluster that
already has an existing HDFS Transparency cluster. This will create multiple HDFS clusters onto the same
GPFS cluster.

Adding a new HDFS cluster into existing HDFS cluster on the same GPFS cluster using
install toolkit
The “Using installation toolkit” on page 34 section describes how to add in a new HDFS cluster into the
environment.
The difference when creating another HDFS cluster into an existing HDFS cluster on the same GPFS
cluster is to create a different cluster name for the new HDFS cluster.
For example, use CLUSTER2 as the cluster name for the second HDFS cluster to be added into the existing
1st HDFS cluster:
1. Add the new 2nd HDFS cluster nodes into the GPFS cluster.
Ensure that the nodes are new nodes and not a part of the existing HDFS cluster.

# NameNodes (Protocol node)

./spectrumscale node add c902f09x01.gpfs.net -p
./spectrumscale node add c902f09x02.gpfs.net -p

# DataNodes
./spectrumscale node add c902f09x03.gpfs.net
./spectrumscale node add c902f09x04.gpfs.net
./spectrumscale node add c902f09x05.gpfs.net
./spectrumscale node add c902f09x06.gpfs.net

2. Configure the 2nd cluster CES HDFS cluster.

./spectrumscale config hdfs new -n CLUSTER2 -nn NAMENODES -dn DATANODES -f FILESYSTEM -d
DATADIR
./spectrumscale config hdfs new -n CLUSTER2 -nn c902f09x01.gpfs.net, c902f09x01.gpfs.net -dn
c902f09x03.gpfs.net, c902f09x04.gpfs.net, c902f09x05.gpfs.net, c902f09x06.gpfs.net -f gpfs
-d gpfshdfs2

3. Deploy the 2nd cluster.

./spectrumscale deploy -pr

./spectrumscale deploy

Note:
• Ensure that there are sufficient free CES-IPs available for usage.
• Ensure that the new cluster NameNodes and DataNodes are not the same nodes as the existing
HDFS cluster.
• Ensure that the DATADIR is unique to host the second cluster’s data.

Manually adding a new HDFS cluster into existing HDFS cluster on the same GPFS
cluster (Multiple HDFS clusters)
This topic lists the steps to manually add a new HDFS cluster into existing HDFS cluster on the same GPFS
cluster (Multiple HDFS clusters).
1. Create different CES groups for different HDFS clusters and ensure that the existing HDFS cluster
nodes are different than the new HDFS cluster nodes that will be added.

Chapter 2. IBM Storage Scale support for Hadoop 73

2. For the first HDFS cluster, HDFS configuration including core-site.xml, hdfs-site.xml, gpfs-site.xml
and hadoop-env.sh must be executed on the NameNode belonging to that cluster. Run the following
command to upload the configuration into CCR:

/usr/lpp/mmfs/hadoop/sbin/mmhdfs config upload

3. Start the NameNodes service of the first HDFS cluster.

mmces service start hdfs -a

4. Check that the NameNodes service status is running for the first HDFS cluster.

mmces service list -a

5. For the second HDFS cluster, HDFS configuration including core-site.xml, hdfs-site.xml, gpfs-site.xml
and hadoop-env.sh must be executed on the NameNode belonging to the second cluster.
The value of dfs.nameservices should be set to the cluster name of the second HDFS cluster.
Run the following command to upload the configuration into CCR. The configuration uploaded will be
pushed to all the NameNodes of the new HDFS cluster.

/usr/lpp/mmfs/hadoop/sbin/mmhdfs config upload

6. Enable CES for the NameNodes of the new HDFS cluster and set the CES group name to the CES group
name of the new HDFS cluster.

mmchnode --ces-enable --ces-group=[groupname_addedhdfscluster] -N

NewClusterNameNode1,NewClusterNameNode2

7. Start the NameNodes service of the new HDFS cluster, if not started already.

mmces service start hdfs -a

8. Check if the NameNodes service status is running for the new HDFS cluster.

mmces service list -a

9. Log in to one of the newly added NameNodes and start the DataNodes of the new HDFS cluster.

mmhdfs hdfs-dn start

Removing an existing HDFS cluster from multiple HDFS clusters of the same
GPFS cluster

Removing an existing HDFS cluster from multiple HDFS clusters of the same GPFS
cluster using install toolkit
Removing an existing HDFS cluster is not supported by the installation toolkit command line interface.

Manually remove an existing HDFS cluster from multiple HDFS clusters of the same
GPFS cluster
This topic lists the steps to manually remove an existing HDFS Cluster from multiple HDFS clusters of the
same GPFS cluster.
1. Stop all the Hadoop services and CES HDFS services.

mmces service stop HDFS -N nn1,nn2

mmhdfs hdfs-dn stop

2. Disable the CES HDFS service on the NameNodes of the removing HDFS cluster to stop the NameNode
service.

74 IBM Storage Scale: Big Data and Analytics Guide

mmchnode --noces-group [groupname_removedhdfscluster] -N removeNameNode1,removeNameNode2

3. Remove the configuration files from CCR. The [clustername] should be the value of
dfs.nameservices that corresponds to the CES group name or hostname of the corresponding CES
IP.

mmccr fdel [clustername].tar

4. Check that the removed NameNode service is not running in the existing HDFS cluster.

mmces service list -a

Adding DataNodes using installation toolkit

The CES HDFS NameNodes and DataNodes do not need to be stopped when adding or deleting
DataNodes from the cluster.
The following are the two ways to add DataNodes into an existing CES HDFS cluster:
• Add new DataNodes into an existing CES HDFS cluster.
• Add existing IBM Storage Scale nodes that are already a part of the IBM Storage Scale cluster as new
DataNodes, to the CES HDFS cluster.
Adding new DataNodes into an existing CES HDFS cluster
1. On the new nodes, ensure that the prerequisites are installed for the node to be able to be deployed by
the installation toolkit.
For more information on basic IBM Storage Scale requirements, see “Installation prerequisites” on
page 30.
2. Log into the existing CES HDFS cluster installer node as root and change to the installer directory to
run the spectrumscale commands.
For IBM Storage Scale 5.1.1 and later:

# cd /usr/lpp/mmfs/5.1.1.0/ansible-toolkit

For IBM Storage Scale 5.1.0 and earlier:

# cd /usr/lpp/mmfs/5.0.4.2/installer

3. Check the current CES HDFS cluster host information.

# ./spectrumscale config hdfs list

4. Add the new nodes (DataNodes) into an IBM Storage Scale cluster.

# ./spectrumscale node add <hostname>

5. Perform environment checks before initiating the installation procedure.

# ./spectrumscale install -pr

Start the IBM Storage Scale installation and add the nodes into the existing cluster.

# ./spectrumscale install

6. To add a new DataNode into an existing CES HDFS cluster, run the following command:

# ./spectrumscale config hdfs add -n <Existing HDFS cluster name> -dn <new_DataNode_hostname>

7. Check the CES HDFS host list to ensure that the new hosts have been added.

# ./spectrumscale config hdfs list

Chapter 2. IBM Storage Scale support for Hadoop 75

8. Perform environment checks before initiating the deployment procedure.

# ./spectrumscale deploy -pr

9. Start the IBM Storage Scale installation and the creation of the new CES HDFS nodes.

# ./spectrumscale deploy

Adding the existing IBM Storage Scale nodes to an existing CES HDFS cluster
1. Log in to the existing CES HDFS cluster installer node and change to the installer directory to run the
spectrumscale commands.
For IBM Storage Scale 5.1.1 and later:

# cd /usr/lpp/mmfs/5.1.1.0/ansible-toolkit

For IBM Storage Scale 5.1.0 and earlier:

# cd /usr/lpp/mmfs/5.0.4.2/installer

2. Check the current CES HDFS cluster host information.

# ./spectrumscale config hdfs list

3. Add a new DataNode into an existing CES HDFS cluster.

# ./spectrumscale config hdfs add -n <Existing HDFS cluster name> -dn <new_DataNode_hostname>

4. Check the CES HDFS host list to ensure that the new hosts are added.

# ./spectrumscale config hdfs list

5. Perform environment checks before initiating the deployment procedure.

# ./spectrumscale deploy -pr

6. Start the IBM Storage Scale installation and the creation of the new CES HDFS nodes.

# ./spectrumscale deploy

Adding DataNodes manually

The CES HDFS NameNodes and DataNodes do not need to be stopped when adding or deleting
DataNodes from the cluster.
The following are the two ways to add DataNodes into an existing CES HDFS cluster:
• Add new DataNodes into an existing CES HDFS cluster.
• Add existing IBM Storage Scale nodes that are already a part of the IBM Storage Scale cluster as new
DataNodes, to the CES HDFS cluster.
Adding new DataNodes into an existing CES HDFS cluster
1. If you have a new DataNode, install IBM Storage Scale by following the “Steps for manual installation”
on page 33 topic and then add the new nodes into the existing IBM Storage Scale cluster by using the
mmaddnode -N command.
If you have existing IBM Storage Scale nodes that already have the IBM Storage Scale packages
installed and configured, go to the next step.
2. Log in to the new DataNode as root.
3. Install HDFS Transparency package into the new DataNode.

76 IBM Storage Scale: Big Data and Analytics Guide

On Red Hat Enterprise Linux, issue the following command:

# rpm -ivh gpfs.hdfs-protocol-<version>.<arch>.rpm

4. On the NameNode as root, edit the worker configuration file to add in the new DataNode.

# vi /var/mmfs/hadoop/etc/hadoop/workers

5. On the NameNode as root, upload the modified configuration.

# mmhdfs config upload

6. On the NameNode as root, copy the init directory to the DataNode.

# scp -r /var/mmfs/hadoop/init [datanode]:/var/mmfs/hadoop/

7. If the CES HDFS cluster is Kerberos enabled, ensure that you configure Kerberos for the new DataNode
by following “Setting up Kerberos for HDFS Transparency nodes” on page 109.
8. On the DataNode as root, start the DataNode.

# /usr/lpp/mmfs/hadoop/sbin/mmhdfs datanode start

9. On the NameNode, confirm if the DataNode is shown from the DataNode list with correct status by
running the following command:

# /usr/lpp/mmfs/hadoop/sbin/mmhdfs hdfs-dn status

Removing DataNodes manually

The CES HDFS NameNodes and DataNodes do not need to be stopped when adding or deleting
DataNodes from the cluster.
To remove a DataNode from the CES HDFS cluster, perform the following steps:
1. On the DataNode as root, stop the DataNode service.

# mmhdfs datanode stop

2. On the NameNode as root, edit the workers configuration file to remove the DataNode from the
DataNode list.

# vi /var/mmfs/hadoop/etc/hadoop/workers

3. On the NameNode as root, upload the modified configuration into CES.

# mmhdfs config upload

4. On the NameNode, confirm that the DataNode is not shown from the DataNode list by running the
following command:

# mmhdfs hdfs-dn status

Note: The HDFS Transparency NameNodes must be restarted to fetch the information about the removed
DataNode. Before this restart is completed, the removed DataNode is listed as "dead datanode" if you
run the hdfs dfsadmin -report command.

Decommissioning DataNodes
This section lists the steps to decommission DataNodes.
To decommission a DataNode from the HDFS cluster, perform the following steps:

Chapter 2. IBM Storage Scale support for Hadoop 77

1. Modify the dfs.exclude file as specified under the dfs.hosts.exclude value in hdfs-site.xml
by adding the nodes to be decommissioned.

dfs.exclude file
<hostname1>
<hostname2>

2. Run the following command for the changes to take effect:

hdfs dfsadmin -refreshNodes

3. Monitor the decommissioning process by running the following command:

hdfs dfsadmin -report

Note: When the DataNode is DECOMMISSIONED, you can stop the DataNode.

Frequently used commands

This section lists the commonly used commands and their options.
mmhdfs
• To check the status of the NameNodes and DataNodes in the HDFS Transparency cluster, run the
following command:

/usr/lpp/mmfs/hadoop/sbin/mmhdfs hdfs status

• To start the DataNode, run the following command:

/usr/lpp/mmfs/hadoop/sbin/mmhdfs hdfs-dn start

• To stop the DataNode, run the following command:

/usr/lpp/mmfs/hadoop/sbin/mmhdfs hdfs-dn stop

mmces
• To start the HDFS Transparency NameNodes, run the following command:

/usr/lpp/mmfs/bin/mmces service start hdfs -a

Note: Do not use the mmhdfs command.

• To stop the HDFS Transparency NameNodes, run the following command:

/usr/lpp/mmfs/bin/mmces service stop hdfs -a

Note: Do not use the mmhdfs command.

• To check the value of the CES HDFS protocol IPs, run the following command:

/usr/lpp/mmfs/bin/mmces address list

• To verify the CES HDFS service, run the following command:

/usr/lpp/mmfs/bin/mmces service list -a

mmhealth
• To show the health status of the node, run the following command:

/usr/lpp/mmfs/bin/mmhealth node show

• To show the NameNode health status, run the following command on the NameNode:

/usr/lpp/mmfs/bin/mmhealth node show HDFS_Namenode -v

78 IBM Storage Scale: Big Data and Analytics Guide

• To show the DataNode health status, run the following command on the DataNode:

/usr/lpp/mmfs/bin/mmhealth node show HDFS_Datanode -v

hdfs
• To retrieve the status and check the state of all the HDFS NameNodes, run the following command:

/usr/lpp/mmfs/hadoop/bin/hdfs haadmin -getAllServiceState

• To manually trigger the HDFS Transparency cluster to failover to the standby NameNode, run the
following commands:

/usr/lpp/mmfs/hadoop/bin/hdfs haadmin -failover nn1 nn2

/usr/lpp/mmfs/bin/mmces address move --ces-ip x.x.x.x --ces-node node.example.com

Note: Replace the --ces-ip parameter with the CES IP address of the HDFS Transparency cluster and
the --ces-node parameter with the name of the new active NameNode.
For more information on mmhdfs, mmces, and mmhealth commands, see the respective command in the
IBM Storage Scale: Command and Programming Reference Guide guide.

Monitoring HDFS Transparency status using the mmhealth command

The mmhealth command helps in monitoring the status of HDFS Transparency by using the PID.
To ensure that the mmhealth command picks up the correct HDFS Transparency PID, modify the
pidfilepath and slavesfile fields in the mmsysmonitor.conf file to match your environment
setup.
As root, perform the following steps on each HDFS Transparency node:
1. Under /var/mmfs/mmsysmon/mmsysmonitor.conf file, go to the [hadoopconnector] section. If the
pidfilepath and slavesfile fields do not match your environment setup, modify these fields as
follows:

[hadoopconnector]
monitorinterval = 30
monitoroffset = 0
clockalign = false
# Optional entries to override the current defaults in the HadoopConnector monitor
# Path to the PID file directory
# default is /var/run/hadoop/root
pidfilepath = <---Edit this value and keep a blank
after the "=" sign before the actual value
# Path to hadoop binaries
# default is /usr/lpp/mmfs/hadoop/bin
binfilepath =
# Full path to the hadoop-env.shl file
# default is /var/mmfs/hadoop/etc/hadoop/hadoop-env.sh
envfile =
# Full path to the slaves file
# default is /var/mmfs/hadoop/etc/hadoop/slaves
slavesfile = <---Edit this value if needed. The
filename can be "slaves" or "workers"
# Full path to the core-site.xml file
# default is /var/mmfs/hadoop/etc/hadoop/core-site.xml
coresitefile =
# Full path to the GPFS binary path (mmhadoopctl program)
# default is /usr/lpp/mmfs/bin
gpfsbinfilepath =

2. Restart the system health monitor by running the following command:

/usr/lpp/mmfs/bin/mmsysmoncontrol restart

For example:
a. Edit the /var/mmfs/mmsysmon/mmsysmonitor.conf file on all the HDFS Transparency nodes
as follows:

Chapter 2. IBM Storage Scale support for Hadoop 79

For Open Source Apache:

pidfilepath = /tmp
binfilepath = /usr/lpp/mmfs/hadoop/bin
envfile = /var/mmfs/hadoop/etc/hadoop/hadoop-env.sh
slavesfile = /var/mmfs/hadoop/etc/hadoop/workers
coresitefile = /var/mmfs/hadoop/etc/hadoop/core-site.xml
gpfsbinfilepath = /usr/lpp/mmfs/bin

b. On all the HDFS Transparency nodes, run the following command:

mmsysmoncontrol restart

c. Check the HDFS Transparency status by running the following command:

mmhealth node show hadoopconnector -v

Monitoring HDFS Transparency status using IBM Storage Scale GUI

The IBM Storage Scale GUI can be used to monitor the state of the HDFS NameNodes and DataNodes. For
CES HDFS integration with the IBM Storage Scale GUI, the GUI displays CES specific information like CES
status, CES network states, CES node group, CES network address.
To access to CES HDFS information through GUI:
1. Log in to the IBM Storage Scale GUI to access the CES HDFS state information.
2. From the left hand side navigation, click Services.
3. Click HDFS Transparency service.
The HDFS Transparency services contain the NameNodes, DataNodes and Events tab that contains the
information about the HDFS Transparency cluster node states.
Note: When a protocol is enabled, it will be on all the nodes that are configured to be a protocol node.
However, only the nodes specified as the NameNode(s) will be enabled as CES HDFS nodes.
Therefore, the HDFS Transparency NameNodes list might be less than the overall nodes listed in the
CES Nodes service panel.

Recovering an HDFS Transparency cluster

Learn how to bring back online an HDFS Transparency cluster.
Disk failure or other unforeseen storage issues sometimes cause IBM Storage Scale file systems to
get unmounted. In such cases, HDFS Transparency automatically shuts down itself, and workloads
could report an exception. When the IBM Storage Scale cluster is back functioning, follow this recovery
procedure for HDFS Transparency to bring the cluster back online.
1. Shut down HDFS Transparency NameNodes and Data Nodes by using the next commands.

# mmces node suspend --stop -N <NameNode1_HOST>,<NameNode2_HOST>

# mmhdfs hdfs-dn stop

Use the mmces node suspend command to stop the NameNodes. Using this command is needed to
ensure that the root directory shared with CES gets unlocked.
2. To retrieve the mount points that HDFS Transparency uses for the IBM Storage Scale file system, run
the mmhdfs config get command as shown in the following example.
Example:

# mmhdfs config get gpfs-site.xml -k gpfs.mnt.dir

gpfs.mnt.dir=/gpfs1,/gpfs2

3. If a secondary file system is configured, unmount that one first. Then, unmount the primary file
system.

80 IBM Storage Scale: Big Data and Analytics Guide

Example:

# mmumount gpfs2 -a
# mmumount gpfs1 -a

4. Check the status of HDFS Transparency to ensure that all the NameNodes and DataNodes are down.

# mmhdfs hdfs status

5. Remount the IBM Storage Scale file systems.

# mmmount gpfs1 -a
# mmmount gpfs2 -a

Make sure that all the file systems are successfully mounted. Use the mmlsmount and mount
commands to verify.
6. Start the HDFS Transparency NameNodes and DataNodes.

# mmces node resume --start -N <NameNode1_HOST>,<NameNode2_HOST>

# mmhdfs hdfs-dn start

7. Check the status of HDFS Transparency to corroborate that all NameNodes and DataNodes are in
operation.

# mmhdfs hdfs status

Kerberos
This section describes how to set up Kerberos under CES HDFS.
Note:
• MIT Kerberos and Red Hat IPA Kerberos are supported.
• As per the prerequisites of Kerberos, the hostname of the cluster nodes that belong to the cluster must
be in lowercase.
• If you need to set up more than one HDFS Transparency cluster by using a common KDC server, see If
Kerberos was configured on multiple HDFS Transparency clusters using a common KDC server and the
supplied gpfs_kerberos_configuration.py script, kinit with the hdfs user principal fails for all
the clusters except the most recent one.
This limitation has been fixed in HDFS Transparency 3.1.1.6.

Prerequisites
Learn the prerequisites that you must comply with before you enable Kerberos.
• Configure FQDN for all the hostname entries in your environment for consistency before you enable
Kerberos. The hostname and hostname -f command outputs should have FQDN information.
• For all the hostname entries that are being replaced, ensure that you use FQDN hostnames from your
environment.
• It is recommended having the hostname resolution through a working DNS setup.
• Synchronize clocks by using chronyd. Use the following command to check the time on all the IBM
Storage Scale nodes:

# mmdsh -N all "date +%m%d%H%M%S%N"

• Change hostnames before you enable Kerberos. In case you must change the hostname after enabling
Kerberos then re-create the principals and keytab also.
• If you need to set up more than one HDFS Transparency cluster by using a common KDC server, see
Note.

Chapter 2. IBM Storage Scale support for Hadoop 81

• If you use Microsoft Active Directory based Kerberos with the 8u241 or a higher versions of Java
Development Kit (JDK), for example OpenJDK 1.8.0u242+, you must disable referrals by making the
following configuration:

sun.security.krb5.disableReferrals=true

Otherwise, HDFS Transparency services could fail to authenticate with each other through Kerberos.
For more information about Java requirements, see Cloudera Docs.

Kerberos authentication with Active Directory (AD) support

Overview
Kerberos is a network authentication protocol. It is designed to provide strong authentication for client/
server applications by using secret-key cryptography.
An active directory is a database that keeps track of all the user accounts and passwords in your
organization. By using an active directory, you can store your user accounts and passwords in one
protected location, which can improve the security of your organization. Network administrators can use
active directories to allow or deny access to specific applications by end users through the trees in the
network.
This section shows ways to setup a Kerberos authentication with Windows AD service on the HDP and
HDFS Transparency cluster.

Prerequisites

Simplify Windows computer name

This topic lists the steps to simplify windows computer name.
Simplify the Windows computer name via the following sequence of operations (optional step):
1. On the Start screen, type Control Panel, and press ENTER.
2. Navigate to System and Security, and then click System.
3. Advanced system settings.
4. Under Computer name click Change settings.
5. On the Computer Name tab, add simple computer name such as “adverser” and click OK.
6. Restart computer.

Add Windows ip/hostname

Add the Windows ip and full hostname into the /etc/hosts file onto all the Hadoop nodes. This is
required if all the nodes (Hadoop nodes and Windows computer) cannot be resolved via DNS.
This example adds the following Windows ip and hostname:

192.0.2.10 adserver.ad.gpfs.net

Synchronize Linux and Windows time

On all Hadoop and GPFS nodes perform the ntpdate command to sync the time across all nodes.

[root@c902f14x13 ~]# ntpdate adserver.ad.gpfs.net

13 May 12:52:32 ntpdate[3753]: step time server 192.0.2.10 offset 342.048227 sec

Add all Hadoop nodes into Windows hosts

On the Windows server, add all the Hadoop nodes in C:\Windows\System32\drivers\etc\hosts if
the Hadoop node's IPs are not resolvable by the Windows server.

# localhost name resolution is handled within DNS itself.

# 192.0.2.22 localhost
# ::1 localhost

82 IBM Storage Scale: Big Data and Analytics Guide

192.0.2.11 c902f05x04.gpfs.net
192.0.2.12 c902f05x05.gpfs.net
192.0.2.13 c902f05x06.gpfs.net
192.0.2.14 c902f05x01.gpfs.net
192.0.2.15 c902f05x02.gpfs.net
192.0.2.16 c902f05x03.gpfs.net
192.0.2.17 c902f14x01.gpfs.net
192.0.2.18 c902f14x02.gpfs.net
192.0.2.19 c902f14x03.gpfs.net

Install and Configure Active Directory

A Domain Controller (DC) allows the creation of logical containers. These containers consist of users,
computers and groups. The Domain Controllers also help in organizing and managing the Servers.
Active Directory is a service which runs on the Domain Controller. One uses this service to create logical
containers.
Follow the steps below to setup the Active Directory services and promote it to a Domain Controller.
This example uses the following information:

Root domain name: AD.GPFS.NET

Password: Admin1234
NetBIOS domain name: AD0

1. Navigate to the Windows Server Manager.

2. Click Add Roles and Features.

3. It will open Add Roles and Features wizard. Click Next.

Chapter 2. IBM Storage Scale support for Hadoop 83

4. Select the server from the server pool and click Next.

5. Click Checkbox to select Active Directory Domain Services.

84 IBM Storage Scale: Big Data and Analytics Guide

6. On the popup Window, just click Add Features.

7. On the description window of Active Directory Domain Services, click Next.

Chapter 2. IBM Storage Scale support for Hadoop 85

8. Click Install on the Confirmation window.

9. Installation process begins.

86 IBM Storage Scale: Big Data and Analytics Guide

10. After installing AD DS Role, you can promote this Server to a Domain Controller.

11. Select Add a new forest and give the Root domain name, ad.gpfs.net. Click Next.

Chapter 2. IBM Storage Scale support for Hadoop 87

12. Enter the Directory Services Restore Mode password.

13. Ignore the warning message.

88 IBM Storage Scale: Big Data and Analytics Guide

14. Use the default NetBIOS domain name and click Next.

15. Use the default paths and click Next.

Chapter 2. IBM Storage Scale support for Hadoop 89

16. Review and click Next if no errors.

17. Click Install and wait for the installation to finish.

90 IBM Storage Scale: Big Data and Analytics Guide

18. The Domain Controller is now set up.

Enable Kerberos on existing Active Directory

This section lists the steps to enable Kerberos on existing Active Directory.

Installation and Configuration of Active Directory Certificate Services

Active Directory Certificate service is one of the essential services that is required for the certificate
management within the organization.
If the Domain is up and running as shown in the picture below, then the ADCS is successfully installed and
configured.

Note: This is required only if you are generating your own certificates for Active Directory.

Create AD user and delegate control

Create a container, Kerberos admin, and set permissions for the cluster.
1. Navigate to Server Manager > Tools > Active Directory Users and Computers.
2. Click View and check Advanced Features.

Chapter 2. IBM Storage Scale support for Hadoop 91

3. Create a container. This example uses the name IBM. Navigate to Action > New > Organizational
Unit.

4. Specify the container name (Example uses name “IBM”).

5. Create a user named hdpad. Navigate to Action > New > User.

92 IBM Storage Scale: Big Data and Analytics Guide

6. Specify the User logon name.

7. Delegate control of the container to hdpad. Right-click on the new container (IBM), and select
Delegate Control.

Chapter 2. IBM Storage Scale support for Hadoop 93

8. In the Delegation of Control Wizard, enter hdpad and click Check Names.

9. Confirm that the hdpad name is listed and click Next.

94 IBM Storage Scale: Big Data and Analytics Guide

10. In the Tasks to Delegate field, check Create, delete, and manage user accounts.

11. Navigate to AD.COM > Properties > Security and add the hdpad user.

Chapter 2. IBM Storage Scale support for Hadoop 95

Adding the domain of your Linux host(s) to be recognized by Active Directory
This topic lists the steps to add the domain of your Linux host(s) to be recognized by Active Directory.
Note: This step is required only if the domain of the Linux servers is different than the Active Directory.
1. On the Windows Host, navigate to Server Manager > Tools > Active Directory Domains and Trusts.
2. Click Actions > Properties > UPN Suffixes.
3. Add the alternative SPN. This is determined by running hostname -d on your Linux server.

Configuring AD in Ambari
This section describes how to configure Kerberos with existing AD through the Ambari GUI.

Configuring Secure LDAP connection

The Lightweight Directory Access Protocol (LDAP) is used to read from and write to Active Directory.
By default, LDAP traffic is transmitted unsecured. To make LDAP traffic confidential and secure, use
Secure Sockets Layer (SSL) / Transport Layer Security (TLS) technology. Enable LDAP over SSL (LDAPS)
by installing a properly formatted certificate from either a Microsoft certification authority (CA) or a
non-Microsoft CA.

96 IBM Storage Scale: Big Data and Analytics Guide

Follow the guide below to configure secure LDAP connection on Server 2016.
The picture shows a successfully configured Secure LDAP.

Trust the Active Directories certificate

Note: This is required for self-signed certificates. This step can be skipped if a purchased SSL certificate is
in use.
On the Windows host:
1. Navigate to Server Manager > Tools > Certificate Authority.
2. Click Action > Properties.
3. Click General Tab > View Certificate > Details > Copy to File.
4. Choose the format: Base-64 encoded X.509 (.CER).
5. Choose a file name. For example, hdpad.cer, and save.
6. Open with Notepad and copy contents.
On the Linux host:
1. Create file /etc/pki/ca-trust/source/anchors/hdpad.cer and paste in the certificate
contents.

Chapter 2. IBM Storage Scale support for Hadoop 97

2. Trust the CA certificate:

yum install openldap-clients ca-certificates

yum update-ca-trust enable
yum update-ca-trust extract
yum update-ca-trust check

3. Trust the CA certificate in Java:

/usr/jdk64/jdk1.8.0_112/bin/keytool -importcert -noprompt -storepass

changeit -file /etc/pki/ca-trust/source/anchors/hdpad.crt -alias ad
-keystore /etc/pki/java/cacerts

[root@c902f14x12 ~]# ambari-server setup-security

Using python /usr/bin/python
Security setup options...
===========================================================================
Choose one of the following options:
[1] Enable HTTPS for Ambari server.
[2] Encrypt passwords stored in ambari.properties file.
[3] Setup Ambari kerberos JAAS configuration.
[4] Setup truststore.
[5] Import certificate to truststore.
===========================================================================
Enter choice, (1-5): 4
Do you want to configure a truststore [y/n] (y)? y
TrustStore type [jks/jceks/pkcs12] (jks):jks
Path to TrustStore file :/etc/ambari-server/conf/hdpad.jks
Password for TrustStore:
Re-enter password:
Ambari Server 'setup-security' completed successfully.

[root@ c902f14x12 ~]# ambari-server setup-security

Using python /usr/bin/python
Security setup options...
===========================================================================
Choose one of the following options:
[1] Enable HTTPS for Ambari server.
[2] Encrypt passwords stored in ambari.properties file.
[3] Setup Ambari kerberos JAAS configuration.
[4] Setup truststore.
[5] Import certificate to truststore.
===========================================================================
Enter choice, (1-5): 5
Do you want to configure a truststore [y/n] (y)? y
Do you want to import a certificate [y/n] (y)? y
Please enter an alias for the certificate: ad
Enter path to certificate: /etc/pki/ca-trust/source/anchors/hdpad.crt
Ambari Server 'setup-security' completed successfully.

[root@c902f14x12 ~]# ambari-server restart

Using python /usr/bin/python
Restarting ambari-server
Waiting for server stop...
Ambari Server stopped
Ambari Server running with administrator privileges.
Organizing resource files at /var/lib/ambari-server/resources...
Ambari database consistency check started...
Server PID at: /var/run/ambari-server/ambari-server.pid
Server out at: /var/log/ambari-server/ambari-server.out
Server log at: /var/log/ambari-server/ambari-server.log
Waiting for server start................
Server started listening on 8080

DB configs consistency check: no errors and warnings were found.

Enable Kerberos in Ambari

1. Open Ambari in your browser.
2. Ensure that all services are working before proceeding.
3. Click Admin > Kerberos.

98 IBM Storage Scale: Big Data and Analytics Guide

4. On the Getting Started page, choose Existing Active Directory and make sure that all of the
requirements are met.

5. On the Configure Kerberos page, set the following configurations:

Follow the Ambari GUI to setup Kerberos.

Chapter 2. IBM Storage Scale support for Hadoop 99

Create a one-way trust from an MIT KDC to Active Directory
Instead of using the KDC of Active Directory server to manage service principals, use a local MIT KDC
in the Hadoop cluster to manage the service principals while using a one-way trust to allow AD users to
utilize the Hadoop environment.

Prerequisites
Before setting up a one-way trust from an MIT KDC to Active Directory, ensure the following:
1. Existing HDP cluster has enabled Kerberos with an MIT KDC.
2. Existing AD server (or creating a new one) is running and promoted to a Domain Controller.
In this example, the following information is used:

MIT KDC realm name: IBM.COM

MIT KDC server name: c902f05x04.gpfs.net
AD domain/realm: AD.GPFS.NET

Configure the Trust in Active Directory

On the AD server, run the following command in a command window with Administrator privilege and
create a definition for the KDC of the MIT realm.

ksetup /addkdc IBM.COM c902f05x04.gpfs.net

On the AD server, create an entry for the one-way trust.

Note: The password used here will be used later in the MIT KDC configuration of the trust.

netdom trust IBM.COM /Domain:AD.GPFS.NET /add /realm /passwordt:Admin1234

Configure Encryption Types

The encryption types between both KDCs (AD KDC and MIT KDC) must be compatible, so that the tickets
generated by AD KDC can be trusted by the MIT realm. There must be at least one encryption type that is
accepted by both KDCs.
Review the encryption types in Local Security Policy > Local Policies > Security Options > Network
security: Configure encryption types allowed for Kerberos > Local Security Setting.

100 IBM Storage Scale: Big Data and Analytics Guide

On the AD server, specify which encryption types are acceptable for communication with the MIT realm.

ksetup /SetEncTypeAttr IBM.COM AES256-CTS-HMAC-SHA1-96

AES128-CTS-HMAC-SHA1-96 DES-CBC-MD5 DES-CBC-CRC RC4-HMAC-MD5

On the MIT KDC server, change the /etc/krb5.conf file to specify encryption types in MIT KDC. By default,
all of the encryption types are accepted by the MIT KDC.

[libdefaults]
permitted_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96 des3-cbc-sha1 rc4 des-cbc-
md5

Enable Trust in MIT KDC

Add the trust to MIT KDC to complete the trust configuration.
In the /etc/krb5.conf file, add the AD domain.
In this example, domain AD.GPFS.NET is the added AD domain.

[realms]
IBM.COM = {
admin_server = c902f05x04.gpfs.net
kdc = c902f05x04.gpfs.net
}

AD.GPFS.NET = {
kdc = adserver.ad.gpfs.net
admin_server = adserver.ad.gpfs.net
}

On the MIT KDC server, create a principal that combines the realms in the trust.
Note: The password for this principal must be the same as the password used to create the trust on the
AD server.

[root@c902f05x04 ~]# kadmin.local

Authenticating as principal nn/[email protected] with password.
kadmin.local: addprinc krbtgt/[email protected]
WARNING: no policy specified for krbtgt/[email protected]; defaulting to no policy
Enter password for principal "krbtgt/[email protected]":
Re-enter password for principal "krbtgt/[email protected]":

Chapter 2. IBM Storage Scale support for Hadoop 101

add_principal: Principal or policy already exists while creating "krbtgt/[email protected]".
kadmin.local:

Configure AUTH_TO_LOCAL
In Ambari or in the core-site.xml file, add Auth_To_Local rules to properly convert the user principals from
the AD domain to usable usernames in the Hadoop cluster.

<property>
<name>hadoop.security.auth_to_local</name>
<value>
RULE:[1:$1@$0](^.*@AD.GPFS.NET$)s/^(.*)@AD.GPFS.NET$/$1/g
RULE:[2:$1@$0](^.*@AD.GPFS.NET$)s/^(.*)@AD.GPFS.NET$/$1/g
DEFAULT</value>
</property>

Configure Transparency with Active Directory

This section describes how to configure HDFS Transparency (without HDP) with Active Directory.
To enable Kerberos with existing AD for Transparency, follow “Enable Kerberos in Ambari” on page 98
section.

Create Domain Users and export keytab file

On the AD server, create Domain users for all the DataNodes and NameNodes for the HDFS Transparency
cluster.
For example, the cluster contains:

Two namenodes: c902f08x06 and c902f08x07

Four datanodes, c902f08x05 – 08

Create the following users in AD for the HDFS Transparency cluster:

nn1 with Display name, nn/c902f08x06.gpfs.net, and User logon name, nn/
[email protected]
nn2 with Display name, nn/c902f08x07.gpfs.net, and User logon name, nn/
[email protected]
dn1 with Display name, dn/c902f08x05.gpfs.net, and User logon name, dn/
[email protected]
dn2 with Display name, dn/c902f08x06.gpfs.net, and User logon name, dn/
[email protected]
dn3 with Display name, dn/c902f08x07.gpfs.net, and User logon name, dn/
[email protected]
dn4 with Display name, dn/c902f08x08.gpfs.net, and User logon name, dn/
[email protected]

102 IBM Storage Scale: Big Data and Analytics Guide

Chapter 2. IBM Storage Scale support for Hadoop 103
For the Account options, ensure to do the following:
• Un-select “User must change password at next logon”
• Select “Password never expires”
• Select “This account supports Kerberos AES 128 bit encryption”
• Select “This account supports Kerberos AES 256 bit encryption”

Export keytab files

Each node requires to generate its own corresponding key.tab.
This example below generates the keytab just for host c902f08x06.
To do another host, c902f08x07, will need to generate a new keytab with a new name like dn3.key.
On the Windows PowerShell, use the ktpass command to generate the principals and keytab files for all
the Domain Users on the HDFS Transparency cluster.

PS C:\files> ktpass /princ dn/[email protected]

/mapuser dn/c902f08x06.gpfs.net /pass Admin1234 /out dn2.key /ptype KRB5_NT_SRV_INST /crypto
all
Targeting domain controller: adserver.ad.gpfs.net
Successfully mapped dn/c902f08x06.gpfs.net to dn_c902f08x06.gpfs.n.
Password successfully set!
WARNING: pType and account type do not match. This might cause problems.
Key created.
Output keytab to dn2.key:
Keytab version: 0x502
keysize 69 dn/[email protected] ptype 2 (KRB5_NT_SRV_INST)

104 IBM Storage Scale: Big Data and Analytics Guide

vno 3 etype 0x17 (RC4-HMAC) keylength 16 (0xdac3a2930fc196001f3aeab959748448)
PS C:\files>

Note: The /crypto specifies the keys that are generated in the keytab file. The default settings are based
on older MIT versions. Therefore, /crypto should always be specified.
Distribute all the keytab files to the HDFS Transparency nodes and rename them to nn.service.keytab (for
NameNode service) and dn.service.keytab (for DataNode service).

PS C:\files> .\pscp.exe dn2.key root@c902f08x06:/etc/security/keytabs/dn.service.keytab

Note: Ensure that the “dn2.key” exported corresponds to the host c902f08x06. Otherwise, the service
will fail to start.
On the Linux nodes, change the owner and permissions for all the keytab files.

chown hdfs:hadoop /etc/security/keytabs/nn.service.keytab

chown hdfs:hadoop /etc/security/keytabs/dn.service.keytab
chmod 400 /etc/security/keytabs/nn.service.keytab
chmod 400 /etc/security/keytabs/dn.service.keytab

Install Kerberos clients and configure onto all the Linux nodes
On all nodes, run the command yum install krb5-workstation to install the Kerberos workstation.
Add the configuration information below to the /etc/krb5.conf file onto all the nodes.

[realms]
AD.GPFS.NET = {
kdc = adserver.ad.gpfs.net
admin_server = adserver.ad.gpfs.net
}

Configure Transparency to use Kerberos authentication

In /usr/lpp/mmfs/hadoop/etc/hadoop/core-site.xml, add or modify the configuration fields
below based on your environment:

<property>
<name>hadoop.security.auth_to_local</name>
<value>RULE:[2:$1@$0]([email protected])s/.*/hdfs/
RULE:[2:$1@$0]([email protected])s/.*/hdfs/
DEFAULT</value>
</property>

<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>

<property>
<name>hadoop.security.authorization</name>
<value>true</value>
</property>

In /usr/lpp/mmfs/hadoop/etc/hadoop/hdfs-site.xml, add the configurations below based on

your environment:

<property>
<name>dfs.datanode.kerberos.principal</name>
<value>dn/[email protected]</value>
</property>

<property>
<name>dfs.datanode.keytab.file</name>
<value>/etc/security/keytabs/dn.service.keytab</value>
</property>

<property>
<name>dfs.namenode.kerberos.principal</name>
<value>nn/[email protected]</value>
</property>

Chapter 2. IBM Storage Scale support for Hadoop 105

<name>dfs.namenode.keytab.file</name>
<value>/etc/security/keytabs/nn.service.keytab</value>
</property>

Use mmhadoopctl connector syncconf /usr/lpp/mmfs/hadoop/etc/hadoop (for HDFS

Transparency 2.7.3-x) or mmhadoopctl connector syncconf /var/mmfs/hadoop/etc/hadoop
(for HDFS Transparency 3.0.x/3.1.x) to sync the configuration files onto all cluster nodes.
Note:
• For HDFS Transparency 2.7.3-x, the configuration is stored in /usr/lpp/mmfs/hadoop/etc/hadoop.
• For HDFS Transparency 3.1.x, the configuration is stored in /var/mmfs/hadoop/etc/hadoop.
For more information, see the Sync HDFS Transparency section.

Start Transparency
As root on one of the HDFS Transparency node, run /usr/lpp/mmfs/bin/mmhadoopctl connector
start to start HDFS Transparency.

Configure SSSD for Transparency

The System Security Services Daemon (SSSD) provides a set of daemons to manage access to remote
directories and authentication mechanisms.
It provides Name Service Switch (NSS) and Pluggable Authentication Modules (PAM) interfaces toward
the system and a pluggable back end system to connect to multiple different account sources.

SSSD installation and configuration

For Red Hat 7, install the following packages

yum -y install sssd realmd oddjob oddjob-mkhomedir adcli samba-common

Connect to an Active Directory Domain

Use the realmd to connect to an Active Directory Domain. The realmd system provides a clear and simple
way to discover and join identity domains to achieve direct domain integration.
It configures underlying Linux system services, such as SSSD or Winbind, to connect to the domain.

realm join adserver.ad.gpfs.net -U Administrator

realm permit -g [email protected]

Configure the sudoers file to add below line:

%[email protected] ALL=(ALL) ALL

Configure /etc/sssd/sssd.conf file with the following changes:

use_fully_qualified_names = False
fallback_homedir = /home/%u

Restart sssd service after changing the configuration file.

systemctl restart sssd

106 IBM Storage Scale: Big Data and Analytics Guide

Test the integration of SSSD and AD
Create a User Account hdfs in AD and add it to group hadoop.

Chapter 2. IBM Storage Scale support for Hadoop 107

On a Linux shell, run the command to verify the SSSD works properly.

[root@c902f05x04 ~]# id hdfs

uid=537601119(hdfs) gid=537601123(hadoop) groups=537601123(hadoop),537600513(domain users)
[root@c902f05x04 ~]#

MIT Kerberos

Manually configuring Kerberos

Setting up the Kerberos server

This topic lists the steps to set up the Kerberos server.
Before following these steps, see the “Prerequisites” on page 81 topic.
1. Install and configure the Kerberos server.

yum install krb5-server krb5-libs krb5-workstation

2. Create /etc/krb5.conf with the following contents:

[logging]
default = FILE:/var/log/krb5libs.log
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log

108 IBM Storage Scale: Big Data and Analytics Guide

[libdefaults]
default_realm = IBM.COM
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
default_realm = IBM.COM

[realms]
IBM.COM = {
kdc = {KDC_HOST_NAME}
admin_server = {KDC_HOST_NAME}
}

[domain_realm]
.ibm.com = IBM.COM
ibm.com = IBM.COM

Note: The KDC_HOST_NAME, KDC_HOST_NAME and IBM.COM should reflect the correct host and
REALM based on your environment.
3. Set up the server.

kdb5_util create -s

systemctl start krb5kdc

systemctl start kadmin
chkconfig krb5kdc on
chkconfig kadmin on

4. Add the admin principal, and set the password.

kadmin.local -q "addprinc root/admin"

Check the kadm5.acl to ensure that the entry is correct.

cat /var/kerberos/krb5kdc/kadm5.acl
*/[email protected]

systemctl restart krb5kdc.service

systemctl restart kadmin.service

5. Ensure that the password is working by running the following command:

kadmin -p root/[email protected]

Setting up Kerberos for HDFS Transparency nodes

This topic lists the steps to set up the Kerberos clients on the HDFS Transparency nodes. These
instructions work for both Cloudera Private Cloud Base and Apache Hadoop distributions.
Before following these steps, see the “Prerequisites” on page 81 topic.
1. Install the Kerberos clients package on all the HDFS Transparency nodes.

yum install -y krb5-libs krb5-workstation

2. Copy the /etc/krb5.conf file to the Kerberos client hosts on the HDFS Transparency nodes.
3. Create a directory for the keytab directory and set the appropriate permissions on each of the HDFS
Transparency node.

mkdir -p /etc/security/keytabs/
chown root:root /etc/security/keytabs
chmod 755 /etc/security/keytabs

4. Create KDC principals for the components, corresponding to the hosts where they are running, and
export the keytab files as follows:

Chapter 2. IBM Storage Scale support for Hadoop 109

Service User:Group Daemons Principal Keytab File Name
HDFS root:root NameNode nn/ nn.service.keytab
<NN_Host_FQDN>@
<REALM-NAME>
NameNode HTTP HTTP/ spnego.service.keyt
<NN_Host_FQDN>@ ab
<REALM-NAME>
NameNode HTTP HTTP/ spnego.service.keyt
<CES_HDFS_Host_F ab
QDN>@<REALM-
NAME>
DataNode dn/ dn.service.keytab
<DN_Host_FQDN>@
<REALM-NAME>

Replace the < NN_Host_FQDN > with the HDFS Transparency NameNode hostname and
the <DN_Host_FQDN> with the HDFS Transparency DataNode hostname. Replace the
<CES_HDFS_Host_FQDN> with the CES hostname configured for your CES HDFS cluster.
You need to create one principal for each HDFS Transparency DataNode and two principals for each
HDFS Transparency NameNode.
Note: If you are using CDP Private Cloud Base, Cloudera Manager creates the principals and keytabs
for all the services except the IBM Storage Scale service. Therefore, you can skip the create service
principals section below and go directly to step a.
If you are using Apache Hadoop, you need to create service principals for YARN and Mapreduce
services as shown in the following table:

Service User:Group Daemons Principal Keytab File Name

YARN yarn:hadoop ResourceManager rm/ rm.service.keytab
<Resource_Manager
_FQDN>@<REALM-
NAME>
NodeManager nm/ nm.service.keytab
<Node_Manager_FQ
DN>@<REALM-
NAME>
Mapreduce mapred:hadoop MapReduce Job jhs/ jhs.service.keytab
History Server <Job_History_Server
_FQDN>@<REALM-
NAME>

Replace the <Resource_Manager_FQDN> with the Resource Manager hostname, the

<Node_Manager_FQDN> with the Node Manager hostname and the <Job_History_Server_FQDN>
with the Job History Server hostname.
a. Create service principals for each service. Refer to the sample table above.

kadmin.local -q "addprinc -randkey -maxrenewlife 7d +allow_renewable {Principal}"

For example:

kadmin.local -q "addprinc -randkey -maxrenewlife 7d +allow_renewable nn/

[email protected]"

110 IBM Storage Scale: Big Data and Analytics Guide

b. Create host principals for each Transparency host.

kadmin.local -q "addprinc -randkey host/{HOST_NAME}@<Realm Name>"

For example:

kadmin.local -q "addprinc -randkey host/[email protected]"

c. If you are using RHEL 9.1+ for Power LE, update the principals to include the
+requires_preauth attribute.
For all the host and service principals created under the previous steps 4.a and 4.b, update the
principals to include the +requires_preauth flag, as shown in the following example:

# kadmin.local: modify_principal +requires_preauth nn/[email protected]

Principal nn/[email protected] modified

d. For each service on each Transparency host, create a keytab file by exporting its service principal
into a keytab file:

kadmin.local ktadd -k
/etc/security/keytabs/{SERVICE_NAME}.service.keytab {Principal}

For example:
DataNode:

kadmin.local ktadd -k /etc/security/keytabs/dn.service.keytab dn/[email protected]

NameNode:

kadmin.local ktadd -k /etc/security/keytabs/nn.service.keytab nn/[email protected]

NameNode HTTP:
The keytab for this service needs an additional step as it contains entries for two principals – one
corresponding to the actual NameNode hostname and another for the CES IP hostname.
• First create the keytab file for HTTP service corresponding to the NameNode host.

kadmin.local ktadd -k /etc/security/keytabs/spnego.service.keytab HTTP/

[email protected]

• Create a temporary keytab file for HTTP service corresponding to the CES HDFS IP hostname.

kadmin.local ktadd -norandkey -k /etc/security/keytabs/myceshdfs.service.keytab HTTP/

[email protected]

• Merge the above two keytabs with kutil utility to create an updated spnego.service.keytab:

#ktutil
ktutil: rkt /etc/security/keytabs/myceshdfs.service.keytab
ktutil: wkt /etc/security/keytabs/spnego.service.keytab
exit

Note: myceshdfs.gpfs.net is an example of the CES IP hostname configured for your CES
HDFS service.
• Repeat the “4.a” on page 110, “4.b” on page 111, and “4.d” on page 111 steps for every
required keytab file.
Note:
• The filename for a service is common (for example, dn.service.keytab) across hosts but
the contents would be different because every keytab would have a different host principal
component.

Chapter 2. IBM Storage Scale support for Hadoop 111

• After a keytab is generated, move the keytab to the appropriate host immediately or move it into
a different location to avoid the keytab from getting overwritten.
5. For CES HDFS NameNode HA, an HDFS admin user and its Kerberos user principal and keytab are
required to be created and setup for the CES NameNodes. These credentials are used by the CES
framework to elect an active NameNode.
This principal should map to an existing OS user on the NameNode hosts.
In this example, the OS user is hdfs. You will configure this principal/keytab into hadoop-env.sh in
step 8.
a. First create a Hadoop supergroup.
Set the dfs.permissions.superusergroup parameter to supergroup by running the following
command:

/usr/lpp/mmfs/hadoop/sbin/mmhdfs config set hdfs-site.xml -k

dfs.permissions.superusergroup=supergroup

b. Create the hdfs user on all the HDFS Transparency nodes that belongs to the supergroup Hadoop
super group by using the supplied gpfs_create_hadoop_users_dirs.py command.
The command ensures that the custom user/group is created with consistent UID/GID across all
the nodes.

/usr/lpp/mmfs/hadoop/scripts/gpfs_create_hadoop_users_dirs.py --create-custom-hadoop-
user-group hdfs:supergroup

Note: If you are going to use CDP, you can skip this step. You will create this user as part of the
CDP specific configuration workflow.
c. Create the user principal.

# kadmin.local "addprinc -randkey -maxrenewlife 7d +allow_renewable ces-

<clustername>@IBM.COM"
# kadmin.local "ktadd -k /etc/security/keytabs/ces-<clustername>.headless.keytab ces-
<clustername>@IBM.COM"

where, <clustername> is the name of your CES HDFS cluster. In case there are multiple CES
HDFS clusters sharing a common KDC server, having the cluster name as part of the principal
helps to create a user principal unique to each CES HDFS cluster.
d. Copy the /etc/security/keytabs/ces-<clustername>.headless.keytab file to all the
NameNodes and change the owner permission of the file to root:

# chown root:root /etc/security/keytabs/ces-<clustername>.headless.keytab

# chmod 400 /etc/security/keytabs/ces-<clustername>.headless.keytab

6. Copy the appropriate keytab file to each host. If a host runs more than one component (for example,
both NameNode and DataNode), copy the keytabs for both these components.
7. Set the appropriate permissions for the keytab files.
On the HDFS Transparency NameNode hosts:

chown root:root /etc/security/keytabs/nn.service.keytab

chmod 400 /etc/security/keytabs/nn.service.keytab
chown root:root /etc/security/keytabs/spnego.service.keytab
chmod 440 /etc/security/keytabs/spnego.service.keytab

On the HDFS Transparency DataNode hosts:

chown root:root /etc/security/keytabs/dn.service.keytab

chmod 400 /etc/security/keytabs/dn.service.keytab

On the Yarn resource manager hosts:

112 IBM Storage Scale: Big Data and Analytics Guide

chown yarn:hadoop /etc/security/keytabs/rm.service.keytab
chmod 400 /etc/security/keytabs/rm.service.keytab

On the Yarn node manager hosts:

chown yarn:hadoop /etc/security/keytabs/nm.service.keytab

chmod 400 /etc/security/keytabs/nm.service.keytab

On Mapreduce job history server hosts:

chown mapred:hadoop /etc/security/keytabs/jhs.service.keytab

chmod 400 /etc/security/keytabs/jhs.service.keytab

8. Update the HDFS Transparency configuration files and upload the changes.
• Get the config files

mkdir /tmp/hdfsconf
mmhdfs config export /tmp/hdfsconf core-site.xml,hdfs-site.xml,hadoop-env.sh

• Configurations in core-site.xml and hdfs-site.xml are different for HDFS Transparency 3.1.x
and HDFS Transparency 3.2.2-x/3.3.x. The configurations are as follows:
– For HDFS Transparency 3.1.x use the following configurations in core-site.xml and hdfs-
site.xml:
File: core-site.xml

<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>

<property>
<name>hadoop.rpc.protection</name>
<value>authentication</value>
</property>

If you are using Cloudera Private Cloud Base cluster, create the following rules:

<property>
<name>hadoop.security.auth_to_local</name>
<value>
RULE:[2:$1/$2@$0](nn/.*@.*IBM.COM)s/.*/hdfs/
RULE:[2:$1/$2@$0](dn/.*@.*IBM.COM)s/.*/hdfs/
RULE:[1:$1@$0](ces-<clustername>@IBM.COM)s/.*/hdfs/
RULE:[1:$1@$0](.*@IBM.COM)s/@.*//
DEFAULT
</value>
</property>

Otherwise, if you are using Apache Hadoop, create the following rules:

<property>
<name>hadoop.security.auth_to_local</name>
<value>
RULE:[2:$1/$2@$0](nn/.*@.*IBM.COM)s/.*/hdfs/
RULE:[2:$1/$2@$0](dn/.*@.*IBM.COM)s/.*/hdfs/
RULE:[2:$1/$2@$0](nm/.*@.*IBM.COM)s/.*/yarn/
RULE:[2:$1/$2@$0](rm/.*@.*IBM.COM)s/.*/yarn/
RULE:[2:$1/$2@$0](jhs/.*@.*IBM.COM)s/.*/mapred/
RULE:[1:$1@$0](ces-<clustername>@IBM.COM)s/.*/hdfs/
DEFAULT
</value>
</property>

In the above example, replace IBM.COM with your Realm name and <clustername> parameter
with your actual CES HDFS cluster name.

Chapter 2. IBM Storage Scale support for Hadoop 113

File: hdfs-site.xml

<property>
<name>dfs.data.transfer.protection</name>
<value>authentication</value>
</property>

<property>
<name>dfs.datanode.address</name>
<value>0.0.0.0:1004</value>
</property>

<property>
<name>dfs.datanode.data.dir.perm</name>
<value>700</value>
</property>

<property>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:1006</value>
</property>

<property>
<name>dfs.datanode.kerberos.principal</name>
<value>dn/[email protected]</value>
</property>

<property>
<name>dfs.datanode.keytab.file</name>
<value>/etc/security/keytabs/dn.service.keytab</value>
</property>

<property>
<name>dfs.encrypt.data.transfer</name>
<value>false</value>
</property>

<property>
<name>dfs.namenode.kerberos.internal.spnego.principal</name>
<value>HTTP/[email protected]</value>
</property>

<property>
<name>dfs.namenode.kerberos.principal</name>
<value>nn/[email protected]</value>
</property>

<property>
<name>dfs.namenode.keytab.file</name>
<value>/etc/security/keytabs/nn.service.keytab</value>
</property>

<property>
<name>dfs.web.authentication.kerberos.keytab</name>
<value>/etc/security/keytabs/spnego.service.keytab</value>
</property>

<property>
<name>dfs.web.authentication.kerberos.principal</name>
<value>*</value>
</property>

– For HDFS Transparency 3.2.2-x and 3.3.x use the following configurations in core-site.xml
and hdfs-site.xml:
File: core-site.xml

<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>

<property>
<name>hadoop.rpc.protection</name>
<value>authentication</value>
</property>

<property>
<name>hadoop.http.authentication.type</name>

114 IBM Storage Scale: Big Data and Analytics Guide

<value>kerberos</value>
</property>

<property>
<name>hadoop.http.authentication.kerberos.principal</name>
<value>*</value>
</property>

<property>
<name>hadoop.http.authentication.kerberos.keytab</name>
<value>/etc/security/keytabs/spnego.service.keytab</value>
</property>

If you are using Cloudera Private Cloud Base cluster, create the following rules:

Otherwise, if you are using Apache Hadoop, create the following rules:

<property>
<name>hadoop.security.auth_to_local</name>
<value>
RULE:[2:$1/$2@$0](nn/.*@.*IBM.COM)s/.*/hdfs/
RULE:[2:$1/$2@$0](dn/.*@.*IBM.COM)s/.*/hdfs/
RULE:[2:$1/$2@$0](nm/.*@.*IBM.COM)s/.*/yarn/
RULE:[2:$1/$2@$0](rm/.*@.*IBM.COM)s/.*/yarn/
RULE:[2:$1/$2@$0](jhs/.*@.*IBM.COM)s/.*/mapred/
RULE:[1:$1@$0](ces-<clustername>@IBM.COM)s/.*/hdfs/
DEFAULT
</value>
</property>

In the above example, replace IBM.COM with your Realm name and <clustername> parameter
with your actual CES HDFS cluster name.
File: hdfs-site.xml

<property>
<name>dfs.data.transfer.protection</name>
<value>authentication</value>
</property>

<property>
<name>dfs.datanode.address</name>
<value>0.0.0.0:1004</value>
</property>

<property>
<name>dfs.datanode.data.dir.perm</name>
<value>700</value>
</property>

<property>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:1006</value>
</property>

<property>
<name>dfs.datanode.kerberos.principal</name>
<value>dn/[email protected]</value>
</property>

<property>
<name>dfs.datanode.keytab.file</name>
<value>/etc/security/keytabs/dn.service.keytab</value>
</property>

<property>
<name>dfs.encrypt.data.transfer</name>

Chapter 2. IBM Storage Scale support for Hadoop 115

<value>false</value>
</property>

<property>
<name>dfs.namenode.kerberos.internal.spnego.principal</name>
<value>HTTP/[email protected]</value>
</property>

<property>
<name>dfs.namenode.kerberos.principal</name>
<value>nn/[email protected]</value>
</property>

<property>
<name>dfs.namenode.keytab.file</name>
<value>/etc/security/keytabs/nn.service.keytab</value>
</property>

<property>
<name>dfs.block.access.token.enable</name>
<value>true</value>
</property>

• File: hadoop-env.sh

KINIT_KEYTAB=/etc/security/keytabs/ces-<clustername>.headless.keytab
KINIT_PRINCIPAL=ces-<clustername>@IBM.COM

where, <clustername> is the name of your CES HDFS cluster.

9. Stop the HDFS Transparency services for the cluster.
a. Stop the DataNodes.
On any HDFS Transparency node, run the following command:

mmhdfs hdfs-dn stop

b. Stop the NameNodes.

On any CES HDFS NameNode, run the following command:

mmces service stop HDFS -N <NN1>,<NN2>

10. Import the files.

mmhdfs config import /tmp/hdfsconf core-site.xml,hdfs-site.xml,hadoop-env.sh

11. Upload the changes.

mmhdfs config upload

12. Start the HDFS Transparency services for the cluster.

a. Start the DataNodes.
On any HDFS Transparency node, run the following command:

mmhdfs hdfs-dn start

b. Start the NameNodes.

On any CES HDFS NameNode, run the following command:

mmces service start HDFS -N <NN1>,<NN2>

c. Verify that the services have started.

On any CES HDFS NameNode, run the following command:

mmhdfs hdfs status

116 IBM Storage Scale: Big Data and Analytics Guide

Configuring Kerberos using the Kerberos script provided with IBM Storage Scale
From HDFS Transparency 3.1.1-3, IBM Storage Scale provides a Kerberos configuration
script /usr/lpp/mmfs/hadoop/scripts/gpfs_kerberos_configuration.py to help with setting
up Kerberos for HDFS Transparency interactively.
From HDFS Transparency 3.1.1-4, a non-interactive version of the automation script is also supported.
The input parameters can be specified through a customized json input file.
The output of the script is logged to /var/log/kerberos_configuration_setup.log file.
Note: If you need to set up more than one HDFS Transparency cluster using a common KDC server , see
the Limitation in the “Kerberos” on page 81 topic.
Before following these steps, see the Prerequisites topic.
There are two methods to use the Kerberos script:
1. Interactive method
2. Custom json file method

Interactive method
You can perform the following using the interactive method:
1. Set up a new KDC server. If you already have a KDC server, go to step 2.
Setting up a new KDC server helps with the following:
a. Install and configure a new Kerberos server on the host being run. Create or update the /var/
kerberos/krb5kdc/kdc.conf and /etc/krb5.conf files.
b. By default, the principals are configured such that ticket_lifetime is set to 24h and
renew_lifetime is set to 7d. If needed, update these default values.
2. Configure Kerberos for HDFS Transparency.
Configuring Kerberos helps with the following:
a. Install and configure Kerberos client on the HDFS Transparency nodes.
b. Create host principals.
c. Create NameNode and DataNode principals and keytabs for HDFS Transparency.
d. Create hdfs user principal and keytab.
e. Apply the Kerberos configurations for hdfs-site.xml, core-site.xml and hadoop-env.sh for HDFS
Transparency.
3. Clear Kerberos configuration from HDFS Transparency.
Clearing Kerberos configuration helps with the following:
a. Disable the Kerberos configurations from HDFS Transparency.
b. In case you want to re-enable Kerberos at a later time, the existing principals and keytabs created
for NameNodes and DataNodes are retained.
Perform the following to run the gpfs_kerberos_configuration.py script:
• For HDFS Transparency-3.1.1-3:

# /usr/lpp/mmfs/hadoop/scripts/gpfs_kerberos_configuration.py
MIT Kerberos configuration:
1: Setup a new KDC server.
[Run the script on the KDC server host]
2: Configure Kerberos for HDFS Transparency.
[Run the script on a CES-HDFS cluster node that has password-less SSH access to
the other HDFS Transparency nodes]
3: Clear Kerberos configuration from HDFS Transparency.
[This option will remove the Kerberos configurations from your HDFS Transparency
cluster.
This will not remove the existing principals and keytabs for NameNodes and

Chapter 2. IBM Storage Scale support for Hadoop 117

DataNodes]

Choose option 1/2/3:

• For HDFS Transparency-3.1.1-4:

Choose option 1/2/3/4:

Custom json file method

For this method, the user needs to update the custom json file (/usr/lpp/mmfs/hadoop/scripts/
gpfs_kerberos_config_metadata.json) with inputs specific to the environment. Then run the
gpfs_kerberos_configuration.py script as follows:

[root@scripts]# ./gpfs_kerberos_configuration.py -h
usage: gpfs_kerberos_configuration.py [-h] [-c CONFIG]Create Kerberos configurationoptional
arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
Provide 'gpfs_kerberos_config_metadata.json' config
path. Help: The sample config template file can be
found in '/usr/lpp/mmfs/hadoop/scripts/gpfs_kerberos_c
onfig_metadata.json'Example:

[root@scripts]#./gpfs_kerberos_configuration.py -c /usr/lpp/mmfs/hadoop/scripts/
gpfs_kerberos_config_metadata.json

Verifying Kerberos
For information about verifying Kerberos, see “Verifying Kerberos” on page 127.

Workaround for the Power LE platform

On RHEL 9.1+ for Power LE, you need to run a command to modify all MIT Kerberos principals that are
generated by HDFS Transparency, so that a new and mandatory attribute +requires_preauth can be
added to the principals.
This attribute is added by default when principals are created by using the addprinc command.
However, +requires_preauth gets missed if the +allow_renewable flag is also passed. For example,
the HDFS Transparency script gpfs_kerberos_configuration.py creates NameNode principals,
which causes the +requires_preauth flag to get missed, as shown in this example:

addprinc -randkey -maxrenewlife 7d +allow_renewable {Principal}

Solution
Modify the principals as in the following example.
Make sure to repeat this solution for all these principals:
• nn/host
• HTTP/host
• dn/host

118 IBM Storage Scale: Big Data and Analytics Guide

• <hostname>/host
• HTTP/<cesip-host>

Example:
# kadmin.local: modify_principal +requires_preauth nn/[email protected]
Principal nn/[email protected] modified.

# kadmin.local: getprinc nn/[email protected]

Principal: nn/[email protected]
Expiration date: [never]
Last password change: Tue Aug 08 09:45:21 EDT 2023
Password expiration date: [never]
Maximum ticket life: 1 day 00:00:00
Maximum renewable life: 7 days 00:00:00
Last modified: Wed Aug 09 01:18:00 EDT 2023 (root/[email protected])
Last successful authentication: [never]
Last failed authentication: [never]
Failed password attempts: 0
Number of keys: 5
Key: vno 4, aes256-cts-hmac-sha1-96
Key: vno 4, aes128-cts-hmac-sha1-96
Key: vno 4, DEPRECATED:arcfour-hmac
Key: vno 4, camellia256-cts-cmac
Key: vno 4, camellia128-cts-cmac
MKey: vno 1
Attributes: REQUIRES_PRE_AUTH
Policy: [none]
kadmin.local:

Red Hat IPA Kerberos

Setting up the IPA Kerberos server
This topic lists the steps to set up the IPA Kerberos server.
Before following these steps, see the “Prerequisites” on page 81 topic.
For the complete procedure to setup your IPA environment, see the Red Hat documentation specific to
your OS version. For example, Options for the ipa-server-install and ipa-replica-install commands.
1. IPA server installation and setup.
RHEL7 and RHEL8 configure the IPA server differently.
• Example of RHEL7:
Install and configure the IPA server by running the following commands:

# yum install ipa-server

# ipa-server-install

• Example of RHEL8:
In RHEL8, there is no ipa-server package provided in its repo. For the setup steps, see Preparing the
system for IdM server installation.
2. Set up the IPA server by running the ipa-server-install command as follows:
3. Verify that the IPA services are up by running the ipactl status command as follows:

# ipactl status
Directory Service: RUNNING
krb5kdc Service: RUNNING
kadmin Service: RUNNING
httpd Service: RUNNING
ipa-custodia Service: RUNNING
ntpd Service: RUNNING
pki-tomcatd Service: RUNNING
ipa-otpd Service: RUNNING
ipa: INFO: The ipactl command was successful

Chapter 2. IBM Storage Scale support for Hadoop 119

4. Ensure that the Administrator (for example, admin) is able to obtain tickets by running the following
command:

# kinit admin

Setting up IPA Kerberos for HDFS Transparency nodes

This topic lists the steps to set up the IPA Kerberos clients on the HDFS Transparency nodes.
Before following these steps, see the “Prerequisites” on page 81 topic.
For the complete procedure to setup your IPA client environment, see the Red Hat documentation
specific to your OS version. For example, Preparing the system for IdM client installation.
1. Install and setup the IPA Kerberos client on all the HDFS Transparency nodes by running the
following commands:

# yum install ipa-client

# ipa-client-install --server=< IPA server FQDN> --domain=<IPA domain name> --realm=<IPA
Realm> --hostname=<This hostname> --force-ntpd

For example,

# ipa-client-install --server=ipaserver.gpfs.net --domain=gpfs.net --realm=IBM.COM --

hostname=dn01.gpfs.net --force-ntpd

2. Update the following configurations:

Copy the /etc/krb5.conf file from the IPA server to one of IPA client nodes (i.e. HDFS
Transparency nodes). Then update the local /etc/krb5.conf as follows:

default_ccache_name = KEYRING:persistent:%{uid}

default_ccache_name = /tmp/krb5cc_%{uid}

If the /etc/krb5.conf.d/kcm_default_ccache file exists, disable the KCM credential cache by

commenting out the following lines in that file:

[libdefaults] default_ccache_name = KCM:

3. Distribute the configuration files that were updated in the previous step to the remaining IPA client
nodes.
4. Create a directory for the keytabs and set the appropriate permissions on each of the HDFS
Transparency node by running the following commands:

# mkdir -p /etc/security/keytabs/
# chown root:root /etc/security/keytabs
# chmod 755 /etc/security/keytabs

5. Create KDC principals for the components, corresponding to the hosts where they are running, and
create the keytab files as follows:

120 IBM Storage Scale: Big Data and Analytics Guide

Creating keytab files for HDFS
Service User:Group Daemons Principal Keytab File Name
HDFS root:root NameNode nn/ nn.service.ke
<NN_Host_FQDN ytab
>@<REALM-
NAME>
NameNode HTTP HTTP/ spnego.servic
<NN_Host_FQDN e.keytab
>@<REALM-
NAME>
NameNode HTTP HTTP/
<CES_HDFS_Host
_FQDN>@<REAL
M-NAME>
DataNode dn/ dn.service.ke
<DN_Host_FQDN ytab
>@<REALM-
NAME>

Replace the < NN_Host_FQDN > with the HDFS Transparency NameNode hostname and
the <DN_Host_FQDN> with the HDFS Transparency DataNode hostname. Replace the
<CES_HDFS_Host_FQDN> with the CES hostname configured for your CES HDFS cluster.
You need to create the following:
• One principal/keytab for each HDFS Transparency DataNode.
• Two principals/keytabs for each HDFS Transparency NameNode.
– nn.service.keytab - For the nn/<NameNode host> principal.
– spnego.service.keytab - This keytab contains entries for two principals (HTTP principal for
the actual NameNode hostname and HTTP principal for the CES IP hostname).
If you are using Open source Apache Hadoop, you need to create service principals for the YARN and
Mapreduce services as shown in the following table:

Creating service principals for YARN and Mapreduce

Service User:Group Daemons Principal Keytab File Name
YARN yarn:hadoop ResourceManager rm/ rm.service.ke
<Resource_Manag ytab
er_FQDN>@<REA
LM-NAME>
NodeManager nm/ nm.service.ke
<Node_Manager_ ytab
FQDN>@<REALM-
NAME>
Mapreduce mapred:hadoop MapReduce Job jhs/ jhs.service.k
History Server <Job_History_Ser eytab
ver_FQDN>@<RE
ALM-NAME>

Replace the <Resource_Manager_FQDN> with the Resource Manager hostname, the

<Node_Manager_FQDN> with the Node Manager hostname and the <Job_History_Server_FQDN> with
the Job History Server hostname.

Chapter 2. IBM Storage Scale support for Hadoop 121

Run the following commands on the IPA server node:
a. Get a ticket for the IPA Admin.
For example,

kinit admin

b. For the HTTP/<CES_HDFS_Host_FQDN> principal, create a host entry for the CES IP hostname as
follows:

# ipa host-add /<CES_HDFS_Host_FQDN>

For example,

# ipa host-add myceshdfs.gpfs.net

where, myceshdfs.gpfs.net is an example of the CES IP hostname configured for the CES HDFS
service.
c. Create the service principals for each service mentioned in Table 1.

# ipa service-add <Principal Name>

For example:

# ipa service-add nn/nn01.gpfs.net

# ipa service-add nn/nn02.gpfs.net
# ipa service-add HTTP/nn01.gpfs.net
# ipa service-add HTTP/nn02.gpfs.net
# ipa service-add dn/dn01.gpfs.net
# ipa service-add HTTP/myceshdfs.gpfs.net

d. Create the following IPA rules:

Rule to allow the IPA Admin to retrieve the HTTP/<CES HDFS hostname> principal.

# ipa service-allow-retrieve-keytab HTTP/<CES HDFS hostname> --users=<IPA admin user>

For example,

# ipa service-allow-retrieve-keytab HTTP/myceshdfs.gpfs.net --users=admin

Rule to allow the IPA Admin to retrieve the NameNode host principals. Repeat the command for
all NameNode hosts.

# ipa host-allow-retrieve-keytab --users=admin

Host name:

Enter the NameNode FQDN at the prompt. These rules are needed to create the
spnego.service.keytab files properly in the next step.
e. For each service on each HDFS Transparency host, create a keytab file by exporting its service
principal into a keytab file.
Note:
• If the file name for a service is common (for example, dn.service.keytab) across the
hosts, the contents would be different because every keytab would have different principal
components.
• As soon as a keytab file is generated, move (scp) the keytab to the appropriate host
immediately or move it into a different location to avoid the keytab from getting overwritten.

# ipa-getkeytab -s <IPA server FQDN> -k /etc/security/keytabs/

{SERVICE_NAME}.service.keytab -p {Principal}

For example:

122 IBM Storage Scale: Big Data and Analytics Guide

DataNode:

# ipa-getkeytab -s ipaserver.gpfs.net -k /etc/security/keytabs/dn.service.keytab -p dn/

dn01.gpfs.net
# ipa-getkeytab -s ipaserver.gpfs.net -k /etc/security/keytabs/dn.service.keytab -p host/
dn01.gpfs.net -r

Move (scp) the generated dn.service.keytab to the corresponding DataNode host under /etc/
security/keytabs/ Then remove the local dn.service.keytab file immediately to avoid it from getting
overwritten by the next DataNode’s keytab.
NameNode1:

# ipa-getkeytab -s ipaserver.gpfs.net -k /etc/security/keytabs/nn.service.keytab -p nn/

nn01.gpfs.net
# ipa-getkeytab -s ipaserver.gpfs.net -k /etc/security/keytabs/nn.service.keytab -p host/
nn01.gpfs.net -r

# ipa-getkeytab -s ipaserver.gpfs.net -k /etc/security/keytabs/spnego.service.keytab -p

HTTP/nn01.gpfs.net
# ipa-getkeytab -s ipaserver.gpfs.net -k /etc/security/keytabs/spnego.service.keytab -p
HTTP/myceshdfs.gpfs.net
# ipa-getkeytab -s ipaserver.gpfs.net -k /etc/security/keytabs/spnego.service.keytab -p
host/nn01.gpfs.net -r

Move (scp) the generated nn.service.keytab and spnego.service.keytab to the

corresponding NameNode host under /etc/security/keytabs/. Remove the local
nn.service.keytab and spnego.service.keytab files immediately to avoid them from
getting overwritten by the next NameNode’s keytab.
NameNode 2:

# ipa-getkeytab -s ipaserver.gpfs.net -k /etc/security/keytabs/nn.service.keytab -p nn/

nn02.gpfs.net
# ipa-getkeytab -s ipaserver.gpfs.net -k /etc/security/keytabs/nn.service.keytab -p host/
nn02.gpfs.net -r

# ipa-getkeytab -s ipaserver.gpfs.net -k /etc/security/keytabs/spnego.service.keytab -p

HTTP/nn02.gpfs.net
# ipa-getkeytab -s ipaserver.gpfs.net -k /etc/security/keytabs/spnego.service.keytab -p
HTTP/myceshdfs.gpfs.net -r
ipa-getkeytab -s ipaserver.gpfs.net -k /etc/security/keytabs/spnego.service.keytab -p
host/nn02.gpfs.net -r

Move (scp) the generated nn.service.keytab and spnego.service.keytab to the

corresponding NameNode host under /etc/security/keytabs/. Remove the local
nn.service.keytab and spnego.service.keytab files immediately to avoid it from getting
overwritten by the next NameNode’s keytab.
Note: Notice the additional -r option used for the ipa-getkeytab commands, used to export
the keytab files. This is needed if the keytab for a particular principal was already initialized.
Otherwise, the existing keytab for that particular principal will get invalidated and can lead to
authentication errors.
6. For CES HDFS NameNode HA, an HDFS admin user and its Kerberos user principal and keytab are
required to be created and setup for the CES NameNodes. These credentials are used by the CES
framework to elect an active NameNode.
This principal should map to an existing OS user on the NameNode hosts.
In this example, the OS user is hdfs. You will configuring this principal/keytab into hadoop-env.sh
in step 8.
a. Create a Hadoop supergroup.

Chapter 2. IBM Storage Scale support for Hadoop 123

Set the dfs.permissions.superusergroup parameter to supergroup by running the following
command:

/usr/lpp/mmfs/hadoop/sbin/mmhdfs config set hdfs-site.xml -k

dfs.permissions.superusergroup=supergroup

b. Create the OS user hdfs on all the HDFS Transparency nodes that belongs to the Hadoop super
group supergroup by using the supplied gpfs_create_hadoop_users_dirs.py command.
This command ensures that the custom user and group is created with consistent UID/GID across
all the nodes.

# /usr/lpp/mmfs/hadoop/scripts/gpfs_create_hadoop_users_dirs.py --create-custom-hadoop-
user-group hdfs:supergroup

Otherwise, if you use FreeIPA to manage all users/groups and SSSD, you should create the hdfs
user and supergroup group in FreeIPA and ensure that you can see the hdfs user and supergroup
group on all the CES HDFS nodes.
c. Create the IPA user principal and keytab for your HDFS Transparency cluster.
Create a unique user principal such as ces-<clustername> where, <clustername> is the name
of your CES HDFS cluster. In case there are multiple CES HDFS clusters sharing a common IPA
KDC server, having the cluster name as part of the principal helps to create a user principal unique
to each CES HDFS cluster.

# ipa user-add ces-<clustername>

# ipa-getkeytab -s ipaserver.gpfs.net -p ces-<clustername> -k /etc/security/keytabs/ces-
<clustername>.headless.keytab

For example,

# ipa user-add ces-scaleces

# ipa-getkeytab -s ipaserver.gpfs.net -p ces-scaleces -k /etc/security/keytabs/ces-
scaleces.headless.keytab

d. Copy the /etc/security/keytabs/ces-<clustername>.headless.keytab file to all the

NameNodes and change the owner permission of the file to root.

# chown root:root /etc/security/keytabs/ces-<clustername>.headless.keytab

# chmod 400 /etc/security/keytabs/ces-<clustername>.headless.keytab

7. Verify the keytabs. The contents should appear as follows:

# klist -kte /etc/security/keytabs/nn.service.keytab

Keytab name: FILE:nn.service.keytab
KVNO Timestamp Principal
---- ------------------- ------------------------------------------------------
1 07/06/2022 08:02:17 nn/[email protected] (aes256-cts-hmac-sha1-96)
1 07/06/2022 08:02:17 nn/[email protected] (aes128-cts-hmac-sha1-96)
7 07/06/2022 08:02:17 host/[email protected] (aes256-cts-hmac-sha1-96)
7 07/06/2022 08:02:17 host/[email protected] (aes128-cts-hmac-sha1-96

# klist -kte /etc/security/keytabs/spnego.service.keytab

Keytab name: FILE:spnego.service.keytab
KVNO Timestamp Principal
---- ------------------- ------------------------------------------------------
1 07/06/2022 08:02:17 HTTP/[email protected] (aes256-cts-hmac-sha1-96)
1 07/06/2022 08:02:17 HTTP/[email protected] (aes128-cts-hmac-sha1-96)
1 07/06/2022 08:02:17 HTTP/[email protected] (aes256-cts-hmac-sha1-96)
1 07/06/2022 08:02:17 HTTP/[email protected] (aes128-cts-hmac-sha1-96)
7 07/06/2022 08:02:17 host/[email protected] (aes256-cts-hmac-sha1-96)
7 07/06/2022 08:02:17 host/[email protected] (aes128-cts-hmac-sha1-96

# klist -kte /etc/security/keytabs/dn.service.keytab

Keytab name: FILE:dn.service.keytab
KVNO Timestamp Principal
---- ------------------- ------------------------------------------------------
1 07/06/2022 08:02:18 dn/[email protected] (aes256-cts-hmac-sha1-96)
1 07/06/2022 08:02:18 dn/[email protected] (aes128-cts-hmac-sha1-96)
7 07/06/2022 08:02:18 host/[email protected] (aes256-cts-hmac-sha1-96)
7 07/06/2022 08:02:18 host/[email protected] (aes128-cts-hmac-sha1-96

124 IBM Storage Scale: Big Data and Analytics Guide

# klist -kte /etc/security/keytabs/ces-scaleces.headless.keytab
Keytab name: FILE:ces-scaleces.headless.keytab
KVNO Timestamp Principal
---- ------------------- ------------------------------------------------------
1 07/05/2022 05:08:22 [email protected] (aes256-cts-hmac-sha1-96)
1 07/05/2022 05:08:22 [email protected] (aes128-cts-hmac-sha1-96)

8. Copy the appropriate keytab files from the IPA server to each HDFS Transparency host.
9. Set the appropriate permissions for the keytab files.
On the HDFS Transparency NameNode hosts, run the following command:

# chown root:root /etc/security/keytabs/nn.service.keytab

# chmod 400 /etc/security/keytabs/nn.service.keytab
# chown root:root /etc/security/keytabs/spnego.service.keytab
# chmod 440 /etc/security/keytabs/spnego.service.keytab

On the HDFS Transparency DataNode hosts, run the following command:

# chown root:root /etc/security/keytabs/dn.service.keytab

# chmod 400 /etc/security/keytabs/dn.service.keytab

On the Yarn resource manager hosts, run the following command:

# chown yarn:hadoop /etc/security/keytabs/rm.service.keytab

# chmod 400 /etc/security/keytabs/rm.service.keytab

On the Yarn node manager hosts, run the following command:

# chown yarn:hadoop /etc/security/keytabs/nm.service.keytab

# chmod 400 /etc/security/keytabs/nm.service.keytab

On Mapreduce job history server hosts, run the following command:

# chown mapred:hadoop /etc/security/keytabs/jhs.service.keytab

# chmod 400 /etc/security/keytabs/jhs.service.keytab

10. Update the HDFS Transparency configuration files and upload the changes.
• Obtain the config files by running the following commands:

# mkdir /tmp/hdfsconf
# mmhdfs config export /tmp/hdfsconf core-site.xml,hdfs-site.xml,hadoop-env.sh

Use the following configurations in core-site.xml and hdfs-site.xml, corresponding to HDFS

Transparency 3.1.1.x.
File: core-site.xml

<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>

<property>
<name>hadoop.rpc.protection</name>
<value>authentication</value>
</property>

If you are using Cloudera Private Cloud Base cluster, create the following rules:

Chapter 2. IBM Storage Scale support for Hadoop 125

Otherwise, if you are using Open source Apache Hadoop, create the following rules:

<property>
<name>hadoop.security.auth_to_local</name>
<value>
RULE:[2:$1/$2@$0](nn/.*@.*IBM.COM)s/.*/hdfs/
RULE:[2:$1/$2@$0](dn/.*@.*IBM.COM)s/.*/hdfs/
RULE:[2:$1/$2@$0](nm/.*@.*IBM.COM)s/.*/yarn/
RULE:[2:$1/$2@$0](rm/.*@.*IBM.COM)s/.*/yarn/
RULE:[2:$1/$2@$0](jhs/.*@.*IBM.COM)s/.*/mapred/
RULE:[1:$1@$0](ces-<clustername>@IBM.COM)s/.*/hdfs/
DEFAULT
</value>
</property>

In the above example, replace IBM.COM with your Realm name and <clustername> parameter with
your actual CES HDFS cluster name.
File: hdfs-site.xml