Professional Documents
Culture Documents
Contents
List of Figures....................................................................................................................vi
Trademarks.........................................................................................................................9
Glossary............................................................................................................................10
Notes, Cautions, and Warnings....................................................................................... 15
Chapter 1: Overview......................................................................................................... 16
Summary..................................................................................................................17
Deployment Workflow............................................................................................. 17
Scatter-Gather................................................................................................ 99
Display Offload Features................................................................................99
Interrupt Moderation and Coalescing...........................................................100
Process Limits....................................................................................................... 100
Memory Management Settings............................................................................. 100
Transparent Huge Page (THP) Compaction................................................100
Swap Settings.............................................................................................. 101
Secure Linux Settings........................................................................................... 101
Services................................................................................................................. 101
Firewall Settings.................................................................................................... 102
Ports Listing...........................................................................................................102
Disable Network Manager.....................................................................................103
Secure Shell Keys.................................................................................................103
User Accounts and Groups...................................................................................103
Appendix H: References.................................................................................................110
About Cloudera..................................................................................................... 111
About Syncsort...................................................................................................... 111
To Learn More...................................................................................................... 111
List of Figures
Figure 1: Dell EMC PowerEdge FX2 Chassis Identification - Front View........................ 28
Figure 2: Dell EMC Ready Bundle for Cloudera Hadoop Cluster Networking................. 34
List of Tables
Table 1: Deployment Workflow........................................................................................17
Table 5: Dell EMC PowerEdge R730xd Tested BIOS and Firmware Versions............... 27
Table 6: Dell EMC PowerEdge FX2/FC630 Tested BIOS and Firmware Versions..........27
Table 25: Dell EMC PowerEdge R730xd and Dell EMC PowerEdge FC630
Infrastructure Node Settings........................................................................................ 85
Table 26: Dell EMC PowerEdge R730xd and Dell EMC PowerEdge FC630 Worker
Node Settings...............................................................................................................86
Table 37: Dell EMC Ready Bundle for Cloudera Hadoop Support Matrix..................... 107
Trademarks
Copyright © 2011-2017 Dell Inc. or its subsidiaries. All rights reserved. Dell, EMC, and other trademarks
are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective
owners.
This document is for informational purposes only, and may contain typographical errors and technical
inaccuracies. The content is provided as-is and without expressed or implied warranties of any kind.
Glossary
ASCII
American Standard Code for Information Interchange, a binary code for alphanumeric characters
developed by ANSI®.
BMC
BMP
CDH
Clos
A multi-stage, non-blocking network switch architecture. It reduces the number of required ports within a
network switch fabric.
CMC
DBMS
DTK
EBCDIC
Extended Binary Coded Decimal Interchange Code, a binary code for alphanumeric characters developed
by IBM®.
ECMP
EDW
EoR
End-of-Row Switch/Router
ETL
Extract, Transform, Load is a process for extracting data from various data sources; transforming the data
into proper structure for storage; and then loading the data into a data store.
HBA
HDFS
HVE
IPMI
JBOD
LACP
LAG
LOM
NIC
NTP
OS
Operating System
PAM
RPM
RSTP
RTO
SIEM
SLA
THP
ToR
Top-of-Rack Switch/Router
VLT
VRRP
YARN
Caution: A Caution indicates potential damage to hardware or loss of data if instructions are not
followed.
Warning: A Warning indicates a potential for property damage, personal injury, or death.
This document is for informational purposes only and may contain typographical errors and technical
inaccuracies. The content is provided as is, without express or implied warranties of any kind.
Chapter
1
Overview
Topics: This guide describes the prerequisites to install the Dell EMC Ready
Bundle for Cloudera Hadoop on a predefined hardware and network
• Summary configuration, as specified in the current Dell EMC Ready Bundle for
• Deployment Workflow Cloudera Hadoop Architecture Guide. It also covers requirements for
preparing the hardware platform and provisioning the operating system
for Cloudera Enterprise 5.10 deployment.
Summary
This guide describes deploying the Dell EMC Ready Bundle for Cloudera Hadoop using either of two
server architectures:
• Dell EMC PowerEdge R730xd - A 2U rack server platform
• Dell EMC PowerEdge FX2 - A high density 2U converged infrastructure platform
Both architectures use similar server configurations and cluster layout. In the converged infrastructure
architecture, each Dell EMC PowerEdge FX2 chassis is the equivalent of two Dell EMC PowerEdge
R730xd servers in the design.
The networking architecture for both architectures is the same, and consists of:
• A leaf-and-spine for the cluster production network
• A flat daisy chain of switches for a dedicated iDRAC network
Deployment Workflow
Table 1: Deployment Workflow on page 17 describes the basic Dell EMC Ready Bundle for Cloudera
Hadoop deployment sequence:
Chapter
2
Installation Prerequisites
Topics: In order to install the components that comprise the Dell EMC Ready
Bundle for Cloudera Hadoop, several prerequisites must be satisfied.
• Software Requirements
This guide assumes that you are familiar with:
• Equipment Requirements
• Site Planning • Cloudera Enterprise 5.10
• RAID and BIOS configuration of Dell EMC PowerEdge R730xd or
Dell EMC PowerEdge FX2 servers
• Red Hat Enterprise Linux® (RHEL) 7.3
• Network installation
Software Requirements
VMware Hypervisor
The Kickstart Server is a virtual machine that you run on your laptop via any of the following VMware
hypervisor products:
• VMware ESXi™ 5.5 or above
• VMware Fusion® 6.0 or above
• VMware Workstation Pro™ 10 or above
Please contact your Dell EMC sales representative to obtain a copy of the Dell EMC Ready Bundle for
Cloudera Hadoop Architecture Guide.
Installation Packages
Release-specific packages include:
• DTK .iso file and MD5 checksum for Dell EMC PowerEdge R730xd and Dell EMC PowerEdge FX2
servers
• Kickstart VM
• Configuration files for Dell Networking S3048-ON, S4048-ON, and S6000-ON switches
• Cut sheets for Dell Networking S3048-ON, S4048-ON, and S6000-ON switches
Non-release-specific packages include:
• Network connectivity tool
Download Procedure
To download the installation packages and prepare them for use:
1
1. Using a web browser , sign into your Dell Digital Locker account.
2. Click on the Digital Products heading in the left-hand pane to display a list of products to which you
have access.
3. Click on the product you wish to download to display a Product Management page.
4. Click on the Download link to display an End User License Agreement (EULA).
a. Scroll to read the entire EULA in order to activate its agree/disagree buttons.
5. Click on the Yes, I Agree button to display a download method dialog window.
a. Or, click on the No, I Do Not Agree button to return to the Product Management page.
6. Select one of the following download methods:
• Download manager — A Windows program that enables multiple downloads, pause/resume
downloads, etc.
• If the download manager is not present on your system, you are offered a choice to either
download and run it, or download your product using your web browser.
• Web browser — Uses your web browser to download your product, and your system's file manager
to save or run it.
7. Click on the Download Now button to begin the download process.
1
Dell EMC recommends that you use current versions of either Firefox®, Chrome™, or Internet Explorer®.
a. Or, click on the Cancel button to abort the operation and return to the Product Management page.
8. Repeat Steps 2-7 for any additional downloads.
9. When finished, click on the Sign Out link atop the page.
Equipment Requirements
Site Planning
There are site planning tasks that should be completed prior to beginning installation.
The scope of these tasks is outside the actual architecture so this section provides checklists that should
be reviewed and answered prior to beginning installation. Some of these questions are intended to raise
additional questions.
Chapter
3
Hardware Setup
Topics: These procedures ensure that your hardware is installed correctly prior
to installing the Dell EMC Ready Bundle for Cloudera Hadoop.
• Unpacking and Installing the
Equipment
• Powering Up the Equipment
• Verifying the Equipment
• Tested BIOS and Firmware
• Dell EMC PowerEdge FX2
Setup
Before you proceed you must perform the following procedures following all standard industry safety
procedures:
1. Unpack and install the racks.
2. Unpack and install the server hardware.
3. Unpack and install the switch hardware.
4. Unpack and install the network cabling. See:
a. Server Node Connections on page 43
b. Cabling the Network Switches on page 41
5. Connect each individual machine to both power bus installations.
6. Apply power to the racks.
Note: This is usually performed by the Dell EMC EDT Team.
The cluster hardware should be verified before physical installation begins. After installation, the final
functional tests should be run.
Recommended validation steps:
1. All power on tests complete successfully.
2. All drives should be powered on, verify that the hardware diagnostic LEDs and system console does not
report any errors.
3. All nodes should be checked for correct memory size.
4. All network ports and cables should be checked for connections.
Table 5: Dell EMC PowerEdge R730xd Tested BIOS and Firmware Versions on page 27 and Table 6:
Dell EMC PowerEdge FX2/FC630 Tested BIOS and Firmware Versions on page 27 list the server BIOS
and firmware versions that were tested for the Dell EMC Ready Bundle for Cloudera Hadoop.
Table 7: Dell Networking S3048-ON Tested Firmware Versions on page 27, Table 8: Dell Networking
S4048-ON Tested Firmware Versions on page 27, and Table 9: Dell Networking S6000-ON Tested
Firmware Versions on page 27 list the switch firmware versions that were tested for the Dell EMC
Ready Bundle for Cloudera Hadoop.
Caution: You must ensure that the firmware on all servers and switches is up to date. Otherwise,
unexpected results may occur.
Table 5: Dell EMC PowerEdge R730xd Tested BIOS and Firmware Versions
Product Version
BIOS 2.3.4
RAID 25.5.0.0018_A08
NIC 17.5.10_A00
Backplane Expander 3.31_A00-01
Non-storage Backplane 2.23_A00-00
iDRAC 2.41.40.40_A00
Table 6: Dell EMC PowerEdge FX2/FC630 Tested BIOS and Firmware Versions
Product Version
CMC 1.32.200.201601210012_A00
BIOS 2.3.5
RAID 25.5.0.0018_A08
NIC 17.5.12_A00
Backplane Expander 3.31_A00-00
Non-storage Backplane 2.23_A00-00
iDRAC 2.41.40.40_A00
Product Version
Firmware SG-9.10.0.1p13
Boot Selector 3.21.0.4 or higher
Product Version
Firmware SK-9.10.0.1p13
Boot Selector 3.21.0.4 or higher
Product Version
Firmware SI-9.10.0.1p13
Boot Selector 3.21.0.4 or higher
The Dell EMC PowerEdge FX2 requires some additional hardware setup and verification.
Chassis Identification
There are two chassis configurations for the Dell EMC PowerEdge FX2 - Infrastructure and Worker. These
chassis configurations appear physically identical, and the infrastructure nodes may have to be identified
from the actual orders, or by checking the drive quantity in the storage module.
The cabling details in Server Node Connections on page 43 are based on the sled configuration shown
in Figure 1: Dell EMC PowerEdge FX2 Chassis Identification - Front View on page 28. It may be
necessary to re-arrange the sleds to match this configuration.
a. If this is the first time the system has been powered on, the system will boot into Life Cycle
Controller for configuration.
b. If it does not, press [F2] to go into the system setup screens.
3. From the Life Cycle Controller, click on the Hardware Configuration link on the left hand side.
4. Select the Configuration Wizards, and then select iDRAC Settings.
5. Scroll to the bottom of the iDRAC Settings page, and click on CMC Network.
6. Under the IPv4 Settings, make sure Enable IPv4 is set to Enabled.
7. Apply a Static IP Address, Subnet Mask and Gateway to the CMC.
8. Press Back, and then Finish.
9. Exit the Life Cycle Controller and reboot the server.
Flex Addressing
The FlexAddress feature in the Dell EMC PowerEdge FX2 allows the replacement of the factory-assigned
iDRAC MAC with a chassis-assigned MAC for individual slots. The use of Flex Addressing is a customer
choice. However, if it is enabled remember that iDRAC MAC addresses will not follow sleds when they are
moved.
Chapter
4
Dell EMC Ready Bundle for Cloudera Hadoop Nodes
Topics: Several node types, each with specific functions, are included in the
Dell EMC Ready Bundle for Cloudera Hadoop. This topic provides
• Node Definitions detailed definitions of those nodes.
Node Definitions
HA Node ZooKeeper
Quorum Journal Node
Operational Databases (PostgreSQL)
Chapter
5
Network Configuration
Topics: This section describes how to configure the network for the Dell EMC
Ready Bundle for Cloudera Hadoop.
• High-level Network Architecture
• IP Addressing
• Cluster Networks and VLANs
• Node Interface Bonds
• Domain Name System
• Network Time Protocol
• Gathering Network Information
All servers in the cluster are tied together using TCP/IP networks. These networks form a data interconnect
across which individual servers pass data back and forth, return query results, and load/unload data.
These networks are also used for management and interfaces to an existing corporate network.
A combination of network switches and Layer 2 VLANs are used to segregate traffic in the cluster. Network
interface bonding is used to provide higher performance for selected networks. A high-level overview of
the network organization is provided in Figure 2: Dell EMC Ready Bundle for Cloudera Hadoop Cluster
Networking on page 34.
The Standby Name Node will usually provide the following network services:
• NTP server (Network Time Protocol server) — makes sure all nodes are keeping the same time
• DHCP server — can be used to assign and manage IP addresses for the compute and storage nodes.
This guide uses static addressing for the cluster nodes.
Note: If the Standby Name Node does not exist in your environment, then these services must be
placed on another node.
Figure 2: Dell EMC Ready Bundle for Cloudera Hadoop Cluster Networking
IP Addressing
The IP addressing uses large subnets to support many machines on the cluster network. The cluster and
BMC/IPMI networks are Class B networks, with 65,536 IP addresses.
In these example networks, the first 10 IP addresses are reserved for switches, routers, and firewalls.
The Edge network is a Class C network, with 256 IP address. The first 10 IP addresses are reserved for
switches, routers, and firewalls.
Note: Each network's ".1" address is reserved for the network gateway.
The Dell EMC Ready Bundle for Cloudera Hadoop implements three distinct VLANs for cluster functions.
The networks are described in Table 13: Cluster Networks on page 36.
Layer 2 Interface bonding is used on the core cluster network to increase performance, bandwidth, and
reliability. The recommended configuration is 802.3ad (LACP) bonding. Bonding can also be used on the
Edge network for the same reasons, depending on the interfaces required to existing networks. See:
• Active/Standby Name Nodes & HA Nodes on page 37
• Edge Node on page 37
• Worker Node on page 38
Note: The Active/Standby Name Nodes & HA Nodes hardware configurations include additional
10GbE ports, but these are ports are not used.
Edge Node
Table 15: Edge Node Network Connections
Worker Node
Table 16: Worker Nodes Network Connections
The installation programs and methodologies provided in this document will result in static IP assignments,
listed in /etc/hosts, on all machines. Any updates should be applied to /etc/hosts on one machine, and
then copied to all other nodes. You must update /etc/resolv.conf to point to your DNS server of choice. Dell
EMC has defaulted to using a public DNS server (8.8.8.8) for your initial use.
Note: DNScache is installed on all hosts.
Dell EMC recommends that the optional administration node attached to the data network be configured
with an authoritative DNS server. This server must have authoritative forward and reverse DNS records for
each and every host that is a member of the cluster.
Note: If you are using Cloudera BDR or DISTCP, then external access and DNS resolution are
required for all nodes in both clusters.
Information on how to configure DNS can be obtained at:
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/
Deployment_Guide/ch-DNS_Servers.html
All nodes in an Apache Hadoop cluster require closely synchronized time. If the time between machines
is not synchronized, undefined errors will occur. Cloudera Manager will also flag nodes that have
unsynchronized time. To maintain clock synchronization, the OS configuration steps set up the Network
Time Protocol (NTP) on the nodes in the cluster, with an NTP server on the Standby Name Node.
This configuration synchronizes all nodes with the Standby Name Node. To synchronize the Standby
Name Node with an external clock source, the NTP server configuration should be updated.
Note: See http://www.ntp.org/ for more information.
You must gather several pieces of customer network environment information, including:
• IP addresses for:
• Kickstart Server
• bond0 interface on each node
• bond1 interface on edge nodes
• iDRAC interfaces
• Each node's service tag (case-insensitive)
• Each node's name
• Whether or not updates are to be installed
• If so, you must gather their source (directly from RHN, or from a RHN Satellite Server)
• Rack location, if racked in a non-standard manner
The IP address recommendations in IP Addressing on page 34 can be used as a starting point.
The Hadoop cluster network can be implemented such that only the edge network has access to the
Internet, while the cluster data network is private. In this configuration, only bond1 interfaces need to have
IP addresses that are routed externally. Cloudera Manager will access the Cloudera packages via bond1
and then distribute them over bond0, which is on the cluster-only network.
Optionally, all nodes can have the ability to connect with the Internet. In all cases, you will need to know
the gateway address for bond1 as well as the network mask. For example:
Service tags for each node are available in multiple places. Dell EMC PowerEdge R730xd servers have a
slide-out tag that contains this information. The information can be written down or scanned from the tag
via a smartphone app. They usually have the format of the following example:
D120R22
Once all required information is gathered you can proceed to Server Configuration and OS Installation on
page 48.
Chapter
6
Network Switches Configuration
Topics: The Dell EMC Ready Bundle for Cloudera Hadoop is based on the
network switches documented in the Dell EMC Ready Bundle for
• Switch Configuration Overview Cloudera Hadoop Architecture Guide. This guide assumes the use
• Cabling the Network Switches of those switches. Configuring the Network Switches on page 45
• Server Node Connections provides the necessary switch configurations as starting points.
• Configuring the Network
Switches
This section describes the connection and setup of the switches used in the Dell EMC Ready Bundle for
Cloudera Hadoop.
The network must be cabled and the switches configured before software installation can begin. The
network configuration is divided into three phases:
• Setting up the S3048-ON, required for each rack in the cluster.
• Setting up the S4048-ON, required for each pod in the cluster.
• Setting up the S6000-ON, required for clusters larger than a single pod.
For each phase, we provide 'cut sheets' for the cabling details, and switch configuration files for the switch
programming. Refer to Table 17: Switch Configuration Files on page 41 to identify the correct cut sheet
and configuration file for each switch.
The Dell EMC PowerEdge FX architecture uses a converged iDRAC or CMC connection on the back
of the chassis. All of the units in a Dell EMC PowerEdge FX2 chassis use the same physical connector
on the back of the unit for the physical network connection, and have separate IP addresses for each
sub unit. Figure 4: Dell Networking S6000-ON Multi-pod Networking Equipment on page 43 shows
the connection for a single chassis iDRAC connection. The port next to it can be used to daisy chain the
CMCs.
The management network for all of the nodes in the cluster, using either the Dell EMC PowerEdge R730xd
servers, or the Dell EMC PowerEdge FX2 chassis is a very simple network setup. The S3048-ON cut
sheet, Cutsheets.xlsx, shows that each Dell EMC PowerEdge R730xd host has a single connection from
the dedicated iDRAC port, to one of the 1 GbE ports on the S3048-ON listed in the cut sheet for host
management access. The Dell EMC PowerEdge FX architecture is similar. Each Dell EMC PowerEdge
FX2 CMC port is connected to the host ports in the cut sheet, in host order.
The listed interconnect ports, s3048-left and s3048-right, are for connecting multiple top-of-rack S3048-
ON switches together. The switches are connected as a simple bus. There is also a port shown in the
cut sheet, marked admin node, having the production network and iDRAC networks. This port allows 1
GbE access for kick starting the machines using our Kickstart VM running in either in ESX or VMware
workstation. This port carries both the Production network and iDRAC networks in tagged form. After the
initial installation, this port can be used for a customer administration node if desired.
Follow the cut sheets, and the following diagrams, to cable each switch:
Server connections to the network switches for the data network are bonded, and use an Active-Active
LAN aggregation group (LAG) in a load-balance configuration using IEEE 802.3 Link Aggregation Control
Protocol (LACP). (Under Linux®, this is referred to as 802.3ad or mode 4 bonding).
The connections are made to a pair of Pod switches, to provide redundancy in the case of port, cable,
or switch failure. The switch ports are configured as a LAG, and the switches are configured as a high
availability pair using VLT.
Connections to the BMC network use a single connection from the iDRAC port to a S3048-ON
management switch in each rack.
Edge Nodes have an additional pair of 10GbE connections available. These connections facilitate high-
performance cluster access between applications running on those nodes, and the optional edge network.
The mapping of bonds to individual interfaces is shown in Table 18: Bond / Interface Cross Reference on
page 45.
The Dell EMC PowerEdge FX2 architecture uses a converged iDRAC or CMC connection on the back of
the chassis. All of the units in a Dell EMC PowerEdge FX2 chassis use the same physical connector on the
back of the unit for the physical network connection, and have separate IP addresses for each sub-unit.
Figure 7: Dell EMC PowerEdge FX2 Worker Chassis Network Ports on page 45 displays the connection
for a single chassis iDRAC connection. The port next to it can be used to daisy chain the CMCs.
Note: The Dell EMC PowerEdge FX2 has two iDRAC ports per chassis - an uplink port and a
stacking port (STK). The uplink port is the main iDRAC port. The stacking port is only used when
chassis are daisy-chained.
Perform the following steps to change its mode only if necessary; otherwise, skip to Switch Configuration
on page 46.
To run the first time setup on each switch:
1. Connect to the switch using a serial cable and laptop. The required serial port settings are:
a. 115200 baud rate
b. No parity
c. 8 data bits
d. 1 stop bit
e. No flow control
2. Bring up a HyperTerminal window to connect to the switch.
3. Power on the switch, and wait for the following menu to appear:
To continue with the standard manual interactive mode, it is necessary to
abort BMP.
Press A to abort BMP now.
Press C to continue with BMP.
Press L to toggle BMP syslog and console messages.
Press S to display the BMP status.
4. Choose A to abort Bare Metal Provisioning.
5. Wait for the switch to finish its current activities. You may need to press the [Enter] key to see the
prompt.
6. Type enable, and then press the [Enter] key, to enter privileged mode.
7. Type configure, and then press the [Enter] key, to enter configuration mode.
8. Type reload-type, and then press the [Enter] key, to change the boot mode for the machine.
9. Type boot-type normal-reload, and then press the [Enter] key,.
10.Type exit, and then press the [Enter] key, to exit the boot-type submenu.
11.Type do wr, and then press the [Enter] key, to write the new configuration to the switch.
12.Type exit, and then press the [Enter] key, to exit the configure mode.
13.Type reload, and then press the [Enter] key, to cause the switch to reboot into the newly chosen mode.
14.When you are asked to confirm saving the configuration, and to confirm reloading the system, type yes,
and then press the [Enter] key.
Switch Configuration
The configuration procedure is nearly identical for each switch. The only difference is the configuration file
that is copied and pasted into the switch console window. Switch configurations are plain text files.
For each switch, you will need to update the template to specify the actual IP address for the management
interface on the switch. You will also need to update the configuration templates to reflect the correct VLAN
IDs.
To configure each switch:
1. Connect to the switch using a serial cable and laptop. The required serial port settings are:
a. 115200 baud rate
b. No parity
c. 8 data bits
d. 1 stop bit
e. No flow control
2. Bring up a HyperTerminal window to connect to the switch.
3. Press the [Enter] key to display a console prompt.
4. Type enable, and then press the [Enter] key, to enter privileged mode.
5. Type configure, and then press the [Enter] key, to enter configuration mode.
6. Copy the configuration from the appropriate text file, and then paste it into the console window. The files
are named according to the conventions in the cut sheets provided in the download packages.
7. After the configuration finishes copying, press the [Enter] key.
8. Press [Ctrl-z].
9. Type exit, and then press the [Enter] key, to leave configuration mode.
10.Type copy running-config startup-config, and then press the [Enter] key.
11.Type reload, and then press the [Enter] key.
Chapter
7
Server Configuration and OS Installation
Topics: Dell EMC PowerEdge servers can be configured with the Dell EMC
OpenManage Deployment Toolkit (DTK). We have developed a
• Installing and Configuring the simplified tool to enable the DTK to configure Dell EMC servers
Kickstart Server specifically for Dell EMC Ready Bundle for Cloudera Hadoop
• DTK Configurator workloads: the DTK Configurator.
The Dell EMC Ready Bundle for Cloudera Hadoop Kickstart Server is
used to automate the operating system installation on all the nodes in
a Hadoop stamp. It is comprised of a VMware virtual machine image
that can be run at the customer site on either of:
• Your laptop
• A customer-supplied system in the data center
The kickstart image must be configured with a correct IP address
within the customer's networking environment.
$ sudo ifconfig
Note: The following steps configure a network interface (eth2 in our examples) over which the
Kickstart Server can PXE boot the cluster nodes. Our examples assume that both eth1 and
eth2 appear after ifconfig is run; however, your environment may be configured differently.
Substitute your interface names as desired.
$ cd /etc/sysconfig/network-scripts
17.Move the existing ifcfg-eno16777736 file to the proper device found in the step above:
IPADDR=
e. Add an entry for the network mask:
NETMASK=
f. Add an entry for the gateway:
GATEWAY=
g. Add an entry the Domain Name Service:
DNS1=
h. Save the file.
20.Restart the interface:
$ cd /var/www/html/master
3. Execute the following command, passing the IP address specified by the customer, or the DHCP
address found earlier:
You can now proceed to Editing the node-config.json File on page 51.
DTK Configurator
The DTK Configurator is a USB key bootable image. It enables you to boot any of our architecture-
compliant machines. Once booted, you can select the type of Hadoop machine you wish to build from a
menu. The DTK Configurator will automatically set up all of the following settings, as necessary:
• BIOS
• Firmware
• RAID Controller
• Disks/Volumes
• iDRAC
# md5sum bootimage.iso
4. List all attached block devices, including USB mass storage devices, by executing the following
command:
# blkid
5. Insert the USB key.
6. Rerun the blkid command. The newly-listed device will be the USB key you just entered. For example:
[root@data2 ~]# blkid > before
[root@data2 ~]# echo insert key now
insert key now
[root@data2 ~]# blkid > after
[root@data2 ~]# diff before after
23a24
> /dev/sdr1: LABEL="BOOTIMG" UUID="20B4-D909" TYPE="vfat"
7. Create the bootable USB key by executing the following command:
USB Boot
1. Ensure that the target machine is in BIOS boot mode. If it is in UEFI mode:
a. Press [F2] to enter the machine into System Setup mode.
b. Navigate to System BIOS > Boot Settings > Boot Mode > BIOS.
c. Save, and then exit the BIOS.
2. Insert the USB key into one of the USB ports on the target machine.
3. When the machine reboots, and the BIOS boot menu appears, press [F11] to enter BIOS Boot
Manager.
4. Select the One-shot BIOS Boot menu.
5. Select the USB port into which the key is inserted.
6. Select Finish, and exit BIOS Boot Manager to boot the machine.
At this point the machine will boot from the USB key, and display the standard CentOS boot messages.
7. The DTK then checks the machine's hardware model and boot sequence.
Dell EMC PowerEdge R730xd example:
In this case, the DTK guides you through one of two scenarios that you can select:
• Keeping the existing configuration (select n at the prompt)
• Removing the existing configuration (select y at the prompt)
• Selecting n will cause the DTK to abort the operation, and display a reboot message.
• Selecting y will cause the DTK to respond with a confirmation prompt before continuing.
Caution: Removing configurations is a destructive operation. Please be sure of your
selection before confirming.
9. The DTK then checks the machine's network interface boot protocols.
a. If the network interfaces are configured correctly, a message similar to the following is displayed:
The DTK then prompts you to select a system profile. See step 10 below.
b. If the network interfaces are configured incorrectly, a message similar to the following is displayed:
Note: In this case, you must allow the machine to reboot in order to continue to Step 10
below.
The DTK then prompts you to select a system profile. See step 10 below.
10.Follow the prompts to select the system profile that you wish to install:
1. Hadoop Infrastructure
2. Hadoop Worker
3. OpenStack Infrastructure
4. OpenStack Compute
5. OpenStack Storage
6. OpenStack SAH
a. When you are prompted for the IPv4 address and network mask, enter the machine's iDRAC IP
address and mask.
11.When the process is complete, follow the prompt to remove the USB key and reboot the machine.
Note: Certain update packages during this procedure may require that the machine being
updated be rebooted immediately, prior to finishing all updates.
12.If the machine reboots on its own without user intervention, or you do not see the DTK finish message
asking you to press [Enter] to reboot the machine:
a. Rerun the DTK updater on the same machine to finish all available updates.
13.While rebooting, the machine contacts the Kickstart Server, and then performs the operating system
installation based upon the service tag, and the node-config.json file.
14.Perform the cluster test in Before Hadoop Cluster Deployment on page 79.
Note: Once the operating system is installed, the root password for each machine will be the
password that you entered in Configuring the Kickstart Server on page 50.
Troubleshooting Service Tag Errors
If a node's service tag cannot be found in the node-config.json file, you can either:
• Select the appropriate node type from the menu option that is displayed, or
• Add the correct service tag to the node-config.json file
Note: Dell EMC recommends that you add the correct service tag to the node-config,json file, in
order to save time and effort.
If you choose to select the node type from the menu:
1. Select the node type. Available types include:
• Name
• Standby Name
• High Availability
• Edge
• Data
2. The operating system will be installed without customizations typically performed by the kickstart
automation.
3. Manually configure the:
• /etc/hosts file with hostnames and IP addresses of all Hadoop nodes
• bond0 interface
• Domain name
• NTP server configuration
• Optional bond1 interface on Infrastructure nodes
• Operating system tuning parameters
• Local RHEL 7.3 repositories, based upon the installation ISO
• Additional mount points
If you choose to add the service tag to the node-config.json file:
1. Rerun the read-json.py script as in Editing the node-config.json File on page 51. The
customizations will be performed automatically.
2. Reboot the problematic node.
Chapter
8
Additional Packages
Topics: The kickstart process installs all necessary OS packages. If you need
additional packages, they should be installed manually.
• Checking and Installing
Packages
Chapter
9
Operating System Software Updates
Topics: Dell EMC recommends that you perform software updates on a regular
basis, for all installed packages.
• Software Update
Recommendations
Chapter
10
Installing Cloudera Manager
Topics: After the base operating system has been imaged on all cluster
nodes, the next step is to install Cloudera Manager to complete the
• Configuring the Metadata deployment. Management of HDFS and other Hadoop services is
Database performed by Cloudera Manager. The Cloudera Manager software
• Installing Cloudera Manager should be installed on the Edge Node.
Software Note: Before continuing to Configuring the Metadata Database
on page 61, best practice is to perform the cluster test in
Before Hadoop Cluster Deployment on page 79.
Refer to the following documents for instructions to configure the PostgreSQL metadata database:
• Cloudera — http://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_extrnl_pstgrs.html
• PostgreSQL — https://www.postgresql.org/docs/9.4/static/index.html
Note: The PostgreSQL database should be configured on the HA Node.
Since the Dell EMC Ready Bundle for Cloudera Hadoop installs the PostgreSQL database software on
the appropriate host, you can skip the Installing the External PostgreSQL Server section and refer to these
sections instead:
• Configuring and Starting the PostgreSQL Server
• Creating Databases for Activity Monitor, Reports Manager, Hive Metastore Server, Sentry Server,
Cloudera Navigator Audit Server, and Cloudera Navigator Metadata Server
• Configuring PostgreSQL for Oozie
To configure the metadata database:
1. Log onto the HA Node as root.
2. Set the correct software localization variables by executing the following commands:
# export LANGUAGE=en_US.UTF-8
# export LANG=en_US.UTF-8
# export LC_ALL=en_US.UTF-8
3. Initialize the database service, which will copy default configuration files into the appropriate locations:
# mkdir /var/lib/pgsql/9.4
# /usr/pgsql-9.4/bin/postgresql94-setup initdb
# systemctl start postgresql-9.4.service
# systemctl stop postgresql-9.4.service
4. To enable client machines in the local subnet to access the database:
a. Open the /var/lib/pgsql/9.4/data/pg_hba.conf file in a text editor.
b. Add the following lines before all other local and host lines, substituting your local environment's
subnet:
listen_addresses = '*'
c. In this same file, change the settings as listed in step 3 of the Cloudera link given above. These
settings relate to the size of the cluster being installed.
d. Save and close the file.
6. Start the database, and enable it to be restarted after each reboot, execute the following commands:
These instructions summarize the overall installation process and call out specific recommendations for the
Dell EMC Ready Bundle for Cloudera Hadoop. For additional details, refer to the Cloudera documentation
at: http://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_install_path_b.html
You will download the “seed” portion of Cloudera Manager software from Cloudera, and then install the
Cloudera Hadoop environment using their Internet-accessable repositories.
Cloudera Manager is installed upon the Edge Node. To install Cloudera Manager:
1. Log into the Edge Node:
a. Username: root
b. Password: the password that you entered in Configuring the Kickstart Server on page 50
2. Update the package repository information:
# /usr/share/cmf/schema/scm_prepare_database.sh -h <hostname_of_HA_node>
postgresql scm scm
a. You are prompted for the SCM password. Enter the password to continue.
6. Start the Cloudera server processes:
Cloudera Manager is now installed. Its HTTP management interface should be reachable on port 7180,
using the admin/admin username and password credentials.
You can now follow the install wizard steps for a custom deployment, or proceed to Cloudera Configuration
on page 64.
Note: If allowed in your jurisdiction, you should install the Java Cryptography Extension (JCE)
Unlimited Strength Jurisdiction Policy File on all cluster and Hadoop user machines. For JCE Policy
File installation instructions, see the README.txt file included in the jce_policy-x.zip file. You
will be given an option to do this when using Cloudera Manager to deploy the Hadoop Environment.
Chapter
11
Cloudera Configuration
Topics: This section describes Cloudera-specific configuration settings that
Dell EMC recommends you set. These changes are not automatically
• Cloudera and Network applied by the DTK/Kickstart process, and must be applied manually.
Interfaces
Note: Once you have finished configuring Cloudera, best
• Using Spark 1 and Spark 2
practice is to perform the cluster test in After Hadoop Cluster
• Service Assignments Deployment on page 79.
• Hadoop Rack Awareness
• Cloudera Update
Recommendations
The Cloudera services are not multi-homed, and only function on a single network interface.
The network interface used for the Cloudera services is the interface that corresponds to the fully qualified
node name. For the Dell EMC Ready Bundle for Cloudera Hadoop Architecture Guide and Dell EMC
Ready Bundle for Cloudera Hadoop Deployment Guide, this will be the 'bond0' interface and the Cloudera
services will be available on the cluster data network.
If the network interface names are changed, or an alternative deployment method is used, the Cloudera
services must be explicitly configured to run on the desired network interface.
Cloudera Enterprise 5.10 supports the simultaneous installation and use of Spark 1.x and Spark 2.x.
Spark 2 contains significant API changes and functional improvements over Spark 1. However, it is not
backwards compatible with Spark 1. Cloudera Enterprise supports both versions by treating Spark 2 as an
additional service in Cloudera Manager.
Spark 2 is a separate download, not included in the base installation. Complete instructions are available
at: http://www.cloudera.com/downloads/spark2/2-0.html.
To install and configure Spark 2:
1. Follow the instructions on the Cloudera Spark 2 page to download and install the Spark 2 parcel. The
most direct way is to configure the Spark 2 parcel repository in Cloudera Manager.
2. Follow the guidelines in Service Assignments on page 65 to add the Spark 2 service to the cluster.
The Service Assignments on page 65 include recommendations for both services. You can configure
either Spark 1 or Spark 2, or configure both depending on your requirements.
Service Assignments
These are the recommended service role to node assignments for the cluster configuration.
As part of Cloudera installation, the mapping of service roles to nodes must be specified. We recommend
the service role assignments in Table 19: Service Role Assignments on page 65 below as a starting
point.
Hadoop rack awareness takes a node's network location into account when scheduling tasks and
allocating storage. Cloudera Manager allows the specification of the rack/switch location for each node in
the cluster. You must configure rack awareness to achieve optimal performance and high availability.
HDFS, MapReduce, and YARN will automatically use the location information (topology) that you specify to
optimize reliability and performance. The default installation of Cloudera places all nodes in the same rack.
If your cluster contains more than one rack, you should specify the topology for each node based on the
rack and pod location for each host. We recommend specifying the topology for all clusters, even if they
are a single rack.
The location of a node is specified using a hierarchical path, such as:
• /pod1/rack1
• /pod1/rack2
• /pod2/rack4
Note: It is important to specify both the pod and rack level information, and the rack component
should be unique within the cluster.
The rack location for hosts is specified in Cloudera Manager, under the hosts tab. For more information,
please see:
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_mc_specify_rack.html
You must restart the affected services after making these changes.
We provide the set_rackId.py utility to assist in configuring the correct rack awareness values for a
cluster. set_rackId.py can set rack identifiers based on the hostname, chassis serial number, or a
supplied list of hosts and identifiers. Refer to the included README file for details on how to run this utility.
Note: The Dell EMC PowerEdge FX2 chassis serial number is good unique identifier.
2. Change the Replica Placement Policy in Cloudera Manager by adding the following to the hdfs core-
site.xml safety valve:
<property>
<name>net.topology.impl</name>
<value>org.apache.hadoop.net.NetworkTopologyWithNodeGroup</value>
</property>
<property>
<name>dfs.block.replicator.classname</name>
<value>
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyWithNodeGroup
</value>
</property>
You must restart the affected services after making these changes.
Dell EMC recommends installing the latest Cloudera maintenance updates during initial installation and as
part of normal administration processes.
For parcel deployment, updates are managed in the Settings section of Cloudera Manager, under Parcels.
The Cloudera Manager repositories are normally accessed via HTTP. Some environments may require the
use of an HTTP proxy server, which can be specified under Settings/Network.
Chapter
12
Installing Syncsort DMX-h
Topics: Syncsort® DMX-h® is an Extract, Transform, Load (ETL) product for
Hadoop, and is an optional installation. For more information about
• Syncsort DMX-h Prerequisites Syncsort, see the Syncsort website. Access to the Syncsort support
• Syncsort DMX-h Software portal requires a valid site login account.
Packages and Versions
This topic briefly describes installing DMX-h on a Dell EMC Ready
• Installation Procedure Bundle for Cloudera Hadoop architecture-compliant cluster, and
configuring it to extract data from a PostgreSQL database using that
database's ODBC driver. The detailed directions for installing DMX-h
are in the Syncsort DMX-h Installation Guide.
Note: For information about configuring other data sources,
such as Oracle DB2 and Sybase, see the Syncsort website.
These instructions assume that you will be installing and using the following versions of software:
• Red Hat Enterprise Linux Server 7.3 - installed and configured on all required Cloudera nodes
• Cloudera Distribution for Apache Hadoop 5.10 - installed with all nodes running their proper roles
• Syncsort DMX-h 9.2
• Syncsort DMX-h license key file
Installation Procedure
To install and configure Syncsort DMX-h on a Dell EMC Ready Bundle for Cloudera Hadoop architecture-
compliant cluster:
1. Acquire Syncsort Files on page 70
2. Install the DMX-h IDE on page 71
3. Configure the Syncsort Parcel for Cloudera on page 71
4. Install DMX-h on the Edge Node on page 71
You can now proceed to Install the DMX-h IDE on page 71.
# ./dmexpress-9.2-el7.parcel_en.bin
4. Specify the extraction directory as /opt/cloudera/parcel-repo/.
5. Log into the Cloudera Management Console as the administrator user.
6. Navigate to Hosts > Parcels to display the parcels administration page.
7. If the new Syncsort parcel is not displayed, perform a scan for newly-available parcels.
8. Select Automatically Distribute Available Parcels to distribute the Syncsort parcel to all nodes.
9. Click on the Save Changes button.
10.Once the operation is complete, activate the parcel on all nodes.
You can now proceed to Install DMX-h on the Edge Node on page 71.
# ./dmexpress-9.2-1.x86_64_en.bin
# cd <new_directory>
# rpm -i dmexpress-9.2-1.x86_64.rpm
a. To install to a different location, use the --prefix option as described in the rpm man page.
6. Install and configure the dmxd service by issuing the following commands as the root user:
# cd /usr/dmexpress
# ./install
7. Select the option to install and run the dmxd daemon on the Edge Node.
a. Select the following when prompted:
• Select [2] to configure the DMExpress Service.
• Select [y] or [n] to choose whether or not to use PAM for authentication.
• Select [y] or [n] to choose whether or not to start the DMExpress Service automatically.
• Select [y] or [n] to choose whether or not to start the DMExpress Service now.
Syncsort and the ODBC connectors are now installed, and configured to allow ETL between the
PostgreSQL database and the Dell EMC Ready Bundle for Cloudera Hadoop architecture-compliant
cluster.
Syncsort DMX-h is now installed and configured.
Chapter
13
YARN Performance Optimization
Topics: This topic describes how to configure YARN and MapReduce memory
allocation settings for the Dell EMC Ready Bundle for Cloudera
• YARN Applications Hadoop, based upon the node hardware specifications. These
• Determining the Reserved guidelines were developed using several documents publicly available
Memory from Cloudera:
• Hadoop Configuration Settings • http://blog.cloudera.com/blog/2014/02/getting-mapreduce-2-up-to-
speed/
• http://www.cloudera.com/documentation/enterprise/latest/topics/
cdh_ig_yarn_tuning.html
Note: These guidelines have been tested on Dell EMC Ready
Bundle for Cloudera Hadoop cluster configurations.
YARN Applications
The performance of YARN applications should be tunable based upon the hardware resources of the
cluster, especially the physical cores and memory. YARN takes into account all of the available compute
resources on each machine in the cluster. Based on the available resources, YARN:
1. Negotiates resource requests from applications (such as MapReduce) running in the cluster
2. Provides processing capacity to each application by allocating Containers
Note: A Container is the basic unit of processing capacity in YARN, and is an encapsulation of
resource elements (memory, CPU, etc.).
In a Hadoop cluster, it is vital to balance the usage of memory (RAM), processors (CPU cores), and disks
so that processing is not constrained by any one of these cluster resources. As a general recommendation,
allowing for two Containers per disk and per core provides the best balance for cluster utilization.
When determining the appropriate YARN and MapReduce memory configurations for a cluster node, start
with the available hardware resources. Specifically, note the following values on each node:
• RAM - Amount of memory
• Cores - Number of CPU cores
• Disks - Number of disks
The total available RAM for YARN and MapReduce should take into account the Reserved Memory.
Reserved Memory is the RAM needed by system processes and other Hadoop processes (such as
HBase).
To determine Reserved Memory per node:
1. Use the Search facility in Cloudera Manager to find the values for the following Role Instance Memory
parameters:
a. Memory Overcommit Threshold — Navigate to Cloudera Manager (CM) > Hosts > [select a
DataNode Host] > Configuration
b. Java Heap Size of Worker Node — Navigate to CM > Hosts > [select a DataNode Host] > Roles >
DataNode > Configuration
c. Java Heap Size of NFS Gateway — Navigate to CM > Hosts > [select a DataNode Host] > Roles >
NFS Gateway > Configuration
d. Java Heap Size of NodeManager — Navigate to CM > Hosts > [select a DataNode Host] > Roles >
NodeManager > Configuration
2. Sum those values to determine the Role Instance Memory.
3. Then, use the following formula:
Table 21: Reserved Memory Recommendations on page 75 provides Dell EMC's recommended
Reserved Memory values.
The YARN and MapReduce configurations should be set as per Table 22: YARN and MapReduce RAM
Settings on page 75.
Map Task Maximum Heap Size The maximum Java heap 858993459
size, in bytes, of the map
processes. This number
will be formatted and
concatenated with 'Map
Task Java Opts Base'
to pass to Hadoop.
Reduce Task Maximum Heap Size The maximum Java heap 1717986918
size, in bytes, of the reduce
processes. This number
will be formatted and
concatenated with 'Reduce
Task Java Opts Base'
to pass to Hadoop.
mapreduce.task.io.sort.mb The total amount of memory Default=256
buffer, in MB, to use while
sorting files.
mapreduce.map.sort.spill.percent The soft limit in either the Default=0.8, Recommended (> 0.5)
buffer or record collection
buffers. When this limit is
reached, a thread will begin
to spill the contents to disk
in the background.
mapreduce.job.reduce.slowstart. Fraction of the number of Default=0.8, Depending on
completedmaps map tasks in the job which workload and Configuration (valid
should be completed before range: 0 – 1)
reduce tasks are scheduled
for the job.
mapreduce.job.maps The default number of map num Worker Node cores x num
tasks per job. Worker Nodes
mapreduce.job.reduces The default number of (Valid range: 1/3 – 1) x
reduce tasks per job. mapreduce.job.maps
dfs.blocksize The default block size in Valid range: 256MB-1GB
bytes for new HDFS files.
dfs.replication The number of replications 3
to make when the file is
created.
dfs.namenode.handler.count The number of server 30
threads for the Name Node.
dfs.datanode.handler.count The number of server 10
threads for the Worker
Node.
Note: After installation, both yarn-site.xml and mapred-site.xml are located in the /etc/hadoop/conf
folder. If using Cloudera Manager, these settings should be entered via the YARN configuration
tool.
Chapter
14
Cluster Testing
Topics: You should test your Hadoop cluster both before and after Cloudera
Manager has deployed the cluster. The tests you perform will vary
• Before Hadoop Cluster depending upon the deployment status.
Deployment
• After Hadoop Cluster
Deployment
# curl -I archive.cloudera.com
# dig @<dns_server> archive.cloudera.com
# yum repolist
# more /etc/yum.repos.d/*
Chapter
15
QuickStart Configuration Differences
Topics: There are differences between the full cluster and QuickStart
configurations.
• QuickStart Node Configuration
Differences
• QuickStart Network
Configuration Differences
• QuickStart Service Assignments
The QuickStart configuration is intended for proof of concept installations, and is not a full cluster
configuration. The QuickStart uses the same node configurations as a full cluster, but includes only 5
nodes and does not include a high availability network.
The recommended QuickStart node usage is shown in Table 23: QuickStart Node Roles on page 81.
The QuickStart configuration uses the same switches and switch configurations as a full cluster. However,
the dual switches that provide high availability are not included.
To configure networking for the QuickStart configuration:
1. Configure switches and cabling just like a full cluster deployment, using only switch S4048-1.
2. Each node will have a single connection to the cluster data network instead of dual connections.
3. Configure hosts and IP addresssing using the same method as a full cluster deployment.
Table 24: QuickStart Service Role Assignments on page 82 shows the recommended service role to
node assignments for the QuickStart configuration.
Role Nodes
HDFS
NameNode Active Name Node
Secondary NameNode Standby Name Node
Balancer Standby Name Node
HttpFS Active Name Node
NFSGateway Active Name Node
DataNode Worker Node 1, Worker Node 2,... Worker Node N
Hive
Gateway all nodes
Hive Metastore Server Standby Name Node
WebHCat Server Standby Name Node
HiveServer2 Standby Name Node
Hue
Hue Server Standby Name Node
Impala
Impala Catalog Server Active Name Node
Impala StateStore Active Name Node
Impala Daemon same servers as DataNode role
Cloudera Management Service
Service Monitor Standby Name Node
Activity Monitor Standby Name Node
Role Nodes
Host Monitor Standby Name Node
Reports Manager Standby Name Node
Event Server Standby Name Node
Alert Publisher Standby Name Node
Navigator Audit Standby Name Node
Navigator Metadata Server Standby Name Node
Oozie
Oozie Server Standby Name Node
Spark
Gateway all nodes
History Server Active Name Node
Spark 2
Spark 2 Gateway all nodes
Spark 2 History Server Standby Name Node
YARN (MR2 Included)
Resource Manager Active Name Node
Job History Server Active Name Node
Node Manager same servers as DataNode role
Gateway all nodes
ZooKeeper
ZooKeeper Server Active Name Node, Standby Name Node, Worker
Node 1
Appendix
A
BIOS Configuration
Topics: This appendix describes BIOS configurations on Dell EMC PowerEdge
R730xd and Dell EMC PowerEdge FC630 server hardware for the
• IPMI Configuration Dell EMC Ready Bundle for Cloudera Hadoop with Red Hat Enterprise
• Primary BIOS Settings Linux Server 7.3.
• Infrastructure Node Settings Note: The Dell EMC-provided DTK tool updates all of the
• Worker Node Settings necessary IPMI/BIOS/iDRAC settings for you. Table 25: Dell
EMC PowerEdge R730xd and Dell EMC PowerEdge FC630
Infrastructure Node Settings on page 85 and Table 26:
Dell EMC PowerEdge R730xd and Dell EMC PowerEdge
FC630 Worker Node Settings on page 86 contain all of the
settings performed by the DTK, and are provided here for your
reference.
IPMI Configuration
You must configure the iDRAC on supported systems. Dell EMC recommends that you configure these
settings from the iDRAC web interface, or directly on the node console:
• User Information
• Network Configuration
• IPMI Validation
The primary BIOS configurations for the Dell EMC Ready Bundle for Cloudera Hadoop are for
Infrastructure Nodes and Worker Nodes.
• For more information about Dell EMC PowerEdge R730xd BIOS settings, please see the Dell EMC
PowerEdge R730xd Owner's Manual.
Note: Dell EMC recommends that you perform BIOS updates on a regular basis. It is particularly
important that your operating system firmware be up to date prior to installing Cloudera Manager.
This section describes required settings for Dell EMC PowerEdge R730xd and Dell EMC PowerEdge
FC630 Infrastructure nodes (Cloudera Manager node, optional Administration Node, HDFS Active and
Standby Name Nodes, etc.).
Table 25: Dell EMC PowerEdge R730xd and Dell EMC PowerEdge FC630 Infrastructure Node
Settings
This section describes required settings for Dell EMC PowerEdge R730xd and Dell EMC PowerEdge
FC630 Worker Nodes.
Table 26: Dell EMC PowerEdge R730xd and Dell EMC PowerEdge FC630 Worker Node Settings
Appendix
B
RAID Configuration
Topics: This appendix describes Infrastructure Nodes and Worker Nodes
RAID settings for the PERC-H730 RAID Controller.
• PERC-H730-Specific
Infrastructure Nodes RAID Note: The Dell EMC-provided DTK tool automatically
configures the RAID controller, and creates all necessary
Settings
RAID sets on each machine. Table 27: PERC-H730 BIOS
• PERC-H730-Specific Worker Settings for Infrastructure Nodes on page 89 and Table 28:
Node RAID Settings PERC-H730 BIOS Settings for Worker Nodes on page 89
contain all of the RAID settings performed by the DTK, and are
provided here for your reference.
For more information on configuring your controller please see the Dell
EMC PowerEdge RAID Controller (PERC) 9 User’s Guide.
Note that:
• Rear flex-bay drives are a single RAID 1 set.
Note: We do not use more than six front drives directly. Any remaining front drives are available for
customer use.
Note: Worker Nodes are set as a single RAID 1 set for the two Flex Bay Drives, and HBA pass-
through (JBOD) for the data drives.
Appendix
C
File System Layout
Topics: This appendix describes filesystem layout deployment parameters.
Infrastructure Nodes
The Infrastructure nodes (Active Name Node, Standby Name Node, HA Node, and Edge Node) are
configured as multiple partitions and filesystems using all available drives. Each partition is optimized for
both performance and reliability.
Dell EMC recommends the following disk and partition layout for this set of machines.
Note: Dell EMC does not recommend that a large swap space be configured. Swapping in a
Hadoop cluster should be avoided, due to the larger and random performance degradation that can
result. See Swap Settings on page 101.
Note: The settings for dfs.name.dir, dfs.namenode.name.dir, ZooKeeper DataDir, ZooKeeper
DataLogDir, and dfs.namenode.edits.dir must be updated in Cloudera Manager to reflect the
locations in this partition layout.
Worker Nodes
The Worker Nodes in the cluster are the processing and data storage nodes. When using Dell EMC
PowerEdge R730xd servers we recommend that the two Flex Bay drives in the back of the chassis be
configured as a mirrored pair, and used for the operating system. All of the other disks attached to the
system should be configured as HBA or JBOD.
Dell EMC recommends the following disk and partition layout for this set of machines.
Note: Dell EMC does not recommend that a large swap space be configured. Swapping in a
Hadoop cluster should be avoided, due to the large and random performance degradation that can
result. See Swap Settings on page 101.
Note: The partition layout in Table 34: Dell EMC PowerEdge R730xd Worker Node Partitions
on page 93 and Table 36: Dell EMC PowerEdge FC630 Worker Node Partitions on page
94 applies to all the data drives in all the Worker Nodes. Depending on the Worker Node drive
configuration, the Dell EMC PowerEdge R730xd will have either 12 or 24 data drives. The Dell EMC
PowerEdge FC630 will have 16 data drives.
Note: Operating system partitions are configured with the Logical Volume Manager enabled.
Appendix
D
Operating System Settings
Topics: This appendix describes how to configure the operating system for the
Dell EMC Ready Bundle for Cloudera Hadoop.
• CPU Settings
Note: The Dell EMC-provided DTK tool automatically
• Network Settings
configures the operating system settings on each machine. The
• Advanced NIC Features information in this appendix is provided here for your reference.
• Process Limits
• Memory Management Settings
• Secure Linux Settings
• Services
• Firewall Settings
• Ports Listing
• Disable Network Manager
• Secure Shell Keys
• User Accounts and Groups
CPU Settings
You can configure the following Linux® operating system settings to increase Dell EMC Ready Bundle for
Cloudera Hadoop performance:
• IRQ Balancer on page 97
• CPU Frequency Governor on page 97
IRQ Balancer
To prevent the IRQ balancer from interfering with the interrupt affinity scheme, the IRQ balancer service
needs to be disabled.
1. Disable the IRQ balancer service by executing the following commands:
# modprobe cpufreq_performance
3. Enable the governor by executing the following command:
# cd /lib/modules/2.6.32-573.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq
# ls
acpi-cpufreq.ko mperf.ko p4-clockmod.ko pcc-cpufreq.ko
powernow-k8.ko speedstep-lib.ko
5. If the necessary cpufreq drivers are not available, you can get them from the /lib/modules/<kernel
version>/kernel/drivers/cpufreq directory. For example:
# cd /lib/modules/2.6.32-573.el6.x86_64/kernel/drivers/cpufreq
# ls
cpufreq_conservative.ko cpufreq_ondemand.ko cpufreq_powersave.ko
cpufreq_stats.ko freq_table.ko
Note: The uname –r command will give you the kernel version.
The cpupower utility is provided by the cpupowerutils package. If you do not have it installed,
you can set the tunables in /sys/devices/system/cpu/<cpu id>/cpufreq/.
Network Settings
Dell EMC recommends that you tune certain network settings to increase Dell EMC Ready Bundle for
Cloudera Hadoop performance.
To tune the network settings:
1. Add the following parameters to the /etc/sysctl.conf file:
a. Temporarily change the MTU size of an interface by executing the following command:
MTU=9000
c. Activate the new MTU by taking the interface down, and then bringing it back up:
# ifdown eth0
# ifup eth0
Scatter-Gather
NICS with scatter-gather enabled are able to read from, and write to, many memory buffers for Direct
Memory Access (DMA). Depending upon the NIC, scatter-gather can be turned on with ethtool.
To enable scatter-gather:
1. Execute the following command:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: off
Process Limits
The Linux® operating system needs to be configured with several processes and files limit settings. The
lines below should be added to the /etc/security/limits.conf file.
GRUB_CMDLINE_LINUX="rd.lvm.lv=rhel/root rd.lvm.lv=rhel/
swap vconsole.font=latarcyrheb-sun16 vconsole.keymap=us
transparent_hugepage=never"
2. Run the grub2-mkconfig command to regenerate the grub.cfg file: Eg.
grub2-mkconfig -o /boot/grub2/grub.cfg
3. Reboot the system and ensure that the parameter is set correctly. This can be confirmed by running this
command:
# cat /proc/cmdline
Swap Settings
The vm.swappiness Linux® kernel parameter controls how aggressively memory pages are swapped to
disk. It can be set to a value between 0-100. The higher the value, the more aggressively the kernel seeks
out inactive memory pages and swaps them to disk.
On most systems this parameter is set to 60 by default. This is not always suitable for Hadoop cluster
nodes because it can cause processes to swap out, even when there is free memory available. This can
affect stability and performance, and may cause problems such as lengthy garbage collection pauses
for important system daemons. Cloudera recommends that vm.swappiness be set based on the Linux
kernel version. Red Hat Enterprise Linux Server 7.3 uses a Linux kernel version 3.1.x.
• To check the kernel version, run:
# uname -a
# sysctl vm.swappiness
• To set the vm.swappiness parameter for kernel versions earlier than 2.6.32-303:
# sysctl -w vm.swappiness=0
• To set the vm.swappiness parameter for later kernel versions:
# sysctl -w vm.swappiness=1
Security Enhanced Linux (SELinux) is a kernel module and toolset to allow greater security control. The
feature is not compatible with Cloudera Manager 5 and should not be installed, or should be disabled.
1. To indicate if the feature is active, execute the following command:
#From this:
SELINUX=enforcing
#To this:
SELINUX=disabled
Services
All unnecessary daemons and services, such as the CUPS printing service, should be disabled on all
cluster nodes. This reduces maintenance requirements and resource usage.
In addition, all hosts in the cluster should have the same time, date and zone settings. Dell EMC highly
recommends that you run the ntpd service.
To disable or stop any unnecessary daemons:
1. Use the chkconfig command to disable any unwanted services. For example:
Firewall Settings
Cloudera suggests that all firewall software on and between nodes in the cluster be disabled.
1. Check the firewall status by running the following commands:
Caution: You must ensure that you provide suitable network security for the cluster, including
but not limited to external firewalls. Please consult with your local site security administrator to
determine the proper solution.
When iptables is disabled, the Linux kernel still implements a limited amount of IP connection tracking
using a fixed size table. If there are indications of packets loss (i.e., errors of the form nf_conntrack:
table full, dropping packet), increase the size of the connection tracking table using sysctl to
change the parameter net.netfilter.nf_conntrack_max. Refer to https://access.redhat.com/solutions/8721
for additional details.
Note: Registration is required to view this solution content.
Ports Listing
See the following link for information about all ports that are used within a Cloudera Hadoop cluster:
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_ig_ports.html
This information can be used to program a firewall to protect the entire cluster.
The Red Hat Network Manager should be disabled, or not installed. Interfaces should be configured to use
the normal Red Hat network service.
Disable the Network Manager by following the instructions at:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/sec-
NetworkManager_and_the_Network_Scripts.html
We normally configure password-less SSH access (using keys) for the root user, from the node running
Cloudera Manager, to simplify access to all nodes in the cluster. This configuration is not required. If
password-less SSH is not configured, the root password is required by the Cloudera Manager installation
process.
To allow this access:
1. Create the public and private keys by running the following command on all nodes as the root user:
# ssh-keygen
The public keys for each machine will reside on those machines in the ~/.ssh/ directory, and are named
according to the type of encryption that is chosen (i.e., id_rsa.pub).
2. Copy the pubic key from the High Availability node to all nodes in the cluster.
3. Append the key to the ~/.ssh/authorized_keys file on each of the nodes.
4. Secure the authorized_keys file to ensure that the system is secure. For more information, please see:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/
System_Administrators_Guide/ch-OpenSSH.html
Cloudera Manager and Cloudera Enterprise use several user accounts and groups to complete their tasks.
These accounts and group are setup automatically by Cloudera Manager during the cluster install process.
The set of user accounts and groups varies according to which components you choose to install.
Caution: Do not delete these accounts or groups, and do not modify their permissions and rights.
Appendix
E
Example node-config.json File
Topics: This appendix provides an example node-config.json file.
• node-config.json Example
node-config.json Example
{
"ClusterName" : "Silver Stamp",
"DomainName" : "ignition.dell.com",
"GatewayBond0" : "172.16.30.1",
"NetMaskBond0" : "255.255.255.0",
"GatewayBond1" : "10.152.248.1",
"NetMaskBond1" : "255.255.255.0",
"EthsBond0" : "em1,em2",
"EthsBond1" : "p4p1,p4p2",
"TimeZone" : "UTC",
"NTPSubnet" : "172.16.30.0",
"Nodes" : [
{
"ServiceTag": "D120R22",
"NodeType" : "namenode",
"NodeName" : "r1s10-namenode1",
"bond0IP" : "172.16.30.93",
"bond1IP" : "10.152.247.93"
},
{
"ServiceTag": "D100R32",
"NodeType" : "edge",
"NodeName" : "r1s12-edge",
"bond0IP" : "172.16.30.94",
"bond1IP" : "10.152.247.94"
},
{
"ServiceTag": "D115D56",
"NodeType" : "workernode",
"NodeName" : "r1s14-workernode1",
"bond0IP" : "172.16.30.95"
},
.
.
.
}
Appendix
F
Support
Topics: Note: Cloudera and Red Hat technical support are paid
services, and require support contract agreements with
• Software Support each respective vendor. Please contact your Dell EMC sales
• Java Compatibility representative for more details.
Software Support
Table 37: Dell EMC Ready Bundle for Cloudera Hadoop Support Matrix on page 107 describes where
you can obtain technical support for the various components of the Dell EMC Ready Bundle for Cloudera
Hadoop.
Table 37: Dell EMC Ready Bundle for Cloudera Hadoop Support Matrix
Java Compatibility
# java -version
# javac -version
# update-java-alternatives --list
# alternatives --display java
Appendix
G
Related Documentation
Topics: This topic provides links to the latest related documentation.
For the latest Cloudera Manager and Cloudera Enterprise documentation, please see:
http://www.cloudera.com/documentation/enterprise/latest.html
Note: In particular, see the Cloudera Manager Installation Guide.
For Red Hat Enterprise Linux Server installation and deployment information, please see:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/
System_Administrators_Guide/index.html
Appendix
H
References
Topics: Additional information can be obtained at http://www.dell.com/en-us/
work/learn/software-platforms-hadoop.
• About Cloudera
If you need additional services or implementation help, please contact
• About Syncsort
your Dell EMC sales representative.
• To Learn More
About Cloudera
Cloudera is a key contributor to the Apache Hadoop project. The Cloudera Distribution for Apache Hadoop
(CDH) is a highly-scalable open source platform for high-volume data management and analytics. CDH
integrates with existing enterprise IT infrastructure, enabling data engineers and data scientists to quickly
and easily develop and deploy Hadoop applications in a cost-efficient manner.
The Dell EMC servers in this Architecture Guide are Cloudera Certified.
About Syncsort
Syncsort creates software that allows enterprises to collect, integrate, sort, and distribute large amounts of
data quickly, with reduced resources usage, in a cost-effective manner.
Dell EMC is a Syncsort-certified Technology Alliance Partner.
To Learn More
For more information on the Dell EMC Ready Bundle for Cloudera Hadoop, visit http://www.dell.com/en-us/
work/learn/software-platforms-hadoop.
Copyright © 2011-2017 Dell Inc. or its subsidiaries. All rights reserved. Trademarks and trade names may
be used in this document to refer to either the entities claiming the marks and names or their products.
Specifications are correct at date of publication but are subject to availability or change without notice
at any time. Dell Inc. and its affiliates cannot be responsible for errors or omissions in typography or
photography. Dell Inc.’s Terms and Conditions of Sales and Service apply and are available on request.
Dell Inc. service offerings do not affect consumer’s statutory rights.
Dell EMC, the DELL EMC logo, the DELL EMC badge, and PowerEdge are trademarks of Dell Inc.