Professional Documents
Culture Documents
1
2
3
4
5
Introduction ............................................................................................................
..............................
Getting the Use Case
Accelerators........................................................................................................
Summary of the Use
Cases....................................................................................................................
DMExpress Hadoop ETL
3.1
Jobs ........................................................................................................
Change Data Capture
3.1.1
(CDC) ..................................................................................................
Joins ...........................................................................................................
3.1.2
............................
Lookup .......................................................................................................
3.1.3
............................
Aggregation ...............................................................................................
3.1.4
............................
DMExpress HDFS Load/Extract
3.2
Jobs .............................................................................................
Mainframe Translation &
3.2.1
Connectivity .................................................................................
Connectivity .............................................................................................
3.2.2
.............................
Directory Structure of the Use
Cases ....................................................................................................
Running the Use Case
Accelerators ......................................................................................................
Setting the Environment
5.1
Variables ...............................................................................................
Running Within
Hadoop ............................................................................................................
5.2
....
Running the Prep Script to Prepare the HDFS
5.2.1
Environment .................................................
Running the Examples within Hadoop via the Job
5.2.2
Editor .....................................................
Running the Examples within Hadoop via the Command
5.2.3
Line ...........................................
Running Outside of
5.3 Hadoop ........................................................................................................
Running the Examples outside Hadoop via the Job
5.3.1
Editor .................................................
Running the Examples outside Hadoop via the Command
5.3.2
Line .........................................
1
2
2
3
3
4
4
4
5
5
5
7
8
8
8
9
9
1
0
1
0
1
0
1
1
Introduction
Syncsort has provided a set of use case accelerators, or examples, to help users
understand how to implement DMExpress Hadoop solutions in a Hadoop
MapReduce framework. These examples have been chosen based on common use
cases seen in the industry and should provide users a head start in developing
solutions for their own use cases.
This guide presents an overview of:
1
2
3
4
Syncsort provides the following pre-configured environments, where the DMXh ETL (Ironcluster) software and UCAs are pre-installed:
DMX-h ETL Test Drive VMs (register to get the desired VM download at
www.syncsort.com/try)
1
2
Cloudera see DMX-h Quick Start for Cloudera VM for more details
MapR see DMX-h Quick Start for MapR VM for more details
Ironcluster
in
the
AWS
(https://aws.amazon.com/marketplace/pp/B00G9BYV5E)
Marketplace
If you are using one of these environments, the UCA examples are already set up
for you on the edge node. If you are using the Hortonworks 2.x Sandbox VM, which
is not pre-configured with DMX-h ETL, follow the instructions in DMX-h Quick Start
for Hortonworks Sandbox, and within this guide, follow the instructions for running
in your own cluster.
If you want to run the examples in your own cluster, you will need to do the
following:
2. Unzip and extract the downloaded files to the directory that you will
DMExpress Hadoop ETL Jobs these are jobs that can be run
in the Hadoop MapReduce framework.
The following sections further subcategorize the use case accelerators by function,
describing each one and giving the name of the job sub-folder in which the files for
executing that job will be found within the use case directory structure, described
in section 4.
3.1
The following use case accelerators demonstrate a common ETL job performed
using DMExpress in the Hadoop MapReduce framework. Click on the linked name
to go to the associated KB article.
3.1.1
Description
Data Capture
FileCDC
Description
demonstrates
JobName Folder
FileCDC_MultiTargets
Description
This use case accelerator is a combination of the Direct
Mainframe Extract &
LoadDirect Mainframe Extract & Load and CDC Single Output
accelerators.
JobName Folder
MainframeCDCDemo
3.1.2
Joins
Description
JobName
Folder
Description
JobName
Folder
3.1.3
Lookup
File Lookup
Description
JobName
Folder
Lookup + Aggregation
Description
JobName
Folder
3.1.4
Aggregation
Description
JobName
Folder
Word Count
Description
standard Hadoop
word count example. It uses pivoting to tokenize the sentences
into words, then
aggregates on the words to obtain a total count for each.
JobName
Folder
WordCount
3.2
3.2.1
Description
residing on a
JobName Folder
MainframeExtractToHDFS
Description
stored
JobName Folder
MainframeExtractLocalToHDFS
Description
This use case accelerator demonstrates how to load a file
residing on a remote
mainframe system to HDFS, interpreting REDEFINES clauses and
converting the
data from EBCDIC fixed length, mixed text and binary to ASCII
displayable text in
the process.
JobName Folder
MainframeRedefineExtractToHDFS
Description
This use case accelerator demonstrates how to load a locally
stored mainframe
format file to HDFS, interpreting REDEFINES clauses and
converting the data
from EBCDIC fixed length, mixed text and binary to ASCII
displayable text in the
process.
JobName Folder
3.2.2
Connectivity
HDFS Extract
MainframeRedefineExtractLocalToHDFS
Description
This use case accelerator demonstrates how to extract TPC-H
supplier data from
the HDFS file system using HDFS connectivity in a DMExpress
copy task.
JobName Folder
HDFSExtract
Description
This use case accelerator demonstrates how to load TPC-H
supplier data to the
HDFS file system, using HDFS connectivity in a DMExpress copy
task.
JobName Folder
HDFSLoad
Description
This use case accelerator is the same as HDFSLoad, except that
the data is split
into 5 partitions using the DMExpress random partition scheme and then loaded
into HDFS in parallel.
JobName Folder
HDFSLoadParallel
The solutions and data are organized in the following directory structure, which
should not be modified:
$DMXHADOOP_EXAMPLES_DIR
1
bin [contains prep_dmx_example.sh ]
21Data [contains source files and a target folder for running outside of HDFS]
Source [contains sample source data used by the use case accelerators]
DMXHadoopJobs* [contains all jobs and tasks needed to run the DMX-h
MapReduce job, and
prep_file_list required by prep_dmx_example.sh]
1
2
3
4
5
HDFS]
example]
*These folders are only present where applicable for the given example.
The use case accelerators can be run both within and outside of Hadoop, via
the Job Editor or the command line, in either a pre-configured DMX-h ETL
environment or your own cluster.
In all cases, some environment variables will need to be set as defined in the
following section, with more detailed instructions for each of the run cases in
the subsequent sections.
5.1
The environment variables shown in the table below need to be defined as described.
When running in your own cluster, these variables need to be set manually. When
running in a pre-configured DMX-h ETL environment, they are already set in the
Hadoop-designated users bash profile and the DMExpress Server dialog.
Directory under which the solutions and data were
DMXHADOOP_EXAMPLES_DIR extracted
HADOOP_HOME
Directory where Hadoop is installed, for example:
/usr/lib/hadoop-0.20-mapreduce
LOCAL_SOURCE_DIR*
LOCAL_TARGET_DIR*
HDFS_SERVER
HDFS_SOURCE_DIR*
HDFS_TARGET_DIR*
MAPRED_TEMP_DATA_DIR
DMX_HADOOP_MRV
1*
$DMXHADOOP_EXAMPLES_DIR/ Data/Source
$DMXHADOOP_EXAMPLES_DIR/ Data/Target
5.2
5.2.1
If the sample data needed to run the job is not already loaded into HDFS, or if youll
be running a job that has been previously run, youll need to run the
$DMXHADOOP_EXAMPLES_DIR/bin/prep_dmx_example.sh script. If running in your
own cluster, set the following environment variables in the command line
environment as indicated in the table above:
1
2
3
4
DMXHADOOP_EXAMPLES_DIR
HDFS_SOURCE_DIR
HDFS_TARGET_DIR
LOCAL_SOURCE_DIR
Then run the script in one of the following ways to load the data for all the examples
or for specific ones:
prep_dmx_example.sh ALL
prep_dmx_example.sh <space-separated list of example folder names>
This script uses the prep_file_list in the examples DMXHadoopJobs folder to load
the specified list of sample data files to HDFS and delete any targets already
existing from a previous run, since HDFS cannot overwrite files.
5.2.2
The following steps describe how to run the use case accelerators within Hadoop
from the Job Editor:
1. For your own cluster, select FTP or Secure FTP connection and specify
the corresponding login credentials as required.
Server: DMXhTstDrv<distribution>
Connection type: Secure FTP
Authentication: Password
User name: dmxdemo
Password: dmxdemo
3. Browse
4. Click on the Run button. If not already pointing to the correct server, select
the server on which you want to run the job, enter the required credentials as
specified in step 2 above), select Run on Hadoop cluster, and click OK.
5. This will bring up the DMExpress Server dialog, which will show the
progress of the running job. Upon completion, select the job and click on
Detail to see Hadoop messages and statistics. (The SRVCDFL warning
message about the data directory can be safely ignored.) To view the Hadoop
logs, see Where to Find DMExpress Hadoop Logs on MRv1 or Where to Find DMExpress
Hadoop Logs on YARN (MRv2)
5.2.3
The following steps describe how to run the use case accelerators within Hadoop
from the command line:
5.3
The use case accelerators can be run outside of Hadoop for testing from either
the Job Editor or the command line. Additionally, the HDFS load and extract
examples are run outside of Hadoop as regular DMExpress jobs.
5.3.1
Run the use case accelerators as described in section 5.2.2, with the following
adjustments:
10
5.3.2
The following steps describe how to run the use case accelerators outside of
Hadoop from the command line:
11