You are on page 1of 42

V9.

cover

IBM Training Front cover


Instructor Exercises Guide

BigInsights Analytics for Business Analysts


Course code DW643 ERC 3.0
Instructor Exercises Guide

Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business
Machines Corp., registered in many jurisdictions worldwide.
The following are trademarks of International Business Machines Corporation, registered in many
jurisdictions worldwide:
BigInsights DB2 IBM Watson
InfoSphere Many Eyes Notes
Power
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Windows is a trademark of Microsoft Corporation in the United States, other countries, or both.
Java and all Java-based trademarks and logos are trademarks or registered trademarks of
Oracle and/or its affiliates.
VMware and the VMware "boxes" logo and design, Virtual SMP and VMotion are registered
trademarks or trademarks (the "Marks") of VMware, Inc. in the United States and/or other
jurisdictions.
Other product and service names might be trademarks of IBM or other companies.

May 2014 edition


The information contained in this document has not been submitted to any formal IBM test and is distributed on an as is basis without
any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer
responsibility and depends on the customers ability to evaluate and integrate them into the customers operational environment. While
each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will
result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk.

Copyright International Business Machines Corporation 2012, 2014.


This document may not be reproduced in whole or in part without the prior written permission of IBM.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
V9.0
Instructor Exercises Guide

TOC Contents
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Instructor exercises overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Exercises configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

Exercises description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

Exercise 1. Importing Data into a Workbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1


Section 1: Start BigInsights components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
Section 2: Set up the IP Address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
Section 3: Start the BigInsights Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
Section 4: Create your first workbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
Section 5: Run an application to generate data for a workbook. . . . . . . . . . . . . . . . . . . . . 1-4
Section 6: Create a Workbook directly from the web crawler application. . . . . . . . . . . . . . 1-5

Exercise 2. Adding Sheets to a Workbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1


Section 1: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
Section 2: Create a Function sheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
Section 3: Further refine the collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3
Section 4: The saga continues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3

Exercise 3. Working with Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1


Section 1: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
Section 2: Develop an appropriate regular expression . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
Section 3: Create a new workbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
Section 4: Count the occurrences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4

Exercise 4. Big SQL Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1


Section 1: Verify that the test data exists in the HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
Section 2: Create a BigSheets Workbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
Section 3: Creating a Big SQL table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
Section 4: Working with the Big SQL table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
Section 5: Creating a worksheet from a table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3

Exercise 5. Analyzing Social Media and Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1


Section 1: Load the Test Data into HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
Section 2: Create a BigSheets Workbook from the DBMS Data . . . . . . . . . . . . . . . . . . . . 5-2
Section 3: Create a BigSheets Workbook from the Boardreader Data . . . . . . . . . . . . . . . 5-2
Section 4: Tailoring a Workbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
Section 5: Union the Two Workbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
Section 6: Exploring the Workbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
Section 7: Visualize your Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
Section 8: Filter Results and Extracting URL Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6
Section 9: Combining Social Media with RDBMS Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8
Section 10: Performing a Group or "Pivot" Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8
Section 11: Joining Social with Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8

Copyright IBM Corp. 2012, 2014 Contents iii


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Instructor Exercises Guide

Section 12: Dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9

iv BigInsights Analytics for Business Analysts Copyright IBM Corp. 2012, 2014
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V9.0
Instructor Exercises Guide

TMK
Trademarks
The reader should recognize that the following terms, which appear in the content of this training
document, are official trademarks of IBM or other companies:
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business
Machines Corp., registered in many jurisdictions worldwide.
The following are trademarks of International Business Machines Corporation, registered in many
jurisdictions worldwide:
BigInsights DB2 IBM Watson
InfoSphere Many Eyes Notes
Power
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Windows is a trademark of Microsoft Corporation in the United States, other countries, or both.
Java and all Java-based trademarks and logos are trademarks or registered trademarks of
Oracle and/or its affiliates.
VMware and the VMware "boxes" logo and design, Virtual SMP and VMotion are registered
trademarks or trademarks (the "Marks") of VMware, Inc. in the United States and/or other
jurisdictions.
Other product and service names might be trademarks of IBM or other companies.

Copyright IBM Corp. 2012, 2014 Trademarks v


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Instructor Exercises Guide

vi BigInsights Analytics for Business Analysts Copyright IBM Corp. 2012, 2014
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V9.0
Instructor Exercises Guide

pref
Instructor exercises overview
Each exercise depends on successful completion of the first exercise.

Copyright IBM Corp. 2012, 2014 Instructor exercises overview vii


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Instructor Exercises Guide

viii BigInsights Analytics for Business Analysts Copyright IBM Corp. 2012, 2014
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V9.0
Instructor Exercises Guide

pref
Exercises configuration
Add instructions to the instructor on configuration issues like:
Each student has a separate system and students work independently.

Copyright IBM Corp. 2012, 2014 Exercises configuration ix


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Instructor Exercises Guide

x BigInsights Analytics for Business Analysts Copyright IBM Corp. 2012, 2014
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V9.0
Instructor Exercises Guide

pref
Exercises description
This course includes the following exercises:
Importing Data into Workbooks
Adding Sheets to a Workbook
Working with Functions
Big SQL Integration
Analyzing Social Media and Structured Data
In the exercise instructions, you can check off the line before each step as
you complete it to track your progress.
Most exercises include required sections which should always be completed.
It might be necessary to complete these sections before you can start later
exercises. Some exercises might also include optional sections that you
might want to complete if you have sufficient time and want an extra
challenge.

Copyright IBM Corp. 2012, 2014 Exercises description xi


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Instructor Exercises Guide

xii BigInsights Analytics for Business Analysts Copyright IBM Corp. 2012, 2014
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V9.0
Instructor Exercises Guide

EXempty
Exercise 1. Importing Data into a Workbook

Estimated time
0:30

What this exercise is about


The student imports data into workbooks from various sources using various
techniques. One set of data came from the exporting of data from a DB2
table. A second set of data is the result of running a MapReduce program
and the third is from the invocation of a web crawler.

What you should be able to do


At the end of this exercise, you should be able to:
- Import data into a workbook
From the Files tab
From the BigSheets tab
Directly from an application

Requirements
Requires the DW643 lab images.

Copyright IBM Corp. 2012, 2014 Exercise 1. Importing Data into a Workbook 1-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Instructor Exercises Guide

Exercise instructions
Preface
The password for root is dalvm3. The password for biadmin is ibm2blue.

Section 1: Start BigInsights components


Your lab environment is running Linux in a VMWare system. This means that you are running an
operating system under another operating system.

Classroom
__ 1. If you are not already logged on to your Windows machine, do so now. You must get the
userid and password from your instructor.
__ 2. Execute the VMPlayer program and choose to boot your lab image. This, too, might require
some instruction from your instructor.
__ 3. You might get prompted to create a new unique identifier. Choose to create a new one.
__ 4. When prompted for a userid and password, enter biadmin and a password of ibm2blue.
You now have two desktops. To remove possible confusion during the exercises, I suggest that
you maximize the VMPlayer desktop.

eLabs
To log into your eLab system, you point your browser to a Citrix server that is running somewhere in
IBM. You have your own Citrix userid and password.
Once you have logged into Citrix, click the icon that represents your lab image. Your VMWare
image should automatically boot.
__ 1. After your Linux image boots, enter biadmin and a password of ibm2blue.
You now have a desktop displayed in your browser window. To remove possible confusion
during the exercises, I suggest that you maximize your browser desktop.

Both classroom and elabs

Section 2: Set up the IP Address


Because you are working with a copy of the original lab image, you need to make sure that your
system has a valid IP address or otherwise you cannot start Hadoop.
__ 1. In the taskbar click Computer.
__ 2. In the System list, select Control Center.
__ 3. Click Network Settings. The password is dalvm3.
__ 4. If you have two controllers listed, select the one that is DHCP and delete it.
__ 5. Edit the remain controller. On the Address tab, click Next. Then, click OK.

1-2 BigInsights Analytics for Business Analysts Copyright IBM Corp. 2012, 2014
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V9.0
Instructor Exercises Guide

EXempty __ 6. Close the Control Center.


__ 7. Right-click the desktop and select Open in Terminal.
__ 8. Switch to root. The password is dalvm3.
su -
__ 9. Check your IP address.
ifconfig
__ 10. Edit the hosts file.
gedit /etc/hosts
__ 11. Change the IP address for ibmclass to the value displayed when you did the ifconfig.
__ 12. Save your work and close gedit.
__ 13. Exit out of root and close the terminal window.

Section 3: Start the BigInsights Components


Some of the components of BigInsights must be running in order to do the lab exercises. Although
you could start just those components that are needed, it is easier to start all of them.
__ 1. Right-click the desktop and select Open in Terminal in order to open a command line.
__ 2. Execute the following to start all BigInsights component. If prompted for a password to
unlock the keyring, type in ibm2blue. Then, click OK.
$BIGINSIGHTS_HOME/bin/start.sh all

Important

If you used this image for some DW612 exercises, then you might have problems starting Hadoop.
(TaskTracker cannot start.) If this is the case, stop the monitoring processes and Hadoop.
$BIGINSIGHTS_HOME/bin/stop.sh monitoring
$BIGINSIGHTS_HOME/bin/stop.sh hadoop
Next, execute the following commands:
cd $BIGINSIGHTS_HOME/bin/hdm/IHC/bin
./start-dfs.sh
./start-mapred.sh that Map/Reduce and HDFS aer started.

Section 4: Create your first workbook


__ 1. Start a web browser using Firefox and invoke the BigInsights Console.
http://ibmclass:8080
__ 2. Log into the web console. Type in a userid of biadmin and a password of ibm2blue.
__ 3. In the web console, click the Files tab.

Copyright IBM Corp. 2012, 2014 Exercise 1. Importing Data into a Workbook 1-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Instructor Exercises Guide

__ 4. Expand hdfs://ibmclass:9000/ and then user.


__ 5. Select biadmin. Then click the Create Directory icon that is above the HDFS tree view area.
Make the directory name csv_data.
__ 6. Select the newly created csv_data directory. (You might have to do a refresh to have it
displayed. This will be true throughout these exercises. If you create a directory or file and it
is not displayed, then do a refresh.) Click the Upload icon.
__ 7. Click the Browse pushbutton.

Note

Due to the fact that you are accessing your desktop remotely, you can have a problem when drilling
down on directories using the double-clicking technique. Due to network delay, the double-clicks
might not arrive within the correct time interval. If this happens, select the directory and press the
Enter key.

__ 8. Drill down on File System->home->biadmin->labfiles->DW64. Select employee_state.csv


and click Open.
__ 9. employee_state.csv has been added to the upload list.
__ 10. Click the Browse pushbutton again. Select product.csv. Click Open.
__ 11. Click the Browse pushbutton again. Select sales.csv. Click Open.
__ 12. Click the OK pushbutton.
__ 13. Expand the csv_data directory and the select employee_state.csv. The contents of
employee_state.csv are displayed in the right frame in a text format.
__ 14. Click the Sheet radiobutton, located above the displayed text.
__ 15. The data has been read by the BigSheets line reader. Unfortunately, the data is in a
comma-separated format, so you need to specify a different BigSheets reader. Click the
pencil icon.
__ 16. From the Select a reader drop-down box, select Comma Separated Value (CSV) Data.
Since the data includes header information, make sure the Headers included? checkbox is
checked. Click the green check mark.
__ 17. Click the Save As Master Workbook pushbutton. Give it a name of Employees and click
Save.

Section 5: Run an application to generate data for a workbook.


__ 1. In the BigInsights Console, click the Files tab.
__ 2. Drill down to the biadmin directory.
__ 3. Select biadmin and click the Create Directory icon.
__ 4. Name the directory GutenbergDocs.

1-4 BigInsights Analytics for Business Analysts Copyright IBM Corp. 2012, 2014
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V9.0
Instructor Exercises Guide

EXempty __ 5. Select the GutenbergDocs directory and click the Upload icon.
__ 6. Click the Browse pushbutton and drill down to
File System->home->biadmin->labfiles->DW64. Select last_of_the_mohicans.txt. Click
Open.
__ 7. Then, click OK.

Deploy the Word Count application


__ 8. Click the Applications tab.
__ 9. At the top of the frame on the left side, click the Manage hyperlink.
__ 10. On the left side, expand the Test category item.
__ 11. Select Word Count.
__ 12. At the top of the frame on the right, click the Deploy pushbutton.
__ 13. Click the biadmin and biusers checkbox under Security.
__ 14. Click the Deploy pushbutton. Now the application can run.
__ 15. At the top of the frame on the left side, click the Run hyperlink.
__ 16. Select the Word Count icon.
__ 17. Enter the parameters necessary to run the Word Count application.
__ a. Execution Name should be Wordcount.
__ b. Click the Browse pushbutton for Input path. Drill down and select the GutenbergDocs
directory. Click OK.
__ c. For Output path, type in /user/biadmin/wordcount.
__ 18. Scroll up and click the Run pushbutton. You can view the progress in the Application History
area. When you see a green box with a white checkmark, the run has finished.
__ 19. Click the BigSheets tab.
__ 20. Click the New Workbook pushbutton.
__ 21. In the Name field, type Wordcount.
__ 22. Drill down to and expand the wordcount directory.
__ 23. Select part-r-00000. The default Line Reader suffices for this data.
__ 24. Click the green checkmark.

Section 6: Create a Workbook directly from the web crawler application.


Run the supplied web crawler application to read a demo website that has a list of people who have
patents and information about those patents.

Deploy the Web Crawler application


__ 1. In the BigInsights Console, click the Applications tab.

Copyright IBM Corp. 2012, 2014 Exercise 1. Importing Data into a Workbook 1-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Instructor Exercises Guide

__ 2. Click the Manage hyperlink.


__ 3. Expand the Web category and select Web Crawler.
__ 4. Click the Deploy pushbutton.
__ 5. Click the biadmin and biusers checkbox under Security. Then, click the Deploy
pushbutton.
__ 6. Click the Run hyperlink.
__ 7. Click the Web Crawler icon.
__ 8. Enter the parameters necessary to run the Web Crawler application.
__ a. For the Execution Name, enter the name PatentsWebcrawl.
__ b. For URLs enter
http://www.ibm.com/software/ebusiness/jstart/bigsheets/demo/Patents.html
__ c. In the Filters box, type in a period ( . ). (This indicates to include everything.)
__ d. In Output Directory, type /user/biadmin/webcrawler.
__ e. For Maximum crawl depth, type a value of 10.
__ f. For Maximum pages per crawl depth, type a value of 10.
__ 9. Expand the Schedule and Advanced Settings.
__ 10. Select the Update Workbook checkbox.
__ 11. In the Select Workbook drop-down box, select Create New Workbook.
__ 12. Give the workbook a name of PatentCrawler.
__ 13. Click the pencil icon. Expand the reader drop-down box and select Basic Crawler Data.
Click OK. Then, click OK again.
__ 14. In the Select Output... drop-down box, select Output Directory.
__ 15. Click the Add row pushbutton.
__ 16. Click the Run pushbutton.
__ 17. After the job completes, (it can run for up to 30 minutes) click the BigSheets tab.
__ 18. Click the PatentCrawler hyperlink and view the crawler data in the workbook.

End of exercise

1-6 BigInsights Analytics for Business Analysts Copyright IBM Corp. 2012, 2014
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V9.0
Instructor Exercises Guide

EXempty
Exercise 2. Adding Sheets to a Workbook

Estimated time
0:30

What this exercise is about


The student creates a new workbook based on the web crawler workbook
and uses the Macro sheets to extract data from web pages.

What you should be able to do


At the end of this exercise, you should be able to:
- Add sheets to a workbook
- Run a workbook

Requirements
Requires the DW643 lab images.
Requires exercise 1 to be completed

Copyright IBM Corp. 2012, 2014 Exercise 2. Adding Sheets to a Workbook 2-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Instructor Exercises Guide

Exercise instructions
Section 1: Background
In the previous exercise you created a workbook based on the results of a web crawler. The crawler
was directed to extract information from a website that dealt with patents. Essentially the web
crawler looked at a site that had a list of names. Each name is a hyperlink to a page that lists the
patents for that person.
__ 1. In your web browser, open a new tab by clicking the plus-signed tab.
__ 2. Go to the following website:
http://www.ibm.com/software/ebusiness/jstart/bigsheets/demo/Patents.html
__ 3. There you see a list of names. Click on any name and it takes you to a page that lists all of
the patents registered to that individual. This is to give you a frame of reference when doing
this exercise.
__ 4. You can close this newly opened tab. (If you did not open a new tab, the click BigInsights
Console in the bookmark toolbar to point you back to the web console.)

Section 2: Create a Function sheet


__ 1. Log into the BigInsights Console if you previously logged out. biadmin / ibm2blue.
__ 2. Click the BigSheets tab.
__ 3. Click the PatentCrawl workbook.
__ 4. PatentCrawl is a master workbook. No modifications can be made to it. However, you can
create a new workbook that is based on PatentCrawl and this new workbook can be
modified. Click the Build new workbook pushbutton.
__ 5. Click the Add sheets pushbutton.
__ 6. You are to apply a function to some of the data. Select Function from the list of sheet types.
__ 7. Since you are not aware of all of the functions at your disposal, it is best to get a list of them.
Click the Categories hyperlink.
__ 8. The ultimate goal is to get patent information for each person on the patents website. You
want to apply some function that works with HTML tags. Click the html hyperlink.
__ 9. For each individuals patent site, the persons name is associated with the HTML tag H1.
The first thing that needs to be done is to get the content for each H1 tag. Select the
HTMLEXTRACTTAG hyperlink from the list of functions.
__ 10. You have the option of giving each new sheet a descriptive name. But this is a lab exercise
and students normally want to do the least possible amount of work. So leave the name as
Sheet1.
__ 11. Click the content drop-down box. You are presented with a list of all columns in the sheet
from which to make a selection. Select Content.
__ 12. You want to get the names of the individuals. They are displayed using an HTML H1 tab. So
in the tag field, type in H1.

2-2 BigInsights Analytics for Business Analysts Copyright IBM Corp. 2012, 2014
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V9.0
Instructor Exercises Guide

EXempty __ 13. For occurrence, enter a value of 1.


__ 14. At the bottom of the entry dialog, click the Carry over (0) tab.
__ 15. You might not realize it, but you not only want the extracted information in this new sheet,
but, for your purposes, you also want the Content column information as well. From the Add
columns to carry over drop-down box, select Content. Then, click the green plus sign. This
adds the Content column to the carry over list.
__ 16. Click the green checkmark at the bottom.
__ 17. You now have a new sheet. But the heading name for column A needs some work. Move
the mouse cursor to the HTMLEXTRACTAG column heading and click on the drop-down
indicator. Select Rename.
__ 18. Highlight the name and change it to NameWithH1Tag. (The column name cannot have any
spaces.) Click the green checkmark.

Section 3: Further refine the collection


__ 1. You did not get the results that you really desired. There are HTML tags encapsulating the
desired data. So keep progressing using baby steps to towards your goal. You must create
another sheet. You want it based on Sheet1. So make sure that the tab for Sheet1 is
selected. Then click Add sheets.
__ 2. You want to create another Function sheet.
__ 3. Once again click Categories and then html.
__ 4. This time, you want to get the encapsulated values so use the HTMLTAGVALUE function.
__ 5. Keep the default sheet name and click the elements drop down box. Select
NameWithH1Tag.
__ 6. Click the Carry over tab and add Content to the carry over list by clicking the green plus
sign.
__ 7. Click the green checkmark.
__ 8. Change the name of column A to PatentOwner.

Section 4: The saga continues


__ 1. Now get all of the patents for each individual. Make sure Sheet2 is selected and click Add
sheet.
__ 2. Select a Function sheet. Click Categories then html.
__ 3. This time, select HTMLEXTRACTTAGS. (with an S) HTMEXTRACTTAG lets you specify
the occurrence. HTMLEXTRACTTAGS selects all.
__ 4. Keep the default sheet name. From the content drop-down, select Content.
__ 5. For tag type in H2. (This is the tag associated with the name of each patent for an
individual.)
__ 6. Click the Carry over tab and add PatentOwner to the list.

Copyright IBM Corp. 2012, 2014 Exercise 2. Adding Sheets to a Workbook 2-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Instructor Exercises Guide

__ 7. Click the green checkmark.


Now you have a row for each patent that is associated with each individual. If you notice, you
also removed the blank name and the one called Found.
Ultimately, your goal might be to count the number of patents for each individual. However, you
are going to stop here. The presentation material has not covered the additional topics that are
required to complete that goal. Later, once you have this additional information, you could come
back here and code the additionally required sheets.
__ 8. Rename your workbook. At the top of the spreadsheet, click the pencil icon. Change the
name to Patent Extract and click the green check mark.
__ 9. Then, click the Save pushbutton. Then select Save & Exit. (At this point you also have the
option to rename the collection.) Click Save.
__ 10. Notice that this new workbook has not been run. You have only been working with a subset
of data. In order for your work to be applied to all of the data, you must run the workbook.
Click either one of the Run pushbuttons.
There is a yellow triangle on the left side above the data that indicates that the workbook
has not been run. (There is also a progress indicator on the right side.) When the processing
completes, the triangle changes to a green checkmark. Also, the percentage complete on
the right side goes to 100%.

End of exercise

2-4 BigInsights Analytics for Business Analysts Copyright IBM Corp. 2012, 2014
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V9.0
Instructor Exercises Guide

EXempty
Exercise 3. Working with Functions

Estimated time
0:30

What this exercise is about


The student creates a new workbook based on the results of the Word Count
program and adds new columns with specific functions to the sheet.

What you should be able to do


At the end of this exercise, you should be able to:
- Add new columns to a sheet
- Code some functions directly for a column

Requirements
Requires the DW643 lab images.
Requires exercise 1 to be completed

Copyright IBM Corp. 2012, 2014 Exercise 3. Working with Functions 3-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Instructor Exercises Guide

Exercise instructions
Section 1: Background
In the first exercise, you ran the Word Count program. This generated a record for each unique
character string found in the specified document. It then totaled the number of occurrences of each
unique character string. Your goal is to create a sheet that has the number of occurrences for all
character strings that occur the same number of times. (If that is confusing, just bear with me.)
__ 1. Log into the BigInsights Console. biadmin \ ibm2blue
__ 2. Select the BigSheets tab.
__ 3. Select the Wordcount workbook. Displayed for each row is a character string, followed by a
tab character, followed by a number. The number represents the occurrences of that
character string in the last_of_the_mohicans.txt document.
Your mission is to count the number of character strings that occur the same number of
times.
In the previous exercise, you used Function sheets to apply functions to your data. This time,
you are going to add new columns to a sheet and code the needed functions directly.

Section 2: Develop an appropriate regular expression


__ 1. The question now, is how do you extract the occurrence number from each record? Well, it
looks like regular expressions might be helpful. But what if you are not strong in coding
regular expressions? You can use the BigInsights AQL development environment to help
you.
On your desktop is an icon to start Eclipse. Double-click it. When prompted for a workspace,
click OK.
__ 2. If prompted for a password, enter ibm2blue.
__ 3. Open the BigInsights perspective by selecting
Window->Open Perspective->Other. Select BigInsights and click OK.
__ 4. In the toolbar, click the Open Regular Expression Builder Wizard icon. It should be the
second enabled icon from the left.
__ 5. In the Type the text that you want to use to test the rule area at the lower left of the panel,
key in the following: Type the tab key following the letter c.
"abc 33
This is used to validate your regular expression.
__ 6. Next, you are going to type a regular expression in the Specify a regular expression rule
area that allows you to extract the occurrence value.
In order to extract a portion of the text using regular expressions, you must use groups. A
group is defined by open and closed parentheses. You need to type the following, but let me
translate what the following means. The parentheses specifies a group. The period means

3-2 BigInsights Analytics for Business Analysts Copyright IBM Corp. 2012, 2014
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V9.0
Instructor Exercises Guide

EXempty any character. The plus sign means any number of occurrences. The \t means the tab
character. So this extracts any number of characters that are followed by the tab character.
(.+\t)
__ 7. Once you have typed in the above, you can see the portion of the test data that matches.
Next, append the following to your regular expression rule. Translation: [0-9] means any
digit zero through nine. The plus sign means any number of occurrences.
[(0-9]+)
The entire expression should look like: (There is no space between the two groups.)
(.+\t)([0-9]+)
__ 8. The entire test string(s) should now be highlighted. If you look in the Matched area (lower
right), you should see the following under sub-pattern.
1:[abc ]2:[33]
This indicates that the second group extracted the value 33.

Note

For those of you who work with regular expressions, you might have wondered why I did not have
you code (.+\t)(\d+). The reason is that when that expression is entered directly into a BigSheets
regular expression function for a column, it mistakenly is detected as an error.

__ 9. You can close the regular expression wizard and then close Eclipse.

Section 3: Create a new workbook


__ 1. Return to the BigInsights console. biadmin \ ibm2blue
__ 2. Select the BigSheets tab.
__ 3. Select the Wordcount workbook.
__ 4. Click the Build new workbook pushbutton.
__ 5. Click Add Sheets and select Formula.
__ 6. Keep the default sheet name.
You are to create a new sheet and specify a formula that references the Header column from
the Wordcount workbook. The result is to be a new column called Occurrences. Then, use the
GETGROUPMATCH() function to populate this new column. Here is the format of that function
as shown in the BigInsights Information Center.
GETGROUPMATCH (text,regex,group,number)
You know which regular expression to use. You know that you want the second group. But what
about the text parameter. You want this function to be applied to each row for the Header
column. To indicate that, you specify #Header. As in:
GETGROUPMATCH(#Header,'(.+\t)([0-9]+)',2)

Copyright IBM Corp. 2012, 2014 Exercise 3. Working with Functions 3-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Instructor Exercises Guide

You are to reference the Wordcount sheet and, as stated, this new column is to have the name
of Occurrences. The final parameter, 2, says to extract the second group from the regular
expression.
__ 7. Code the following in the fx field.
Wordcount!A1 : [Occurrences = GETGROUPMATCH(#Header,'(.+\t)([0-9]+)',2) ]
__ 8. Click the green checkmark.

Section 4: Count the occurrences


__ 1. Add another column to your sheet. Click the drop-down icon for Occurrences and select
Insert Right->New Column.
__ 2. Give it a name of Num.
__ 3. You want to take the value for each row in the Occurrences column and select all rows that
have that same value. Then count the number of those rows returned. In the fx field for the
Num column, type the following formula:
COUNT(SELECT(Occurrences,#Occurrences).Occurrences)
__ 4. Click the green checkmark.
You can see that there were 1266 unique character strings that only occurred once in the
document. But you do not want to see that 1266 times. You need to remove duplicate rows.
__ 5. Make sure Sheet1 is selected and add a new sheet. Select Distinct.
__ 6. Keep the default name for the new sheet and click the green checkmark.
Now you have your results. There are 1266 unique character strings that occurred once.
Fourteen unique character strings that occurred 10 times, etc.
__ 7. Save your collection. Give it a name of Wordcount Totals. If you did not do a Save and
Exit, then exit the workbook. Do not worry about running the collection. (Heck this is only a
lab exercise.)

Note

Now you have some additional knowledge that can help you with the second exercise to create a
sheet that returns the number of patents that are associated with each owner. However, you need
one or two more pieces of information that you will get from the last exercise.

End of exercise

3-4 BigInsights Analytics for Business Analysts Copyright IBM Corp. 2012, 2014
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V9.0
Instructor Exercises Guide

EXempty
Exercise 4. Big SQL Integration

Estimated time
0:20

What this exercise is about


The student will use Big SQL tables with BigSheets.

What you should be able to do


At the end of this exercise, you should be able to:
- Create a Big SQL table from a worksheet
- Create a worksheet from a BigSQL table

Requirements
Requires the DW643 lab images.

Copyright IBM Corp. 2012, 2014 Exercise 4. Big SQL Integration 4-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Instructor Exercises Guide

Exercise instructions
Section 1: Verify that the test data exists in the HDFS
Do the following steps to load the news_data.txt and products.csv into the HDFS.
__ 1. Log into the BigInsights Console. biadmin \ ibm2blue
__ 2. Select the Files tab.
__ 3. Drill down to user/biadmin and select that directory.
__ 4. Click the Create Directory icon and create a directory called Watson
__ 5. Select the Watson directory and click the Upload icon.
__ 6. Browse to home->biadmin->labfiles->DW64. Select news-data.txt and click Open.
__ 7. Then, click OK.

Section 2: Create a BigSheets Workbook


__ 8. Click the BigSheets tab. This takes you to the list of all workbooks.
__ 9. Click the New Workbook pushbutton.
__ 10. Type a Name of NewsData.
__ 11. Drill down to user/biadmin and expand the Watson directory.
__ 12. Select the Watson directory. This step is important, if you select the news_data.txt file, you
will NOT be able to create a table later. You MUST select the directory, not the file.
__ 13. Click the pencil icon and select the JSON Array reader. Click the green checkmark.
__ 14. Now with the data formated properly, scroll down (if you have to) and click the green
checkmark.

Section 3: Creating a Big SQL table


__ 15. Click the NewsData workbook.
__ 16. Click the Create Table pushbutton and specify the Target Schema: sheets and the Target
Table: NewsData.
Your table has been created. You will notice that the pushbutton changed from Create Table to
Delete Table. This means that you can only create one table from a particular worksheet.
__ 17. In this next step, you will check out the table. Click on the Files tab and select the Catalog
Table tab. Expand the sheets schema to see the table.

Section 4: Working with the Big SQL table


The table that you just created uses the exact same data as the worksheet. You will now see how
you can execute SQL queries against that table.

4-2 BigInsights Analytics for Business Analysts Copyright IBM Corp. 2012, 2014
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V9.0
Instructor Exercises Guide

EXempty __ 18. You can do this is through the BigInsights Console. Click the Welcome tab and select the
Run BIG SQL Queries under the Quick Links workbook.This will open up a new browser
tab.
__ 19. In the textbox, you can enter your BIG SQL query. Enter in a simple one to get started:
select * from sheets.WatsonNews;
__ 20. Click the Run pushbutton. You should see the spinning icon in the middle when it is running.
__ 21. You should now see results below in the Results tab.
__ 22. Enter in another query: select * from sheets.WatsonNews where country = US;
__ 23. The scope of this lab is not to teach Big SQL queries, so we will stop here. Understand that
now you have successfully create a table from which you can run Big SQL queries against
it.

Section 5: Creating a worksheet from a table.


In this section, you will first create a Big SQL table. Then you will import data from a file into that
table. Finally, you will see how you can easily create a workbook of that tables data.
__ 24. Open up a Big SQL shell from the terminal. Navigate to this directory:
/opt/ibm/biginsights/bigsql/jsqsh/bin
__ 25. Execute this command to start up the shell: ./jsqsh -U biadmin -P ibm2blue
__ 26. If it asks to go through a setup, you can just quit the setup and continue.
__ 27. First command is to create a schema. Execute this command in the shell: create schema
bigsql;
__ 28. Now create the table. Execute this command:
a. create table bigsql.products (PROD_NAME varchar(90), description varchar(90),
category varchar(90), qty_on_hand int, prod_num int) row format delimited fields
terminated by ',';
__ 29. Next step is to load the data from the products.csv file that you should have already been
loaded into the HDFS. Execute this command:
a. load hive data inpath '/user/biadmin/csv_data/product.csv' overwrite into table
bigsql.products;
__ 30. Execute this command to verify that the load was successful: select * from
bigsql.products;
__ 31. Now that you have a Big SQL table, to create a worksheet from this table, navigate to the
Files tab in the Web Console.
__ 32. Click the Catalog Table tab and expand the bigsql schema. You may need to do a refresh
to see the new schema.
__ 33. Select the products table.
__ 34. On the right side, the products table will show up. Click on the Save as Master Workbook
and specify the name of the workbook as bigsql_products. Click Save.

Copyright IBM Corp. 2012, 2014 Exercise 4. Big SQL Integration 4-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Instructor Exercises Guide

You are brought into the BigSheets tab where the new workbook you just created is on the screen.
From here you can do anything you normally would with a workbook, except you cannot create a
table, that pushbutton has been grayed out since this workbook came from a table.

End of exercise

4-4 BigInsights Analytics for Business Analysts Copyright IBM Corp. 2012, 2014
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V9.0
Instructor Exercises Guide

EXempty
Exercise 5. Analyzing Social Media and
Structured Data

Estimated time
2:00

What this exercise is about


The student uses BigSheets to analyze social media data and uses various
forms of BigSheets visualization to present the results.
In the second portion of this exercise, the student works joins the social
media data with DBMS data to do further analysis.

What you should be able to do


At the end of this exercise, you should be able to:
- Create a Filter sheet
- Load data from a workbook into a second workbook
- Join data from two sheets
- Use the Pivot function on BigSheets data
- Utilize the visualization capabilities of BigSheets

Requirements
Requires the DW643 lab images.
Requires exercise 1 to be completed

Copyright IBM Corp. 2012, 2014 Exercise 5. Analyzing Social Media and Structured Data 5-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Instructor Exercises Guide

Exercise instructions
Section 1: Load the Test Data into HDFS
The social media data used in this exercise has been loaded on your local system. This data was
created using the Boardreader app that comes with BigInsights. (Although to use this app, you must
have a license.)
Also, data was exported from a database system info a csv format.
__ 1. Log into the BigInsights Console. biadmin \ ibm2blue
__ 2. Select the Files tab.
__ 3. Drill down to biadmin and select that directory.
__ 4. Click the Create Directory icon and create a directory called DBMS.
__ 5. Select the Watson directory and click the Upload icon.
__ 6. Browse to File System->home->biadmin->labfiles->DW64. Select blogs-data.txt and click
Open.
__ 7. Select the DBMS directory and click the Upload icon.
__ 8. Browse to File System->home->biadmin->labfiles->DW64. Select RDBMS-data.csv and
click Open. Then, click OK.

Section 2: Create a BigSheets Workbook from the DBMS Data


__ 9. Select the RDBMS-data.csv file. Then, click the Sheet radiobutton.
__ 10. Change the BigSheets reader by selecting the pencil icon.
__ 11. Select the Comma Separated Value (CSV) Data reader and uncheck the Headers
included? checkbox. Click the green checkmark.
__ 12. Click the Save As Master Workbook pushbutton. Type a Name of Media Contacts. Click
the Save pushbutton.

Section 3: Create a BigSheets Workbook from the Boardreader Data


__ 13. Click the BigSheets tab. This has the effect of taking you to the list of all workbooks.
__ 14. Click the New Workbook pushbutton.
__ 15. Type a Name of WatsonBlogs.
__ 16. Drill down to and expand the Watson directory.
__ 17. Select blogs-data.txt.
__ 18. Click the pencil icon and select the JSON Array reader. Click the green checkmark.
__ 19. Now with the data formated properly, scroll down (if you have to) and click the green
checkmark.
Tag your sheet. This allows you to quickly search and manage your workbooks.

5-2 BigInsights Analytics for Business Analysts Copyright IBM Corp. 2012, 2014
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V9.0
Instructor Exercises Guide

EXempty __ 20. Scroll to the bottom of the workbook to view the workbook details. If you do not see the
detail information, you should click the Toggle from Normal to Fullscreen icon that is
above the workbook data in the upper right of the BigSheets page.
__ 21. To add a tag, click the green plus sign. Type in a tag value and then click the green
checkmark. The tags to add are Watson, IBM, and Blogs.
__ 22. Next, using the same basic steps, create a second workbook for the news-data.txt file. Call
the workbook WatsonNews. Add Watson, IBM, and News as tags to this new workbook.
Start off by clicking the BigSheets tab.
__ 23. Click the BigSheets tab to get a list of all workbooks.
__ 24. Click the Tags pushbutton. A cloud list of tags gets displayed. Click News and only those
workbooks with that tag are displayed.
__ 25. In the filter field, (to the left of the Tags pushbutton) type in tag:Watson. Then, press Enter.
This is another way of filtering on a tag.

Section 4: Tailoring a Workbook


__ 26. Click the WatsonNews workbook.
__ 27. Create a child workbook based on this master workbook. Click the Build new workbook
pushbutton.
__ 28. Change the workbook name by clicking the pencil icon. Change the name to
WatsonNewsRevised.
__ 29. To view more of the columns, click the Fit column(s) pushbutton.
__ 30. You are not going to need the IsAudit column. Click the triangle for the isAudit column. The
select Remove. (The data was not actually deleted. The mapping to the column was
removed.)
__ 31. Actually, you need to remove a number of columns. Click the triangle for any column and
select Organize Columns.
__ 32. Clicking a red X removes that column. When all desired columns have been removed, click
the green checkmark. Here is the list of columns to keep.
Country
FeedInfo
Language
Published
SubjectHtml
Tags
Type
Url
__ 33. Click the Save pushbutton and select Save & Exit. Then, click Save.
__ 34. Run your workbook.

Copyright IBM Corp. 2012, 2014 Exercise 5. Analyzing Social Media and Structured Data 5-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Instructor Exercises Guide

__ 35. For the Watson Blogs workbook, follow the same steps as above to remove unwanted
columns. Call this new workbook Watson Blogs Revised. Make sure to run your Watson
Blogs Revised workbook.

Section 5: Union the Two Workbooks


Because both workbooks have the same structure, you can union them. This then becomes the
basis for exploring the coverage of IBM Watson across the sources that the Boardreader provided.
__ 36. Click the BigSheets tab and select the WatsonNewsRevised workbook.
__ 37. Next, click the Build new workbook pushbutton.
__ 38. Click the Add sheets pushbutton.
__ 39. Click the Load icon and the click WatsonBlogsRevised.
__ 40. Change the sheet name to WatsonBlogsRevised and click the green checkmark.
__ 41. Now the data from both Revised workbooks is accessible in order to add the data into a
single sheet. Click the Add sheets pushbutton and select the Union icon.
__ 42. Change the Sheet Name to NewsAndBlogs.
__ 43. In the Select Sheet drop down, select WatsonNewsRevised. Click the green plus sign.
__ 44. Select the WatsonBlogsRevised. Then, the green checkmark. (If you forgot to change the
name of the sheet, you can click the triangle on the sheets tab and choose to rename it.)
__ 45. Save and exit the workbook. Change the name of the workbook to Watson News and
Blogs. Then, run the workbook.
__ 46. If you move your mouse pointer over the icon that is to the right of the pencil icon, you
should see that the icon is for the Workbook Diagram. Click it. The Watson News and Blogs
workbook was created by loading two workbooks and then doing a union. Close this
window.
__ 47. Now click the icon to the right of the Build new workbook pushbutton. This is the Workflow
Diagram. This shows the workbooks that were used to create the current workbook. Close
this window.

Section 6: Exploring the Workbook


__ 48. You should still be in the Watson News and Blogs workbook. You do not want to modify this
workbooks, so click the Build new workbook pushbutton.
__ 49. Click the pencil icon and change the workbook name to WatsonSorted.
Lets take a closer look at the languages and types of posts in the data.
__ 50. Click the triangle for any column.
__ 51. Select Sort->Advanced.
__ 52. Click the Add Columns to Sort drop down box and select Language. Click the green plus
sign.
__ 53. Choose to sort the values in the Language column in Descending sequence.

5-4 BigInsights Analytics for Business Analysts Copyright IBM Corp. 2012, 2014
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V9.0
Instructor Exercises Guide

EXempty __ 54. Click the Add Columns to Sort drop down box and select Type. Click the green plus sign.
Keep the default of Ascending.
__ 55. Click the green checkmark.
__ 56. Click the Fit column(s) pushbutton so that you can see both the Language and the Type
columns.

Note

The sort that was performed is only running on a subset of the data, in this case the first 2000
records. When you save and run the workbook, the sort gets applied to all of the data so you might
see some differences. For example, the subset of data has only one record where the Language is
Vietnamese. This changes when all of the data is used.
You can change the sampling size in $BIGINSIGHTS_HOME/sheets/conf/m2config.ini. The
property name is "sampleSize": 2000. You would have to restart the console after changing this
property.

__ 57. Save, exit and run your workbook.

Section 7: Visualize your Data


__ 58. Assume that you are interested in seeing the number of posts associated with each
language. Click Add chart.
__ 59. Select the chart hyperlink and choose Pie.
__ 60. Pie chart info.
__ a. Chart Name - Language Coverage
__ b. Title - IBM Watson Coverage by Language
__ c. Value - Language
__ d. Count - Count occurrences of X axis values
__ e. Limit - 12
__ 61. Click the green checkmark.
__ 62. Click Run and wait for all of the data to be processed.
__ 63. As you move the mouse pointer over the various segments, you see that Chinese - Simple
makes up 2.6% of the posts. But right next to it is a segment for Chinese (Spelling). This
makes you think that maybe it make more sense to have all of the Chinese posts in a single
segment.
__ 64. Click the Edit pushbutton.
__ 65. To do the combination trick, you need to add a new column. Move the mouse cursor over
the Language column. Then, click the triangle that is displayed.
__ 66. Select Insert Right->New Column.

Copyright IBM Corp. 2012, 2014 Exercise 5. Analyzing Social Media and Structured Data 5-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Instructor Exercises Guide

__ 67. Give this new column a name of Language_Revised. (Remember, there cannot be any
spaces in column names.)
After saving the column name, the cursor was moved to the fx field. The idea is that you are
going to provide a function that is to be used to populate this new column.
__ 68. Here is the gist of the function. You want to look at the Language value for each row. If that
value begins with Chin, then you want the value in the Language_Revised column for that
row to be Chinese. Otherwise, you want the value to be what is in the Language column.
Type the following in the fx field.
IF(SEARCH(Chin*, #Language) > 0, Chinese, #Language)
__ 69. Click the green checkmark.
__ 70. Save, exit and run the workbook.
__ 71. Click triangle for the Language Coverage tab at the bottom to modify the chart settings.
__ 72. Select Chart Settings.
__ 73. Change the Value field to Language_Revised. Click the green checkmark.
__ 74. Click the Language Coverage tab to bring up the modified chart.
__ 75. Click the Run pushbutton.
__ 76. Now you can see that the Chinese segment is the second largest.

Section 8: Filter Results and Extracting URL Data


In this section, we'll derive a new workbook, based on the Watson Sorted workbook, that only
contains English-language records with URLs that end in .uk or a Country value of "GB" (for Great
Britain). To do this, you apply a filter and a function against the entire set of data in the workbook.
__ 77. You should still be in the Watson Sorted workbook. Create a new workbook based on the
Watson Sorted workbook. Click the Build new workbook pushbutton.
__ 78. Change the workbook name to Watson Sorted English UK.
__ 79. Add a new sheet.
__ 80. Select the Filter icon.
__ 81. Change the Sheet Name to English Only.
__ 82. In the Select column drop down box, select Language.
__ 83. Select contains.
__ 84. If you click the drop down box for the value, you see only four languages and English is not
one of them. It just means that English is not in a record that is in the cache. Type in the
word English.
__ 85. You can add multiple filtering criteria by clicking the plus sign. (That is not the case here, so
click the green checkmark.)
__ 86. To build upon this new subset of data, you want to now perform an extraction to pull the Host
portion of a URL into a new column. So add a new sheet.

5-6 BigInsights Analytics for Business Analysts Copyright IBM Corp. 2012, 2014
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V9.0
Instructor Exercises Guide

EXempty __ 87. Click the Function icon.


__ 88. By default no functions are displayed. Click the Categories hyperlink.
__ 89. Click the url hyperlink.
__ 90. Click the URLHOST function.
__ 91. Change the Sheet Name to URLHOST.
__ 92. Click the url* drop down box and select the Url column name.
__ 93. In this case, you want to have some other existing columns added to this new sheet. To do
so, click the Carry over (0) tab.
__ 94. Click the Add all hyperlink and then click the green checkmark.
__ 95. One additional filter is needed to look at only URLs that end in .uk along with posts from the
Country of Great Britain (code GB). Click the Add sheets pushbutton.
__ 96. Select the Filter icon.
__ 97. Set the Sheet Name to UK Sites.
__ 98. Click the Select column drop down and choose the URLHOST column.
__ 99. Select ends with.
__ 100.Type uk in the value field. Because you want multiple filters, click the green plus sign.
__ 101.In the second filter, choose the Country column, is GB.
__ 102.Set the Match radiobutton to any. This implies an or condition. Then, click the green
checkmark.
__ 103.Save, exit and run the workbook.
__ 104.To quickly visualize the results, click the Add chart tab.
__ 105.Click Cloud and then Text Cloud.
__ 106.Specify the following settings:
__ a. Chart Name - Top Sites
__ b. Title - Top 10 Sites with IBM Watson Coverage
__ c. Tags - URLHOST
__ d. Count - Count occurrences of value field
__ e. Occurrence Order - Descending
__ f. Limit - 10
__ 107.Click the green checkmark.
__ 108.Then, run the visualization. After the processing completes, you can move the mouse
pointer over the different URLs and see the totals for each. The size of the fonts is a visual
indication as to the totals for each URL.

Copyright IBM Corp. 2012, 2014 Exercise 5. Analyzing Social Media and Structured Data 5-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Instructor Exercises Guide

Section 9: Combining Social Media with RDBMS Data


From the Watson News and Blogs workbook, create a new workbook that uses the URLHOSTS
function.
__ 109.Click the BigSheets tab to get a list of workbooks.
__ 110.Click the WatsonNewsAndBlogs workbook.
__ 111.Click the Build new workbook pushbutton. Rename the workbook to Watson URL Details.
__ 112.Add a sheet. Click the Function icon. Click the Categories hyperlink.
__ 113.Select URL. Then, select URLHOST.
__ 114.Call the sheet URL Hosts. For the parameter, select the Url column.
__ 115.Click the Carry over tab and click Add all. Click the green checkmark.

Section 10:Performing a Group or "Pivot" Operation


You might be curious about the number of posts for each URL host site and other details in addition
to just the total. This next example uses an easy approach to obtaining the total as well as individual
URL details that an analyst might need to obtain through these analytics.
__ 116.Add a new sheet and select the Group function. Call the sheet Pivot URLHOST.
__ 117.For the Group by columns select URLHOST. Click the green plus sign.
__ 118.Then, click the Calculate tab in order to add another column to your new sheet based on a
particular calculation. Type in the name of the column as Count_URLHOST. Click the
green plus sign.
__ 119.Set the column function to COUNT and set the Fill in parameters: Column to URLHOST.
__ 120.Then, add one more calculation before leaving this tab. In this case, create a new column to
contain a merge of all of the rows in your dataset that match the URLHOST on which you
are performing the Pivot (or "group by") action. Type in a new column name of Merge_URL.
__ 121.Click the green plus-sign button to add it.
__ 122.Set the column function to Merge.
__ 123.Set the Fill in parameters: Column to Url and set the Separator to a comma (,).
__ 124.Click the green checkmark.
__ 125.To make it easier to see the largest number of posts at the top, sort the Count_URLHOST
descending.
__ 126.Save, exit, and run the workbook.

Section 11:Joining Social with Structured Data


Last but not least, begin to work with the RDBMS data, pulled into a BigSheets workbook at the
beginning of this exercise. As you might remember, you pulled data into a workbook and named
it Media Contacts. Now, join this structured data with the Social Media data. By joining these two
workbooks, you can explore how corporate media outreach efforts correlate to coverage by
third-party websites.

5-8 BigInsights Analytics for Business Analysts Copyright IBM Corp. 2012, 2014
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V9.0
Instructor Exercises Guide

EXempty __ 127.In order to start with a workbook that has all of the items in it that you need, start with the
WatsonNewsAndBlogs workbook. Open this workbook.
__ 128.Build a new workbook and call it Watson Media Analytics.
__ 129.Again, you need the URLHOST column added to your new workbook. So, add a sheet that
runs the URLHOST function and carries over all of the columns. Call the sheet URL Hosts.
__ 130.Add a sheet that Loads the Media Contacts workbook into your new, Watson Media
Analytics workbook. Call this sheet Media Contacts.
__ 131.To make the last column of the Media Contacts more clear, rename it to Last_Contact.
Move the cursor over the header4 column and click the triangle. Choose to rename the
column.
__ 132.Change the name of the header3 column to URL.
__ 133.Join the data. Add a new sheet and select the Join icon.
__ 134.Call the sheet Join URLHOSTS and Contacts.
__ 135.Select to do an inner join.
__ 136.In the Add sheets drop down, select URL Hosts. Click the green plus sign. Then, add the
Media Contacts sheet and click the green plus sign.
__ 137.For the URL Hosts sheet, select the URLHOST column. For the Media Contacts sheet,
select the URL column.
__ 138.Click the green checkmark.
__ 139.As an additional way to make your results look more intuitive, you can reorganize the order
of the columns by using the Organize Columns option or by dragging and dropping the
column. Do that by a left-click-mouse-grab on the letter above the column name. Also, do
not forget about the Fit Columns pushbutton.
__ 140.Save, exit, and run the workbook.

Section 12:Dashboard
This section introduces you to quickly and easily creating and managing custom dashboards. A
custom dashboard allows you to gain total visibility over a set of data, a system, or analysis on a set
of data depending on the types of widgets being managed by the dashboard. You create a simple
dashboard with 3 widgets showcasing charts and data from other parts of this lab.
__ 141.Click the Dashboard tab on the BigInsights Console.
__ 142.Add a new dashboard by clicking the New Dashboard icon

__ 143.Give it a name of MyWorkBooks.

Copyright IBM Corp. 2012, 2014 Exercise 5. Analyzing Social Media and Structured Data 5-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Instructor Exercises Guide

__ 144.Click the Add Widget pushbutton.


__ 145.Add the Watson Sorted English UK and Top 10 Sites with IBM Watson Coverage widgets by
clicking Add Widget under each of those icons.
__ 146.Close the add widgets dialog.
__ 147.Click the Arrange pushbutton.
__ 148.Change the name of the tab. Click New Tab and it becomes an entry field. Change the
name to IBM Watson.
__ 149.Save your dashboard by clicking the Save icon.
__ 150.From the Watson Sorted English UK workbook widget, click the Open on BigSheets icon in
the upper right corner of the widget. Note that a new tab opens in your browser to display
the target workbook in BigSheets.
__ 151.You might try deleting a widget from the dashboard. Or you might delete the entire
dashboard.
__ 152.Log out of the BigInsights Console.
__ 153.Close Firefox.
__ 154.You can stop the BigInsights components. From a command line:
$BIGINSIGHTS_HOME/bin/stop.sh all

End of exercise

5-10 BigInsights Analytics for Business Analysts Copyright IBM Corp. 2012, 2014
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V9.0

backpg
Back page

You might also like