You are on page 1of 11

SEG-N-011 (2010)

Proling Tutorial: A simple program


DJ Worth, LS Chin, C Greenough June 2010

Abstract
This is a short tutorial document for proling tools GNU gprof and Intels VTune. It uses a simple array based program to illustrate the basics of performance proling with these two tools. Keywords: Tutorial example, proling, performance analysis

{david.worth, shawn.chin, christopher.greenough}@stfc.ac.uk Reports can be obtained from www.softeng.cse.clrc.ac.uk

Software Engineering Group Computational Science & Engineering Department Rutherford Appleton Laboratory Harwell Science and Innovation Campus Didcot Oxfordshire OX11 0QX

Science and Technology Facilites Council

Enquires about the copyright, reproduction and requests for additional copies of this report should be address to: Library and Information Services STFC Rutherford Appleton Laboratory Harwell Science and Innovation Campus Didcot Oxfordshire OX11 0QX Tel: +44 (0)1235 445384 Fax: +44 (0)1235 446403 Email:library@rl.ac.uk

STFC e-reports are available online at: http://epubs.cclrc.ac.uk

Neither the Council nor the Laboratory accept any responsibility for loss or damage arising from the use of information contained in any of their reports or in any communication about their tests or investigations

Contents
1 The Program 2 Compilation and Building 3 Using gprof 4 Using Intel VTune 4.1 Sampling Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Callgraph Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 General Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Simple Array Code 1 1 1 3 3 6 6 7

The Program
Given an array size: allocate the memory for the array, initialise the array in a loop, negate the rst 10% of the elements. Given an array size: allocate memory for an array on tenth that size, initialise the array in a loop. Given an array size: allocate the memory for the array, initialise the array with array assignment, negate every fth element of the array.

The program is written in standard Fortran 95 and performs three simple array manipulations:

It has been developed on a Linux system using RedHat EL5 using the gfortran(GNU gcc v 4.1.2) compiler.

Compilation and Building

This program comprises two source les and can be compiled with: make You should look at the contents of the Makele and see the -pg ags that must be used with gprof to instrument the executable to produce proling information during the run. No such ags are necessary for Intel VTune.

Using gprof
1. Compile the test application 2. Run the application ./test to produce the gmon.out le. 3. Analyse the results with gprof gprof test

The results are split into two parts, the at prole that shows the functions in which the time was spent ordered by % of the time and should look something like the following Flat profile: Each sample counts as 0.01 seconds. % cumulative self self time seconds seconds calls ms/call 48.11 0.12 0.12 2 60.14 48.11 0.24 0.12 1 120.28 4.01 0.25 0.01 1 10.02 0.00 0.25 0.00 1 0.00 ... <explanation cut for brevity> ... The call graph prole shows a call graph type view for each function allowing us to see how much time was spent in the function itself and how much in the children. 1

total ms/call 60.14 120.28 10.02 250.58

name sub1_ sub3_ sub2_ MAIN__

Call graph (explanation follows)

granularity: each sample hit covers 2 byte(s) for 3.99% of 0.25 seconds index % time self children called name 0.00 0.25 1/1 main [2] [1] 100.0 0.00 0.25 1 MAIN__ [1] 0.12 0.00 2/2 sub1_ [3] 0.12 0.00 1/1 sub3_ [4] 0.01 0.00 1/1 sub2_ [5] ----------------------------------------------<spontaneous> [2] 100.0 0.00 0.25 main [2] 0.00 0.25 1/1 MAIN__ [1] ----------------------------------------------0.12 0.00 2/2 MAIN__ [1] [3] 48.0 0.12 0.00 2 sub1_ [3] ----------------------------------------------0.12 0.00 1/1 MAIN__ [1] [4] 48.0 0.12 0.00 1 sub3_ [4] ----------------------------------------------0.01 0.00 1/1 MAIN__ [1] [5] 4.0 0.01 0.00 1 sub2_ [5] ----------------------------------------------... <explanation cut for brevity> ... Detailed timings for each line are available by running gprof --line test The at view is the most useful here as it shows the time spent on each executable line ordered by % time. The results should look something like Flat profile: Each sample counts as 0.01 seconds. % cumulative self self time seconds seconds calls Ts/call 38.55 0.10 0.10 23.13 0.16 0.06 19.28 0.21 0.05 7.71 0.23 0.02 3.86 0.24 0.01 3.86 0.25 0.01 3.86 0.26 0.01 0.00 0.26 0.00 2 0.00 0.00 0.26 0.00 1 0.00 0.00 0.26 0.00 1 0.00 0.00 0.26 0.00 1 0.00

total Ts/call

0.00 0.00 0.00 0.00

name sub3_ (sub.f90:40 @ 8048a00) sub1_ (sub.f90:11 @ 8048794) sub1_ (sub.f90:10 @ 80487ae) sub3_ (sub.f90:43 @ 8048a6d) sub1_ (sub.f90:14 @ 8048823) sub2_ (sub.f90:27 @ 8048951) sub3_ (sub.f90:42 @ 8048a3e) sub1_ (sub.f90:3 @ 8048700) MAIN__ (test.f90:4 @ 8048684) sub2_ (sub.f90:20 @ 804885a) sub3_ (sub.f90:33 @ 8048983)

Using Intel VTune

This tutorial contains three sections: 1. Create and run a sampling activity and see results that show where the code is spending most of its time. Outputs are: VTune sampling activity that runs the code Data on clocktick samples for one run of the code Annotated source with clocktick samples shown for each executable line of code 2. Create and run a callgraph activity and see results that show the call sequence of the code, the call information and the most time consuming call sequence (the critical-path). Outputs are: VTune callgraph activity that runs the code Data on calls to and from functions and the critical path GUI to examine data 3. General remarks Specifying applications and modules of interest Applications with arguments GUI for sampling results

4.1

Sampling Activity

Compile your application so that it includes debug information, usually with the -g option (use the Makele supplied). This will allow display of source code with execution time against each line. Create and run a sampling activity vtl activity -c sampling -app /home/wksh1/profiling/vtune/test \ -moi /home/wksh1/profiling/vtune/test run Output looks like VTune(TM) Performance Analyzer 9.0 for Linux* Copyright (C) 2000-2007 Intel Corporation. All rights reserved. The The Mon Mon Mon Mon Mon The Activity has been successfully created. Activity is running. Jun 28 14:01:24 2010 softeng (Run 0) Setting Sampling CPU mask to 0-1 Jun 28 14:01:24 2010 softeng (Run 0) The processor PMU configuration file: psc.xml Jun 28 14:01:24 2010 softeng (Run 0) Collection for the following event(s) ... Jun 28 14:01:24 2010 softeng (Run 0) Clockticks, Instructions Retired. Jun 28 14:01:26 2010 softeng (Run 0) Sampling data was successfully collected. Activity has finished running.

The important arguments to this command are -app <name of application to run> and -moi <name of module in that application we are interested in>. Usually the module is the same as the application but for scripts that run many applications we can use -moi to focus on one in particular. Take a look at the activities we have created 3

[wksh1@softeng vtune]\$ vtl show One or more components did not load correctly.

You may need to reinstall the product.

VTune(TM) Performance Analyzer 9.0 for Linux* Copyright (C) 2000-2007 Intel Corporation. All rights reserved. a1__Activity1 r1___Mon Jun 28 14:00:09 2010 - Sampling Results [softeng] Show the clocktick results to see where most time was spent (similar to the gprof at view). [wksh1@softeng vtune]\$ vtl view -ar a1 -hf -mn test -en Clockticks One or more components did not load correctly. You may need to reinstall the product.
VTune(TM) Performance Analyzer 9.0 for Linux* Copyright (C) 2000-2007 Intel Corporation. All rights reserved. Name sub3_ sub1_ sub2_ ModuleName test test test Clockticks samples Segment Offset 120 0xffffffff 0x3c5 102 0xffffffff 0x14c 10 0xffffffff 0x2a1 RVA 0x845 0x5cc 0x721 Size Class DisplayName 0x140 sub3_ 0x155 sub1_ 0x124 sub2_ File Name /home/wksh1/profiling/vtune/sub.f90 /home/wksh1/profiling/vtune/sub.f90 /home/wksh1/profiling/vtune/sub.f90 Resolved Module Path /home/wksh1/profiling/vtune/test /home/wksh1/profiling/vtune/test /home/wksh1/profiling/vtune/test

The arguments here are: -ar - the activity we want results for. -hf - we want to see the sampling hotspots grouped by function. -mn - the module we are interested in. -en - the data we want displayed. In this case clockticks is a measure of the time taken in each routine. So lets look in detail at the functions vtl view -ar a1 -code -mn test -fn sub3 > annotated src.f90 Part of the annotated source le is show below - the Clockticks column (Ev1) is what we are interested in and it shows the initialisation do loop at line 10 taking up a lot of time. Notice too that the array statement initialisation (line 40) that may look innocuous in the source code shows up high in the proling results. VTune(TM) Performance Analyzer 9.0 for Linux* Copyright (C) 2000-2007 Intel Corporation. All rights reserved. Mon Jun 28 15:05:03 2010 WARNING: [Source View] Warning - Cant find ... Legend Ev1 = Clockticks samples Ev2 = Instructions Retired samples Address Line Number Ev1 Ev2 Source 1 0 0 ! Subroutines called by test program 2 0 0 0x5cc 3 0 0 subroutine sub1(n) 4 0 0 5 0 0 integer :: n, i 6 0 0 real, allocatable :: a(:) 7 0 0 0x5db 8 0 0 allocate(a(n)) 4

0x644 0x65b

0x68b 0x6ba

9 10 11 12 13 14 15 16 17 18

0 62 31 0 0 2 7 0 0 0

0 60 7 0 0 0 10 0 0 0

do i = 1,n a(i) = real(i) end do do i = 1,n/10 a(i) = -a(i) end do end ! subroutine sub1

.... 0x845 33 34 35 36 37 38 39 40 41 42 43 44 45 46 0 0 0 0 0 0 0 89 0 0 31 0 0 0 0 0 0 0 0 0 0 44 0 5 11 0 0 0 subroutine sub3(n) integer :: n real, allocatable :: a(:) allocate(a(n)) a = 1.0 do i = 1,n,5 a(i) = -a(i) end do end ! subroutine sub3

0x854 0x8bd 0x8fb 0x92a

Collecting Other Events The default events that VTune records are Clockticks and Instructions Retired. To see the other events available on the processor run vtl query -c sampling > events.lst Ignoring the help text at the top the events are listed lower in the le. The rst few events are: The supported CPU Events for this platform are as below: 128-bit MMX(TM) 1st Level Cache 2nd Level Cache 2nd Level Cache 2nd-Level Cache 2nd-Level Cache 2nd-Level Cache 2nd-Level Cache 3rd-Level Cache 3rd-Level Cache Instructions Retired Load Misses Retired Load Misses Retired Read Misses Read References Reads Hit Exclusive Reads Hit Modified Reads Hit Shared Read Misses Read References

These event names can be used when setting up the sampling activity with the -o -ec. . . options as follows. 5

vtl activity -d 20 -c sampling -o "-ec en=2nd Level Cache Read Misses:sa=1, \ en=Mispredicted conditionals:sa=1" -app /home/djw/tools_test/vtune_vtl/test/test \ -moi /home/djw/tools_test/vtune_vtl/test/test run Here the event names are given in full and the :sa= is the sample-after value, in this case we sample each event. Note the addition of -d 20. This is because the chosen events require a non-zero duration to be set. For more details on collecting multiple events with VTune see the following (infeasibly long URL) http://software.intel.com/en-us/articles/performance-tools-for-softwaredevelopers-collecting-multiple-events-using-the-vtune-analyzer-command-line/.

4.2

Callgraph Activity

Create and run a callgraph activity vtl activity -c callgraph -app /home/wksh1/profiling/vtune/test \ -moi /home/wksh1/profiling/vtune/test run Use vtl show again to see the activities including the new callgraph results. To see the call information: parent functions with children, threads and timing info run vtl view -ar a2 -calls > calls.csv There is a great deal of detail here and the data we need is obscured - the gprof call graph data us much better. For those running a local X server (and the X connection set up along with ssh if running VTune remotely) there is a gui option which makes the data much clearer. vtl view -ar a2 -gui

4.3

General Remarks

The application should be given with its full path after the -app ag and if the application runs other executables (e.g. its a shell script) you can analyse one in particular by using -moi exe of interest. For single executables give its name to the -moi ag (safer to make this the full path too). If your application requires arguments then put them in double quotes separated from the application name with a comma. For example to run an activity on ls -a -l the application is specied as -app ls,"-a -l" You can also get a GUI for the sampling activity by running vtl view -ar a1 -gui and using double clicks to drill down from process to module (program) to function and ultimately to source code.

Simple Array Code

! test.f90 ! Test program for VTune. It calls 3 procedures (defined in sub.f90) ! to show how hotspots can be found. program test integer :: n ! Loop control parameter. Must be multiple of 10 n = 10000000 ! Call the first procedure that executes a long loop call sub1(n)

! Call the second procedure that executes a shorter loop call sub2(n) ! Call the first procedure again with smaller argument call sub1(n/10) ! Call the third procedure that does some array assignment call sub3(n)

end ! program test

! sub.f90 ! Subroutines called by test program subroutine sub1(n) integer :: n, i real, allocatable :: a(:) allocate(a(n)) do i = 1,n a(i) = real(i) end do do i = 1,n/10 a(i) = -a(i) end do end ! subroutine sub1 subroutine sub2(n) integer :: n, i real, allocatable :: a(:) allocate(a(n/10)) do i = 1,n/10 a(i) = real(i) end do end ! subroutine sub2 subroutine sub3(n) integer :: n real, allocatable :: a(:) allocate(a(n)) a = 1.0 do i = 1,n,5 a(i) = -a(i) end do end ! subroutine sub3

You might also like