docs:rtlib:profiling

Application characterization

To characterize an application, you first need to be able to execute it. Not having a recipe at your disposal, you could write a sample recipe to run the tests. The best practice, however, is to run the application in Unmanaged Mode.

If the previous tutorial went well, you should have built the application without errors. Try executing it in Unmanaged Mode.

[BOSPShell ~] \> export BBQUE_RTLIB_OPTS="U"
[BOSPShell ~] \> bbqtutorials
*****                   - INFO   BbqTutorials    : .:: BbqTutorials (ver. HEAD-HASH-NOTFOUND) ::.
*****                   - INFO   BbqTutorials    : Built: Jun 24 2014 11:06:50
*****                   - INFO   BbqTutorials    : STEP 0. Initializing RTLib, application [bbqtutorials]...
11:45:51,334 - WARN   rpc             : Enabling UNMANAGED mode, selected AWM [0]
11:45:51,334 - WARN   rpc             : Running in UNMANAGED MODE
*****                   - INFO   BbqTutorials    : STEP 1. Registering EXC using [BbqTutorials] recipe...
*****                   - INFO   BbqTutorials    : STEP 2. Starting EXC control thread...
*****                   - INFO   BbqTutorials    : STEP 3. Waiting for EXC completion...
11:45:51,371 - NOTICE exc             : [onMonitor]: Low QoS (10) on cycle 1
11:45:51,397 - NOTICE exc             : [onMonitor]: Low QoS (10) on cycle 2
 
...
 
11:45:53,913 - NOTICE exc             : [onMonitor]: Low QoS (10) on cycle 99
11:45:53,938 - NOTICE exc             : [onMonitor]: Low QoS (10) on cycle 100
*****                   - INFO   BbqTutorials    : STEP 4. Disabling EXC...
11:45:53,939 - NOTICE rpc             : Execution statistics:
 
 
Cumulative execution stats for 'BbqTutorials':
  TotCycles    :      99
  StartLatency :       0 [ms]
  AwmWait      :       0 [ms]
  Configure    :       0 [ms]
  Process      :    2526 [ms]
 
# EXC    AWM   Uses Cycles   Total |      Min      Max |      Avg      Var
#==================================+===================+==================
BbqTutorials 000      1     99    2526 |   24.562   44.535 |   26.032    7.784
#-------------------------+        +-------------------+------------------
BbqTutorials 000         onRun    2526 |   24.533   44.376 |   25.958    7.782
BbqTutorials 000     onMonitor       0 |    0.029    0.160 |    0.074    0.001
#-------------------------+--------+-------------------+------------------
BbqTutorials 000   onConfigure       0 |    0.401    0.401 |    0.401    0.000
*****                   - INFO   BbqTutorials    : ===== BbqTutorials DONE! =====

The parameters of the profiling will evidently be:

  • Threads number
  • QoS parameter A (ranges from 1 to 5)
  • QoS parameter B (ranges from 1 to 5)

The objectives will be, in order:

  1. Minimization of execution time (60% of the point value)
  2. Maximization of the quality of service, whose value has also to be greater than 3 (20% of the point value)
  3. Minimization of resource usage (20% of the point value)

Now, check your BPL file (did you read that tutorial?). This example will exploit a six-cores MDEV, so the threads number will range from 1 to, let's say, 8.

--------+---------------+---------------+---------------------------------------
	| CPUs IDs	| Memory Nodes	| Description
--------+---------------+---------------+---------------------------------------
HOST	| 0,4		| 0		| BBQUE Generic 1Core Hyperthreaded Host
MDEV	| 1-3,5-7	| 0		| BBQUE Generic 3Core Hyperthreaded MDEV
--------+---------------+---------------+---------------------------------------
#
# Resources clusterization for MANAGED resources (single node)
--------+---------------+---------------+---------------+-----------------------
	| CPUs IDs	| Time Quota	| Memory Nodes	| Memory (MB)
--------+---------------+---------------+---------------+-----------------------
NODE	| 1-3,5-7	| 100		| 0		| 6000

Let's write a simple script to automatize the tests. It should be something like this:

############### File: bbqtutorial.sh ###############
 
#!/bin/bash
 
# Edit this line
PATH_TO_BOSP="/home/slibutti/opt/BOSP"
 
# Sourcing the shell
. $PATH_TO_BOSP/out/etc/bbque/bosp_init.env
 
# Just for some logging
NUMBER_OF_TESTS=0
 
# Results are stored here
printf "%-10s %7s  %7s  %7s  %3s  %8s  %6s\n" APPNAME \
THREADS PARAM_A PARAM_B QOS EXECTIME USAGE > test_results
 
# Running on Unmanaged Mode, with performance counters sampling,
# on the MDEV cpus with 100% CPU quota and 50MB max memory usage
export BBQUE_RTLIB_OPTS="U:p:C 1-3,5-7 100000 -1 0 52428800"
 
for THREADS in 1 2 3 4 5 6 7 8; do
    for PARAM_A in 1 2 3 4 5; do
        for PARAM_B in 1 2 3 4 5; do
 
            # Exploiting the constraint to reduce tests number
            QOS_CONSTRAINT=$(echo $PARAM_A"+"$PARAM_B"<4" | bc)
            # If the constraint is violated, skip the test
            [ "$QOS_CONSTRAINT" == "1" ] && continue
 
            # Logging
            echo "Performing test $THREADS $PARAM_A $PARAM_B"
            let NUMBER_OF_TESTS++
 
            # Running the test
            bbqtutorials -j 200 -t $THREADS -a $PARAM_A -b $PARAM_B &> test_tmp
 
            # Extracting information
            USAGE=$( grep -o "[0-9\ \.]\{1,\} CPUs utilized" test_tmp | awk -F' ' '{print $1}' )
            QOS=$( echo $PARAM_A"+"$PARAM_B | bc )
            EXECTIME=$( grep -o "Process[0-9\ \.\:]\{1,\}" test_tmp | grep -o "[0-9\.]\{1,\}" )
 
            # Storing the results
            printf "%-10s %7s  %7s  %7s  %3s  %8s  %6s\n" "Bbqtut" \
                $THREADS $PARAM_A $PARAM_B $QOS $EXECTIME $USAGE >> test_results
 
            rm test_tmp
        done
    done
done

  1. You need to start the bbque daemon, in order to move all non-integrated applications into the HOST device
  2. Update PATH_TO_BOSP variable to match your installation path
  3. We are using the BBQUE_RTLIB_OPTS flag C. The specified cores are your single NODE cores. Adapt your script to match the right IDs.
  4. We are using the BBQUE_RTLIB_OPTS flag C. Thus, you need to be root.
$ sudo su
$ chmod a+x bbqtutorial.sh
$ . /PATH/TO/BOSP/out/etc/bbque/bosp_init.env
[BOSPShell ~] \> bbque-startd
[BOSPShell ~] \> ./bbqtutorial.sh

The results are now saved in the file test_results, ready to be analysed. 176 tests were performed (you already guessed that Design Space Exploration is not a very simple problem…). Note that, on this machine, the application is not able to exploit more than 5 CPUs.

APPNAME    THREADS  PARAM_A  PARAM_B  QOS  EXECTIME   USAGE
Bbqtut           1        1        3    4      2591   0.991
Bbqtut           1        1        4    5      3424   0.994
Bbqtut           1        1        5    6      4202   0.994
Bbqtut           1        2        2    4      2127   0.985
Bbqtut           1        2        3    5      2799   0.991
Bbqtut           1        2        4    6      3591   0.990
Bbqtut           1        2        5    7      4402   0.992
Bbqtut           1        3        1    4      1502   0.978
Bbqtut           1        3        2    5      2177   0.986
Bbqtut           1        3        3    6      2956   0.992
Bbqtut           1        3        4    7      3756   0.995
Bbqtut           1        3        5    8      4585   0.995
Bbqtut           1        4        1    5      1620   0.980
Bbqtut           1        4        2    6      2305   0.985
Bbqtut           1        4        3    7      3089   0.992
Bbqtut           1        4        4    8      3884   0.994
Bbqtut           1        4        5    9      4694   0.996
Bbqtut           1        5        1    6      1722   0.984
Bbqtut           1        5        2    7      2416   0.990
Bbqtut           1        5        3    8      3215   0.993
Bbqtut           1        5        4    9      4024   0.995
Bbqtut           1        5        5   10      4857   0.996
Bbqtut           2        1        3    4      1370   1.950
...
...
Bbqtut           8        3        5    8      1544   4.817
Bbqtut           8        4        1    5       537   4.238
Bbqtut           8        4        2    6       839   4.322
Bbqtut           8        4        3    7      1111   4.478
Bbqtut           8        4        4    8      1354   4.701
Bbqtut           8        4        5    9      1641   4.768
Bbqtut           8        5        1    6       628   4.117
Bbqtut           8        5        2    7       889   4.366
Bbqtut           8        5        3    8      1157   4.594
Bbqtut           8        5        4    9      1395   4.708
Bbqtut           8        5        5   10      1631   4.860

The analysis could be done with another simple script, which normalizes the values and orders the points exploiting the obtained metric as a key.

#!/bin/bash
 
# Reference values
MIN_EXECTIME=$(awk 'NR == 1 {line = $0; min = $6} NR > 1 && $6 < min {line = $0; min = $6} END{print min}' test_results)
MIN_USAGE=$(awk 'NR == 1 {line = $0; min = $7} NR > 1 && $7 < min {line = $0; min = $7} END{print min}' test_results)
MAX_QOS=10
 
echo "#RESOURCES THREADS PARAM_A PARAM_B VALUE" > ordered_list
 
# for each point
while read POINT; do
 
	echo "Working on next point.."
 
	# Skipping firts point. It's the header
	[[ $POINT == *APPNAME* ]] && continue
 
	# Extract parameters and results
	EXECTIME=$(echo $POINT | awk '{print $6}')
	USAGE=$(echo $POINT | awk '{print $7}')
	QOS=$(echo $POINT | awk '{print $5}')
	PARAM_B=$(echo $POINT | awk '{print $4}')
	PARAM_A=$(echo $POINT | awk '{print $3}')
	THREADS=$(echo $POINT | awk '{print $2}')
 
	# Normalize
	NORMALIZED_EXECTIME=$(echo "scale=2;"$MIN_EXECTIME"/"$EXECTIME | bc)
	NORMALIZED_USAGE=$(echo "scale=2;"$MIN_USAGE"/"$USAGE | bc)
	NORMALIZED_QOS=$(echo "scale=2;"$QOS"/"$MAX_QOS | bc)
 
	# Compute value, for example giving more importance to exectime
	DESIGN_POINT_VALUE=$(echo "scale=2;0.6*"$NORMALIZED_EXECTIME"+0.2*"$NORMALIZED_QOS"+0.2*"$NORMALIZED_USAGE | bc)
 
	# Storing results
	echo "$USAGE $THREADS $PARAM_A $PARAM_B $DESIGN_POINT_VALUE" >> unordered_list
 
done < test_results
 
# Ordering results
sort -k 5 unordered_list >> ordered_list
rm unordered_list

Here are the results, from worst points to the best points. A couple of considerations:

  1. Parameters A and B cannot assume low values, because the quality of service has to be maximized. However, parameter B affects the execution time more than parameter A. We know it because we coded the example just to have this situation (int iterations = 100000*quality_a + 500000*quality_b). This results in optimal points with low parameter B values and high parameter A values, confirming our prediction.
  2. Brute force is often enjoyable, but we can guess that points with more than three threads will be less efficient than a three-threads configuration. The reason is that this MDEV does not feature six CPUs: it features three CPUs in hyper-threading.
# File: ordered_list
# First points are the worst
#RESOURCES THREADS PARAM_A PARAM_B VALUE
...
0.980 1 4 1 .48
3.466 5 3 2 .48
0.984 1 5 1 .49
3.873 5 4 2 .49
2.332 4 3 1 .50
2.633 4 5 1 .50
3.792 5 2 2 .50
2.510 4 4 1 .51
3.905 6 3 2 .51
4.116 5 5 2 .51
4.258 7 4 2 .51
4.283 7 5 2 .51
3.959 7 2 2 .52
4.136 7 3 2 .52
4.366 8 5 2 .52
2.861 3 5 2 .53
4.322 8 4 2 .53
4.430 6 4 2 .53
2.891 3 4 2 .54
4.221 8 2 2 .54
4.263 8 3 2 .54
4.272 6 2 2 .54
4.731 6 5 2 .54
2.898 3 3 2 .55
1.941 2 4 1 .56
2.929 3 2 2 .56
3.063 5 4 1 .56
1.926 2 3 1 .57
1.936 2 5 1 .57
2.643 3 5 1 .61
4.121 5 5 1 .62
3.754 7 4 1 .63
3.810 6 4 1 .63
3.390 5 3 1 .64
4.073 7 5 1 .64
4.094 6 5 1 .64
3.693 7 3 1 .65
4.117 8 5 1 .65
3.916 6 3 1 .66
2.871 3 4 1 .71
3.928 8 3 1 .72
4.238 8 4 1 .72
2.718 3 3 1 .73

The characterization is done. Obviously, using a Design Exploration Tool would yield results in an easier and more efficient way. In this case, we will choose the points manually. In this example, we will choose eight AWMs:

  1. one configuration exploiting a single CPU (1 thread, from the results)
  2. one configuration exploiting two CPUs (2 threads)
  3. a couple of configurations exploiting three CPUs (3 or 4 threads)
  4. a couple of configurations exploiting four CPUs (5 threads)
  5. a couple of configurations exploiting five CPUs (5 or 6 or 7 or 8 threads)

The reason is quite simple: from the ordered list of design points, we can easily see that we have very few optimal points exploiting one or two threads. Conversely there are lots of points exploiting three, four, five CPUs.

Reading the ordered_list file from the best to the worst point and skipping points too similar to already chosen ones, in this case we select:

  1. One CPU: <t=1, A=5, B=1, CPUQuota=~100%>
  2. Two CPUs: <t=2, A=5, B=1, CPUQuota=~194%>
  3. Three CPUs: <t=3, A=3, B=1, CPUQuota=~272%>, <t=3, A=4, B=1, CPUQuota=~288%>
  4. Four CPUs: <t=8, A=3, B=1, CPUQuota=~393%>, <t=7, A=3, B=1, CPUQuota=~370%>
  5. Five CPUs: <t=8, A=4, B=1, CPUQuota=~424%>, <t=8, A=5, B=1, CPUQuota=~412%>

At this point you can think about inserting these characterization data into the recipe. How to do that is explained in Recipe: add characterization information.

docs/rtlib/profiling.txt · Last modified: 2017/05/17 12:19 by jumanix

Page Tools