You are on page 1of 22

Efficient Provisioning of Bursty Scientific

Workloads on the Cloud Using Adaptive


Elasticity Control

Ahmed Ali-Eldin, Johan Tordsson,


and Erik Elmroth
Department of Computing Science
Ume University, Sweden
www.cloudresearch.se

Maria Kihl
Lund Center for Control of Complex
Engineering Systems
Lund University, Sweden
http://www.lccc.lth.se/

Ume University

Context

ve research programme in e-science between Uppsala University, Lund University


rch environment that enables a strong interplay between e-science research, e-infr

Motivation & Problem definition

The cloud elasticity problem


How much capacity to (de)allocate to a
cloud service (and when)?
Bursty and unknown workload
Increase ability to meet SLAs
Reduce resource usage
One of the limitations identified by Truong
et al. [1] to the wide adoption of the

Problem Description
Prediction of load/signal/future is not a new problem
Studied extensively within many disciplines
Time series analysis
Econometrics
Control theory
Stock markets
Biology, etc.
Multiple solutions proposed to prediction problem
Neural networks
Fuzzy logic
Adaptive control
Regression
Kriging models
<your favorite machine learning technique>
However, solution must be suitable for our problem

Requirements
Vary capacity allocated to a service

According to current and future load


Fulfill QoS requirements to meet SLAs
Without costly over-provisioning
Robustness
Avoid oscillations or behavioral changes
Scalability
Tens of thousands of servers + even more VMs
Adaptive to changing workloads
PID-controllers reliable for certain load patterns,
but unstable once the load or system dynamics
change
Fast
Limited look-ahead control accurate but too slow
Can take 30 min to control 15 servers and 60 VMs

Simplicity
Key to adoption

Our approach:
Adaptive Hybrid control
Closed loop control
Adaptive control:

P-controller
Adjust error signal by gain parameter
Error signal is the difference between current and
desired output
Change signal adjustments with load dynamics

Hybrid control, a controller that combines

Reactive control (step controller)


Proactive control (proportional, P-controller)

Initial model and


assumptions
Service with homogeneous requests
Short requests that take one time unit (or
less) to serve
VM startup time is negligible
Delayed requests are dropped
VM capacity constant
Infrastructure modeled as G/G/N queue
N (#VMs) varies over time
Perfect load balancing assumed

A. Ali-Eldin, J. Tordsson, and E. Elmroth. An


adaptive hybrid elasticity controller for cloud
infrastructures. In NOMS 2012, IEEE/IFIP Network
Operations and Management Symposium. IEEE,
2012.

Model and assumptions


Assumptions:

Homogeneous requests
Short requests that take one time unit
(or less)
Machine startup time is negligible
Delayed requests are dropped
Constant machine capacity
Infrastructure modeled as G/G/N queue

N (#VMs) varies over time


Perfect load balancing assumed

Our approach (cont.)

Adaptive control (cont.)


How to estimate change in workload?
F=C*P
Estimated
load change

Gain parameter

Average capacity in last time window


Window size changes dynamically
Smaller upon prediction errors

A tolerance level decide how often


window is resized

Two gain parameter alternatives studied

1.Periodical rate of change


2.P = Load change / avg. rate in last time window
3.Denoted P_1 henceforth
2. Ratio of load change over average system rate:
. P = Load change / avg. rate over all time
. Denoted P_2 henceforth

Hybrid control (cont.)


All in all, 9 approaches for

scale up (U) and scale down (D)


Reactively (R) and/or Proactively (P)

UR combined with:
DR, DP, DRP

UP combined with:
DR, DP, DRP

URP combined with:


DR, DP, DRP

Notation in the following:

URP-DP
Scale up: reactive + proactive
Scale down: proactive

Performance Evaluation
Simulation-based evaluations
3 aspects studied

1.Best combination of reactive and proactive


controllers
2.Controller stability w.r.t. workload size
3.Comparison with state-of-the art controller
4.Regression control [Iqbal et al, FGCS 2011]

Performance metrics

.Over-provisioning:
.VMs allocated but not needed
.Under-provisioning:
.VMs needed, but failed to allocate (SLA violation)

Studied workload
FIFA98 traces

~3 month Web server traces (bursty)


Grouped requests per second of arrival

Best controller combination


Scaled FIFA traces x 50

Reasonable Internet growth 1998 > today

Assume that 1 VM handles 500 requests


Reasonable for DB-backend Web servers

Studied (for sake of completion) all 9


combinations of reactive + proactive controller

Some make no sense & indeed performed poorly:


Reactive scale down causes oscillations and lot of
under-provisioning (SLA violations)
Pure proactive scale up tends to skew and cause
under-provisioning
Other approaches more promising:
Reactive scale up
Fast reaction to load increases, no skew
Proactive scale-down
Keep VMs for a while (just in case) once they are allocated

Best combination(cont.)
Baseline: UR-DR

1.63% under-provisioning
1.4% over-provisioning

Best combination(cont.)
UR-DP_1

0.41% under-provisioning (1.63% for UR-DR)


9.44% over-provisioning (1.4% for UR-DR)

Best combination(cont.)
UR-DP_2

0.18% under-provisioning (1.63% for UR-DR)


14.33% over-provisioning (1.4% for UR-DR)

Stability w.r.t workload size

Multiplied FIFA traces by X=10, 20, , 60


Assume that 1 VM handles 10*X requests/s
Studied UR-DR, UR-DP_1, UR-DP_2
Under-provisioning:

Conclusions:

Over-provisioning:

Reactive stable (no surprise)


Proactive controller prediction quality varies with workload
Error in over-provisioning grows slower than workload size

Comparison with regression


Regression-based control:

Scale up: reactively, Scale down: regression


2nd order regression based on full workload history

Evaluation on selected (nasty) part of FIFA trace

UR-DR:
2.99% under-provisioning,
UR-D_Regression:
2.24% under-provisioning,
UR-DP_1:
1.51% under-provisioning,
UR-DP_2:
1.07% under-provisioning,

19.57% over-prov.
47% over-prov.
32.24% over-prov.
39.75% over-prov.

Controller performance (execution time)

Regression: 0.98s on average, up to 6.5s observed


Our approach: 0.6 ms on average

Conclusions
P-control promising approach to cloud elasticity

Accurate predictions
Rapid
Controller execution time in ms
Robust
Copes with changes in workload dynamics

No one-size-fits all controller

Tradeoff between over- and under-provisioning


Costs for SLA violation (under-provisioning) and
resource wastage (over-provisioning) decides
strategy to use

You might also like