You are on page 1of 44

Performance is Overrated

Mark Callaghan NEDB 2012

(Peak) Performance is overrated


Focus on reducing variance rather than increasing peaks

Capacity planning uses p95 or p99 response time Servers must be underutilized to tolerate variance

Manageability

needs more attention

Cost of extra hardware can be predicted Cost of downtime cannot Downtime comes in many forms (server down and server too busy)

What is manageability?
The

rate of interrupts/server for the operations team count grows quickly and operations team grows slowly of service must improve over time

Server

Quality

Does work get done? Does work get done on time?

This has good average performance

Why MySQL?
It

was there when we arrived made it scale 10X

We

My peers in db eng/ops are very good Room for new people, ideas and products

like MySQL for OLTP


250,000 QPS on (silly) benchmarks InnoDB is wonderful

OLTP for the social graph


Secondary Index-only Small

indexes queries

joins but most queries use one table transactions

Multi-row Majority Physical Async

of workload does not need SQL/optimizer and logical backup

replication on a WAN

Does this require SQL?


Most Why

of it does not

is the grass greener on the other side?

Automated replacement of failed nodes Less downtime on schema changes or fewer schema changes Multi-master Better compression Write-optimized

A busy OLTP deployment circa 2010


Query

response time bytes per second

Rows

read per second changed per second

4 ms reads, 5ms writes

450M peak

Network

Rows

38GB peak

3.5M peak

Queries

per second

InnoDB

page IO per second

13M peak

5.2M peak

Why are there so many servers?


Big

data X high QPS

Per Domas we have lots of medium data (sharded MySQL)

Add

servers to add IOPS is very interesting databases are very interesting

Flash

Write-optimized

Database teams at Facebook


Operations
Move

Engineering
Fix

fast and x things our changes, or not

bugs that stall and crash MySQL better bugs

Deploy Tell

Make

me what to x

Market

The git log for our MySQL branch has 452 changes.

Tips on scaling: more data, more QPS


1. 2. 3.

Fix stalls to make use of capacity Improve efciency to use less Repeat

Fix stalls
Dont make MySQL faster, make it less slow

Stalls from le systems Stalls from caches in MySQL Stalls from mutexes in MySQL Everything else

File system stalls


Switch

the IO scheduler from cfq to deadline

Deadline is less likely to stall writes

Switch

from ext-3 to XFS

XFS does not lock a per-inode mutex on writes XFS has less variance on write-append

Stalls from caches


Some expensive operations are deferred
InnoDB InnoDB Fuzzy

purge removes delete-marked rows insert buffer defers IO for secondary index maintenance

checkpoint constraint enforcement

Repeat until done


Problem

Arrival rate exceeds completion rate Throughput collapses when cache is full

Solution

Increase completion rate

Performance drops when ibuf is full

Otherwise, the insert buffer is awesome

Sysbench QPS at 20 second intervals with checkpoint stalls

Stalls from mutexes


Extending Opening

InnoDB les

LOCK_open Excessive Deadlock

and kernel_mutex

InnoDB tables lock conicts

calls to fcntl detection overhead

Purge/undo TRUNCATE DROP

table and LOCK_open

innodb_thread_concurrency Group

table and LOCK_open

commit control

Admission

Repeat until done


Problem

Global mutex held while expensive operation done Requests stall

Solution

Defer expensive operation until global mutex unlocked

Stalls from excessive calls to fcntl


fcntl

Some Linux kernels get the big kernel lock on fcntl calls MySQL called it too often

Doubled

peak QPS by changing MySQL to call it less is now xed in ofcial MySQL

200,000 QPS on benchmarks

Problem

Sysbench read-only with fcntl x

Stalls from deadlock detection overhead


InnoDB

deadlock detection was inefcient

O(N*N) for N threads waiting on the same row lock.

Fix

is simple

Disable it and rely on lock wait timeout Detection is now more efcient in ofcial MySQL

The cost of deadlock detection


QPS for 1 to 1024 connections updating the same row
3000 2500 2000 1500 1000 500 0 1 2 4 8 16 32 64 128 256 512 1024 Disabled Enabled

Stalls from innodb_thread_concurrency


Limits

the maximum number of running threads

Threads are scheduled in LIFO order

With

1000+ sleeping threads it can take too long to wake one some threads to run in FIFO order

Allow

When new thread arrives run if other threads are slow to wake

FIFO

+ LIFO = FLIFO

Sysbench TPS with FLIFO

Commit stalls for MySQL


This

is XA when the replication log (binlog) is enabled

InnoDB and replication log are resource managers

Commit HW

requires 3 fsyncs, 2 can be shared

RAID card does ~5000 fsyncs/second

Supports ~2500 commits/second

Group commit
Modied Fix

MySQL to allow all fsyncs to be shared

was fun
Uses a group commit timeout Threads only wait when other threads are about to commit (magic)

Useful

side effect

Servers are better able to survive RAID battery failure

Stalls from mutex thrashing


Preserve throughput while overloaded

Good preserve the rows read rate, limit threads running Better preserve query completion rate, limit queries running

Admission control

Simple TP monitor in MySQL Limits max concurrent queries per database account Does the right thing when a query blocks on IO and lock waits

Stalls from the speed of light


mysql_query(START TRANSACTION); mysql_query(INSERT IGNORE INTO graph...); if (mysql_affected_rows() == 1) mysql_query(INSERT INTO counts ... ON DUPLICATE KEY UPDATE c = c+1) mysql_query(INSERT INTO other_table ) mysql_query(COMMIT);

The Solution non stored procedures


mysql_query( START TRANSACTION; INSERT IGNORE INTO graph...; SELECT row_count() INTO @r; INSERT INTO counts ON DUPLICATE KEY UPDATE c = IF(@r = 1, c+1, c); INSERT INTO other_table ...; COMMIT );

Transaction Per Second vs. Concurrency


2400 Original Trigger 2000 Procedure Multi-Query

1600

1200

800

400

0 0 20 40 60 80 100 120 140 160 180 200

How did we nd these problems?


We

know MySQL

When does experience trump perfect software?

We

use PMP

Poor Mans Proler State of the art tool for debugging stalls Continue to invest in making it better

This is PMP
echo "set pagination 0" > /tmp/pmpgdb echo "thread apply all bt" >> /tmp/pmpgdb mpid=$( pidof mysqld ) t=$( date +'%y%m%d_%H%M%S' ) gdb --command /tmp/pmpgdb --batch -p $mpid | grep -v 'New Thread' > f.$t cat f.$t | awk 'BEGIN { s = ""; } /Thread/ { print s; s = ""; } /^\#/ { x=index($2, "0x"); if (x == 1) { n=$4 } else { n=$2 }; if (s != "" ) { s = s "," n} else { s = n } } END { print s }' - | sort | uniq -c | sort -r -n -k 1,1 > h.$t

The database is slow!


Paging

via LIMIT x,y is O(N*N)

Dont allow it or use an index to determine paging order

Non

index-only queries depend on a warm buffer cache

Make them index-only

Queries

that examine 1M rows to return 100 rows are slow

Dene a better index

Queries

that might do 10,000 disk reads are slow

Dont do them

We repeatedly confront these problems

Manageability: solutions
Online

schema change tool collects data during a query pileup

Dogpiled

Get performance counters and the list of running queries Generate HTML page with interesting results

Pylander

sheds load during a query pileup

Kill duplicate queries Limit the number of queries from specic accounts

Schema Change
Must Add

do frequent schema changes

a column, add an index, change an index TABLE can take hours on a large table TABLE can block reads and writes to the table

ALTER ALTER

Our solution: Online Schema Change (OSC)


1.

Setup triggers to track changes

Briey locks the table

2. 3. 4.

Copy data to new table with desired schema Replay changes on new table Rename new table as the target table

Briey locks the table

Manageability: work in progress


Make Faker Auto Auto

InnoDB compression work for OLTP tool for prefetching for replication slaves

replacement replace failed and unhealthy MySQL servers resharding sharding is easy, resharding is hard

Faker
Replication

replay is page read modify page write

Bottleneck might be disk reads Work done by a single thread Transactions on master are concurrent

Faker

Multiple threads replay transactions in fake-changes mode on slaves Captures 70% of disk reads, work in progress to improve the rate

Manageability: open issues


Why Why

is one host slow? is the database tier doing a lot more work today? do I spend the next N dollars (memory, disk, ash)?

Where How How

do I run a workload across old (slow) and new (fast) servers? do I integrate cache and database tiers? monitoring signals generate useful interrupts?

What

World has a surplus of clever ideas


Getting Run

things into production is the hard part

a server in production before writing a new one more in monitoring, debugging and tuning

Invest

Read more at facebook.com/MySQLatFacebook

(c) 2007 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

You might also like