Professional Documents
Culture Documents
October 2010
altin.papa
@riskfocusinc.com
vassil.avramov
Who we are
Established financial services technology consulting company
Founded in 2004 by experts in risk management technology Exclusive focus on Capital Markets Engaged at top-tier international banks and hedge funds Offices in NY, London, Bangalore www.riskfocusinc.com
Our Approach
We aim for better, generalized solutions to problem patterns
page 2
Presentation Agenda The Enterprise IT Problem Challenges of Enterprise Systems Splunk Solutions for the whole Software Development Lifecycle
Cross-cutting concerns Design Release Cycle Operations
Recommendations
page 3
page 4
The algorithm
Strategic Success
Clear Message
Common Language
Reactive
page 5
The architecture
Robust System
Message Driven
Common Format
Reactive
page 6
Maintenance:
Preventive is better than Corrective Corrective Maintenance: quick and replicable
page 7
page 8
Operational Patterns in Large Systems How do we apply behavior across functional components? Cross-cutting concerns
Apply to all parts, regardless of function At application level, often handled via Aspect Oriented Programming:
Logging Performance Profiling Security Transactionality
But what about at higher levels? This is how the operations team experiences the system
page 9
Novation Handler
Trade DAO
Message Listener
Logging
page 10
Cross cutting at the SYSTEM Level Client Trade Processing External Gateway
Log Aggregation
page 11
Cross cutting at the ORGANIZATION Level Trading System Market Data System Valuation System
Operational Intelligence
page 12
They make a system look fragmented to the operational teams. Borders are problematic
Example
An issue occurs within one of the components This leads to an incident across the border The symptoms are observed in a different place at a different time
Solution
Aggregate all logs and cross-index them Create an integrated dashboard
page 13
Dashboard
page 14
Conversation
page 15
Release Cycle Problem : The Problem Only Occurs in Production (good acronym)
Tests passed For some reason we only see the problem once the system is live
Example
Exception occurred in QA/UA, but tests passed and no one saw it Same problem blew up in Production later
Link to everything:
Knowledge Base (e.g. Support Wiki) Source Control viewer (FishEye) Build Server (TeamCity/Hudson) Bug Database (e.g. Jira)
page 16
Root Cause
Show problem FpML message via ReST Drill through to Support Wiki for solution
page 17
Example
We have a problem. Can you look at it? Collaborative effort preceding call is lost Inability to correlate events across components and over time Inability to look historically.
When did the problem appear first? Did we just introduce it in this release?
Solution
Just email a Splunk link Single entry point for ALL INTELLIGENCE on this problem It can be passed around with no loss
page 18
Performance
Support Email: Sync was slow starting 1pm. Any ideas?
See trends over time, across releases Confirm, drill down, resolve
page 19
Recommendations
Good Design takes into account the whole lifecycle of a System
page 20