You are on page 1of 230

ACI Troubleshooting Book

Documentation
Release 1.0.1

Andres Vega Bryan Deaver Jerry Ye


Kannan Ponnuswamy Loy Evans Mike Timm Paul Lesiak
Paul Raytick

July 09, 2015

Contents

1 Preface 3
1.1 Authors and Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Distinguished Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Dedications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Book Writing Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Expected Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 Organization of this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Section 1: Introduction to ACI Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Section 2: Sample Reference Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Section 3: Troubleshooting Application Centric Infrastructure . . . . . . . . . . . . . . . . . . . 7

2 Application Centric Infrastructure 8

3 ACI Policy Model 9


3.1 Abstraction Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Everything is an Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Relevant Objects and Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Hierarchical ACI Object Model and the Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Infrastructure as Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.6 Build object, use object to build policy, reuse policy . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.7 REST API just exposes the object model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.8 Logical model, Resolved model, concrete model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.9 Formed and Unformed Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.10 Declarative End State and Promise Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Troubleshooting Tools 23
4.1 APIC Access Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
CLI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Programmatic Configuration (Python) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Fabric Node Access Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Exporting information from the Fabric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5 External Data Collection – Syslog, SNMP, Call-Home . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Health Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.7 Atomic Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Troubleshooting Methodology 27
5.1 Overall Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6 Sample Reference Topology 29


6.1 Physical Fabric Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.2 Logical Application Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7 Troubleshooting 31
7.1 Naming Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Suggested Naming Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

8 Initial Hardware Bringup 34


8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Symptom 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Verification/Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Symptom 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Verification/Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Symptom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

9 Fabric Initialization 38
9.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.2 Fabric Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
9.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Symptom 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Symptom 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Symptom 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Symptom 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

10 APIC High Availablity and Clustering 48


10.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
10.2 Cluster Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
10.3 Majority and Minority - Handling Clustering Split Brains . . . . . . . . . . . . . . . . . . . . . . . . 49
10.4 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Symptom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
10.5 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Symptom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
10.6 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Symptom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
10.7 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Symptom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
10.8 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Symptom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
10.9 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Symptom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
10.10Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Symptom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

11 Firmware and Image Management 53


11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
APIC Controller and Switch Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Firmware Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Compatibility Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Firmware Upgrade Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Verifying the Firmware Version and the Upgrade Status by of use of the REST API . . . . . . . . . . 56
11.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Symptom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
11.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Symptom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
11.4 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Symptom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

12 Faults / Health Scores 58


12.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
12.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Symptom 1: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Resolution 1: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
12.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Symptom 1: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Resolution 1: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Resolution 2: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

13 REST Interface 65
13.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
ACI Object Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
APIC REST API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Payload Encapsulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Read Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Write Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
API Inspector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
ACI Software Development Kit (SDK) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Establishing a Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Working with Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
APIC REST to Python Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
13.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Symptom 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Verification 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Symptom 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Symptom 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Symptom 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Symptom 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Symptom 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

14 Management Tenant 84
14.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Fabric Management Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Out-Of-Band Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Inband Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Layer 2 Inband Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Layer 2 Configuration Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Layer 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Layer 3 Inband Configuration Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
APIC Management Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Fabric Node (Switch) Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Management Failover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Management EPG Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Fabric Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
14.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Symptom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
14.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Symptom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
15 Common Network Services 99
15.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Fabric Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
APIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Fabric nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
15.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Symptom 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Symptom 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Symptom 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Symptom 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
15.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Symptom 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Symptom 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

16 Unicast Data Plane Forwarding and Reachability 120


16.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Verification - Endpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Verification - VLANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Verfication - Forwarding Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
16.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Symptom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
16.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Symptom 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

17 Policies and Contracts 136


17.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Verification of Zoning Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
17.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Symptom 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Symptom 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Symptom 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Symptom 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
18 Bridged Connectivity to External Networks 153
18.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
18.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Symptom 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Symptom 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
18.3 Problem Description: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Symptom 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Verification/Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Symptom 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Verification/Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
18.4 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Symptom 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Symptom 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

19 Routed Connectivity to External Networks 170


19.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
External Route Distribution Inside Fabric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
19.2 Fabric Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Output from Spine 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Output from Spine 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
19.3 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Symptom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Verification/Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
19.4 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Symptom 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Verification 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Verification 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Resolution 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Resolution 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Symptom 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
19.5 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Symptom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

20 Virtual Machine Manager and UCS 184


20.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
20.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Symptom 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Symptom 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
20.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Symptom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Symptom 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
20.4 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Symptom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
20.5 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Symptom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

21 L4-L7 Service Insertion 197


21.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Device Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Service Graph Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Concrete Device and Logical Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Device Cluster Selector Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Rendering the Service Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
21.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Symptom 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Symptom 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Symptom 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Symptom 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Symptom 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Symptom 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Symptom 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Symptom 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

22 ACI Fabric Node and Process Crash Troubleshooting 205


22.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
DME Processes: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
CLI: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Identify When a Process Crashes: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Collecting the Core Files: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
22.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Symptom 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Symptom 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

23 APIC Process Crash Troubleshooting 211


23.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
DME Processes: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
How to Identify When a Process Crashes: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Collecting the Core Files: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
23.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Symptom 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Symptom 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

24 Appendix 218
24.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
H . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
J . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
L. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

25 Indices and tables 223


Table of Contents:

1 Preface

• Authors and Contributors


• Authors
• Distinguished Contributors
• Dedications
• Acknowledgments
• Book Writing Methodology
• Who Should Read This Book?
– Expected Audience
• Organization of this Book
– Section 1: Introduction to ACI Troubleshooting
– Section 2: Sample Reference Topology
– Section 3: Troubleshooting Application Centric Infrastructure

1.1 Authors and Contributors

This book represents a joint intense collaborative effort between Cisco’s Engineering, Technical Support, Advanced
Services and Sales employees over a single week in the same room at Cisco Headquarters Building 20 in San Jose,
CA.

1.2 Authors

Andres Vega - Cisco Technical Services


Bryan Deaver - Cisco Technical Services
Jerry Ye - Cisco Advanced Services
Kannan Ponnuswamy - Cisco Advanced Services
Loy Evans - Systems Engineering
Mike Timm - Cisco Technical Services
Paul Lesiak - Cisco Advanced Services
Paul Raytick - Cisco Technical Services

1.3 Distinguished Contributors

Giuseppe Andreello
Pavan Bassetty
Sachin Jain
Sri Goli
Contributors
Lucien Avramov
Piyush Agarwal
Pooja Aniker
Ryan Bos
Mike Brown
Robert Burns
Mai Cutler
Tomas de Leon
Luis Flores
Maurizio Portolani
Michael Frase
Mioljub Jovanovic
Ozden Karakok
Jose Martinez
Rafael Mueller
Chandra Nagarajan
Mike Petrinovic
Daniel Pita
Mike Ripley
Zach Seils
Ramses Smeyers
Steve Winters

1.4 Dedications

“For Olga and Victoria, my love and happiness, hoping for a world that continues to strive in providing more effective
solutions to all problems intrinsic to human nature.” - Andres Vega
“An appreciative thank you to my wife Melanie and our children Sierra and Jackson for their support. And also to
those that I have had the opportunity to work with over the years on this journey with Cisco.” - Bryan Deaver
“For my parents, uncles, aunts, cousins, and wonderful nephews and nieces in the US, Australia, Hong Kong and
China.” - Jerry Ye
“To my wife, Vanitha for her unwavering love, support and encouragement, my kids Kripa, Krish and Kriti for the
sweet moments, my sister who provided me the education needed for this book, my brother for the great times, and
my parents for their unconditional love.” - Kannan Ponnuswamy
“Would like to thank my amazing family, Molly, Ethan and Abby, without whom, none of the things I do would
matter.” - Loy Evans
“Big thanks to my wife Morena, my daughters Elena and Mayra. Thank you to my in-laws Guadalupe and Armando
who helped watch my beautiful growing girls while I spent the time away from home working on this project.” -
Michael Timm
“Dedicated to the patience, love and continual support of Amanda; my sprite and best friend” - Paul Lesiak
“For Susan, Matthew, Hanna, Brian, and all my extended family, thanks for your support throughout the years. Thanks
as well to Cisco for the opportunity, it continues to be a fun ride.” - Paul Raytick
1.5 Acknowledgments

While this book was produced and written in a single week, the knowledge and experience leading to it are the result
of hard work and dedication of many individual inside and outside Cisco.
Special thanks to Cisco’s INSBU Executive, Technical Marketing and Engineering teams who supported the realization
of this book. We would like to thank you for your continuous innovation and the value you provide to the industry.
We want to thank Cisco’s Advanced Services and Technical Services leadership teams for the trust they conferred to
this initiative and the support provided since the inception of the idea.
In particular we want to express gratitude to the following individuals for their influence and support both prior and
during the book sprint:

Shrey Ajmera
Subrata Banerjee
Dave Broenen
John Bunney
Luca Cafiero
Ravi Chamarthy
Mike Cohen
Kevin Corbin
Ronak Desai
Krishna Doddapaneni
Mike Dvorkin
Tom Edsall
Ken Fee
Vikki Fee
Siva Gaggara
Shilpa Grandhi
Ram Gunuganti
Ruben Hakopian
Robert Hurst
Donna Hutchinson
Fabio Ingrao
Saurabh Jain
Praveen Jain
Prem Jain
Soni Jiandani
Sarat Kamisetty
Yousuf Khan
Praveen Kumar
Tighe Kuykendall
Adrienne Liu
Anand Louis
Gianluca Mardente
Wayne McAllister
Rohit Mediratta
Munish Mehta
Sameer Merchant
Joe Onisick
Ignacio Orozco
Venkatesh Pallipadi
Ayas Pani
Amit Patel
Maurizio Portolani
Pirabhu Raman
Alice Saiki
Christy Sanders
Enrico Schiattarella
Priyanka Shah
Pankaj Shukla
Michael Smith
Edward Swenson
Srinivas Tatikonda
Santhosh Thodupunoori
Sergey Timokhin
Muni Tripathi
Bobby Vandalore
Sunil Verma
Alok Wadhwa
Jay Weinstein
Yi Xue

We would also like to thank the Office of the CTO and Chief Architect for their hospitality while working in their
office space.
We are truly grateful to our book sprint facilitators Laia Ros and Adam Hyde for carrying us throughout this collabo-
rative knowledge production process, and to our illustrator Henrik van Leeuwen who took abstract ideas and was able
to depict those ideas into clear visuals. Our first concern was how to take so many people from different sides of the
business to complete a project that traditionally takes months. The book sprint team showed that this is possible and
presents a new model for how we collaborate, extract knowledge and experience and present it into a single source.

1.6 Book Writing Methodology

The Book Sprint (www.booksprints.net) methodology was used for writing this book. The Book Sprint methodology is
an innovative new style of cooperative and collaborative authorship. Book Sprints are strongly facilitated and leverage
team-oriented inspiration and motivation to rapidly deliver large amounts of well authored and reviewed content, and
incorporate it into a complete narrative in a short amount of time. By leveraging the input of many experts, the
complete book was written in a short time period of only five days, however involved hundreds of authoring man
hours, and included thousands of experienced engineering hours, allowing for extremely high quality in a very short
production time period.
1.7 Who Should Read This Book?

Expected Audience

The intended audience for this book is those with a general need to understand how to operate and/or troubleshoot
an ACI fabric. While operation engineers may experience the largest benefit from this content, the materials included
herein may be of use to a much wider audience, especially given modern industry trends towards continuous integration
and development, along with the ever growing need for agile DevOps oriented methodologies.
There are many elements in this book that explore topics outside the typical job responsibilities of network admin-
istrators. For example the programmatic manipulation of policy models can be viewed as a development-oriented
task, however has specific relevance to networking configuration and function, taking a very different approach than
traditional CLI-based interface configuration.

1.8 Organization of this Book

Section 1: Introduction to ACI Troubleshooting

The introduction covers basic concepts, terms and models while introducing the tools that will be used in troubleshoot-
ing. Also covered are the troubleshooting, verification and resolution methodologies used in later sections that cover
the actual problems being documented.

Section 2: Sample Reference Topology

This section sets the baseline sample topology used throughout all of the troubleshooting exercises that are documented
later in the book. Logical diagrams are provided for the abstract policy elements (the endpoint group objects, the
application profile objects, etc) as well as the physical topology diagrams and any supporting documentation that
is needed to understand the focal point of the exercises. In each problem description in Section 3, references will
be made to the reference topology as necessary. Where further examination is required, the specific aspects of the
topology being examined may be re-illustrated in the text of the troubleshooting scenario.

Section 3: Troubleshooting Application Centric Infrastructure

The Troubleshooting ACI section goes through specific problem descriptions as it relates to the fabric. For each
iterative problem, there will be a problem description, a listing of the process, some verification steps, and possible
resolutions.
Chapter format: The chapters that follow in the Troubleshooting section document the various problems: verification
of causes and possible resolutions are arranged in the following format.
Overview: Provides an introduction to the problem in focus by highlighting the following information:
• Theory and concepts to be covered
• Information of what should be happening
• Verification steps of a working state
Problem Description: The problem description will be a high level observation of the starting point for the trou-
bleshooting actions to be covered. Example: a fabric node is showing “inactive” from the APIC by using APIC CLI
command acidiag fnvread.
Symptoms: Depending on the problem, various symptoms and their impacts may be observed. In this example, some
of the symptoms and indications of issues around an inactive fabric node could be:
• loss of connectivity to the fabric
• low health score
• system faults
• inability to make changes through the APIC
In some chapters, multiple symptoms could be observed for the same problem description that require different verifi-
cation or resolution.
Verification: The logical set of steps to identify what is being observed will be indicated along with the appropriate
tools and output. Additionally, some information on likely causes and resolution will be included.

2 Application Centric Infrastructure

In the same way that humans build relationships to communicate and share their knowledge, computer networks are
built to allow for nodes to exchange data at ever increasing speeds and rates. The drivers for these rapidly growing
networks are the applications, the building blocks that consume and provide the data which are close to the heart of
the business lifecycle. The organizations tasked with nurturing and maintaining these expanding networks, nodes and
vast amounts of data, are critical to those that consume the resources they provide.
IT organizations have managed the conduits of this data as network devices with each device being managed individ-
ually. In the efforts to support an application, a team or multiple teams of infrastructure specialists build and configure
static infrastructure including the following:
• Physical infrastructure (switches, ports, cables, etc.)
• Logical topology (VLANs, L2 interfaces and protocols, L3 interfaces and protocols, etc.)
• Access control configuration (permit/deny ACLs) for application integration and common services)
• Quality of Service configuration
• Services integration (firewall, load balancing, etc.)
• Connecting application workload engines (VMs, physical servers, logical application instances)
Cisco seeks to innovate the way this infrastructure is governed by introducing new paradigms. Going from a network
of individually managed devices to an automated policy-based model that allows an organization to define the policy,
and the infrastructure to automate the implementation of the policy in the hardware elements, will change the way the
world communicates.
To this end, Cisco has introduced Application Centric Infrastructure, or ACI, as an holistic systems-based approach
to infrastructure management.
The design intent of ACI is to provide the following:
• Application-driven policy modeling
• Centralized policy management and visibility of infrastructure and application health
• Automated infrastructure configuration management
• Integrated physical and virtual infrastructure management
• Open interface to enable flexible software and ecosystem partner integration
• Seamless communications from any endpoint to any endpoint
There are multiple possible implementation options for an ACI fabric:
• Leveraging a network centric approach to policy deployment - in this case a full understanding of application
interdependencies is not critical, and instead the current model of a network-oriented design is maintained. This
can take one of two forms:
– L2 Fabric – Uses the ACI policy controller to automate provisioning of network infrastructure based on
L2 connectivity between connected network devices and hosts.
– L3 Fabric – Uses the ACI policy model to automate provisioning network infrastructure based on L3
connectivity between network devices and hosts.
• Application-centric fabric – takes full advantage of all of the ACI objects to build out a flexible and completely
automated infrastructure including L2 and L3 reachability, physical machine and VM connectivity integration,
service node integration and full object manipulation and management.
• Implementations of ACI that take full advantage of the intended design from an application-centric perspective
allow for end-to-end network automation spanning physical and virtual network and network services integra-
tion.
All of the manual configuration and integration work detailed above is thus automated based on policy, therefore
making the infrastructure team’s efforts more efficient.
Instead of manually configuring VLANs, ports and access lists for every device connected to the network, the policy
is created and the infrastructure itself resolves and provisions the relevant configuration to be provisioned on demand,
where needed, when needed. Conversely, when devices, applications or workloads detach from the fabric, the relevant
configuration can be de-provisioned, allowing for optimal network hygiene.
Cisco ACI follows a model-driven approach to configuration management. This model-based configuration is dissem-
inated through the managed nodes using the concept of Promise Theory.
Promise Theory is a management model in which a central intelligence system declares a desired configuration “end-
state”, and the underlying objects act as autonomous intelligent agents that can understand the declarative end-state
and either implement the required change, or send back information on why it could not be implemented.
In ACI, the intelligent agents are purpose-built elements of the infrastructure that take an active part in its management
by the keeping of “promises”. Within promise theory, a promise is an agent’s declaration of intent to follow an intended
instruction defining operational behavior. This allows management teams to create an abstract “end-state” model and
the system to automate the configuration in compliance. With declarative end-state modeling, it is easier to build and
manage networks of all scale sizes with less effort.
Many new ideas, concepts and terms come with this coupling of ACI and Promise Theory. This book is not intended
to be a complete tutorial on ACI or Promise Theory, nor is it intended to be a complete operations manual for ACI, or
a complete dictionary of terms and concepts. Where possible, however, a base level of definitions will be provided,
accompanied by explanations. The goal is to provide common concepts, terms, models and fundamental features of
the fabric, then use that base knowledge to dive into troubleshooting methodology and exercises.
To read more information on Cisco’s Application Centric Infrastructure, the reader may refer to the Cisco website at
https://www.cisco.com/go/aci.

3 ACI Policy Model


• Abstraction Model
• Everything is an Object
• Relevant Objects and Relationships
• Hierarchical ACI Object Model and the Infrastructure
• Infrastructure as Objects
• Build object, use object to build policy, reuse policy
• REST API just exposes the object model
• Logical model, Resolved model, concrete model
• Formed and Unformed Relationships
• Declarative End State and Promise Theory

While the comprehensive policy model that ACI utilizes is broad, the goal of this chapter is to introduce the reader
to a basic level of understanding about the model, what it contains and how to work with it. The complete object
model contains a vast amount of information that represents a complete hierarchy of data center interactions, so it is
recommended that the reader take the time to review the many white papers available on cisco.com, or for the most
extensive information resource available, review the APIC Management Information Model Reference packaged with
the APIC itself.

3.1 Abstraction Model

ACI provides the ability to create a stateless definition of application requirements. Application architects think in
terms of application components and interactions between such components; not necessarily thinking about networks,
firewalls and other services. By abstracting away the infrastructure, application architects can build stateless policies
and define not only the application, but also Layer 4 through 7 services and interactions within applications. Abstrac-
tion also means that the policy defining application requirements is no longer tied to traditional network constructs,
and thus removes dependencies on the infrastructure and increases the portability of applications.
The application policy model defines application requirements, and based on the specified requirements, each device
will instantiate a set of required changes. IP addresses become fully portable within the fabric, while security and for-
warding are decoupled from any physical or virtual network attributes. Devices autonomously and consistently update
the state of the network based on the configured policy requirements set within the application profile definitions.

3.2 Everything is an Object

The abstracted model utilized in ACI is object-oriented, and everything in the model is represented as an object, each
with properties relevant to that object. As is typical for an object-oriented system, these objects can be grouped,
classed, read, and manipulated, and objects can be created referencing other objects. These objects can reference
relevant application components as well as relationships between these components. The rest of this section will
describe the elements of the model, the objects inside, and their relationships at a high level.

3.3 Relevant Objects and Relationships

Within the ACI application model, the primary object that encompasses all of the objects and their relationships to
each other is called an Application Profile, or AP. Some readers are certain to think, “a 3-tier app is a unicorn,” but in
this case, the idea of a literal 3-tier application works well for illustrative purposes. Below is a diagram of an AP shown
as a logical structure for a 3-tier application that will serve well for describing the relevant objects and relationships.
From left to right, in this 3-tier application there is a group of clients that can be categorized and grouped together.
Next there is a group of web servers, followed by a group of application servers, and finally a group of database servers.
There exist relationships between each of these independent groups. For example, from the clients to the application
servers, there are relationships that can be described in the policy which can include things such as QoS, ACLs,
Firewall and Server Load Balancing service insertion. Each of these things is defined by managed objects, and the
relationships between them are used to build out the logical model, then resolve them into the hardware automatically.
Endpoints are objects that represent individual workload engines (i.e. virtual or physical machines, etc.). The following
diagram emphasizes which elements in the policy model are endpoints, which include web, application and database
virtual machines.

These endpoints are logically grouped together into another object called an Endpoint Group, or EPG. The following
diagram highlights the EPG boundaries in the diagram, and there are four EPGs - Clients, Web servers, Application
servers, and Database servers.
There are also Service Nodes that are referenceable objects, either physical or virtual, such as Firewalls, and Server
Load Balancers (or Application Delivery Controllers/ADC), with a firewall and load balancer combination chained
between the client and web EPGs, a load balancer between the web and application EPGs, and finally a firewall
securing traffic between the application and database EPGs.

A group of Service Node objects can be logically chained into a sequence of services represented by another object
called a Service Graph. A Service Graph object provides compound service chains along the data path. The diagram
below shows where the Service Graph objects are inserted into a policy definition, emphasizing the grouped service
nodes in the previous diagram.

With objects defined to express the essential elements of the application, it is possible to build relationships between
the EPG objects, using another object called a Contract. A Contract defines what provides a service, what consumes a
service and what policy objects are related to that consumption relationship. In the case of the relationship between the
clients and the web servers, the policy defines the communication path and all related elements of that. As shown in the
details of the example below, the Web EPG provides a service that the Clients EPG consumes, and that consumption
would be subject to a Filter (ACL) and a Service Graph that includes Firewall inspection services and Server Load
Balancing.
A concept to note is that ACI fabrics are built on the premise of a whitelist security approach, which allows the ACI
fabric to function as a semi-stateful firewall fabric. This means communication is implicitly denied, and that one must
build a policy to allow communication between objects or they will be unable to communicate. In the example above,
with the contract in place as highlighted, the Clients EPG can communicate with the Web EPG, but the Clients cannot
communicate with the App EPG or DB EPGs. This is not explicit in the contract, but native to the fabric’s function.

3.4 Hierarchical ACI Object Model and the Infrastructure

The APIC manages a distributed managed information tree (dMIT). The dMIT discovers, manages, and maintains the
whole hierarchical tree of objects in the ACI fabric, including their configuration, operational status, and accompanying
statistics and associated faults.
The Cisco ACI object model structure is organized around a hierarchical tree model, called a distributed Management
Infrastructure Tree (dMIT). The dMIT is the single source of truth in the object model, and is used in discovery,
management and maintenance of the hierarchical model, including configuration, operational status and accompanying
statistics and faults.
As mentioned before, within the dMIT, the Application Profile is the modeled representation of an application, network
characteristics and connections, services, and all the relationships between all of these lower-level objects. These
objects are instantiated as Managed Objects (MO) and are stored in the dMIT in a hierarchical tree, as shown below:
All of the configurable elements shown in this diagram are represented as classes, and the classes define the items that
get instantiated as MOs, which are used to fully describe the entity including its configuration, state, runtime data,
description, referenced objects and lifecycle position.
Each node in the dMIT represents a managed object or group of objects. These objects are organized in a hierarchical
structure, similar to a structured file system with logical object containers like folders. Every object has a parent, with
the exception of the top object, called “root”, which is the top of the tree. Relationships exist between objects in the
tree.

Objects include a class, which describes the type of object such as a port, module or network path, VLAN, Bridge Do-
main, or EPG. Packages identify the functional areas to which the objects belong. Classes are organized hierarchically
so that, for example, an access port is a subclass of the class Port, or a leaf node is a subclass of the class Fabric Node.
Managed Objects can be referenced through relative names (Rn) that consist of a prefix matched up with a name
property of the object. As an example, a prefix for a Tenant would be “tn” and if the name would be “Cisco”, that
would result in a Rn of “tn-Cisco” for a MO.
Managed Objects can also be referenced via Distinguished Names (Dn), which is the combination of the scope of the
MO and the Rn of the MO, as mentioned above. As an example, if there is a tenant named “Cisco” that is a policy
object in the top level of the Policy Universe (polUni), that would combine to give us a Dn of “uni/tn-Cisco”. In
general, the DN can be related to a fully qualified domain name.
Because of the hierarchical nature of the tree, and the attribute system used to identify object classes, the tree can be
queried in several ways for MO information. Queries can be performed on an object itself through its DN, on a class
of objects such as switch chassis, or on a tree-level, discovering all members of an object.
The structure of the dMIT provides easy classification of all aspects of the relevant configuration, as the application
objects are organized into related classes, as well as hardware objects and fabric objects into related classes that
allow for easy reference, reading and manipulation from individual object properties or multiple objects at a time
by reference to a class. This allows configuration and management of multiple similar components as efficiently as
possible with a minimum of iterative static configuration.

3.5 Infrastructure as Objects

ACI uses a combination of Cisco Nexus 9000 Series Switch hardware and Application Policy Infrastructure Controllers
for policy-based fabric configuration and management. These infrastructure components can be integrated with Cisco
and third-party service products to automatically provision end-to-end network solutions.
As shown in the following diagram, the logical policy model is built through manipulation of the dMIT, either through
direct GUI, programmatic API, or through traditional CLI methods. Once the policy is built, the intention of the policy
gets resolved into an abstract model, then is conferred to the infrastructure elements. The infrastructure elements
contain specific Cisco ASIC hardware that make them equipped, purpose-built agents of change that can understand
the abstraction that the policy controller presents to it, and automate the relevant concrete configuration based on the
abstract model. This configuration gets applied when an endpoint connects to the fabric and first transmits traffic.
The purpose-built hardware providing the intelligent resolution of policy configuration is built on a spine-leaf archi-
tecture providing consistent network forwarding and deterministic latency. The hardware is also able to normalize the
encapsulation coming in from multiple different endpoints regardless of the type connectivity.
If an endpoint connects to a fabric with an overlay encapsulation (such as VXLAN), uses physical port connectivity
or VLAN 802.1Q tagging, the fabric can take accept that traffic, de-encapsulate, then re-encapsulate it to VXLAN for
fabric forwarding, then de-encapsulate and re-encapsulate to whatever the destination expects to see. This gateway
function of encapsulation normalization happens at optimized hardware speeds in the fabric and creates no additional
latency or software gateway penalty to perform the operation outside of the fabric.
In this manner, if a VM is running on VMWare ESX utilizing VXLAN, and a VM running on Hyper-V using VLAN
802.1Q encapsulation, and a physical server running a bare metal database workload on top of Linux, it is possible to
configure policy to allow each of these to communicate directly to each other without having to bounce to any separate
gateway function.
This automated provisioning of end-to-end application policy provides consistent implementation of relevant connec-
tivity, quality measures, and security requirements. This model is extensible, and has the potential capability to be
extended into compute and storage for complete application policy-based provisioning.
The automation of the configuration takes the Logical model, and translates it into other models, such as the Resolved
model and the Concrete model (Covered later in this chapter). The automation process resolves configuration infor-
mation into the object and class-based configuration elements that then get applied based on the object and class. As
an example, if the system is applying a configuration to a port or a group of ports, the system would likely utilize a
class-based identifier to apply configuration broadly without manual iteration. As an example, a class is used to iden-
tify objects like cards, ports, paths, etc; port Ethernet 1/1 is a member of class port and a type of port configuration,
such as an access or trunk port is a subclass of a port. A leaf node or a spine node is a subclass of a fabric node, and
so forth.
The types of objects and relationships of the different networking elements within the policy model can be seen in the
diagram below. Each of these elements can be managed via the object model being manipulated through the APIC,
and each element could be directly manipulated via REST API.

3.6 Build object, use object to build policy, reuse policy

The inherent model of ACI is built on the premise of object structure, reference and reuse. In order to build an AP,
one must first create the building blocks with all the relevant information for those objects. Once those are created,
it is possible to build other objects referencing the originally created objects as well as reuse other objects. As an
example, it is possible to build EPG objects, use those to build an AP object, and reuse the AP object to deploy to
different tenant implementations, such as a Development Environment AP, a Test Environment AP, and a Production
Environment AP. In the same fashion, an EPG used to construct a Test AP may later be placed into a Production AP,
accelerating the time to migrate from a test environment into production.

3.7 REST API just exposes the object model

REST stands for Representative State Transfer, and is a reference model for direct object manipulation via HTTP
protocol based operations.
The uniform ACI object model places clean boundaries between the different components that can be read or manip-
ulated in the system. When an object exists in the tree, whether it is an object that was derived from discovery (such
as a port or module) or from configuration (such as an EPG or policy graph), the objects then would be exposed via
the REST API via a Universal Resource Indicator (URI). The structure of the REST API calls is shown below with a
couple of examples.

The general structure of the REST API commands is seen at the top. Below the general structure two specific examples
of what can done with this structured URI.

3.8 Logical model, Resolved model, concrete model

Within the ACI object model, there are essentially three stages of implementation of the model: the Logical Model,
the Resolved Model, and the Concrete Model.
The Logical Model is the logical representation of the objects and their relationships. The AP that was discussed
previously is an expression of the logical model. This is the declaration of the “end-state” expression that is desired
when the elements of the application are connected and the fabric is provisioned by the APIC, stated in high-level
terms.
The Resolved Model is the abstract model expression that the APIC resolves from the logical model. This is essentially
the elemental configuration components that would be delivered to the physical infrastructure when the policy must
be executed (such as when an endpoint connects to a leaf).
The Concrete Model is the actual in-state configuration delivered to each individual fabric member based on the
resolved model and the Endpoints attached to the fabric.
In general, the logical model should be the high-level expression of what exists in the resolved model, which should be
present on the concrete devices as the concrete model expression. If there is any gap in these, there will be inconsistent
configurations.

3.9 Formed and Unformed Relationships

In creating objects and forming their relationships within the ACI fabric, a relationship is expressed when an object is
a provider of a service, and another object is a consumer of that provided service. If a relationship is formed and one
side of the service is not connected, the relationship would be considered to be unformed. If a consumer exists with
no provider, or a provider exists with no consumer, this would be an unformed relationship. If both a consumer and
provider exist and are connected for a specific service, that relationship is fully formed.

3.10 Declarative End State and Promise Theory

For many years, infrastructure management has been built on a static and inflexible configuration paradigm. In terms
of theory, traditional configuration via traditional methods (CLI configuration of each box individually) where config-
uration must be done on every device for every possibility of every thing prior to this thing connecting, is termed an
Imperative Model of configuration. In this model, due to the way configuration is built for eventual possibility, the
trend is to overbuild the infrastructure configuration to a fairly significant amount. When this is done, fragility and
complexity increase with every eventuality included.
Similar to what is illustrated above, if configuration must be made on a single port for an ESXi host, it must be
configured to trunk all information for all of the possible VLANs that might get used by a vSwitch or DVS on the host,
whether or not a VM actually exists on that host. On top of that, additional ACLs may need to be configured for all
possible entries on that port, VLAN or switch to allow/restrict traffic to/from the VMs that might end up migrating to
that host/segment/switch. That is a fairly heavyweight set of tasks for just some portions of the infrastructure, and that
continues to build as peripheral aspects of this same problem are evaluated. As these configurations are built, hardware
resource tables are filled up even if they are not needed for actual forwarding. Also reflected are configurations on
the service nodes for eventualities that can build and grow, many times being added but rarely ever removed. This
eventually can grow into a fairly fragile state that might be considered a form of firewall house of cards. As these
building blocks are built up over time and a broader perspective is taken, it becomes difficult to understand which
ones can be removed without the whole stack tumbling down. This is one of the possible things that can happen when
things are built on an imperative model.
On the other hand, a declarative mode allows a system to describe the “end-state” expectations system-wide as depicted
above, and allows the system to utilize its knowledge of the integrated hardware and automation tools to execute the
required work to deliver the end state. Imagine an infrastructure system where statements of desire can be made, such
as “these things should connect to those things and let them talk in this way”, and the infrastructure converges on that
desired end state. When that configuration is no longer needed, the system knows this and removes that configuration.

Promise Theory is built on the principles that allow for systems to be designed based on the declarative model. It’s
built on voluntary execution by autonomous agents which provide and consume services from one another based on
promises.
As the IT industry continues to build and scale more and more, information and systems are rapidly reaching break-
ing points where scaled-out infrastructure cannot stretch to the hardware resources without violating the economic
equilibrium, nor scale-in the management without integrated agent-based automation. This is why Cisco ACI, as a
system built on promise theory, is a purpose-built system for addressing scale problems that are delivery challenges
with traditional models.

4 Troubleshooting Tools
• APIC Access Methods
– GUI
– API
– CLI
* CLI MODES:
* Navigating the CLI:
• Programmatic Configuration (Python)
• Fabric Node Access Methods
• Exporting information from the Fabric
• External Data Collection – Syslog, SNMP, Call-Home
• Health Scores
• Atomic Counters

This section is intended to provide an overview of the tools that could be used during troubleshooting efforts on an
ACI fabric. This is not intended to be a complete reference list of all possible tools, but rather a high level list of the
most common tools used.

4.1 APIC Access Methods

There are multiple ways to connect to and manage the ACI fabric and object model. An administrator can use the
built-in Graphical User Interface (GUI), programmatic methods using an Application Programming Interface (API),
or standard Command Line Interface (CLI). While there are multiple ways to access the APIC, the APIC is still the
single point of truth. All of these access methods - the GUI, CLI and REST API - are just interfaces resolving to the
API, which is the abstraction of the object model managed by the DME.

GUI

One of the primary ways to configure, verify and monitor the ACI fabric is through the APIC GUI. The APIC GUI is
a browser-based HTML5 application that provides a representation of the object model and would be the most likely
default interface that people would start with. GUI access is accessible through a browser at the URL https://<APIC
IP> The GUI does not expose the underlying policy object model. One of the available tools for browsing the MIT in
addition to leveraging the CLI is called “visore” and is available on the APIC and nodes. Visore supports querying by
class and object, as well as easily navigating the hierarchy of the tree. Visore is accessible through a browser at the
URL https://<APIC IP>/visore.html
API

The APIC supports REST API connections via HTTP/HTTPS for processing of XML/JSON documents for rapid
configuration. The API can also be used to verify the configured policy on the system. This is covered in details in the
REST API chapter.
A common tool used to query the system is a web browser based APP that runs on Google Chrome (tm) web browser
called “Postman”.

CLI

The CLI can be used in configuring the APIC. It can be used extensively in troubleshooting the system as it allows
real-time visibility of the configuration, faults, and statistics of the system or alternatively as an object model browser.
Typically the CLI is accessed via SSH with the appropriate administrative level credentials. The APIC CLI can be
accessed as well through the CIMC KVM (Cisco Integrated Management Console Keyboard Video Mouse interface).
CLI access is also available for troubleshooting the fabric nodes either through SSH or the console.
The APIC and fabric nodes are based on a Linux kernel but there are some ACI specific commands and modes of
access that will be used in this book.

CLI MODES:

APIC: The APIC has fundamentally only one CLI access mode. The commands used in this book are assuming
admininistrave level access to the APIC by use of the admin user account.

Fabric Node: The switch running ACI software has several different modes that can be used to access different
levels of information on the system:
• CLI - The CLI will be used to run NX-OS and Bash shell commands to check the concrete models on the switch.
For example show vlan, show endpoint, etc. In some documentation this may have been referred to as Bash,
iBash, or iShell.
• vsh_lc - This is the line card shell and it will be used to check line card processes and forwarding tables specific
to the Application Leaf Engine (ALE) ASIC.
• Broadcom Shell - This shell is used to view information on the Broadcom ASIC. The shell will not be covered
as it falls outside the scope of this book as its assumed troubleshooting at a Broadcom Shell level should be
performed with assistance of Cisco Technical Assistance Center (TAC). Virtual Shell
• (VSH): Provides deprecated NX-OS CLI shell access to the switch. This mode can provide output on a switch in
ACI mode that could be inaccurate. This mode is not recommended, not supported, and commands that provide
useful output should be available from the normal CLI access mode.

Navigating the CLI:

There are some common commands as well as some unique differences than might be seen in NX-OS on a fabric node.
On the APIC, the command structure has common commands as well as some unique differences compared to Linux
Bash. This section will present a highlight of a few of these commands but is not meant to replace existing external
documentation on Linux, Bash, ACI, and NX-OS.
Common Bash commands: When using the CLI, some basic understanding of Linux and Bash is necessary. These
commands include:
• man – prints the online manual pages. For example, man cd will display
• what the command cd does
• ls – list directory contents
• cd – change directory
• cat – print the contents of a file less – simple navigation tool for displaying the contents of a file
• grep – print out a matching line from a file
• ps – show current running processes – typically used with the options ps –ef
• netstat – display network connection status. netstat –a will display active and ports which the system is listening
on
• ip route show – displays the kernel route table. This is useful on the APIC but not on the fabric node
• pwd – print the current directory

Common CLI commands: Beyond the normal NX-OS commands on a fabric node, there are several more that are
specific commands to ACI. Some CLI commands referenced in this guide are listed below:
• acidiag - Specifically acidiag avread and acidiag fnvread are two common commands to check the status of
the controllers and the fabric nodes
• techsupport – CLI command to collect the techsupport files from the device
• attach – From the APIC, opens up a ssh session to the names node. For example attach rtp_leaf1
• iping/itraceroute – Fabric node command used in place of ping/traceroute which provides similar functionality
against an fabric device address and VRF. Note that the Bash ping and traceroute commands do work but are
effective only for the switch OOB access.

Help: When navigating around the APIC CLI, there are some differences when compared to NX-OS.
• <ESC><ESC> - similar to NX-OS “?” to get a list of command options and keywords.
• <TAB> - autocomplete of the command. For example show int**<TAB> will complete to **show interface.
• man <command> - displays the manual and usage output for the command.

4.2 Programmatic Configuration (Python)

A popular modern programming language is Python, which provides simple object-oriented semantics in interpreted
easy-to-write code. The APIC can be configured through the use of Python through an available APIC Software
Development Kit (SDK) or via the REST API.

4.3 Fabric Node Access Methods

CLI In general, most work within ACI will be done through the APIC using the access methods listed above. There
are, however, times in which one must directly access the individual fabric nodes (switches). Fabric nodes can be
accessed via SSH using the fabric administrative level credentials. The CLI is not used for configuration but is used
extensively for troubleshooting purposes. The fabric nodes have a Linux shell along with a CLI interpreter to run show
level commands. The CLI can be accessed through the console port as well.
Faults The APICs automatically detect issues on the system and records these as faults. Faults are displayed in the GUI
until the underlying issue is cleared. After faults are cleared, they are retained until they are acknowledged or until the
retaining timer has expired. The fault is composed of system parameters, which are used to indicate the reason for the
failure, and where the fault is located. Fault messages link to help to understand possible actions in some cases.

4.4 Exporting information from the Fabric

Techsupport The Techsupport files in ACI capture application logs, system and services logs, version information,
faults, event and audit logs, debug counters and other command output, then bundle all of that into one file on the
system. This is presented in a single compressed file (tarball) that can be exported to an external location for off-
system processing. Techsupport is similar to functionality available on other Cisco products that allow for a simple
collection of copious amounts of relevant data from the system. This collection can be initiated through the GUI or
through the CLI using the command techsupport.
Core Files A process crash on the ACI fabric will generate a core file, which can be used to determine the reason
for why the process crashed. This information can be exported from the APIC for decoding by Cisco support and
engineering teams.

4.5 External Data Collection – Syslog, SNMP, Call-Home

There are a variety of external collectors that can be configured to collect a variety of system data. The call-home
feature can be configured to relay information via emails through an SMTP server, for a network engineer or to Cisco
Smart Call Home to generate a case with the TAC.

4.6 Health Scores

The APIC manages and automates the underlying forwarding components and Layer 4 to Layer 7 service devices.
Using visibility into both the virtual and physical infrastructure, as well as the knowledge of the application end-to-
end based on the application profile, the APIC can calculate an application health score. This health score represents
the network health of the application across virtual and physical resources, including Layer 4 to Layer 7 devices. The
score includes failures, packet drops, and other indicators of system health.
The health score provides enhanced visibility on both application and tenant levels. The health score can drive further
value by being used to trigger automated events at specific thresholds. This ability allows the network to respond
automatically to application health by making changes before users are impacted.

4.7 Atomic Counters

Atomic counters can be configured to monitor endpoint/EPG to endpoint/EPG traffic within a tenant for identifying
and isolating traffic loss. Once configured, the packet counters on a configured policy are updated every 30 seconds.
Atomic counters are valid when endpoints reside on different leaf nodes.

5 Troubleshooting Methodology

• Overall Methodology
5.1 Overall Methodology

Troubleshooting is the systematic process used to identify the cause of a problem. The problem to be addressed is
determined by the difference between how some entity (function, process, feature, etc.) should be working versus how
it is working. Once the cause is identified, the appropriate actions can be taken to either correct the issue or mitigate
the effects: the latter is sometimes referred to as a workaround.
Initial efforts in the process focus around understanding more completely the issue that is occurring. Effective trou-
bleshooting should be based on an evidence-driven method, rather than a symptomatic level exploration. This can be
done by asking the question:

“What evidence do we have ...? “

The intent of this question is to move towards an observed factual evidence-driven method where the evidence is
generally taken from the system where the problem is observed.
Troubleshooting is an iterative process attempting to isolate an issue to the point that some action can be taken to have
a positive effect. Often this is a multi-step process which moves toward isolating the issue. For example, in deploying
an application on a server attached to an ACI fabric, a possible problem observed could be that the application does
not seem to respond from a client on the network. The isolation steps may look something like this:

Troubleshooting is usually not a simple linear path, and in this example it is possible that a troubleshooter may have
observed the system fault earlier in the process and started at that stage.
In this example, information related to the problem came from several data points in the system. These data points can
be part of a linear causal process or can be used to better understand the scope and various points and conditions that
better define the issue. How these data points are collected is defined by three characteristics:
• WHAT: What information is being collected
• WHERE: Where on the system is the information being collected
• HOW: The method used in collecting the information
For example, the state of the fabric Ethernet interface can be gathered from the leaf through CLI on the leaf in a couple
of different ways. This information can be gathered from the APIC, either through the GUI or the REST API call.
When troubleshooting, it is important to understand where else relevant information is likely to come from to build a
better picture of what is the issue.
6 Sample Reference Topology

• Physical Fabric Topology


• Logical Application Topology

6.1 Physical Fabric Topology

For a consistent frame of reference, a sample reference topology has been deployed, provisioned and used throughout
the book. This ensures a consistent reference for the different scenarios and troubleshooting exercises.
This section explores the different aspects of the reference topology, from the logical application view to the physical
fabric and any supporting details that will be used throughout the troubleshooting exercises. Each individual section
will call out the specific components that have been focused on so the reader does not have to refer back to this section
in every exercise.
The topology includes a Cisco ACI Fabric composed of three clustered Cisco APIC Controllers, two Nexus 9500 spine
switches, and three Nexus 9300 leaf switches. The APICs and Nexus 9000 switches are running the current release
on www.cisco.com at the initial version of this book. This is APIC version 1.0(1k) and Nexus ACI-mode version
11.0(1d).
The fabric is connected to both external Layer 2 and Layer 3 networks. For the external Layer 2 network, the connec-
tion used a pair of interfaces aggregated into a port channel connecting to a pair of Cisco Nexus 7000 switches that are
configured as a Virtual Port Channel. The connection to the external Layer 3 network is individual links on each leaf
to each Nexus 7000.
Each leaf also contains a set of connections to host devices. Host devices include a directly connected Cisco UCS
C-Series rack server and a UCS B-Series Blade Chassis connected via a pair of fabric interconnects. All servers are
running virtualization hypervisors. The blade servers are virtualized using VMware ESX, with some of the hosts
as part of a VMware Distributed Virtual Switch, and some others as virtual leafs part of the Cisco ACI Application
Virtual Switch. The guest virtual machines on the hosts are a combination of traditional operating systems and virtual
network appliances such as virtual Firewalls and Load Balancers.
6.2 Logical Application Topology

To maintain a consistent reference throughout the book, an Application Profile (AP) was built for a common 3-tier
application that was used for every troubleshooting sample. This AP represents the logical model of the configuration
that was deployed to the fabric infrastructure. The AP includes all the information required for application connectivity
and policy (QoS, security, SLAs, Layer 4-7 services, logging, etc.).
This particular AP is a logical model built on a common application found in data centers, which includes a front-end
web tier, a middleware application tier, and a back-end database tier. As the diagram illustrates, the flow of traffic
would be left to right, with client connections coming in to the web tier, which communicates with the app tier, which
then communicates to the database tier, and returns right to left in reverse fashion.
7 Troubleshooting

• Naming Conventions
– Overview
– Suggested Naming Templates

7.1 Naming Conventions

Overview

Logical thinking and clear communication is the champion of the troubleshooting process. This chapter presents
some recommended practices for naming of managed objects in the ACI policy model to provide clean and logical
organization and clarity in the object’s reference. As the management information tree is navigated during policy
creation or inspection (during troubleshooting), consistency and meaningful context become extremely helpful.
Effective troubleshooting with the ACI environment does require some knowledge of the ACI policy model. Many
of the objects within this policy are unbounded fields that are left open to the administrator to name. Having a
predefined and consistent methodology for deriving these object names can provide some clarity, which can greatly
aid in configuration and, more importantly, troubleshooting. By using names which are descriptive of the purpose and
scope in the tree relevant to the MO, easier identification is achieved for the various MOs, their place in the tree, and
their use/relationships at a glance. This process of descriptive naming is very helpful in delivering context at a glance.
A good example for this type of structured naming is configuring a VLAN pool to be used for a Cisco Application
Virtual Switch (AVS) deployment and naming it “VLANsForAVS”. The name identifies the object, and what it is
used for. Exploring this situation within the context of a real life situation is helpful. Take for example entering a
troubleshooting situation after the environment has been configured. Ideally, it would be possible to see an object
when viewing the policy model through Visore or the CLI and know by the name “VLANsForAVS” that it is a VLAN
pool for an AVS deployment. The benefits of this naming methodology are clear.

Suggested Naming Templates

Below are some suggested templates for naming the various MOs in the fabric with an explanation of how each was
constructed.
When naming Attachable Entity Profiles, it is good to define in the name the resource type, and concatenate a suffix
of -AEP behind it ([Resource-Type]-AEP). Examples:
• DVS-AEP describes an AEP for use with a VMware DVS
• UCS-AEP describes an AEP for use when connecting a Cisco Unified Computing System (UCS)
• L2Outxxx-AEP describes an AEP that gets used when connecting to an external device via an L2Out connection
• L3Outxxx-AEP describes an AEP that gets used when connecting to an external device via an L3Out connection
• vSwitch-AEP describes an AEP used when connecting to a standard vSwitch.
Contracts are used to describe communications that are allowed between EndPoint Groups (EPGs). When creating
contracts, it’s a good idea to reference which objects are talking and the scope of relevance in the fabric. This could be
defined in the format of [SourceEPG]to[DestinationEPG]-[Scope]Con which includes the from and to EPGs as well
as the scope of application for the contract. Examples:
• WebToApp-GblCon describes a globally scoped contract object and is used to describe communications between
the Web EPG and the App EPG.
• AppToDB-TnCon describes a contract scoped to a specific tenant that describes communications between the
App EPG and the DB EPG within that specific tenant.
Contracts that deal with some explicit communications protocol function (such as allowing ICMP or denying some
specific protocol), can be given a name based on the explicit reference and scope. This could be defined in the format
of [ExplicitFunction]-[Scope]Con which indicates the protocol or service to be allowed as well as the scope for the
placement of the contract. Some examples of explicit contracts might include the following:
• ICMPAllow-CtxCon is a contract scoped to a specific context that allows ICMP
• HTTPDeny-ApCon is a contract scoped to an application that denies HTTP protocol
Contracts are compound objects that reference other lower-level objects. A Contract references a subject, which
references a filter that has multiple filter entries. In order to maintain good naming consistency, similar naming
structure should be followed. A Subject should keep to a naming convention like [RuleGroup]-[Direction]Sbj, as seen
in these examples:
• AppTraffic-BiSbj names a subject that defines bidirectional flows of a specific applications traffic
• WebTraffic-UniSbj names a subject that defines web traffic in a single direction
A Filter should have a naming convention structure of [ResourceName]-flt, such as:
• SQL-flt names a filter that contains entries to allow communications for an SQL server
• Exchange-flt names a filter that contains entries that allows communications for an Exchange server
Filter Entries should follow a structure like [ResourceName]-[Service] such as:
• Exchange-HTTP might be the name of a filter entry that allows HTTP service connections to an exchange server
(such as for OWA connections)
• VC-Mgmt might name a filter entry allowing management connections to a VMware vCenter server
• SQL-CIMC is a name for a filter entry that allows connections to the CIMC interface on a SQL server running
on a standalone Cisco UCS Rack mount server.
If all of these are put together the results might look like this:
AppToDB-TnConreferences DBTraffic-BiSbj with filter SQL-flt which has an entry such as SQL-data. This combined
naming chain can almost be read as common text, such as “this contract to allow database traffic in both directions
that is filtered to only allow SQL data connections”.
Interface policy naming should follow some characteristics, such as:
• Link Level structure - [Speed][Negotiation], Ex: 1GAuto, 10GAuto - directly describes the speed and negotia-
tion mode.
• CDP Interface configuration policy - explicit naming: EnableCDP / DisableCDP
• LLDP Interface configuration policy - explicit naming: EnableLLDP / DisableLLDP
When grouping interface policies, it’s good to structure naming based on the interface type and its use like
[InterfaceType]For[Resource-Type]such as:
• PCForDVS names a policy that describes a portchannel used for the uplinks from a DVS
• VPCForUCS names a virtual portchannel for connecting to a set of UCS Fabric Interconnects
• UplinkForvSwitch names a single port link connecting to a standard vSwitch.
• PCForL3Out names a portchannel connecting to an external L3 network
Interface profiles naming should be relative to the profile’s use, such as IntsFor[Resource-Type]. Examples:
• IntsForL3Outxxx names an interface profile for connecting to an external L3 network
• IntsForUCS names an interface profile for connecting to a UCS system
• IntsForDVS names an interface profile for connecting to a DVS running on a VMware host
Switch Selectors should be named using a structure like LeafsFor[Resource-Type]. Some examples:
• LeafsForUCS switch selector policy to group leafs that are used for connecting UCS
• LeafsForDVS switch selector policy that might be used to group leafs used for DVS connections
• LeafsForL3Out policy that might be used to group leafs for external L3 connections
And when creating VLAN pools, structure of VLANsFor[Resources-Type] could produce:
• VLANsForDVS names a VLAN pool for use with DVS-based endpoint connections
• VLANsForvSwitches names a VLAN pool for use with vSwitch-based endpoint connections
• VLANsForAVS names a VLAN pool for use with AVS-based endpoint connections
• VLANsForL2Outxxx names a VLAN pool used with L2 external connections. VLANsForL3Outxxx names a
VLAN pool used with L3 external connections.
8 Initial Hardware Bringup

• Overview
• Problem Description
– Symptom 1
– Verification/Resolution
– Symptom 2
– Verification/Resolution
• Problem Description
– Symptom
– Verification
– Resolution

8.1 Overview

This section will cover common issues seen when bringing up the initial hardware. The APIC Fabric can be ordered
in several different configurations. There is an option to purchase optical leaves (leaves with Small Form Pluggable
(SFP) interfaces), and when that is the case an optical Virtual Interface Card (VIC1225) must be used in the APIC.
When a copper leaf is used the optical VIC1225T must be used.
Initial cabling of the ACI fabric is very important and the following requirements must be adhered to:
• Leafs can only be connected to spines. There should be no cabling between the leafs, even when the leafs are
being configured as virtual port-channel (vPC) peer devices.
• Spines can only be connected to leafs. Spines cannot be inter-connected.
• An APIC must be attached to a leaf. APICs should be dual-homed (connected to two different leafs) for redun-
dancy.
• All end points, L2, L3, L4-L7 devices must connect to leafs. Nothing should be connected to spines other than
leafs as previously mentioned
There are a few common issues that can be observed when initially bringing up a fabric.

8.2 Problem Description

APIC not discovering Leafs after initially cabling

Symptom 1

When first connecting the link between the APIC and leaf, the link interface on the APIC side is down (no lights) but
the leaf side inteface has lights on.

Verification/Resolution

• The leaf showed the APIC as a LLDP neighbor (show lldp neighbors)
• The APIC did not show the leaf in the output of “acidiag fnvread“
• A physical examination of the setup shows:
• In the picture above a GLC-T transceiver was plugged into the APIC which has a VIC1225 installed. This is an
optical SFP+ Virtual Interface Card (VIC). The other end of the connection is the 93128TX (copper) leaf.
• Their desired behavior was to convert optical to copper (media conversion).
• There are no transceivers qualified to do this sort of conversion. Optical-based VICs need to be plugged into
optical-based leafs, and copper based VICs need to be plugged into copper-based leafs.
Once the proper transceiver was used, and a leaf with copper ports was connected the other end, the link came up
properly and both the APIC and the leaf were able to share LLDP as expected. Fabric discovery was able to continue
as expected.

Symptom 2

The APIC does not see the leaf switch in the output of “acidiag fnvread” but the leaf does see the APIC in the output
of “show lldp neighbors”.
Verification/Resolution

The VIC1225 and VIC1225T need to have the proper firmware at a minimum to ensure that these VIC’s do not
consume LLDP and prevent LLDP from going to the APIC. A minimal version of VIC firmware that should be used
is 2.2(1dS1). The firmware software can be found available for download at Cisco.com along with the corresponding
upgrade instructions. Once the correct version of VIC firmware is used, the APIC can see the LLDP frames from the
leaf and fabric iscovery will complete.

8.3 Problem Description

Amber lights or no lights on the Leaf switch interfaces.


As mentioned in the overview section of this chapter, cabling configurations are strictly enforced. If a leaf is connected
to another leaf, or a spine is connected to another spine, a wiring mismatch will occur.

Symptom

Using the CLI interface on the leaf, execute the show interface command. The output of the command will show the
interface as “out-of-service”.
rtp_leaf1# show interface ethernet 1/16
Ethernet1/16 is up (out-of-service)
admin state is up, Dedicated Interface
Hardware: 100/1000/10000/auto Ethernet, address: 88f0.31db.e800 (bia 88f0.31db.e800)
[snip]

Verification

The “show lldp neighbors” output will identify this leaf port is connected to another leaf port.
rtp_leaf1# show lldp neighbors
Capability codes:
(R) Router, (B) Bridge, (T) Telephone, (C) DOCSIS Cable Device
(W) WLAN Access Point, (P) Repeater, (S) Station, (O) Other
Device ID Local Intf Hold-time Capability Port ID
RTP_Apic1 Eth1/1 120
RTP_Apic2 Eth1/2 120
rtp_leaf3.cisco.com Eth1/16 120 BR Eth1/16
rtp_spine1.cisco.com Eth1/49 120 BR Eth3/1
rtp_spine2.cisco.com Eth1/50 120 BR Eth4/1

The following fault will be raised in the GUI under Fabric–>Inventory–>Pod_1–><leaf node>
This same fault can also be viewed in the CLI.
admin@RTP_Apic1:if-[eth1--16]> faults
Severity Code Cause Ack Last Transition Dn
-------- ----- ------------------- --- ------------------- -----------------------
major F0454 wiring-check-failed no 2014-10-17 12:50:16 topology/pod-1/
node-101/sys/lldp/inst/
if-[eth1/16]/
fault-F0454

Total : 1

The fault can also be viewed in the APIC CLI. The full path is shown below.
admin@RTP_Apic1:if-[eth1--16]> pwd
/home/admin/mit/topology/pod-1/node-101/sys/lldp/inst/if-[eth1--16]
Resolution

The resolution for this problem is to correct the cabling misconfiguration. Note: The same problem will be seen for
spine cabling misconfiguration where a spine is cabled to another spine.

9 Fabric Initialization

• Overview
• Fabric Verification
• Problem Description
– Symptom 1
– Verification
– Symptom 2
– Verification
– Resolution
– Symptom 3
– Resolution
– Symptom 4
– Verification
– Resolution

9.1 Overview

This chapter covers the discovery process for an ACI fabric, beginning with an overview of the actions that happen
and the verification steps used to confirm that a functioning fabric exists. The displays have been captured from our
reference topology working fabric and can be used as an aid in troubleshooting issues where fabric nodes fail to join
the fabric.
In this discovery process, a fabric node is considered active when the APIC and node can exchange heartbeats through
the Intra-Fabric Messaging (IFM) process. The IFM process is also used by the APIC to push policy to the fabric leaf
nodes.
Fabric discovery happens in three stages. The leaf node directly connected to the APIC is discovered in the first stage.
The second stage of discovery brings in the spines connected to that initial seed leaf. Then the third stage processes
the discovery of the other leaf nodes and APICs in the cluster.
The diagram below illustrates the discovery process for switches that are directly connected to the APIC. Coverage of
specific verification for other parts of the process will be presented later in the chapter.
The steps are:
• Link Layer Discovery Protocol (LLDP) Neighbor Discovery
• Tunnel End Point (TEP) IP address assignment to the node
• Node software upgraded if necessary
• Policy Element IFM Setup
Node status may fluctuate between several states during the fabric registration process. The states are shown in the
Fabric Node Vector table. The APIC CLI command to show the Fabric Node Vector table acidiag fnvread and sample
output will be shown further down in this section.Below is a description of each state.
States and descriptions:
• Unknown – Node discovered but no Node ID policy configured
• Undiscovered – Node ID configured but not yet discovered
• Discovering – Node discovered but IP not yet assigned
• Unsupported – Node is not a supported model
• Disabled – Node has been decommissioned
• Inactive – No IP connectivity
• Active – Node is active
During fabric registration and initialization a port might transition to an “out-of-service” state. Once a port has tran-
sitioned to an out-of-service status, only DHCP and CDP/LLDP protocols are allowed to be transmitted. Below is a
description of each out-of-service issue that may be encountered:
• fabric-domain-mismatch – Adjacent node belongs to a different fabric
• ctrlr-uuid-mismatch – APIC UUID mismatch (duplicate APIC ID)
• wiring-mismatch – Invalid connection (Leaf to Leaf, Spine to non-leaf, Leaf fabric port to non-spine etc.)
• adjaceny-not-detected – No LLDP adjacency on fabric port
Ports can go out-of-service due to wiring issues. Wiring Issues get reported through the lldpIf object information on
this object can be browsed at the following object location in the MIT: /mit/sys/lldp/inst/if-[eth1/1]/summary.

9.2 Fabric Verification

This section illustrates some displays from the reference topology configured and in full working order.
The first step is to verify LLDP neighborships information has been exchanged. To verify LLDP information exchange,
the command show lldp neighbors can be used. This command can be run on the APIC and executed on the nodes,
or it can be run directly on the fabric nodes. The APIC runs Linux using a bash-based shell, that is not sensitive to
the question mark, as is typical for IOS or NX-OS shells. In order to see all the command options, the APIC requires
the entry of a special control sequence sent by pressing the escape key twice. This double escape sequence is the
equivalent of the NXOS/IOS contextual help function triggered when the question mark ”?” is typed in the CLI. For
example, the output below shows the result of typing show lldp neighbors <esc> <esc>:
admin@RTP_Apic1:~> <b>show lldp neighbors</b>
node Fabric node
rtp_leaf1 Specify Fabric Node Name
rtp_leaf2 Specify Fabric Node Name
rtp_leaf3 Specify Fabric Node Name
rtp_spine1 Specify Fabric Node Name
rtp_spine2 Specify Fabric Node Name

Based on the option provided in the contextual help output above, now extending the command to show lldp neighbors
node produces the following output:
admin@RTP_Apic1:~> show lldp neighbors node
101 Specify Fabric Node id
102 Specify Fabric Node id
103 Specify Fabric Node id
201 Specify Fabric Node id
202 Specify Fabric Node id

Executing the command show lldp neighbors rtp_leaf1 in the APIC CLI displays all the LLDP Neighbors adjacent
to “rtp_leaf1”. The output shows that this leaf is connected to two different APICs and two spines.
admin@RTP_Apic1:~> show lldp neighbors rtp_leaf1
# Executing command: 'cat /aci/fabric/inventory/pod-1/rtp_leaf1/protocols/lldp/neighbors/summary'
neighbors:
device-id local-interface hold-time capability port-id
-------------- --------------- --------- ------------- -----------------
RTP_Apic1 eth1/1 120 90:e2:ba:4b:fc:78
RTP_Apic2 eth1/2 120 90:e2:ba:5a:9f:30
rtp_spine1 eth1/49 120 bridge,router Eth3/1
rtp_spine2 eth1/50 120 bridge,router Eth4/1

This command may also be run directly on the leaf as shown below:
rtp_leaf1# show lldp neighbors
Capability codes:
(R) Router, (B) Bridge, (T) Telephone, (C) DOCSIS Cable Device
(W) WLAN Access Point, (P) Repeater, (S) Station, (O) Other
Device ID Local Intf Hold-time Capability Port ID
RTP_Apic1 Eth1/1 120 90:e2:ba:4b:fc:78
RTP_Apic2 Eth1/2 120 90:e2:ba:5a:9f:30
rtp_spine1 Eth1/49 120 BR Eth3/1
rtp_spine2 Eth1/50 120 BR Eth4/1

When the command acidiag fnvread is run in the APIC CLI, it can be used to verify the Fabric Node Vector (FNV)
that is exchanged using LLDP. This is the quickest way to determine if each node is active, and a TEP address has
been assigned.
admin@RTP_Apic1:~> acidiag fnvread

ID Name Serial Number IP Address Role State LastUpdMsgId


-------------------------------------------------------------------------------------------------
101 rtp_leaf1 SAL1819SAN6 172.16.136.95/32 leaf active 0
102 rtp_leaf2 SAL172682S0 172.16.136.91/32 leaf active 0
103 rtp_leaf3 SAL1802KLJF 172.16.136.92/32 leaf active 0
201 rtp_spine1 FGE173400H2 172.16.136.93/32 spine active 0
202 rtp_spine2 FGE173400H7 172.16.136.94/32 spine active 0

Total 5 nodes

When the command acidiag avread is run in the APIC CLI, it can be used to verify the Appliance Vector (AV) that is
exchanged using LLDP. This is the best way to determine the APICs are all part of one clustered fabric. This command
also helps to verify that the TEP address is assigned, the appliance is commissioned, registered, and active, and the
health is equal to 255 which signifies the appliance is “fully fit”.
admin@RTP_Apic1:~> acidiag avread
Local appliance ID=1 ADDRESS=172.16.0.1 TEP ADDRESS=172.16.0.0/16 CHASSIS_ID=a5945f3c-53c8-11e4-bde2-
Cluster of 3 lm(t):1(2014-10-14T20:04:46.691+00:00) appliances (out of targeted 3 lm(t):3(2014-10-14T
appliance id=1 last mutated at 2014-10-14T17:36:51.734+00:00 address=172.16.0.1 tep address=172.1
appliance id=2 last mutated at 2014-10-14T19:55:24.356+00:00 address=172.16.0.2 tep address=172.1
appliance id=3 last mutated at 2014-10-14T20:04:46.922+00:00 address=172.16.0.3 tep address=172.1
clusterTime=<diff=0 common=2014-10-14T20:24:47.810+00:00 local=2014-10-14T20:24:47.810+00:00 pF=<disp

This same information can also be verified using the ACI GUI. The capture below shows the APIC cluster health
screen.

The capture below displays the overall fabric topology. When fully discovered, each node should be visible under the
Pod1 folder.
9.3 Problem Description

During fabric discovery, issues may be encountered when a leaf or spine does not join the ACI fabric due to issues that
were mentioned in the overview section of this chapter.

Symptom 1

The leaf or spine does not show up in fabric membership GUI.

Verification

1. Check the power status of switches and ensure they are powered on. Use the locator LED to identify if each
switch is in a healthy state.
2. Check the cabling between switches. Example: Leaf should only be connected to Spine and APIC. Spine should
only be connected to Leaves.
3. Use console cables to access the device, verify if the device is in loader> prompt or (none) prompt.
(a) When using the console connection, if the device displays the loader> prompt, the switch is in a state
where it did not load the ACI switch software image. Please refer to the ‘ACI Fabric Node and Process
Crash Troubleshooting’ chapter of this document that explains how to recover from the loader prompt.
(b) When using the console connection, if the device displays the (none) login: prompt, enter “admin” then
hit the Enter key to access the CLI. The following message should appear on the screen:
User Access Verification

(none) login: admin


********************************************************************************
Fabric discovery in progress, show commands are not fully functional
Logout and Login after discovery to continue to use show commands.
********************************************************************************
(none)#

Use the command show lldp neighbor to verify if the Leaf is connected to the spine or APIC. If this is a spine, it should
be connected to the leaves.
(none)# show lldp neighbor
Capability codes:
(R) Router, (B) Bridge, (T) Telephone, (C) DOCSIS Cable Device
(W) WLAN Access Point, (P) Repeater, (S) Station, (O) Other
Device ID Local Intf Hold-time Capability Port ID
RTP_Apic1 Eth1/1 120 90:e2:ba:4b:fc:78
...
switch Eth1/49 120 BR Eth3/1
switch Eth1/50 120 BR Eth4/1
Total entries displayed: 14

If presented with the (none)# prompt, use the command show interface brief to verify what the status the interfaces are
in.
(none)# show interface brief
--------------------------------------------------------------------------------
Port VRF Status IP Address Speed MTU
--------------------------------------------------------------------------------
mgmt0 -- up 1000 9000
--------------------------------------------------------------------------------
Ethernet VLAN Type Mode Status Reason Speed Port
Interface Ch #
--------------------------------------------------------------------------------
Eth1/1 0 eth trunk up out-of-service 10G(D) --
...
Eth1/47 0 eth trunk down sfp-missing 10G(D) --
Eth1/48 0 eth trunk up out-of-service 10G(D) --
Eth1/49 -- eth routed up none 40G(D) --
Eth1/49.1 2 eth routed up none 40G(D) --
Eth1/50 -- eth routed up none 40G(D) --
Eth1/50.2 2 eth routed up none 40G(D) --
...

Alternatively, this information can also be found with the command cat /mit/sys/lldp/inst/if-[eth1–<PORT NUM-
BER>]/summary in the (none)# prompt to verify if there is any wiring issue:
(none)# cat /mit/sys/lldp/inst/if-\[eth1--60\]/summary

# LLDP Interface
id : eth1/60
adminRxSt : enabled
adminSt : enabled
adminTxSt : enabled
childAction :
descr :
dn : sys/lldp/inst/if-[eth1/60]
lcOwn : local
mac : 7C:69:F6:0F:EA:EF
modTs : 2014-10-13T20:44:37.182+00:00
monPolDn : uni/fabric/monfab-default
name :
operRxSt : enabled
operTxSt : enabled
portDesc : topology/pod-1/paths-0/pathep-[eth1/60]
portVlan : unspecified
rn : if-[eth1/60]
status :
sysDesc :
wiringIssues :

Symptom 2

In the Fabric membership, no TEP IP addresses are assigned to either leaf or sprine switch, and the node has a status
of “unsupported” and a role of ”unknown” listed under fabric membership.

Verification

If the switch has “unsupported” for its state, the device model (part number) is not supported by the current APIC
version. The command “acidiag fnvread” in the APIC CLI will help to verify all nodes in the fabric.
admin@RTP_Apic1:~> acidiag fnvread
ID Name Serial Number IP Address Role State LastUpdMsgId
-------------------------------------------------------------------------------------------------
0 SAL12341234 0.0.0.0 unknown unsupported 0
(none)# cat /mit/uni/fabric/compcat-default/swhw-*/summary | grep model
model : N9K-C9336PQ
model : N9K-C9508
model : N9K-C9396PX
model : N9K-C93128TX
(none)#

Resolution

The device model or part number must match the catalog’s supported hardware. The command grep model
/mit/uni/fabric/compcat-default/swhw-*/summary in the switch can be used to verify the catalog’s supported hard-
ware:

Symptom 3

The switch state shows “unknown”. The state can be corroborated by use of the acidiag fnvread command in the APIC
CLI.
admin@RTP_Apic1:~> acidiag fnvread
ID Name Serial Number IP Address Role State LastUpdMsgId
-------------------------------------------------------------------------------------------------
0 SAL1819SAN6 0.0.0.0 unknown unknown 0

There are a few causes that could cause this switch state:
• Node ID policy has not been posted to the APIC or the switch has not been provisioned with the APIC GUI with
the device’s specific serial number.
• If the REST API was used to post the Node ID policy to the APIC, the serial number that was posted to the
APIC doesn’t match the actual serial number of the device. The following switch CLI command can verify the
serial number of the device:
(none)# cat /mit/sys/summary | grep serial
serial : SAL1819SAN6

Resolution

Assign a Node ID to the device if one is missing to be configured as well as making sure the provisioned serial number
matches the actual device serial number.

Symptom 4

The leaf or spine is not discovered in the “Pod” folder in the GUI.

Verification

If the state from the cat /mit/sys/summary switch CLI shows out-of-service, re-verify by going back through Symp-
tom 1 verification steps.
If the state from the cat /mit/sys/summary switch CLI shows invalid-ver, verify “Firmware Default Policy” via the
APIC GUI.
Use the cat /mit/sys/summary CLI command to verify the state of the leaf or spine:
leaf101# cat /mit/sys/summary
# System
address : 0.0.0.0
childAction :
currentTime : 2014-10-14T18:14:26.861+00:00
dn : sys
fabricId : 1
fabricMAC : 00:22:BD:F8:19:FF
id : 0
inbMgmtAddr : 0.0.0.0
lcOwn : local
modTs : 2014-10-13T20:43:50.056+00:00
mode : unspecified
monPolDn : uni/fabric/monfab-default
name :
oobMgmtAddr : 0.0.0.0
podId : 1
rn : sys
role : leaf
serial : SAL1819SAN6
state : out-of-service
status :
systemUpTime : 00:21:31:39.000

If the state from the cat /mit/sys/summary CLI command shows in-service, then the TEP IP address listed under the
“address” field of the CLI output should be pingable. If the switch’s TEP address is not reachable from the APIC, a
possible cause could be a switch certificate issue. Verify that the switch is able to communicate with APIC via TCP
port 12183.
leaf101# netstat -a |grep 12183
tcp 0 0 leaf101:12183 *:* LISTEN
tcp 0 0 leaf101:12183 apic2:43371 ESTABLISHED
tcp 0 0 leaf101:12183 apic1:49862 ESTABLISHED
tcp 0 0 leaf101:12183 apic3:42332 ESTABLISHED

If the switch is listening on TCP port 12183 but there are no established sessions, assuming that IP connectivity
between the switch and APIC has been confirmed with ping test, verify SSL communication with the command cat
/tmp/logs/svc_ifc_policyelem.log | grep SSL.
leaf101# cat /tmp/logs/svc_ifc_policyelem.log | grep SSL
3952||14-08-02 21:06:53.875-08:00||ifm||DBG4||co=ifm||incoming connection established from 10.0.0.1:5
3952||14-08-02 21:06:53.931-08:00||ifm||DBG4||co=ifm||openssl: error:14094415:SSL routines:SSL3_READ_

Resolution

If this scenario is encountered, contact the Cisco Technical Assistance Center support.

10 APIC High Availablity and Clustering

• Overview
• Cluster Formation
• Majority and Minority - Handling Clustering Split Brains
• Problem Description
– Symptom
– Verification
• Problem Description
– Symptom
– Verification
• Problem Description
– Symptom
– Resolution
• Problem Description
– Symptom
– Resolution
• Problem Description
– Symptom
– Resolution
• Problem Description
– Symptom
– Resolution
• Problem Description
– Symptom
– Resolution

10.1 Overview

This chapter covers the APIC clustering operations. Clustering provides for high availability, data protection and
scaling by distributed data storage and processing across the APIC controllers. While every unit of data in the object
model is handled by a single controller, all units are replicated 3 times across the cluster to other controller nodes,
regardless of the size of the cluster. Clustering makes the system highly resilient to process crashes and corrupted
databases by eliminating single points of failure.
The APIC process that handles clustering is the Appliance Director process. The Appliance Director process runs
in every controller and is specifically in charge of synchronizing information across all nodes in the cluster. While
Appliance Director is in charge of performing periodic heartbeats to track the availability of other controllers, actual
replication of data is done by each respective service independently. For example, Policy Manager on one controller is
in charge to replicate its data to the Policy Manager instances in other controllers. Appliance Director only participates
in indicating processes or services on which other controllers are their replicas set up in.
Each controller node in the cluster is uniquely identified by an ID. This ID is configured by the administrator at the
time of initial configuration.

10.2 Cluster Formation

The following list of necessary conditions has to be met for successful cluster formation:
• Candidate APICs must have been configured with matching admin user credentials to be part of cluster.
• When adding controller nodes to the cluster, the administratively configured cluster size must not be exceeded.
• When a new node is added, its specified cluster size must match the configured cluster size on all other nodes in
the cluster.
• Every node must have connectivity to all other nodes in the cluster.
• There must be a data exchange between reachable controller pairs.
In our sample reference topology, 3 controllers are being used, namely APIC1, APIC2, and APIC3. The process flow
for forming the cluster is as follows:
APIC1 enters a state where the status shows an operational cluster size of 1 and the controller’s health of “fully-fit”
in the GUI. During the setup script, the administrative cluster size was configured as 3. Once the fabric discovery has
converged to the point where APIC1 has formed relationships with the fabric node switches, providing connectivity
to APIC2, APIC1 and APIC2 will establish a data exchange and APIC2 will start sending heartbeats. Once APIC1
receives heartbeats from APIC2, APIC1 increments the operational cluster size to 2 and allows APIC2 to join the
cluster. The discovery process continues and detects a third controller, and the cluster operational size is incremented
again by a value of 1. This process of fabric and controller discovery continues until the operational cluster size
reaches the configured administrative cluster size. In our reference topology, the administrative cluster size is 3 and
when APIC3 joins the cluster, the operational cluster size is 3, and the cluster formation is complete.

10.3 Majority and Minority - Handling Clustering Split Brains

Due to the fundamental functionality of data replication, any ACI fabric has a minimum supported APIC cluster size
of 3. With clustering operations, APICs leverage the concept of majority and minority. Majority and minority are
used to resolve potential split brain scenarios. In a case where split brain has occurred, 2 APICs, such as APIC1 and
APIC2, can communicate with each other but not with APIC3, and APIC3 is not able to communicate with either
APIC1 or APIC2. Since there were an odd number of controllers to start with, APIC1 and APIC2 are considered to
be the majority and APIC3 is the minority. If there were, to start with, an even number of APIC controllers, it will be
more difficult to resolve which are the majority vs minority.
When an APIC controller network connectivity is lost to other controllers, it transitions into a minority state, while
if the other controllers continue to be reachable in between, the still connected controllers represent a majority. In a
minority state, the APIC enters a read only mode where no configuration changes are allowed. No incoming updates
from any of the fabric switch nodes are handled by the minority controller(s) and if any VMM integration exists,
incoming updates from hypervisors are ignored.
While an APIC remains in minority state, read requests will be allowed but will return data with an indication that the
data could be stale.
In the scenario of loss of network connectivity resulting in the partitioning of the fabric and an APIC in minority
state, if an endpoint attaches to a leaf managed by an APIC in minority state, the leaf will download and instantiate a
potentially state policy from the minority controller. Once all controllers regain connectivity to each other and the split
brain condition has been resolved, if a more recent or updated copy of the policy exists between the majority clusters,
the leaf will download and update the policy accordingly.
10.4 Problem Description

When adding or replacing APIC within an existing cluster, potentially an issue can be encountered where APIC is not
able to join the existing APIC cluster.

Symptom

During fabric bring up or expansion, APIC1 is the only controller online, and APIC3 is being inserted before APIC2,
therefore APIC3 will not join the fabric.

Verification

Under System->Controller-Faults, verify the existence of the following fault:

The fault message indicates that APIC3 cannot join the fabric before APIC2 joins the fabric. The problem will be
resolved once APIC2 is brought up, then APIC3 will be able to join the cluster.

10.5 Problem Description

Policy changes are not allowed on APIC1 even though APIC1 is healthy and fully fit.

Symptom

APIC2 and APIC3 are not functional (shutdown or disconnected) while APIC1 is fully functional.

Verification

Under System->Controllers->Cluster APIC2 and APIC2 have an operational status of “Unavailable”


When trying to create a new policy, the following status message is seen:

These symptoms indicated that APIC1 is in the minority state and it thinks that APIC2 and APIC3 are still online, but
APIC1 lost connectivity to both of these APICs via infrastructure VLAN.
One of the missing APIC, APIC2 or APIC3 needs to be powered up to resolve this error. Let’s say when APIC1 and
APIC3 become part of the cluster again, APIC1 and APIC3 will be in the majority state where APIC2 (still offline)
will be in the minority state.

Types of Cluster Faults

Cluster-related faults are designed to provide diagnostic information which is sufficient to correct detected faulty
conditions. There are 2 major groups of faults - faults related to messages which are discarded by ApplianceDirector
process on the receiving APIC, and faults related to cluster geometry changes.
Faults related to messages which are discarded by the ApplianceDirector process running on the APIC receiving the
messages are then examined from the following two perspectives:
1. Is this message from a cluster peer
2. If not, is it from an APIC which might be considered as a candidate for cluster expansion
Consequently, there will be an attempt to raise two faults (F1370 and F1410)if the received message fails to qualify
either check and is discarded by recipient.
There has to be a continuous stream of similar messages arriving over a period of time for a fault to be raised. Once
the fault is raised, it contains information about the APIC experiencing the failure, including the serial number, cluster
ID, and time when the stream of similar messages started to arrive.

10.6 Problem Description

A cluster with Operational ClusterSize equal to 3 will not accept an APIC as an addition or replacement which claims
Operational Cluster Size equal to 5.

Symptom

A fault code of 1370 with a reason of “operational-cluster-size-distance-cannot-be-bridged” will be raised if the


APIC trying to join has OperationalClusterSize that deviates from cluster’s OperationalClusterSize by more than 1.

Resolution

Change the operational cluster size on the new APIC to match from the initial setup script, or be only 1 greater than
what is configured on the current fabric.

10.7 Problem Description

A controllers configuration is erased, the server is restarted and the controller is brought back into the cluster.

Symptom

A fault code of 1370 with a reason of “source-has-mismatched-target-chassis-id” will be raised when an trying to
join the cluster has a new Chassis ID from what it was previosly known by other controllers.

Resolution

The corrective action is to decommission the server which its configuration was erased from any other controller, and
commission back. The clusters will then be able to merge with the controller that has been brought back online.

10.8 Problem Description

Adding subsequent controllers beyond a number of 3 on a cluster previously configured with a size of 3 will result in
a system fault and the new controller not joining the cluster.

Symptom

A fault code of 1370 with a reason of “source-id-is-outside-operational-cluster-size” is raised when the transmitting
APIC has a cluster ID which doesn’t fit into cluster with current OperationalClusterSize.
Resolution

Change the cluster ID to be with the range of the defined cluster size from the setup script. The chosen cluster ID
should be 1 greater than the current defined size It may be required to grow the cluster.

10.9 Problem Description

Adding a currently decomissioned server back into the cluster results in a fault.

Symptom

A fault code of 1370 with a reason of source-is-not-commissioned is raised when the transmitting APIC has a cluster
ID which is currently decommissioned in the cluster.

Resolution

Commission the APIC.

10.10 Problem Description

Adding a controller from another fabric fails to join the cluster.

Symptom

A fault code of 1370 with a reason of fabric-domain-mismatch is raised when the transmitting APIC has a FabricID
which is different from FabricID in the formed cluster.

Resolution

Run the APIC CLI command acidiag eraseconfig setup and set the correct FabricID on the APIC from the setup
script.

11 Firmware and Image Management


• Overview
– APIC Controller and Switch Software
– Firmware Management
– Compatibility Check
– Firmware Upgrade Verification
– Verifying the Firmware Version and the Upgrade Status by of use of the REST API
• Problem Description
– Symptom
– Verification
– Resolution
• Problem Description
– Symptom
– Verification
– Resolution
• Problem Description
– Symptom
– Verification
– Resolution

11.1 Overview

This chapter covers firmware and image management for the ACI fabric hardware components. It will cover the
overview of objects and policies that make up firmware and image management in the context of software upgrades,
followed by the verification steps used to confirm a successful upgrade process.

APIC Controller and Switch Software

There are three types of software images in the fabric that can be upgraded:
1. The APIC software image.
2. The switch software image — software running on leafs and spines of the ACI fabric.
3. The Catalog image — the catalog contains information about the capabilities of different models of hardware
supported in the fabric, compatibility across different versions of software, and hardware and diagnostic utilities.
The Catalog image is implicitly upgraded with the controller image. Occasionally, it may be required to upgrade
the Catalog image only to include newly qualified hardware components into the fabric or add new diagnostic
utilities.
You must upgrade the switch software image for all the spine and leaf switches in the fabric first. After that upgrade
is successfully completed, upgrade the APIC controller software image.

Firmware Management

There are five components within the context of firmware management:


1. Firmware Repository is used to store of images that have been downloaded to the APIC. Images are transferred
into the firmware repository from external source locations over HTTP or SCP protocols. The source locations
are configurable via the firmware source policy. Once an image has been copied from its source location, it is
replicated across all controllers within the cluster. The switch nodes will retrieve images from the controller as
required during the beginning of the upgrade process.
2. Firmware Policy is the policy which specifies the desired firmware image version.
3. Firmware Group is the configured group of nodes that share the same firmware policy.
4. Maintenance Policy is the maintenance policy which specifies a schedule for upgrade.
5. Maintenance Group is the group of nodes that share the same maintenance policy.
By default, all controllers are part of a predefined firmware group and a predefined maintenance group. Membership
within the firmware and maintenance groups is not modifiable. However, both the controller firmware policy and the
controller maintenance policy are modifiable to select a desired version to upgrade to.
Before the administrator can upgrade switches, a firmware group must be created for all the switches and and one or
more maintenance groups should be created to contain all the switches within the ACI fabric.

Compatibility Check

The ACI fabric can have up to 3 different versions of compatible switch software images to be simultaneously active
in a fabric. There are three different levels of “compatibility” checks:
1. Image level compatibility - Controllers use the Catalog image to check for compatibility across software images
that can interoperate in the fabric. The controller will ensure image compatibility is satisfied before allowing for
upgrade and downgrade.
2. Card level compatibility - Within a spine modular chassis, the supervisor software must be compatible with
line card, fabric card and system controller software. Similarly all the connected FEXes within the leaf switch
must be compatible with software running in the leaf. If a card or a FEX connected to the system contains
incompatible software with the supervisor module of a spine or a leaf, the supervisor module of the spine or the
leaf will ensure compatibility by pushing down a compatible version of card or FEX software.
3. Feature level compatibility - Given a set of disparate image versions running in the fabric, these images may
have image level compatibility and can be simultaneously available within the fabric. However, they may not
support the same set of software features. As a result, feature and hardware level compatibility is encoded
in the object model such that the controller can identify feature incompatibility at the point of configuration by
administrator. Administrator will be prompted or configuration will result in failure when enabling such features
in a mixed hardware and software version environment.

Firmware Upgrade Verification

Once a controller image is upgraded, it will disconnect itself from the cluster and reboots with the newer version
while the other APIC controllers in the cluster are still operational. Once the controller is rebooted, it joins the cluster
again. Then the cluster converges, and the next controller image will start the upgrade process. If the cluster does not
immediately converge and is not fully fit, the upgrade will wait until the cluster converges and is “Fully Fit”. During
this period, a Waiting for Cluster Convergence message is displayed.
For the switches, the administrator can also verify that the switches in the fabric have been upgraded from the APIC
GUI navigation pane, by clicking Fabric Node Firmware. In the Work pane, view all the switches listed. In the Current
Firmware column view the upgrade image details listed against each switch.
Verifying the Firmware Version and the Upgrade Status by of use of the REST API

For the upgrade status of controllers and switches. An administrator can query the upgrade status with the following
URL:
https://<ip address>/api/node/class/maintUpgJob.xml
An administrator can query the current running firmware version on controllers:
https://<ip address>/api/node/class/firmwareCtrlrRunning.xml
An administrator can also query the currently operating firmware version on switches:
https://<ip address>/api/node/class/firmwareRunning.xml

11.2 Problem Description

Failing to copy firmware files to APIC through a download task

Symptom

After configuring an APIC download task policy, the download keeps failing and will not download the firmware from
the home directory of the user.

Verification

The following screen is observed:


Resolution

Since the APIC is using standard Linux distribution, the SCP command needs to follow the standard Linux SCP format.
For example, if the IP address is 171.70.42.180 and the absolute path is /full_path_from_root/release/image_name.
The following illustrations show the successful download of the APIC software via SCP.

11.3 Problem Description

The APIC cluster fails to upgrade.

Symptom

Policy upgrade status showing “Waiting for Cluster Convergence”.

Verification

When upgrading Controllers, the Controller Upgrade Firmware Policy will not proceed unless the APIC Cluster has
a status “Fully Fit”. The upgrade status may show “Waiting for Cluster Convergence” and will not proceed with
upgrade.
This “Waiting for Cluster Convergence” status can be caused due to a policy or process that has crashed. If the cluster
is not in a “Fully Fit” state, check the list of running processes for each APIC, for example evidence of such a problem
would be the presence of core dump files in a Controller.

Resolution

While the administrator can recover the APIC from the “Waiting for Cluster Convergence” state by restarting the
affected APIC to allow all processes to start up normally, in the presence of core dump files the Cisco Technical
Assistance Center should be contacted immediately to analyze and troubleshoot further.
11.4 Problem Description

Policy upgrade is paused.

Symptom

Upgrade is paused for either an APIC or a switch.

Verification

The administrator can verify the fault to see if there is a fault code F1432 - Maintenance scheduler is paused for group
policyName. One or more members of the group failed to upgrade or the user manually paused the scheduler being
generated.

Resolution

The administrator should look for other faults indicating why the upgrade failed. Once all the faults are resolved, the
administrator can delete failed/paused policy and re-initiate a new policy upgrade.

12 Faults / Health Scores

• Overview
• Problem Description
– Symptom 1:
– Resolution 1:
• Problem Description
– Symptom 1:
– Resolution 1:
– Resolution 2:

12.1 Overview

This chapter is intended to provide a basic understanding of faults and health scores in the ACI object model. This
chapter will cover what these items are and how the information in these elements can be used in troubleshooting.
For every object in the fabric that can have errors or problems against it, that object will have the potential of faults
being raised. For every object, each fault that is raised has a related weight and severity. Faults transition between
stages throughout a life cycle where the fault is raised, soaking and cleared. The APIC maintains a real time list of
administrative and operational components and their related faults, which in turn is used to derive the health score of
an object. The health score itself is calculated based on the active faults on the object and the health scores of its child
objects. This yields a health score of 100 if no faults are present on the object and the child MOs are all healthy. The
health score will trend towards 0 as health decreases. Health score of System is calculated using the health scores of
the switches and the number of endpoints learnt on the leafs. Similarly, health score the tenant is calculated based on
the health score of the resources used by the tenant on leafs and the number of endpoints learned on those leafs.
To describe the stages of the fault lifecycle in more detail, a soaking fault is the beginning state for a fault when it is
first detected. During this state, depending on the type of fault, it may expire if it is a non-persistent fault or it will
continue to persist in the system. When a fault enters the soaking-clearing state, that fault condition has been resolved
at the end of a soaking interval.
If a fault has not been cleared by the time the soaking interval has been reached, it will enter the raised state, and
potentially have its severity increased. The new severity is defined by the policy for the particular fault class, and will
remain in the raised state until the fault condition is cleared.
Once a fault condition has been cleared, it will enter the raised-clearing state. At this point a clearing interval begins,
after which if the fault has not returned it will enter the retaining state, which leaves the fault visible so that it can be
inspected after an issue has been resolved.
At any point during which a fault is created, changes state or is cleared, a fault event log is generated to keep a record
of the state change.
12.2 Problem Description

Health Score is not 100%

Symptom 1:

A fault has been raised on some object within the MIT

Resolution 1:

The process for diagnosing low health scores is similar for physical, logical and configuration issues, however it can
be approached from different sections of the GUI. For this example, a low overall system health score due to a physical
issue will be diagnosed.
1. Navigate to the System Health Dashboard, and identify a switch that has a diminished health score.

Look primarily for health scores less than 99. Double clicking on that leaf will allow navigation into the faults raised
on that particular device. In this case, double click on rtp_leaf1.
Once in the Fabric Inventory section of the GUI, the dashboard for the leaf itself will be displayed, and from there
navigate into the health tab, by either double clicking the health score or clicking on the “Health” tab.
Now the nodes in the health tree can be expanded, to find those with low health scores. To the left of each node in
the tree, it can be seen that there will an indicator showing the impact of the particular subtree on the parent’s health
score. This can be one of Low, Medium, Max or None. If the indicator states None that means that this particular
object has no impact on the health of the parent object. Information describing the different severity of faults present
on the managed object, along with their counts is also displayed.
Navigating into a sub object by clicking the plus sign, will show the sub objects that make up the total health score of
the parent.
Navigating down through the tree, it can be noticed that there are no faults raised directly on an object, which means
that some child object contains the faults. Continue to navigate down through the tree until the faulted objects have a
value that is present.
Once such an object has been reached with no children, the cause of the fault has been found. It is possible right click
anywhere on the object. Clicking in this area brings up an action menu making it possible to show the fault objects
“Show Object”.
Click on the “Show Object” menu, to bring up the object that has the fault along with a number of details regarding
that objects current state. This includes a tab named “Faults” which will show what faults are raised. Double clicking
a fault will provide the fault properties and with this information, it is possible to limit the area for troubleshooting to
just the object that has the fault.
In the above example, it can be seen that an interface has a fault due to being used by an EPG however missing an SFP
transceiver. Along the same line any environmental problem and hardware failures are presented the same way.

12.3 Problem Description

Health score degraded, identification has been made of faults indicating “LLDP neighbor is bridge and its port vlan 1
mismatches with the local port vlan unspecified” at the Fabric Level

Symptom 1:

The front panel ports of the ACI leaf switches do not have a native VLAN configured by default.
If a Layer-2 switch is connected to a leaf port, certain models including Nexus 5000 and Nexus 7000 by default will
advertise LLDP with a native vlan of 1. LLDP on N7K side would advertise 1 in the TLV, and our side would trigger
the fault.
There is no native vlan configured on front panel ports of fabric leaf by default. Normally this is not an issue when
servers are connected to these leaf ports. When a layer2 switch is connected to leaf port, it is important that native
vlan is configured on that leaf port. If not configured, leaf may not forward STP BPDUs. Hence native vlan mismatch
is treated as critical fault.

Resolution 1:

This fault can be cleared by configuring a statically attached interface to the path interface on which the fault is raised.
The EPG static path attach should have the encap VLAN set to the same as the native VLAN, and have the mode set
as an untagged interface. This can be configured via XML, using the following POST request URI and payload.
https://10.122.254.211/api/mo/uni/tn-Prod/ap-Native/epg-Native.xml
<fvAEPg name="native">
<fvRsPathAtt tDn="topology/pod-1/paths-103/pathep-[eth1/5]"/>
<fvRsPathAtt tDn="topology/pod-1/paths-101/pathep-[eth1/5]"/>
<fvRsDomAtt tDn="uni/phys-phys"/>
</fvAEPg>

Resolution 2:

To clear this fault from the system, configure the downstream switch to not advertise a vlan. This can be configured
using the “no vlan dot1q tag native” command in global config mode, after which bouncing the interfaces connected
to the fabric using “shutdown” and “no shutdown” should clear the issue.

13 REST Interface
• Overview
– ACI Object Model
– Queries
– APIC REST API
– Payload Encapsulation
– Read Operations
– Write Operations
– Authentication
– Filters
– Browser
– API Inspector
– ACI Software Development Kit (SDK)
– Establishing a Session
– Working with Objects
– APIC REST to Python Adapter
– Conclusion
• Problem Description
– Symptom 1
– Verification 1
– Resolution
– Symptom 2
– Verification
– Resolution
– Symptom 3
– Verification
– Resolution
– Symptom 4
– Verification
– Resolution
– Symptom 5
– Verification
– Resolution
– Symptom 6
– Verification
– Resolution

13.1 Overview

This chapter will explain the basic concepts necessary to begin effectively utilizing ACI programmatic features for
troubleshooting. This begins with recapitulating an overview the ACI Object Model, which describes how the system
interprets configuration and represents state to internal and external entities. The REST API provides the means
necessary to manipulate the object store, which contains the configured state of APIC using the object model as the
metadata definition. The APIC SDK leverages the REST API to read and write the configuration of APIC, using the
object model to describe the current and desired state.
ACI provides a new approach to data center connectivity, innovative and different from the standard approach taken
today, but astonishingly simple in its elegance and capacity to describe complete application topologies and holistically
manage varying components of the data center. With the fabric behaving as a single logical switch, problems like
managing scale, enabling application mobility, collecting uniform telemetry points and configuration automation are
all solved in a straightforward approach. With the controller acting as a single point of management, but not a single
point of failure, clustering offers the advantages of managing large data centers but none of the associated fragmented
management challenges.
The controller is responsible for all aspects of configuration. This includes configuration for a number of key areas:
• Policy: defines how applications communicate, security zoning rules, quality of service attributes, service inser-
tion and routing/switching
• Operation: protocols on the fabric for management and monitoring, integration with L4-7 services and virtual
networking
• Hardware: maintaining fabric switch inventory and both physical and virtual interfaces
• Software: firmware revisions on switches and controllers
With these pieces natively reflected in the object model, it is possible to change these through the REST API, further
simplifying the process by utilizing the SDK.

ACI Object Model

Data modeling is a methodology used to define and analyze data requirements needed to support a process in relation
to information systems. The ACI Object Model contains a modeled representation of applications, network constructs,
services, virtualization, management and the relationships between all of the building blocks. Essentially, the object
model is an abstracted version of the configuration and operational state that is applied individually to independent
network entities. As an example, a switch may have interfaces and those interfaces can have characteristics, such as
the mode of operation (L2/L3), speed, connector type, etc. Some of these characteristics are configurable, while others
are read-only, however all of them are still properties of an interface.
The object model takes this analytical breakdown of what defines a thing in the data center, and carefully determines
how it can exist and how to represent that. Furthermore, since all of these things do not merely exist, but rather interact
with one another, there can be relationships within the model, which includes containment hierarchy and references.
An interface belongs to a switch, therefore is contained by the switch, however a virtual port channel can reference it.
A virtual port channel does not necessarily belong to a single switch.
The objects in the model can also utilize a concept called inheritance, where an interface can be a more generic concept
and specific definitions can inherit characteristics from a base class. For example, a physical interface can be a data
port or a management port, however both of these still have the same basic properties, so they can inherit from a single
interface base class. Rather than redefine the same properties many times, inheritance can be used to define them in
one base class, and then specialize them for a specific child class.
All of these configurable entities and their structure are represented as classes. The classes define the entities that
are instantiated as Managed Objects (MO) and stored within the Management Information Tree (MIT). The general
concept is similar to the tree based hierarchy of a file system or the SNMP MIB tree. All classes have a single parent,
and may contain multiple children. This is with exception to the root of the tree, which is a special class called topRoot.
Within the model there are different packages that act as logical groupings of classes, so that similar entities are placed
into the same package for easier navigation of the model. Each class has a name, which is made from the package
and a class name, for example “top” is the package and “Root” is the class: “topRoot”; “fv” is the package (fabric
virtualization) and “Tenant” is the class: “fvTenant”. A more generic form of this would be:
Package:classname == packageClassName
Managed objects make up the management information tree, and everything that can be configured in ACI is an object.
MOs have relative names (Rn), which are built according to well-defined rules in the model. For the most part, the
Rn is a prefix prepended to some naming properties, so for example the prefix for an fvTenant is “tn-“ and the naming
property for a fvTenant would be the name, “Cisco”. Combining these gives an Rn of tn-Cisco for a particular MO.
Relative names are unique within their namespace, meaning that within the local scope of an MO, there can only ever
be one using that name. By using this rule paired with the tree-based hierarchy of the MIT, concatenate the relative
names of objects to derive their Distinguished Name (Dn), providing a unique address in the MIT for a specific
object. For example, an fvTenant is contained by polUni (Policy Universe), and polUni is contained by topRoot.
Concatenating the Rns for each of these from top down yields a Dn of “uni/tn-Cisco”. Note that topRoot is always
implied and does not appear in the Dn.

Queries

With all of this information neatly organized, it’s possible to perform a number of tree based operations, including
searching, traversal, insertion and deletion. One of the most common operations is a search to query information from
the MIT.
The following types of queries are supported:
• Class-level query: Search the MIT for objects of a specific class
• Object-level query: Search the MIT for a specific Dn
Each of these query types supports a plethora of filtering and subtree options, but the primary difference is how each
type is utilized.
A class-based query is useful for searching for a specific type of information, without knowing the details, or not all
of the details. Since a class-based query can return 0 or many results, it can be a helpful way to query the fabric for
information where the full details are not known. A class-based query combined with filtering can be a powerful tool
to extract data from the MIT. As a simple example, a class-based query can be used to find all fabric nodes that are
functioning as leafs, and extract their serial numbers, for a quick way to get a fabric inventory.
An object based (Dn based) query returns zero or 1 matches, and the full Dn for an object must be provided for a
match to be found. Combined with an initial class query, a Dn query can be helpful for finding more details on an
object referenced from another, or as a method to update a local copy of information.
Both query types support tree-level queries with scopes and filtering. This means that the MIT can be queried for all
objects of a specific class or Dn, and then retrieve the children or complete subtree for the returned objects. Further-
more, the data sets can be filtered to only return specific information that is interesting to the purpose at hand.
The next section on the REST API covers more details about how to build and execute these queries.
APIC REST API

This section provides a brief overview of the REST API, however a more exhaustive description can be found in the
Cisco APIC REST API User Guide document on Cisco.com
The APIC REST API is a programmatic interface to the Application Policy Infrastructure Controller (APIC) that uses
a Representational State Transfer (REST) architecture. The API accepts and returns HTTP or HTTPS messages that
contain JavaScript Object Notation (JSON) or Extensible Markup Language (XML) documents. Any programming
language can be used to generate the messages and the JSON or XML documents that contain the API methods or
managed object (MO) descriptions.
The REST API is the interface into the MIT and allows for manipulation of the object model state. The same REST
interface is utilized by the APIC CLI, GUI and SDK, so that whenever information is displayed it is read via the
REST API and when configuration changes are made, they are written via the REST API. In addition to configuration
changes, the REST API also provides an interface by which other information can be retrieved, including statistics,
faults, audit events and even provide a means of subscribing to push based event notification, so that when a change
occurs in the MIT, an event can be sent via a Web Socket.
Standard REST methods are supported on the API, which includes POSTs, GETs and DELETE operations through
the HTTP protocol. The following table shows the actions of each of these and the behavior in case of multiple
invocations.

Figure 3: REST HTTP(S) based CRUD methods


The POST and DELETE methods are idempotent meaning that they have no additional effect if called more than once
with the same input parameters. The GET method is nullipotent, meaning that it can be called 0 or more times without
making any changes (or that it is a read-only operation).

Payload Encapsulation

Payloads to and from the REST interface can be encapsulated via either XML or JSON encodings. In the case of
XML, the encoding operation is simple: the element tag is the name of the package and class, and any properties
of that object are specified as attributes on that element. Containment is defined by creating child elements. The
following example shows a simple XML body defining a tenant, application profile, EPG and static port attachment.
XML Managed Object Definition:
<polUni>
<fvTenant name="NewTenant">
<fvAp name="NewApplication">
<fvAEPg name="WebTier">
<fvRsPathAtt encap="vlan-1" mode="regular" tDn="topology/pod-1/paths-101/pathep-[
</fvAEPg>
</fvAp>
</fvTenant>
</polUni>
For JSON, encoding requires definition of certain entities to reflect the tree based hierarchy, however is repeated at all
levels of the tree, so is fairly simple once initially understood.
1. All objects are described as JSON dictionaries, where the key is the name of the package and class, and the
value is another nested dictionary with two keys: attribute and children.
2. The attribute key contains a further nested dictionary describing key/value pairs defining attributes on the object
The children key contains a list that defines all of the child objects.
3. The children in this list will be dictionaries containing any nested objects, that are defined as described in (a)
The following example shows the XML defined above, in JSON format.
JSON Managed Object Definition:
{
"polUni": {
"attributes": {},
"children": [
{
"fvTenant": {
"attributes": {
"name": "NewTenant"
},
"children": [
{
"fvAp": {
"attributes": {
"name": "NewApplication"
},
"children": [
{
"fvAEPg": {
"attributes": {
"name": "WebTier"
},
"children": [
{
"fvRsPathAtt": {
"attributes": {
"mode": "regular",
"encap": "vlan-1",
"tDn": "topology/pod-1/paths-101/pathep-[eth1/1]"
}
}
}
]
}
}
]
}
}
]
}
}
]
}
}

Both the XML and JSON have been pretty printed to simplify visual understanding. Practically, it would make sense
to compact both of them before exchanging with the REST interface, however it will make no functional impact. In
the cases of the object examples shown here, the compacted XML results in 213 bytes of data, and the compacted
JSON results in 340 bytes of data.

Read Operations

Once the object payloads are properly encoding as XML or JSON, they can be used in Create, Read, Update or Delete
(CRUD) operations on the REST API.

Since the REST API is HTTP based, defining the URI to access a certain resource type is important. The first two
sections of the request URI simply define the protocol and access details of the APIC. Next in the request URI is the
literal string “/api” indicating that the API will be invoked. Generally read operations will be for an object or class, as
discussed earlier, so the next part of the URI defines if it will be for a “mo” or “class”. The next component defines
either the fully qualified Dn being queried for object based queries, or the package and class name for class-based
queries. The final mandatory part of the request URI is the encoding format, either .XML or .JSON. This is the only
method by which the payload format is defined (Content-Type and other headers are ignored by APIC).
The next optional part of the request URI is the query options, which can specify various types of filtering, which are
explained extensively in the REST API User Guide.
In the example shown above, first an object level query is shown, where an EPG named Download is queried. The
second example shows how a query for all objects with classl1PhysIf can be queried, and the results filtered to only
show those where the speed attribute is equal to 10G. For a complete reference to different objects, their properties
and possible values please refer to the Cisco APIC API Model Documentation.

Write Operations

Create and update operations to the REST API are actually both implemented using the POST method, so that if an
object does not already exist it will be created, and if it does already exist, it will be updated to reflect any changes
between its existing state and desired state.
Both create and update operations can contain complex object hierarchies, so that a complete tree can be defined within
a single command, so long as all objects are within the same context root and they are under the 1MB limit for data
payloads to the REST API. This limit is in place to guarantee performance and protect the system under high load.
The context root helps defines a method by which APIC distributes information to multiple controllers and ensures
consistency. For the most part it should be transparent to the user, though very large configurations may need to be
broken up into smaller pieces if they result in a distributed transaction.

Create/Update operations follow the same syntax as read operations, except that they will always be targeted at an
object level because changes cannot be made to every object of a specific class. The create/update operation should
target a specific managed object, so the literal string “/mo” indicates that the Dn of the managed object will be provided,
followed next by the actual Dn. Filter strings can be applied to POST operations, to retrieve the results of a POST in
the response, for example, pass the rsp-subtree=modified query string to indicate that the response should include any
objects that have been modified by the POST.
The payload of the POST operation will contain the XML or JSON encoded data representing the managed object
defining the API command body.

Authentication

Authentication to the REST API for username/password-based authentication uses a special subset of request URIs,
including aaaLogin, aaaLogout and aaaRefresh as the Dn target of a POST operation. Their payloads contain a simple
XML or JSON payload containing the MO representation of an aaaUser object with attributes name and pwd defining
the username and password, for example: <aaaUser name=’admin’ pwd=’insieme’/>. The response to the POSTs
will contain an authentication token as both a Set-Cookie header as well as an attribute to the aaaLogin object in the
response named token, for which the XPath is /imdata/aaaLogin/@token if encoded as XML. Subsequent operations
on the REST API can use this token value as a Cookie named “APIC-cookie” to have future requests authenticated.

Filters

The REST API supports a wide range of flexible filters, useful for narrowing the scope of a search to allow for
information to be more quickly located. The filters themselves are appended as query URI options, started with a
question mark (?) and concatenated with an ampersand (&). Multiple conditions can be joined together to form
complex filters
The Cisco APIC RESTful API User Guide covers in great detail the specifics of how to use filters, their syntax, and
provides examples. Some of the tools covered below, can be used to learn to build a query string, as well as uncover
those being used by the native APIC interface, and build on top of those to create advanced filters.

Browser

The MIT contains multitudes of valuable data points. Being able to browse that data can expose new ways to use the
data, aid in troubleshooting, and inspect the current state of the object store. One of the available tools for browsing
the MIT is called “visore” and is available on the APIC. Visore supports querying by class and object, as well as easily
navigating the hierarchy of the tree.
In order to access visore, open https://<apic>/visore.html in a web browser, and then authenticate with credentials for
the APIC. Once logged in, an initial set of data will be visible, however searching for information using filtered fields
will also be available at the top of the screen. Within the “Class or DN” text input field, enter the name of a class, e.g.
“fabricNode” or “topology/pod-1/node-1”; press the “Run Query” button and press OK when prompted to continue
without a filter. The results will be provided in either a list of nodes on the fabric, or information for the first APIC
depending on the input string.
In the list of attributes for the objects, the Dn will have a set of icons next to it.

The green arrows can be used for navigating up and down the tree, where pressing the left arrow will navigate to the
parent of the object and the right arrow will navigate to a list of all children of the current object. The black staggered
bars will display any statistics that are available for the object. If none are available, the resulting page will not contain
any data. The red octagon with exclamation point will show any faults that are present on the current object and finally
the blue circle with the letter H will show the health score for the object, if one is available.
These tools provide access to all types of information in the MIT, and additionally use Visore to structure query strings.
For example, entering “fabricNode” as the class, “id” for the property and “1” in the field labeled Val1, leaving the Op
value to “==”, and execute the query to filter the class results on just those with an id equal to 1. Note that Visore does
not contain the complete list of filters supported by the REST API, however can be a useful starting point.
Visore provides the URI of the last query and the response body, and the data can be seen not only in a tabular format,
but also as the natively encoded payload. This allows for quick access to determine the request URI for a class or Dn
based query, and also see what the XML body of the response looks like.

API Inspector

All operations that are made through the GUI will invoke REST calls to fetch and commit the information being
accessed. The API Inspector further simplifies the process of examining what is taking place on the REST interface
as the GUI is navigated by displaying in real time the URIs and payloads. When new configuration is committed, API
inspector will display the resulting POST requests, and when information is displayed on the GUI, the GET request
will be displayed.
To get started with API inspector, access it from the account menu, visible in the top right of the APIC GUI. Click on
“welcome, <username>” and then select the “Show API Inspector” option, as shown in the figure below.
Once the API Inspector is brought up, timestamps will be seen along with the REST method, URIs, and payloads.
Occasional updates may also be seen in the list as the GUI refreshes subscriptions to data being shown on the screen.

From the example output shown above, it can be seen that the last logged item has a POST with the JSON payload
containing a tenant named Cisco, and some attributes defined on that object.
POST
url: http://172.23.3.215/api/node/mo/uni/tn-Cisco.json
{
"fvTenant": {
"attributes": {
"name": "Cisco",
"status": "created"
},
"children": []
}
}

ACI Software Development Kit (SDK)

The ACI Python SDK is named Cobra, and is a Python implementation of the API that provides native bindings for
all the REST function. Cobra also has a complete copy of the object model so that data integrity can be ensured, and
provides methods for performing lookups and queries and object creation, modification and deletion, which match
the REST methods leveraged by the GUI, as well as those that can be found using API Inspector. As a result, policy
created in the GUI can be used as a programming template for rapid development.
The installation process for Cobra is straightforward, using standard Python distribution utilities. It is currently dis-
tributed as an egg and can be installed using easy_install. Please reference the APIC Python API Documentation for
full details on installing Cobra on a variety of operating systems.

Establishing a Session

The first step in any code that will use Cobra is to establish a login session. Cobra currently supports username and
password based authentication, as well as certificate-based authentication. For this example, we’ll use username and
password based authentication:
import cobra.mit.access
import cobra.mit.session

apicUri = 'https://10.0.0.2'
apicUser = 'username'
apicPassword = 'password'

ls = cobra.mit.session.LoginSession(apicUri, apicUser, apicPassword)


md = cobra.mit.access.MoDirectory(ls)
md.login()

This will provide an MoDirectory object named md, that is logged in and authenticated to an APIC. If for some reason
this script is unable to authenticate, the script will get a cobra.mit.request.CommitError exception from Cobra. Once
a session is allocated for the script things can move forward.

Working with Objects

Utilizing the Cobra SDK to manipulate the MIT generally follows the workflow:
1. identify object to be manipulated
2. build a request to change attributes, add or remove children
3. commit changes made to that object
For example, to create a new Tenant, where the tenant will be placed in the MIT must first be identified. In this case it
will be a child of the Policy Universe object:
import cobra.model.pol
polUniMo = cobra.model.pol.Uni('')

With the policy universe Mo object defined, it is possible to create a tenant object as a child of polUniMo:
import cobra.model.fv
tenantMo = cobra.model.fv.Tenant(polUniMo, 'cisco')

Since all of these operations have only resulted in Python objects being created, the configuration must be committed
in order to apply it. This can do this using an object called a ConfigRequest. A ConfigRequests acts as a container for
Managed Object based classes that fall into a single context, which can all be committed in a single atomic POST.
import cobra.mit.request
config = cobra.mit.request.ConfigRequest()
config.addMo(tenantMo)
md.commit(config)

The ConfigRequest is created, then the tenantMo is added to the request, and finally this is commited through the
MoDirectory.
For the above example, in the first step a local copy is built of the polUni object. Since it does not have any naming
properties (reflected above by the empty double single-quotes), there is no need to look it up in the MIT to figure out
what the full Dn for the object is, since it is always known as the “uni”. If something deeper in the MIT needs to be
posted, where the object has naming properties, a lookup needs to be performed for that object. As an example, to
post a configuration to an existing tenant, it is possible to query for that tenant, and create objects beneath it.
tenantMo = md.lookupByClass('fvTenant', propFilter='eq(fvTenant.name, "cisco")')
tenantMo = tenantMo[0] if tenantMo else None

The resulting tenantMo object will be of class cobra.model.fv.Tenant, and contain properties such as .dn, .status, .name,
etc, all describing the object itself. lookupByClass() returns an array, since it can return more than one object. In this
case, the propFilter is specifying a fvTenant with a particular name. For a tenant, the name attribute is a special type
of attribute called a naming attribute. The naming attribute is used to build the relative name, which must be unique
within its local namespace. As a result of this, it can be guaranteed that what lookupByClass on an fvTenant with a
filter on the name will always either return an array of length 1 or None, meaning nothing was found. The specific
naming attributes and others can be looked up in the APIC Model Reference document.
Another method to entirely avoid a lookup, is to build a Dn object and make an object a child of that Dn. This will
only work in cases where the parent object already exists.
topDn = cobra.mit.naming.Dn.fromString('uni/tn-cisco')
fvAp = cobra.model.fv.Ap(topMo, name='AppProfile')

These fundamentals of interacting with Cobra will provide the building blocks necessary to create more complex
workflows that will aid in the process of automating network configuration, troubleshooting and management.

APIC REST to Python Adapter

The process of building a request can be time consuming. For example, the object data payload as Python code
reflecting the object changes that are desired to be made must be represented. Given that the Cobra SDK is directly
modeled off of the ACI Object Model, this means it should be possible to generate code directly from what resides in
the object model. As expected, this is possible using a tool developed by Cisco Advanced Services named Arya, short
for APIC REST to Python Adapter.
In the diagram above, it’s clearly shown how the input that might come from API Inspector, Visore or even the output
of a REST query, can be quickly converted into Cobra SDK code, that can then be tokenized and re-used in more
advanced ways. Installing Arya is relatively simple and has minimal external dependencies. Arya requires Python
2.7.5 and git installed. The following quick installation steps will install Arya and place it the system python.
git clone https://github.com/datacenter/ACI.git
cd ACI/arya
sudo python setup.py install

After installation of Arya has completed, it is possible to take XML or JSON representing ACI modeled objects and
convert them to Python code quickly. For example:
arya.py -f /home/palesiak/simpletenant.xml

Will yield the following Python code:


#!/usr/bin/env python
'''
Autogenerated code using /private/tmp/ACI/arya/lib/python2.7/site-packages/arya-1.0.0-py2.7.egg/EGG-I
Original Object Document Input:
<fvTenant name='bob'/>
'''
raise RuntimeError('Please review the auto generated code before ' +
'executing the output. Some placeholders will ' +
'need to be changed')

# list of packages that should be imported for this code to work


import cobra.mit.access
import cobra.mit.session
import cobra.mit.request
import cobra.model.fv
import cobra.model.pol
from cobra.internal.codec.xmlcodec import toXMLStr

# log into an APIC and create a directory object


ls = cobra.mit.session.LoginSession('https://1.1.1.1', 'admin', 'password')
md = cobra.mit.access.MoDirectory(ls)
md.login()

# the top level object on which operations will be made


topMo = cobra.model.pol.Uni('')

# build the request using cobra syntax


fvTenant = cobra.model.fv.Tenant(topMo, name='bob')

# commit the generated code to APIC


print toXMLStr(topMo)
c = cobra.mit.request.ConfigRequest()
c.addMo(topMo)
md.commit(c)

The placeholder raising a RuntimeError must first be removed before this code can be executed, however it is purposely
put in place to ensure that any other tokenized values that must be updated are corrected. For example, the APIC IP
defaulting to 1.1.1.1 should be updated to reflect the actual APIC IP address. The same applies for the credentials and
other possible placeholders.
Note that if the input is XML or JSON that does not have a fully qualified hierarchy, it may be difficult or impossible
for Arya to attempt to determine it through heuristics. In this case, a placeholder of “REPLACEME” will be populated
with the text. This placeholder will need to be replaced with the correct distinguished names (Dn’s). These Dn’s by
querying for the object in Visore, or inspecting the request URI for the object shown in API inspector.

Conclusion

With an understanding of how ACI network and application information is represented, how to interact with that data,
and a grasp on using the SDK, it is trivial to create powerful programs that can simplify the professional tasks and
introduce higher levels of automation. Mastering the MIT, Cobra SDK and leveraging Arya to streamline operational
workflows is just the beginning to leveraging ACI in ways that will increase the value to the business and business
stakeholders.

13.2 Problem Description

Errors from the REST API do not generally generate faults on the system. Errors are returned directly to the source
of the request. There are logs on the APIC that the request was sent to that can be examined to see what was queried,
and if errors occur what may have resulted in that error. In 1.0(1e) the /var/log/dme/log/nginx.bin.log on the APIC
will track requests coming to the APIC and show specific types of errors. In later versions the nginx error.log and
access.log will be available at /var/log/dme/nginx/.

Symptom 1

Message “Connection refused” is presented when trying to connect to APIC over HTTP using the REST API.

Verification 1

By default port HTTP is disabled on the APICs and HTTPS is enabled.

Resolution

• HTTPS can be used or the communication policy can be changed to enable HTTP. However, please be aware
that that APIC ships in the most secure mode possible.

Symptom 2

REST API returns an error similar to, “Invalid DN [Dn] wrong rn prefix [Rn] at position [position]” or “Request failed,
unresolved class for [string]”

Verification

The REST API uses the Universal Resource Indicator (URI) to try and determine what to either configure (for a POST)
or return back to the request (for a GET).
For a GET of a class, if the APIC is unable to resolve that URI back to a valid class the error that the APIC returns
is the “unable to resolve the class” error. Please refer back to the APIC Management Information Model Reference
documentation to verify the name of the class.
For a GET or POST for a managed object, if the APIC is unable to resolve the distinguished name, the APIC will return
the error about an invalid DN and the APIC will specify which Rn is the problem. It is possible use this information to
determine which part of the distinguished name has resulted in the failure. Please refer back to the APIC Management
Information Model Reference documentation to verify the structure of the distinguished name if needed. It is also
possible to use Visore to traverse the object store on the APIC and see which distinguished names exist.

Resolution

To use the REST API, a fully qualified distinguished name must be used for either GET requests or POST queries to
URI’s starting with “/api/mo/” and the class name when making GET requests to URI’s starting with “api/class.”

Symptom 3

The REST API returns the error “Token was invalid (Error: Token timeout)”

Verification

The REST API requires that a login is refreshed periodically. When logging in using the aaaLogin request, the response
includes a refreshTimeoutSeconds attribute that defines how long the login cookie will remain valid. The cookie must
be refreshed using a GET to api/aaaRefresh.xml or api/aaaRefresh.json prior to that timeout period. By default the
timeout period is 300 seconds. If the token is not refreshed, it will expire and the REST API will return the token
invalid error.

Resolution

Refresh the token by using the aaaRefresh API before the token expires or get a new token by simply logging in again.

Symptom 4

The REST API returns the error, “Failed to update multiple items in a single operation - request requires distributed
transaction. Please modify request to process each item individually”

Verification

When a POST transaction is sent to the REST API, ensure that the POST does not contain managed objects from
different parts of the management information tree that may belong to different parts of the distributed management
information tree that may be managed by different APICs. This can be a rather difficult thing to ensure on the surface
if a rather large transaction is being created by just generating a huge configuration and committing it. However, if
the POST is limited to objects within the same package it is generally possible to avoid this issue. For example the
infraInfra object should not be included when doing a POST to api/uni.xml or api/uni.json for a fvTenant object.

Resolution

Break up the REST API POST such that the request does not cover classes outside of packages at equal or higher levels
of the management information tree. Please see the APIC Management Information Model Reference documentation
for more information about the class hierarchy.
Symptom 5

The REST API reports either “incomplete node at line [line number]” or “Invalid request. Cannot contain child [child]
under parent [parent]” for POST requests.

Verification

The REST API requires that the objects that are sent in a POST are well formed. XML objects simply require that the
proper containment rules be followed, the proper attributes be included and the XML is well formed. Managed objects
that are contained by other managed objects in the management information tree need to be contained in the same way
in the XML POST. For example the following POST would fail because fvAP is cannot be contained by fvAEp:
<?xml version="1.0"?>
<polUni>
<fvTenant name="NewTenant">
<fvAEPg name="WebTier">
<!-- This is wrong -->
<fvAp name="NewApplication" />
<fvRsPathAtt encap="vlan-1" mode="regular" tDn="topology/pod-1/paths-101/pathep-[eth1/1]"
</fvAEPg>
</fvTenant>
</polUni>

For JSON POSTs, it becomes a little more difficult because the JSON is basically built from XML and XML attributes
become an attributes field, and children become a children array in the JSON. The children and attributes have to be
explicitly specified. For example:
{
"polUni": {
"attributes": {},
"children": [
{
"fvTenant": {
"attributes": {
"name": "NewTenant"
}
}
}
]
}
}

However, if the attributes and children specification are not included - the most common situation is that they are not
included when there are no attributes, it is easy to forget them in such a case - the REST API will return the error about
an incomplete node. This is an example of a poorly formed JSON query:
{
"polUni": {
"fvTenant": {
"attributes": {
"name": "NewTenant"
}
}
}
}
Resolution

Ensure that the REST API query is properly formed.

Symptom 6

The web server returns a response 200 but along with a body that states an error 400 Bad Request.

Verification

This will happen if the header or request sent to the webserver is malformed in such a way that the request cannot be
parsed by the web server. For example:
>>> import httplib
>>> conn = httplib.HTTPConnection("10.122.254.211")
>>> conn.request("get", "api/aaaListDomains.json")
>>> r1 = conn.getresponse()
>>> data1 = r1.read()
>>> print r1.status, r1.reason
200
>>> print data1
<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>nginx/1.4.0</center>
</body>
</html>

>>>

In this case the method used is “get” - all lower case. The web server the APIC uses requires methods be all upper
case, “GET”.

Resolution

Ensure that the header and request is not malformed in anyway and conforms to common web standards and practices
by enabling debugging on the client side to inspect the headers being sent and received.

14 Management Tenant
• Overview
– Fabric Management Routing
– Out-Of-Band Management
– Inband Management
– Layer 2 Inband Management
– Layer 2 Configuration Notes
– Layer 3
– Layer 3 Inband Configuration Notes
– APIC Management Routing
– Fabric Node (Switch) Routing
– Management Failover
– Management EPG Configuration
– Fabric Verification
• Problem Description
– Symptom
– Verification
– Resolution
• Problem Description
– Symptom
– Verification
– Resolution

14.1 Overview

The management tenant is a pre-defined tenant in the ACI policy model that addresses policies related to inband and
out of band management connectivity of the ACI Fabric.
This chapter presents an overview of how the management tenant functions, the verification steps used to confirm a
working out-of-band management configuration for the example reference topology, and potential issues that relate
to the management tenant. The displays taken on a working fabric can then be used as a reference resource to aid in
troubleshooting issues with the management tenant functions.
The example reference topology that is used only has out-of-band management configured, so in order to show any
inband management functions, all inband management information shown in this book will be captured from another
fabric.

Fabric Management Routing

The ACI fabric provides both in-band and out-of-band management access options. The following paragraphs will
describe the internal system behavior, including routing and failover for the APIC and fabric nodes (switches).

Out-Of-Band Management

Out-of-band (OOB) management provides management communications through configuration of dedicated physical
interfaces on the APICs and fabric nodes (switches). The initial APIC setup script prompts to configure the OOB
management IP address by a series of configuration prompts:
Out-of-band management configuration ...
Enter the IP address for out-of-band management: 10.122.254.141/24
Enter the IP address of the default gateway [None]: 10.122.254.1
Enter the interface speed/duplex mode [auto]:
Once the fabric is initialized and discovered, the OOB addresses can be configured for the fabric nodes (switches)
through any of the object model interfaces (GUI, API, CLI).
On the APIC, the OOB configuration creates an interface called oobmgmt. Keep in mind throughout this book that
when viewing configuration information on an APIC, the APIC is built on a Linux host operating system, and some
of the abbreviated information might be more systems related than traditional Cisco NXOS/IOS command or output
structure related. As an example, to view the interface oobmgmt configuration, connect to the APIC CLI, enter the
command ip add show dev oobmgmt. The dev keyword is more of a Linux context moniker for “device” and the order
of the command is different from a traditional Cisco show command. Below is the output produced from the ip add
show dev oobmgmt command:
admin@RTP_Apic2:~> ip add show dev oobmgmt
8: oobmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
link/ether 24:e9:b3:15:dd:60 brd ff:ff:ff:ff:ff:ff
inet 10.122.254.212/24 brd 10.122.254.255 scope global oobmgmt
inet6 fe80::26e9:b3ff:fe15:dd60/64 scope link
valid_lft forever preferred_lft forever
admin@RTP_Apic2:~>

On the fabric nodes (switches), the OOB configuration is applied to the management interface eth0 (aka mgmt0). To
view the eth0 interface configuration, connect to the node CLI, enter the following command and observe the produced
output:
rtp_leaf1# ip add show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 88:f0:31:db:e7:f0 brd ff:ff:ff:ff:ff:ff
inet 10.122.254.241/24 brd 10.122.254.255 scope global eth0
inet6 fe80::8af0:31ff:fedb:e7f0/64 scope link
valid_lft forever preferred_lft forever
rtp_leaf1#

Inband Management

Inband management provides management communications through configuration of one or more front-panel (data
plane) ports on the fabric leaf nodes (switches). Inband management requires a dedicated pool of IP addresses that do
not directly extend outside the fabric. Inband management can be configured in two modes: Layer 2 and Layer 3.

Layer 2 Inband Management

With Layer 2 inband management, the inband management addresses assigned to the APICs and fabric nodes
(switches) are only accessible from networks directly connected to the leaf nodes.
In this model, the fabric inband addresses are not accessible from networks not directly connected to the fabric.

Layer 2 Configuration Notes

A minimum of 2 VLANs are required


• 1 for the Inband management EPG
– 1 for the application EPG mapped to the leaf port providing connectivity outside the fabric
– Configuring a second Bridge Domain (BD) for the application EPG is optional and it is also valid to map
the application EPG to the default BD named ‘inb’
• The subnet gateway(s) configured for the BD’s are used as next-hop addresses and should be unique host ad-
dresses
Layer 3

With Layer 3 inband management, the inband management addresses assigned to the APICs and fabric nodes are
accessible by remote networks by virtue of configuring a L3 Routed Outside network object.

Layer 3 Inband Configuration Notes

• A minimum of 2 VLANs are required


– 1 for the Inband EPG
– 1 for the application EPG mapped to the leaf port providing access outside the fabric
• Configuring a second BD for the application EPG is optional - it is also valid to map the application EPG to the
default ‘inb’ BD
• The subnet gateway(s) configured for the BD’s are used as next-hop addresses and should be unique (i.e. unused)
host addresses
Regardless of whether using L2 or L3, the encapsulation VLAN used for the Inband EPG is used to create a sub-
interface on the APIC using the name format bond0.<vlan>, where <vlan> is the VLAN configured as the encapsula-
tion for the Inband EPG. As an example, the following is the output from the APIC CLI show command:
admin@fab2_apic1:~> ip add show bond0.10
116: bond0.10@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1496 qdisc noqueue state UP
link/ether 64:12:25:a7:df:3f brd ff:ff:ff:ff:ff:ff
inet 5.5.5.141/24 brd 5.5.5.255 scope global bond0.10
inet6 fe80::6612:25ff:fea7:df3f/64 scope link
valid_lft forever preferred_lft forever
On the fabric nodes, inband interfaces are created as part of the mgmt:inb VRF:
fab2_leaf4# show ip int vrf mgmt:inb
IP Interface Status for VRF "mgmt:inb"
vlan27, Interface status: protocol-up/link-up/admin-up, iod: 128,
IP address: 5.5.5.1, IP subnet: 5.5.5.0/24 <<<<<<<<<<<<<<< BD gateway address
IP address: 5.5.5.137, IP subnet: 5.5.5.137/32 secondary
IP broadcast address: 255.255.255.255
IP primary address route-preference: 1, tag: 0

In the output above, the gateway address(es) for the BD is also configured on the same interface. This is true for all
leaf nodes (switches) that are configured for inband.

APIC Management Routing

The APIC internal networking configuration utilizes the Linux iproute2 utilities, which provides a combination of
routing policy database and multiple routing tables used to implement routing on the controllers. When both inband
and out-of-band management are configured, the APIC uses the following forwarding logic:
1. Packets that come in an interface, go out that same interface
2. Packets sourced from the APIC, destined to a directly connected network, go out the directly connected interface
3. Packets sourced from the APIC, destined to a remote network, prefer inband, followed by out-of-band
An APIC controller always prefers the in-band management interface to the out-of-band management interface as long
as in-band is available. This behavior cannot be changed with configuration. APIC controllers should have two ways
to reach a single management network with inband being the primary path and out-of-band being the backup path.
To view the configured routing tables on the APIC, execute the following command cat /etc/iproute2/rt_tables:
admin@fab2_apic1:~> cat /etc/iproute2/rt_tables
#
# reserved values
#
255 local
254 main
253 default
0 unspec
#
# local
#
#1 inr.ruhep
1 overlay
2 oobmgmt
admin@fab2_apic1:~>

The local and main routing tables are Linux defaults. The local routing table is populated with information from all of
the interfaces configured with IP addresses on the APIC. Theoverlay, oobmgmt, and ibmgmt routing tables are APIC-
specific and are populated with the relevant routes for each network. The entries from the 3 APIC-specific routing
tables are used to populate the main routing table. The contents of each routing table can be viewed by using the
command ip route show <table>. For example:
admin@fab2_apic1:~> ip route show table oobmgmt
default via 10.122.254.1 dev oobmgmt src 10.122.254.141
10.122.254.1 dev oobmgmt scope link src 10.122.254.141
169.254.254.0/24 dev lxcbr0 scope link
admin@fab2_apic1:~>
The decision of which routing table is used for the lookup is based on an ordered list of rules in the routing policy
database. Use ip rule show command to view the the routing policy database:

The main routing table used for packets originating from the APIC, shows 2 default routes:
admin@fab2_apic1:~> ip route show
default via 10.122.254.1 dev oobmgmt metric 16
10.0.0.0/16 via 10.0.0.30 dev bond0.4093 src 10.0.0.1
10.0.0.30 dev bond0.4093 scope link src 10.0.0.1
10.122.254.0/24 dev oobmgmt proto kernel scope link src 10.122.254.141
10.122.254.1 dev oobmgmt scope link src 10.122.254.141
169.254.1.0/24 dev teplo-1 proto kernel scope link src 169.254.1.1
169.254.254.0/24 dev lxcbr0 proto kernel scope link src 169.254.254.254
admin@fab2_apic1:~>

The metric 16 on the default route out the oobmgmt interface is what makes the default route via inband (bond0.10)
preferable.

Fabric Node (Switch) Routing

Routing on the fabric nodes (switches) is split between Linux and NX-OS. Unlike the APIC configuration, the routing
table segregation on the fabric nodes is implemented using multiple VRF instances. The configured VRFs on a fabric
node can be viewed by using the show vrf command:
fab2_leaf1# show vrf
VRF-Name VRF-ID State Reason
black-hole 3 Up --
management 2 Up --
mgmt:inb 11 Up --
overlay-1 9 Up --

Although the management VRF exists in the above output, the associated routing table is empty. This is because the
management VRF mapped to the out-of-band network configuration, is handled by Linux. This means that on the
fabric nodes, the Linux configuration does not use multiple routing tables and the content of the main routing table is
only populated by the out-of-band network configuration.
To view the contents of each VRF routing table in NX-OS, use the show ip route vrf <vrf> command. For example:
fab2_leaf1# show ip route vrf mgmt:inb
IP Route Table for VRF "mgmt:inb"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

3.3.3.0/24, ubest/mbest: 1/0, attached, direct, pervasive


*via 10.0.224.65%overlay-1, [1/0], 02:31:46, static
3.3.3.1/32, ubest/mbest: 1/0, attached
*via 3.3.3.1, Vlan47, [1/0], 02:31:46, local
5.5.5.0/24, ubest/mbest: 1/0, attached, direct, pervasive
*via 10.0.224.65%overlay-1, [1/0], 07:48:24, static
5.5.5.1/32, ubest/mbest: 1/0, attached
*via 5.5.5.1, Vlan40, [1/0], 07:48:24, local
5.5.5.134/32, ubest/mbest: 1/0, attached

Management Failover

In theory the out-of-band network functions as a backup when inband management connectivity is unavailable on the
APIC. However, APIC has does not run any routing protocol and so will not be able to intelligently fallback to use
OOB interface in case of any upstream connectivity issues over inband. The inband management network on APIC
changes in the following scenarios:
• The bond0 interface on the APIC goes down
• The encapsulation configuration on the Inband EPG is removed

Note: Note that the above failover behavior is specific to APIC and the same failover behavior is unavailable on the
fabric switches due to the switches inband and OOB interface belong to two different VRFs.

Management EPG Configuration

Some of the fabric services, such as NTP, DNS, etc., provide the option to configure a Management EPG attribute.
This specifies whether inband or out-of-band is used for communication by these services. This setting only affects
the behavior of the fabric nodes, not the APICs. With the exception of the VM Provider configuration, the APIC
follows the forwarding logic described in the APIC Management Routing section earlier in this chapter. The VM
Provider configuration has an optional Management EPG setting, but only able to select an EPG tied to the In-Band
Management EPG.

Fabric Verification

In the following section, displays are collected from the reference topology to show a working fabric configuration.
This verification is only for out-of-band management.
Out of Band Verification

The first step is to verify the configuration of the oobmgmt interface on the APIC using the command ip addr show
dev oobmgmt on all three APIC’s. The interfaces need to be in the up state and the expected IP addresses need to be
assigned with the proper masks. To connect to the various fabric nodes, there are several options but if once logged
into at least one of the APIC’s use the output of show fabric membership to see which node-names and VTEP IP
addresses can be used to connect via SSH in order to verify operations.
Verification of the oobmgmt interface address assignment and interface status on RTP_Apic1:
admin@RTP_Apic1:~> ip addr show dev oobmgmt
8: oobmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
link/ether 24:e9:b3:15:a0:ee brd ff:ff:ff:ff:ff:ff
inet 10.122.254.211/24 brd 10.122.254.255 scope global oobmgmt
inet6 fe80::26e9:b3ff:fe15:a0ee/64 scope link
valid_lft forever preferred_lft forever

Verification of the oobmgmt interface address assignment and interface status on RTP_Apic2:
admin@RTP_Apic2:~> ip addr show dev oobmgmt
8: oobmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
link/ether 24:e9:b3:15:dd:60 brd ff:ff:ff:ff:ff:ff
inet 10.122.254.212/24 brd 10.122.254.255 scope global oobmgmt
inet6 fe80::26e9:b3ff:fe15:dd60/64 scope link
valid_lft forever preferred_lft forever
admin@RTP_Apic2:~>

Verification of the oobmgmt interface address assignment and interface status on RTP_Apic3:
admin@RTP_Apic3:~> ip addr show dev oobmgmt
8: oobmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
link/ether 18:e7:28:2e:17:de brd ff:ff:ff:ff:ff:ff
inet 10.122.254.213/24 brd 10.122.254.255 scope global oobmgmt
inet6 fe80::1ae7:28ff:fe2e:17de/64 scope link
valid_lft forever preferred_lft forever
admin@RTP_Apic3:~>

Verification of the fabric node membership and their respective TEP address assignment as seen from RTP_Apic2
(would be the same on all controllers under in a normal state):
admin@RTP_Apic2:~> show fabric membership
# Executing command: cat /aci/fabric/inventory/fabric-membership/clients/summary

clients:
serial-number node-id node-name model role ip decomissioned supported-model
------------- ------- ---------- ------------ ----- ---------------- ------------- ---------------
SAL1819SAN6 101 rtp_leaf1 N9K-C9396PX leaf 172.16.136.95/32 no yes
SAL172682S0 102 rtp_leaf2 N9K-C93128TX leaf 172.16.136.91/32 no yes
SAL1802KLJF 103 rtp_leaf3 N9K-C9396PX leaf 172.16.136.92/32 no yes
FGE173400H2 201 rtp_spine1 N9K-C9508 spine 172.16.136.93/32 no yes
FGE173400H7 202 rtp_spine2 N9K-C9508 spine 172.16.136.94/32 no yes
admin@RTP_Apic2:~>

To verify the OOB mgmt on the fabric switch, use the attach command on the APIC to connect to a fabric switch
via the VTEP address, then execute the ip addr show dev eth0 command for each switch, and again ensure that the
interface state is UP, the ip address and netmask are correct, etc:
Verification of the OOB management interface (eth0) on fabric node rtp_leaf1:
rtp_leaf1# ip addr show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 88:f0:31:db:e7:f0 brd ff:ff:ff:ff:ff:ff
inet 10.122.254.241/24 brd 10.122.254.255 scope global eth0
inet6 fe80::8af0:31ff:fedb:e7f0/64 scope link
valid_lft forever preferred_lft forever
rtp_leaf1#

Verification of the OOB management interface (eth0) on fabric node rtp_leaf2:


rtp_leaf2# ip addr show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:22:bd:f8:34:c0 brd ff:ff:ff:ff:ff:ff
inet 10.122.254.242/24 brd 10.122.254.255 scope global eth0
inet6 fe80::222:bdff:fef8:34c0/64 scope link
valid_lft forever preferred_lft forever
rtp_leaf2#

Verification of the OOB management interface (eth0) on fabric node rtp_leaf3:


rtp_leaf3# ip addr show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 7c:69:f6:10:6d:18 brd ff:ff:ff:ff:ff:ff
inet 10.122.254.243/24 brd 10.122.254.255 scope global eth0
inet6 fe80::7e69:f6ff:fe10:6d18/64 scope link
valid_lft forever preferred_lft forever
rtp_leaf3#

When looking at the spine, the command used is show interface mgmt0 to ensure the proper ip address and netmask
is assigned. Verification of the OOB management interface on rtp_spine1:
rtp_spine1# show int mgmt0
mgmt0 is up
admin state is up,
Hardware: GigabitEthernet, address: 0022.bdfb.f256 (bia 0022.bdfb.f256)
Internet Address is 10.122.254.244/24
MTU 9000 bytes, BW 1000000 Kbit, DLY 10 usec
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, medium is broadcast
Port mode is routed
full-duplex, 1000 Mb/s
Beacon is turned off
Auto-Negotiation is turned on
Input flow-control is off, output flow-control is off
Auto-mdix is turned off
EtherType is 0x0000
1 minute input rate 0 bits/sec, 0 packets/sec
1 minute output rate 0 bits/sec, 0 packets/sec
Rx
256791 input packets 521 unicast packets 5228 multicast packets
251042 broadcast packets 26081550 bytes
Tx
679 output packets 456 unicast packets 217 multicast packets
6 broadcast packets 71294 bytes
rtp_spine1#

To verify that the spine has the proper default gateway configuration, use the command ip route show as seen here for
rtp_spine1:
rtp_spine1# ip route show
default via 10.122.254.1 dev eth6
10.122.254.0/24 dev eth6 proto kernel scope link src 10.122.254.244
127.1.0.0/16 dev psdev0 proto kernel scope link src 127.1.1.27
rtp_spine1#

The same validation for rtp_spine2 looks similar to spine1 as shown:


rtp_spine2# show int mgmt0
mgmt0 is up
admin state is up,
Hardware: GigabitEthernet, address: 0022.bdfb.fa00 (bia 0022.bdfb.fa00)
Internet Address is 10.122.254.245/24
MTU 9000 bytes, BW 1000000 Kbit, DLY 10 usec
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, medium is broadcast
Port mode is routed
full-duplex, 1000 Mb/s
Beacon is turned off
Auto-Negotiation is turned on
Input flow-control is off, output flow-control is off
Auto-mdix is turned off
EtherType is 0x0000
1 minute input rate 0 bits/sec, 0 packets/sec
1 minute output rate 0 bits/sec, 0 packets/sec
Rx
256216 input packets 345 unicast packets 5218 multicast packets
250653 broadcast packets 26007756 bytes
Tx
542 output packets 312 unicast packets 225 multicast packets
5 broadcast packets 59946 bytes
rtp_spine2#

And to see the routing table:


rtp_spine2# ip route show
default via 10.122.254.1 dev eth6
10.122.254.0/24 dev eth6 proto kernel scope link src 10.122.254.245
127.1.0.0/16 dev psdev0 proto kernel scope link src 127.1.1.27
rtp_spine2#

To verify APIC routing use the command cat /etc/iproute2/rt_tables:


admin@RTP_Apic1:~> cat /etc/iproute2/rt_tables
#
# reserved values
#
255 local
254 main
253 default
0 unspec
#
# local
#
#1 inr.ruhep
2 oobmgmt
1 overlay
admin@RTP_Apic1:~>

The output of the ip route show tells us that there are two routing tables on our sample reference topology, one for
out-of-band management and one for the overlay.
The next output to verify is the out-of-band management routing table entries using the command ip route show table
oobmgmt . There should be a default route pointed at the default gateway IP address and out-of-band management
interface (dev oobmgmt) with a source IP address that matches the IP address of the out-of-band management interface.
admin@RTP_Apic1:~> ip route show table oobmgmt
default via 10.122.254.1 dev oobmgmt src 10.122.254.211
10.122.254.0/24 dev oobmgmt scope link
169.254.254.0/24 dev lxcbr0 scope link
admin@RTP_Apic1:~>

The next output to verify is the output of ip rule show which shows how the APIC chooses which routing table is used
for the lookup:
admin@RTP_Apic1:~> ip rule show
0: from all lookup local
32762: from 10.122.254.211 lookup oobmgmt
32763: from 172.16.0.1 lookup overlay
32764: from 172.16.0.1 lookup overlay
32765: from 10.122.254.211 lookup oobmgmt
32766: from all lookup main
32767: from all lookup default
admin@RTP_Apic1:~>

Finally the ip route show command will show how the global routing table is configured on an APIC for out-of-
band management. The oobmgmt metric is 16 which has no impact on this situation but if an inband management
configuration was applied, the inband would not have a metric and would have a preference over the out-of-band
management route.
admin@RTP_Apic3:~> ip route show
default via 10.122.254.1 dev oobmgmt metric 16
10.122.254.0/24 dev oobmgmt proto kernel scope link src 10.122.254.213
169.254.1.0/24 dev teplo-1 proto kernel scope link src 169.254.1.1
169.254.254.0/24 dev lxcbr0 proto kernel scope link src 169.254.254.254
172.16.0.0/16 via 172.16.0.30 dev bond0.3500 src 172.16.0.3
172.16.0.30 dev bond0.3500 scope link src 172.16.0.3
admin@RTP_Apic3:~>

To ensure that the VRF is configured for the fabric nodes, verify with the output of show vrf.
rtp_leaf1# show vrf
VRF-Name VRF-ID State Reason
black-hole 3 Up --
management 2 Up --
overlay-1 4 Up --

rtp_leaf1#

14.2 Problem Description

Can SSH to APIC but cannot reach a fabric node via SSH.

Symptom

All three APIC’s are accessible via the out-of-band management network via SSH, HTTPS, ping, etc. The fabric nodes
are only accessible via ping, but should be accessible via SSH as well.

Verification

The switch opens up ports using the linux iptables tool. However, the current state of the tables cannot be viewed with
out root access. Without root access, it is still possible to verify what ports are open by running an nmap scan against
a fabric node:
Computer:tmp user1$ nmap -A -T5 -PN 10.122.254.241
Starting Nmap 6.46 ( http://nmap.org ) at 2014-10-15 09:29 PDT
Nmap scan report for rtp-leaf1.cisco.com (10.122.254.241)
Host is up (0.082s latency).
Not shown: 998 filtered ports
PORT STATE SERVICE VERSION
179/tcp closed bgp
443/tcp open http nginx 1.4.0
|_http-methods: No Allow or Public header in OPTIONS response (status code 400)
|_http-title: 400 The plain HTTP request was sent to HTTPS port
| ssl-cert: Subject: commonName=APIC/organizationName=Default Company Ltd/stateOrProvinceName=CA/coun
| Not valid before: 2013-11-13T18:43:13+00:00
|_Not valid after: Can't parse; string is "20580922184313Z"
|_ssl-date: 2020-02-18T08:57:12+00:00; +5y125d16h27m34s from local time.
| tls-nextprotoneg:
|_ http/1.1
Service detection performed. Please report any incorrect results at http://nmap.org/submit/ .
Nmap done: 1 IP address (1 host up) scanned in 16.61 seconds

* The output shows that only bgp and https ports are open, but not ssh on
this fabric node. This indicates that the policy has not been fully
pushed down to the fabric nodes.
* Reviewing the policy on the APIC reveals that the subnet is missing from
the External Network Instance Profile:
Resolution

Once a subnet is added in the GUI, the ports are added to the iptables on the fabric nodes and is then accessible via
SSH:
$ ssh admin@10.122.254.241
Password:
Cisco Nexus Operating System (NX-OS) Software
TAC support: http://www.cisco.com/tac
Copyright (c) 2002-2014, Cisco Systems, Inc. All rights reserved.
The copyrights to certain works contained in this software are
owned by other third parties and used and distributed under
license. Certain components of this software are licensed under
the GNU General Public License (GPL) version 2.0 or the GNU
Lesser General Public License (LGPL) Version 2.1. A copy of each
such license is available at
http://www.opensource.org/licenses/gpl-2.0.php and
http://www.opensource.org/licenses/lgpl-2.1.php
rtp_leaf1#

14.3 Problem Description

Previously reachable APIC or fabric node not reachable via out-of-band management interface.

Symptom

When committing a node management policy change or when clearing the configuration of a fabric node, decommis-
sioning that fabric node and re-accepting that node back into the fabric through fabric membership, the out-of-band IP
connectivity to an APIC and/or fabric switch gets lost. In this case rtp_leaf2 and RTP_Apic2 lost IP connectivity via
the out-of-band managment interfaces.

Verification

Upon verifying the fabric, scenarios such as overlapping IP addresses, between the APIC and switch, lead to loss of
connectivity:
admin@RTP_Apic2:~> ip add show dev oobmgmt
23: oobmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
link/ether 24:e9:b3:15:dd:60 brd ff:ff:ff:ff:ff:ff
inet 10.122.254.212/24 brd 10.122.254.255 scope global oobmgmt
inet6 fe80::26e9:b3ff:fe15:dd60/64 scope link
valid_lft forever preferred_lft forever
admin@RTP_Apic2:~>

and the leaf:


rtp_leaf2# ip addr show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:22:bd:f8:34:c0 brd ff:ff:ff:ff:ff:ff
inet 10.122.254.212/24 brd 10.122.254.255 scope global eth0
inet6 fe80::222:bdff:fef8:34c0/64 scope link
valid_lft forever preferred_lft forever
rtp_leaf2#
• When checking the fabric node policies, the following is seen on the default Node Management Address policy,
and there are no connectivity groups applied to this default policy.

Resolution

When a fabric node joins the fabric, it randomly gets an IP address assigned from the pool, and there are a few
activities that can cause the IP address on a node to change. Generally speaking any activity that causes the APIC or
fabric node to come up from scratch and be removed from the network or be assigned to a new policy can cause it to
be readdressed.
In situations where a fabric member is simply renumbered there may only be a need to investigate what new IP address
was assigned. In some other rare circumstances where the IP address overlaps with another device, Cisco TAC should
be contacted to investigate further.

15 Common Network Services


• Overview
– Fabric Verification
* DNS
· APIC
· Fabric Nodes
* NTP
· APIC
· Fabric nodes
* DCHP Relay
– APIC
– Fabric nodes
• Problem Description
– Symptom 1
– Verification
– Resolution
– Symptom 2
– Verification
* APIC
* Fabric Nodes
– Resolution
– Symptom 3
– Verification
* APICs
* Fabric Nodes
– Symptom 4
– Verification
* APICs
* Fabric Nodes
– Resolution
• Problem Description
– Symptom 1
– Verification
– Resolution
– Symptom 2
– Verification
– Resolution

15.1 Overview

This chapter covers the common network services like DNS, NTP, DHCP, etc. A common network service is any
service that can be shared between the fabric nodes or tenants.
These services are handled differently in the way they are configured within the fabric.
• DNS: DNS profiles are configured globally as fabric policies and then can be applied as needed via a dns label
at a EPG level.
• NTP: This is configured as a pod level policy.
• DHCP: DHCP relay is configured at a tenant level.
Fabric Verification

For most common network shared services configurations, if the management tenant or EPG is not configured or not
working, the shared services policies will not be pushed down to the fabric nodes. Management tenant and EPG
configuration should be verified along with shared services configuration.

DNS

APIC It is verified first by looking at the management information tree then seeing how that policy is applied to the
actual APIC configuration.
Verify that a DNS profile has been created by traversing to the following directory and using the cat command on the
summary file:
/aci/fabric/fabric-policies/global-policies/dns-profiles/default
admin@RTP_Apic1:~> cd /aci/fabric/fabric-policies/global-policies/dns-profiles/default
admin@RTP_Apic1:default> cat summary
# dns-profile
name : default
description :
ownerkey :
ownertag :
management-epg : tenants/mgmt/node-management-epgs/default/out-of-band/default

dns-providers:
address preferred
-------------- ---------
171.70.168.183 yes
173.36.131.10 no

dns-domains:
name default description
--------- ------- -----------
cisco.com yes

It should be ensured that:


• The management-epg is pointing at a management EPG distinguished name
• There is at least one dns-provider configured
• There is a dns-domain configured
Next the DNS label verification can be done by changing to the following directory and looking at the following
summary file:
/aci/tenants/mgmt/networking/private-networks/oob/dns-profile-labels/default
admin@RTP_Apic1:default> cd /aci/tenants/mgmt/networking/private-networks/oob/dns-profile-labels/defa
admin@RTP_Apic1:default> cat summary
# dns-lbl
name : default
description :
ownerkey :
ownertag :
tag : yellow-green

When the policies are applied they push the DNS configuration down to Linux on the APIC. That configuration can
be verified by looking at the /etc/resolv.conf file:
admin@RTP_Apic1:default> cat /etc/resolv.conf
# Generated by IFC
search cisco.com

nameserver 171.70.168.183

nameserver 173.36.131.10

The last verification step for the APIC would be to actually resolve a host using the host command and then ping that
host.
admin@RTP_Apic1:default> host www.cisco.com
www.cisco.com is an alias for www.cisco.com.akadns.net.
www.cisco.com.akadns.net is an alias for origin-www.cisco.com.
origin-www.cisco.com has address 72.163.4.161
origin-www.cisco.com has IPv6 address 2001:420:1101:1::a

admin@RTP_Apic1:default> ping www.cisco.com


PING origin-www.cisco.com (72.163.4.161) 56(84) bytes of data.
64 bytes from www1.cisco.com (72.163.4.161): icmp_seq=1 ttl=238 time=29.3 ms
64 bytes from www1.cisco.com (72.163.4.161): icmp_seq=2 ttl=238 time=29.0 ms
^C
--- origin-www.cisco.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1743ms
rtt min/avg/max/mdev = 29.005/29.166/29.328/0.235 ms

Fabric Nodes The policy that is applied needs to be looked at by inspecting the raw management information tree.
Once that is verified, next step is to look at the DNS configuration that is applied to the fabric node as a result of that
policy.
Verify that a DNS policy is applied by changing to the following directory and listing out the contents:
/mit/uni/fabric/dnsp-default
rtp_leaf1# cd /mit/uni/fabric/dnsp-default
rtp_leaf1# ls -1
dom-cisco.com
mo
prov-[171.70.168.183]
prov-[173.36.131.10]
rsProfileToEpg
rsProfileToEpg.link
rsProfileToEpp
rsProfileToEpp.link
rtdnsProfile-[uni--ctx-[uni--tn-mgmt--ctx-oob]--dnslbl-default]
summary

The following should be seen:


• The DNS providers listed as prov-[ipaddress]
• The DNS domains listed as dom-[domainname]
• The summary file in the rtdnsProfile-... directory has a tDn that points to a valid dnslabel
• The rsProfileToEpg.link should exist and resolve to a valid place in the management information tree
• The rsProfileToEpp.link should exist and resolve to a valid place in the management information tree
Verifying the dnslabel on the fabric node can be done by looking at the summary file in the rtdsnProfile-... directory,
taking the tDn reference and prefacing it with /mit, and and cat the summary file in the resulting directory.
rtp_leaf1# cat rtdnsProfile-[uni--ctx-[uni--tn-mgmt--ctx-oob]--dnslbl-default]/summary
# DNS Profile Label
tDn : uni/ctx-[uni/tn-mgmt/ctx-oob]/dnslbl-default
childAction :
dn : uni/fabric/dnsp-default/rtdnsProfile-[uni/ctx-[uni/tn-mgmt/ctx-oob]/dnslbl-default]
lcOwn : local
modTs : 2014-10-15T14:16:14.850-04:00
rn : rtdnsProfile-[uni/ctx-[uni/tn-mgmt/ctx-oob]/dnslbl-default]
status :
tCl : dnsLblDef

rtp_leaf1# cat /mit/uni/ctx-\[uni--tn-mgmt--ctx-oob\]/dnslbl-default/summary


# DNS Profile Label
name : default
childAction :
descr :
dn : uni/ctx-[uni/tn-mgmt/ctx-oob]/dnslbl-default
lcOwn : policy
modTs : 2014-10-15T14:16:14.850-04:00
monPolDn :
ownerKey :
ownerTag :
rn : dnslbl-default
status :
tag : yellow-green

The policy that is pushed to the fabric node results in the DNS configuration being applied to Linux. The DNS
configuration can be verified by first looking at /etc/dcos_resolv.conf to verify DNS is enabled and /etc/resolv.conf to
verify how DNS is configured.
rtp_leaf1# cat /etc/dcos_resolv.conf
# DNS enabled
rtp_leaf1# cat /etc/resolv.conf
search cisco.com
nameserver 171.70.168.183
nameserver 173.36.131.10

On the fabric nodes, the host command is not available so ping is the best way to try and resolve a host.
rtp_leaf1# ping www.cisco.com
PING origin-www.cisco.com (72.163.4.161): 56 data bytes
64 bytes from 72.163.4.161: icmp_seq=0 ttl=238 time=29.153 ms
64 bytes from 72.163.4.161: icmp_seq=1 ttl=238 time=29.585 ms
^C--- origin-www.cisco.com ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max/stddev = 29.153/29.369/29.585/0.216 ms

NTP

Note: NTP can be configured with either an IP address or a hostname, but when configured with a hostname DNS
must be configured in order to resolve the hostname.
APIC NTP policies are applied globally by first applying a global pod-selector policy which points to a policy-group.
This can be verified by changing to /aci/fabric/fabric-policies/pod-policies/pod-selector-default-all and viewing the
summary file. In this case the policy-group is set to RTPFabric1:
admin@RTP_Apic1:~> cd /aci/fabric/fabric-policies/pod-policies/pod-selector-default-all
admin@RTP_Apic1:pod-selector-default-all> cat summary
# pod-selector
name : default
type : all
description :
ownerkey :
ownertag :
fabric-policy-group : fabric/fabric-policies/pod-policies/policy-groups/RTPFabric1

Make note of the RTPFabric1.


The Pod policy-group can be verified by changing to the directory to /aci/fabric/fabric-policies/pod-policies/policy-
groups/ and viewing the summary file:
admin@RTP_Apic1:pod-policies> cd /aci/fabric/fabric-policies/pod-policies/policy-groups/
admin@RTP_Apic1:policy-groups> cat summary
policy-groups:
name date-time-policy isis-policy coop-group-policy bgp-route-reflector- communication-pol
policy
---------- ---------------- ----------- ----------------- -------------------- -----------------
RTPFabric1 ntp.esl.cisco.com default default default default

Ensure that the date-time-policy is pointed at the proper date-time-policy name


Verify that a NTP policy has been created by traversing to the following directory and using the cat command on the
summary file for the specific date-time policy configured:
/aci/fabric/fabric-policies/pod-policies/policies/date-and-time/
admin@RTP_Apic1:> cd /aci/fabric/fabric-policies/pod-policies/policies/date-and-time/
admin@RTP_Apic1:> cat date-and-time-policy-ntp.esl.cisco.com/summary
# date-and-time-policy
name : default
description :
administrative-state : enabled
authentication-state : disabled
ownerkey :
ownertag :
ntp-servers:
host-name-ip-address preferred minimum-polling- maximum-polling- management-epg
interval interval
-------------------- --------- ---------------- ---------------- ---------------------
ntp.esl.cisco.com yes 4 6 tenants/mgmt/
node-management-epgs/
default/out-of-band/
default

• Ensure the administrative state is enabled


• Ensure the ntpserver is shown
• Ensure the management-epg is shown and resolves to a valid management epg.
• When the NTP policy is applied on the APIC it is pushed down to linux as an NTP configuration. This can be
verified using the ntpstat command.
admin@RTP_Apic1:date-and-time> ntpstat
synchronised to NTP server (171.68.38.66) at stratum 2
time correct to within 952 ms
polling server every 64 s

• The NTP server should be synchronized.


• Netstat can also be checked on the APIC to ensure that the APIC is listening on port 123:
• The proper NTP server should be seen listed
admin@RTP_Apic1:date-and-time> netstat -anu | grep :123
udp 0 0 172.16.0.1:123 0.0.0.0:*
udp 0 0 10.122.254.211:123 0.0.0.0:*
udp 0 0 169.254.1.1:123 0.0.0.0:*
udp 0 0 169.254.254.254:123 0.0.0.0:*
udp 0 0 127.0.0.1:123 0.0.0.0:*
udp 0 0 0.0.0.0:123 0.0.0.0:*
udp 0 0 ::1:123 :::*
udp 0 0 fe80::92e2:baff:fe4b:fc7:123 :::*
udp 0 0 fe80::38a5:a2ff:fe9a:4eb:123 :::*
udp 0 0 fe80::f88d:a5ff:fe4c:419:123 :::*
udp 0 0 fe80::ce7:b9ff:fe50:4481:123 :::*
udp 0 0 fe80::3c79:62ff:fef0:214:123 :::*
udp 0 0 fe80::26e9:b3ff:fe15:a0e:123 :::*
udp 0 0 fe80::e89f:1dff:fedf:1f6:123 :::*
udp 0 0 fe80::f491:1ff:fe9f:f1de:123 :::*
udp 0 0 fe80::dc2d:dfff:fe88:20d:123 :::*
udp 0 0 fe80::e4cb:caff:feec:5bd:123 :::*
udp 0 0 fe80::a83d:1ff:fe54:597:123 :::*
udp 0 0 fe80::8c71:63ff:feb2:f4a:123 :::*
udp 0 0 :::123 :::*

Fabric nodes Verify that a NTP policy has been created by traversing to the following directory and using the cat
command on the summary file and list out the directory:
/mit/uni/fabric/time-default
rtp_leaf1# cd /mit/uni/fabric/time-default
rtp_leaf1# cat summary
# Date and Time Policy
name : default
adminSt : enabled
authSt : disabled
childAction :
descr :
dn : uni/fabric/time-default
lcOwn : resolveOnBehalf
modTs : 2014-10-15T13:11:19.747-04:00
monPolDn : uni/fabric/monfab-default
ownerKey :
ownerTag :
rn : time-default
status :
uid : 0
rtp_leaf1#
rtp_leaf1# ls -1
issues
mo
ntpprov-10.81.254.202
rtfabricTimePol-[uni--fabric--funcprof--podpgrp-RTPFabric1]
summary

• Ensure the adminSt is enabled


• Ensure the ntpprov-* directory is for the proper ntp provider.
• When the NTP policy is pushed to the fabric node it resolves to a NTP configuration in Linux that gets applied.
It can be verified using both show ntp peers and show ntp peer status commands:
rtp_leaf1# show ntp peers
--------------------------------------------------
Peer IP Address Serv/Peer
--------------------------------------------------
10.81.254.202 Server (configured)
rtp_leaf1# show ntp peer-status
Total peers : 1
* - selected for sync, + - peer mode(active),
- - peer mode(passive), = - polled in client mode
remote local st poll reach delay vrf
-------------------------------------------------------------------------------
*10.81.254.202 0.0.0.0 1 64 377 0.00041 management

• Ensure that the Peer IP Address is correct


• Ensure that the peer is a server
• Ensure that the vrf is shown as a management

DCHP Relay

There are two main components in the DHCP Relay configuration. The first is the policy which is configured under a
tenant. The policy contains the DHCP server address as well as how (EPG) the DHCP server is reached.
The second component is under the tenant BD with a DHCP Relay label to link to the DHCP Relay Policy.
APIC

The DHCP Relay policy can be verified through shell access by cd to /mit/uni/tn-<tenant name>/relayp-<DHCP Relay
Profile Name>.
admin@RTP_APIC1:relayp-DHCP_Relay_Profile> ls

mo
provdhcp-[uni--tn-Prod--out-L3out--instP-ExtL3EPG]
rsprov-[uni--tn-Prod--out-L3out--instP-ExtL3EPG]
rsprov-[uni--tn-Prod--out-L3out--instP-ExtL3EPG].link
rtlblDefToRelayP-[uni--bd-[uni--tn-Prod--BD-MiddleWare]-isSvc-no--dhcplbldef-DHCP_Relay_Profile]
summary

admin@RTP_APIC1:relayp-DHCP_Relay_Profile> cat summary


# DHCP Relay Policy
name : DHCP_Relay_Profile
childAction :
descr :
dn : uni/tn-Prod/relayp-DHCP_Relay_Profile
lcOwn : local
modTs : 2014-10-16T15:43:03.139-07:00
mode : visible
monPolDn : uni/tn-common/monepg-default
owner : infra
ownerKey :
ownerTag :
rn : relayp-DHCP_Relay_Profile
status :
uid : 15374

In this last example, the DHCP relay policy name is DHCP_Relay_Profile. The provider is the EPG where the DHCP
server is located. In this example the server is located through a layer 3 external routed domain named L3out.
The dhcpRsProv contains the address of the server IP address. From the DHCP relay policy directory, cd to the
rsprov-* directory which in this example is rsprov-[uni–tn-Prod–out-L3out–instP-ExtL3EPG]
admin@RTP_APIC1:relayp-DHCP_Relay_Profile> cd rsprov-\[uni--tn-Prod--out-L3out--instP-ExtL3EPG\]

admin@RTP_APIC1:rsprov-[uni--tn-Prod--out-L3out--instP-ExtL3EPG]> ls
mo summary

admin@RTP_APIC1:rsprov-[uni--tn-Prod--out-L3out--instP-ExtL3EPG]> cat summary


# DHCP Provider
tDn : uni/tn-Prod/out-L3out/instP-ExtL3EPG
addr : 10.30.250.1
childAction :
dn : uni/tn-Prod/relayp-DHCP_Relay_Profile/rsprov-[uni/tn-Prod/out-L3out/instP-ExtL3EPG]
forceResolve : no
lcOwn : local
modTs : 2014-10-16T15:43:03.139-07:00
monPolDn : uni/tn-common/monepg-default
rType : mo
rn : rsprov-[uni/tn-Prod/out-L3out/instP-ExtL3EPG]
state : formed
stateQual : none
status :
tCl : l3extInstP
tType : mo
uid : 15374

Fabric nodes

From the fabric nodes, confirmation that the relay is configured properly is with the CLI command show dhcp internal
info relay address. The command show ip dhcp relay presents similar information.
rtp_leaf1# show dhcp internal info relay address
DHCP Relay Address Information:
DHCP relay intf Vlan9 has 3 relay addresses:
DHCP relay addr: 10.0.0.1, vrf: overlay-1, visible, gateway IP: 10.0.0.30
DHCP relay addr: 10.0.0.2, vrf: overlay-1, invisible, gateway IP:
DHCP relay addr: 10.0.0.3, vrf: overlay-1, invisible, gateway IP:
DHCP relay intf Vlan17 has 1 relay addresses:
DHCP relay addr: 10.30.250.1, vrf: Prod:Prod, visible, gateway IP: 10.0.0.101 10.30.250.2
DHCP relay intf loopback0 has 3 relay addresses:
DHCP relay addr: 10.0.0.1, vrf: overlay-1, invisible, gateway IP:
DHCP relay addr: 10.0.0.2, vrf: overlay-1, invisible, gateway IP:
DHCP relay addr: 10.0.0.3, vrf: overlay-1, invisible, gateway IP:

The DHCP relay statistics on the leaf can be viewed with show ip dhcp relay statistics:
Leaf-1# show ip dhcp relay statistics
----------------------------------------------------------------------
Message Type Rx Tx Drops
----------------------------------------------------------------------
Discover 5 5 0
Offer 1 1 0
Request(*) 4 4 0
Ack 7 7 0
Release(*) 0 0 0
Decline 0 0 0
Nack 0 0 0
Inform 3 3 0
----------------------------------------------------------------------
Total 28 28 0
----------------------------------------------------------------------

15.2 Problem Description

After configuring specific shared services (DNS, NTP, SNMP, etc) there are issues with connectivity to those services.

Symptom 1

The APICs can resolve hostnames via DNS but fabric nodes are not able to

Verification

A fabric node is unable to to resolve a hostname.


rtp_leaf1# ping www.cisco.com
ping: unknown host
rtp_leaf1#

An APIC is able to resolve a hostname.


admin@RTP_Apic1:~> ping www.cisco.com
PING origin-www.cisco.com (72.163.4.161) 56(84) bytes of data.
64 bytes from www1.cisco.com (72.163.4.161): icmp_seq=1 ttl=238 time=29.4 ms
64 bytes from www1.cisco.com (72.163.4.161): icmp_seq=2 ttl=238 time=29.1 ms
^C
--- origin-www.cisco.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1351ms
rtt min/avg/max/mdev = 29.173/29.334/29.495/0.161 ms

Since the problem seems isolated to the fabric nodes, let’s start there. Verify the policy is correct on the fabric node.
rtp_leaf1# cd /mit/uni/fabric/dnsp-default
rtp_leaf1# ls -al
total 1
drw-rw---- 1 admin admin 512 Oct 15 17:46 .
drw-rw---- 1 admin admin 512 Oct 15 17:46 ..
-rw-rw---- 1 admin admin 0 Oct 15 17:46 mo
-r--r----- 1 admin admin 0 Oct 15 17:46 summary

The fabric node has no policy, the mo and summary files are empty, further inspection should take place at the policy
on the APIC configuration. All policy for the fabric nodes comes from the APIC, so that’s where the problem is most
likely to be found.
From the APIC the policy is applied:
admin@RTP_Apic1:default> cat summary
# dns-profile
name : default
description :
ownerkey :
ownertag :
management-epg : tenants/mgmt/node-management-epgs/default/out-of-band/default

dns-providers:
address preferred
-------------- ---------
171.70.168.183 yes
173.36.131.10 no

dns-domains:
name default description
--------- ------- -----------
cisco.com yes

The DNS label is missing however:


admin@RTP_Apic1:default> cd /aci/tenants/mgmt/node-management-epgs/default/out-of-band/default
admin@RTP_Apic1:default> cat summary
# out-of-band-management-epg
name : default
configuration-issues :
configuration-state : applied
qos-priority : unspecified
description :

provided-out-of-band-contracts:
qos-priority oob-contract state
------------ ------------ ------
unspecified oob_contract formed

tags:
name
----
admin@RTP_Apic1:default>

From the GUI the missing label from the out-of-band management can be seen:
Resolution

Once the DNS label “default” is added to the private network, the fabric node is able to resolve hostnames.

Symptom 2

NTP is not functional on any of the fabric nodes but the APICs have NTP synchronized.

Verification

APIC

• There are faults on the date-time policy for all of the fabric nodes that state that the config failed and: Datetime
Policy Configuration Failed with issues: access-epg-not-specified
The APIC does not have a management-egp assigned.
admin@RTP_Apic1:~> cd /aci/fabric/fabric-policies/pod-policies/policies/date-and-time/date-and-time-p
admin@RTP_Apic1:date-and-time-policy-ntp.esl.cisco.com> cat summary
# date-and-time-policy
name : ntp.esl.cisco.com
description :
administrative-state : enabled
authentication-state : disabled
ownerkey :
ownertag :
ntp-servers:
host-name-ip-address preferred minimum-polling- maximum-polling- management-epg
interval interval
-------------------- --------- ---------------- ---------------- --------------
ntp.esl.cisco.com yes 4 6

This can be seen the GUI as well:

This is likely why NTP is not synchronized on the fabric nodes. The fabric nodes are not being told which vrf to use
to reach the NTP server.
On the APICs, port 123 is being listened on and because this fabric only has out-of-band management configured the
APICs are able to reach the NTP server over this interface.
admin@RTP_Apic1:date-and-time-policy-ntp.esl.cisco.com> ntpstat
synchronized to NTP server (171.68.38.65) at stratum 2
time correct to within 976 ms
polling server every 64 s
admin@RTP_Apic1:date-and-time-policy-ntp.esl.cisco.com> netstat -anu | grep :123
udp 0 0 172.16.0.1:123 0.0.0.0:*
udp 0 0 10.122.254.211:123 0.0.0.0:*
udp 0 0 169.254.1.1:123 0.0.0.0:*
udp 0 0 169.254.254.254:123 0.0.0.0:*
udp 0 0 127.0.0.1:123 0.0.0.0:*
udp 0 0 0.0.0.0:123 0.0.0.0:*
udp 0 0 ::1:123 :::*
udp 0 0 fe80::92e2:baff:fe4b:fc7:123 :::*
udp 0 0 fe80::38a5:a2ff:fe9a:4eb:123 :::*
udp 0 0 fe80::f88d:a5ff:fe4c:419:123 :::*
udp 0 0 fe80::ce7:b9ff:fe50:4481:123 :::*
udp 0 0 fe80::3c79:62ff:fef0:214:123 :::*
udp 0 0 fe80::26e9:b3ff:fe15:a0e:123 :::*
udp 0 0 fe80::e89f:1dff:fedf:1f6:123 :::*
udp 0 0 fe80::f491:1ff:fe9f:f1de:123 :::*
udp 0 0 fe80::dc2d:dfff:fe88:20d:123 :::*
udp 0 0 fe80::e4cb:caff:feec:5bd:123 :::*
udp 0 0 fe80::a83d:1ff:fe54:597:123 :::*
udp 0 0 fe80::8c71:63ff:feb2:f4a:123 :::*
udp 0 0 :::123 :::*

Fabric Nodes

The leafs do not have any NTP policy:


rtp_leaf1# cd /mit/uni/fabric/time-default
rtp_leaf1# cat summary
cat: summary: No such file or directory
rtp_leaf1# cat mo
cat: mo: No such file or directory

Because the leafs do not have any policy, they also do not have any NTP configuration or peers.
rtp_leaf1# show ntp peer-status
Total peers : 1
* - selected for sync, + - peer mode(active),
- - peer mode(passive), = - polled in client mode
remote local st poll reach delay vrf
-------------------------------------------------------------------------------
=0.0.0.0 0.0.0.0 0 1 0 0.00000
rtp_leaf1# show ntp peers
--------------------------------------------------
Peer IP Address Serv/Peer
--------------------------------------------------
0.0.0.0 Server (configured)

Resolution

By adding a management EPG of default (Out-of-band) to the date-time policy, NTP is able to synchronize on the
fabric nodes.

Symptom 3

The APICs do not synchronize with NTP but the fabric nodes do

Verification

APICs

The ntp daemon is not running.


admin@RTP_Apic1:pod-selector-default-all> ntpstat
Unable to talk to NTP daemon. Is it running?
admin@RTP_Apic1:pod-selector-default-all>

The APICs have the date-time policy configured properly.


admin@RTP_Apic1:date-and-time-policy-ntp.esl.cisco.com> cat summary
# date-and-time-policy
name : ntp.esl.cisco.com
description :
administrative-state : enabled
authentication-state : disabled
ownerkey :
ownertag :

ntp-servers:
host-name-ip-address preferred minimum-polling- maximum-polling- management-epg
interval interval
-------------------- --------- ---------------- ---------------- ---------------------
ntp.esl.cisco.com yes 4 6 tenants/mgmt/
node-management-epgs/
default/out-of-band/
default

The APICs do have the proper fabric-policy-group as well.


admin@RTP_Apic1:~> cd /aci/fabric/fabric-policies/pod-policies/pod-selector-default-all
admin@RTP_Apic1:pod-selector-default-all> cat summary
# pod-selector
name : default
type : all
description :
ownerkey :
ownertag :
fabric-policy-group : fabric/fabric-policies/pod-policies/policy-groups/RTPFabric1

The APICs do not have the proper date-time-policy specified in the policy-group.
admin@RTP_Apic1:pod-policies> cd /aci/fabric/fabric-policies/pod-policies/policy-groups/
admin@RTP_Apic1:policy-groups> cat summary
policy-groups:
name date-time-policy isis-policy coop-group-policy bgp-route-reflector- communication-pol
policy
---------- ---------------- ----------- ----------------- -------------------- -----------------
RTPFabric1 default default default default default

This should be ntp.esl.cisco.com but it is incorrectly set to default. There should be a fault for this.
The fault is on the Pod policy and states: Failed to form relation to MO time-default of class datetimePol in context
Fabric Nodes

The fabric nodes are synchronized with the NTP server.


rtp_leaf1# show ntp peers
--------------------------------------------------
Peer IP Address Serv/Peer
--------------------------------------------------
171.68.38.65 Server (configured)
rtp_leaf1# show ntp peer-status
Total peers : 1
* - selected for sync, + - peer mode(active),
- - peer mode(passive), = - polled in client mode
remote local st poll reach delay vrf
-------------------------------------------------------------------------------
=171.68.38.65 0.0.0.0 1 64 377 0.07144 management
Symptom 4

The APICs and the fabric nodes do not synchronize with NTP

Verification

APICs

The ntp daemon is not running on the APICs:


admin@RTP_Apic1:pod-selector-default-all> ntpstat
Unable to talk to NTP daemon. Is it running?
admin@RTP_Apic1:pod-selector-default-all>
The pod selector policy is missing the fabric-policy-group:

admin@RTP_Apic1:~> cd /aci/fabric/fabric-policies/pod-policies/pod-selector-default-all
admin@RTP_Apic1:pod-selector-default-all> cat summary
# pod-selector
name : default
type : all
description :
ownerkey :
ownertag :
fabric-policy-group :

Without a fabric-policy-group applied to the pod-selector, the date-time policy will not be applied to the pod-policy-
group and the NTP daemon will not start up. This is a problem that needs to be corrected. However, verification needs
to be continued to other parts of the config to ensure that nothing else is broken.
The policy-group config does look proper and points to the date-time-policy:
admin@RTP_Apic1:date-and-time> cd /aci/fabric/fabric-policies/pod-policies/policy-groups/
admin@RTP_Apic1:policy-groups> cat summary
policy-groups:
name date-time-policy isis-policy coop-group-policy bgp-route-reflector- communication-po
policy
---------- ----------------- ----------- ----------------- -------------------- ----------------
RTPFabric1 ntp.esl.cisco.com default default default default

The date-and-time policy is configured correctly:


admin@RTP_Apic1:date-and-time> cat date-and-time-policy-ntp.esl.cisco.com/summary
# date-and-time-policy
name : ntp.esl.cisco.com
description :
administrative-state : enabled
authentication-state : disabled
ownerkey :
ownertag :

ntp-servers:
host-name-ip-address preferred minimum-polling- maximum-polling- management-epg
interval interval
-------------------- --------- ---------------- ---------------- ---------------------
ntp.esl.cisco.com yes 4 6 tenants/mgmt/
node-management-epgs/
default/out-of-band/
default
So the only thing that needs to be corrected is that the fabric policy group needs to be applied to the pod selector:

Fabric Nodes

The fabric nodes do not have any NTP configuration:


rtp_leaf1# show ntp peers
dn "sys/time" could not be found
Error executing command, check logs for details
rtp_leaf1# show ntp peer-status
dn "sys/time" could not be found
Error executing command, check logs for details

There are no time-date policies on the fabric nodes:


rtp_leaf1# cd /mit/uni/fabric/time-default
bash: cd: /mit/uni/fabric/time-default: No such file or directory

Resolution

Once the Fabric Policy Group is set, the NTP daemon is started and NTP is synchronized on the APICs. In this case,
no fault is shown anywhere.

15.3 Problem Description

Devices connected to the fabric are not able to get the expected IP address via DHCP.

Symptom 1

DHCP client is not getting an IP address from DHCP server


Verification

Several issues could be the cause of this. There are several steps that can be run to verify the cause of the issue. This
is listed in a logical order moving from the policy through the leaf to the DHCP server:
• The DHCP relay policy is properly applied as indicated in the overview section
• The endpoint is part of the EPG that is in the BD that contains the correct DHCP relay policy. This can be
verified with the switch CLI command show endpoint interface <interface ID> detail.
rtp_leaf1# show endpoint interface ethernet 1/13 detail
+---------------+---------------+-----------------+--------------+-------------+---------------------

VLAN/ Encap MAC Address MAC Info/ Interface Endpoint Group


Domain VLAN IP Address IP Info Info
+---------------+---------------+-----------------+--------------+-------------+---------------------
20 vlan-1301 0024.81b5.d22b L eth1/13 Prod:commerceworkspa

• If the endpoint is not present, confirm the fabric interface status with the switch CLI command show interface
ethernet 1/13.
• If the interface status is not Up, check the physical connection
• If the interface status is Up but is “out-of-service”, this is typically an indication that there is an misconfiguration.
Confirm that the EPG points to the proper domain and the domain is configured with the proper fabric vlan pool
and AEP.
• Check for faults and refer to the section Faults and Health Scores.
The DHCP relay policy on the leaf where the client is attached is properly configured as shown with the leaf CLI
command show dhcp internal info relay address shown in the overview section.
The DHCP server can be reached from the leaf. One way to verify this is using the leaf CLI command iping originated
from the leaf using the tenant context.
Check the DHCP relay statistics on the leaf with the leaf CLI command show ip dhcp relay statistics:
• If the Discover is not incrementing, check the fabric interface status where the client is connected
• If the Discover stats are incrementing but the Offer is not, confirm that the server can reach the BD SVI address
• If the Discover stats are incrementing but the Offer is not, confirm that a proper contract is in place to not drop
the DHCP Offer
Confirm from the DHCP server side that the DHCP Discover is received.
From the DHCP server, confirm the GIADDR (relay agent) address is the expected address and the proper DHCP
scope for that subnet has been defined.
From the DHCP server, confirm that the DHCP Offer is sent and the destination IP address of the relay where it is sent
Confirm from the DHCP server that the DHCP relay agent address can be reached/ping.

Resolution

The above verification steps should isolate whether the issue is with the policy, the physical layer, the network or the
DHCP server.

Symptom 2

DHCP client is getting an address but not for the expected subnet
Verification

Several issues could be the cause of this. One possibility is that if there are multiple subnets on a BD, the relay agent
address (GIADDR) used will be the primary BD SVI address. This is typically the first SVI configured on the BD
which may be the subnet from which the DHCP server scope has allocated the address.
Other steps to verify are:
• The DHCP relay policy is properly applied as indicated in the overview section
• The endpoint is part of the EPG that is in the BD that contains the correct DHCP relay policy. This can be
verified with the leaf CLI command show endpoint interface <interface ID> detail.
rtp_leaf1# show endpoint interface ethernet 1/13 detail
+---------------+---------------+-----------------+--------------+-------------+---------------------
VLAN/ Encap MAC Address MAC Info/ Interface Endpoint Group
Domain VLAN IP Address IP Info Info
+---------------+---------------+-----------------+--------------+-------------+---------------------

20 vlan-1301 0024.81b5.d22b L eth1/13 Prod:commerceworkspa

• If the endpoint is not present, confirm the fabric interface status with the leaf CLI command show interface
ethernet 1/13.
• If the interface status is not Up, check the physical connection
• If the interface status is Up but is “out-of-service”, this is typically an indication that there is an misconfiguration.
Confirm that the EPG points to the proper domain and the domain is configured with the proper fabric vlan pool
and AEP.
• Check for faults and refer to the section Faults and Health Scores.
From the DHCP server, confirm the GIADDR (relay agent) address is the expected address and the proper DHCP
scope for that subnet has been defined.

Resolution

The above verification steps should isolate whetherthe issue is with the policy, the physical layer, the network or the
DHCP server.

16 Unicast Data Plane Forwarding and Reachability

• Overview
– Verification - Endpoints
– Verification - VLANs
– Verfication - Forwarding Tables
• Problem Description
– Symptom
– Verification
• Problem Description
– Symptom 1
– Verification
16.1 Overview

This chapter will cover unicast forwarding and reachability problems. This can be, but is not limited to, end points
not showing up in the forwarding tables, end points not able to communicate with each other (non-zoning rule policy
(contract) related problems), VLANs not being programmed, as well as incorrect configurations that can cause these
problems and the subsequent faults that are raised.
To troubleshoot packet forwarding issues multiple shells will need to be used on a given leaf or spine.
• CLI: The CLI will be used to run the well-known VSH commands and check the concrete models on the switch.
For example show vlan, show endpoint, etc...
• vsh_lc: This is the line card shell and it will be used to check line card processes and forwarding tables specific
to the Application Leaf Engine (ALE) ASIC.
In the course of troubleshooting, several different VLAN types can be come across. A thorough deep dive on the
different types of VLANs and their respective mappings is beyond the scope of this chapter and book, but summarized
in the picture below are some of the more important concepts that are fundamental in understanding the use of VLANs
in the ACI fabric. The output from the below command will be discussed in more detail later in this chapter, but the
command was ran from the vsh_lc shell discussed above

The show vlan extended command shows the VLANs described above map to the Bridge Domains and End Point
Groups that have been configured on the fabric.
rtp_leaf1# show vlan extended
VLAN Name Status Ports
---- --------------------------------- --------- -------------------------------
13 infra:default active Eth1/1, Eth1/2, Eth1/5, Eth1/35 <--- Infra VLAN
17 Prod:Web active Eth1/27, Eth1/28, Po2, Po3 <--- BD_VLAN
21 Prod:MiddleWare active Eth1/27, Eth1/28, Po2, Po3 <--- BD_VLAN
22 Prod:commerceworkspace:Middleware active Eth1/27, Eth1/28, Po2, Po3 <--- EPG FD_VLAN
23 Prod:commerceworkspace:Web active Eth1/27, Eth1/28, Po2, Po3 <--- EPG FD_VLAN

VLAN Type Vlan-mode Encap


---- ----- ---------- -------------------------------
13 enet CE vxlan-16777209, vlan-3500<--- Encap VLAN for infra
17 enet CE vxlan-16089026
21 enet CE vxlan-16351138
22 enet CE vlan-634<--- Encap VLAN
23 enet CE vlan-600<---Encap VLAN

Additionally, an understanding of the forwarding tables in the ALE ASIC is useful when troubleshooting packet
forwarding problems. The picture below depicts two tables, and the subsequent lookup that is performed on ingress
and egress. There is a Local Station Table (LST) and a Global Station Table (GST). These two tables are unified for
Layer 2 and Layer 3, and these entries can be displayed separately.

Verification - Endpoints

To verify end point reachability the following commands can be used from the CLI. Notice the legend that details how
that end point was learned. The below sample output displays end points learned locally (direct attached to the leaf)
as well vPC attached.
rtp_leaf1# show endpoint
Legend:
O - peer-attached H - vtep a - locally-aged S - static
V - vpc-attached p - peer-aged L - local M - span
s - static-arp B - bounce
+---------------+---------------+-----------------+--------------+-------------+
VLAN/ Encap MAC Address MAC Info/ Interface
Domain VLAN IP Address IP Info
+---------------+---------------+-----------------+--------------+-------------+
Prod:Prod 10.0.0.101 L
27/Prod:Prod vlan-700 0026.f064.0000 LpV po1
27/Prod:Prod vlan-700 001b.54c2.2644 LpV po1
27/Prod:Prod vlan-700 0026.980a.df44 LV po1
27/Prod:Prod vlan-700 0000.0c9f.f2bc LV po1
35/Prod:Prod vlan-670 0050.56bb.0164 LV po2
29/Prod:Prod vlan-600 0050.56bb.cccf LV po2
34/Prod:Prod vlan-636 0000.0c9f.f2bc LV po2
34/Prod:Prod vlan-636 0026.980a.df44 LV po2
34/Prod:Prod vlan-636 0050.56bb.ba9a LV po2
Test:Test 10.1.0.101 L
overlay-1 172.16.136.95 L
overlay-1 172.16.136.96 L
13/overlay-1 vxlan-16777209 90e2.ba5a.9f30 L eth1/2
13/overlay-1 vxlan-16777209 90e2.ba4b.fc78 L eth1/1

The show mac address-table command can also can be used to confirm mac address learning and aging as shown
below.
rtp_leaf1# show mac address-table
Legend:
* - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC
age - seconds since last seen,+ - primary entry using vPC Peer-Link,
(T) - True, (F) - False
VLAN MAC Address Type age Secure NTFY Ports/SWID.SSID.LID
---------+-----------------+--------+---------+------+----+------------------
* 27 0026.f064.0000 dynamic - F F po1
* 27 001b.54c2.2644 dynamic - F F po1
* 27 0000.0c9f.f2bc dynamic - F F po1
* 27 0026.980a.df44 dynamic - F F po1
* 16 0050.56bb.0164 dynamic - F F po2
* 16 0050.56bb.2577 dynamic - F F po2
* 16 0050.56bb.cccf dynamic - F F po2
* 33 0050.56bb.cccf dynamic - F F po2
* 17 0026.980a.df44 dynamic - F F po2
* 17 0050.56bb.ba9a dynamic - F F po2
* 40 0050.56bb.f532 dynamic - F F po2
* 13 90e2.ba5a.9f30 dynamic - F F eth1/2
* 13 90e2.ba4b.fc78 dynamic - F F eth1/1

To verify end point reachability from the vsh_lc shell the following can be used command. Notice all the options that
are available. To reduce the amount of output displayed, specific end points can be specified as the output can be quite
extensive in a very large fabric. Alternatively output can be filter by using the Linux grep as well.
rtp_leaf1# vsh_lc
module-1# show system internal epmc endpoint ?
all Show information about all endpoints
interface Display interface information
ip IP address of the endpoint
key Key of the endpoint
mac MAC address of the endpoint
vlan VLAN info
vrf VRF of the endpoint

module-1# show system internal epmc endpoint all


VRF : overlay-1 ::: Context id : 4 ::: Vnid : 16777199
MAC : 90e2.ba4b.fc78 ::: Num IPs : 0
Vlan id : 13 ::: Vlan vnid : 16777209 ::: BD vnid : 16777209
VRF vnid : 16777199 ::: phy if : 0x1a000000 ::: tunnel if : 0
Interface : Ethernet1/1
VTEP tunnel if : N/A ::: Flags : 0x80004804
Ref count : 4 ::: sclass : 0
Timestamp : 01/02/1970 21:21:58.113891
EP Flags : local,MAC,class-set,timer,
Aging:Timer-type : Host-tracker timeout ::: Timeout-left : 399 ::: Hit-bit : Yes ::: Timer-reset coun

PD handles:
Bcm l2 hit-bit : Yes
[L2]: Asic : NS ::: BCM : Yes

::::

[snip]

------------------------------------------------
EPMC Endpoint Summary
----------------------------------------------
Total number of local endpoints : 7
Total number of remote endpoints : 0
Total number of peer endpoints : 0
Total number of cached endpoints : 0
Total number of config endpoints : 5
Total number of MACs : 5
Total number of IPs : 2

Verification - VLANs

The following commands can be used in the CLI to verify the VLANs programmed on the leaf. The show vlan
extended command is very useful as it provides the BD_VLAN, FD_VLAN (EPG), and the encap used for the EPG
FD_VLAN.
rtp_leaf1# show vlan
<CR> Carriage return
all-ports Show all ports on VLAN
brief All VLAN status in brief
extended VLAN extended info like encaps
id VLAN status by VLAN id
internal Show internal information of vlan-mgr
reserved Internal reserved VLANs
summary VLAN summary information

rtp_leaf1# show vlan extended

VLAN Name Status Ports


---- -------------------------------- --------- -------------------------------
13 infra:default active Eth1/1, Eth1/2, Eth1/5, Eth1/35
16 Prod:Web-FWctxProd active Eth1/27, Eth1/28, Po2, Po3
17 Prod:Web-FWctxProd active Eth1/27, Eth1/28, Po2, Po3
18 Prod:Web-LBctxProd active Eth1/27, Eth1/28, Po2, Po3
21 Prod:MiddleWare active Eth1/27, Eth1/28, Po2, Po3
22 Prod:commerceworkspace:Middleware active Eth1/27, Eth1/28, Po2, Po3
26 Prod:FWOutside active Eth1/27, Eth1/28, Eth1/42,
Eth1/44, Po1, Po2, Po3
27 -- active Eth1/42, Eth1/44, Po1
32 Prod:FWInside active Eth1/27, Eth1/28, Po2, Po3
33 Prod:commerceworkspace:FWInside active Eth1/27, Eth1/28, Po2, Po3
37 Prod:Web active Eth1/27, Eth1/28, Po2, Po3
40 Prod:commerceworkspace:Web active Eth1/27, Eth1/28, Po2, Po3

VLAN Type Vlan-mode Encap


---- ----- ---------- -------------------------------
13 enet CE vxlan-16777209, vlan-3500
16 enet CE vlan-637
17 enet CE vlan-671
18 enet CE vlan-603
21 enet CE vxlan-16351138
22 enet CE vlan-634
26 enet CE vxlan-15662984
27 enet CE vlan-700
32 enet CE vxlan-15597456
33 enet CE vlan-602
37 enet CE vxlan-16089026
40 enet CE vlan-600

The following commands can be used in vsh_lc shell to verify the VLANs are programmed on the ALE ASIC. The
brief version of the VLAN command is also useful as it provides all the VLAN information in a single table.
rtp_leaf1# vsh_lc
module-1# show system internal eltmc info vlan ?
<0-4095> Vlan value
access_encap_vlan Vlan on the wire
access_encap_vnid Vnid on the wire
all Show information for all instances of object
brief Show brief information for all objects
fab_encap_vnid Vnid in the fabric
hw_vlan Vlan value in HW
summary Show summary for object

module-1# show system internal eltmc info vlan brief

VLAN-Info
VlanId HW_VlanId Type Access_enc Access_enc Fabric_enc Fabric_enc BDVlan
Type Type
==================================================================================
13 15 BD_CTRL_VLAN 802.1q 3500 VXLAN 16777209 0
16 28 FD_VLAN 802.1q 637 VXLAN 8429 26
17 25 FD_VLAN 802.1q 671 VXLAN 8463 32
18 29 FD_VLAN 802.1q 603 VXLAN 8395 32
21 23 BD_VLAN Unknown 0 VXLAN 16351138 21
22 24 FD_VLAN 802.1q 634 VXLAN 8426 21
26 26 BD_VLAN Unknown 0 VXLAN 15662984 26
27 27 FD_VLAN 802.1q 700 VXLAN 8192 26
32 30 BD_VLAN Unknown 0 VXLAN 15597456 32
33 31 FD_VLAN 802.1q 602 VXLAN 8394 32
37 21 BD_VLAN Unknown 0 VXLAN 16089026 37
40 32 FD_VLAN 802.1q 600 VXLAN 8392 37

The following command can be used in vsh_lc shell to verify the VLANs programmed on ALE ASIC. The output
shows EPG FD_VLAN which has an encap value of 634. The FD_VLAN has a parent BD_VLAN of 21.
module-1# show system internal eltmc info vlan 22
vlan_id: 22 ::: hw_vlan_id: 24
vlan_type: FD_VLAN ::: bd_vlan: 21
access_encap_type: 802.1q ::: access_encap: 634
fabric_encap_type: VXLAN ::: fabric_encap: 8426
sclass: 16389 ::: scope: 4
bd_vnid: 8426 ::: untagged: 0
acess_encap_hex: 0x27a ::: fabric_enc_hex: 0x20ea
pd_vlan_ft_mask: 0x4f
bcm_class_id: 16 ::: bcm_qos_pap_id: 1024
qq_met_ptr: 2 ::: seg_label: 0
ns_qos_map_idx: 0 ::: ns_qos_map_pri: 1
ns_qos_map_dscp: 0 ::: ns_qos_map_tc: 0
vlan_ft_mask: 0x30

NorthStar Info:
qq_tbl_id: 1808 ::: qq_ocam: 0
seg_stat_tbl_id: 0 ::: seg_ocam: 0

The same command can be used to display BD_VLAN 21 from the above display is the parent VLAN of FD_VLAN
22. Displaying this VLAN provides important information with regards to the forwarding behavior for this Bridge
Domain. It can be seen that this BD has been left to the default forwarding L3 behavior. Remember that the BD_VLAN
and FD_VLAN is only locally significant to a single leaf.
module-1# show system internal eltmc info vlan 21
vlan_id: 21 ::: hw_vlan_id: 23
vlan_type: BD_VLAN ::: bd_vlan: 21
access_encap_type: Unknown ::: access_encap: 0
fabric_encap_type: VXLAN ::: fabric_encap: 16351138
sclass: 32773 ::: scope: 4
bd_vnid: 16351138 ::: untagged: 0
acess_encap_hex: 0 ::: fabric_enc_hex: 0xf97fa2
vrf_fd_list: 22,
pd_vlan_ft_mask: 0
bcm_class_id: 0 ::: bcm_qos_pap_id: 0
qq_met_ptr: 0 ::: seg_label: 0
ns_qos_map_idx: 0 ::: ns_qos_map_pri: 0
ns_qos_map_dscp: 0 ::: ns_qos_map_tc: 0
vlan_ft_mask: 0x7f
fwd_mode: bridge, route
arp_mode: unicast
unk_mc_flood: 1
unk_uc_mode: proxy

NorthStar Info:
qq_tbl_id: 2040 ::: qq_ocam: 0
seg_stat_tbl_id: 149 ::: seg_ocam: 0
flood_encap: 29 ::: igmp_mld_encap: 33

Using the same command sequence and displaying FD_VLAN 33 shows that the parent BD_VLAN is 32. Displaying
the BD_VLAN shows that this BD forwarding behavior has been changed for L2 forwarding and flooding.
module-1# show system internal eltmc info vlan 33

vlan_id: 33 ::: hw_vlan_id: 31


vlan_type: FD_VLAN ::: bd_vlan: 32
access_encap_type: 802.1q ::: access_encap: 602
fabric_encap_type: VXLAN ::: fabric_encap: 8394
sclass: 16394 ::: scope: 4
bd_vnid: 8394 ::: untagged: 0
acess_encap_hex: 0x25a ::: fabric_enc_hex: 0x20ca
pd_vlan_ft_mask: 0x4f
bcm_class_id: 16 ::: bcm_qos_pap_id: 1024
qq_met_ptr: 7 ::: seg_label: 0
ns_qos_map_idx: 0 ::: ns_qos_map_pri: 1
ns_qos_map_dscp: 0 ::: ns_qos_map_tc: 0
vlan_ft_mask: 0x30

NorthStar Info:
qq_tbl_id: 1417 ::: qq_ocam: 0
seg_stat_tbl_id: 0 ::: seg_ocam: 0

This is the display of the BD_VLAN showing the forwarding has been changed for layer 2 (bridged):
module-1# show system internal eltmc info vlan 32

vlan_id: 32 ::: hw_vlan_id: 30


vlan_type: BD_VLAN ::: bd_vlan: 32
access_encap_type: Unknown ::: access_encap: 0
fabric_encap_type: VXLAN ::: fabric_encap: 15597456
sclass: 16393 ::: scope: 4
bd_vnid: 15597456 ::: untagged: 0
acess_encap_hex: 0 ::: fabric_enc_hex: 0xedff90
vrf_fd_list: 18,17,33,
pd_vlan_ft_mask: 0
bcm_class_id: 0 ::: bcm_qos_pap_id: 0
qq_met_ptr: 0 ::: seg_label: 0
ns_qos_map_idx: 0 ::: ns_qos_map_pri: 0
ns_qos_map_dscp: 0 ::: ns_qos_map_tc: 0
vlan_ft_mask: 0x7f
fwd_mode: bridge
arp_mode: flood
unk_mc_flood: 1
unk_uc_mode: flood

NorthStar Info:
qq_tbl_id: 1672 ::: qq_ocam: 0
seg_stat_tbl_id: 452 ::: seg_ocam: 0
flood_encap: 36 ::: igmp_mld_encap: 37

Verfication - Forwarding Tables

From the CLI the show endpoint command can be used determine how the endpoint are being learned. In the below
example all the endpoints are locally learned (L).
rtp_leaf1# show endpoint
Legend:
O - peer-attached H - vtep a - locally-aged S - static
V - vpc-attached p - peer-aged L - local M - span
s - static-arp B - bounce
+---------------+---------------+-----------------+--------------+-------------+
VLAN/ Encap MAC Address MAC Info/ Interface
Domain VLAN IP Address IP Info
+---------------+---------------+-----------------+--------------+-------------+
Prod:Prod 10.0.0.101 L
27/Prod:Prod vlan-700 0026.f064.0000 LpV po1
27/Prod:Prod vlan-700 001b.54c2.2644 LpV po1
27/Prod:Prod vlan-700 0026.980a.df44 LV po1
27/Prod:Prod vlan-700 0000.0c9f.f2bc LV po1
35/Prod:Prod vlan-670 0050.56bb.0164 LV po2
29/Prod:Prod vlan-600 0050.56bb.cccf LV po2
34/Prod:Prod vlan-636 0000.0c9f.f2bc LV po2
34/Prod:Prod vlan-636 0026.980a.df44 LV po2
34/Prod:Prod vlan-636 0050.56bb.ba9a LV po2
Test:Test 10.1.0.101 L
overlay-1 172.16.136.95 L
overlay-1 172.16.136.96 L
13/overlay-1 vxlan-16777209 90e2.ba5a.9f30 L eth1/2
13/overlay-1 vxlan-16777209 90e2.ba4b.fc78 L eth1/1

From the vsh_lc shell the following commands can be used to examine the forwarding tables. Remember from the
overview there are two forwarding tables of interest on the ALE ASIC, the GST and LST. These are unified tables for
L2 and L3. The ingress and egress pipelines can be displayed separately as shown below.
For the output below, when the ingress direction is specified these are packets originating from the fabric, and when
egress direction is specfied, it refers to packets originating from the front panel ports.
module-1# show platform internal ns forwarding lst-l2 ingress

================================================================================
TABLE INSTANCE : 0
================================================================================
Legend:
POS: Entry Position O: Overlay Instance
V: Valid Bit MD/PT: Mod/Port
PT: Pointer Type(A=Adj, E=ECMP, D=DstEncap N=Invalid)
PTR: ECMP/Adj/DstEncap/MET pointer
ML: MET Last
ST: Static PTH: Num Paths
BN: Bounce CP: Copy To CPU
PA: Policy Applied PI: Policy Incomplete
DL: Dst Local SP: Spine Proxy
--------------------------------------------------------------------------------
MO SRC P M S B C P P D S
POS O VNID Address V DE MD/PT CLSS T PTR L T PTH N P A I L P
--------------------------------------------------------------------------------
253 0 eeff88 00:26:98:0a:df:44 1 0 00/15 4007 A 0 0 0 1 0 0 0 0 0 0
479 0 edff90 00:00:0c:9f:f2:bc 1 0 00/13 400f A 0 0 0 1 0 0 0 0 0 0
693 0 eeff88 00:00:0c:9f:f2:bc 1 0 00/15 4007 A 0 0 0 1 0 0 0 0 0 0
848 0 edff90 00:50:56:bb:cc:cf 1 0 00/13 400a A 0 0 0 1 0 0 0 0 0 0
919 0 edff90 00:26:98:0a:df:44 1 0 00/13 400f A 0 0 0 1 0 0 0 0 0 0
1271 0 f97fa2 00:22:bd:f8:19:ff 1 0 00/00 1 A 0 0 1 1 0 0 0 1 0 0
1306 0 eeff88 00:1b:54:c2:26:44 1 0 00/15 4007 A 0 0 0 1 0 0 0 0 0 0
1327 0 edff90 00:50:56:bb:ba:9a 1 0 00/13 400f A 0 0 0 1 0 0 0 0 0 0
2421 0 f57fc2 00:50:56:bb:f5:32 1 0 00/13 8004 A 0 0 0 1 0 0 0 0 0 0
3942 0 eeff88 00:26:f0:64:00:00 1 0 00/15 4007 A 0 0 0 1 0 0 0 0 0 0
4083 0 eeff88 00:50:56:bb:01:64 1 0 00/13 4010 A 0 0 0 1 0 0 0 0 0 0
module-1# show platform internal ns forwarding lst-l2 egress

================================================================================
TABLE INSTANCE : 1
================================================================================
Legend:
POS: Entry Position O: Overlay Instance
V: Valid Bit MD/PT: Mod/Port
PT: Pointer Type(A=Adj, E=ECMP, D=DstEncap N=Invalid)
PTR: ECMP/Adj/DstEncap/MET pointer
ML: MET Last
ST: Static PTH: Num Paths
BN: Bounce CP: Copy To CPU
PA: Policy Applied PI: Policy Incomplete
DL: Dst Local SP: Spine Proxy
--------------------------------------------------------------------------------
MO SRC P M S B C P P D S
POS O VNID Address V DE MD/PT CLSS T PTR L T PTH N P A I L P
--------------------------------------------------------------------------------
253 0 eeff88 00:26:98:0a:df:44 1 0 00/00 4007 A 10 0 0 1 0 0 0 0 1 0
479 0 edff90 00:00:0c:9f:f2:bc 1 0 00/00 400f A 13 0 0 1 0 0 0 0 1 0
693 0 eeff88 00:00:0c:9f:f2:bc 1 0 00/00 4007 A 10 0 0 1 0 0 0 0 1 0
848 0 edff90 00:50:56:bb:cc:cf 1 0 00/00 400a A 14 0 0 1 0 0 0 0 1 0
919 0 edff90 00:26:98:0a:df:44 1 0 00/00 400f A 13 0 0 1 0 0 0 0 1 0
1306 0 eeff88 00:1b:54:c2:26:44 1 0 00/00 4007 A 10 0 0 1 0 0 0 0 1 0
1327 0 edff90 00:50:56:bb:ba:9a 1 0 00/00 400f A 13 0 0 1 0 0 0 0 1 0
2421 0 f57fc2 00:50:56:bb:f5:32 1 0 00/00 8004 A 12 0 0 1 0 0 0 0 1 0
3942 0 eeff88 00:26:f0:64:00:00 1 0 00/00 4007 A 10 0 0 1 0 0 0 0 1 0
4083 0 eeff88 00:50:56:bb:01:64 1 0 00/00 4010 A 11 0 0 1 0 0 0 0 1 0

module-1# show platform internal ns forwarding lst-l3 ingress

===========================================================================
TABLE INSTANCE : 0
===========================================================================
Legend:
POS: Entry Position O: Overlay Instance
V: Valid Bit MD/PT: Mod/Port
PT: Pointer Type(A=Adj, E=ECMP, D=DstEncap N=Invalid)
PTR: ECMP/Adj/DstEncap/MET pointer
ML: MET Last
ST: Static PTH: Num Paths
BN: Bounce CP: Copy To CPU
PA: Policy Applied PI: Policy Incomplete
DL: Dst Local SP: Spine Proxy
--------------------------------------------------------------------------------
MO SRC P M S B C P P D S
POS O VNID Address V DE MD/PT CLSS T PTR L T PTH N P A I L P
--------------------------------------------------------------------------------
3142 0 268000 10.1.2.1 1 0 00/00 1 A 0 0 1 1 0 0 0 1 0 0

module-1# show platform internal ns forwarding lst-l3 egress


<no output>

module-1# show platform internal ns forwarding gst-l2 ingress

===========================================================================
TABLE INSTANCE : 0
===========================================================================

Legend:
POS: Entry Position O: Overlay Instance
V: Valid Bit MD/PT: Mod/Port
PT: Pointer Type(A=Adj, E=ECMP, D=DstEncap N=Invalid)
PTR: ECMP/Adj/DstEncap/MET pointer
ML: MET Last
ST: Static PTH: Num Paths
BN: Bounce CP: Copy To CPU
PA: Policy Applied PI: Policy Incomplete
DL: Dst Local SP: Spine Proxy
--------------------------------------------------------------------------------
MO SRC P M S B C P P D S
POS O VNID Address V DE MD/PT CLSS T PTR L T PTH N P A I L P
--------------------------------------------------------------------------------
4095 0 eeff88 00:50:56:bb:01:64 1 0 00/00 4010 A 11 0 0 1 0 0 0 0 1 0
4261 0 edff90 00:26:98:0a:df:44 1 0 00/00 400f A 13 0 0 1 0 0 0 0 1 0
4354 0 edff90 00:50:56:bb:ba:9a 1 0 00/00 400f A 13 0 0 1 0 0 0 0 1 0
4476 0 eeff88 00:26:98:0a:df:44 1 0 00/00 4007 A 10 0 0 1 0 0 0 0 1 0
4672 0 f57fc2 00:50:56:bb:f5:32 1 0 00/00 8004 A 12 0 0 1 0 0 0 0 1 0
7190 0 eeff88 00:00:0c:9f:f2:bc 1 0 00/00 4007 A 10 0 0 1 0 0 0 0 1 0
7319 0 eeff88 00:26:f0:64:00:00 1 0 00/00 4007 A 10 0 0 1 0 0 0 0 1 0
7631 0 edff90 00:00:0c:9f:f2:bc 1 0 00/00 400f A 13 0 0 1 0 0 0 0 1 0
7742 0 edff90 00:50:56:bb:25:77 1 0 00/00 400f A 13 0 0 1 0 0 0 0 1 0
7910 0 edff90 00:1b:54:c2:26:44 1 0 00/00 400f A 13 0 0 1 0 0 0 0 1 0
7999 0 eeff88 00:1b:54:c2:26:44 1 0 00/00 4007 A 10 0 0 1 0 0 0 0 1 0
8167 0 eeff88 00:50:56:bb:25:77 1 0 00/00 4010 A 11 0 0 1 0 0 0 0 1 0

module-1# show platform internal ns forwarding gst-l2 egress


<no output>

module-1# show platform internal ns forwarding gst-l3 ingress

==========================================================================
TABLE INSTANCE : 0
==========================================================================

Legend:
POS: Entry Position O: Overlay Instance
V: Valid Bit MD/PT: Mod/Port
PT: Pointer Type(A=Adj, E=ECMP, D=DstEncap N=Invalid)
PTR: ECMP/Adj/DstEncap/MET pointer
ML: MET Last
ST: Static PTH: Num Paths
BN: Bounce CP: Copy To CPU
PA: Policy Applied PI: Policy Incomplete
DL: Dst Local SP: Spine Proxy
--------------------------------------------------------------------------------
MO SRC P M S B C P P D S
POS O VNID Address V DE MD/PT CLSS T PTR L T PTH N P A I L P
--------------------------------------------------------------------------------
562 0 268000 10.0.0.9 1 0 00/00 1 A c 0 0 1 0 0 0 0 1 0
563 0 268000 10.0.0.1 1 0 00/00 1 A e 0 0 1 0 0 0 0 1 0
2312 0 2c0000 10.0.1.11 1 0 00/00 1 A 2 0 0 1 0 0 0 0 1 0
2313 0 2c0000 10.0.1.3 1 0 00/00 1 A 2 0 0 1 0 0 0 0 1 0
4580 0 2c0000 10.0.1.9 1 0 00/00 1 A d 0 0 1 0 0 0 0 1 0
4581 0 2c0000 10.0.1.1 1 0 00/00 1 A f 0 0 1 0 0 0 0 1 0
6878 0 268000 10.0.0.11 1 0 00/00 1 A 2 0 0 1 0 0 0 0 1 0
6879 0 268000 10.0.0.3 1 0 00/00 1 A 2 0 0 1 0 0 0 0 1 0

module-1# show platform internal ns forwarding gst-l3 egress


<no output>

16.2 Problem Description

Issues when using Atomic counters as an aid in troubleshooting

Symptom

Atomic Counters are configured, but do not seem to be working

Verification

There are several areas to verify.


• NTP must be configured and working correctly within the fabric for Atomic Counters to work.
• Endpoints must be learned by the leaf switch. Attach to the leaf command line and issue show endpoint and
confirm that there is a L next to the endpoints the Atomic Counter Policy will be configured.
• The endpoints must be sending traffic in one or both directions before Atomic Counters can display the packet
counts.
• The endpoints must reside on different leafs. Counted packets must traverse the ACI Spine switches. Locally
switched packets are not counted by Atomic Counters. The packet must traverse the ALE ASICs.
• Atomic Counters are not supported when the endpoints are in different VRFs (also known as different Contexts).
This implies that Atomic Counters are not supported between endpoints that reside in different tenants.
In the diagram below, On-demand Atomic Counters are available for troubleshooting between EPG-A and EPG-B,
and between any of the EPs
In the diagram below On-demand Atomic Counters would only update the transmit counter. Drops or excess packets
could not be counted.
Packet counts are updated in 30 second intervals, so wait at least 30 seconds before expecting to see any counters
incrementing.
Complete Atomic Counter restrictions are documented in the Cisco APIC Troubleshooting Guide.
On Demand Atomic Counters are available in Fabric->Fabric Policies-> Troubleshooting Policies->Traffic Map as
shown below.
Click on the Leaf to Leaf traffic as shown below.
16.3 Problem Description

Layer 2 or layer 3 forwarding issues are occurring

Symptom 1

Layer 2 forwarding issues are occurring

Verification

• Use the steps in the verification section, and verify


– End point reachability
– VLAN Programming
– BD_VLAN forwarding behavior
– GST and LST L2 Forwarding Tables.
• For additional Troubleshooting Tips see “Routed Connectivity to External Networks” chapter. This chapter
demonstrates the use of other commands such as show vpc.
Symptom 2 Layer 3 forwarding issues are occurring
Verification
• Use the steps in the verification section, and verify
– End point reachability
– VLAN Programming
– BD_VLAN forwarding behavior
– GST and LST L3 Forwarding Tables.
• For additional Troubleshooting Tips see “Bridged Connectivity to External Networks” chapter. This chapter
demonstrates the use of other commands such as show ip route vrf.
17 Policies and Contracts

• Overview
– Verification of Zoning Policies
• Problem Description
– Symptom 1
– Verification
– Resolution
– Symptom 2
– Verification
– Resolution
– Symptom 3
– Verification
– Resolution
– Symptom 4
– Verification
– Resolution

17.1 Overview

Within the ACI abstraction model, Contracts are objects built to represent the communications allowed or denied
between objects, such as EPGs. In order to resolve and configure the infrastructure, the contract objects get resolved
into Zoning Rules on the fabric nodes. Zoning Rules are the ACI equivalent of Access Control Lists in traditional
infrastructure terms. This chapter provides an overview and some troubleshooting topics related to the Zoning-Rules
Policy Control in the ACI fabric. The ACI zoning-rule policy architecture consists of 4 main components, Policy
Manager, ACLQOS, Filter and Filter Entries, Scopes and classIDs as described below.
Policy Manager
• Policy Manager is a supervisor component that processes objectStore notifications when Data Management
Engine (DME)/Policy Element (PE) pushes zoning configuration to a leaf.
• Policy Manager uses the PPF (Policy Propagation Facility) library to push configuration to the linecards.
• Policy Manager follows an atomic “verify/commit” model where lack of hardware resources will cause a failure
in the verify ‘stage’ of the process.
• Sets operational state as “Enabled” or “Disabled”
ACLQOS
• ACLQOS is a linecard component that receives the PPF configuration from the supervisor.
• This component is responsible for programming the hardware resources (Ternary Content Addressable Memory
- TCAM) on the linecards on the leafs.
Filters and Filter Entries
• Filters act as containers for the filter entries
• Filter entries specify the Layer 4 (L4) information
Scopes and classIDs
• Each Context (VRF) uses a specific scope identified by “scopeID”
• “actrlRules” and “mgmtRules” are children that exist under a given scope
• EPGs are identified by the classID or PcTag.
• Rules are specified in terms of scope, source class ID, dest class ID and the filter
• The actrlRules exist only on leafs, while actrl.MgmtRules exist on both leaf and spine switches.
At a very high level, the interaction of the components on the APIC and leaf for policy can be summarized as follows
• Policy Manager on APIC communicates with Policy Element Manager on the leaf
• Policy Element Manager on the leaf programs the Object Store on the leaf
• Policy Manager on the leaf communicates with ACLQOS client on the leaf
• ACLQOS client programs the hardware

Verification of Zoning Policies

Zoning rules on the leaf can either be displayed directly on the leaf using CLI, or through the GUI on the APIC. The
zoning policies that were examined as part of this chapter were configured as part of the below reference topology.

The following CLI command can be used to display the zoning rules configured on the switch. This provides several
key pieces of information when troubleshooting zoning rule policies. The Rule ID, SrcEPG/DstEPG, FilterID, and
Scope can be used in future displays.
rtp_leaf1# show zoning-rule

Rule ID SrcEPG DstEPG FilterID operSt Scope Actio

======= ====== ====== ======== ====== ===== =====

4096 0 0 implicit enabled 16777200 deny,


4106 0 0 implicit enabled 2523136 deny,
4107 0 16386 implicit enabled 2523136 deny,
4147 0 32773 implicit enabled 2523136 permi
4148 0 16388 implicit enabled 2523136 permi
4149 0 32774 implicit enabled 2523136 permi
4150 0 16393 implicit enabled 2523136 permi
4151 0 32770 implicit enabled 2523136 permi
4152 16400 16391 17 enabled 2523136 permi
4153 16391 16400 17 enabled 2523136 permi
4154 16400 16391 18 enabled 2523136 permi
4155 16391 16400 18 enabled 2523136 permi
4097 16398 16394 default enabled 2523136 permi
4112 16394 16398 default enabled 2523136 permi
4120 16398 16399 default enabled 2523136 permi
4121 16399 16398 default enabled 2523136 permi
4126 16389 16387 default enabled 2523136 permi
4127 16387 16389 default enabled 2523136 permi
4128 16389 16401 default enabled 2523136 permi
4129 16401 16389 default enabled 2523136 permi
4130 16387 16401 default enabled 2523136 permi
4131 16401 16387 default enabled 2523136 permi
4117 0 0 implicit enabled 2457600 deny,
4118 0 0 implicit enabled 2883584 deny,
4119 0 32770 implicit enabled 2883584 deny,

As is evidenced by the above output, even in a small test fabric, a number of rules are installed. In order to identify
which Rule IDs apply to which configured contexts and EPGs, it is necessary to first identify the scope for the config-
ured context. This can be done by using Visore to search for the configured context on the APIC using the Distinguish
Name (DN) “fvCtx”. Once all the contexts are displayed, search on the specific context that is configured, and identify
the scope for that context.

The scope information is circled below. This is important as it will be used in future displays to verify the con-
tract/policy has been pushed to the leaf.
Notice that the scope identified in the above capture (2523136) matches the scope that appears in the show zoning-rule
output displayed and highlighted below.
rtp_leaf1# show zoning-rule

Rule ID SrcEPG DstEPG FilterID operSt Scope Actio


======= ====== ====== ======== ====== ===== =====
4096 0 0 implicit enabled 16777200 deny,
4106 0 0 implicit enabled 2523136 deny,
4107 0 16386 implicit enabled 2523136 deny,
4147 0 32773 implicit enabled 2523136 permi
4148 0 16388 implicit enabled 2523136 permi
4149 0 32774 implicit enabled 2523136 permi
4150 0 16393 implicit enabled 2523136 permi
4151 0 32770 implicit enabled 2523136 permi
4152 16400 16391 17 enabled 2523136 permi
4153 16391 16400 17 enabled 2523136 permi
4154 16400 16391 18 enabled 2523136 permi
4155 16391 16400 18 enabled 2523136 permi
4097 16398 16394 default enabled 2523136 permi
4112 16394 16398 default enabled 2523136 permi
4120 16398 16399 default enabled 2523136 permi
4121 16399 16398 default enabled 2523136 permi
4122 32772 16387 default enabled 2523136 permi
4123 16387 32772 default enabled 2523136 permi
4124 32772 16389 default enabled 2523136 permi
4125 16389 32772 default enabled 2523136 permi
4126 16389 16387 default enabled 2523136 permi
4127 16387 16389 default enabled 2523136 permi
4117 0 0 implicit enabled 2457600 deny,
4118 0 0 implicit enabled 2883584 deny,
4119 0 32770 implicit enabled 2883584 deny,

Once the Scope ID information has been identified, as well as the rule and filter IDs, the following command can be
used to verify what Rule IDs and filters are being used for the scope previously identified. From the below display it
can be seen that rule 4149 with source any (s-any) and destination (d-32774) is being used.
rtp_leaf1# show system internal policy-mgr stats | grep 2523136
Rule (4097) DN (sys/actrl/scope-2523136/rule-2523136-s-16398-d-16394-f-default) Ingress: 206, Egress:
Rule (4106) DN (sys/actrl/scope-2523136/rule-2523136-s-any-d-any-f-implicit) Ingress: 35, Egress: 0
[snip]
Rule (4148) DN (sys/actrl/scope-2523136/rule-2523136-s-any-d-16388-f-implicit) Ingress: 9, Egress: 0
Rule (4149) DN (sys/actrl/scope-2523136/rule-2523136-s-any-d-32774-f-implicit) Ingress: 8925, Egress:
Rule (4150) DN (sys/actrl/scope-2523136/rule-2523136-s-any-d-16393-f-implicit) Ingress: 30, Egress: 4
[snip]

rtp_leaf1# show system internal policy-mgr stats | grep 2523136


Rule (4097) DN (sys/actrl/scope-2523136/rule-2523136-s-16398-d-16394-f-default) Ingress: 206, Egress:
Rule (4106) DN (sys/actrl/scope-2523136/rule-2523136-s-any-d-any-f-implicit) Ingress: 35, Egress: 0
[snip]
Rule (4148) DN (sys/actrl/scope-2523136/rule-2523136-s-any-d-16388-f-implicit) Ingress: 9, Egress: 0
Rule (4149) DN (sys/actrl/scope-2523136/rule-2523136-s-any-d-32774-f-implicit) Ingress: 8935, Egress:
Rule (4150) DN (sys/actrl/scope-2523136/rule-2523136-s-any-d-16393-f-implicit) Ingress: 30, Egress: 4
[snip]

Is this the rule id that is expected to be incrementing? If so another very useful command is the show system internal
aclqos zoning-rules display. Use of this command will require the direction of Cisco TAC’s, but this command provides
for a confirmation that hardware on the leaf has been programmed correctly.
The Source EPG and Destination EPG combination of interest is (0 and 32774). Next step is to identify all the hardware
entries for these source and destination classes that match the rule IDs in question (4149). The rules are numbered
sequentially. The rule ID of interest is highlighted below. It can be observed that a hardware index (hw_index) of 150
and 151 is there which indicates that there is a hardware entry for this rule.
module-1# show system internal aclqos zoning-rules

===========================================
Rule ID: 1 Scope 4 Src EPG: 0 Dst EPG: 16386 Filter 65534
Curr TCAM resource:
=============================
unit_id: 0
=== Region priority: 2307 (rule prio: 9 entry: 3)===
sw_index = 23 | hw_index = 132
=== Region priority: 2307 (rule prio: 9 entry: 3)===
sw_index = 24 | hw_index = 133

[snip]

Dumping the hardware entry and examining the accuracy of the content is beyond the scope of this book, but at this
point there is sufficient information to contact the Cisco Technical Assistance Center (TAC) if there is a zoning rule
but no corresponding hardware entry.
===========================================
Rule ID: 4149 Scope 4 Src EPG: 0 Dst EPG: 32774 Filter 65534
Curr TCAM resource:
=============================
unit_id: 0
=== Region priority: 2311 (rule prio: 9 entry: 7)===
sw_index = 38 | hw_index = 150
=== Region priority: 2311 (rule prio: 9 entry: 7)===
sw_index = 39 | hw_index = 151

The GUI can also be used to verify contracts/zoning-rules. All the rules on the leaf can be examined as shown below.
By going to Fabric->Inventory->Rules. Then double click on a particular rule of interest.

The existing policy state can be verified using the GUI. Remember in the overview section it was the responsibility of
Policy Manager to set the operation state of the rule. In the below display it can be verified if the operational state is
disabled or enabled, whether the action is permit or deny, and the direction of the rule.
The statistics for each rule can also be examined to make a determination that the rule is being used. This was
demonstrated earlier using the CLI. Click on stats, and then the check mark as shown below to view stats in the GUI.

Select the packet counters of interest and the sampling interval to be monitored.
If the health score for that specific rule is not 100, its health status can be further drilled down upon. This provides
insight as to what problems may be occurring. Running out of hardware resources is just one factor that can cause the
health score to decrease. The use of Health Score as an aid in troubleshooting is covered in more detail in the Health
Score/Faults specific chapter.
Faults that have been generated as a direct result of this rule being applied to the leaf can also be analyzed. This is one
of the most important items to check when troubleshooting zoning-rules, or any other ACI policy. The use of Health
Score as an aid in troubleshooting is covered in more detail in the Health Score/Faults specific chapter.

17.2 Problem Description

Some of the Policy Zoning Rules problems that can be encountered during the ACI deployment process include, but
are not limited to, the items below. The commands already shown in the verification section above can be used to help
identify the problems below.

Symptom 1

End Point Groups can communicate when there is no contract configured.


Verification

Check the GUI to see if any faults were generated if the rule/contract was recently removed Verify in the GUI that the
rule does not exist after identifying the scope for the context Verify with show zoning-rules CLI command that the rule
does not exist Verify in Visore that the rule does not exist on the APIC For example using the following highlighted
Rule Id:
rtp_leaf1# show zoning-rule

Rule ID SrcEPG DstEPG FilterID operSt Scope Actio


======= ====== ====== ======== ====== ===== =====
4096 0 0 implicit enabled 16777200 deny,

[snip]

4130 16388 32775 25 enabled 2523136 permi


4131 32775 16388 21 enabled 2523136 permi

Visore can be used to verify the rule is there or not by searching “actrlRule” and filtering on the rule id as shown below.
Visore can also be used to search on a specific filter (actrlFlt). It can be seen that the rule exists in this case, and is
applied to node 101 or leaf1.

It can be verified that the filter entries are correct by drilling down on the arrows as shown below.
Statistics can also be checked directly from Visore as shown below.
• Verify using the commands in the verification section that the leaf does not have a hardware entry for that rule.

Resolution

If a rule (contract) is found to be configured, remove it to block communication between the EPGs. If there is no rule
configured, or the rule is unable to be verified, contact the Cisco Technical Assistance Center for help in diagnosing
the problem.

Symptom 2

End Point Groups cannot communicate when there is a contract configured.

Verification

As shown in the previous verification steps:


• Check the GUI for faults associated to the rule in question
• Verify in the GUI that the operational state is enabled
• Verify with show zoning-rules CLI command that the rule exists after identifing the scope for the context
• Verify in the GUI or CLI that the rule entry counters are incrementing
• Verify the health score for that rule
• Verify in Visore that the rule exists (See above)
• Verify there is a corresponding hardware entry for that rule id in the CLI

Resolution

If a rule (contract) is found to be configured, contains the proper filter content (ports), and is found to have a cor-
responding hardware entry then check for forwarding problems. If further assistance is required contact the Cisco
Technical Assistance Center for help in diagnosing the problem.

Symptom 3

Hardware resource exhaustion when rules are being pushed and programmed on the leaf.

Verification

Check the GUI for faults associated to the rule in question.


This will be Fault F1203 - Rule failed due to hardware programming error.
The following fault will be observed: Fault F105504 - TCA: policy CAM entries usage current
value(eqptcapacityPolEntry5min:normalizedLast) value 93 raised above threshold 90
This fault will be generated on the leaf.
• Verify with show zoning-rules CLI command that the rule exists after identifying the scope for the context
• Verfiy the health score for that rule

Resolution

Reduce the amount of zoning rules (contracts) required. Explore the option of using “vzAny”, as well as any other
contract optimization techniques. Contact the Cisco TAC if further assistance is required.

Symptom 4

Configured rules are not being deployed on the leaf.

Verification

As shown in the previous verification steps:


• Check the GUI for faults associated to the rule in question
• Verify in the GUI that the operational state is enabled
• Verify in Visore that the rule exists (See above)
• Verify with show zoning-rules CLI command that the rule exists after identifying the scope for the context
• Verify using the commands in the verification section that the leaf has a hardware entry for the rule

Resolution

If a rule (contract) is present in the GUI and Visore, does not exist on the leaf, and there are no corresponding faults for
the policy that is being deployed, contact the Cisco Technical Assistance Center for help in diagnosing the problem.
Otherwise correct the configuration that is causing the fault and redeploy the policy.

18 Bridged Connectivity to External Networks

• Overview
• Problem Description
– Symptom 1
– Verification
– Resolution
– Symptom 2
– Verification
– Resolution
• Problem Description:
– Symptom 3
– Verification/Resolution
– Symptom 4
– Verification/Resolution
• Problem Description
– Symptom 1
– Verification
– Resolution
– Symptom 2
– Verification
– Resolution

18.1 Overview

This chapter covers potential issues that could occur with Bridged Connectivity to External Networks, starting with an
overview of how bridged connectivity to external networks should function and the verification steps used to confirm
a working layer 2 bridged network for the example reference topology fabric. The displays taken on a working fabric
can then be used as an aid in troubleshooting issues with external Layer 2 connectivity.
There are different ways to extend Layer 2 domain beyond the ACI fabric:
• Extend the EPG out of the ACI fabric - A user can extend an EPG out of the ACI fabric by statically assigning
a port (along with VLAN ID) to an EPG. The leaf will learn the endpoint information and assign the traffic (by
matching the port and VLAN ID) to the proper EPG, and then enforce the policy. The endpoint learning, data
forwarding, and policy enforcement remain the same whether the endpoint is directly attached to the leaf port
or if it is behind a Layer 2 network (provided the proper VLAN is enabled in the layer2 network).
• Extend the bridge domain out of the ACI fabric - This option is designed to extend the entire bridge domain
(not an individual EPG under bridge domain) to the outside network.

18.2 Problem Description

There are many ways to connect ACI Leafs to external devices. The interface properties could be access, trunk,
port-channel, virtual port-channel, routed, routed sub-interfaces, or SVI. When establishing Layer 2 connectivity with
external devices, it’s important to match the properties such as VLANs tagged, LACP modes, etc. This is equally
applicable to networks hosted by ACI as well as external networks.
When a configuration mismatch occurs, depending on the parameters, the interfaces are either down, or not forwarding
traffic as expected.

Symptom 1

Interfaces connecting to the external devices are in down state.

Verification

In this example, rtp_leaf1 and rtp_leaf3 need to form a vPC pair and connect using a dedicated port-channel per UCS
Fabric Interconnect. A virtual port-channel has been configured, and the APIC has assigned port-channel4 as seen
below on rtp_leaf1 and rtp_leaf3.
Please note that although in this output both vPC ID (2) and Port (Po4) match on both leafs, only the vPC ID need to
match for this configuration to work.
rtp_leaf1# show vpc
Legend:
(*) - local vPC is down, forwarding via vPC peer-link

vPC domain id : 10
Peer status : peer adjacency formed ok
vPC keep-alive status : Disabled
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 inconsistency reason : Consistency Check Not Performed
vPC role : primary
Number of vPCs configured : 2
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Enabled (timeout = 240 seconds)
Operational Layer3 Peer : Disabled

vPC Peer-link status


---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 up -

vPC status
----------------------------------------------------------------------
id Port Status Consistency Reason Active vlans
-- ---- ------ ----------- ------ ------------
2 Po4 down* success success -
343 Po1 up success success 700,751

rtp_leaf3# show vpc


Legend:
(*) - local vPC is down, forwarding via vPC peer-link

vPC domain id : 10
Peer status : peer adjacency formed ok
vPC keep-alive status : Disabled
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 inconsistency reason : Consistency Check Not Performed
vPC role : secondary
Number of vPCs configured : 2
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Enabled (timeout = 240 seconds)
Operational Layer3 Peer : Disabled
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 up -

vPC status
----------------------------------------------------------------------
id Port Status Consistency Reason Active vlans
-- ---- ------ ----------- ------ ------------
2 Po4 down* success success -
343 Po1 up success success 700,751

Since the interfaces are in the down (D) state, a check of interface status would reveal if there is a potential layer 1
issue as seen below.
rtp_leaf1# show interface ethernet 1/27
Ethernet1/27 is down (sfp-speed-mismatch)
admin state is up, Dedicated Interface

On the GUI, Fabric -> Inventory -> Pod1 -> leafname -> Interfaces -> vPC Interfaces -> <vPC Domain ID> -> <vPC
ID> -> Faults, would reveal:
Resolution

Configure the interface policies to match the peer device. In this scenario, the speed mismatch was addressed by
changing the interface policy from 1G to 10G.

Symptom 2

Certain interfaces are in the ‘suspended’ state when configuring a port-channel or virtual port-channel.

Verification

Check the status of the vPC and port-channel using the CLI or GUI.
rtp_leaf1# show vpc
Legend:
(*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 10
Peer status : peer adjacency formed ok
vPC keep-alive status : Disabled
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 inconsistency reason : Consistency Check Not Performed
vPC role : primary
Number of vPCs configured : 2
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Enabled (timeout = 240 seconds)
Operational Layer3 Peer : Disabled

vPC Peer-link status


---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 up -

vPC status
----------------------------------------------------------------------
id Port Status Consistency Reason Active vlans
-- ---- ------ ----------- ------ ------------
2 Po4 up success success 600-601,634
,639,667-66
8
343 Po1 up success success 700,751

rtp_leaf1# show port-channel summary


Flags: D - Down P - Up in port-channel (members)
I - Individual H - Hot-standby (LACP only)
s - Suspended r - Module-removed
S - Switched R - Routed
U - Up (port-channel)
M - Not in use. Min-links not met
--------------------------------------------------------------------------------
Group Port- Type Protocol Member Ports
Channel
-------------------------------------------------------------------------------
1 Po1(SU) Eth LACP Eth1/42(P) Eth1/44(P)
4 Po4(SU) Eth LACP Eth1/27(P) Eth1/28(s)

On the GUI, Fabric -> Inventory -> Pod1 -> leafname -> Interfaces -> vPC Interfaces -> <vPC Domain ID> -> <vPC
ID> -> Faults, would reveal:
Although the vPC is up, links to only one neighbor are members of this port-channel.
Since the interfaces are in the suspended (s) state, a check of the LACP interface status would reveal if there are any
problems with LACP communications with the peer. In this example the peer LACP system identifier are different
indicating two different peer devices.
rtp_leaf1# show lacp interface ethernet 1/27 | grep -A 2 Neighbor
Neighbor: 0x113
MAC Address= 00-0d-ec-b1-a0-3c
System Identifier=0x8000,00-0d-ec-b1-a0-3c
rtp_leaf1# show lacp interface ethernet 1/28 | grep -A 2 Neighbor
Neighbor: 0x113
MAC Address= 00-0d-ec-b1-a9-fc
System Identifier=0x8000,00-0d-ec-b1-a9-fc
Resolution

In this example, rtp_leaf1 and rtp_leaf3 need to form a vPC pair, and they each need to connect using a dedicated
port-channel for each UCS Fabric Interconnect. When using the vPC wizard or directly configuring virtual port-
channels, unique interface policy groups and interface selectors are needed to create dedicated port-channels for each
peer device, such as UCS Fabric Interconnect A and Fabric Interconnect B.
Once the needed configuration is done, two independent port-channels are created.

GUI:

Fabric -> Access Policies -> Interface Policies -> Policy Groups
Fabric -> Access Policies -> Interface Policies -> Profiles

rtp_leaf1# show port-channel summary


Flags: D - Down P - Up in port-channel (members)
I - Individual H - Hot-standby (LACP only)
s - Suspended r - Module-removed
S - Switched R - Routed
U - Up (port-channel)
M - Not in use. Min-links not met
-------------------------------------------------------------------------------
Group Port- Type Protocol Member Ports
Channel
-------------------------------------------------------------------------------
1 Po1(SU) Eth LACP Eth1/42(P) Eth1/44(P)
5 Po5(SU) Eth LACP Eth1/27(P)
6 Po6(SD) Eth LACP Eth1/28(P)
18.3 Problem Description:

There are various use cases, such as migration, where the L2 extension outside ACI Fabric is needed either through
direct EPG extension or through special L2 Out connectivity. During migration scenarios, there is also a need for the
default gateway to be external to the ACI fabric.
Most of the time, the problem represents itself as a reachability problem between fabric hosted endpoints and external
networks. In this example, an example of Web tier within the Tenant Test is used and the address of WebServer
10.2.1.11 needs to be reachable from the Nexus 7K, which has been configured for:
N7K-1-65-vdc_4# show hsrp brie
P indicates configured to preempt.
|
Interface Grp Prio P State Active addr Standby addr Group addr
Vlan700 700 110 P Active local 10.1.0.2 10.1.0.1 (conf)
Vlan750 750 110 P Active local 10.2.0.2 10.2.0.1 (conf)
Vlan751 751 110 P Active local 10.2.1.2 10.2.1.1 (conf)

The WebServer with IP of 10.2.1.11 and BD’s IP of 10.2.1.254 are unreachable from the Nexus 7Ks.
N7K-1-65-vdc_4# ping 10.2.1.254
PING 10.2.1.254 (10.2.1.254): 56 data bytes
Request 0 timed out
Request 1 timed out
Request 2 timed out
Request 3 timed out
Request 4 timed out

--- 10.2.1.254 ping statistics ---


5 packets transmitted, 0 packets received, 100.00% packet loss
N7K-1-65-vdc_4# ping 10.2.1.11
PING 10.2.1.11 (10.2.1.11): 56 data bytes
Request 0 timed out
Request 1 timed out
Request 2 timed out
Request 3 timed out
Request 4 timed out

--- 10.2.1.11 ping statistics ---


5 packets transmitted, 0 packets received, 100.00% packet loss

• Extending the EPG directly out of the ACI fabric

Symptom 3

The Leaf is not getting programmed with the correct VLANs for BridgeDomain and EPGs.

Verification/Resolution

The reachability problem could be due to many problems, however the most common problem is that the right leafs
are not programmed with the correct vlans used for BridgeDomain and EPG identification.
BD relationship with the Context:
The context-BD relationship is key for programming the leaf. Without that the VLANs don’t get programmed on the
leaf as shown below.
rtp_leaf1# show vlan brief

VLAN Name Status Ports


---- -------------------------------- --------- -------------------------------
13 infra:default active Eth1/1, Eth1/2, Eth1/5, Eth1/35
27 Test:Database active Eth1/27, Eth1/28, Po2, Po3
28 Test:CommerceWorkspaceTest:Datab active Eth1/27, Eth1/28, Po2, Po3
ase

Once the BD is assigned the right context, the BD and EPG Vlans get programmed appropriately.
rtp_leaf1# show vlan brief

VLAN Name Status Ports


---- -------------------------------- --------- -------------------------------
13 infra:default active Eth1/1, Eth1/2, Eth1/5, Eth1/35
27 Test:Database active Eth1/27, Eth1/28, Po2, Po3
28 Test:CommerceWorkspaceTest:Datab active Eth1/27, Eth1/28, Po2, Po3
ase
37 Test:Web active Eth1/27, Eth1/28, Po2, Po3
38 Test:CommerceWorkspaceTest:Web active Eth1/27, Eth1/28, Po2, Po3

While the VLANs are programmed properly, the Ports on which the VLANs are carried seem to be incomplete. The
Po2 and Po3 links to UCS hosting the VMs are shown here, however the links to the N7Ks are not, which leads us to
the next possible issue.
Ports not being programmed with EPG encap VLANs:
In this example, VLAN 751 is used to connect to the Nexus 7Ks, and the EPG has been assigned dynamically VLAN
639 within the scope of the VMM domain. The following output confirms while VLAN 639 has been programmed,
VLAN 751 is not present on the leaf.
rtp_leaf1# show vlan extended

VLAN Name Status Ports


---- -------------------------------- --------- -------------------------------
13 infra:default active Eth1/1, Eth1/2, Eth1/5, Eth1/35
27 Test:Database active Eth1/27, Eth1/28, Po2, Po3
28 Test:CommerceWorkspaceTest:Datab active Eth1/27, Eth1/28, Po2, Po3
ase
37 Test:Web active Eth1/27, Eth1/28, Po2, Po3
38 Test:CommerceWorkspaceTest:Web active Eth1/27, Eth1/28, Po2, Po3

VLAN Type Vlan-mode Encap


---- ----- ---------- -------------------------------
13 enet CE vxlan-16777209, vlan-3500
27 enet CE vxlan-14680064
28 enet CE vlan-600
37 enet CE vxlan-15794150
38 enet CE vlan-639

For this issue to be resolved, the EPG needs to be binded to a port/leaf, and also the L2Out domain needs to be attached.
The L2Out would need to be associated to a VLAN pool consisting of VLAN 751.
Once the configuration is applied, the leafs are programmed to carry all the relevant L2 constructs: BD, encap VLAN
for VMM domain, and encap VLAN for L2Out. Also, the right interfaces are mapped to the encap VLANs: VLAN-
751 on Po1 to N7Ks and VLAN-639 on Po2, Po3 to both Fabric Interconnects of a UCS System.
rtp_leaf1# show port-channel summary
Flags: D - Down P - Up in port-channel (members)
I - Individual H - Hot-standby (LACP only)
s - Suspended r - Module-removed
S - Switched R - Routed
U - Up (port-channel)
M - Not in use. Min-links not met
--------------------------------------------------------------------------------
Group Port- Type Protocol Member Ports
Channel
--------------------------------------------------------------------------------
1 Po1(SU) Eth LACP Eth1/42(P) Eth1/44(P)
2 Po2(SU) Eth LACP Eth1/27(P)
3 Po3(SU) Eth LACP Eth1/28(P)

rtp_leaf1# show vlan extended

VLAN Name Status Ports


---- -------------------------------- --------- -------------------------------
13 infra:default active Eth1/1, Eth1/2, Eth1/5, Eth1/35
14 Test:CommerceWorkspaceTest:Web active Eth1/42, Eth1/44, Po1
27 Test:Database active Eth1/27, Eth1/28, Po2, Po3
28 Test:CommerceWorkspaceTest:Datab active Eth1/27, Eth1/28, Po2, Po3
ase
37 Test:Web active Eth1/27, Eth1/28, Eth1/42,
Eth1/44, Po1, Po2, Po3
38 Test:CommerceWorkspaceTest:Web active Eth1/27, Eth1/28, Po2, Po3

VLAN Type Vlan-mode Encap


---- ----- ---------- -------------------------------
13 enet CE vxlan-16777209, vlan-3500
14 enet CE vlan-751
27 enet CE vxlan-14680064
28 enet CE vlan-600
37 enet CE vxlan-15794150
38 enet CE vlan-639

rtp_leaf1# show vpc


Legend:
(*) - local vPC is down, forwarding via vPC peer-link

vPC domain id : 10
Peer status : peer adjacency formed ok
vPC keep-alive status : Disabled
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 inconsistency reason : Consistency Check Not Performed
vPC role : primary
Number of vPCs configured : 3
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Enabled (timeout = 240 seconds)
Operational Layer3 Peer : Disabled

vPC Peer-link status


---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 up -

vPC status
----------------------------------------------------------------------
id Port Status Consistency Reason Active vlans
-- ---- ------ ----------- ------ ------------
1 Po2 up success success 600-601,634
,639,666-66
9
343 Po1 up success success 700,751

684 Po3 up success success 600-601,634


,639,667-66
8
rtp_leaf1#

Testing now reveals that while N7K can ping the BD address of 10.2.1.254, it cannot ping the WebServer VM
(10.2.1.11).
N7K-2-50-N7K2# ping 10.2.1.254
PING 10.2.1.254 (10.2.1.254): 56 data bytes
Request 0 timed out
64 bytes from 10.2.1.254: icmp_seq=1 ttl=56 time=1.656 ms
64 bytes from 10.2.1.254: icmp_seq=2 ttl=56 time=0.568 ms
64 bytes from 10.2.1.254: icmp_seq=3 ttl=56 time=0.826 ms
64 bytes from 10.2.1.254: icmp_seq=4 ttl=56 time=0.428 ms

--- 10.2.1.254 ping statistics ---


5 packets transmitted, 4 packets received, 20.00% packet loss
round-trip min/avg/max = 0.428/0.869/1.656 ms

N7K-2-50-N7K2# ping 10.2.1.11


PING 10.2.1.11 (10.2.1.11): 56 data bytes
Request 0 timed out
Request 1 timed out
Request 2 timed out
Request 3 timed out
Request 4 timed out

--- 10.2.1.11 ping statistics ---


5 packets transmitted, 0 packets received, 100.00% packet loss
N7K-2-50-N7K2#

The scenario described here is Intra-EPG connectivity, where the contracts are not applied. So this is not related to
any filter, which brings us to the next use case.

Symptom 4

ACI Fabric is not learning the endpoint IPs on the leafs.

Verification/Resolution

Once the leaf is programmed, the endpoints are learned as traffic is received. The endpoint learning is key, when the
BD is in Hardware Proxy mode, so that the Fabric can efficiently route packets.
Since the N7Ks can ping the BD pervasive gateway address and not the Webserver IP of 10.2.1.11, the next step is to
check the endpoint table.
As seen below, while the N7K addresses (10.2.1.2, 10.2.1.3) are seen, the webserver IP (10.2.1.11) is missing from
the endpoint table.
rtp_leaf1# show endpoint vrf Test:Test detail
Legend:
O - peer-attached H - vtep a - locally-aged S - static
V - vpc-attached p - peer-aged L - local M - span
s - static-arp B - bounce
+---------------+---------------+-----------------+--------------+-------------+---------------------
VLAN/ Encap MAC Address MAC Info/ Interface Endpoint Group
Domain VLAN IP Address IP Info Info
+---------------+---------------+-----------------+--------------+-------------+---------------------
Test:Test 10.1.0.101 L
38/Test:Test vlan-639 0050.56bb.d508 LV po2 Test:CommerceWorkspa
14/Test:Test vlan-751 0026.f064.0000 LpV po1 Test:CommerceWorkspa
14/Test:Test vlan-751 0000.0c9f.f2ef LpV po1 Test:CommerceWorkspa
14 vlan-751 0026.980a.df44 LpV po1 Test:CommerceWorkspa
Test:Test vlan-751 10.2.1.2 LV
14 vlan-751 001b.54c2.2644 LV po1 Test:CommerceWorkspa
Test:Test vlan-751 10.2.1.3 LV
+------------------------------------------------------------------------------+
Endpoint Summary
+------------------------------------------------------------------------------+
Total number of Local Endpoints : 6
Total number of Remote Endpoints : 0
Total number of Peer Endpoints : 0
Total number of vPC Endpoints : 5
Total number of non-vPC Endpoints : 1
Total number of MACs : 5
Total number of VTEPs : 0
Total number of Local IPs : 3
Total number of Remote IPs : 0
Total number All EPs : 6

Just as mac-addresses are learnt in traditional switching, the endpoints are learnt by the leaf when the first packet is
received. A ping from the VM triggers this learning and the following output confirm this:
rtp_leaf1# show endpoint vrf Test:Test detail
Legend:
O - peer-attached H - vtep a - locally-aged S - static
V - vpc-attached p - peer-aged L - local M - span
s - static-arp B - bounce
+---------------+---------------+-----------------+--------------+-------------+---------------------
VLAN/ Encap MAC Address MAC Info/ Interface Endpoint Group
Domain VLAN IP Address IP Info Info
+---------------+---------------+-----------------+--------------+-------------+---------------------
Test:Test 10.1.0.101 L
38 vlan-639 0050.56bb.d508 LpV po2 Test:CommerceWorkspa
Test:Test vlan-639 10.2.1.11 LV
14/Test:Test vlan-751 0026.f064.0000 LpV po1 Test:CommerceWorkspa
14/Test:Test vlan-751 0000.0c9f.f2ef LpV po1 Test:CommerceWorkspa
14 vlan-751 0026.980a.df44 LpV po1 Test:CommerceWorkspa
Test:Test vlan-751 10.2.1.2 LV
14 vlan-751 001b.54c2.2644 LV po1 Test:CommerceWorkspa
Test:Test vlan-751 10.2.1.3 LV

+------------------------------------------------------------------------------+
Endpoint Summary
+------------------------------------------------------------------------------+
Total number of Local Endpoints : 6
Total number of Remote Endpoints : 0
Total number of Peer Endpoints : 0
Total number of vPC Endpoints : 5
Total number of non-vPC Endpoints : 1
Total number of MACs : 5
Total number of VTEPs : 0
Total number of Local IPs : 4
Total number of Remote IPs : 0
Total number All EPs : 6

Once the endpoint is learned, the ping is successful from the N7K to the Webserver IP of 10.2.1.11
N7K-1-65-vdc_4# ping 10.2.1.11
PING 10.2.1.11 (10.2.1.11): 56 data bytes
64 bytes from 10.2.1.11: icmp_seq=0 ttl=127 time=1.379 ms
64 bytes from 10.2.1.11: icmp_seq=1 ttl=127 time=1.08 ms
64 bytes from 10.2.1.11: icmp_seq=2 ttl=127 time=0.498 ms
64 bytes from 10.2.1.11: icmp_seq=3 ttl=127 time=0.479 ms
64 bytes from 10.2.1.11: icmp_seq=4 ttl=127 time=0.577 ms

--- 10.2.1.11 ping statistics ---


5 packets transmitted, 5 packets received, 0.00% packet loss
round-trip min/avg/max = 0.479/0.802/1.379 ms
N7K-1-65-vdc_4#

This issue is also seen with the N7K HSRP address as well, since N7K does not normally source packets from the
HSRP address. In the above tables, the HSRP IP (10.2.1.1) is missing from the endpoint table.
Forcing N7K to source the packet from the HSRP address populates the endpoint table.
N7K-1-65-vdc_4# ping 10.2.1.254 source 10.2.1.1
PING 10.2.1.254 (10.2.1.254) from 10.2.1.1: 56 data bytes
Request 0 timed out
64 bytes from 10.2.1.254: icmp_seq=1 ttl=57 time=1.472 ms
64 bytes from 10.2.1.254: icmp_seq=2 ttl=57 time=1.062 ms
64 bytes from 10.2.1.254: icmp_seq=3 ttl=57 time=1.097 ms
64 bytes from 10.2.1.254: icmp_seq=4 ttl=57 time=1.232 ms

--- 10.2.1.254 ping statistics ---


5 packets transmitted, 4 packets received, 20.00% packet loss
round-trip min/avg/max = 1.062/1.215/1.472 ms
N7K-1-65-vdc_4#

This is one of the reason, when the default gateway is outside the fabric, the Fabric BD mode should be enabled for
flooding and NOT Hardware Proxy.
rtp_leaf1# show endpoint vrf Test:Test detail
Legend:
O - peer-attached H - vtep a - locally-aged S - static
V - vpc-attached p - peer-aged L - local M - span
s - static-arp B - bounce
+---------------+---------------+-----------------+--------------+-------------+---------------------
VLAN/ Encap MAC Address MAC Info/ Interface Endpoint Group
Domain VLAN IP Address IP Info Info
+---------------+---------------+-----------------+--------------+-------------+---------------------
Test:Test 10.1.0.101 L
38 vlan-639 0050.56bb.d508 LV po2 Test:CommerceWorkspa
Test:Test vlan-639 10.2.1.11 LV
14/Test:Test vlan-751 0026.f064.0000 LpV po1 Test:CommerceWorkspa
14/Test:Test vlan-751 0000.0c9f.f2ef LpV po1 Test:CommerceWorkspa
14 vlan-751 0026.980a.df44 LpV po1 Test:CommerceWorkspa
Test:Test vlan-751 10.2.1.2 LV
14 vlan-751 001b.54c2.2644 LpV po1 Test:CommerceWorkspa
Test:Test vlan-751 10.2.1.3 LV
Test:Test vlan-751 10.2.1.1 LV

+------------------------------------------------------------------------------+
Endpoint Summary
+------------------------------------------------------------------------------+
Total number of Local Endpoints : 6
Total number of Remote Endpoints : 0
Total number of Peer Endpoints : 0
Total number of vPC Endpoints : 5
Total number of non-vPC Endpoints : 1
Total number of MACs : 5
Total number of VTEPs : 0
Total number of Local IPs : 5
Total number of Remote IPs : 0
Total number All EPs : 6

N7K-1-65-vdc_4# ping 10.2.1.11 source 10.2.1.1


PING 10.2.1.11 (10.2.1.11) from 10.2.1.1: 56 data bytes
64 bytes from 10.2.1.11: icmp_seq=0 ttl=127 time=1.276 ms
64 bytes from 10.2.1.11: icmp_seq=1 ttl=127 time=0.751 ms
64 bytes from 10.2.1.11: icmp_seq=2 ttl=127 time=0.752 ms
64 bytes from 10.2.1.11: icmp_seq=3 ttl=127 time=0.807 ms
64 bytes from 10.2.1.11: icmp_seq=4 ttl=127 time=0.741 ms

--- 10.2.1.11 ping statistics ---


5 packets transmitted, 5 packets received, 0.00% packet loss
round-trip min/avg/max = 0.741/0.865/1.276 ms
N7K-1-65-vdc_4#

18.4 Problem Description

Extending bridged domain using external bridged network:


In this scenario, Layer 2 extension is achieved using a unique external EPG so as to address spanning tree interoper-
ability when integrating/extending with external Layer 2 networks inside the ACI fabric.
In this setup, the Web EPG is not directly associated with L2Out interfaces towards the N7K. Instead, Web EPG is
associated to BD Web, which is then extended using external bridged connectivity named L2Out, with the external
networks identified to be allowed to communicate with the Web tier.

Symptom 1

The Leaf is not getting programmed with the correct VLANs for BridgeDomain and EPGs.

Verification

As with the previous scenario with direct EPG extension outside the fabric, programming the leafs needs to happen
before the endpoints are learned by the leafs. Mismatch in configuration is the most common scenario seen when
defining the extended bridged network.
rtp_leaf1# show vlan extended

VLAN Name Status Ports


---- -------------------------------- --------- -------------------------------
13 infra:default active Eth1/1, Eth1/2, Eth1/5, Eth1/35
27 Test:Database active Eth1/27, Eth1/28, Po2, Po3
28 Test:CommerceWorkspaceTest:Datab active Eth1/27, Eth1/28, Po2, Po3
ase
37 Test:Web active Eth1/27, Eth1/28, Eth1/42,
Eth1/44, Po1, Po2, Po3
38 Test:CommerceWorkspaceTest:Web active Eth1/27, Eth1/28, Po2, Po3
68 -- active Eth1/42, Eth1/44, Po1

VLAN Type Vlan-mode Encap


---- ----- ---------- -------------------------------
13 enet CE vxlan-16777209, vlan-3500
27 enet CE vxlan-14680064
28 enet CE vlan-600
37 enet CE vxlan-15794150
38 enet CE vlan-639
68 enet CE vlan-750

Since N7Ks are expecting VLAN-751 but the L2Out is configured with VLAN-750, the Layer 2 domains are not
extended correctly.

Resolution

Changing it to VLAN-751, makes the N7K ping the BD address of 10.2.1.254, but not the WebServer 10.1.2.11. This is
due to the fact that external network are identified as an EPG L2Out and contracts are needed to make communication
happen between any two EPGs.
rtp_leaf1# show vlan extended
VLAN Name Status Ports
---- -------------------------------- --------- -------------------------------
13 infra:default active Eth1/1, Eth1/2, Eth1/5, Eth1/35
27 Test:Database active Eth1/27, Eth1/28, Po2, Po3
28 Test:CommerceWorkspaceTest:Datab active Eth1/27, Eth1/28, Po2, Po3
ase
37 Test:Web active Eth1/27, Eth1/28, Eth1/42,
Eth1/44, Po1, Po2, Po3
38 Test:CommerceWorkspaceTest:Web active Eth1/27, Eth1/28, Po2, Po3
69 -- active Eth1/42, Eth1/44, Po1

VLAN Type Vlan-mode Encap


---- ----- ---------- -------------------------------
13 enet CE vxlan-16777209, vlan-3500
27 enet CE vxlan-14680064
28 enet CE vlan-600
37 enet CE vxlan-15794150
38 enet CE vlan-639
69 enet CE vlan-751

N7K-1-65-vdc_4# ping 10.2.1.254


PING 10.2.1.254 (10.2.1.254): 56 data bytes
64 bytes from 10.2.1.254: icmp_seq=0 ttl=56 time=1.068 ms
64 bytes from 10.2.1.254: icmp_seq=1 ttl=56 time=0.753 ms
64 bytes from 10.2.1.254: icmp_seq=2 ttl=56 time=0.708 ms
64 bytes from 10.2.1.254: icmp_seq=3 ttl=56 time=0.731 ms
64 bytes from 10.2.1.254: icmp_seq=4 ttl=56 time=0.699 ms
--- 10.2.1.254 ping statistics ---
5 packets transmitted, 5 packets received, 0.00% packet loss
round-trip min/avg/max = 0.699/0.791/1.068 ms

N7K-1-65-vdc_4# ping 10.2.1.11


PING 10.2.1.11 (10.2.1.11): 56 data bytes
Request 0 timed out
Request 1 timed out
Request 2 timed out
Request 3 timed out
Request 4 timed out

--- 10.2.1.11 ping statistics ---


5 packets transmitted, 0 packets received, 100.00% packet loss

Symptom 2

The Leaf is programmed with correct VLANs and Interfaces as expected, but the servers are unreachable from the
outside L2 network.

Verification

The presence of contracts between the Web EPG and L2Out EPG need to be checked to confirm reachability of
Webserver from N7K. The PcTag of Web EPG is found from Visore to be 49153.
rtp_leaf1# show zoning-rule | grep 49153

rtp_leaf1#

Resolution

After configuring contracts between the WebEPG and L2Out EPG, the command output shows as below:
rtp_leaf1# show zoning-rule | grep 49153
5352 49153 16386 default enabled 2883584 permi
5353 16386 49153 default enabled 2883584 permi

Once the contracts are defined, the pings from N7K are successful. The endpoints still are learnt as they send traffic,
so the issues highlighted in the previous ‘Symptoms when extending the EPG directly out of the ACI fabric:
Endpoint not in the database’ section is applicable even in this scenario.
N7K-1-65-vdc_4# ping 10.2.1.11
PING 10.2.1.11 (10.2.1.11): 56 data bytes
64 bytes from 10.2.1.11: icmp_seq=0 ttl=127 time=1.676 ms
64 bytes from 10.2.1.11: icmp_seq=1 ttl=127 time=0.689 ms
64 bytes from 10.2.1.11: icmp_seq=2 ttl=127 time=0.626 ms
64 bytes from 10.2.1.11: icmp_seq=3 ttl=127 time=0.75 ms
64 bytes from 10.2.1.11: icmp_seq=4 ttl=127 time=0.797 ms

--- 10.2.1.11 ping statistics ---


5 packets transmitted, 5 packets received, 0.00% packet loss
round-trip min/avg/max = 0.626/0.907/1.676 ms
19 Routed Connectivity to External Networks

• Overview
– External Route Distribution Inside Fabric
• Fabric Verification
– Output from Spine 1
– Output from Spine 2
• Problem description
– Symptom
– Verification/Resolution
• Problem description
– Symptom 1
– Verification 1
– Resolution
– Verification 2
– Resolution 1
– Resolution 2
– Symptom 2
– Verification
– Resolution
• Problem Description
– Symptom
– Verification
– Resolution

19.1 Overview

External network connectivity is an essential component to a useful fabric deployment. To accommodate for connec-
tions to external network entities, the ACI fabric provides the ability to automate provisioning of external network
connections through the policy model, and this chapter provides an overview and troubleshooting related to external
network connection methods.
Routed external network connectivity is provided by the association of an external routed domain to a special EPG in
a tenant. This EPG expresses the routed network reachability into an ACI fabric as an object that can be managed and
manipulated like any other object. Within the Layer 3 External network, configurable routing protocols are BGP, OSPF
or static routes. Configuration of this object involves switch-specific configuration and interface-specific configuration.
The Layer 3 External Instance Profile EPG exposes the external EPG to tenant EPGs through a contract.
As of ACI software version 1.0.(1k), there is one operational caveat that dictates that only one outside network can
be configured per leaf switch. However, the outside network configuration can easily be reused for multiple nodes by
associating multiple nodes with the L3 External Node Profile.

External Route Distribution Inside Fabric

Multiprotocol Border Gateway Protocol (MP-BGP) is the routing protocol running internal to the fabric. A border
leaf (an ACI leaf that provides host, fabric, and external network connections) can peer with external networks and
redistribute external routes into the internal MP-BGP. The fabric leverages MP-BGP to distribute external routes to
other leaf switches. External routes are propagated to leaf switches where there are end points attached for a given
tenant.
Route distribution does not occur by default as MP-BGP has to be enabled by configuration. To configure Route
Distribution, MP-BGP has to be turned on by assigning a BGP AS number and configuring spine nodes as BGP
route reflectors. As a result the APIC will configure all leaf nodes as MP-BGP route reflector clients. APIC will also
automate the provisioning of BGP components to provide this functionality - BGP session setup, Route Distinguishers,
import and export targets, VPNv4 address family and route-maps for redistribution. Sessions are established between
TEP IPs of leafs and route reflector functions running on spines. The MP-BGP process will be contained to the
overlay-1 VRF part of the infra tenant. It is important to highlight MP-BGP will not carry Endpoint tables (Endpoint
MAC and IP entries). While BGP leverages the TEP IPs for session establishment, IS-IS is leveraged for reachability
of TEP IPs of nodes.
Border leafs advertise tenant public subnets to external routers. Transit routing is currently not supported, while the
border leafs inject external routes to MP-BGP, external routes learned by a border leaf are not advertised back outside
of the fabric. External routes distributed to non-border leafs are installed with next hop as the overlay VRF TEP
address of the border leaf where it was learned from

19.2 Fabric Verification

The BGP Process is started once the BGP object has a valid ASN.

Output from Spine 1

rtp_spine1# cat /mit/sys/bgp/inst/summary


# BGP Instance
activateTs : 2014-10-15T13:11:25.669-04:00
adminSt : enabled
asPathDbSz : 0
asn : 10
attribDbSz : 736
childAction :
createTs : 2014-10-15T13:11:24.415-04:00
ctrl :
dn : sys/bgp/inst
lcOwn : local
memAlert : normal
modTs : 2014-10-15T13:11:19.746-04:00
monPolDn : uni/fabric/monfab-default
name : default
numAsPath : 0
numRtAttrib : 8
operErr :
rn : inst
snmpTrapSt : disable
status :
syslogLvl : err
ver : v4
waitDoneTs : 2014-10-15T13:11:36.640-04:00
rtp_spine1# show vrf
VRF-Name VRF-ID State Reason
black-hole 3 Up --
management 2 Up --
overlay-1 4 Up --

rtp_spine1# show bgp sessions vrf overlay-1


Total peers 3, established peers 3
ASN 10
VRF overlay-1, local ASN 10
peers 3, established peers 3, local router-id 172.16.136.93
State: I-Idle, A-Active, O-Open, E-Established, C-Closing, S-Shutdown

Neighbor ASN Flaps LastUpDn|LastRead|LastWrit St Port(L/R) Notif(S/R)


172.16.136.92 10 0 00:00:20|never |never E 179/36783 0/0
172.16.136.95 10 0 00:00:20|never |never E 179/49138 0/0
172.16.136.91 10 0 00:00:19|never |never E 179/56262 0/0

Output from Spine 2

rtp_spine2# cat /mit/sys/bgp/inst/summary


# BGP Instance
activateTs : 2014-10-15T13:11:26.594-04:00
adminSt : enabled
asPathDbSz : 0
asn : 10
attribDbSz : 736
childAction :
createTs : 2014-10-15T13:11:25.363-04:00
ctrl :
dn : sys/bgp/inst
lcOwn : local
memAlert : normal
modTs : 2014-10-15T13:11:19.746-04:00
monPolDn : uni/fabric/monfab-default
name : default
numAsPath : 0
numRtAttrib : 8
operErr :
rn : inst
snmpTrapSt : disable
status :
syslogLvl : err
ver : v4
waitDoneTs : 2014-10-15T13:11:32.901-04:00

rtp_spine2# show bgp sessions vrf overlay-1


Total peers 3, established peers 3
ASN 10
VRF overlay-1, local ASN 10
peers 3, established peers 3, local router-id 172.16.136.94
State: I-Idle, A-Active, O-Open, E-Established, C-Closing, S-Shutdown

Neighbor ASN Flaps LastUpDn|LastRead|LastWrit St Port(L/R) Notif(S/R)


172.16.136.91 10 0 00:05:15|never |never E 179/49429 0/0
172.16.136.95 10 0 00:05:14|never |never E 179/47068 0/0
172.16.136.92 10 0 00:05:14|never |never E 179/32889 0/0

19.3 Problem description

External routes are not reachable from the fabric.


Symptom

When checking routing table entries for a given VRF on a leaf, no BGP routes are shown or directly connected routes
are not distributed to other leafs.

Verification/Resolution

Verification of the route tables can be confirmed on the spine by running the command show bgp session vrf all:
rtp_spine1# show bgp session vrf all

Note: BGP process currently not running

Route reflector configuration includes modifying the default Fabric Pod policy to include a Policy Group with a
relationship to the default BGP Route Reflector policy. The BGP Route Reflector needs to have a defined BGP AS
number with two spines selected as the route reflectors.
Other troubleshooting commands:
show bgp sessions vrf <name | all>
show bgp ipv4 unicast vrf <name | all>
show bgp vpnv4 unicast vrf <name | all>
show ip bgp neighbors vrf <name | all>
show ip bgp neighbors <a.b.c.d> vrf <name | all>
show ip bgp nexthop-database vrf <name | all>

19.4 Problem description

Devices that should be reachable via OSPF in ACI fabric are unreachable.
For this example, the reference toplogy is used. Endpoint IPs within ACI fabric are in most cases expected to be
routable and reachable from the external/outside network. For this the reference topology, leaf1 and leaf3 are acting
as border routers peering with external Nexus 7000 devices using OSPF. For this use case, pinging the DB Endpoint
IP of 10.1.3.31 from the Nexus 7Ks.
N7K-2-50-N7K2# ping 10.1.3.31
PING 10.1.3.31 (10.1.3.31): 56 data bytes
Request 0 timed out
Request 1 timed out
Request 2 timed out
Request 3 timed out
Request 4 timed out

--- 10.1.3.31 ping statistics ---


5 packets transmitted, 0 packets received, 100.00% packet loss
N7K-2-50-N7K2#

Symptom 1

OSPF routes are missing, neighbor relationships not established.


The following are some common problems that can be seen when getting Open Shortest Path First (OSPF) neighbors
to become fully adjacent between ACI and external devices. In a successful formation of OSPF adjacency, OSPF
neighbors will attain the FULL neighbor state.
Verification 1

Mismatched OSPF Area Type


At the time of this writing, border leaf switches only support OSPF Not So Stubby Areas (NSSA). This implies that the
ACI border leaf switches will not be in area 0 and will not provide Area Border Router (ABR) functionality. Although
the APIC GUI and object model for OSPF don’t provide area-type configurations, users need to set the area type on
the external routers to be a NSSA in order to bring up OSPF adjacency.
In this example, N7K2 has not been configured for NSSA and the neighbors missing from the leaf:
rtp_leaf1# show ip ospf neighbors vrf all
OSPF Process ID default VRF Prod:Prod
Total number of neighbors: 1
Neighbor ID Pri State Up Time Address Interface
4.4.4.1 1 FULL/BDR 05:45:58 10.0.0.1 Eth1/41.14
OSPF Process ID default VRF Test:Test
Total number of neighbors: 1
Neighbor ID Pri State Up Time Address Interface
4.4.4.1 1 FULL/DR 00:18:30 10.0.1.1 Eth1/41.24

On ACI Leafs, checking the properties of the area will reveal not only the area type, but also other settings such as
reference bandwidth need to be made sure so that overall OSPF design is in line with best practices.
rtp_leaf1# show ip ospf vrf Prod:Prod
Routing Process default with ID 10.0.0.101 VRF Prod:Prod
Stateful High Availability enabled
Supports only single TOS(TOS0) routes
Supports opaque LSA
Redistributing External Routes from
static
Administrative distance 110
Reference Bandwidth is 40000 Mbps
SPF throttling delay time of 200.000 msecs,
SPF throttling hold time of 1000.000 msecs,
SPF throttling maximum wait time of 5000.000 msecs
LSA throttling start time of 0.000 msecs,
LSA throttling hold interval of 5000.000 msecs,
LSA throttling maximum wait time of 5000.000 msecs
Minimum LSA arrival 1000.000 msec
LSA group pacing timer 10 secs
Maximum paths to destination 8
Number of external LSAs 0, checksum sum 0x0
Number of opaque AS LSAs 0, checksum sum 0x0
Number of areas is 1, 0 normal, 0 stub, 1 nssa
Number of active areas is 1, 0 normal, 0 stub, 1 nssa
Area (0.0.0.100)
Area has existed for 19:46:14
Interfaces in this area: 3 Active interfaces: 3
Passive interfaces: 1 Loopback interfaces: 1
This area is a NSSA area
Perform type-7/type-5 LSA translation
Summarization is disabled
No authentication available
SPF calculation has run 40 times
Last SPF ran for 0.000529s
Area ranges are
Number of LSAs: 10, checksum sum 0x0
Resolution

Once the following configuration is done on the N7K2,


router ospf 100
area 0.0.0.100 nssa no-summary default-information-originate
area 0.0.0.110 nssa no-summary default-information-originate

The neighbors are back up and operational:


rtp_leaf1# show ip ospf neighbors vrf all
OSPF Process ID default VRF Prod:Prod
Total number of neighbors: 2
Neighbor ID Pri State Up Time Address Interface
4.4.4.1 1 FULL/BDR 05:40:42 10.0.0.1 Eth1/41.14
4.4.4.2 1 FULL/BDR 00:14:05 10.0.0.9 Eth1/43.15
OSPF Process ID default VRF Test:Test
Total number of neighbors: 2
Neighbor ID Pri State Up Time Address Interface
4.4.4.1 1 FULL/DR 00:13:14 10.0.1.1 Eth1/41.24
4.4.4.2 1 FULL/DR 00:12:47 10.0.1.9 Eth1/43.25

Verification 2

Mismatched MTU
At FCS, ACI supports by default MTU of 9000 bytes. Since the default on N7K and other devices could very well
deviate from this, this is a common reason to see neighbors stuck in exstart/exchange state.
In this example, N7Ks have not been configured for MTU 9000 and the neighbors are stuck in EXS-
TART/EXCHANGE states instead of Full:
In GUI:

In the CLI:
rtp_leaf1# show ip ospf nei vrf all

OSPF Process ID default VRF Prod:Prod


Total number of neighbors: 2
Neighbor ID Pri State Up Time Address Interface
4.4.4.1 1 EXSTART/BDR 00:00:10 10.0.0.1 Eth1/41.14
4.4.4.2 1 EXSTART/BDR 00:07:50 10.0.0.9 Eth1/43.15
OSPF Process ID default VRF Test:Test
Total number of neighbors: 2
Neighbor ID Pri State Up Time Address Interface
4.4.4.1 1 EXSTART/BDR 00:00:09 10.0.1.1 Eth1/41.24
4.4.4.2 1 EXSTART/BDR 00:07:50 10.0.1.9 Eth1/43.25

Resolution 1

There are two possible ways to resolve this issue. One is to set the ACI leaf nodes to use a smaller MTU. This is an
example of setting a Leaf Interface MTU to 1500 bytes:

Change this setting from ‘inherit’ to ‘1500’

Resolution 2

Another possible way to resolve this is to set N7K Interface MTU to 9000 bytes as shown below:
!
interface Ethernet8/1
mtu 9000
ip router ospf 100 area 0.0.0.100
no shutdown
!
interface Ethernet8/1.801
mtu 9000
encapsulation dot1q 801
ip address 10.0.0.1/30
ip router ospf 100 area 0.0.0.100
no shutdown
!

With MTU set, the OSPF neighbors should be up and operational.


rtp_leaf1# show ip ospf neighbors vrf all
OSPF Process ID default VRF Prod:Prod
Total number of neighbors: 2
Neighbor ID Pri State Up Time Address Interface
4.4.4.1 1 FULL/BDR 05:40:42 10.0.0.1 Eth1/41.14
4.4.4.2 1 FULL/BDR 00:14:05 10.0.0.9 Eth1/43.15
OSPF Process ID default VRF Test:Test
Total number of neighbors: 2
Neighbor ID Pri State Up Time Address Interface
4.4.4.1 1 FULL/DR 00:13:14 10.0.1.1 Eth1/41.24
4.4.4.2 1 FULL/DR 00:12:47 10.0.1.9 Eth1/43.25

Symptom 2

OSPF route learning problems, Neighbor adjacency formed


In our reference topology, both N7Ks are advertising default routes to ACI border leafs. There are situations where
either the leafs or the external device (N7Ks) form neighbor relationships fine, but don’t learn routes from each other.
rtp_leaf1# show ip route 0.0.0.0 vrf all
IP Route Table for VRF "Prod:Prod"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

0.0.0.0/0, ubest/mbest: 2/0


*via 10.0.0.1, eth1/41.14, [110/5], 01:40:59, ospf-default, inter
*via 10.0.0.9, eth1/43.15, [110/5], 01:40:48, ospf-default, inter

IP Route Table for VRF "Test:Test"


'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

0.0.0.0/0, ubest/mbest: 2/0


*via 10.0.1.1, eth1/41.24, [110/5], 01:41:02, ospf-default, inter
*via 10.0.1.9, eth1/43.25, [110/5], 01:40:44, ospf-default, inter

rtp_leaf1#
Verification

External OSPF Peers are not learning routes from ACI. For this example, ACI is advertising the DB subnet (10.1.3.0)
to the N7K. This subnet exists on Leaf2, while Leaf1 and Leaf3 are the border leafs. As seen below, the N7K is not
receiving the route:
N7K-2-50-N7K2# show ip route 10.1.3.0
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

Route not found


N7K-2-50-N7K2#

ACI manages routing advertisements based on route availability, reachability and more importantly based on Policy.
The following concepts are key to understand route exchange between ACI and external peers:

Resolution

There are three steps involved in resolving this problem.


The first step that should be looked at is the Bridge Domain The Bridge domain subnet needs to be marked as Public.
This lets the ACI Leaf know to advertise the route to external peers. Even with this setting, the routes from Leaf2
are not learned by Leaf1 and Leaf3. This is due to only one of the three main conditions being met for external route
advertisements.
rtp_leaf1# show ip route 10.1.3.0 vrf Prod:Prod
IP Route Table for VRF "Prod:Prod"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

0.0.0.0/0, ubest/mbest: 2/0


*via 10.0.0.9, eth1/43.15, [110/5], 00:46:55, ospf-default, inter
*via 10.0.0.1, eth1/41.14, [110/5], 00:46:37, ospf-default, inter
rtp_leaf1#

The Bridge domain needs to be associated with L3 Out as shown below:


Even with this setting, the routes are not learned by Leaf1 and Leaf3 as there are no contracts in place specifying the
communication.
rtp_leaf1# show ip route 10.1.3.0 vrf Prod:Prod

IP Route Table for VRF "Prod:Prod"


'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

0.0.0.0/0, ubest/mbest: 2/0


*via 10.0.0.9, eth1/43.15, [110/5], 00:58:18, ospf-default, inter
*via 10.0.0.1, eth1/41.14, [110/5], 00:58:00, ospf-default, inter
rtp_leaf1#

However, if the routes are local to Leaf1 and Leaf3, the routes are then advertised due to L3out association. Just for
troubleshooting, this can be forced by having EPG association either by path or local binding on Leaf1 or Leaf3.

Now the N7Ks see the routes from Leaf1 but not Leaf3 as the EPG is associated only to Leaf1 and Leaf2.
rtp_leaf1# show ip route 10.1.3.0 vrf Prod:Prod
IP Route Table for VRF "Prod:Prod"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

10.1.3.0/24, ubest/mbest: 1/0, attached, direct, pervasive


*via 172.16.104.65%overlay-1, [1/0], 00:00:15, static
rtp_leaf1#

N7K-2-50-N7K2# show ip route 10.1.3.0


IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

10.1.3.0/24, ubest/mbest: 1/0


*via 10.0.0.10, Eth8/1.800, [110/20], 00:01:46, ospf-100, nssa type-2
N7K-2-50-N7K2# ping 10.1.3.1
PING 10.1.3.1 (10.1.3.1): 56 data bytes
64 bytes from 10.1.3.1: icmp_seq=0 ttl=57 time=1.24 ms
64 bytes from 10.1.3.1: icmp_seq=1 ttl=57 time=0.8 ms
64 bytes from 10.1.3.1: icmp_seq=2 ttl=57 time=0.812 ms
64 bytes from 10.1.3.1: icmp_seq=3 ttl=57 time=0.809 ms
64 bytes from 10.1.3.1: icmp_seq=4 ttl=57 time=0.538 ms

--- 10.1.3.1 ping statistics ---


5 packets transmitted, 5 packets received, 0.00% packet loss
round-trip min/avg/max = 0.538/0.839/1.24 ms
N7K-2-50-N7K2#

Now without a contract, why is ping successful? This is due to the fact that the pervasive GW address is not an
endpoint within that BD/EPG. Contracts are needed for pinging the EP if the Context is in ‘enforced’ mode.
N7K-2-50-N7K2# ping 10.1.3.31
PING 10.1.3.31 (10.1.3.31): 56 data bytes
Request 0 timed out
Request 1 timed out
Request 2 timed out
Request 3 timed out
Request 4 timed out

--- 10.1.3.31 ping statistics ---


5 packets transmitted, 0 packets received, 100.00% packet loss

Now removing the EPG binding on Leaf1, the route would stop getting advertised to the Nexus 7Ks.
A third part of the resolution is that the subnet being marked Public and the Bridge Domain associated with L3 Out,
needs a contract to be defined between the Database EPG and L3Out.
The contract needs to be defined and associated both on the L3Out Networks, and Database EPG. Prior to associating
contract:

Associate contract:
Routes being learned on the N7K:
N7K-2-50-N7K2# show ip route 10.1.3.0
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

10.1.3.0/24, ubest/mbest: 2/0


*via 10.0.0.10, Eth8/1.800, [110/20], 00:08:06, ospf-100, nssa type-2
*via 10.0.0.14, Eth8/3.800, [110/20], 00:08:06, ospf-100, nssa type-2
N7K-2-50-N7K2#

Now with L3Out defined with associated external networks, OSPF Neighbor peering, routes being advertised and
appropriate contract permitting the traffick, the ping is successful.
N7K-2-50-N7K2# ping 10.1.3.31
PING 10.1.3.31 (10.1.3.31): 56 data bytes
64 bytes from 10.1.3.31: icmp_seq=0 ttl=126 time=1.961 ms
64 bytes from 10.1.3.31: icmp_seq=1 ttl=126 time=0.533 ms
64 bytes from 10.1.3.31: icmp_seq=2 ttl=126 time=0.577 ms
64 bytes from 10.1.3.31: icmp_seq=3 ttl=126 time=0.531 ms
64 bytes from 10.1.3.31: icmp_seq=4 ttl=126 time=0.576 ms

--- 10.1.3.31 ping statistics ---


5 packets transmitted, 5 packets received, 0.00% packet loss
round-trip min/avg/max = 0.531/0.835/1.961 ms
N7K-2-50-N7K2#

19.5 Problem Description

Inter-tenant Communications
This problem is a scenario where there is an endpoint in one tenant’s context that cannot connect to an endpoint in
another tenant’s context. For this scenario, the Database servers in Tenant “Test” must communicate with the “Prod”
Tenant’s Database tier.
The Test-Database servers are in subnet 10.2.3.0/24, while the Prod-Database Servers are in 10.1.3.0/24.
Symptom

Communications between tenants do not work.

Verification

In this case, Routes not being learned between tenant contexts. Since the Tenants have their respective contexts/VRF,
by default the routes are not leaked between the contexts. Here a snippet of the status is show with Prod:Prod not
learning 10.2.3.0 from Tenant Test:Test as shown below:
rtp_leaf1# show ip route 10.1.3.0 vrf Prod:Prod
IP Route Table for VRF "Prod:Prod"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>
10.1.3.0/24, ubest/mbest: 1/0, attached, direct, pervasive
*via 172.16.104.65%overlay-1, [1/0], 00:57:55, static
rtp_leaf1# show ip route 10.2.3.0 vrf Prod:Prod
IP Route Table for VRF "Prod:Prod"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>
0.0.0.0/0, ubest/mbest: 2/0
*via 10.0.0.9, eth1/43.17, [110/5], 13:18:12, ospf-default, inter
*via 10.0.0.1, eth1/41.16, [110/5], 13:18:09, ospf-default, inter
rtp_leaf1#show ip route 10.2.3.0 vrf Test:Test

IP Route Table for VRF "Test:Test"


'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

10.2.3.0/24, ubest/mbest: 1/0, attached, direct, pervasive


*via 172.16.104.65%overlay-1, [1/0], 00:58:41, static
rtp_leaf1#

Resolution

The subnet address to be leaked between contexts, Tenants, in addition to being defined under the Bridge Domain,
needs to be marked as a as a shared subnet under the EPG. This is the first step in the resolution of this issue.
With the subnet defined, the route is now visible under Prod:Prod.
rtp_leaf1# show ip route 10.2.3.0 vrf Prod:Prod
IP Route Table for VRF "Prod:Prod"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

10.2.3.0/24, ubest/mbest: 1/0, attached, direct, pervasive


*via 172.16.104.65%overlay-1, [1/0], 00:00:09, static
rtp_leaf1#

However, while routes are learned the Prod and Test DB endpoints are still unable to communicate. Contracts and
Policies to allow the communication need to be defined for the communication to happen. To figure out the contracts,
the PcTag for the EPG need to be known using Visore.
DN PcTag
uni/tn-Test/ap-CommerceWorkspaceTest/epg-Database 5474
uni/tn-Prod/ap-commerceworkspace/epg-Database 32774

Verify, using GUI (Fabric -> Inventory -> Pod -> Pools -> Rules) or CLI
rtp_leaf1# show zoning-rule | grep 32774
rtp_leaf1# show zoning-rule | grep 5474
rtp_leaf1#

To take the second step in the resolution, a special contract needs to be created for inter-tenant communications with a
scope of ‘Global’ and should not have a scope of ‘Context’. The contract should also be ‘Exported’ from one Tenant
to the other Tenant, so that the other EPG can consume the defined contract as a ‘Consumed Contract Interface’.
Once the appropriate contract configuration is done, the contract show up on the leafs so the data plane can allow the
inter-tenant communication.
rtp_leaf1# show zoning-rule | grep 5474
4146 32774 5474 default enabled 2523136 permit
4147 5474 32774 default enabled 2523136 permit
rtp_leaf1# show zoning-rule | grep 32774
4146 32774 5474 default enabled 2523136 permit
4147 5474 32774 default enabled 2523136 permit
rtp_leaf1#

20 Virtual Machine Manager and UCS

• Overview
• Problem Description
– Symptom 1
– Verification
– Symptom 2
– Verification
• Problem Description
– Symptom
– Verification
– Symptom 2
– Verification
• Problem Description
– Symptom
– Verification
• Problem Description
– Symptom
– Verification

20.1 Overview

Virtual Machine Manager integration allows for the fabric to extend network policy and policy group definitions into
virtual switches residing on a hypervisor. This integration automates critical network plumbing steps that typically
stand as delays in the deployment of virtual and compute resources, by automatically configuring the required fabric
side and hypervisor virtual switch encapsulation.
The general hierarchy of VMM configuration is as shown in the diagram below:
20.2 Problem Description

When attached to a Cisco UCS Fabric Interconnect, two hosts are unable to reach one another or unable to resolve one
another’s MAC addresses.

Symptom 1

Packet captures running on the sending host, show that it is in fact being transmitted, however it never reaches the
destination host. Creating static ARP entries on both devices for one another shows that traffic allowed by policy is
able to be exchanged via ICMP, etc.
The UCS Fabric Interconnect is configured with two uplinks that are configured as disjoint layer-2 domains.

Verification

The problem in this situation is that the UCS FI has two uplinks with disjoint layer-2, where the VLAN that is being
used for the ARP is using the non-ACI uplink as the designated forwarder. This means that the fabric will not receive
the ARP frames and as a result cannot unicast them to the host. The easiest way to determine if this problem is
impacting the environment, is to use the following command on the UCS FI in NX-OS mode, substituting the VLAN
ID for the one that is attached to the EPG. This can be determined by using the “show platform software enm internal
info vlandb id <vlan>” command as shown below:
FI-A(nxos)# show platform software enm internal info vlandb id 248

vlan_id 248
-------------
Designated receiver: Po103
Membership:
Po103
FI-A(nxos)#

If the designated receiver is not the port-channel facing the ACI fabric, the uplink pinning settings in LAN manager
will need to be adjusted. Use the LAN Uplink Manager in UCS to set the VLANs dedicated to ACI use to be pinned
to the ACI fabric facing uplink.
For more information, please reference the Network Configuration section on Configuring LAN Pin Groups in the
Cisco UCS Manager GUI Configuration Guide.

Symptom 2

• ARP requests are egressing ESX host on UCS blade, but not making it to the destination
• VM hosted on C-series appliance directly attached to fabric is able to reach the BD anycast gateway
• VM hosted on B-series chassis blade attached to fabric is unable to reach the BD anycast gateway

Verification

In this situation there are two VMs, one on a UCS blade chassis and another on a C200, and they are unable to ping
one another, while on the same EPG. The C200 VM is able to ping the unicast gateway, however the UCS blade hosted
VM is not able to ping the gateway.
The vSphere 5.5 pktcap-uw tool can be used to determine if outbound ARP requests are in fact leaving the VM and
hitting the virtual switch.
~ # pktcap-uw --uplink vmnic3
The name of the uplink is vmnic3
No server port specifed, select 38100 as the port
Output the packet info to console.
Local CID 2
Listen on port 38100
Accept...Vsock connection from port 1027 cid 2
01:04:51.765509[1] Captured at EtherswitchDispath point, TSO not enabled, Checksum not offloaded and
Segment[0] ---- 60 bytes:
0x0000: ffff ffff ffff 0050 56bb cccf 0806 0001
0x0010: 0800 0604 0001 0050 56bb cccf 0a01 000b
0x0020: 0000 0000 0000 0a01 0001 0000 0000 0000
0x0030: 0000 0000 0000 0000 0000 0000

By monitoring the packet count on the Veth### interface on the UCS in NX-OS mode, it is possible to confirm the
packets were being received.
tsi-aci-ucsb-A(nxos)# show int Veth730
Vethernet730 is up
Bound Interface is Ethernet1/1/3
Port description is server 1/3, VNIC eth3
Hardware is Virtual, address is 000d.ecb1.a000
Port mode is trunk
Speed is auto-speed
Duplex mode is auto
300 seconds input rate 0 bits/sec, 0 packets/sec
300 seconds output rate 0 bits/sec, 0 packets/sec
Rx
36 unicast packets 3694 multicast packets 3487 broadcast packets
7217 input packets 667170 bytes
0 input packet drops
Tx
433 unicast packets 12625 multicast packets 44749 broadcast packets
57807 output packets 4453489 bytes
0 flood packets
0 output packet drops

So the problem is between the UCS fabric links and the leaf interfaces. Checking the counters on the leaf indicates no
broadcast packets were ingressing.
Ethernet1/27 is up
admin state is up, Dedicated Interface
Belongs to po2
Hardware: 100/1000/10000/auto Ethernet, address: 7c69.f610.6d33 (bia 7c69.f610.6d33)
MTU 9000 bytes, BW 10000000 Kbit, DLY 1 usec
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, medium is broadcast
Port mode is trunk
full-duplex, 10 Gb/s, media type is 10G
Beacon is turned off
Auto-Negotiation is turned on
Input flow-control is off, output flow-control is off
Auto-mdix is turned off
Rate mode is dedicated
Switchport monitor is off
EtherType is 0x8100
EEE (efficient-ethernet) : n/a
Last link flapped 09:09:44
Last clearing of "show interface" counters never
4 interface resets
30 seconds input rate 75 bits/sec, 0 packets/sec
30 seconds output rate 712 bits/sec, 0 packets/sec
Load-Interval #2: 5 minute (300 seconds)
input rate 808 bps, 1 pps; output rate 616 bps, 0 pps
RX
193 unicast packets 5567 multicast packets 17365 broadcast packets
23125 input packets 2185064 bytes
0 jumbo packets 0 storm suppression packets
0 runts 0 giants 0 CRC 0 no buffer
0 input error 0 short frame 0 overrun 0 underrun 0 ignored
0 watchdog 0 bad etype drop 0 bad proto drop 0 if down drop
0 input with dribble 0 input discard
0 Rx pause
TX
129 unicast packets 5625 multicast packets 17900 broadcast packets
23654 output packets 1952861 bytes
0 jumbo packets
0 output error 0 collision 0 deferred 0 late collision
0 lost carrier 0 no carrier 0 babble 0 output discard
0 Tx pause

This indicates that traffic is egressing the ESX host, however not making it through to the leaf. One possible cause
for this is that the frames are being tagged on upon leaving the ESX host, however are being stripped and placed on
the native VLAN. The UCS configuration, specifically VLAN manager, can be checked and verified if VLAN 602 is
incorrectly set as the native VLAN.
This means that frames egressing the UCS FI would be untagged heading towards the fabric, and thus would not be
categorized into the appropriate EPG. By unmarking the VLAN as native, the frames are properly tagged and then
categorized as being members of the EPG, and ICMP can immediately begin to function.

20.3 Problem Description

Virtual Machine Manager function is unable to register vCenter with APIC

Symptom

When attempting to register a vCenter with APIC, one or more of the following faults is raised:
F606262 [FSM:FAILED]: VMM Add-Controller FSM: comp/prov-VMware/ctrlr-[RTPACILab]-TestVcenter Failed t
F606351 [FSM:FAILED]: Task for updating comp:PolCont(TASK:ifc:vmmmgr:CompPolContUpdateCtrlrPol)
F16438 [FSM:STAGE:FAILED]: Establish connection Stage: comp/prov-VMware/ctrlr-[RTPACILab]-TestVcenter
Verification

These faults typically indicate that there is an issue reaching vCenter from the APIC. Typical causes for this include:
• The VMM is configured to use the Out of Band management (OOBM) network to access vCenter however is
on a separate subnet and has no route to reach that vCenter
• The IP address entered for the vCenter is incorrect
Log into the APIC and attempt a simple ping test to the remote vCenter:
admin@RTP_Apic1:~> ping 10.122.253.152
PING 10.122.253.152 (10.122.253.152) 56(84) bytes of data.
From 64.102.253.234 icmp_seq=1 Destination Host Unreachable
From 64.102.253.234 icmp_seq=2 Destination Host Unreachable
From 64.102.253.234 icmp_seq=3 Destination Host Unreachable
From 64.102.253.234 icmp_seq=4 Destination Host Unreachable
^C

In this case vCenter is not reachable from the APIC. By default the APIC will use the OOB interface for reaching
remotely managed devices, so this would indicate that there is either a misconfiguration on the APIC or that the
vCenter is unreachable by that address.
The first step is to verify if a proper default route is configured. This can be verified by navigating to the Tenants
section, entering the mgmt tenant, and then inspecting the Node Management Addresses. If out of band management
node management addresses have been configured, verify that the proper default gateway has been entered in that
location.

The default gateway is configured as 10.122.254.254/24


admin@RTP_Apic1:~> ping 10.122.254.254
PING 10.122.254.254 (10.122.254.254) 56(84) bytes of data.
From 10.122.254.211 icmp_seq=1 Destination Host Unreachable
From 10.122.254.211 icmp_seq=2 Destination Host Unreachable
From 10.122.254.211 icmp_seq=3 Destination Host Unreachable
From 10.122.254.211 icmp_seq=4 Destination Host Unreachable
^C

The Unreachable state indicates that the gateway is improperly configured, and this misconfiguration can be corrected
by setting it to the appropriate 10.122.254.1.
After modifying the configured Out-of-Band gateway address:
admin@RTP_Apic1:~> ping 10.122.254.152
PING 10.122.254.152 (10.122.254.152) 56(84) bytes of data.
64 bytes from 10.122.254.152: icmp_seq=1 ttl=64 time=0.245 ms
64 bytes from 10.122.254.152: icmp_seq=2 ttl=64 time=0.258 ms
64 bytes from 10.122.254.152: icmp_seq=3 ttl=64 time=0.362 ms
64 bytes from 10.122.254.152: icmp_seq=4 ttl=64 time=0.344 ms
^C

The complete management configuration is as follows:


<fvTenant name="mgmt">
<fvBD name="inb"/>
<aaaDomainRef name="mgmt"/>
<mgmtMgmtP name="default">
<mgmtInB name="default"/>
<mgmtOoB name="default">
<mgmtRsOoBProv tnVzOOBBrCPName="oob_contract"/>
</mgmtOoB>
</mgmtMgmtP><a>p
<fvCtx name="inb"/>
<fvCtx name="oob">
<dnsLbl name="default"/>
</fvCtx>
<vzOOBBrCP name="oob_contract">
<vzSubj name="oob_subject">
<vzRsSubjFiltAtt tnVzFilterName="default"/>
<vzRsSubjFiltAtt tnVzFilterName="ssh"/>
</vzSubj>
</vzOOBBrCP>
<vzFilter name="ssh">
<vzEntry name="ssh"/>
</vzFilter>
<fvnsAddrInst name="rtp_leaf3ooboobaddr">
<fvnsUcastAddrBlk from="10.122.254.243" to="10.122.254.243"/>
</fvnsAddrInst>
<fvnsAddrInst name="RTP_Apic3ooboobaddr">
<fvnsUcastAddrBlk from="10.122.254.213" to="10.122.254.213"/>
</fvnsAddrInst>
<fvnsAddrInst name="RTP_Apic1ooboobaddr">
<fvnsUcastAddrBlk from="10.122.254.211" to="10.122.254.211"/>
</fvnsAddrInst>
<fvnsAddrInst name="RTP_Apic2ooboobaddr">
<fvnsUcastAddrBlk from="10.122.254.212" to="10.122.254.212"/>
</fvnsAddrInst>
<fvnsAddrInst name="rtp_spine1ooboobaddr">
<fvnsUcastAddrBlk from="10.122.254.244" to="10.122.254.244"/>
</fvnsAddrInst>
<fvnsAddrInst name="rtp_leaf1ooboobaddr">
<fvnsUcastAddrBlk from="10.122.254.241" to="10.122.254.241"/>
</fvnsAddrInst>
<fvnsAddrInst name="rtp_leaf2ooboobaddr">
<fvnsUcastAddrBlk from="10.122.254.242" to="10.122.254.242"/>
</fvnsAddrInst>
<fvnsAddrInst name="rtp_spine2ooboobaddr">
<fvnsUcastAddrBlk from="10.122.254.245" to="10.122.254.245"/>
</fvnsAddrInst>
<mgmtExtMgmtEntity name="default">
<mgmtInstP name="oob_emei">
<mgmtRsOoBCons tnVzOOBBrCPName="oob_contract"/>
<mgmtSubnet ip="0.0.0.0/0"/>
</mgmtInstP>
</mgmtExtMgmtEntity>
</fvTenant>

Now it is possible to verify that the vCenter VMM is reachable:

Symptom 2

The following fault is raised in the VMM manager


F16438 [FSM:STAGE:FAILED]: Establish connection Stage: comp/prov-VMware/ctrlr-[RTPACILab]-172.31.222.
F606262 [FSM:FAILED]: VMM Add-Controller FSM: comp/prov-VMware/ctrlr-[RTPACILab]-172.31.222.24 Failed

Verification

Ensure that the datacenter name in vCenter matches the “Datacenter” property configured in the VMM Controller
policy configuration
In the above screenshot, the Datacenter name is purposely misconfigured as BldgE instead of BldgF

20.4 Problem Description

Virtual Machine Manager (VMM) unassociation fails to delete Distributed Virtual Switch (DVS) in vCenter

Symptom

After removing a Virtual Machine Manager (VMM) configuration or removing a Virtual Machine Manager (VMM)
domain from an End Point Group (EPG), the associated virtual port groups or DVS are not removed from the vCenter
configuration.

Verification

Check to see that the port groups are not currently in use by a virtual machine network adapter.
This can be verified from the vCenter GUI, by accessing the settings for a virtual machine and individually inspecting
the network backing for the vNIC adapters
Another mechanism by which this can be verified is by inspecting the DVS settings, and viewing the Virtual Machines
that are associated with the DVS.
The list of virtual machines that are currently using a distributed virtual port group can also be found using the APIC
GUI, by navigating to the VM Networking section, navigating into the Provider, the Domain, into the DVS, the
expanding the port groups, and looking at each individual port group.

To resolve this particular issue, the backing on the Virtual Machine VNICs must be removed. This can be accomplished
by either removing the Virtual Adapter entirely, or by changing the Virtual Adapter network backing to one that is not
present on the DVS, including a local standard virtual switch or some other DVP.

20.5 Problem Description

Virtual Machine Manager hosted VMs are unable to reach the fabric, get learned by the fabric or reach their default
gateway through a UCS Fabric Interconnect.

Symptom

Checking the endpoint table on the fabric does not show any new endpoints being learned, although the Distributed
Virtual Port groups are being created on the vSwitch and VMs.
The VMs are unable to ping their gateway or other VMs

Verification

For these symptoms the first step is to check to see if the endpoint table on the leaf to which the UCS is attached is
learning any endpoints in the EPG. The MAC address for the VM in question is 00:50:56:BB:D5:08, and it is unable
to reach its default gateway

Upon inspecting the “show endpoint detail” output on the leaf, the MAC for the VM is missing from the output.
rtp_leaf1# show endpoint detail
Legend:

O - peer-attached H - vtep a - locally-aged S - static


V - vpc-attached p - peer-aged L - local M - span
s - static-arp B - bounce
+---------------+---------------+-----------------+--------------+-------------+---------------------
VLAN/ Encap MAC Address MAC Info/ Interface Endpoint Group
Domain VLAN IP Address IP Info Info
+---------------+---------------+-----------------+--------------+-------------+---------------------

Additionally, viewing the output of “show vlan” and grepping for the Test EPG, the interface that is expected to be
configured with the EPG is not visible in the interfaces that the policy should be programmed on.
rtp_leaf1# show vlan | grep Test
39 Test:CommerceWorkspaceTest:Web active Eth1/42, Eth1/44, Po1
Inspecting the configuration on the Attachable Entity Profile for the interface group used on the UCS shows that
no vSwitch policy is configured for the LLDP, CDP or LACP policies. Without these policies, the defaults will be
inherited from the AEP itself, and as a result will be configured to run LLDP using whatever link aggregation protocol
is used on the upstream links. This will cause the VDS to inherit these properties, and thus run incorrectly.

By right clicking on the Attachable Entity Profile and clicking the “Config vSwitch Policies” it is possible to associate
override policies for the vSwitch. When using a UCS between the leaf and ESX hosts, these should be configured to
disable LLDP, enable CDP and use Mac Pinning as the LACP policy, as shown below:
With the override in place, inspecting the endpoint table on the switch itself shows that the MAC address for the VM
has been learned and the VLAN table shows that the interface where the EPG can be learned is correctly placed in the
CommerceWorkspaceTest:Web EPG.
rtp_leaf1# show vlan | grep Test
14 Test:CommerceWorkspaceTest:Web active Eth1/27, Eth1/28, Po2, Po3

rtp_leaf1# show endpoint detail


Legend:
O - peer-attached H - vtep a - locally-aged S - static
V - vpc-attached p - peer-aged L - local M - span
s - static-arp B - bounce
+---------------+---------------+-----------------+--------------+-------------+---------------------
VLAN/ Encap MAC Address MAC Info/ Interface Endpoint Group
Domain VLAN IP Address IP Info Info\
+---------------+---------------+-----------------+--------------+-------------+---------------------
14 vlan-639 0050.56bb.d508 LV po2 Test:CommerceWorkspa

Further verification from the host itself shows that ping to the gateway is successful.

21 L4-L7 Service Insertion

21.1 Overview

This chapter covers the common troubles encountered during L4-L7 service insertion with the ACI fabric. An overview
of what should happen and the verification steps used to confirm a working L4-L7 service insertion are covered first.
The displays taken on a working fabric can then be used as an aid in troubleshooting issues when service graph and
device cluster deployment failed.
The Cisco ACI and the APIC controller are designed with the ability to provide automated service insertion while act-
ing as a central point of policy control within the ACI fabric. ACI policies manage both the network fabric and services
appliances such as firewalls, load balancers, etc. The policy controller has the ability to configure the network auto-
matically to allow traffic to flow through the service devices. In addition, the policy controller can also automatically
configure the service devices according to the application service requirements. This approach allows organizations to
automate infrastructure configuration coordinated with service insertion and eliminate the challenge of managing all
the complex traffic-steering techniques that are used by traditional service insertion configuration methods.
When a service graph is defined though the APIC GUI, the concept of “functions” are used to specify how traffic
should flow between the consumer EPG and the provider EPG. These functions can be exposed as firewall, load
balancer, SSL offload, etc. and APIC will translate these function definitions into selectable elements of a service
graph through a technique called rendering. Rendering involves the allocation of the fabric resources, such as bridge
domain, service device IP addresses, etc. to ensure the consumer and provider EPGs will have all necessary resources
and configuration to be functional.

Device Package

The APIC needs to communicate with the service devices to define and configure the user-specific functions according
to the “communcations method” the service device understands. This method of translation happens between the
APIC and service devices by utilizing a plug-in or device package installed by the administrator. The device package
also includes a description of the functions supported by the device package and the mode that the service device is
utilizing. In ACI terminology, a service appliance can operate in two modes:
1. Go-To Mode - aka Routed mode. Examples include L3 routed firewall or load balancer, or one-arm load bal-
ancer.
2. Go-Through Mode - Transparent mode. An example would be a transparent L2, or bridged) firewall.
The illustration below shows some examples of device package functions.

Service Graph Definition

When the service graph definitions are being configured, the abstract graph needs to stitch together the consumer and
provider contract. The connectors between the Function Node have two connector types:
1. L2 - Layer 2 connector. Example includes ACI fabric that has L2 adjacency between EPG and the transparent
firewall’s inside interface.
2. L3 - Layer 3 connector with Unicast routing. Example: ACI fabric will act as the default gateway the outside
interface of the ASA transparent firewall.
Node name - this will be used later on during before the service graph is rendered.
Adjacency
Function Type
BD Selection
L2
Go-To
Disable routing on BD if the routing is disabled for the connection.
L3
Go-To
Routing must also be enabled within the BD.
L2
Go-Through
Disable routing on BD if the routing is disabled for the connection.
L3
Go-Through
Routing settings on “shadow” BD is set as per the routing on connection.
Once the abstract graph is instantiated, the function of the service devices can be configured via GUI, REST or CLI.
These functions include firewall or load balancer configurations such as IP addresses of the interfaces, access-list, load
balancer monitoring policy, virtual IP, etc.
The illustration below shows the L4-L7 Function mode and empty Service Parameters.
Concrete Device and Logical Device

The service graph also contains the abstract node information. The APIC will translate the definition and functions
from the abstract graph into the concrete devices that are connected onto the ACI fabric. This may raise the question
of why there is a logical device and a concrete device. The way this works is the concrete devices are the standalone
appliance nodes, but the devices are typically deployed as a cluster, or pair, which is represented as a logical clustered
device.
The following parameters are mandatory to create the Concrete Device:
1. Device identity such as IP address and login credential of the concrete device.
2. Logical interface to actual interface mapping, including guest VM virtual network adapter name.
The following parameters are mandatory to create the Logical Device Cluster:
1. Select the device type - physical or virtual.
2. Device identity such as IP address and login credential of the logical device.
3. Logical interface name and function.
The illustration below shows the Logical Device Cluster configuration screen.
Device Cluster Selector Policies

The last step before the service graph can be rendered is to associate the service graph with the appropriate contract
and logical device. For example, the Create Logical Device Context screen is where the association of contract, graph,
node and cluster is built between the “PermitWeb” contract, “Web” graph, “Web-FW” node, “Prod/Web-FW” device
cluster.
The illustration below shows the Logical Device Context configuration.

Rendering the Service Graph

In order to render the service graph, association needs to happen between the appropriate contract and subject to the
correct L4-L7 Service Graph.
If the service graph is able to deploy, the service graph instance and virtual device will be seen as deployed in “De-
ployed Service Graphs” and “Deployed Device Clusters”. The illustration shows the working and rendered service
graph.
The illustration below shows where to attach the service graph to the contract.
21.2 Problem Description

The service graph is not rendering and will not deploy after the service graph is attached to a contract.

Symptom 1

When clicking the logical device cluster, the Device State is in “init” state.

Verification

The “init” state indicates there is a communication issue - the APIC controller cannot communicate with the service
device. Faults under the Logical Device context should be seen. A following fault code from an ASA logical device
context shows communication between APIC and service device:
F0324 Major script error : Connection error : HTTPSConnectionPool(host='10.122.254.39', port=443): Ma

This fault can be resolved by verifying the connectivity between the APIC and the service device with the following:
1. Ping the service device from the APIC CLI to verify reachability
2. Verify login credentials to the service device with the username and password supplied in the device configura-
tion
3. Verify the device’s virtual IP and port is open
4. Verify username and password is correct in the APIC configuration

Symptom 2

After correcting connectivity issues between the APIC and the service device, it can be seen that a F0765 CDev
configuration is invalid due to cdev-missing-virtual-info fault has occurred.

Verification

After verification of the network connectivity between APIC and the service appliance (in this case the service ap-
pliance is a VM), it is necessary to ensure the service VM name matches the vCenter console, and the vCenter name
matches the Data Center name.

Symptom 3

Seeing a fault defined as F0772 LIf configuration is invalid due to LIf-invalid-CIf in the Logical Device context.

Verification

First, it is necessary to define what are the items indicated called the LIf and the CIf. LIf is the logical interface and
CIf is a concrete interface. With this particular fault, the Logical interface is the element that is not rendering properly.
This is where the Function Node maps the logical interface to the actual, or concrete, interface to form a relationship.
F0772 means one of the following:
1. The Logical interface is not created
2. The Logical interface is not mapped to the correct concrete interface.

Symptom 4

After fixing the previous fault, F0772, there may be an additional fault, F0765 Cdev configuration is invalid due to
cdev-missing-cif.

Verification

This fault indicates that the CIf, concrete interface, is missing from the concrete device. This can be checked under
the concrete device configuration under L4-L7 Services->Device Clusters->Logical Device->Device->Policy to verify
the necessary concrete interfaces have been configured.

Symptom 5

When deploying the service graph, it is possible to see a fault defined as F0758 Service graph could not be rendered
due to following: id-allocation-failure.
Verification

When deploying service device VMs in a hypervisor, these devices are like the normal virtual machine creation in that
they will be placed into their own EPG that is mapped to the BD where the VM resides. When the service graph is
rendered by the APIC, it will allocate the VLANs from the VMM pool assigned during logical device cluster creation.
If the dynamic VLAN pool that is associated with the VMM does not have enough VLANs allocated, it will fail and
raise fault F0758.
This error can be corrected by allocating additional VLANs into the dynamic VLAN pool that is used by the VMM.

Symptom 6

All faults seem to be cleared but the service graph will still not render, and no faults are raised. In addition, verification
of the contract shows it has been associated with the appropriate service graph. The filter is also defined and associated
to the correct contract.

Verification

Go to consumer EPG or External Bridge Network and the provider EPG. It needs to have configured the correct EPG
or External Bridge Network as the consumer and provider. If the EPG is configured as both consumer and provider,
the L4-L7 graph will not be rendered.

Symptom 7

The service graph is trying to render, but it fails and raises the fault F0758 Service graph could not be rendered due to
following: missing-mandatory-param.

Verification

This fault is associated with the Function Node configuration. It would be caused by one or more missing mandatory
parameters, or one or more missing mandatory device configuration parameters:
• Check the Function Node configuration and verify if any Mandatory parameter with “true” is missing.
• Check under the actual service device configuration and identify if any Mandatory parameter is missing. One
example might be seen when configuring the ASA firewall and the “order” parameter the access control entry is
a required field even thought it is not marked as required.

Symptom 8

In the example the Cisco ASAv is being used, and traffic is not passing through the service device. After inspecting
the Deployed Device Cluster, there is a fault, F0324 Major script error: Configuration error:.

Verification

This fault is related to the Function Node configuration and it indicates that a passed configured parameter in rendering
was not accepted by the service device. Examples might include configuring ASAv transparent mode in the policy
while the firewall is configured in routed mode, or configuring the ASAv security level to 200 when the only acceptable
values are from 0 to 100.
22 ACI Fabric Node and Process Crash Troubleshooting

• Overview
– DME Processes:
– CLI:
– Identify When a Process Crashes:
– Collecting the Core Files:
• Problem Description
– Symptom 1
– Verification
* Check the appropriate process log:
* Check what activity occurred at the time of the process crash:
* Collect Techsupport and Core File and Contact the TAC:
– Symptom 2
– Verification
* Break the HAP reset loop:
* Check the appropriate process log:
* Check what activity occurred at the time of the process crash:
* Collect Core File and Contact the Cisco TAC:

22.1 Overview

The ACI switch node has numerous processes which control various functional aspects on the system. If the system
has a software failure in a particular process, a core file will be generated and the process will be reloaded.
If the process is a Data Management Engine (DME) process, the DME process will restart automatically. If the process
is a non-DME process, it will not restart automatically and the switch will reboot to recover.
This section presents an overview of the various processes, how to detect that a process has cored, and what actions
should be taken when this occurs.

DME Processes:

The essential processes running on an APIC can be found through the CLI. Unlike the APIC, the processes that can
be seen via the GUI in FABRIC->INVENTORY->Pod 1->(node) will show all processes running on the leaf.

CLI:

With the command ps-ef | grep svc_ifc:


rtp_leaf1# ps -ef |grep svc_ifc
root 3990 3087 1 Oct13 ? 00:43:36 /isan/bin/svc_ifc_policyelem --x
root 4039 3087 1 Oct13 ? 00:42:00 /isan/bin/svc_ifc_eventmgr --x
root 4261 3087 1 Oct13 ? 00:40:05 /isan/bin/svc_ifc_opflexelem --x -v dptcp:8000
root 4271 3087 1 Oct13 ? 00:44:21 /isan/bin/svc_ifc_observerelem --x
root 4277 3087 1 Oct13 ? 00:40:42 /isan/bin/svc_ifc_dbgrelem --x
root 4279 3087 1 Oct13 ? 00:41:02 /isan/bin/svc_ifc_confelem --x
rtp_leaf1#

Each of the processes running on the switch writes activity to a log file on the system. These log files are bundled as
part of the techsupport file but can be found via CLI access in /tmp/logs/ directory. For example, the Policy Element
process log output is written into /tmp/logs/svc_ifc_policyelem.log.
The following is a brief description of the DME processes running on the system. This can help in understanding
which log files to reference when troubleshooting a particular process or understand the impact to the system if a
process crashed:
Process Function
policyelem Policy Element: Process logical MO from APIC and push concrete model to the switch
eventmgr Event Manager: Processes local faults, events, health score
opflexelem Opflex Element: Opflex server on switch
observerelem Observer Element: Process local stats sent to APIC
dbgrelem Debugger Element: Core handler
nginx Web server handling traffic between the switch and APIC

Identify When a Process Crashes:

When a process crashes and a core file is generated, a fault as well as an event is generated. The fault for the particular
process is shown as a “process-crash” as shown in this syslog output from the APIC:
Oct 16 03:54:35 apic3 %LOG_LOCAL7-3-SYSTEM_MSG [E4208395][process-crash][major][subj-[dbgs/cores/node

When the process on the switch crashes, the core file is compressed and copied to the APIC. The syslog message
notification comes from the APIC.
The fault that is generated when the process crashes is cleared when the process is restarted. The fault can be viewed
via the GUI in the fabric history tab at FABRIC->INVENTORY->Pod 1. In this example, node102 Policy Element
crashed:

Collecting the Core Files:

The APIC GUI provides a central location to collect the core files for the fabric nodes.
An export policy can be created from ADMIN -> IMPORT/EXPORT in Export Policies -> Core. However, there is a
default core policy where files can be downloaded directly. As shown in this example:

The core files can be accessed via SSH/SCP through the APIC at /data/techsupport on the APIC where the core file
is located. Note that the core file will be available at /data/techsupport on one APIC in the cluster, the exact APIC
that the core file resides can be found by the Export Location path as shown in the GUI. For example, if the Export
Location begins with “files/3/”, the file is located on node 3 (APIC3).

22.2 Problem Description

Process on fabric node has crashed and either restarts automatically or leads to the switch restarting.

Symptom 1

Process on switch fabric crashes. Either the process restarts automatically or the switch reloads to recover.

Verification

As indicated in the overview section, if a DME process crashes, it should restart automatically without the switch
restarting. If a non-DME process crashes, the process will not automatically restart and the switch will reboot to
recover.
Depending on which process crashes, the impact of the process core will vary.
When a non-DME process crashes, this will typical lead to a HAP reset as seen on the console:
[ 1130.593388] nvram_klm wrote rr=16 rr_str=ntp hap reset to nvram
[ 1130.599990] obfl_klm writing reset reason 16, ntp hap reset
[ 1130.612558] Collected 8 ext4 filesystems
Check the appropriate process log:

The process which crashes should have at some level of log output prior to the crash. The output of the logs on the
switch are written into the /tmp/logs directory. The process name will be part of the file name. For example, for the
Policy Element process, the file is svc_ifc_policyelem.log
rtp_leaf2# ls -l |grep policyelem
-rw-r--r-- 2 root root 13767569 Oct 16 00:37 svc_ifc_policyelem.log
-rw-r--r-- 1 root root 1413246 Oct 14 22:10 svc_ifc_policyelem.log.1.gz
-rw-r--r-- 1 root root 1276434 Oct 14 22:15 svc_ifc_policyelem.log.2.gz
-rw-r--r-- 1 root root 1588816 Oct 14 23:12 svc_ifc_policyelem.log.3.gz
-rw-r--r-- 1 root root 2124876 Oct 15 14:34 svc_ifc_policyelem.log.4.gz
-rw-r--r-- 1 root root 1354160 Oct 15 22:30 svc_ifc_policyelem.log.5.gz
-rw-r--r-- 2 root root 13767569 Oct 16 00:37 svc_ifc_policyelem.log.6
-rw-rw-rw- 1 root root 2 Oct 14 22:06 svc_ifc_policyelem.log.PRESERVED
-rw-rw-rw- 1 root root 209 Oct 14 22:06 svc_ifc_policyelem.log.stderr
rtp_leaf2#

There will be several files for each process located at /tmp/logs. As the log file increases in size, it will be compressed
and older log files will be rotated off. Check the core file creation time (as shown in the GUI and the core file name)
to understand where to look in the file. Also, when the process first attempts to come up, there be an entry in the
log file that indicates “Process is restarting after a crash” that can be used to search backwards as to what might have
happened prior to the crash.

Check what activity occurred at the time of the process crash:

A process which has been running has had some change which then caused it to crash. In many cases the changes may
have been some configuration activity on the system. What activity occurred on the system can be found in the audit
log history of the system.
For example, if the ntp process crashes, going back around the time of the crash, in this example there was a change
where a ntp provider was deleted:
Collect Techsupport and Core File and Contact the TAC:

A process crashing should not normally occur. In order to understand better why beyond the above steps it will be
necessary to decode the core file. At this point, the file will need to be collected and provided to the TAC for further
processing.
Collect the core file (as indicated above how to do this) and open up a case with the TAC.

Symptom 2

Fabric switch continuously reloads or is stuck at the BIOS loader prompt.

Verification

As indicated in the overview section, if a DME process crashes, it should restart automatically without the switch
restarting. If a non-DME process crashes, the process will not automatically restart and the switch will reboot to
recover. However in either case if the process continuously crashes, the switch may get into a continuous reload loop
or end up in the BIOS loader prompt.
[ 1130.593388] nvram_klm wrote rr=16 rr_str=policyelem hap reset to nvram
[ 1130.599990] obfl_klm writing reset reason 16, policyelem hap reset
[ 1130.612558] Collected 8 ext4 filesystems

Break the HAP reset loop:

First step is to attempt to get the switch back into a state where further information can be collected.
If the switch is continuously rebooting, when the switch is booting up, break into the BIOS loader prompt through the
console by typing CTRL C when the switch is first part of the boot cycle.
Once the switch is at the loader prompt, enter in the following commands:
cmdline no_hap_reset
boot <file>

The cmdline command will prevent the switch from reloading with a hap reset is called. The second command will
boot the system. Note that the boot command is needed instead of a reload at the loader as a reload will remove the
cmdline option entered.
Though the system should now remain up to allow better access to collect data, whatever process is crashing will
impact the functionality of the switch.

Check the appropriate process log:

The process which crashes should have at some level of log output prior to the crash. The output of the logs on the
switch are written into the /tmp/logs directory. The process name will be part of the file name. For example, for the
Policy Element process, the file is svc_ifc_policyelem.log
rtp_leaf2# ls -l |grep policyelem
-rw-r--r-- 2 root root 13767569 Oct 16 00:37 svc_ifc_policyelem.log
-rw-r--r-- 1 root root 1413246 Oct 14 22:10 svc_ifc_policyelem.log.1.gz
-rw-r--r-- 1 root root 1276434 Oct 14 22:15 svc_ifc_policyelem.log.2.gz
-rw-r--r-- 1 root root 1588816 Oct 14 23:12 svc_ifc_policyelem.log.3.gz
-rw-r--r-- 1 root root 2124876 Oct 15 14:34 svc_ifc_policyelem.log.4.gz
-rw-r--r-- 1 root root 1354160 Oct 15 22:30 svc_ifc_policyelem.log.5.gz
-rw-r--r-- 2 root root 13767569 Oct 16 00:37 svc_ifc_policyelem.log.6
-rw-rw-rw- 1 root root 2 Oct 14 22:06 svc_ifc_policyelem.log.PRESERVED
-rw-rw-rw- 1 root root 209 Oct 14 22:06 svc_ifc_policyelem.log.stderr
rtp_leaf2#

There will be several files for each process located at /tmp/logs. As the log file increases in size, it will be compressed
and older log files will be rotated off. Check the core file creation time (as shown in the GUI and the core file name)
to understand where to look in the file. Also, when the process first attempts to come up, there be an entry in the
log file that indicates “Process is restarting after a crash” that can be used to search backwards as to what might have
happened prior to the crash.

Check what activity occurred at the time of the process crash:

A process which has been running has had some change which then caused it to crash. In many cases the changes may
have been some configuration activity on the system. What activity occurred on the system can be found in the audit
log history of the system.
For example, if the ntp process crashes, going back around the time of the crash, in this example there was a change
where a ntp provider was deleted:
Collect Core File and Contact the Cisco TAC:

A process crashing should not normally occur. In order to understand better why, beyond the above steps, it will be
necessary to decode the core file. At this point, the file will need to be collected and provided to the Cisco TAC for
further processing.
Collect the core file (as indicated above how to do this) and open up a support case with the Cisco TAC.

23 APIC Process Crash Troubleshooting

• Overview
– DME Processes:
– How to Identify When a Process Crashes:
– Collecting the Core Files:
• Problem Description
– Symptom 1
– Verification
* Check the appropriate process log:
* Check what activity occurred at the time of the process crash
* Restarting a process:
* Collect Techsupport and Core File and Contact the Cisco TAC:
– Symptom 2
– Verification
* Check the appropriate process log:
* Check what activity occurred at the time of the process crash:
* Collect Techsupport and Core File and Contact the Cisco TAC:

23.1 Overview

The APIC has a series of Data Management Engine (DME) processes which control various functional aspects on the
system. When the system has a software failure in a particular process, a core file will be generated and the process
will be reloaded.
This chapter covers potential issues involving system processes crashes or software failures, beginning with an
overview of the various system processes, how to detect that a process has cored, and what actions should be taken
when this occurs. The displays taken on a working healthy system can then be used to identify processes that may
have terminated abruptly.

DME Processes:

The essential processes running on an APIC can be found either through the GUI or the CLI. Using the GUI, the
processes and the process ID running is found in System->Controllers->Processes as shown here:
Using the CLI, the processes and the process ID are found in the summary file at /aci/system/controllers/1/processes
(for APIC1):
admin@RTP_Apic1:processes> cat summary
processes:
process-id process-name max-memory-allocated state
---------- ----------------- -------------------- -------------------
0 KERNEL 0 interruptible-sleep
331 dhcpd 108920832 interruptible-sleep
336 vmmmgr 334442496 interruptible-sleep
554 neo 398274560 interruptible-sleep
1034 ae 153690112 interruptible-sleep
1214 eventmgr 514793472 interruptible-sleep
2541 bootmgr 292020224 interruptible-sleep
4390 snoopy 28499968 interruptible-sleep
5832 scripthandler 254308352 interruptible-sleep
19204 dbgr 648941568 interruptible-sleep
21863 nginx 4312199168 interruptible-sleep
32192 appliancedirector 136732672 interruptible-sleep
32197 sshd 1228800 interruptible-sleep
32202 perfwatch 19345408 interruptible-sleep
32203 observer 724484096 interruptible-sleep
32205 lldpad 1200128 interruptible-sleep
32209 topomgr 280576000 interruptible-sleep
32210 xinetd 99258368 interruptible-sleep
32213 policymgr 673251328 interruptible-sleep
32215 reader 258940928 interruptible-sleep
32216 logwatch 266596352 interruptible-sleep
32218 idmgr 246824960 interruptible-sleep
32416 keyhole 15233024 interruptible-sleep
admin@apic1:processes>

Each of the processes running on the APIC writes to a log file on the system. These log files can be bundled as part
of the APIC techsupport file but can also be observed through SSH shell access in /var/log/dme/log. For example, the
Policy Manager process log output is written into /var/log/dme/log/svc_ifc_policymgr.bin.log.
The following is a brief description of the processes running on the system. This can help in understanding which
log files to reference when troubleshooting a particular process or understand the impact to the system if a process
crashed:
Process Function
KERNEL Linux kernel
dhcpd DHCP process running for APIC to assign infra addresses
vmmmgr Handles process between APIC and Hypervisors
neo Shell CLI Interpreter
ae Handles the state and inventory of local APIC appliance
eventmgr Handles all events and faults on the system
bootmgr Controls boot and firmware updates on fabric nodes
snoopy Shell CLI help, tab command completion
scripthandler Handles the L4-L7 device scripts and communication
dbgr Generates core files when process crashes
nginx Web service handling GUI and REST API access
appliancedirector Handles formation and control of APIC cluster
sshd Enabled SSH access into the APIC
perfwatch Monitors Linux cgroup resource usage
observer Monitors the fabric system and data handling of state, stats, health
lldpad LLDP Agent
topomgr Maintains fabric topology and inventory

How to Identify When a Process Crashes:

When a process crashes and a core file is generated, the ACI system raises a fault notification and generates an entry in
the event logs. The fault for the particular process is shown as a “process-crash” as shown in this syslog output from
the APIC:
Oct 15 17:13:35 apic1 %LOG_LOCAL7-3-SYSTEM_MSG [E4208395][process-crash][major][subj-[dbgs/cores/ctrl

The fault that is generated when the process crashes is cleared when the process is restarted. The fault can be viewed
via the GUI in the fabric HISTORY-> EVENTS tab at FABRIC->INVENTORY->Pod 1:
Collecting the Core Files:

The APIC GUI provides a central location to collect the core files for APICS and nodes in the fabric.
An export policy can be created from ADMIN -> IMPORT/EXPORT in Export Policies -> Core. However, there is a
default core policy where files can be downloaded directly. As shown in this example in the OPERATIONAL tab:
The core files can be accessed via SSH/SCP on the APIC at /data/techsupport.
Note that the core file will be available at /data/techsupport for the APIC that had the process crash. Which APIC
that the core file resides can be found by the Export Location path as shown in the GUI. For example, if the Export
Location begins with “files/2/”, the file is located on node 2 (APIC2).

23.2 Problem Description

APIC process crashes and either restarts automatically or is not running.

Symptom 1

APIC process is not running

Verification

A process that crashes generally should restart. However, if the same process crashes several times in a short amount
of time, the process may not recover.
Verify the process status through:
APIC CLI: Verify the contents of the summary file on the APIC located in /aci/system/controllers/<APIC node
ID>/processes. For example /aci/system/controllers/1/processes/summary for APIC1. An example output was shown
in the above overview section.
GUI by navigating to SYSTEM->CONTROLLERS->Controllers and the APIC and check that the processes running
have a PID associated. All but KERNEL should. An example output was shown in the above overview section.

Check the appropriate process log:

The process which is not running should have at some level of log output prior to the crash. The output of the logs for
that APIC that the process is not running is found in /var/log/dme/log via SSH access. The process name will be part
of the file name. For example vmmmgr is svc_ifc_vmmmgr.bin.log.
admin@RTP_Apic1:log> ls -l |grep vmmmgr
-rw-r--r-- 2 ifc root 18529370 Oct 15 14:38 svc_ifc_vmmmgr.bin.log
-rw-r--r-- 1 ifc root 1318921 Oct 14 19:25 svc_ifc_vmmmgr.bin.log.1.gz
-rw-r--r-- 1 ifc root 967890 Oct 14 19:42 svc_ifc_vmmmgr.bin.log.2.gz
-rw-r--r-- 1 ifc root 1555562 Oct 14 22:11 svc_ifc_vmmmgr.bin.log.3.gz
-rw-r--r-- 1 ifc root 1673143 Oct 15 12:19 svc_ifc_vmmmgr.bin.log.4.gz
-rw-r--r-- 1 ifc root 1119380 Oct 15 12:30 svc_ifc_vmmmgr.bin.log.5.gz
-rw-r--r-- 2 ifc root 18529370 Oct 15 14:38 svc_ifc_vmmmgr.bin.log.6
-rw-r--r-- 1 ifc root 2 Oct 14 13:36 svc_ifc_vmmmgr.bin.log.PRESERVED
-rw-r--r-- 1 ifc root 7924 Oct 14 22:44 svc_ifc_vmmmgr.bin.log.stderr
admin@RTP_Apic1:log>

There will be several files for each process located at /var/log/dme/log. As the log file increases in size, it will be
compressed and older log files will be rotated off. Check the core file creation time (as shown in the GUI and the core
file name) to understand where to look in the file. Also, when the process first attempts to come up, there exists an
entry in the log file that indicates “Process is restarting after a crash” that can be used to search backwards as to what
might have happened prior to the crash.
Check what activity occurred at the time of the process crash

Typically, a process which has been running successfully would have to experience some change which caused it to
crash. In many cases the changes may have been some configuration activity on the system. What activity occurred
on the system can be found in the audit log history of the system.
For example, if the policymgr process crashes several times that led to the process not being up, going into the logs
and inspecting entries around the time of the first crash is a good way to investigate what might have caused the issue.
As shown in the example below, there was a change where a new service graph was added, thus giving the indication
that the service graph configuration may have caused the failure:

Restarting a process:

When a process fails to restart automatically on an APIC, the recommended method is to restart the APIC to allow all
the processes to come up organically.
The processes can be started as well through the APIC shell command acidiag restart mgmt. This will restart the
essential APIC processes but it will cause all processes to restart, not just bringing up the process which is not running.
Now, if the process has crashed several times already, the process may crash again when it comes up. This could be
to some persistent condition of configuration that is leading to the crash. Knowing what changed as indicated above
may help to know what corrective actions to take to correct the root issue.

Collect Techsupport and Core File and Contact the Cisco TAC:

Process crashes should not occur under normal operational conditions. In order to understand better why the process
crashed beyond the above steps it will be necessary to decode the core files. At this point, the files will need to be
collected and provided to Cisco Technical Assistance Center for further processing.
Collect the core files, as indicated above in the overview section, and open up a support case with the Cisco Technical
Assistance Center.

Symptom 2

APIC process has crashed and restarted automatically

Verification

A process that crashes generally should restart. When the process crashes, a core file will be generated as indicated in
the overview section.

Check the appropriate process log:

The process which crashes should have at some level of log output prior to the crash. The output of the logs for that
APIC that the process is not running is found in /var/log/dme/log when logged in via SSH access. The process name
will be part of the file name. For example vmmmgr is svc_ifc_vmmmgr.bin.log.
admin@RTP_Apic1:log> ls -l |grep vmmmgr
-rw-r--r-- 2 ifc root 18529370 Oct 15 14:38 svc_ifc_vmmmgr.bin.log
-rw-r--r-- 1 ifc root 1318921 Oct 14 19:25 svc_ifc_vmmmgr.bin.log.1.gz
-rw-r--r-- 1 ifc root 967890 Oct 14 19:42 svc_ifc_vmmmgr.bin.log.2.gz
-rw-r--r-- 1 ifc root 1555562 Oct 14 22:11 svc_ifc_vmmmgr.bin.log.3.gz
-rw-r--r-- 1 ifc root 1673143 Oct 15 12:19 svc_ifc_vmmmgr.bin.log.4.gz
-rw-r--r-- 1 ifc root 1119380 Oct 15 12:30 svc_ifc_vmmmgr.bin.log.5.gz
-rw-r--r-- 2 ifc root 18529370 Oct 15 14:38 svc_ifc_vmmmgr.bin.log.6
-rw-r--r-- 1 ifc root 2 Oct 14 13:36 svc_ifc_vmmmgr.bin.log.PRESERVED
-rw-r--r-- 1 ifc root 7924 Oct 14 22:44 svc_ifc_vmmmgr.bin.log.stderr
admin@RTP_Apic1:log>

There will be several files for each process located at /var/log/dme/log. As the log file increases in size, it will be
compressed and older log files will be rotated off. Check the core file creation time (as shown in the GUI and the core
file name) to understand where to look in the file. Also, when the process first attempts to come up, there be an entry
in the log file that indicates “Process is restarting after a crash” that can be used to search backwards as to what might
have happened prior to the crash.

Check what activity occurred at the time of the process crash:

Typically, a process which has been running successfully would have to experience some change which caused it to
crash. In many cases the changes may have been some configuration activity on the system. What activity occurred
on the system can be found in FABRIC->Pod 1 in the HISTORY tab and then the AUDIT LOG subtab.
In this example, the policymgr process crashed several times leading to the process not being up. On further investi-
gation, during the time of the first crash event, a new service graph was added.
Collect Techsupport and Core File and Contact the Cisco TAC:

Process crashes should not occur under normal operational conditions. In order to understand better why the process
crashed beyond the above steps it will be necessary to decode the core files. At this point, the files will need to be
collected and provided to Cisco Technical Assistance Center for further processing.
Collect the core files, as indicated above in the overview section, and open up a support case with the Cisco Technical
Assistance Center.

24 Appendix
• Overview
– A
– B
– C
– D
– E
– F
– G
– H
– I
– J
– L
– M
– O
– P
– R
– S
– T
– V
– X

24.1 Overview

This section is designed to provide a high level description of terms and concepts that get brought up in this book.
While ACI does not change how packets are transmitted on a wire, there are some new terms and concepts employed,
and understanding those new terms and concepts will help those working on ACI communicate with one another about
the constructs used in ACI used to transmit those bits. Associated new acronyms are also provided.
This is not meant to be an exhaustive list nor a completely detailed dictionary of all of the terms and concepts, only the
key ones that may not be a part of the common vernacular or which would be relevant to the troubleshooting exercises
that were covered in the troubleshooting scenarios discussed.

AAA: acronym for Authentication, Authorization, and Accounting.


ACI External Connectivity: Any connectivity to and from the fabric that uses an external routed or switched inter-
mediary system, where endpoints fall outside of the managed scope of the fabric.
ACID transactions: ACID is an acrronym for Atomicity, Consistency, Isolation, Durability – properties of transac-
tions that ensure consistency in database transactions. Transactions to APIC devices in an ACI cluster are considered
ACID, to ensure that database consistency is maintained. This means that if one part of a transaction fails the entire
transaction fails.
AEP: Attach Entity Profile – this is a configuration profile of the interface that gets applied when an entity attaches to
the fabric. An AEP represents a group of external entities with similar infrastructure policy requirements.
ALE: Application Leaf Engine, an ASIC on a leaf switch.
APIC: Application Infrastructure Controller is a centralized policy management controller cluster. The APIC config-
ures the intended state of the policy to the fabric.
API: Application Programming Interface used for programmable extensibility.
Application Profile: Term used to reference an application profile managed object reference that models the logical
components of an application and how those components communicate. The AP is the key object used to represent an
application and is also the anchor point for the automated infrastructure management in an ACI fabric.
ASE: Application Spine Engine, an ASIC on a Spine switch.

BGP: Border Gateway Protocol, on the ACI fabric BGP is used to distribute reachability information within the fabric.
Bridge Domain: A unique layer 2 forwarding domain that contains one or more subnets.

Clos fabric: A multi-tier nonblocking leaf-spine architecture network.


Cluster: Set of devices that work together as a single system to provide an identical or similar set of functions.
Contracts: A logical container for the subjects which relate to the filters that govern the rules for communication
between endpoint groups. ACI works on a white list policy model. Without a contract, the default forwarding policy
is to not allow any communication between EPGs but communication within an EPG is allowed.
Context: A layer 3 forwarding domain, equivalent to a VRF. Every bridge domain needs to be associated with a
context.

DLB: Dynamic Load Balancing – a network traffic load balancing mechanism in the ACI fabric based on flowlet
switching.
DME: Data Management Engine, a service that runs on the APIC that manages data for the data model.
dMIT: distributed Management Information Tree, a representation of the ACI object model with the root of the tree
at the top and the leaves of the tree at the bottom. The tree contains all aspects of the object model that represent an
ACI fabric.
Dn: Distinguished name – a fully qualified name that represents a specific object within the ACI management in-
formation tree as well as the specific location information in the tree. It is made up of a concatenation of all of the
relative names from itself back to the root of the tree. As an example, if policy object of type Application Profile
is created named commerceworkspace within a Tenant named Prod, the dn would be expressed as uni/tn-Prod/ap-
commerceworkspace.

EP: Endpoint - Any logical or physical device connected directly or indirectly to a port on a leaf switch that is not
a fabric facing port. Endpoints have specific properties like an address, location, or potentially some other attribute,
which is used to identify the endpoint. Examples include virtual-machines, servers, storage devices, etc.
EPG: A collection of endpoints that can be grouped based on common requirements for a common policy. Endpoint
groups can be dynamic or static.
F

Fault: When a failure occurs or an alarm is raised, the system creates a fault managed object for the fault. A fault
contains the conditions, information about the operational state of the affected object, and potential resolutions for the
problem.
Fabric: Topology of network nodes.
Filters: Filters define the rules outlining the layer 2 to layer 4 fields that will be matched by a contract.
Flowlet switching: An optimized multipath load balancing methodology based on research from MIT in 2004. Flowlet
Switching is a way to use TCP’s own bursty nature to more efficiently forward TCP flows by dynamically splitting
flows into flowlets and splitting traffic across multiple parallel paths without requiring packet reordering.

GUI: Graphical User Interface.

HTML: HyperText Markup Language, a markup language that focuses on the formatting of web pages.
Hypervisor: Software that abstracts the hardware on a host machine and allows the host machine to run multiple
virtual machines.
Hypervisor integration: Extension of ACI Fabric connectivity to a virtualization manager to provide the APIC con-
troller with a mechanism for virtual machine visibility and policy enforcement.

IFM: Intra-Fabric Messages, Used for communication between different devices on the ACI fabric.
Inband Management (INB): Inband Management. Connectivity using an inband management configuration. This
uses a front panel (data plane) port of a leaf switch for external management connectivity for the fabric and APICs.
IS-IS: Link local routing protocol leveraged by the fabric for infrastructure topology. Loopback and VTEP addresses
are internally advertised over IS-IS. IS-IS announces the creation of tunnels from leaf nodes to all other nodes in fabric.

JSON: JavaScript Object Notation, a data encapsulation format that uses human readable text to encapsulate data
objects in attribute and value pairs.

Layer 2 Out (l2out): Layer 2 connectivity to an external network that exists outside of the ACI fabric.
Layer 3 Out (l3out): Layer 3 connectivity to an external network that exists outside of the ACI fabric.
L4-L7 Service Insertion: The insertion of a service like a firewall and a load balancer into the flow of traffic. Service
nodes operate between Layers 4 and 7 of the OSI model, where as networking elements (i.e. the fabric) operate at
layers 1-3).
Labels: Used for classifying which objects can and cannot communicate with each other.
Leaf: Network node in fabric providing host and border connectivity. Leafs connect only to hosts and spines. Leafs
never connect to each other.

MO: Managed Object – every configurable component of the ACI policy model managed in the MIT is called a MO.
Model: A model is a concept which represents entities and the relationships that exist between them.
Multi-tier Application: Client–server architecture in which presentation, application logic, and database management
functions are physically separated and require networking functions to communicate with the other tiers for application
functionality.

Object Model: A collection of objects and classes are used to examine and manipulate the configuration and running
state of the system that is exposing that object model. In ACI the object model is represented as a tree known as the
distributed management information tree (dMIT).
Out of Band management (OOBM): External connectivity using a specific out-of-band management interface on
every switch and APIC.

Port Channel: Port link aggregation technology that binds multiple physical interfaces into a single logical interface
and provides more aggregate bandwidth and link failure redundancy.

RBAC: Role Based Access Control, which is a method of managing secure access to infrastructure by assigning roles
to users, then using those roles in the process of granting or denying access to devices, objects and privilege levels.
REpresentational State Transfer (REST): a stateless protocol usually run over HTTP that allows a client to access a
service. The location that the client access usually defines the data the client is trying to access from the service. Data
is usually accessed and returned in either XML or JSON format.
RESTful: An API that uses REST, or Representational State Transfer.
Rn: Relative name, a name of a specific object within the ACI management information tree that is not fully qualified.
A Rn is significant to the individual object, but without context, it’s not very useful in navigation. A Rn would need
to be concatenated with all the relative names from itself back up to the root to make a distinguished name, which
then becomes useful for navigation. As an example, if an Application Profile object is created named “commerce-
workspace”, the Rn would be “ap-commerceworkspace” because Application Profile relative names are all prefaced
with the letters “ap-”. See also the Dn definition.

Service graph: Cisco ACI treats services as an integral part of an application. Any services that are required are
treated as a service graph that is instantiated on the ACI fabric from the APIC. Service graphs identify the set of
network or service functions that are needed by the application, and represent each function as a node. A service graph
is represented as two or more tiers of an application with the appropriate service function inserted between.
Spine: Network node in fabric carrying aggregate host traffic from leafs, connected only to leafs in the fabric and no
other device types.
Spine Leaf topology: A clos-based fabric topology in which spine nodes connect to leaf nodes, leaf nodes connect to
hosts and external networks.
Subnets: Contained by a bridge domain, a subnet defines the IP address range that can be used within the bridge
domain.
Subjects: Contained by contracts and create the relationship between filters and contracts.
Supervisor: Switch module/line card that provides the processing engine.

Tenants: The logical container to group all policies for application policies. This allows isolation from a policy
perspective. For service providers this would be a customer. In an enterprise or organization this would allow the
organization to define policy separation in a way that suits their needs. There are three pre-defined tenants on every
ACI fabric:
• common: policies in this tenant are shared by all tenants. Usually these are used for shared services or L4-L7
services.
• infra: policies in this tenant are used to influence the operation of the fabric overlay
• mgmt: policies in this tenant are used to define access to the inband and out-of-band management and virtual
machine controllers.

Virtualization: application of technology used to abstract hardware resources into virtual representations and allowing
software configurability.
vPC: virtual Port Channel, in which a port channel is created for link aggregation, but is spread across multiple
physical switches.
VRF: Virtual Routing and Forwarding - A L3 namespace isolation methodology to allow for multiple L3 contexts to
be deployed on a single device or infrastructure.
VXLAN: VXLAN is a Layer 2 overlay scheme transported across a Layer 3 network. A 24-bit VXLAN segment ID
(SID) or VXLAN network identifier (VNID) is included in the encapsulation to provide up to 16 million VXLAN
segments for traffic isolation or segmentation. Each segment represents a unique Layer 2 broadcast domain. An ACI
VXLAN header is used to identify the policy attributes if the application endpoint within the fabric, and every packet
carries these policy attributes.

XML: eXtensible Markup Language, a markup language that focuses on encoding data for documents rather than the
formatting of the data for those documents.

25 Indices and tables

• genindex
• modindex
• search

You might also like