توضیحاتی در مورد کتاب Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations (Casey Sisterson's Library)
نام کتاب : Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations (Casey Sisterson's Library)
عنوان ترجمه شده به فارسی : ایجاد پایه های SRE: راهنمای گام به گام برای معرفی مهندسی قابلیت اطمینان سایت در سازمان های ارائه دهنده نرم افزار (کتابخانه کیسی سیسترسون)
سری :
نویسندگان : Vladyslav Ukis
ناشر : Pearson
سال نشر : 2023
تعداد صفحات : 557
ISBN (شابک) : 9780137424757 , 0137424604
زبان کتاب : English
فرمت کتاب : pdf
حجم کتاب : 15 مگابایت
بعد از تکمیل فرایند پرداخت لینک دانلود کتاب ارائه خواهد شد. درصورت ثبت نام و ورود به حساب کاربری خود قادر خواهید بود لیست کتاب های خریداری شده را مشاهده فرمایید.
فهرست مطالب :
Cover
Half Title
Title Page
Copyright Page
Table of Contents
Foreword
Preface
Acknowledgments
About the Author
Part I: Foundations
Chapter 1 Introduction to SRE
1.1 Why SRE?
1.1.1 ITIL
1.1.2 COBIT
1.1.3 Modeling
1.1.4 DevOps
1.1.5 SRE
1.1.6 Comparison
1.2 Alignment Using SRE
1.3 Why Does SRE Work?
1.4 Summary
Chapter 2 The Challenge
2.1 Misalignment
2.2 Collective Ownership
2.3 Ownership Using SRE
2.3.1 Product Development
2.3.2 Product Operations
2.3.3 Product Management
2.3.4 Benefits and Costs
2.4 The Challenge Statement
2.5 Coaching
2.6 Summary
Chapter 3 SRE Basic Concepts
3.1 Service Level Indicators
3.2 Service Level Objectives
3.3 Error Budgets
3.3.1 Availability Error Budget Example
3.3.2 Error Budget of Zero
3.3.3 Latency Error Budget Example
3.4 Error Budget Policies
3.5 SRE Concept Pyramid
3.6 Alignment Using the SRE Concept Pyramid
3.7 Summary
Chapter 4 Assessing the Status Quo
4.1 Where Is the Organization?
4.1.1 Organizational Structure
4.1.2 Organizational Alignment
4.1.3 Formal and Informal Leadership
4.2 Where Are the People?
4.3 Where Is the Tech?
4.4 Where Is the Culture?
4.4.1 Is There High Cooperation?
4.4.2 Are Messengers Trained?
4.4.3 Are Risks Shared?
4.4.4 Is Bridging Encouraged?
4.4.5 Does Failure Lead to Inquiry?
4.4.6 Is Novelty Implemented?
4.5 Where Is the Process?
4.6 SRE Maturity Model
4.7 Posing Hypotheses
4.8 Summary
Part II: Running the Transformation
Chapter 5 Achieving Organizational Buy-In
5.1 Getting People Behind SRE
5.2 SRE Marketing Funnel
5.2.1 Awareness
5.2.2 Interest
5.2.3 Understanding
5.2.4 Agreement
5.2.5 Engagement
5.3 SRE Coaches
5.3.1 Qualities
5.3.2 Responsibilities
5.4 Top-Down Buy-In
5.4.1 Stakeholder Chart
5.4.2 Engaging the Head of Development
5.4.3 Engaging the Head of Operations
5.4.4 Engaging the Head of Product Management
5.4.5 Achieving Joint Buy-In
5.4.6 Getting SRE into the Portfolio
5.5 Bottom-Up Buy-In
5.5.1 Engaging the Operations Teams
5.5.2 Engaging the Development Teams
5.6 Lateral Buy-In
5.7 Buy-In Staggering
5.8 Team Coaching
5.9 Traversing the Organization
5.9.1 Grouping the Organization
5.9.2 Traversing the Organization Versus SRE Infrastructure Demand
5.9.3 Team Engagements Over Time
5.10 Organizational Coaching
5.11 Summary
Chapter 6 Laying Down the Foundations
6.1 Introductory Talks by Team
6.2 Conveying the Basics
6.2.1 SLO as a Contract
6.2.2 SLO as a Proxy Measure of Customer Happiness
6.2.3 User Personas
6.2.4 User Story Mapping
6.2.5 Motivation to Fix SLO Breaches
6.2.6 SLOs Are Not About Technicalities
6.2.7 Causes of SLO Breaches
6.2.8 On Call for SLO Breaches
6.3 SLI Standardization
6.3.1 Application Performance Management Facility
6.3.2 Availability
6.3.3 Latency
6.3.4 Prioritization
6.4 Enabling Logging
6.5 Teaching the Log Query Language
6.6 Defining Initial SLOs
6.6.1 What Makes a Good SLO?
6.6.2 Iterating on an SLO
6.6.3 Revising SLOs
6.7 Default SLOs
6.8 Providing Basic Infrastructure
6.8.1 Dashboards
6.8.2 Alert Content
6.9 Engaging Champions
6.10 Dealing with Detractors
6.10.1 Issues with the Cause
6.10.2 Issues with Alerting
6.10.3 Issues with Tooling
6.10.4 Issues with Product Owner Involvement
6.10.5 Issues with Team Motivation
6.11 Creating Documentation
6.12 Broadcast Success
6.13 Summary
Chapter 7 Reacting to Alerts on SLO Breaches
7.1 Environment Selection
7.2 Responsibilities
7.2.1 Dev Versus Ops Responsibilities
7.2.2 Operational Responsibilities
7.2.3 Splitting Operational Responsibilities
7.3 Ways of Working
7.3.1 Interruption-Based Working Mode
7.3.2 Focus-Based Working Mode
7.4 Setting Up On-Call Rotations
7.4.1 Initial Rotation Period
7.4.2 One Person On Call
7.4.3 Two People On Call
7.4.4 Three People On Call
7.5 On-Call Management Tools
7.5.1 Posting SLO Breaches
7.5.2 Scheduling
7.5.3 Professional On-Call Management Tools
7.6 Out-of-Hours On-Call
7.6.1 Using Availability Targets and Product Demand
7.6.2 Trade-offs
7.7 Systematic Knowledge Sharing
7.7.1 Knowledge-Sharing Needs
7.7.2 Knowledge-Sharing Pyramid
7.7.3 On-Call Training
7.7.4 Runbooks
7.7.5 Internal Stack Overflow
7.7.6 SRE Community of Practice
7.8 Broadcast Success
7.9 Summary
Chapter 8 Implementing Alert Dispatching
8.1 Alert Escalation
8.2 Defining an Alert Escalation Policy
8.3 Defining Stakeholder Groups
8.4 Triggering Stakeholder Notifications
8.5 Defining Stakeholder Rings
8.6 Defining Effective Stakeholder Notifications
8.7 Getting the Stakeholders Subscribed
8.7.1 Subscribing Using the On-Call Management Tool
8.7.2 Subscribing Using Other Means
8.8 Broadcast Success
8.9 Summary
Chapter 9 Implementing Incident Response
9.1 Incident Response Foundations
9.2 Incident Priorities
9.2.1 SLO Breaches Versus Incidents
9.2.2 Changing Incident Priority During an Incident
9.2.3 Defining Generic Incident Priorities
9.2.4 Mapping SLOs to Incident Priorities
9.2.5 Mapping Error Budgets to Incident Priorities
9.2.6 Mapping Resource-Based Alerts to Incident Priorities
9.2.7 Uncovering New Use Cases for Incident Priorities
9.2.8 Adjusting Incident Priorities Based on Stakeholder Feedback
9.2.9 Extending the SLO Definition Process
9.2.10 Infrastructure
9.2.11 Deduplication
9.3 Complex Incident Coordination
9.3.1 What Is a Complex Incident?
9.3.2 Existing Incident Coordination Systems
9.3.3 Incident Classification
9.3.4 Defining Generic Incident Severities
9.3.5 Social Dimension of Incident Classification
9.3.6 Incident Priority Versus Incident Severity
9.3.7 Defining Roles
9.3.8 Roles Required by Incident Severity
9.3.9 Roles On Call
9.3.10 Incident Response Process Evaluation
9.3.11 Incident Response Process Dynamics
9.3.12 Incident Response Team Well-Being
9.4 Incident Postmortems
9.5 Effective Postmortem Criteria
9.5.1 Initiating a Postmortem
9.5.2 Postmortem Lifecycle
9.5.3 Before the Postmortem
9.5.4 During the Postmortem
9.5.5 After the Postmortem
9.5.6 Analyzing the Postmortem Process
9.5.7 Postmortem Template
9.5.8 Facilitating Learning from Postmortems
9.5.9 Successful Postmortem Practice
9.5.10 Example Postmortems
9.6 Mashing Up the Tools
9.6.1 Connecting to the On-Call Management Tool
9.6.2 Connections Among Other Tools
9.6.3 Mobile Integrations
9.6.4 Example Tool Landscapes
9.7 Service Status Broadcast
9.8 Documenting the Incident Response Process
9.9 Broadcast Success
9.10 Summary
Chapter 10 Setting Up an Error Budget Policy
10.1 Motivation
10.2 Terminology
10.3 Error Budget Policy Structure
10.4 Error Budget Policy Conditions
10.5 Error Budget Policy Consequences
10.6 Error Budget Policy Governance
10.7 Extending the Error Budget Policy
10.8 Agreeing to the Error Budget Policy
10.9 Storing the Error Budget Policy
10.10 Enacting the Error Budget Policy
10.11 Reviewing the Error Budget Policy
10.12 Related Concepts
10.13 Summary
Chapter 11 Enabling Error Budget–Based Decision–Making
11.1 Reliability Decision-Making Taxonomy
11.2 Implementing SRE Indicators
11.2.1 Dimensions of SRE Indicators
11.2.2 “SLOs by Service” Indicator
11.2.3 SLO Adherence Indicator
11.2.4 SLO Error Budget Depletion Indicator
11.2.5 Premature SLO Error Budget Exhaustion Indicator
11.2.6 “SLAs by Service” Indicator
11.2.7 SLA Error Budget Depletion Indicator
11.2.8 SLA Adherence Indicator
11.2.9 Customer Support Ticket Trend Indicator
11.2.10 “On-Call Rotations by Team” Indicator
11.2.11 Incident Time to Recovery Trend Indicator
11.2.12 Least Available Service Endpoints Indicator
11.2.13 Slowest Service Endpoints Indicator
11.3 Process Indicators, Not People KPIs
11.4 Decisions Versus Indicators
11.5 Decision-Making Workflows
11.5.1 API Consumption Decision Workflow
11.5.2 Tightening a Dependency’s SLO Decision Workflow
11.5.3 Features Versus Reliability Prioritization Workflow
11.5.4 Setting an SLO Decision Workflow
11.5.5 Setting an SLA Decision Workflow
11.5.6 Allocating SRE Capacity to a Team Decision Workflow
11.5.7 Chaos Engineering Hypotheses Selection Workflow
11.6 Summary
Chapter 12 Implementing Organizational Structure
12.1 SRE Principles Versus Organizational Structure
12.2 Who Builds It, Who Runs It?
12.2.1 “Who Builds It, Who Runs It?” Spectrum
12.2.2 Hybrid Models
12.2.3 Reliability Incentives
12.2.4 Model Comparison Criteria
12.2.5 Model Comparison
12.3 You Build It, You Run It
12.4 You Build It, You and SRE Run It
12.4.1 SRE Team Within the Development Organization
12.4.2 SRE Team Within the Operations Organization
12.4.3 SRE Team in a Dedicated SRE Organization
12.4.4 Comparison
12.4.5 SRE Team Incentives, Identity, and Pride
12.4.6 SRE Team Head Count and Budget
12.4.7 SRE Team Cost Accounting
12.4.8 SRE Team KPIs
12.5 You Build It, SRE Run It
12.5.1 SRE Team Within a Development Organization
12.5.2 SRE Team Within an Operations Organization
12.5.3 SRE Team in a Dedicated SRE Organization
12.6 Cost Optimization
12.7 Team Topologies
12.7.1 Reporting Lines
12.7.2 SRE Identity Triangle
12.7.3 Holacracy: No Reporting Lines
12.8 Choosing a Model
12.8.1 Model Transformation Options
12.8.2 Decision Dimensions
12.8.3 Reporting Options
12.8.4 Positioning the SRE Organization
12.8.5 Conveying the Value to Executives
12.9 A New Role: SRE
12.9.1 Why Is a New Role Needed?
12.9.2 Role Definition
12.9.3 Role Naming
12.9.4 Role Assignment
12.9.5 Role Fulfillment
12.10 SRE Career Path
12.10.1 SRE Role Progressions
12.10.2 SRE Role Transitions
12.10.3 Cultural Importance
12.11 Communicating the Chosen Model
12.12 Introducing the Chosen Model
12.12.1 Organization Changes
12.12.2 Reporting Structure Changes
12.12.3 Role Changes
12.13 Summary
Part III: Measuring and Sustaining the Transformation
Chapter 13 Measuring the SRE Transformation
13.1 Testing Transformation Hypotheses
13.2 Outages Not Detected Internally
13.3 Services Exhausting Error Budgets Prematurely
13.4 Executives’ Perceptions
13.5 Reliability Perception by Users and Partners
13.6 Summary
Chapter 14 Sustaining the SRE Movement
14.1 Maturing the SRE CoP
14.2 SRE Minutes
14.3 Availability Newsletter
14.4 SRE Column in the Engineering Blog
14.5 Promote Long-Form SRE Wiki Articles
14.6 SRE Broadcasting
14.7 Combining SRE and CD Indicators
14.7.1 CD Versus SRE Indicators
14.7.2 Bottleneck Analysis
14.8 SRE Feedback Loops
14.9 New Hypotheses
14.10 Providing Learning Opportunities
14.11 Supporting SRE Coaches
14.12 Summary
Chapter 15 The Road Ahead
15.1 Service Catalog
15.2 SLAs
15.3 Regulatory Compliance
15.4 SRE Infrastructure
15.5 Game Days
Appendix Topics for Quick Reference
Index
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z