توضیحاتی در مورد کتاب Delta Lake: The Definitive Guide: Modern Data Lakehouse Architectures with Data Lakes
نام کتاب : Delta Lake: The Definitive Guide: Modern Data Lakehouse Architectures with Data Lakes
عنوان ترجمه شده به فارسی : دریاچه دلتا: راهنمای قطعی: معماری های مدرن Data Lakehouse با دریاچه های داده
سری :
نویسندگان : Prashanth Babu, Tristen Wentling, Scott Haines, Denny Lee
ناشر : O'Reilly Media
سال نشر : 2024
تعداد صفحات : 383
ISBN (شابک) : 1098151941 , 9781098151942
زبان کتاب : English
فرمت کتاب : pdf
حجم کتاب : 7 مگابایت
بعد از تکمیل فرایند پرداخت لینک دانلود کتاب ارائه خواهد شد. درصورت ثبت نام و ورود به حساب کاربری خود قادر خواهید بود لیست کتاب های خریداری شده را مشاهده فرمایید.
فهرست مطالب :
Copyright
Table of Contents
Foreword by Michael Armbrust
Foreword by Dominique Brezinski
Preface
Who This Book Is For
How This Book Is Organized
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Denny
Tristen
Scott
Prashanth
Chapter 1. Introduction to the Delta Lake Lakehouse Format
The Genesis of Delta Lake
Data Warehousing, Data Lakes, and Data Lakehouses
Project Tahoe to Delta Lake: The Early Years Months
What Is Delta Lake?
Common Use Cases
Key Features
Anatomy of a Delta Lake Table
Delta Transaction Protocol
Understanding the Delta Lake Transaction Log at the File Level
The Single Source of Truth
The Relationship Between Metadata and Data
Multiversion Concurrency Control (MVCC) File and Data Observations
Observing the Interaction Between the Metadata and Data
Table Features
Delta Kernel
Delta UniForm
Conclusion
Chapter 2. Installing Delta Lake
Delta Lake Docker Image
Delta Lake for Python
PySpark Shell
JupyterLab Notebook
Scala Shell
Delta Rust API
ROAPI
Native Delta Lake Libraries
Multiple Bindings Available
Installing the Delta Lake Python Package
Apache Spark with Delta Lake
Setting Up Delta Lake with Apache Spark
Prerequisite: Set Up Java
Setting Up an Interactive Shell
PySpark Declarative API
Databricks Community Edition
Create a Cluster with Databricks Runtime
Importing Notebooks
Attaching Notebooks
Conclusion
Chapter 3. Essential Delta Lake Operations
Create
Creating a Delta Lake Table
Loading Data into a Delta Lake Table
The Transaction Log
Read
Querying Data from a Delta Lake Table
Reading with Time Travel
Update
Delete
Deleting Data from a Delta Lake Table
Overwriting Data in a Delta Lake Table
Merge
Other Useful Actions
Parquet Conversions
Delta Lake Metadata and History
Conclusion
Chapter 4. Diving into the Delta Lake Ecosystem
Connectors
Apache Flink
Flink DataStream Connector
Installing the Connector
DeltaSource API
DeltaSink API
End-to-End Example
Kafka Delta Ingest
Install Rust
Build the Project
Run the Ingestion Flow
Trino
Getting Started
Configuring and Using the Trino Connector
Using Show Catalogs
Creating a Schema
Show Schemas
Working with Tables
Table Operations
Conclusion
Chapter 5. Maintaining Your Delta Lake
Using Delta Lake Table Properties
Delta Lake Table Properties Reference
Create an Empty Table with Properties
Populate the Table
Evolve the Table Schema
Add or Modify Table Properties
Remove Table Properties
Delta Lake Table Optimization
The Problem with Big Tables and Small Files
Using OPTIMIZE to Fix the Small File Problem
Table Tuning and Management
Partitioning Your Tables
Defining Partitions on Table Creation
Migrating from a Nonpartitioned to a Partitioned Table
Repairing, Restoring, and Replacing Table Data
Recovering and Replacing Tables
Deleting Data and Removing Partitions
The Life Cycle of a Delta Lake Table
Restoring Your Table
Cleaning Up
Conclusion
Chapter 6. Building Native Applications with Delta Lake
Getting Started
Python
Rust
Building a Lambda
What’s Next
Chapter 7. Streaming In and Out of Your Delta Lake
Streaming and Delta Lake
Streaming Versus Batch Processing
Delta as Source
Delta as Sink
Delta Streaming Options
Limit the Input Rate
Ignore Updates or Deletes
Initial Processing Position
Initial Snapshot with withEventTimeOrder
Advanced Usage with Apache Spark
Idempotent Stream Writes
Delta Lake Performance Metrics
Auto Loader and Delta Live Tables
Auto Loader
Delta Live Tables
Change Data Feed
Using Change Data Feed
Schema
Conclusion
Chapter 8. Advanced Features
Generated Columns, Keys, and IDs
Comments and Constraints
Comments
Delta Table Constraints
Deletion Vectors
Merge-on-Read
Stepping Through Deletion Vectors
Conclusion
Chapter 9. Architecting Your Lakehouse
The Lakehouse Architecture
What Is a Lakehouse?
Learning from Data Warehouses
Learning from Data Lakes
The Dual-Tier Data Architecture
Lakehouse Architecture
Foundations with Delta Lake
Open Source on Open Standards in an Open Ecosystem
Transaction Support
Schema Enforcement and Governance
The Medallion Architecture
Exploring the Bronze Layer
Exploring the Silver Layer
Exploring the Gold Layer
Streaming Medallion Architecture
Conclusion
Chapter 10. Performance Tuning: Optimizing Your Data Pipelines with Delta Lake
Performance Objectives
Maximizing Read Performance
Maximizing Write Performance
Performance Considerations
Partitioning
Table Utilities
Table Statistics
Cluster By
Bloom Filter Index
Conclusion
Chapter 11. Successful Design Patterns
Slashing Compute Costs
High-Speed Solutions
Smart Device Integration
Efficient Streaming Ingestion
Streaming Ingestion
The Inception of Delta Rust
The Evolution of Ingestion
Coordinating Complex Systems
Combining Operational Data Stores at DoorDash
Change Data Capture
Delta and Flink in Harmony
Conclusion
Chapter 12. Foundations of Lakehouse Governance and Security
Lakehouse Governance
The Emergence of Data Governance
Data Products and Their Relationship to Data Assets
Data Products in the Lakehouse
Maintaining High Trust
Data Assets and Access
The Data Asset Model
Unifying Governance Between Data Warehouses and Lakes
Permissions Management
Filesystem Permissions
Cloud Object Store Access Controls
Identity and Access Management
Data Security
Fine-Grained Access Controls for the Lakehouse
Conclusion
Chapter 13. Metadata Management, Data Flow, and Lineage
Metadata Management
What Is Metadata Management?
Data Catalogs
Data Reliability, Stewards, and Permissions Management
Why the Metastore Matters
Unity Catalog
Data Flow and Lineage
Data Lineage
Data Sharing
Automating Data Life Cycles
Audit Logging
Monitoring and Alerting
What Is Data Discovery?
Conclusion
Chapter 14. Data Sharing with the Delta Sharing Protocol
The Basics of Delta Sharing
Data Providers
Data Recipients
Delta Sharing Server
Using the REST APIs
Anatomy of the REST URI
List Shares
Get Share
List Schemas in Share
List All Tables in Share
Delta Sharing Clients
Delta Sharing with Apache Spark
Stream Processing with Delta Shares
Delta Sharing Community Connectors
Conclusion
Index
About the Authors
Colophon