GCP-SDPD: Serverless Data Processing with Dataflow

Unified stream and batch data processing that’s serverless, fast, and cost-effective.

By the end of 2024, 75% of enterprises will shift from piloting to operationalizing artificial intelligence according to IDC, yet the growing complexity of data types, heterogeneous data stacks and programming languages make this a challenge for all data engineers. With the current economic climate, doing more with cheaper costs and higher efficiency have also become a key consideration for many organizations.

With the world’s only truly unified batch and streaming data processing model provided by Apache Beam, the wide support for ML frameworks, and the unique cross-language capabilities of the Beam model, Dataflow is becoming ever easier, faster, and more accessible for all data processing needs.

HRDC Claimable and Malaysian Bumiputeras are eligible for Yayasan Peneraju Financing Scheme. T&C applies.

Overview

The next generation of Dataflow: Dataflow Prime, Dataflow Go, and Dataflow ML.

This training is intended for big data practitioners who want to further their understanding of Dataflow in order to advance their data processing applications.

Beginning with foundations, this training explains how Apache Beam and Dataflow work together to meet your data processing needs without the risk of vendor lock-in. The section on developing pipelines covers how you convert your business logic into data processing applications that can run on Dataflow.

This training culminates with a focus on operations, which reviews the most important lessons for operating a data application on Dataflow, including monitoring, troubleshooting, testing, and reliability

Skills Covered

Demonstrate how Apache Beam and Dataflow work together to fulfill your organization’s data processing needs.
Summarize the benefits of the Beam Portability Framework and enable it for your Dataflow pipelines.
Enable Shuffle and Streaming Engine, for batch and streaming pipelines respectively, for maximum performance.
Enable Flexible Resource Scheduling for more cost-efficient performance.
Select the right combination of IAM permissions for your Dataflow job.
Implement best practices for a secure data processing environment.
Select and tune the I/O of your choice for your Dataflow pipeline.
Use schemas to simplify your Beam code and improve the performance of your pipeline.
Develop a Beam pipeline using SQL and DataFrames.
Perform monitoring, troubleshooting, testing and CI/CD on Dataflow pipelines.

Prerequisites

Completed “Building Batch Data Pipelines”
Completed “Building Resilient Streaming Analytics Systems

Target Audience

Data Engineer
Data Analysts and Data Scientists aspiring to develop Data Engineering skills

Course Curriculum

Download PDF

Module 1: Introduction

Course Introduction
Beam and Dataflow Refresher
Introduce the course objectives.
Demonstrate how Apache Beam and Dataflow work together to fulfill your organization’s data processing needs.

Module 2: Beam Portability

Beam Portability
Runner v2
Container Environments
Cross-Language TransformS
Summarize the benefits of the Beam Portability Framework.
Customize the data processing environment of your pipeline using custom containers.
Review use cases for cross-language transformations.
Enable the Portability framework for your Dataflow pipelines.

Module 3: Separating Compute and Storage with Dataflow

Dataflow
Dataflow Shuffle Service
Dataflow Streaming Engine
Flexible Resource Scheduling
Enable Shuffle and Streaming Engine, for batch and streaming pipelines respectively, for maximum performance.
Enable Flexible Resource Scheduling for more cost-efficient performance

Module 4: IAM, Quotas, and Permissions

IAM
Quota
Select the right combination of IAM permissions for your Dataflow job.
Determine your capacity needs by inspecting the relevant quotas for your Dataflow jobs.

Module 5: Security

Data Locality
Shared VPC
Private IPs
CMEK
Select your zonal data processing strategy using Dataflow, depending on your data locality needs.
Implement best practices for a secure data processing environment.

Module 6: Beam Concepts Review

Beam Basics
Utility Transforms
DoFn Lifecycle
Review main Apache Beam concepts (Pipeline, PCollections, PTransforms, Runner, reading/writing, Utility PTransforms, side inputs), bundles and DoFn Lifecycle.

Module 7: Windows, Watermarks, Triggers

Windows
Watermarks
Triggers
Implement logic to handle your late data.
Review different types of triggers.
Review core streaming concepts (unbounded PCollections, windows).

Module 8: Sources and Sinks

Sources and Sinks
Text IO and File IO
BigQuery IO
PubSub IO
Kafka IO
Bigable IO
Avro IO
Splittable DoFn
Write the I/O of your choice for your Dataflow pipeline.
Tune your source/sink transformation for maximum performance.
Create custom sources and sinks using SDF.

Module 9: Schemas

Beam Schemas
Code Examples
Introduce schemas, which give developers a way to express structured data in their Beam pipelines.
Use schemas to simplify your Beam code and improve the performance of your pipeline.

Module 10: State and Timers

State API
Timer API
Summary
Identify use cases for state and timer API implementations.
Select the right type of state and timers for your pipeline.

Module 11: Best Practices

Schemas
Handling unprocessable Data
Error Handling
AutoValue Code Generator
JSON Data Handling
Utilize DoFn Lifecycle
Pipeline Optimizations
Implement best practices for Dataflow pipelines

Module 12: Dataflow SQL and DataFrames

Dataflow and Beam SQL
Windowing in SQL
Beam DataFrames
Develop a Beam pipeline using SQL and DataFrames

Module 13: Beam Notebooks

Beam Notebooks
Prototype your pipeline in Python using Beam notebooks.
Launch a job to Dataflow from a notebook

Module 14: Monitoring

Job List
Job Info
Job Graph
Job Metrics
Metrics Explorer
Navigate the Dataflow Job Details UI.
Interpret Job Metrics charts to diagnose pipeline regressions.
Set alerts on Dataflow jobs using Cloud Monitoring.

Module 15: Logging and Error Reporting

Logging
Error Reporting
Use the Dataflow logs and diagnostics widgets to troubleshoot pipeline issues

Module 16: Troubleshooting and Debug

Troubleshooting Workflow
Types of Troubles
Use a structured approach to debug your Dataflow pipelines.
Examine common causes for pipeline failures.

Module 17: Performance

Pipeline Design
Data Shape
Source, Sinks, and External Systems
Shuffle and Streaming Engine
Understand performance considerations for pipelines.
Consider how the shape of your data can affect pipeline performance.

Module 18: Testing and CI/CD

Testing and CI/CD Overview
Unit Testing
Integration Testing
Artifact Building
Deployment
Testing approaches for your Dataflow pipeline.
Review frameworks and features available to streamline your CI/CD workflow for
Dataflow pipelines.

Module 19: Reliability

Introduction to Reliability
Monitoring
Geolocation
Disaster Recovery
High Availability
Implement reliability best practices for your Dataflow pipelines.

Module 20: Flex Templates

Classic Templates
Flex Templates
Using Flex Templates
Google-provided Templates
Using flex templates to standardize and reuse Dataflow pipeline code.

Show full curriculum

Dates & Locations

Let’s make it work for you

Can’t find a date that fits? Need to train your whole team? Looking for a discount?
Speak to one of our learning experts today.

Talk To Us

Exam & Certification

Google Cloud Professional Data Engineer

Google Professional Data Engineers enable data-driven decision making by collecting, transforming, and publishing data. A Data Engineer should be able to design, build, operationalize, secure, and monitor data processing systems with a particular emphasis on security and compliance; scalability and efficiency; reliability and fidelity; and flexibility and portability.

A Data Engineer should also be able to leverage, deploy, and continuously train pre-existing machine learning models.

Training & Certification Guide

Frequently Asked Questions

What is the learning path for Professional Data Engineer certification?

GCPBD: Google Cloud Platform Big Data and Machine Learning Fundamentals

This GCPBD: Google Cloud Platform Big Data and Machine Learning Fundamentals course introduces participants to the big data capabilities of Google Cloud. Through a combination of presentations, demos, and hands-on labs, participants get an overview of Google Cloud and a detailed view of the data processing and machine learning capabilities. and Machine Learning Fundamentals.

GCPDE: Data Engineering on Google Cloud Platform

Get hands-on experience with designing and building data processing systems on Google Cloud. This course uses lectures, demos, and hand-on labs to show you how to design data processing systems, build end-to-end data pipelines, analyze data, and implement machine learning.

GCPPPDE-E: Preparing for the Google Cloud Professional Data Engineer Examination

Start preparing for your Google Professional Data Engineer certification exam and further your ability to collect, transform, and publish data to help organizations make data-driven decisions as a Data Engineer with this course.

Speak to a Training Consultant

All courses are HRD Claimable.
Get in touch with our team via the form or WhatsApp us on +6011-5119 6631

Overview

Skills Covered

Prerequisites

Target Audience

Course Curriculum

Dates & Locations

Let’s make it work for you

Exam & Certification

Training & Certification Guide

Frequently Asked Questions

Speak to a Training Consultant

Explore Our Courses

Explore Tech Partners

Customer Service

Company

Trainocate: A Global Leader in Technology, Business, and People Development

Download Course Syllabus

Explore Tech Partners

Courses

Search for a course

Popular Courses

Popular Tech Articles