Professional Data Engineer
The GCP Professional Data Engineer certification exam validates expertise in designing, building, and operating production‑grade data pipelines on Google Cloud, covering ingestion, storage, analysis, and automation.
Who Should Take This
It is intended for data engineers, analytics architects, and platform developers who have at least three years of hands‑on data engineering experience and one year of designing, managing solutions on Google Cloud. These professionals seek to demonstrate mastery of end‑to‑end data workflows and advance their career by earning a recognized industry credential.
What's Covered
1
Designing data pipelines for batch and streaming workloads; selecting appropriate storage and processing technologies; planning for scalability, fault tolerance, and cost optimization.
2
Implementing data ingestion using Pub/Sub, Dataflow, and Data Fusion; transforming data with Apache Beam; processing streaming and batch workloads at scale.
3
Selecting and configuring storage solutions across BigQuery, Cloud Storage, Cloud SQL, Spanner, Bigtable, and Firestore based on access patterns and requirements.
4
Preparing data for analytics and ML using BigQuery, Dataprep, and Vertex AI; implementing data quality checks and cataloging with Dataplex and Data Catalog.
5
Orchestrating data pipelines with Cloud Composer; implementing monitoring, alerting, and disaster recovery; automating data governance and lifecycle management.
Exam Structure
Question Types
- Multiple Choice
- Multiple Select
Scoring Method
Pass/fail. Google does not publish a scaled score or passing percentage.
Delivery Method
Kryterion testing center or online proctored
Prerequisites
None required. Associate Cloud Engineer recommended.
Recertification
3 years
What's Included in AccelaStudy® AI
Course Outline
72 learning goals
1
Domain 1: Ingesting and Processing Data
3 topics
Plan data pipelines
- Implement batch data ingestion pipelines using Cloud Storage as a staging layer with event-driven triggers, Storage Transfer Service for on-premises and multi-cloud sources, and Cloud Data Fusion for visual ETL pipeline construction.
- Implement streaming data ingestion pipelines using Pub/Sub with push and pull subscriptions, dead-letter topics, ordering keys, and exactly-once delivery for real-time event-driven architectures.
- Analyze change data capture requirements and determine when to use Datastream for real-time replication from Cloud SQL, AlloyDB, and Oracle sources versus periodic batch exports based on source impact, latency SLAs, and data consistency needs.
- Analyze ingestion strategy tradeoffs across batch, streaming, and CDC approaches to determine optimal pipeline design based on latency requirements, ordering guarantees, source system constraints, and cost profiles.
- Design end-to-end data ingestion architectures that select among Pub/Sub, Datastream, Data Fusion, and Cloud Storage transfer mechanisms based on source diversity, freshness SLAs, and organizational data strategy.
Build data pipelines
- Implement Apache Beam pipelines using the Dataflow SDK with PCollections, ParDo transforms, composite transforms, side inputs, and pipeline options for portable batch and streaming execution.
- Implement windowing and watermark strategies in Dataflow using fixed, sliding, session, and global windows with triggers, accumulation modes, allowed lateness, and watermark management for late-arriving and out-of-order data.
- Implement Dataproc clusters with autoscaling policies, initialization actions, optional components, and job submission for Apache Spark, Hadoop, Hive, and Presto workloads on managed infrastructure.
- Analyze pipeline execution patterns across Dataflow and Dataproc to determine optimal processing framework selection based on data shape, transformation complexity, team expertise, and operational overhead.
- Analyze Dataflow streaming pipeline performance using watermark progression, system lag, data freshness metrics, and element counts to identify bottlenecks, tune parallelism, and optimize throughput.
- Design pipeline framework governance standards that define when to use Dataflow versus Dataproc versus serverless Spark, establish coding conventions, and standardize testing practices across data engineering teams.
Deploy and operationalize data pipelines
- Implement pipeline orchestration using Cloud Composer with Airflow DAGs, task dependencies, sensors, branching operators, XCom communication, and connection management for complex workflow coordination.
- Implement Dataflow pipeline deployment and monitoring using Flex Templates, streaming update strategies with drain and cancel modes, Cloud Monitoring metrics, and Cloud Logging structured logs for production pipeline management.
- Analyze deployment and scaling tradeoffs across Flex Templates, classic templates, autoscaling configurations, and machine type selection to determine optimal resource allocation and deployment strategy for each workload profile.
- Analyze pipeline error handling strategies including dead-letter queues, retry policies, poison message isolation, and partial failure recovery to determine the appropriate resilience pattern for each pipeline stage.
- Design production pipeline operational frameworks that integrate monitoring, alerting, CI/CD, scaling, and incident response into a cohesive data operations practice aligned with organizational SLOs.
2
Domain 2: Storing and Accessing Data
3 topics
Select storage systems
- Implement Cloud Storage bucket configurations with storage classes, lifecycle rules, retention policies, object versioning, and access controls for scalable unstructured and semi-structured data storage.
- Implement relational database configurations using Cloud SQL with high availability, read replicas, and automated backups and Cloud Spanner with multi-regional topology and interleaved tables for transactional workloads.
- Implement Bigtable instances with cluster topology, column family design, row key optimization, garbage collection policies, and replication configurations for high-throughput time-series and analytical workloads.
- Analyze Firestore and Memorystore usage patterns to determine when document-oriented data models, composite indexes, real-time listeners, and caching layers are appropriate versus relational or wide-column alternatives.
- Analyze storage system characteristics across Cloud Storage, Cloud SQL, Spanner, Bigtable, BigQuery, Firestore, and Memorystore to determine the optimal technology for each workload based on access patterns, consistency, scalability, and cost.
- Design multi-tier storage architectures that combine operational databases, analytical warehouses, object storage, and caching layers to optimize for performance, cost, and data access patterns across the data lifecycle.
Model data for storage and access
- Implement BigQuery table schemas with nested and repeated fields using STRUCT and ARRAY types, denormalization patterns, and schema evolution strategies for analytical workloads.
- Implement BigQuery partitioning using ingestion-time, column-based, and integer-range partitioning with clustering on high-cardinality columns to optimize query performance and reduce scan costs.
- Implement NoSQL and distributed relational schema designs including Bigtable row key patterns with salting and field promotion and Spanner interleaved tables with primary key hierarchies for high-throughput workloads.
- Analyze data modeling tradeoffs between normalized and denormalized schemas, nested versus flat structures, and wide versus tall table designs to determine optimal modeling strategy for query patterns and storage efficiency.
- Design enterprise data modeling standards that govern schema conventions, partitioning policies, naming practices, and evolution procedures across multiple teams and data products within an organization.
Manage data lifecycle
- Implement Cloud Storage lifecycle management rules with automatic storage class transitions from Standard to Nearline, Coldline, and Archive based on object age, access frequency, and cost optimization targets.
- Analyze BigQuery storage cost management options including dataset and table expiration policies, partition expiration, time travel window sizing, and long-term storage pricing to determine optimal retention configurations.
- Analyze data lifecycle patterns across hot, warm, and cold storage tiers and evaluate retention locks, dataset-level retention, and organization policy constraints to determine optimal archival and deletion strategies for compliance and cost.
- Design organization-wide data lifecycle governance frameworks that standardize retention policies, archival procedures, and deletion workflows across all storage systems in alignment with data classification tiers.
3
Domain 3: Analyzing and Presenting Data
3 topics
Analyze data with BigQuery
- Implement advanced BigQuery SQL queries using window functions, Common Table Expressions, PIVOT/UNPIVOT operations, approximate aggregation functions, and scripting for complex analytical transformations.
- Implement BigQuery user-defined functions using SQL and JavaScript UDFs, table-valued functions, and remote functions connected to Cloud Functions for extending query capabilities with custom logic.
- Implement BigQuery ML models using CREATE MODEL statements with linear regression, logistic regression, k-means clustering, time-series forecasting, and imported TensorFlow models for in-warehouse predictive analytics.
- Implement BigQuery external data access and Vertex AI integration using federated queries, BigLake tables, external connections, and ML.PREDICT for cross-system analytics and in-warehouse model serving.
- Analyze BigQuery query execution plans using INFORMATION_SCHEMA views, slot utilization metrics, and stage-level statistics to identify performance bottlenecks and optimize query structures.
- Analyze predictive analytics approach tradeoffs between BigQuery ML, Vertex AI, and federated query patterns to determine the optimal strategy based on model complexity, data residency, and operational deployment needs.
- Design enterprise analytical platform strategies that integrate BigQuery, BigQuery ML, Vertex AI, and federated query capabilities into a unified analytics architecture supporting self-service and governed consumption patterns.
Visualize and catalog data
- Implement Looker Studio dashboards with BigQuery data sources, calculated fields, blended data, parameterized reports, and embedded analytics for self-service business intelligence delivery.
- Design semantic layer governance using Looker with LookML projects, models, explores, views, and derived tables to establish governed data access layers that enforce consistent business logic and controlled analytical consumption patterns.
- Implement Data Catalog metadata management with automated discovery, tag templates, custom entries, and policy tags for creating a searchable inventory of organizational data assets across GCP services.
- Analyze Dataplex lake architecture patterns including zone organization, asset assignment, discovery job configuration, and quality rule placement to determine optimal structures for unified data management across Cloud Storage and BigQuery.
- Analyze visualization and cataloging platform tradeoffs across Looker, Looker Studio, Data Catalog, and Dataplex to determine the optimal combination of tools for governance, discoverability, and self-service analytics requirements.
- Design enterprise data cataloging strategies that integrate Data Catalog, Dataplex, and data lineage tracking to create a comprehensive data governance fabric enabling discoverability and accountability.
Automate analytical workloads
- Implement BigQuery scheduled queries with DDL and DML statements, parameterized execution, destination table configuration, and failure notification for automated data transformation pipelines.
- Implement analytical workflow orchestration using Cloud Composer DAGs that coordinate BigQuery jobs, Dataflow pipelines, and data quality checks with dependency management and SLA monitoring.
- Implement lightweight workflow automation using Cloud Workflows with YAML-based step definitions, parallel execution, error handling, and connectors to BigQuery, Dataflow, and Cloud Functions for serverless orchestration.
- Analyze orchestration tool selection between Cloud Composer, Cloud Workflows, and BigQuery scheduled queries based on workflow complexity, team Airflow expertise, cost, and operational management overhead.
- Design organization-wide analytical automation standards that govern scheduling conventions, dependency management, failure escalation, and SLA enforcement across multiple analytical teams and data products.
4
Domain 4: Maintaining and Automating Data Workloads
4 topics
Optimize resources
- Implement BigQuery slot reservations with commitment plans, reservation assignments, and idle slot sharing across projects to manage analytical compute capacity and control costs for enterprise workloads.
- Analyze BigQuery query optimization effectiveness across partition pruning, clustering alignment, materialized views, BI Engine acceleration, and query result caching to determine the highest-impact techniques for each workload profile.
- Analyze BigQuery cost management and pricing tradeoffs comparing on-demand versus capacity pricing, custom quotas, per-user byte limits, and slot utilization to recommend optimal billing and governance configurations.
- Analyze Dataproc cluster optimization strategies comparing ephemeral clusters, preemptible workers, Dataproc Serverless, and persistent clusters to determine the cost-optimal compute model for each workload category.
- Design organization-wide data platform cost optimization strategies that integrate slot management, storage tiering, query governance, and chargeback models aligned with business unit consumption patterns.
Design for reliability
- Implement disaster recovery strategies for BigQuery using cross-region dataset copies, Cloud Storage backups, and BigQuery Data Transfer Service for maintaining analytical continuity during regional outages.
- Implement idempotent pipeline designs using deterministic processing, deduplication strategies, checkpoint-based recovery, and transactional writes to ensure exactly-once semantics in batch and streaming workloads.
- Analyze data validation framework options across Dataflow assertions, BigQuery data quality tasks in Dataplex, and custom validation logic to determine effective detection strategies for schema drift, null anomalies, and distribution shifts.
- Analyze failure modes, reprocessing strategies, and data validation coverage across pipeline components to determine RPO, RTO, backfill procedures, and quality check placement that balance thoroughness against latency and cost.
- Design enterprise data reliability frameworks that define SLOs for data freshness, completeness, and accuracy with automated monitoring, escalation procedures, and continuous improvement processes.
Ensure data quality and governance
- Design metadata classification frameworks using Data Catalog tag templates and policy tags to establish systematic sensitivity labeling, domain tagging, and quality tier assignment that enable metadata-driven governance across the organization.
- Analyze data lineage tracking capabilities using Data Catalog lineage API and Dataplex lineage features to determine effective tracing strategies from source through transformation stages to consumption endpoints.
- Implement Cloud DLP inspection and de-identification pipelines with info type detectors, inspection templates, de-identification templates, and job triggers for automated sensitive data discovery and protection.
- Analyze BigQuery fine-grained access control approaches using column-level policy tags, row access policies, and authorized views to determine the optimal security model for multi-tenant analytical environments with diverse consumer roles.
- Analyze data governance maturity across cataloging, lineage, quality, and access control dimensions to identify gaps and prioritize governance improvements for regulatory compliance and operational excellence.
- Design comprehensive data governance programs that integrate Data Catalog, Dataplex, DLP, lineage tracking, and organizational policies into a unified framework supporting data mesh and data product ownership models.
Secure data
- Implement IAM policies for data access control using predefined roles, custom roles, service accounts, and organization policy constraints to enforce least-privilege access across BigQuery, Cloud Storage, and database services.
- Implement encryption configurations using Google-managed keys, customer-managed encryption keys with Cloud KMS, customer-supplied encryption keys, key rotation policies, and TLS for data at rest and in transit.
- Design VPC Service Controls perimeter architectures with service perimeters, access levels, ingress and egress rules, and bridge perimeters to prevent data exfiltration while enabling legitimate cross-project data access patterns.
- Implement data masking and tokenization using Cloud DLP de-identification transforms including redaction, character masking, format-preserving encryption, and cryptographic hashing for protecting sensitive fields in analytical datasets.
- Analyze encryption and key management tradeoffs between Google-managed keys, CMEK, and CSEK based on compliance requirements, key lifecycle overhead, performance impact, and organizational security policies.
- Analyze VPC Service Controls perimeter design and audit logging strategies to determine optimal boundary definitions and monitoring configurations that balance exfiltration prevention with cross-project data sharing needs.
- Design defense-in-depth security architectures for data platforms that layer IAM, VPC Service Controls, encryption, DLP, audit logging, and network isolation into a comprehensive data protection strategy aligned with regulatory frameworks.
Hands-On Labs
Practice in a simulated cloud console or Python code sandbox — no account needed. Each lab runs entirely in your browser.
Certification Benefits
Salary Impact
Related Job Roles
Industry Recognition
Google Cloud certifications are highly valued in data-driven and AI-focused organizations. BigQuery is widely recognized as the industry-leading serverless data warehouse, and this certification validates expertise in Google Cloud's dominant data analytics and pipeline orchestration ecosystem.
Scope
Included Topics
- All domains and task statements in the Google Cloud Professional Data Engineer certification exam guide: Domain 1 Ingesting and Processing Data (~25%), Domain 2 Storing and Accessing Data (~20%), Domain 3 Analyzing and Presenting Data (~25%), and Domain 4 Maintaining and Automating Data Workloads (~30%).
- Professional-level data engineering architecture and operations decisions for pipeline design, storage selection, analytical workload orchestration, reliability engineering, governance, and security on Google Cloud Platform.
- Complex scenario-based tradeoff analysis involving pipeline scalability, storage cost optimization, data quality enforcement, regulatory compliance, and cross-service integration strategies.
- Key GCP services for data engineers: Pub/Sub, Dataflow, Cloud Data Fusion, Dataproc, Cloud Composer, Apache Beam SDK, Cloud Storage, Cloud SQL, Cloud Spanner, Bigtable, BigQuery, Firestore, Memorystore, BigQuery ML, Vertex AI, Looker, Looker Studio, Data Catalog, Dataplex, Cloud DLP, IAM, VPC Service Controls, Cloud KMS, CMEK, Cloud Workflows, Datastream, Cloud Logging, Cloud Monitoring.
Not Covered
- Deep machine learning model research and neural architecture design not connected to BigQuery ML or Vertex AI integration patterns tested in the exam.
- Provider-agnostic open-source tooling detail that does not map to GCP managed services and integration patterns used in the exam objectives.
- Application development topics unrelated to data pipeline construction, data storage design, or analytical workload management.
- Exact short-lived pricing terms and transient promotional details not suitable for durable technical domain specifications.
- GCP networking and compute infrastructure topics beyond what is directly relevant to data engineering workloads and VPC Service Controls.
Official Exam Page
Learn more at Google Cloud
Ready to master PDE?
Adaptive learning that maps your knowledge and closes your gaps.
Subscribe to Access