This course is in active development. Preview the scope below and create a free account to be notified the moment it goes live.
CBDP
The ICCP Certified Big Data Professional (CBDP) exam validates expertise in designing, implementing, and optimizing enterprise‑scale distributed data architectures, storage, processing pipelines, and scalable analytics/ML solutions.
Who Should Take This
It targets data engineers, solutions architects, and analytics leads with two to ten years of hands‑on experience who design and manage large‑scale data platforms. These professionals seek formal recognition of their ability to integrate storage, processing, and machine‑learning components to drive business‑critical insights.
What's Covered
1
Domain 1: Big Data Architecture and Distributed Systems
2
Domain 2: Distributed Storage Technologies
3
Domain 3: Big Data Processing Frameworks
4
Domain 4: Data Engineering and Pipeline Management
5
Domain 5: Big Data Analytics and Machine Learning at Scale
6
Domain 6: Big Data Security and Governance
What's Included in AccelaStudy® AI
Course Outline
53 learning goals
1
Domain 1: Big Data Architecture and Distributed Systems
3 topics
Distributed systems fundamentals
- Apply distributed computing principles including CAP theorem, consistency models, partitioning strategies, replication, and consensus protocols to big data architecture decisions.
- Implement fault tolerance mechanisms including data replication, checkpointing, heartbeat monitoring, and automatic failover for distributed data processing systems.
- Analyze distributed system performance by evaluating throughput, latency, partition tolerance, and resource utilization to identify bottlenecks and scaling requirements.
Big data reference architectures
- Implement lambda architecture combining batch and speed layers with a serving layer to provide both historical and real-time data access for analytical workloads.
- Implement kappa architecture using unified stream processing to simplify big data systems, eliminating the batch layer and processing all data as streams.
- Analyze big data architecture trade-offs between lambda and kappa approaches considering operational complexity, data consistency, latency requirements, and cost efficiency.
- Design a big data platform architecture that selects appropriate processing frameworks, storage tiers, and serving technologies based on workload requirements and organizational constraints.
Cloud-native big data services
- Implement managed big data services including cloud-native Spark, serverless SQL analytics, and managed streaming services for cost-efficient big data processing.
- Analyze cloud big data service trade-offs between managed services and self-managed clusters considering cost, control, performance, and operational complexity.
- Design a multi-cloud big data strategy that avoids vendor lock-in through portable data formats, abstraction layers, and cross-cloud data replication.
2
Domain 2: Distributed Storage Technologies
3 topics
Distributed file systems and object storage
- Implement HDFS cluster configuration including NameNode high availability, DataNode management, block replication, rack awareness, and storage policies for mixed workloads.
- Apply cloud object storage services for data lake implementations including bucket policies, lifecycle rules, storage tiering, and cross-region replication strategies.
- Analyze storage format trade-offs between Parquet, ORC, Avro, and Delta Lake for different access patterns including columnar analytics, row-level updates, and schema evolution.
NoSQL and distributed databases
- Implement NoSQL database solutions including HBase, Cassandra, MongoDB, and DynamoDB with appropriate data modeling, partitioning, and consistency configurations.
- Apply data partitioning and sharding strategies for distributed databases including hash-based, range-based, and composite partitioning with hotspot mitigation.
- Analyze NoSQL database performance by evaluating read/write throughput, latency percentiles, compaction overhead, and consistency trade-offs under different workload patterns.
- Design a polyglot data storage strategy that selects appropriate database technologies for transactional, analytical, search, and graph workloads within a big data ecosystem.
Search and graph data stores
- Implement search engine deployments including Elasticsearch or Solr for full-text search, log analytics, and faceted navigation over large document collections.
- Apply graph database technologies including Neo4j or JanusGraph for relationship-intensive data including social networks, knowledge graphs, and fraud detection networks.
- Analyze specialized data store performance by benchmarking query latency, indexing throughput, and storage efficiency against workload-specific access patterns.
3
Domain 3: Big Data Processing Frameworks
3 topics
Batch processing with Hadoop and Spark
- Implement MapReduce and Spark batch processing jobs including data partitioning, shuffle optimization, memory management, and execution plan tuning for large-scale data transformations.
- Apply Spark SQL and DataFrame operations for structured data processing including query optimization, catalyst optimizer hints, and adaptive query execution strategies.
- Analyze batch processing job performance by interpreting execution plans, identifying data skew, evaluating shuffle efficiency, and optimizing resource allocation.
Stream processing and real-time analytics
- Implement stream processing pipelines using Kafka Streams, Spark Structured Streaming, or Flink for real-time event processing with exactly-once semantics.
- Apply windowing strategies including tumbling, sliding, session, and global windows with watermark-based late data handling for stream aggregation computations.
- Analyze stream processing system reliability by evaluating backpressure handling, checkpoint recovery time, event time correctness, and end-to-end latency guarantees.
- Design a real-time data processing architecture that balances latency requirements, throughput demands, fault tolerance needs, and operational complexity constraints.
Data processing optimization
- Implement Spark performance optimization including partition management, broadcast joins, cache strategies, serialization tuning, and cluster resource configuration.
- Apply cost optimization techniques for big data processing including spot instances, auto-scaling policies, storage tiering, and workload scheduling for off-peak execution.
- Analyze big data processing costs by attributing expenses to workloads, teams, and projects using resource tagging, chargeback models, and cost allocation frameworks.
- Design a big data cost optimization strategy that balances performance requirements with budget constraints through reserved capacity, spot markets, and workload prioritization.
4
Domain 4: Data Engineering and Pipeline Management
3 topics
Data pipeline design and orchestration
- Implement data pipeline orchestration using workflow engines with dependency management, retry logic, alerting, and SLA monitoring for complex multi-stage data workflows.
- Apply data quality validation within big data pipelines including schema validation, statistical anomaly detection, freshness checks, and completeness verification at scale.
- Analyze data pipeline reliability by evaluating failure modes, recovery mechanisms, idempotency guarantees, and data consistency across pipeline stages.
Data lake management and lakehouse patterns
- Implement data lake organization using medallion architecture with bronze, silver, and gold tiers for progressive data refinement and quality improvement.
- Apply lakehouse technologies including Delta Lake, Apache Iceberg, or Apache Hudi for ACID transactions, time travel, schema evolution, and efficient upserts on data lake storage.
- Design a data lake governance strategy that addresses data cataloging, access control, cost optimization, and data lifecycle management for petabyte-scale environments.
Data engineering best practices
- Implement data engineering CI/CD practices including automated testing for data pipelines, schema validation, data contract enforcement, and deployment automation.
- Apply infrastructure as code for big data environments including cluster provisioning, configuration management, and environment reproducibility using declarative templates.
- Analyze data engineering team productivity by measuring pipeline deployment frequency, failure rates, recovery times, and data freshness SLA compliance.
5
Domain 5: Big Data Analytics and Machine Learning at Scale
2 topics
Distributed analytics and ML infrastructure
- Implement distributed machine learning training using Spark MLlib, distributed TensorFlow, or Horovod for model training on datasets that exceed single-machine capacity.
- Apply big data analytics techniques including distributed SQL analytics, graph analytics on large networks, and geospatial analytics over massive location datasets.
- Analyze the scalability and cost efficiency of big data analytics workloads by evaluating compute-storage separation, auto-scaling policies, and serverless processing alternatives.
Data science platform integration
- Implement feature engineering pipelines on distributed platforms including Spark feature transformations, feature stores, and automated feature computation for ML workloads.
- Apply model serving infrastructure including batch prediction pipelines, real-time inference endpoints, and A/B testing frameworks for big data ML deployments.
- Design a big data ML platform strategy that integrates distributed training, feature engineering, model serving, and experiment tracking into a cohesive data science infrastructure.
6
Domain 6: Big Data Security and Governance
2 topics
Big data security and compliance
- Implement security controls for big data platforms including Kerberos authentication, Ranger/Sentry authorization, encryption at rest and in transit, and audit logging.
- Apply data governance practices to big data environments including data lineage tracking, metadata management, data cataloging, and compliance monitoring for distributed data assets.
- Analyze security and compliance risks in big data environments by evaluating data exposure surface, access control effectiveness, and regulatory compliance gaps.
- Design a big data security and governance strategy that balances data democratization with protection requirements, establishing guardrails for self-service data access.
Data quality at scale
- Implement data quality frameworks for big data including great expectations, dbt tests, and custom validation rules executed within distributed processing pipelines.
- Apply data observability practices including freshness monitoring, volume anomaly detection, schema drift alerts, and distribution shift detection for big data pipelines.
- Analyze data quality patterns across big data pipelines to identify systematic quality degradation sources and recommend preventive engineering controls.
Scope
Included Topics
- Big data architecture and distributed systems engineering including Hadoop ecosystem, Spark, Kafka, and cloud-native big data services as tested on the ICCP Certified Big Data Professional exam.
- Distributed data storage technologies including HDFS, object storage, NoSQL databases, data lakes, and lakehouses with appropriate data organization and access patterns.
- Big data processing paradigms including batch processing, stream processing, micro-batch, and lambda/kappa architecture patterns for real-time and near-real-time analytics.
- Data engineering at scale including ETL/ELT pipeline design, data pipeline orchestration, data quality in distributed environments, and schema management for evolving big data platforms.
- Big data analytics and business intelligence including data warehousing at scale, OLAP over big data, machine learning on distributed platforms, and self-service analytics enablement.
- Big data governance, security, and compliance including data lineage in distributed systems, access control for data lakes, encryption, and regulatory compliance for large-scale data processing.
Not Covered
- Foundation-level data management concepts covered by the CDP Foundation exam that serve as prerequisites for this certification.
- Advanced machine learning algorithm development and model training techniques covered by the CDS certification.
- Application-level software development including web frameworks, mobile development, and user interface design not directly related to big data engineering.
- Network infrastructure design and management beyond what is necessary for distributed systems configuration.
Official Exam Page
Learn more at ICCP
CBDP is coming soon
Adaptive learning that maps your knowledge and closes your gaps.
Create Free Account to Be Notified