Home / Programs / Data Analytics / Data Engineering & AI Agents
Class 014 · DATA ENGINEERING & AI AGENTS · PYSPARK + DATABRICKS + FABRIC

Data Engineering
+ AI Agents

Master end-to-end Data Engineering with Agentic AI. Build medallion pipelines on PySpark + Delta Lake with Databricks (DLT, Unity Catalog, MLflow), ship a complete Microsoft Fabric implementation across OneLake and Real-Time Intelligence, and deploy a Data Engineering Coding Agent.

3mo
duration
70+
modules
4.7/5
class rating
100k+
enrolled
Where our Data Engineering alumni work
MicrosoftAmazonSalesforceServiceNowDeloitteInfosysAccentureTCSWiproCapgeminiCognizantHCL MicrosoftAmazonSalesforceServiceNowDeloitteInfosysAccentureTCSWiproCapgeminiCognizantHCL
What you leave with

Four things every Data Engineering grad walks away with.

01
Agent-Ready DE skills
Triple-platform DE depth — PySpark, Databricks (Delta Lake, DLT, Unity Catalog), and Microsoft Fabric (OneLake, DP-600/DP-700) — plus an LLM PySpark layer with LangGraph, Claude Agent SDK, MCP.
02
A shipped project
A production-deployed Data Engineering Coding Agent that drafts PySpark, debugs Spark jobs, and proposes data-quality fixes via MCP servers, running with human approval gates and a public verification URL.
03
Verifiable credential
2026 Agent-Ready rubric mapped to Databricks DE Associate + Professional, DP-700, DP-600, and AWS Data Engineer, graded 1–5, with a public verification URL recruiters can check in 30 seconds.
04
Direct placement pipeline
GitHub + LinkedIn portfolio rewrite, DE-tuned resume rebuild, and warm intros into our 1,000+ hiring partners actively staffing Data Engineer, Databricks Engineer, and Fabric Engineer roles.
3 MONTHS · FOUR PHASES · ONE DE AGENT

From “loads a CSV” to ships AI-native data platforms..

Weeks 1–2 · Foundations

IT & AI Foundations + Python for DE

  • Application lifecycle, Agile/Scrum, and cloud computing models
  • Introduction to AI, ML, Generative AI, and Agentic AI
  • Python fundamentals and data structures for engineers
  • Advanced Python — OOP, decorators, generators, and packaging
YOU SHIPA production-quality Python codebase with OOP, decorators, and proper packaging — the DE toolkit ready for Spark and Fabric.
Weeks 3–6 · Analytics + Data Layer

Power BI + PostgreSQL for Data Engineers

  • Power BI Desktop, Power Query, and data prep transformations
  • Data modelling — star/snowflake schemas, DAX, time intelligence
  • PostgreSQL DDL, DML, JOINs, window functions, and CTEs
  • PL/pgSQL stored procedures, triggers, and query optimisation
YOU SHIPA Power BI dashboard suite plus a PostgreSQL analytical database — connected and indexed to feed the Spark and Fabric pipelines.
Weeks 7–12 · Data Engineering Core (PySpark + Databricks + Fabric)

PySpark + Databricks + Microsoft Fabric

  • PySpark — RDDs, DataFrames, Spark SQL, and structured streaming
  • Production engineering with medallion pipelines, Airflow, and Docker
  • Databricks — Delta Lake, Unity Catalog, Delta Live Tables, MLflow
  • Microsoft Fabric — OneLake, Lakehouse, Data Factory, Real-Time Intelligence
YOU SHIPA PySpark + Databricks medallion pipeline, a Microsoft Fabric end-to-end implementation, and an Airflow + Docker CI/CD toolkit.
Weeks 12–14 · GenAI + Agentic AI

Master the 2026 GenAI + Agentic AI stack — and ship a Data Engineering Coding Agent that drafts PySpark pipelines, debugs Spark jobs, and proposes data quality fixes autonomously.

Engineer with LLM APIs from OpenAI, Anthropic, Google GenAI, and DeepSeek. Master prompt engineering (zero-shot, few-shot, CoT, ReAct) and context engineering — the 2026 frontier discipline. Build production RAG pipelines with ChromaDB and pgvector over your data dictionary, pipeline documentation, and historical Spark job logs. Master the 2026 production agent stack — LangGraph 1.0 (#1 production default), Claude Agent SDK (#2 MCP-native), CrewAI (#3 multi-agent crews). Wire it all through the Model Context Protocol (MCP) — 200+ server implementations, 97M+ monthly SDK downloads. Final project — a deployed Data Engineering Coding Agent with MCP servers exposing your Databricks workspaces, Fabric workloads, Spark clusters, and PostgreSQL data layer. The data engineer’s force multiplier.

Partner orgs (2026)62
DE projects deployed280+
→ Placement offers91%
Course curriculum

Seven sections. 65+ modules. The AI-native data engineering stack.

01

Fundamentals of IT & AI

Foundational track building the conceptual bedrock every data engineer needs — application lifecycle, Agile/Scrum, computing infrastructure, AI/ML/Generative/Agentic AI fundamentals, and real-world digital systems. Sets the context for everything that follows in the DE + AI engineering stack.
5 MODULES
SECTION 1
Application fundamentals — what applications are, their types, web architecture
Web Technologies — Frontend (HTML, CSS, JavaScript, React) and Backend (Python, Java, Node.js)
Database Systems — SQL (PostgreSQL, MySQL) and NoSQL (MongoDB)
The seven SDLC phases — Planning, Analysis, Design, Implementation, Testing, Deployment, Maintenance
The data engineer sits between application data and analytical workloads — knowing how applications generate data makes you a better pipeline architect
Methodology Evolution — Waterfall vs Agile, the Agile mindset
Popular frameworks — Scrum, Kanban, Extreme Programming (XP)
Scrum Roles — Product Owner, Scrum Master, Development Team (including data engineers)
Scrum Events — Sprint Planning, Daily Scrum, Sprint Review, Sprint Retrospective
Scrum Artifacts — Product Backlog, Sprint Backlog, Increment deliverables
User Stories — Epics, Themes, Acceptance Criteria
Estimating user stories with story points
Backlog management with Google Sheets and Azure Boards
CPU vs GPU — when each matters for analytics workloads
Memory, storage, and network basics
Why these matter for Spark cluster sizing and Fabric capacity planning
IaaS — Infrastructure as a Service (e.g., Azure VMs, AWS EC2)
PaaS — Platform as a Service (e.g., Azure SQL Database)
SaaS — Software as a Service (e.g., Power BI Service, Databricks SaaS)
Cloud data warehouses (Snowflake, BigQuery, Redshift, Synapse, Fabric Warehouse) — comparative analysis
AI is reshaping data engineering in 2026 — from AI-generated PySpark to autonomous pipeline drafting to RAG-powered data catalogs
Machine Learning — algorithms that improve through experience
Deep Learning — neural networks for complex pattern recognition
Generative AI — systems that generate code, pipelines, narratives
Large Language Models — LLMs that draft PySpark, debug Spark jobs, propose data quality fixes
Agentic AI — autonomous systems that plan, reason, act, and learn — the future of DE
CRM systems — Salesforce, Dynamics — typical data sources for analytics pipelines
HRMS — Workday, SAP SuccessFactors — sensitive data with strict governance
Retail & E-Commerce — high-volume transactional + clickstream pipelines
Healthcare Applications — HIPAA/DPDP compliance for sensitive health analytics
Domain depth multiplies your DE salary — BFSI, healthcare, retail DEs command premium rates
02

Python for Data

The dominant language for data engineering. Master Python syntax, data structures, and advanced programming concepts essential for AI and data work. 10 modules from environment setup through advanced OOP — the language fluency that powers PySpark, Databricks notebooks, and Fabric Python User Functions.
10 MODULES
SECTION 2
Python interpreter installation for Windows and Mac
Visual Studio Code + Jupyter for data engineering workflows
Variables, identifiers, naming conventions
Data types, operators, type conversion
Control flow — if/elif/else, while, for, match-case
break, continue, pass statements
String fundamentals — indexing, slicing, concatenation
f-strings and .format() for log messages and pipeline metadata
String methods — case conversion, search, trimming, replacement
.split() and .join() for text data preprocessing
Critical for parsing CSV, JSON, and log data in pipelines
Lists — creation, indexing, slicing, modification
List comprehensions for elegant data transformation
Sorting, reversing, copying patterns
Tuples — packing and unpacking
Performance advantages over lists
Schema definition patterns with StructType (preview of PySpark)
Dictionaries — creation, access, operations
Dictionary comprehensions
Nested dictionaries for structured data
Essential for representing JSON pipeline configs
Sets — UUU properties (Unique, Unordered, Unindexed)
Mathematical operations — union, intersection, difference
Use cases in deduplication
Collections module — namedtuple, Counter, defaultdict, deque
Iterators & Generators — memory-efficient streaming for large data
Generator expressions and pipelines
Functional programming — lambda, map, filter, reduce
Generators are the data engineer's tool for larger-than-memory data processing
Function definition, parameters, return values
Default arguments, *args, **kwargs
Variable scope (LEGB rule)
First-class functions and higher-order patterns
Recursion
Type hints (Python 3.5+) — essential for self-documenting pipeline code
Documenting functions with docstrings
Built-in modules, user-defined modules, packages
pip for package management
requirements.txt for reproducible builds
Virtual environments for isolated DE projects
Reproducibility is non-negotiable in production DE — invest here
CRUD operations with open()
File modes and pathlib
Directory management with os and shutil
Python's csv module — reader, writer, DictReader, DictWriter
JSON operations — dump(), dumps(), load(), loads()
The two most common data formats for DE ingestion
Exception Handling — robust error handling for unreliable data sources, retry patterns
Decorators — for logging pipeline runs, timing functions, caching expensive operations
Generators deep dive — memory efficiency for streaming large datasets
Context Managers — proper resource management for database connections, file handles, Spark sessions
Four patterns that separate scripting Python from production data engineering
Classes, objects, methods, special methods
Instance variables vs class variables
Encapsulation — access modifiers control data visibility
Inheritance — single, multi-level, and multiple inheritance
Abstraction — abstract classes and methods
Polymorphism — method overriding and duck typing
Custom Spark UDFs and PySpark schema classes use OOP extensively — master it before Section 5
03

SQL for AI & Data

The data backbone of every analytics pipeline. Five modules covering PostgreSQL from foundations through programming with PL/pgSQL, triggers, and query optimization — the data layer that feeds your Spark and Fabric pipelines.
5 MODULES
SECTION 3
Databases, DBMS, RDBMS — concepts and terminology
ACID properties — Atomicity, Consistency, Isolation, Durability
PostgreSQL setup, psql, pgAdmin 4, DBeaver
Data types — numeric, character, date/time, boolean, JSON, arrays
Constraints — PRIMARY KEY, FOREIGN KEY, UNIQUE, NOT NULL, CHECK, DEFAULT
SELECT statements and column projection
WHERE clauses with operators and conditions
Built-in functions — string, numeric, date, conditional
Aggregates — COUNT, SUM, AVG, MIN, MAX
GROUP BY and HAVING
Window functions — ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD
JOIN operations — INNER, LEFT, RIGHT, FULL OUTER, CROSS, SELF
Subqueries — scalar, row, table subqueries
CTEs (Common Table Expressions) — readable analytical queries
Recursive CTEs for hierarchical data
Set operators — UNION, UNION ALL, INTERSECT, EXCEPT
DML — INSERT, UPDATE, DELETE patterns
Transactions — atomicity for data integrity
ALTER TABLE for schema evolution
Indexes — B-tree, Hash, GiST, GIN
Views — virtual tables, materialized views
Stored functions with CREATE FUNCTION
PL/pgSQL — variables, control structures, exception handling
Triggers — BEFORE, AFTER, INSTEAD OF
ER modelling, normalization (1NF, 2NF, 3NF)
OLTP vs analytics workload patterns
Star schema for analytics
Query plan analysis with EXPLAIN and EXPLAIN ANALYZE
Index strategies — selectivity, covering indexes, multi-column indexes
VACUUM, ANALYZE, partitioning
DEs read query plans daily — invest in EXPLAIN ANALYZE fluency
04

Power BI for Data Analysis

The BI tool every data engineer must understand. Power BI sits at the end of your Lakehouse/Warehouse pipelines — knowing how it consumes data makes you a better pipeline architect. 10 modules grouped into four progressive units.
10 MODULES
SECTION 4
BI fundamentals and modern analytics approaches
Power BI components — Desktop, Service, Mobile, Gateway
Interface navigation, workspace setup, first report creation
Desktop versus Service capabilities
File sources, database connections, cloud services, web sources
Connection Modes — Import, DirectQuery, Live Connection
Direct Lake mode (Fabric) preview — query Delta tables at speed without import
Power Query interface and applied steps
Data profiling and quality assessment
Essential transformations — filtering, splitting, merging
Reshaping — pivot, unpivot, grouping
Combining queries — append and merge operations
Star schema versus snowflake schema
Creating and managing table relationships
Primary and foreign keys
Hierarchies and date dimension tables
Data model optimisation strategies
Data visualisation principles and chart selection
Core visualisations, slicers, bookmarks, drill-through
Mobile optimisation and data storytelling
DAX syntax and structure
Calculated columns vs measures
Aggregation, logical, text, date/time functions
CALCULATE and FILTER functions
Creating KPIs and business metrics
Year-over-year, quarter-over-quarter comparisons
Custom calendar handling
Iterator functions — SUMX, AVERAGEX, COUNTX
ALL, ALLEXCEPT for filter manipulation
AI visuals — Q&A, Key Influencers, Decomposition Tree
Workspaces, apps, sharing
Subscriptions and alerts
Row-Level Security (RLS) — dynamic security with USERPRINCIPALNAME()
Sensitivity labels
DAX performance tuning, composite models
Foundation for Microsoft Certified: Power BI Data Analyst Associate (PL-300)
05

PySpark Foundations

The heart of distributed data engineering. Apache Spark is the engine powering modern data engineering — and PySpark is its dominant interface. 9 modules taking you from Big Data fundamentals through RDDs, DataFrames, Spark SQL, structured streaming, performance optimisation, and finally GenAI pipelines on Spark. The foundation Databricks and Fabric build on.
9 MODULES
SECTION 5
Volume, Velocity, Variety — the three pillars of Big Data
Why traditional tools fall short at scale
Modern data architecture evolution
Driver, Executors, and Cluster Manager working in harmony
Spark vs Hadoop — why in-memory processing gives Spark a decisive edge
Execution Model — DAG, Stages, Tasks, and Shuffle explained clearly
Deployment Modes — Local, Standalone, YARN, Kubernetes
Setting up PySpark — Local (pip install pyspark), Docker, Databricks, Microsoft Fabric
SparkSession and SparkContext — the gateway to every Spark operation
Spark Web UI — monitor Jobs, Stages, Storage, and Environment in real time
Creating DataFrames from CSV/JSON/Parquet files, RDDs, and Python lists
Schema definition with StructType & StructField
Core Transformations — select, rename, drop columns
Filter with filter() and where()
Add and transform with withColumn()
orderBy() and sort() for ordering
dropna(), fillna(), replace() for handling nulls and gaps
show(), collect(), take(), first() — choosing the right method for the right context
Registering DataFrames as temporary views and global temporary views
Running SQL queries with spark.sql()
Choosing between DataFrame API and Spark SQL — when each shines
groupBy() and aggregation functions — count, sum, avg, min, max
Multiple aggregations with agg()
pivot and unpivot operations
Window specification — partitionBy, orderBy, frame definitions
Ranking functions — row_number, rank, dense_rank, percent_rank
Aggregate windows — running totals, moving averages
Essential for time-series and class analysis at scale
Join types — inner, left, right, full outer, cross, semi, anti
Broadcast joins — when and how to use broadcast() for small dimension tables
Sort-merge joins vs hash joins — the Spark optimiser's choices
Join pitfalls — data skew, shuffle costs, cartesian products
Self-joins for hierarchical data
Performance debugging with EXPLAIN and the Spark UI
CSV, JSON, Parquet, ORC, Avro — when to use each
Schema inference vs explicit schemas
Compression options (snappy, gzip, lz4, zstd)
Connecting to PostgreSQL, MySQL, SQL Server via JDBC
Predicate pushdown and partitioning JDBC reads
Coalesce vs Repartition — when each is right
Write strategies and partition pruning
Delta Lake — read, write, merge operations
ACID transactions on data lakes
Schema enforcement — rejecting incompatible writes
Schema evolution — automatically updating schemas
Auto Loader — incrementally ingest new files from cloud storage as they arrive
File notifications vs directory listing — ideal for streaming ingestion patterns
Read Stream → Transform → Write Stream → Checkpoint
Micro-batch vs continuous processing
Idempotent writes and exactly-once semantics
Watermarks for handling late-arriving data
Stateful aggregations
Stream-static joins
Native Apache Kafka source and sink
Subscribe and assign patterns
Trigger types — once (single batch and stop), continuous (true streaming), processingTime (micro-batch at fixed cadence)
Catalyst Optimiser — how Spark plans your query
Logical vs physical plans
Cost-based optimisation
Tungsten Engine — whole-stage code generation
Off-heap memory management
Adaptive Query Execution (AQE) — dynamic shuffle partition coalescing
Skew join handling
Dynamic switching of join strategies
cache() vs persist() storage levels
When caching helps vs hurts
Data skew mitigation — salting techniques
Skew join hints
Medallion Architecture — Bronze, Silver, Gold layers — the canonical DE pattern
Quality expectations at each layer
spark-submit configuration
Cluster sizing — driver vs executor memory and cores
Dynamic allocation
Apache Airflow DAGs for Spark jobs
Dependencies, retries, SLAs
Sensors and operators for Spark
Docker for reproducible Spark environments
Docker Compose for local cluster simulation
pytest for PySpark
Mock SparkSession for fast tests
Testing transformations vs actions
MLlib pipelines — feature engineering, model training, evaluation
Saving and loading models
Calling LLM APIs from Spark UDFs
Distributed inference patterns
Cost optimisation for LLM-on-Spark
Embedding generation in Spark pipelines
Bulk vector loading to Pinecone, Qdrant, ChromaDB
Building RAG ingestion pipelines on Spark
Section Project — end-to-end PySpark medallion pipeline (Bronze → Silver → Gold) with structured streaming ingestion, Delta Lake writes, Airflow orchestration, unit tests, and Docker deployment
06

Databricks Mastery

Databricks is the world's leading Data + AI platform, built on Apache Spark with a Lakehouse architecture. It unifies data engineering, data science, ML, and BI workflows in a single collaborative environment. 11 modules covering everything from platform fundamentals to certification prep for Databricks Certified Data Engineer Associate + Professional.
11 MODULES
SECTION 6
Data Lake + Data Warehouse unified — the Lakehouse paradigm vs legacy architectures
Databricks on AWS, Azure & GCP — cross-cloud parity and differences
Workspace navigation — Catalog, Compute, Workflows, SQL, ML workloads
Notebook collaboration patterns
Compute options — All-Purpose vs Job Clusters
Serverless Compute overview
Databricks Runtime (DBR) versions
Notebooks — Python, SQL, Scala, R support
Collaborative notebooks and version control
Databricks Assistant — AI coding companion
Delta file format on top of Parquet
Transaction log (_delta_log)
ACID transactions on data lakes — the breakthrough
Optimistic concurrency control
Managed vs external Delta tables
Schema enforcement and evolution
Querying historical versions — Time Travel
VERSION AS OF and TIMESTAMP AS OF
VACUUM — cleaning up old files
OPTIMIZE — compacting small files
Z-Ordering for data skipping
Change Data Feed (CDF) — capture row-level changes
Shallow and deep clones — testing and disaster recovery
Unity Catalog provides a unified governance layer across all Databricks workspaces
Three-Level Namespace — Catalog → Schema → Table hierarchy
Cross-workspace data sharing
Fine-grained GRANT, REVOKE, and DENY statements
Row-level and column-level security
Dynamic views for data masking
Automatic lineage tracking — table-to-table and column-level lineage
Configuring external locations
Storage credentials management
Built-in audit logs
Meeting SOC 2, HIPAA, GDPR compliance
DLT — declarative pipelines vs imperative Spark jobs
Materialised views and streaming tables
Quality expectations — EXPECT, EXPECT OR DROP, EXPECT OR FAIL
Data quality at pipeline runtime, not after the fact
Triggered vs continuous pipelines
Development vs production modes
DLT + Auto Loader for streaming ingestion
Auto Loader — incremental file ingestion at scale
File notification mode vs directory listing
Schema inference and evolution in streaming
Structured Streaming on Databricks — production patterns
Checkpointing for fault tolerance
Multi-task jobs with task dependencies
Conditional execution and retry policies
Cron and triggered scheduling
Databricks CLI and REST API for automation
Apache Airflow integration
Monitoring Hub and cost optimisation
Classic vs Serverless SQL Warehouses
Query history and performance profiling
Power BI, Tableau & Excel connectivity
dbt integration
Photon engine for blazing performance
AI/BI Dashboards
Genie — conversational BI
MLflow — experiment tracking and model registry
Model versioning and stage transitions
Feature Store for reusable features
Online and offline feature stores
AutoML — rapid model prototyping
Automated feature engineering
scikit-learn, XGBoost, PyTorch, Hugging Face
Hyperparameter tuning with Hyperopt
Batch scoring and model serving
Real-time and batch inference endpoints
Model drift monitoring
Mosaic AI integration — Foundation Models on Databricks
Foundation Models — DBRX, Llama, Mixtral, MPT
Vector Search — Databricks native vector database
RAG patterns on Databricks
AI Agents with Databricks
Fine-tuning open-source LLMs
Cost optimisation for GenAI workloads
DABs — declarative bundle definitions for Databricks resources
YAML configuration for jobs, pipelines, models, workflows
GitHub Actions + DABs workflows
Development → Staging → Production promotion
Environment-specific configuration
Unit testing notebooks
Integration testing pipelines
Code review patterns
Databricks Certified Data Engineer Associate — exam blueprint walkthrough
Practice questions across all domains
Hands-on labs aligned to exam objectives
Databricks Certified Data Engineer Professional — advanced topics, performance tuning, security, governance
Production engineering patterns
Practice exam with timed assessment
Section Project — complete Databricks Lakehouse implementation with DLT pipelines, Unity Catalog governance, MLflow experiment tracking, Databricks SQL dashboards, GenAI integration, and DABs-based CI/CD deployment
07

Microsoft Fabric

The 2026 Microsoft data platform play. Microsoft Fabric unifies Data Factory, Data Engineering, Warehouse, Data Science, Real-Time Intelligence, and Power BI in a single SaaS platform — with OneLake as the unified data lake underneath. 15 modules covering every Fabric workload, with direct certification prep for DP-600 (Analytics Engineer Associate) and DP-700 (Data Engineer Associate).
15 MODULES
SECTION 7
Fabric workloads and architecture overview
Licensing and capacity units — F2 through F2048
Workspaces and tenant structure
Platform comparisons — Databricks, Snowflake, Azure Synapse
Migration paths from Azure Synapse
OneLake architecture — one lake per tenant
Delta Lake and Parquet file formats
ACID transactions and versioning
Shortcuts to external data sources (Azure Data Lake Storage, Amazon S3, Google Cloud Storage) without copying
OneLake Catalog and data discovery
OneLake shortcuts enable a true multi-cloud data mesh
Lakehouse fundamentals and components
Creating and managing Lakehouses
Medallion architecture — Bronze, Silver, Gold
SQL Analytics Endpoint for T-SQL access
Delta table operations and optimisation
Time travel queries on Delta tables
Build robust data integration pipelines with comprehensive connectivity and orchestration
Data Factory capabilities and connectors
Data pipelines creation and configuration
Dataflows Gen2 and Power Query
Database mirroring — Azure SQL, Cosmos DB, PostgreSQL
Pipeline orchestration and CI/CD
Apache Spark in Fabric workloads
Fabric Notebooks with Copilot
PySpark DataFrames and transformations
Spark SQL queries and optimisation
Spark job definitions and scheduling
AI functions — Summarisation, Classification, PII obfuscation built into Fabric Spark
Fabric Data Warehouse overview
Full T-SQL support and user-defined functions
Schema design and table management
Star schema and dimensional modelling
Slowly Changing Dimensions (SCD) Type 1, 2, 3
SQL Database in Fabric
Performance optimisation techniques
Real-time analytics in Fabric — purpose-built for time-series and log analytics with sub-second query performance on billions of rows
Eventstreams and streaming source ingestion
Kusto Query Language (KQL) fundamentals
Eventhouse and KQL databases
Graph in Fabric for relationship modelling
Maps for geospatial analytics
Real-time dashboards and alerting
KQL offers sub-second query performance on billions of rows — ideal for IoT, telemetry, and operational intelligence workloads
Direct Lake mode — query Delta tables at speed without import
The third performance tier after Import and DirectQuery
Semantic models and relationships
DAX fundamentals and syntax in Fabric
Row-level security (RLS) and object-level security (OLS)
Incremental refresh and aggregations
Report development with Copilot assistance
Data Science experience and tooling in Fabric
Exploratory data analysis in Fabric notebooks
ML model training and versioning
MLflow experiment tracking
Semantic Link (SemPy) — connect ML to Power BI semantic models
Batch scoring and predictions at scale
Copilot across workloads — Notebooks, SQL, KQL, Pipelines, Reports
Data Agents for conversational AI on your data
Operations Agents for monitoring
Fabric IQ and ontology models
AI Functions and Azure AI Foundry integration
User Data Functions overview — Python-based serverless functions
Create custom serverless functions for extended capabilities
VS Code extension for development
Integration with Notebooks, Pipelines, and SQL
Testing and deployment workflows
Implement comprehensive security and governance frameworks
Fabric security model
Authentication and authorisation — Entra ID integration
Row-level and column-level security
Dynamic data masking
Microsoft Purview integration
Data lineage, catalog, compliance, and auditing
Fabric admin portal and tenant settings
Capacity management and SKU selection
Monitoring Hub and performance dashboards
Query and pipeline monitoring
Git integration and deployment pipelines
CI/CD patterns for Fabric workloads
Query and performance optimisation
Partition strategies and caching
Enterprise architecture patterns
Data mesh implementation in Fabric
Migration strategies from legacy platforms (Synapse, on-prem warehouses)
Developer tools — Fabric CLI and REST APIs
Integration with Azure and third-party services
Microsoft Certified: Fabric Analytics Engineer Associate (DP-600) — exam blueprint walkthrough
Practice questions across all domains
Hands-on labs aligned to exam objectives
Microsoft Certified: Fabric Data Engineer Associate (DP-700) — data engineering-specific topics
Production engineering patterns
Practice exam with timed assessment
Section Project — complete Microsoft Fabric implementation with OneLake, Lakehouse (medallion), Data Factory pipelines with database mirroring, Spark notebooks, Data Warehouse, Real-Time Intelligence with KQL, Power BI Direct Lake reports, Purview governance, and CI/CD via Git integration
08

Generative AI & Agentic AI

The 2026 frontier — and the culmination of the data engineering programme. 10 modules covering the complete GenAI engineering stack tuned for data engineering work: frontier models, prompt engineering, RAG over your pipeline metadata, agent frameworks, and the Model Context Protocol. The named Data Engineering Coding Agent project lives here.
10 MODULES
SECTION 8
Narrow AI — pre-2022
Generative AI — post-2022, unleashed by ChatGPT
Agentic AI — post-2024 era of autonomous systems
2022 inflection point — ChatGPT launch
2024 inflection point — Agentic emergence
For Data Engineers — AI that drafts PySpark from natural language
AI that debugs Spark job failures
Autonomous pipeline maintenance (with human approval)
GPT-5.5 — Terminal-Bench 2.0 leader at 82.7%. Best for autonomous agents
Claude Opus 4.7 — SWE-bench Pro leader at 64.3%. Lowest hallucination rate. Best for accurate code generation
Gemini 3.1 Pro — 2M+ token context window. Best for ingesting massive pipeline codebases
Open-source frontier — Llama 4, DeepSeek, Mistral, Qwen — for VPC deployments
Copilot in Fabric Notebooks — PySpark code generation
Databricks Assistant — Spark-aware AI assistant
GitHub Copilot in VS Code for pipeline development
Fundamentals — Context + Task + Examples + Format + Constraints
Core Techniques — Zero-shot, few-shot, Chain-of-Thought (CoT), ReAct
System Prompts — persistent persona design, guardrails
Multimodal — reading architecture diagrams, lineage graphs
Hallucination & Context — grounding for accurate code generation
Domain prompts for PySpark, SQL, DBT, Airflow
Context Engineering — the 2026 frontier discipline; managing what enters the LLM's context window
Project — ship a 30+ prompt library for DE work (PySpark drafting, SQL generation, pipeline debugging, etc.)
ChatGPT, Claude, Gemini for daily DE work
AI for PySpark writing, SQL generation, pipeline documentation
Research with Perplexity for technology benchmarking
Microsoft Copilot integration across Office and developer tools
Reading architecture diagrams with vision models
Analysing dashboards and pipeline lineage graphs
OCR for legacy data dictionaries
Audio transcription with Whisper for meetings
Hallucination — when an LLM invents a Spark function that doesn't exist
Prompt injection through documents
Privacy — keeping sensitive data out of public LLMs
Regulatory landscape — EU AI Act, India DPDP Act
Streamlit — rapid prototyping for internal DE tools
FastAPI — production-grade Python API
Building chatbots for pipeline status Q&A
Build and deploy a Streamlit + FastAPI internal tool
LLM APIs in production — OpenAI, Anthropic, Google GenAI, DeepSeek Python SDKs
Function calling and structured outputs
Embeddings & Vector Databases — ChromaDB, Pinecone, Qdrant, pgvector
HNSW, IVF indexing strategies
Databricks Vector Search integration
Fabric AI Foundry integration
RAG pipeline for DEs — the canonical flow over pipeline metadata, data dictionaries, runbooks
Hybrid search (BM25 + embeddings)
Re-ranking with cross-encoders
Agentic RAG — self-improving retrieval over your data platform documentation
Project — Internal DE Docs RAG App: RAG over your data dictionary, pipeline runbooks, and architecture decision records
LangGraph 1.0 — the production default for agentic data engineering
Claude Agent SDK — deepest MCP integration
CrewAI — role-based multi-agent crews
Semantic Kernel / Microsoft Agent Framework — enterprise .NET stacks
Pydantic AI — type-safe Python, validation-first agent design
ReAct — investigate a pipeline failure, then propose a fix
Plan-and-Execute — generate a multi-step pipeline migration plan
Reflection loops — agent reviews its own PySpark code before deploying
Multi-agent collaboration — Schema agent, Pipeline agent, Quality agent, Reviewer agent
Human-in-the-loop checkpoints — humans approve every production-touching action
MCP — open standard for connecting agents to tools, data, systems
Proposed by Anthropic late 2024, stewarded by the Linux Foundation
200+ servers, 97M+ monthly SDK downloads
Build an MCP server exposing Databricks workspaces — clusters, jobs, notebooks
Build an MCP server exposing Microsoft Fabric — workspaces, pipelines, Lakehouses
Build an MCP server exposing Spark clusters — for safe job submission and monitoring
Build an MCP server exposing PostgreSQL — for query execution
Connect LangGraph agents to multiple MCP servers
Use Claude Agent SDK's deepest native MCP integration
A2A Protocol — Google-led agent-to-agent communication standard
DATA ENGINEERING CODING AGENT CAPSTONE — multi-agent Data Engineering Coding Agent using LangGraph + Claude Agent SDK with MCP servers exposing Databricks workspaces, Fabric workloads, Spark clusters, and PostgreSQL
Agent generates PySpark code from natural language, drafts DLT pipelines with proper quality expectations, debugs Spark job failures, proposes data quality remediations, and automates the boilerplate that DEs spend half their week on
Frontend with Streamlit or React, backend with FastAPI, observability via LangSmith — human approval gates for every production-touching action — the named project for the entire Data Engineering & AI Agents programme
Tools you'll master

32+ data engineering & AI tools, one production project.

Spk
Spark
Kf
Kafka
Fl
Flink
Ai
Airflow
Pf
Prefect
Da
Dagster
dbt
dbt
Sf
Snowflake
DBX
Databricks
BQ
BigQuery
Rs
Redshift
Tr
Trino
It
Iceberg
Hu
Hudi
Dl
Delta Lake
S3
S3
Pg
PostgreSQL
Mn
MongoDB
Ks
Kinesis
Pb
Pub/Sub
DC
Data Contracts
GE
Great Expectations
Mt
Monte Carlo
Lf
Lakehouse
OAI
OpenAI
LC
LangChain
LG
LangGraph
D
Docker
K
Kubernetes
TF
Terraform
aws
AWS
Cu
Cursor AI
Real-time projects

You don't watch videos. You ship software.

Three full-production projects, each threaded through the entire curriculum. By the project, you've built the whole stack around them.

Hero project · weeks 3–12

Production lakehouse + streaming pipeline + AI agent

Ship a full lakehouse on Iceberg/Delta, wire a Kafka + Flink streaming layer into it, orchestrate the whole stack on Airflow with data contracts, and bolt on a LangGraph augmentation agent.

01Lakehouse on Iceberg/Delta — bronze/silver/gold layers, dbt models with tests + docs, partitioned + sorted for query performance.
02Streaming layer — Kafka topics, Flink stateful processing, exactly-once writes to the lakehouse, late-data handling.
03Orchestration on Airflow/Dagster with data contracts, Great Expectations data-quality gates, lineage in the catalog.
04AI augmentation agent — a LangGraph agent that profiles tables, drafts test cases, explains lineage, and answers analyst questions over the warehouse.
Outcome: 4× faster pipeline build
Data SLA: 99.9%
Reviewer: Data Platform panel
SparkKafkaIcebergdbtLangGraph
Enterprise · weeks 6–11

Streaming CDC pipeline

Build a Postgres → Debezium → Kafka → Flink → Iceberg CDC pipeline with exactly-once semantics, schema-registry contracts, and Monte Carlo monitoring.

DebeziumKafkaFlinkIceberg
Real-time · weeks 8–12

Self-tuning lakehouse agent

Stand up a LangGraph agent that watches table-level metrics — latency, freshness, cost — auto-files dbt issues, drafts fixes, and benchmarks query plans on Trino.

LangGraphTrinodbtMonte Carlo
Project · weeks 11–12

Your AI data platform in a real partner org.

Pick a real partner data problem. Deploy a production lakehouse + streaming pipeline + AI agent — Iceberg storage, Flink processing, dbt models, LangGraph augmentation — into a partner team that's running it for real users.

Download the real world project
Full scope, sample partner orgs, weekly milestones, and grading rubric — PDF, 14 pages.
2026: 220+ deployed76% → placement offers
Your instructor

Taught by engineers who shipped agentic AI to production.

MK
Manikanta Kona
Founder, Digital Lync · Principal Data Engineering Architect
Spark · Kafka · Airflow · dbt · Iceberg · Snowflake · LangGraph
"A 2026 data engineer doesn't stop at moving rows. They ship a lakehouse you can stake an SLA on, a streaming layer that survives a bad partition key, orchestration that an on-call engineer doesn't fight with, and an LLM agent that answers analyst questions over the warehouse. That's the bar I teach to, every cohort."
15 yrs
DATA & AI
2,400+
LEARNERS
4.9 /5
RATING

Manikanta is the founder of Digital Lync and brings 15 years of applied data engineering from AT&T, Salesforce, Cox Communications, and Broadcom — where he led lakehouse, streaming, and orchestration platforms for Fortune-500 banks, telcos, and insurers. Most recently he architected production data platforms that pair Iceberg/Delta lakehouses, Flink streaming, and dbt models with a LangGraph augmentation layer that explains lineage and drafts test cases for analyst teams.

His classes get you two things other programs don't give you: a founding architect who still ships production data platforms, and a curriculum rewritten every quarter to match what hiring managers actually ask about — credentials like AWS Data Engineer Associate, Databricks Data Engineer Professional, Snowflake SnowPro, Confluent Kafka Developer, and dbt Analytics Engineer included. M.S. in Engineering, Purdue University.

RK
Ravi Krishna
Chief Technologist, Digital Lync · Data Platform & Streaming Lead
Spark · Kafka · Flink · dbt · Iceberg · Airflow · LangGraph
"A data platform earns its keep when the lakehouse is auditable, the streaming layer keeps its exactly-once promise on a bad day, the data SLAs hold under real load, and an LLM agent answers analyst questions before the meeting starts. I teach the unglamorous parts that make all of that real in production."
10 yrs
DATA PLATFORMS
1,800+
LEARNERS
4.8 /5
RATING

Ravi is Chief Technologist at Digital Lync, where he leads the data platform and streaming practice. After ten years building and running production lakehouses and streaming pipelines across enterprise — telecom, banking, and SaaS — he stepped into the Chief Technologist seat to wire Spark, Kafka, Flink, Iceberg, and dbt into the way data teams actually work — data contracts that hold under schema drift, freshness SLAs that on-call engineers trust, and a LangGraph augmentation layer that explains lineage to the analysts who own the numbers.

His data platform modules are built from real production post-mortems, not slide decks. Expect to leave with working Iceberg lakehouses, Flink streaming jobs with exactly-once semantics, dbt models with tests and docs, Airflow orchestration with data contracts, and a LangGraph augmentation agent wired into the warehouse. Ten years across enterprise data platforms — Hyderabad-based, hands-on, and known for the unglamorous parts of data engineering that everyone else skips.

HIRING PARTNERS · INDUSTRY VOICES

What data engineering employers say about Digital Lync grads.

Real feedback from data and platform leaders at AI-first companies and the firms hiring our Data Engineering + AI graduates.

Microsoft logo

Digital Lync grads ramp 40% faster on data platform deploys than typical data engineering hires. Best Data Engineering + AI pipeline in India.

Aakash Mehta

Aakash Mehta, Engineering Director, Microsoft

Deloitte logo

We've onboarded 80+ Digital Lync alumni in 18 months. Lowest ramp time we've seen for lakehouses, streaming pipelines, and AI augmentation practices.

Anita Sharma

Anita Sharma, Senior Manager, Deloitte

Mphasis logo

The Data Engineering + AI programme is comprehensive — Spark, Kafka, dbt, LangGraph augmentation. Grads come pre-trained for production data platforms with AI.

Rahul Bhatt

Rahul Bhatt, Solutions Lead, Mphasis

TCS logo

Their lakehouse + streaming track produces PMs who ship production-grade pipelines on day one. Rare combination of engineering rigor and platform craft.

Deepak Pillai

Deepak Pillai, Senior Architect, TCS

Accenture logo

What sets Digital Lync apart is the AI augmentation layer baked into the data engineering track. Our enterprise clients ask for exactly this profile.

Suresh Menon

Suresh Menon, Practice Lead, Accenture

Infosys logo

Their Databricks Data Engineer Pro + dbt Analytics Engineer prep is rigorous, and the shipped project — lakehouse, streaming pipeline, AI augmentation agent — is what closes interviews for us.

Vikram Iyer

Vikram Iyer, Director, Infosys

Wipro logo

Digital Lync's Data engineers ship reliable pipelines twice as fast in the first 90 days. Our internal platform metrics back this up clearly.

Lakshmi Nair

Lakshmi Nair, VP Engineering, Wipro

Cognizant logo

Best Data Engineering + AI pipeline we've sourced from in India. Their projects are real shipped pipelines, not slide demos.

Karthik Subramanian

Karthik Subramanian, Engineering Director, Cognizant

Capgemini logo

Strong Spark and lakehouse engineering foundation. Their Data Engineering grads need almost zero ramp time on enterprise data platform engagements with us.

Arun Joshi

Arun Joshi, Practice Director, Capgemini

IBM logo

We've placed 40+ Digital Lync alumni across our data and watsonx engineering teams. Strong fundamentals, sharp on data SLAs and lineage.

Sanjay Verma

Sanjay Verma, Talent Director, IBM

LTIMindtree logo

lakehouses + AI augmentation is exactly the talent gap we've been struggling to close. Digital Lync is filling it for us reliably.

Anjali Desai

Anjali Desai, Practice Head, LTIMindtree

Tech Mahindra logo

Their Data Engineering track delivers engineers who navigate Spark, Kafka, and dbt on customer engagements unsupervised.

Ramesh Iyer

Ramesh Iyer, Senior Manager, Tech Mahindra

Cyient logo

Hired 25+ Digital Lync graduates for our data engineering practice. Strong on Spark, sharp on Kafka/Flink, fluent in dbt.

Geetha Pillai

Geetha Pillai, Talent Acquisition Lead, Cyient

Microsoft logo

Digital Lync grads who blend lakehouses with Azure OpenAI augmentation land production-ready on day one. Rare combination, well-trained.

Priya Reddy

Priya Reddy, Talent Lead, Microsoft

03Program certifications

An Agent‑Ready credential, not a participation trophy.

Digital Lync · Institute Certificate
Agent‑Ready Data Engineer
Presented to
Spandana Bala
For the successful design, build, and production deployment of a data platform — Iceberg lakehouse, Kafka/Flink streaming, dbt models, and an AI augmentation agent — evaluated against the Databricks Data Engineer Pro, dbt Analytics Engineer, and AWS Data Engineer Associate credential rubrics.
Manikanta Kona
CEO · Digital Lync
AGENT
READY
2026
01
Industry‑recognized
Co‑branded with the data engineering community and mapped to Databricks Data Engineer Pro and dbt Analytics Engineer credentials — names that hiring managers already scan for on resumes.
02
Project artifact included
Every certificate carries your shipped project — Iceberg lakehouse, Flink streaming pipeline, dbt models, AI augmentation agent — with a link to the live partner-org deployment. Proof, not a promise.
03
Enhanced skill validation
Graded against the 2026 Agent‑Ready rubric: lakehouse design, streaming pipelines, dbt models, data contracts, quality gates & lineage. No pass/fail — a level 1‑5 band.
04
Verifiable on a public URL
Each credential has a public verification page recruiters can check in 10 seconds — no PDF back‑and‑forth.
04Job placement support

Your first Data Engineer offer isn't a lottery ticket. It's a built process.

GitHub, LinkedIn, resume — and most importantly, warm intros into data-heavy SaaS and platform teams. Our placement team works your search like an account, not a helpdesk.
01 / GITHUB & PORTFOLIO

A portfolio, not a graveyard.

Guidance on building a portfolio that showcases your lakehouse design, streaming pipeline, dbt models, AI augmentation agent, and a public verification URL — reviewed 1:1, not via template.

02 / RESUME PREP

Rewrite, don't proofread.

A one-page resume rebuilt around the data platforms you shipped (lakehouses, streaming pipelines, dbt models), the partner-org project, and the business outcome. Reviewed by engineers who've read 10,000+ resumes.

03 / LINKEDIN + INTROS

Where most opportunities actually live.

Profile tuning plus direct warm introductions into data-heavy SaaS and platform teams — Microsoft, Databricks, Snowflake, dbt Labs, Confluent, Fivetran, AWS, Anthropic, Hugging Face, Scale AI, Stripe, Razorpay, plus services that staff data platform teams (Deloitte, Accenture, Cognizant, TCS). You leave with recruiter contacts, not a generic "good luck."

Data Engineering alumni

Hundreds of data engineering careers launched — here are eight.

SB
Spandana Bala
Data Engineer
Hyderabad · India
Now at · Microsoft
NV
Naveen Vedala
Senior Data Engineer
Hyderabad · India
Now at · Atlassian
TA
Tejashwini Addla
Staff Data Engineer (Streaming)
Hyderabad · India
Now at · Salesforce
TD
Tharunesh Dillikar
Principal Data Engineer
Seattle · United States
Now at · Confluent
MM
Mujahed Mohammed
Lakehouse Architect
Hyderabad · India
Now at · Databricks
BK
Bhargav Kumar Murala
Streaming Platform Lead
Hyderabad · India
Now at · Adobe
SL
Sai Manasa Leburi
Analytics Engineering Lead
New York · United States
Now at · Hugging Face
RD
Rahul Dhamma
Director of Data Platform
Hyderabad · India
Now at · dbt Labs
Our locations

Come chat with us — over coffee, or over Zoom.

One flagship campus in Hyderabad, plus online Principal Data Engineer cohorts running on Indian and US timezones.

Flagship campus
Hyderabad
2nd Floor, Hitech City Road · Above Domino's · Opp. Cyber Towers, Jai Hind Enclave · Hyderabad, Telangana
Call
+91 90003 29956
US desk
+1 858 666 6719
Hours
Mon–Sat · 9am–9pm
Online class
Global
Weekend and evening Data Engineering cohorts running on IST and PST. Every online cohort ships the same shipped project — Iceberg lakehouse, Flink streaming, dbt models, AI augmentation — as the on‑campus track.
Timezones
IST & PST
Format
Live + 1:1 mentorship
Next class
25 May 2026
FAQ

Questions we actually get — answered honestly.

Straight answers on prerequisites, the data platform stack, certifications, and placement. If something's missing, book a 20-minute advisor call — no slides, no pitch.

Do I need a CS background or prior SQL/Spark experience?+
No on both counts. Roughly 40% of every class comes from non-CS streams — mechanical, electrical, BCom, BBA, and self-taught coders. Weeks 1–2 cover the SQL fundamentals, distributed compute, and pipeline design from scratch. What you do need: consistency and 12–15 hours a week.
Will I actually ship production pipelines, or only do tutorials?+
You actually ship. Every learner builds a lakehouse on Iceberg/Delta with bronze/silver/gold layers, a Kafka → Flink streaming layer with exactly-once semantics, dbt models with Great Expectations gates, and a LangGraph agent that augments the platform. The project runs in a partner org — not a notebook.
Which tools, frameworks, and AI models will I use?+
Compute: Spark, Flink, Trino, Databricks. Streaming: Kafka, Kinesis, Pub/Sub. Storage: Iceberg, Hudi, Delta Lake, S3, Snowflake, BigQuery, Redshift. Orchestration: Airflow, Prefect, Dagster, dbt. Quality: Great Expectations, Monte Carlo, Data Contracts. AI: OpenAI, LangChain, LangGraph.
Will I prep for AIPMM Data Engineer and Pragmatic Principal Data Engineer certs?+
Yes. The curriculum is mapped to the AIPMM Data Engineer track and the Pragmatic Principal Data Engineer credential. We run two full mock exams and reimburse the voucher fee on first-attempt pass.
What's the time commitment per week?+
Plan for 12–15 hours: 2 live classes × 2 hours, 1 lab × 3 hours building pipelines on your training cluster, and ~5 hours of project work (Spark, Kafka, dbt). Saturday office hours with the TA team are optional, but most learners use them.
Is placement support really 1:1, and which companies hire data engineers?+
Yes — a dedicated placement advisor from week 8, not a helpdesk. AI product hiring partners include Microsoft, Adobe, Salesforce, Atlassian, Notion, Linear, Anthropic, Hugging Face, Databricks, Snowflake, Stripe, Razorpay, Freshworks, Zoho, and Postman. Resume, LinkedIn, mock interviews, and warm intros are individual.
Online, weekend, or on-campus?+
All three. On-campus at the Hyderabad flagship, live online (IST and PST cohorts), and a weekend track for working professionals. Every format ships the same shipped project — Iceberg lakehouse, Flink streaming, dbt models, AI augmentation — only the schedule changes.
What if I fall behind, or can't continue mid-class?+
Freeze your seat for up to 90 days and rejoin the next class — no extra fee. TAs run catch-up sessions every Saturday for anyone more than a week behind, and recordings of every live session are available for the lifetime of your account.

Still have a question? Talk to an advisor — no slides, no pitch.

Class DEA-025 starts 1 Jun 2026.
40 seats. 12 already claimed.

Book a 20-minute advisor call. We'll walk through the curriculum, match it to your current role, and show you two real projects from class 022.

CLASS DEA-025 3 MONTHS STARTS 01 JUN ONLY 13 SEATS LEFT · 17 / 30 CLAIMED

Get Skilled

Call UsCall Us