Home / Programs / Data Analytics / Data Engineering & AI Agents

Class 014 · DATA ENGINEERING & AI AGENTS · PYSPARK + DATABRICKS + FABRIC

Data Engineering
+ AI Agents

Master end-to-end Data Engineering with Agentic AI. Build medallion pipelines on PySpark + Delta Lake with Databricks (DLT, Unity Catalog, MLflow), ship a complete Microsoft Fabric implementation across OneLake and Real-Time Intelligence, and deploy a Data Engineering Coding Agent.

3mo

duration

70+

modules

4.7/5

class rating

100k+

enrolled

What you'll ship by week 24

DE-014 · 13 JUL

A production PySpark + Databricks medallion pipeline

PySpark DataFrames, Delta Lake Bronze/Silver/Gold layers, DLT with quality expectations, Unity Catalog, MLflow, and CI/CD with Databricks Asset Bundles.

A Microsoft Fabric end-to-end implementation

OneLake, Lakehouse with medallion architecture, Data Factory mirroring, Real-Time Intelligence, and Direct Lake in Power BI — DP-600 / DP-700 aligned.

A production data engineering toolkit

Airflow orchestration, Docker, pytest, GitHub Actions and Databricks Asset Bundles, plus Fabric CLI and REST APIs for production deployment.

A deployed Data Engineering Coding Agent

A production agent on LangGraph + Claude Agent SDK + MCP that generates PySpark code, drafts DLT pipelines, and proposes data quality remediations.

Where our Data Engineering alumni work

Microsoft

Amazon

Salesforce

ServiceNow

Deloitte

Infosys

Accenture

TCS

Wipro

Capgemini

Cognizant

HCL

Microsoft

Amazon

Salesforce

ServiceNow

Deloitte

Infosys

Accenture

TCS

Wipro

Capgemini

Cognizant

HCL

What you leave with

Four things every Data Engineering grad walks away with.

Agent-Ready DE skills

Triple-platform DE depth — PySpark, Databricks (Delta Lake, DLT, Unity Catalog), and Microsoft Fabric (OneLake, DP-600/DP-700) — plus an LLM PySpark layer with LangGraph, Claude Agent SDK, MCP.

A shipped project

A production-deployed Data Engineering Coding Agent that drafts PySpark, debugs Spark jobs, and proposes data-quality fixes via MCP servers, running with human approval gates and a public verification URL.

Verifiable credential

2026 Agent-Ready rubric mapped to Databricks DE Associate + Professional, DP-700, DP-600, and AWS Data Engineer, graded 1–5, with a public verification URL recruiters can check in 30 seconds.

Direct placement pipeline

GitHub + LinkedIn portfolio rewrite, DE-tuned resume rebuild, and warm intros into our 1,000+ hiring partners actively staffing Data Engineer, Databricks Engineer, and Fabric Engineer roles.

3 MONTHS · FOUR PHASES · ONE DE AGENT

From “loads a CSV” to ships AI-native data platforms..

Weeks 1–2 · Foundations

IT & AI Foundations + Python for DE

Application lifecycle, Agile/Scrum, and cloud computing models
Introduction to AI, ML, Generative AI, and Agentic AI
Python fundamentals and data structures for engineers
Advanced Python — OOP, decorators, generators, and packaging

YOU SHIPA production-quality Python codebase with OOP, decorators, and proper packaging — the DE toolkit ready for Spark and Fabric.

Weeks 3–6 · Analytics + Data Layer

Power BI + PostgreSQL for Data Engineers

Power BI Desktop, Power Query, and data prep transformations
Data modelling — star/snowflake schemas, DAX, time intelligence
PostgreSQL DDL, DML, JOINs, window functions, and CTEs
PL/pgSQL stored procedures, triggers, and query optimisation

YOU SHIPA Power BI dashboard suite plus a PostgreSQL analytical database — connected and indexed to feed the Spark and Fabric pipelines.

Weeks 7–12 · Data Engineering Core (PySpark + Databricks + Fabric)

PySpark + Databricks + Microsoft Fabric

PySpark — RDDs, DataFrames, Spark SQL, and structured streaming
Production engineering with medallion pipelines, Airflow, and Docker
Databricks — Delta Lake, Unity Catalog, Delta Live Tables, MLflow
Microsoft Fabric — OneLake, Lakehouse, Data Factory, Real-Time Intelligence

YOU SHIPA PySpark + Databricks medallion pipeline, a Microsoft Fabric end-to-end implementation, and an Airflow + Docker CI/CD toolkit.

Weeks 12–14 · GenAI + Agentic AI

Master the 2026 GenAI + Agentic AI stack — and ship a Data Engineering Coding Agent that drafts PySpark pipelines, debugs Spark jobs, and proposes data quality fixes autonomously.

Engineer with LLM APIs from OpenAI, Anthropic, Google GenAI, and DeepSeek. Master prompt engineering (zero-shot, few-shot, CoT, ReAct) and context engineering — the 2026 frontier discipline. Build production RAG pipelines with ChromaDB and pgvector over your data dictionary, pipeline documentation, and historical Spark job logs. Master the 2026 production agent stack — LangGraph 1.0 (#1 production default), Claude Agent SDK (#2 MCP-native), CrewAI (#3 multi-agent crews). Wire it all through the Model Context Protocol (MCP) — 200+ server implementations, 97M+ monthly SDK downloads. Final project — a deployed Data Engineering Coding Agent with MCP servers exposing your Databricks workspaces, Fabric workloads, Spark clusters, and PostgreSQL data layer. The data engineer’s force multiplier.

Partner orgs (2026)62

DE projects deployed280+

←’ Placement offers91%

Course curriculum

Seven sections. 65+ modules. The AI-native data engineering stack.

Fundamentals of IT & AI

Foundational track building the conceptual bedrock every data engineer needs — application lifecycle, Agile/Scrum, computing infrastructure, AI/ML/Generative/Agentic AI fundamentals, and real-world digital systems. Sets the context for everything that follows in the DE + AI engineering stack.

5 MODULES
SECTION 1

Application fundamentals — what applications are, their types, web architecture

Web Technologies — Frontend (HTML, CSS, JavaScript, React) and Backend (Python, Java, Node.js)

Database Systems — SQL (PostgreSQL, MySQL) and NoSQL (MongoDB)

The seven SDLC phases — Planning, Analysis, Design, Implementation, Testing, Deployment, Maintenance

The data engineer sits between application data and analytical workloads — knowing how applications generate data makes you a better pipeline architect

Methodology Evolution — Waterfall vs Agile, the Agile mindset

Popular frameworks — Scrum, Kanban, Extreme Programming (XP)

Scrum Roles — Product Owner, Scrum Master, Development Team (including data engineers)

Scrum Events — Sprint Planning, Daily Scrum, Sprint Review, Sprint Retrospective

Scrum Artifacts — Product Backlog, Sprint Backlog, Increment deliverables

User Stories — Epics, Themes, Acceptance Criteria

Estimating user stories with story points

Backlog management with Google Sheets and Azure Boards

CPU vs GPU — when each matters for analytics workloads

Memory, storage, and network basics

Why these matter for Spark cluster sizing and Fabric capacity planning

IaaS — Infrastructure as a Service (e.g., Azure VMs, AWS EC2)

PaaS — Platform as a Service (e.g., Azure SQL Database)

SaaS — Software as a Service (e.g., Power BI Service, Databricks SaaS)

Cloud data warehouses (Snowflake, BigQuery, Redshift, Synapse, Fabric Warehouse) — comparative analysis

AI is reshaping data engineering in 2026 — from AI-generated PySpark to autonomous pipeline drafting to RAG-powered data catalogs

Machine Learning — algorithms that improve through experience

Deep Learning — neural networks for complex pattern recognition

Generative AI — systems that generate code, pipelines, narratives

Large Language Models — LLMs that draft PySpark, debug Spark jobs, propose data quality fixes

Agentic AI — autonomous systems that plan, reason, act, and learn — the future of DE

CRM systems — Salesforce, Dynamics — typical data sources for analytics pipelines

HRMS — Workday, SAP SuccessFactors — sensitive data with strict governance

Retail & E-Commerce — high-volume transactional + clickstream pipelines

Healthcare Applications — HIPAA/DPDP compliance for sensitive health analytics

Domain depth multiplies your DE salary — BFSI, healthcare, retail DEs command premium rates

Python for Data

The dominant language for data engineering. Master Python syntax, data structures, and advanced programming concepts essential for AI and data work. 10 modules from environment setup through advanced OOP — the language fluency that powers PySpark, Databricks notebooks, and Fabric Python User Functions.

10 MODULES
SECTION 2

Python interpreter installation for Windows and Mac

Visual Studio Code + Jupyter for data engineering workflows

Variables, identifiers, naming conventions

Data types, operators, type conversion

Control flow — if/elif/else, while, for, match-case

break, continue, pass statements

String fundamentals — indexing, slicing, concatenation

f-strings and .format() for log messages and pipeline metadata

String methods — case conversion, search, trimming, replacement

.split() and .join() for text data preprocessing

Critical for parsing CSV, JSON, and log data in pipelines

Lists — creation, indexing, slicing, modification

List comprehensions for elegant data transformation

Sorting, reversing, copying patterns

Tuples — packing and unpacking

Performance advantages over lists

Schema definition patterns with StructType (preview of PySpark)

Dictionaries — creation, access, operations

Dictionary comprehensions

Nested dictionaries for structured data

Essential for representing JSON pipeline configs

Sets — UUU properties (Unique, Unordered, Unindexed)

Mathematical operations — union, intersection, difference

Use cases in deduplication

Collections module — namedtuple, Counter, defaultdict, deque

Iterators & Generators — memory-efficient streaming for large data

Generator expressions and pipelines

Functional programming — lambda, map, filter, reduce

Generators are the data engineer's tool for larger-than-memory data processing

Function definition, parameters, return values

Default arguments, *args, **kwargs

Variable scope (LEGB rule)

First-class functions and higher-order patterns

Recursion

Type hints (Python 3.5+) — essential for self-documenting pipeline code

Documenting functions with docstrings

Built-in modules, user-defined modules, packages

pip for package management

requirements.txt for reproducible builds

Virtual environments for isolated DE projects

Reproducibility is non-negotiable in production DE — invest here

CRUD operations with open()

File modes and pathlib

Directory management with os and shutil

Python's csv module — reader, writer, DictReader, DictWriter

JSON operations — dump(), dumps(), load(), loads()

The two most common data formats for DE ingestion

Exception Handling — robust error handling for unreliable data sources, retry patterns

Decorators — for logging pipeline runs, timing functions, caching expensive operations

Generators deep dive — memory efficiency for streaming large datasets

Context Managers — proper resource management for database connections, file handles, Spark sessions

Four patterns that separate scripting Python from production data engineering

Classes, objects, methods, special methods

Instance variables vs class variables

Encapsulation — access modifiers control data visibility

Inheritance — single, multi-level, and multiple inheritance

Abstraction — abstract classes and methods

Polymorphism — method overriding and duck typing

Custom Spark UDFs and PySpark schema classes use OOP extensively — master it before Section 5

SQL for AI & Data

The data backbone of every analytics pipeline. Five modules covering PostgreSQL from foundations through programming with PL/pgSQL, triggers, and query optimization — the data layer that feeds your Spark and Fabric pipelines.

5 MODULES
SECTION 3

Databases, DBMS, RDBMS — concepts and terminology

ACID properties — Atomicity, Consistency, Isolation, Durability

PostgreSQL setup, psql, pgAdmin 4, DBeaver

Data types — numeric, character, date/time, boolean, JSON, arrays

Constraints — PRIMARY KEY, FOREIGN KEY, UNIQUE, NOT NULL, CHECK, DEFAULT

SELECT statements and column projection

WHERE clauses with operators and conditions

Built-in functions — string, numeric, date, conditional

Aggregates — COUNT, SUM, AVG, MIN, MAX

GROUP BY and HAVING

Window functions — ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD

JOIN operations — INNER, LEFT, RIGHT, FULL OUTER, CROSS, SELF

Subqueries — scalar, row, table subqueries

CTEs (Common Table Expressions) — readable analytical queries

Recursive CTEs for hierarchical data

Set operators — UNION, UNION ALL, INTERSECT, EXCEPT

DML — INSERT, UPDATE, DELETE patterns

Transactions — atomicity for data integrity

ALTER TABLE for schema evolution

Indexes — B-tree, Hash, GiST, GIN

Views — virtual tables, materialized views

Stored functions with CREATE FUNCTION

PL/pgSQL — variables, control structures, exception handling

Triggers — BEFORE, AFTER, INSTEAD OF

ER modelling, normalization (1NF, 2NF, 3NF)

OLTP vs analytics workload patterns

Star schema for analytics

Query plan analysis with EXPLAIN and EXPLAIN ANALYZE

Index strategies — selectivity, covering indexes, multi-column indexes

VACUUM, ANALYZE, partitioning

DEs read query plans daily — invest in EXPLAIN ANALYZE fluency

Power BI for Data Analysis

The BI tool every data engineer must understand. Power BI sits at the end of your Lakehouse/Warehouse pipelines — knowing how it consumes data makes you a better pipeline architect. 10 modules grouped into four progressive units.

10 MODULES
SECTION 4

BI fundamentals and modern analytics approaches

Power BI components — Desktop, Service, Mobile, Gateway

Interface navigation, workspace setup, first report creation

Desktop versus Service capabilities

File sources, database connections, cloud services, web sources

Connection Modes — Import, DirectQuery, Live Connection

Direct Lake mode (Fabric) preview — query Delta tables at speed without import

Power Query interface and applied steps

Data profiling and quality assessment

Essential transformations — filtering, splitting, merging

Reshaping — pivot, unpivot, grouping

Combining queries — append and merge operations

Star schema versus snowflake schema

Creating and managing table relationships

Primary and foreign keys

Hierarchies and date dimension tables

Data model optimisation strategies

Data visualisation principles and chart selection

Core visualisations, slicers, bookmarks, drill-through

Mobile optimisation and data storytelling

DAX syntax and structure

Calculated columns vs measures

Aggregation, logical, text, date/time functions

CALCULATE and FILTER functions

Creating KPIs and business metrics

Year-over-year, quarter-over-quarter comparisons

Custom calendar handling

Iterator functions — SUMX, AVERAGEX, COUNTX

ALL, ALLEXCEPT for filter manipulation

AI visuals — Q&A, Key Influencers, Decomposition Tree

Workspaces, apps, sharing

Subscriptions and alerts

Row-Level Security (RLS) — dynamic security with USERPRINCIPALNAME()

Sensitivity labels

DAX performance tuning, composite models

Foundation for Microsoft Certified: Power BI Data Analyst Associate (PL-300)

PySpark Foundations

The heart of distributed data engineering. Apache Spark is the engine powering modern data engineering — and PySpark is its dominant interface. 9 modules taking you from Big Data fundamentals through RDDs, DataFrames, Spark SQL, structured streaming, performance optimisation, and finally GenAI pipelines on Spark. The foundation Databricks and Fabric build on.

9 MODULES
SECTION 5

Volume, Velocity, Variety — the three pillars of Big Data

Why traditional tools fall short at scale

Modern data architecture evolution

Driver, Executors, and Cluster Manager working in harmony

Spark vs Hadoop — why in-memory processing gives Spark a decisive edge

Execution Model — DAG, Stages, Tasks, and Shuffle explained clearly

Deployment Modes — Local, Standalone, YARN, Kubernetes

Setting up PySpark — Local (pip install pyspark), Docker, Databricks, Microsoft Fabric

SparkSession and SparkContext — the gateway to every Spark operation

Spark Web UI — monitor Jobs, Stages, Storage, and Environment in real time

Creating DataFrames from CSV/JSON/Parquet files, RDDs, and Python lists

Schema definition with StructType & StructField

Core Transformations — select, rename, drop columns

Filter with filter() and where()

Add and transform with withColumn()

orderBy() and sort() for ordering

dropna(), fillna(), replace() for handling nulls and gaps

show(), collect(), take(), first() — choosing the right method for the right context

Registering DataFrames as temporary views and global temporary views

Running SQL queries with spark.sql()

Choosing between DataFrame API and Spark SQL — when each shines

groupBy() and aggregation functions — count, sum, avg, min, max

Multiple aggregations with agg()

pivot and unpivot operations

Window specification — partitionBy, orderBy, frame definitions

Ranking functions — row_number, rank, dense_rank, percent_rank

Aggregate windows — running totals, moving averages

Essential for time-series and class analysis at scale

Join types — inner, left, right, full outer, cross, semi, anti

Broadcast joins — when and how to use broadcast() for small dimension tables

Sort-merge joins vs hash joins — the Spark optimiser's choices

Join pitfalls — data skew, shuffle costs, cartesian products

Self-joins for hierarchical data

Performance debugging with EXPLAIN and the Spark UI

CSV, JSON, Parquet, ORC, Avro — when to use each

Schema inference vs explicit schemas

Compression options (snappy, gzip, lz4, zstd)

Connecting to PostgreSQL, MySQL, SQL Server via JDBC

Predicate pushdown and partitioning JDBC reads

Coalesce vs Repartition — when each is right

Write strategies and partition pruning

Delta Lake — read, write, merge operations

ACID transactions on data lakes

Schema enforcement — rejecting incompatible writes

Schema evolution — automatically updating schemas

Auto Loader — incrementally ingest new files from cloud storage as they arrive

File notifications vs directory listing — ideal for streaming ingestion patterns

Read Stream ←’ Transform ←’ Write Stream ←’ Checkpoint

Micro-batch vs continuous processing

Idempotent writes and exactly-once semantics

Watermarks for handling late-arriving data

Stateful aggregations

Stream-static joins

Native Apache Kafka source and sink

Subscribe and assign patterns

Trigger types — once (single batch and stop), continuous (true streaming), processingTime (micro-batch at fixed cadence)

Catalyst Optimiser — how Spark plans your query

Logical vs physical plans

Cost-based optimisation

Tungsten Engine — whole-stage code generation

Off-heap memory management

Adaptive Query Execution (AQE) — dynamic shuffle partition coalescing

Skew join handling

Dynamic switching of join strategies

cache() vs persist() storage levels

When caching helps vs hurts

Data skew mitigation — salting techniques

Skew join hints

Medallion Architecture — Bronze, Silver, Gold layers — the canonical DE pattern

Quality expectations at each layer

spark-submit configuration

Cluster sizing — driver vs executor memory and cores

Dynamic allocation

Apache Airflow DAGs for Spark jobs

Dependencies, retries, SLAs

Sensors and operators for Spark

Docker for reproducible Spark environments

Docker Compose for local cluster simulation

pytest for PySpark

Mock SparkSession for fast tests

Testing transformations vs actions

MLlib pipelines — feature engineering, model training, evaluation

Saving and loading models

Calling LLM APIs from Spark UDFs

Distributed inference patterns

Cost optimisation for LLM-on-Spark

Embedding generation in Spark pipelines

Bulk vector loading to Pinecone, Qdrant, ChromaDB

Building RAG ingestion pipelines on Spark

Section Project — end-to-end PySpark medallion pipeline (Bronze ←’ Silver ←’ Gold) with structured streaming ingestion, Delta Lake writes, Airflow orchestration, unit tests, and Docker deployment

Databricks Mastery

Databricks is the world's leading Data + AI platform, built on Apache Spark with a Lakehouse architecture. It unifies data engineering, data science, ML, and BI workflows in a single collaborative environment. 11 modules covering everything from platform fundamentals to certification prep for Databricks Certified Data Engineer Associate + Professional.

11 MODULES
SECTION 6

Data Lake + Data Warehouse unified — the Lakehouse paradigm vs legacy architectures

Databricks on AWS, Azure & GCP — cross-cloud parity and differences

Workspace navigation — Catalog, Compute, Workflows, SQL, ML workloads

Notebook collaboration patterns

Compute options — All-Purpose vs Job Clusters

Serverless Compute overview

Databricks Runtime (DBR) versions

Notebooks — Python, SQL, Scala, R support

Collaborative notebooks and version control

Databricks Assistant — AI coding companion

Delta file format on top of Parquet

Transaction log (_delta_log)

ACID transactions on data lakes — the breakthrough

Optimistic concurrency control

Managed vs external Delta tables

Schema enforcement and evolution

Querying historical versions — Time Travel

VERSION AS OF and TIMESTAMP AS OF

VACUUM — cleaning up old files

OPTIMIZE — compacting small files

Z-Ordering for data skipping

Change Data Feed (CDF) — capture row-level changes

Shallow and deep clones — testing and disaster recovery

Unity Catalog provides a unified governance layer across all Databricks workspaces

Three-Level Namespace — Catalog ←’ Schema ←’ Table hierarchy

Cross-workspace data sharing

Fine-grained GRANT, REVOKE, and DENY statements

Row-level and column-level security

Dynamic views for data masking

Automatic lineage tracking — table-to-table and column-level lineage

Configuring external locations

Storage credentials management

Built-in audit logs

Meeting SOC 2, HIPAA, GDPR compliance

DLT — declarative pipelines vs imperative Spark jobs

Materialised views and streaming tables

Quality expectations — EXPECT, EXPECT OR DROP, EXPECT OR FAIL

Data quality at pipeline runtime, not after the fact

Triggered vs continuous pipelines

Development vs production modes

DLT + Auto Loader for streaming ingestion

Auto Loader — incremental file ingestion at scale

File notification mode vs directory listing

Schema inference and evolution in streaming

Structured Streaming on Databricks — production patterns

Checkpointing for fault tolerance

Multi-task jobs with task dependencies

Conditional execution and retry policies

Cron and triggered scheduling

Databricks CLI and REST API for automation

Apache Airflow integration

Monitoring Hub and cost optimisation

Classic vs Serverless SQL Warehouses

Query history and performance profiling

Power BI, Tableau & Excel connectivity

dbt integration

Photon engine for blazing performance

AI/BI Dashboards

Genie — conversational BI

MLflow — experiment tracking and model registry

Model versioning and stage transitions

Feature Store for reusable features

Online and offline feature stores

AutoML — rapid model prototyping

Automated feature engineering

scikit-learn, XGBoost, PyTorch, Hugging Face

Hyperparameter tuning with Hyperopt

Batch scoring and model serving

Real-time and batch inference endpoints

Model drift monitoring

Mosaic AI integration — Foundation Models on Databricks

Foundation Models — DBRX, Llama, Mixtral, MPT

Vector Search — Databricks native vector database

RAG patterns on Databricks

AI Agents with Databricks

Fine-tuning open-source LLMs

Cost optimisation for GenAI workloads

DABs — declarative bundle definitions for Databricks resources

YAML configuration for jobs, pipelines, models, workflows

GitHub Actions + DABs workflows

Development ←’ Staging ←’ Production promotion

Environment-specific configuration

Unit testing notebooks

Integration testing pipelines

Code review patterns

Databricks Certified Data Engineer Associate — exam blueprint walkthrough

Practice questions across all domains

Hands-on labs aligned to exam objectives

Databricks Certified Data Engineer Professional — advanced topics, performance tuning, security, governance

Production engineering patterns

Practice exam with timed assessment

Section Project — complete Databricks Lakehouse implementation with DLT pipelines, Unity Catalog governance, MLflow experiment tracking, Databricks SQL dashboards, GenAI integration, and DABs-based CI/CD deployment

Microsoft Fabric

The 2026 Microsoft data platform play. Microsoft Fabric unifies Data Factory, Data Engineering, Warehouse, Data Science, Real-Time Intelligence, and Power BI in a single SaaS platform — with OneLake as the unified data lake underneath. 15 modules covering every Fabric workload, with direct certification prep for DP-600 (Analytics Engineer Associate) and DP-700 (Data Engineer Associate).

15 MODULES
SECTION 7

Fabric workloads and architecture overview

Licensing and capacity units — F2 through F2048

Workspaces and tenant structure

Platform comparisons — Databricks, Snowflake, Azure Synapse

Migration paths from Azure Synapse

OneLake architecture — one lake per tenant

Delta Lake and Parquet file formats

ACID transactions and versioning

Shortcuts to external data sources (Azure Data Lake Storage, Amazon S3, Google Cloud Storage) without copying

OneLake Catalog and data discovery

OneLake shortcuts enable a true multi-cloud data mesh

Lakehouse fundamentals and components

Creating and managing Lakehouses

Medallion architecture — Bronze, Silver, Gold

SQL Analytics Endpoint for T-SQL access

Delta table operations and optimisation

Time travel queries on Delta tables

Build robust data integration pipelines with comprehensive connectivity and orchestration

Data Factory capabilities and connectors

Data pipelines creation and configuration

Dataflows Gen2 and Power Query

Database mirroring — Azure SQL, Cosmos DB, PostgreSQL

Pipeline orchestration and CI/CD

Apache Spark in Fabric workloads

Fabric Notebooks with Copilot

PySpark DataFrames and transformations

Spark SQL queries and optimisation

Spark job definitions and scheduling

AI functions — Summarisation, Classification, PII obfuscation built into Fabric Spark

Fabric Data Warehouse overview

Full T-SQL support and user-defined functions

Schema design and table management

Star schema and dimensional modelling

Slowly Changing Dimensions (SCD) Type 1, 2, 3

SQL Database in Fabric

Performance optimisation techniques

Real-time analytics in Fabric — purpose-built for time-series and log analytics with sub-second query performance on billions of rows

Eventstreams and streaming source ingestion

Kusto Query Language (KQL) fundamentals

Eventhouse and KQL databases

Graph in Fabric for relationship modelling

Maps for geospatial analytics

Real-time dashboards and alerting

KQL offers sub-second query performance on billions of rows — ideal for IoT, telemetry, and operational intelligence workloads

Direct Lake mode — query Delta tables at speed without import

The third performance tier after Import and DirectQuery

Semantic models and relationships

DAX fundamentals and syntax in Fabric

Row-level security (RLS) and object-level security (OLS)

Incremental refresh and aggregations

Report development with Copilot assistance

Data Science experience and tooling in Fabric

Exploratory data analysis in Fabric notebooks

ML model training and versioning

MLflow experiment tracking

Semantic Link (SemPy) — connect ML to Power BI semantic models

Batch scoring and predictions at scale

Copilot across workloads — Notebooks, SQL, KQL, Pipelines, Reports

Data Agents for conversational AI on your data

Operations Agents for monitoring

Fabric IQ and ontology models

AI Functions and Azure AI Foundry integration

User Data Functions overview — Python-based serverless functions

Create custom serverless functions for extended capabilities

VS Code extension for development

Integration with Notebooks, Pipelines, and SQL

Testing and deployment workflows

Implement comprehensive security and governance frameworks

Fabric security model

Authentication and authorisation — Entra ID integration

Row-level and column-level security

Dynamic data masking

Microsoft Purview integration

Data lineage, catalog, compliance, and auditing

Fabric admin portal and tenant settings

Capacity management and SKU selection

Monitoring Hub and performance dashboards

Query and pipeline monitoring

Git integration and deployment pipelines

CI/CD patterns for Fabric workloads

Query and performance optimisation

Partition strategies and caching

Enterprise architecture patterns

Data mesh implementation in Fabric

Migration strategies from legacy platforms (Synapse, on-prem warehouses)

Developer tools — Fabric CLI and REST APIs

Integration with Azure and third-party services

Microsoft Certified: Fabric Analytics Engineer Associate (DP-600) — exam blueprint walkthrough

Practice questions across all domains

Hands-on labs aligned to exam objectives

Microsoft Certified: Fabric Data Engineer Associate (DP-700) — data engineering-specific topics

Production engineering patterns

Practice exam with timed assessment

Section Project — complete Microsoft Fabric implementation with OneLake, Lakehouse (medallion), Data Factory pipelines with database mirroring, Spark notebooks, Data Warehouse, Real-Time Intelligence with KQL, Power BI Direct Lake reports, Purview governance, and CI/CD via Git integration

Generative AI & Agentic AI

The 2026 frontier — and the culmination of the data engineering programme. 10 modules covering the complete GenAI engineering stack tuned for data engineering work: frontier models, prompt engineering, RAG over your pipeline metadata, agent frameworks, and the Model Context Protocol. The named Data Engineering Coding Agent project lives here.

10 MODULES
SECTION 8

Narrow AI — pre-2022

Generative AI — post-2022, unleashed by ChatGPT

Agentic AI — post-2024 era of autonomous systems

2022 inflection point — ChatGPT launch

2024 inflection point — Agentic emergence

For Data Engineers — AI that drafts PySpark from natural language

AI that debugs Spark job failures

Autonomous pipeline maintenance (with human approval)

GPT-5.5 — Terminal-Bench 2.0 leader at 82.7%. Best for autonomous agents

Claude Opus 4.7 — SWE-bench Pro leader at 64.3%. Lowest hallucination rate. Best for accurate code generation

Gemini 3.1 Pro — 2M+ token context window. Best for ingesting massive pipeline codebases

Open-source frontier — Llama 4, DeepSeek, Mistral, Qwen — for VPC deployments

Copilot in Fabric Notebooks — PySpark code generation

Databricks Assistant — Spark-aware AI assistant

GitHub Copilot in VS Code for pipeline development

Fundamentals — Context + Task + Examples + Format + Constraints

Core Techniques — Zero-shot, few-shot, Chain-of-Thought (CoT), ReAct

System Prompts — persistent persona design, guardrails

Multimodal — reading architecture diagrams, lineage graphs

Hallucination & Context — grounding for accurate code generation

Domain prompts for PySpark, SQL, DBT, Airflow

Context Engineering — the 2026 frontier discipline; managing what enters the LLM's context window

Project — ship a 30+ prompt library for DE work (PySpark drafting, SQL generation, pipeline debugging, etc.)

ChatGPT, Claude, Gemini for daily DE work

AI for PySpark writing, SQL generation, pipeline documentation

Research with Perplexity for technology benchmarking

Microsoft Copilot integration across Office and developer tools

Reading architecture diagrams with vision models

Analysing dashboards and pipeline lineage graphs

OCR for legacy data dictionaries

Audio transcription with Whisper for meetings

Hallucination — when an LLM invents a Spark function that doesn't exist

Prompt injection through documents

Privacy — keeping sensitive data out of public LLMs

Regulatory landscape — EU AI Act, India DPDP Act

Streamlit — rapid prototyping for internal DE tools

FastAPI — production-grade Python API

Building chatbots for pipeline status Q&A

Build and deploy a Streamlit + FastAPI internal tool

LLM APIs in production — OpenAI, Anthropic, Google GenAI, DeepSeek Python SDKs

Function calling and structured outputs

Embeddings & Vector Databases — ChromaDB, Pinecone, Qdrant, pgvector

HNSW, IVF indexing strategies

Databricks Vector Search integration

Fabric AI Foundry integration

RAG pipeline for DEs — the canonical flow over pipeline metadata, data dictionaries, runbooks

Hybrid search (BM25 + embeddings)

Re-ranking with cross-encoders

Agentic RAG — self-improving retrieval over your data platform documentation

Project — Internal DE Docs RAG App: RAG over your data dictionary, pipeline runbooks, and architecture decision records

LangGraph 1.0 — the production default for agentic data engineering

Claude Agent SDK — deepest MCP integration

CrewAI — role-based multi-agent crews

Semantic Kernel / Microsoft Agent Framework — enterprise .NET stacks

Pydantic AI — type-safe Python, validation-first agent design

ReAct — investigate a pipeline failure, then propose a fix

Plan-and-Execute — generate a multi-step pipeline migration plan

Reflection loops — agent reviews its own PySpark code before deploying

Multi-agent collaboration — Schema agent, Pipeline agent, Quality agent, Reviewer agent

Human-in-the-loop checkpoints — humans approve every production-touching action

MCP — open standard for connecting agents to tools, data, systems

Proposed by Anthropic late 2024, stewarded by the Linux Foundation

200+ servers, 97M+ monthly SDK downloads

Build an MCP server exposing Databricks workspaces — clusters, jobs, notebooks

Build an MCP server exposing Microsoft Fabric — workspaces, pipelines, Lakehouses

Build an MCP server exposing Spark clusters — for safe job submission and monitoring

Build an MCP server exposing PostgreSQL — for query execution

Connect LangGraph agents to multiple MCP servers

Use Claude Agent SDK's deepest native MCP integration

A2A Protocol — Google-led agent-to-agent communication standard

DATA ENGINEERING CODING AGENT CAPSTONE — multi-agent Data Engineering Coding Agent using LangGraph + Claude Agent SDK with MCP servers exposing Databricks workspaces, Fabric workloads, Spark clusters, and PostgreSQL

Agent generates PySpark code from natural language, drafts DLT pipelines with proper quality expectations, debugs Spark job failures, proposes data quality remediations, and automates the boilerplate that DEs spend half their week on

Frontend with Streamlit or React, backend with FastAPI, observability via LangSmith — human approval gates for every production-touching action — the named project for the entire Data Engineering & AI Agents programme

Tools you'll master

32+ data engineering & AI tools, one production project.

Real-time projects

You don't watch videos. You ship software.

Three full-production projects, each threaded through the entire curriculum. By the project, you've built the whole stack around them.

Hero project · weeks 3–12

Production lakehouse + streaming pipeline + AI agent

Ship a full lakehouse on Iceberg/Delta, wire a Kafka + Flink streaming layer into it, orchestrate the whole stack on Airflow with data contracts, and bolt on a LangGraph augmentation agent.

01Lakehouse on Iceberg/Delta — bronze/silver/gold layers, dbt models with tests + docs, partitioned + sorted for query performance.

02Streaming layer — Kafka topics, Flink stateful processing, exactly-once writes to the lakehouse, late-data handling.

03Orchestration on Airflow/Dagster with data contracts, Great Expectations data-quality gates, lineage in the catalog.

04AI augmentation agent — a LangGraph agent that profiles tables, drafts test cases, explains lineage, and answers analyst questions over the warehouse.

Outcome: 4× faster pipeline build

Data SLA: 99.9%

Reviewer: Data Platform panel

SparkKafkaIcebergdbtLangGraph

Enterprise · weeks 6–11

Streaming CDC pipeline

Build a Postgres ←’ Debezium ←’ Kafka ←’ Flink ←’ Iceberg CDC pipeline with exactly-once semantics, schema-registry contracts, and Monte Carlo monitoring.

DebeziumKafkaFlinkIceberg

Real-time · weeks 8–12

Self-tuning lakehouse agent

Stand up a LangGraph agent that watches table-level metrics — latency, freshness, cost — auto-files dbt issues, drafts fixes, and benchmarks query plans on Trino.

LangGraphTrinodbtMonte Carlo

Project · weeks 11–12

Your AI data platform in a real partner org.

Pick a real partner data problem. Deploy a production lakehouse + streaming pipeline + AI agent — Iceberg storage, Flink processing, dbt models, LangGraph augmentation — into a partner team that's running it for real users.

Download the real world project

Full scope, sample partner orgs, weekly milestones, and grading rubric — PDF, 14 pages.

2026: 220+ deployed76% ←’ placement offers

Your instructor

Taught by engineers who shipped agentic AI to production.

Manikanta Kona

Founder, Digital Lync · Principal Data Engineering Architect

Spark · Kafka · Airflow · dbt · Iceberg · Snowflake · LangGraph

"A 2026 data engineer doesn't stop at moving rows. They ship a lakehouse you can stake an SLA on, a streaming layer that survives a bad partition key, orchestration that an on-call engineer doesn't fight with, and an LLM agent that answers analyst questions over the warehouse. That's the bar I teach to, every cohort."

15 yrs

DATA & AI

2,400+

LEARNERS

4.9 /5

RATING

Manikanta is the founder of Digital Lync and brings 15 years of applied data engineering from AT&T, Salesforce, Cox Communications, and Broadcom — where he led lakehouse, streaming, and orchestration platforms for Fortune-500 banks, telcos, and insurers. Most recently he architected production data platforms that pair Iceberg/Delta lakehouses, Flink streaming, and dbt models with a LangGraph augmentation layer that explains lineage and drafts test cases for analyst teams.

His classes get you two things other programs don't give you: a founding architect who still ships production data platforms, and a curriculum rewritten every quarter to match what hiring managers actually ask about — credentials like AWS Data Engineer Associate, Databricks Data Engineer Professional, Snowflake SnowPro, Confluent Kafka Developer, and dbt Analytics Engineer included. M.S. in Engineering, Purdue University.

Ravi Krishna

Chief Technologist, Digital Lync · Data Platform & Streaming Lead

Spark · Kafka · Flink · dbt · Iceberg · Airflow · LangGraph

"A data platform earns its keep when the lakehouse is auditable, the streaming layer keeps its exactly-once promise on a bad day, the data SLAs hold under real load, and an LLM agent answers analyst questions before the meeting starts. I teach the unglamorous parts that make all of that real in production."

10 yrs

DATA PLATFORMS

1,800+

LEARNERS

4.8 /5

RATING

Ravi is Chief Technologist at Digital Lync, where he leads the data platform and streaming practice. After ten years building and running production lakehouses and streaming pipelines across enterprise — telecom, banking, and SaaS — he stepped into the Chief Technologist seat to wire Spark, Kafka, Flink, Iceberg, and dbt into the way data teams actually work — data contracts that hold under schema drift, freshness SLAs that on-call engineers trust, and a LangGraph augmentation layer that explains lineage to the analysts who own the numbers.

His data platform modules are built from real production post-mortems, not slide decks. Expect to leave with working Iceberg lakehouses, Flink streaming jobs with exactly-once semantics, dbt models with tests and docs, Airflow orchestration with data contracts, and a LangGraph augmentation agent wired into the warehouse. Ten years across enterprise data platforms — Hyderabad-based, hands-on, and known for the unglamorous parts of data engineering that everyone else skips.

HIRING PARTNERS · INDUSTRY VOICES

What data engineering employers say about Digital Lync grads.

Real feedback from data and platform leaders at AI-first companies and the firms hiring our Data Engineering + AI graduates.

Digital Lync grads ramp 40% faster on data platform deploys than typical data engineering hires. Best Data Engineering + AI pipeline in India.

Aakash Mehta, Engineering Director, Microsoft

We've onboarded 80+ Digital Lync alumni in 18 months. Lowest ramp time we've seen for lakehouses, streaming pipelines, and AI augmentation practices.

Anita Sharma, Senior Manager, Deloitte

The Data Engineering + AI programme is comprehensive — Spark, Kafka, dbt, LangGraph augmentation. Grads come pre-trained for production data platforms with AI.

Rahul Bhatt, Solutions Lead, Mphasis

Their lakehouse + streaming track produces PMs who ship production-grade pipelines on day one. Rare combination of engineering rigor and platform craft.

Deepak Pillai, Senior Architect, TCS

What sets Digital Lync apart is the AI augmentation layer baked into the data engineering track. Our enterprise clients ask for exactly this profile.

Suresh Menon, Practice Lead, Accenture

Their Databricks Data Engineer Pro + dbt Analytics Engineer prep is rigorous, and the shipped project — lakehouse, streaming pipeline, AI augmentation agent — is what closes interviews for us.

Vikram Iyer, Director, Infosys

Digital Lync's Data engineers ship reliable pipelines twice as fast in the first 90 days. Our internal platform metrics back this up clearly.

Lakshmi Nair, VP Engineering, Wipro

Best Data Engineering + AI pipeline we've sourced from in India. Their projects are real shipped pipelines, not slide demos.

Karthik Subramanian, Engineering Director, Cognizant

Strong Spark and lakehouse engineering foundation. Their Data Engineering grads need almost zero ramp time on enterprise data platform engagements with us.

Arun Joshi, Practice Director, Capgemini

We've placed 40+ Digital Lync alumni across our data and watsonx engineering teams. Strong fundamentals, sharp on data SLAs and lineage.

Sanjay Verma, Talent Director, IBM

lakehouses + AI augmentation is exactly the talent gap we've been struggling to close. Digital Lync is filling it for us reliably.

Anjali Desai, Practice Head, LTIMindtree

Their Data Engineering track delivers engineers who navigate Spark, Kafka, and dbt on customer engagements unsupervised.

Ramesh Iyer, Senior Manager, Tech Mahindra

Hired 25+ Digital Lync graduates for our data engineering practice. Strong on Spark, sharp on Kafka/Flink, fluent in dbt.

Geetha Pillai, Talent Acquisition Lead, Cyient

Digital Lync grads who blend lakehouses with Azure OpenAI augmentation land production-ready on day one. Rare combination, well-trained.

Priya Reddy, Talent Lead, Microsoft

03Program certifications

An Agent‑Ready credential, not a participation trophy.

Digital Lync · Institute Certificate

Agent‑Ready Data Engineer

Presented to

Spandana Bala

For the successful design, build, and production deployment of a data platform — Iceberg lakehouse, Kafka/Flink streaming, dbt models, and an AI augmentation agent — evaluated against the Databricks Data Engineer Pro, dbt Analytics Engineer, and AWS Data Engineer Associate credential rubrics.

Manikanta Kona

CEO · Digital Lync

AGENT
READY
2026

Industry‑recognized

Co‑branded with the data engineering community and mapped to Databricks Data Engineer Pro and dbt Analytics Engineer credentials — names that hiring managers already scan for on resumes.

Project artifact included

Every certificate carries your shipped project — Iceberg lakehouse, Flink streaming pipeline, dbt models, AI augmentation agent — with a link to the live partner-org deployment. Proof, not a promise.

Enhanced skill validation

Graded against the 2026 Agent‑Ready rubric: lakehouse design, streaming pipelines, dbt models, data contracts, quality gates & lineage. No pass/fail — a level 1‑5 band.

Verifiable on a public URL

Each credential has a public verification page recruiters can check in 10 seconds — no PDF back‑and‑forth.

04Job placement support

Your first Data Engineer offer isn't a lottery ticket. It's a built process.

GitHub, LinkedIn, resume — and most importantly, warm intros into data-heavy SaaS and platform teams. Our placement team works your search like an account, not a helpdesk.

01 / GITHUB & PORTFOLIO

A portfolio, not a graveyard.

Guidance on building a portfolio that showcases your lakehouse design, streaming pipeline, dbt models, AI augmentation agent, and a public verification URL — reviewed 1:1, not via template.

02 / RESUME PREP

Rewrite, don't proofread.

A one-page resume rebuilt around the data platforms you shipped (lakehouses, streaming pipelines, dbt models), the partner-org project, and the business outcome. Reviewed by engineers who've read 10,000+ resumes.

03 / LINKEDIN + INTROS

Where most opportunities actually live.

Profile tuning plus direct warm introductions into data-heavy SaaS and platform teams — Microsoft, Databricks, Snowflake, dbt Labs, Confluent, Fivetran, AWS, Anthropic, Hugging Face, Scale AI, Stripe, Razorpay, plus services that staff data platform teams (Deloitte, Accenture, Cognizant, TCS). You leave with recruiter contacts, not a generic "good luck."

Data Engineering alumni

Hundreds of data engineering careers launched — here are eight.

Spandana Bala

Data Engineer

Hyderabad · India

Now at · Microsoft

Naveen Vedala

Senior Data Engineer

Hyderabad · India

Now at · Atlassian

Tejashwini Addla

Staff Data Engineer (Streaming)

Hyderabad · India

Now at · Salesforce

Tharunesh Dillikar

Principal Data Engineer

Seattle · United States

Now at · Confluent

Mujahed Mohammed

Lakehouse Architect

Hyderabad · India

Now at · Databricks

Bhargav Kumar Murala

Streaming Platform Lead

Hyderabad · India

Now at · Adobe

Sai Manasa Leburi

Analytics Engineering Lead

New York · United States

Now at · Hugging Face

Rahul Dhamma

Director of Data Platform

Hyderabad · India

Now at · dbt Labs

Our locations

Come chat with us — over coffee, or over Zoom.

One flagship campus in Hyderabad, plus online Principal Data Engineer cohorts running on Indian and US timezones.

Flagship campus

Hyderabad

2nd Floor, Hitech City Road · Above Domino's · Opp. Cyber Towers, Jai Hind Enclave · Hyderabad, Telangana

Call

+91 81858 87766

US desk

+1 346 588 7766

Hours

Mon–Sat · 9am–9pm

Online class

Global

Weekend and evening Data Engineering cohorts running on IST and PST. Every online cohort ships the same shipped project — Iceberg lakehouse, Flink streaming, dbt models, AI augmentation — as the on‑campus track.

Timezones

IST & PST

Format

Live + 1:1 mentorship

Next class

15 JUL 2026

FAQ

Questions we actually get — answered honestly.

Straight answers on prerequisites, the data platform stack, certifications, and placement. If something's missing, book a 20-minute advisor call — no slides, no pitch.

Do I need a CS background or prior SQL/Spark experience?+

No on both counts. Roughly 40% of every class comes from non-CS streams — mechanical, electrical, BCom, BBA, and self-taught coders. Weeks 1–2 cover the SQL fundamentals, distributed compute, and pipeline design from scratch. What you do need: consistency and 12–15 hours a week.

Will I actually ship production pipelines, or only do tutorials?+

You actually ship. Every learner builds a lakehouse on Iceberg/Delta with bronze/silver/gold layers, a Kafka ←’ Flink streaming layer with exactly-once semantics, dbt models with Great Expectations gates, and a LangGraph agent that augments the platform. The project runs in a partner org — not a notebook.

Which tools, frameworks, and AI models will I use?+

Compute: Spark, Flink, Trino, Databricks. Streaming: Kafka, Kinesis, Pub/Sub. Storage: Iceberg, Hudi, Delta Lake, S3, Snowflake, BigQuery, Redshift. Orchestration: Airflow, Prefect, Dagster, dbt. Quality: Great Expectations, Monte Carlo, Data Contracts. AI: OpenAI, LangChain, LangGraph.

Will I prep for AIPMM Data Engineer and Pragmatic Principal Data Engineer certs?+

Yes. The curriculum is mapped to the AIPMM Data Engineer track and the Pragmatic Principal Data Engineer credential. We run two full mock exams and reimburse the voucher fee on first-attempt pass.

What's the time commitment per week?+

Plan for 12–15 hours: 2 live classes × 2 hours, 1 lab × 3 hours building pipelines on your training cluster, and ~5 hours of project work (Spark, Kafka, dbt). Saturday office hours with the TA team are optional, but most learners use them.

Is placement support really 1:1, and which companies hire data engineers?+

Yes — a dedicated placement advisor from week 8, not a helpdesk. AI product hiring partners include Microsoft, Adobe, Salesforce, Atlassian, Notion, Linear, Anthropic, Hugging Face, Databricks, Snowflake, Stripe, Razorpay, Freshworks, Zoho, and Postman. Resume, LinkedIn, mock interviews, and warm intros are individual.

Online, weekend, or on-campus?+

All three. On-campus at the Hyderabad flagship, live online (IST and PST cohorts), and a weekend track for working professionals. Every format ships the same shipped project — Iceberg lakehouse, Flink streaming, dbt models, AI augmentation — only the schedule changes.

What if I fall behind, or can't continue mid-class?+

Freeze your seat for up to 90 days and rejoin the next class — no extra fee. TAs run catch-up sessions every Saturday for anyone more than a week behind, and recordings of every live session are available for the lifetime of your account.

Still have a question? Talk to an advisor — no slides, no pitch.

Class DEA-025 starts 13 JUL 2026.
40 seats. 12 already claimed.

Book a 20-minute advisor call. We'll walk through the curriculum, match it to your current role, and show you two real projects from class 022.

CLASS DEA-025 3 MONTHS STARTS 15 JUL ⚡ONLY 13 SEATS LEFT · 17 / 30 CLAIMED

Call us Chat with us

Data Engineering+ AI Agents

Four things every Data Engineering grad walks away with.

From “loads a CSV” to ships AI-native data platforms..

IT & AI Foundations + Python for DE

Power BI + PostgreSQL for Data Engineers

PySpark + Databricks + Microsoft Fabric

Master the 2026 GenAI + Agentic AI stack — and ship a Data Engineering Coding Agent that drafts PySpark pipelines, debugs Spark jobs, and proposes data quality fixes autonomously.

Seven sections. 65+ modules. The AI-native data engineering stack.

Fundamentals of IT & AI

Python for Data

SQL for AI & Data

Power BI for Data Analysis

PySpark Foundations

Databricks Mastery

Microsoft Fabric

Generative AI & Agentic AI

32+ data engineering & AI tools, one production project.

You don't watch videos. You ship software.

Production lakehouse + streaming pipeline + AI agent

Streaming CDC pipeline

Self-tuning lakehouse agent

Your AI data platform in a real partner org.

Taught by engineers who shipped agentic AI to production.

What data engineering employers say about Digital Lync grads.

An Agent‑Ready credential, not a participation trophy.

Your first Data Engineer offer isn't a lottery ticket. It's a built process.

A portfolio, not a graveyard.

Rewrite, don't proofread.

Where most opportunities actually live.

Hundreds of data engineering careers launched — here are eight.

Come chat with us — over coffee, or over Zoom.

Questions we actually get — answered honestly.

Class DEA-025 starts 13 JUL 2026.40 seats. 12 already claimed.

Get Skilled

Data Engineering
+ AI Agents

Class DEA-025 starts 13 JUL 2026.
40 seats. 12 already claimed.