July 30, 2025

Quick Insights to Start Your Week

🎧 Listen to the Huddle

This is an AI generated audio, for feedback or suggestions, please click: Here

Share


Welcome to this week’s Data Engineering and Analytics huddle – your go-to source for the latest trends, industry insights, and tools shaping the industry. Let’s dive in! 🔥

⏱️ Estimated Read Time:


Duplicates in Data and SQL

Duplicates in data pipelines are a classic problem, and SQL remains the iconic battleground. Despite decades of tools like Pandas, Spark, and Postgres, duplicates persist across industries and architectures—from legacy Data Warehouses to modern Delta Lake and Iceberg setups. It’s a problem that’s been around since the days of SQL Server and SSIS, and it’s still here today.

Primary Keys: The Solution

The key to deduplication lies in defining a primary key for your dataset. Primary keys identify unique records and are essential for downstream processing. There are two types:

  • Composite keys: A combination of columns that uniquely identify a record (e.g., customer_id + order_date).
  • Logical keys: User-defined in Lake House environments like Delta Lake, where enforcement isn’t mandatory.

The process of creating a primary key—whether enforced or not—is what prevents duplicates from wreaking havoc. Every dataset has a primary key, and you should know it.

Deduplication in Action

Once you’ve defined your primary key, deduplication becomes straightforward:

  • PySpark: dropDuplicates()
  • Polars: unique()
  • Pandas: drop_duplicates()
  • DuckDB: SELECT DISTINCT ON or ROW_NUMBER()

These methods are simple, reliable, and work across tools. The effort to define a primary key upfront—just a few minutes—saves hours downstream by preventing duplicates from breaking dashboards or MERGE statements.

Final Thoughts

Duplicates are inevitable, but they don’t have to be catastrophic. The solution lies in proactive design: define your primary key, use it for deduplication, and avoid relying on tools to solve the problem for you. It’s a small step with massive dividends.

Read more


Databricks Workflows vs Apache Airflow

In the ever-evolving world of data orchestration, Apache Airflow and Databricks Workflows are two of the most prominent tools. This post dives into their strengths, weaknesses, and how they stack up against each other.

Key Features to Consider

When evaluating orchestration tools, several core features stand out:

  • Data Flow Visualization: Both tools offer robust visualization, though Airflow’s UI is more mature and user-friendly.
  • Complex Task Dependency Support: Airflow excels here, with intuitive DAGs for managing dependencies. Databricks Workflows, while functional, require more effort for nested logic.
  • Retry and Error Handling: Airflow provides granular control over retries, exponential backoff, and custom policies. Databricks Workflows lack these advanced options, relying on simpler, per-task configurations.
  • Integration Capabilities: Airflow’s ecosystem supports a broader range of tools and services, making it more versatile for hybrid workflows. Databricks Workflows shine in environments heavily reliant on Databricks.
  • Development Experience: Airflow’s Python-based DAGs are widely adopted, though some find them less intuitive than Databricks’ declarative YAML format.

Hands-On Comparison

For a practical look, consider this example:

  • Apache Airflow: The Astro CLI simplifies local development, and its UI provides detailed run insights.
  • Databricks Workflows: While powerful for Databricks-centric tasks, they can feel clunky for complex, multi-tool pipelines.

Final Verdict

Airflow’s maturity, flexibility, and community support make it a clear favorite for most use cases. Databricks Workflows, however, are ideal for teams deeply integrated with Databricks. Together, they form a complementary pair, enabling seamless hybrid workflows.

Read more


Scaling Netflix’s Threat Detection Pipelines Without Streaming

Back in 2018, I was part of Netflix’s real-time threat detection team. I owned the orchestration and delivery layer of a detection pipeline that flagged fraudulent behavior, security breaches, and abuse patterns across our global platform. This article delves into the challenges and insights from a unique architecture called the “Psycho Pattern,” which was a hybrid of batch and real-time processing, using Spark, Kafka, and Airflow.

What Is the Psycho Pattern?

The Psycho Pattern was designed to balance batch and real-time processing. It ran continuously, one job at a time, using a high watermark table to track the latest processed data. Spark read from Kafka or SQS, applied fraud detection logic, and wrote results to a database. The watermark was updated after each job, ensuring no data was missed even if it arrived late.

Why the Psycho Pattern Failed

Despite its solid design, the Psycho Pattern struggled with scale and bugs. It introduced high latency (~5-7 minutes) and unpredictable memory usage. When the DAG stalled, memory spikes caused massive data ingestion on the next run, overwhelming Spark. Migrating to Flink only reduced latency slightly, from 6 to 4 minutes, but didn’t improve signal quality or reduce engineering costs.

Key Takeaways

  • Micro-batch isn’t broken. If you can consistently hit 3–5 minute latency, you’re already at Netflix-grade.
  • High watermark logic is your lifeline. Monitor and audit it like the heartbeat of your system.
  • Memory management in Spark isn’t about throwing more RAM. Use spill monitoring, GC logs, and dynamic allocation strategies.
  • Ask better questions. When someone wants faster pipelines, they may mean better data, not just speed.

Read more


🛠️ Tool of the Week

RapidMiner is a data science platform that analyses the collective impact of an organization’s data. It was acquired by Altair Engineering in September 2022..


🤯 Fun Fact of the Week

Data Engineering as a Service (DaaS) is experiencing substantial growth, with its market size projected to double from $5.4 billion in 2023 to $13.2 billion by 2026. This anticipated growth reflects the growing reliance on cloud-based data engineering services, which provide scalability and flexibility, enabling businesses to effectively manage their data.


Huddle Quiz 🧩

Question 1 of 5
Score: 0

⚡ Quick Bites: Headlines You Can’t Miss!


Share


Subscribe this huddle for more weekly updates on Data Engineering and Analytics! 🚀