May 18, 2025
Quick Insights to Start Your Week
Data-engineering-analytics🎧 Listen to the Huddle
This is an AI generated audio, for feedback or suggestions, please click: Here
Share
Welcome to this week’s Data Engineering and Analytics huddle – your go-to source for the latest trends, industry insights, and tools shaping the industry. Let’s dive in! 🔥
⏱️ Estimated Read Time:
Lakehouse Data Processing with AWS Lambda, DuckDB, and Iceberg
In this exploration, we aim to demonstrate the feasibility of creating a lightweight data processing pipeline for a Lake House using AWS Lambda, DuckDB, and Cloudflare’s R2 Iceberg. Here’s a step-by-step guide:
-
Data Preparation: Prepare your raw trip data in CSV format and store it in an S3 bucket (
s3://confessions-of-a-data-guy/iciceBaby/*.csv). Additionally, set up an Iceberg table schema for daily metrics in Cloudflare’s R2 Data Catalog. - Environment Setup:
- Create a Dockerfile to hold AWS Lambda code.
- Build and push the Docker image to Amazon Elastic Container Registry (ECR) in AWS.
- Set up an AWS Lambda function with a trigger for any .CSV file in the specified S3 bucket.
-
Lambda Code: Develop Python code using DuckDB and PyIceberg to read CSV files from S3, perform GROUP BY analytics, convert results to Arrow format, connect to the R2 Data Catalog via PyIceberg, and write data to the Iceberg table.
The following steps outline the code’s functionality:
- Read the file with DuckDB from S3.
- Run the GROUP BY analytics with DuckDB.
- Convert the result to Arrow format.
- Connect to the R2 Data Catalog using PyIceberg.
- Write data to the Iceberg table via PyIceberg and Arrow.
-
Docker Configuration: To resolve potential issues with DuckDB and Lambda runtimes, set
HOME=/tmpin the environment variables within the Dockerfile or Lambda configuration. - Testing and Validation: Execute the Lambda function to process CSV files from S3 and verify that data is correctly written to the R2 Iceberg table.
Key Takeaways:
- AWS Lambda offers underutilized potential for lightweight data processing in a Lake House environment.
- DuckDB proves itself as fast, versatile, and user-friendly, albeit requiring middleware like PyIceberg for Iceberg support.
- Cloudflare’s R2 Data Catalog simplifies the setup and usage of Apache Iceberg, making it an underappreciated hero in the data ecosystem.
- First-class Iceberg READ and WRITE support should be implemented across major data tooling without requiring catalogs or additional middleware for Delta Tables.
- The author acknowledges the limitations of pyiceberg and the need for improved Iceberg Catalog requirements, emphasizing the desire for seamless integration.
How To Write Better SQL – Simplifying Complex Queries 🧵
Are you tired of opening .sql files or Airflow DAGs only to face a 5,000+ line query? You’re not alone! In this article, we’ll explore why these monstrous queries exist, their drawbacks, and how to avoid them. Let’s dive in! 👻📊
Why do massive SQL queries happen?
- Multiple distinct tables: Often, queries combine multiple CTEs or subqueries within a single statement. Breaking these down into separate queries can improve readability and maintainability.
- Redundant logic: Repeated CASE statements or similar logic across different parts of the query indicate an opportunity for centralization in a dedicated table or reusable query.
- Lack of testing: Without tests, developers might avoid breaking down complex queries into smaller, more manageable pieces. Writing tests for your SQL can help keep it modular and maintainable.
- Clunky development environments: Traditional setups like master stored procedures calling multiple stored procedures can lead to query bloat if not managed properly. Upgrading your tools may prevent this issue.
Why are massive queries a problem?
- Hard to test
- Difficult to read
- Challenging to update
- Overall, bad practice 🚫
How can you avoid these issues?
- Break down complex queries: Divide large, multi-table queries into smaller, more manageable pieces using CTEs or subqueries.
- Centralize redundant logic: Identify repetitive CASE statements or similar logic and move them to a dedicated table or reusable query.
- Write tests for your SQL: Testing smaller queries encourages modularity, making it easier to maintain and update your codebase over time.
- Improve development tools: Invest in better SQL development environments that support modular workflows and prevent query bloat.
- Leverage code reviews: Regularly review code and designs with teammates to catch and address complex queries early on. A simple comment can make a big difference! 🤝
While there may be exceptions where longer queries are necessary, it’s always wise to ask: “Does this query really need to be this long, or should I break it up?” 🧐
Remember, writing clean, maintainable SQL is an essential skill for any data professional! 😎💻
What Is Columnar Storage
Columnar storage is a data organization method that stores data by columns rather than rows, optimizing for analytical queries. This approach allows for more efficient compression and faster processing of large datasets. Two popular columnar storage formats are Apache Parquet and Apache Avro.
Apache Parquet is a columnar storage file format available to any project in the Hadoop ecosystem. It provides efficient data compression and encoding schemes, enabling rapid query processing. Parquet stores metadata alongside data, allowing for faster access and improved performance. It supports various compression codecs like Snappy, Gzip, and LZO.
Apache Avro is another columnar storage format that focuses on data serialization and compression. It uses a compact, fast JSON-encoded data format and provides rich data structures with strong schemas. Avro files can be processed by multiple programming languages, making it versatile for various big data processing frameworks like Apache Spark and Apache Hive.
Both Parquet and Avro offer advantages in handling large datasets:
-
Efficient Compression: Columnar storage allows for better compression rates compared to row-based formats since similar data types are grouped together. This reduces storage costs and accelerates data transfer between systems.
-
Faster Query Processing: By organizing data by columns, analytical queries can quickly access the necessary information without reading unnecessary rows, leading to faster query execution times.
-
Schema Evolution: Both formats support schema evolution, allowing for changes in data structure over time without losing compatibility with existing data.
-
Interoperability: Parquet and Avro are widely supported across various big data processing frameworks, enabling seamless integration into diverse tech stacks.
-
Performance Optimization: Both formats offer optimizations like predicate pushdown and vectorized processing, further enhancing query performance in analytical workloads.
However, the choice between Parquet and Avro depends on specific use cases:
-
Parquet is generally preferred for its wide adoption across big data ecosystems, particularly in Hive, Spark, and Presto environments. Its focus on compression and query performance makes it ideal for analytical workloads.
-
Avro, while also offering good compression and schema evolution capabilities, shines more in scenarios requiring rich data structures and interoperability across different programming languages beyond the Hadoop ecosystem.
Ultimately, the decision should be guided by factors like existing technology stack, query patterns, scalability needs, and the trade-offs between storage efficiency and processing speed in your specific use case.
🛠️ Tool of the Week
Iceberg is a high-performance format designed for massive analytical tables. It combines the reliability and simplicity of SQL tables with big data capabilities, enabling engines such as Spark, Trino, Flink, Presto, Hive, and Impala to safely work with the same tables simultaneously.
🤯 Fun Fact of the Week
Enterprise cloud adoption surges with 94% integration, according to Flexera. This widespread adoption across industries reflects businesses’ focus on scalability and efficiency. Data engineers lead the transition, ensuring seamless cloud integration and optimizing data infrastructures to support organizational goals.
Huddle Quiz 🧩
Trend Explained:
⚡ Quick Bites: Headlines You Can’t Miss!
- GitHub Issues search now supports nested queries and boolean operators: Here’s how we (re)built it.
- Integration Isn’t a Task — It’s an Architectural Discipline.
- Why You’re Stuck at Senior Data Engineer (And How to Break Out).
- Infrastructure as Code (IaC) Beyond the Basics
Share
Subscribe this huddle for more weekly updates on Data Engineering and Analytics! 🚀

Share Your Score!