Cloud Infrastructure

April 15, 2025

8 min read

Building a Scalable Data Lake on AWS: Lessons from the Trenches

After migrating three enterprise data warehouses to AWS S3-based data lakes, I've collected a set of principles that separate architectures that scale gracefully from the ones that quietly become technical debt.

AWSS3Data LakeArchitecture

After migrating three enterprise data warehouses to AWS S3-based data lakes, I've collected a set of principles that separate architectures that scale gracefully from the ones that quietly become technical debt. Most of what I'm about to say sounds obvious in retrospect — but I've watched capable engineers skip each of these steps under time pressure.

The Foundation: Get Your Storage Layer Right The most expensive mistakes I've seen happen at the storage layer. Teams treat S3 like a shared drive — flat folders, inconsistent naming, no partitioning strategy — and pay for it in compute costs and query latency months later. Before you write a single ETL job, design your prefix structure. Partition by date, source, and entity type. Use Parquet or ORC, not CSV. Set lifecycle rules on raw data from day one.

Partitioning Strategy Matters More Than You Think A poorly partitioned dataset can cost you 10× in both compute and query time. The key insight most engineers miss: partition your data for the way it gets read, not the way it gets written. If 95% of your queries filter by date, partition by date. If they filter by customer region, partition by region. Athena and Spark will thank you.

Governance From Day One The data lake that started as a “quick win” and became a data swamp usually has one thing in common: no governance. Nobody knew what was in it, who owned it, or whether it was still accurate. Implement a data catalog (AWS Glue Data Catalog or Apache Atlas) from the first table. Tag everything with owner, data classification, and SLA. It takes an hour upfront and saves weeks of detective work later.

Schema Evolution Is Inevitable — Plan for It Your upstream sources will change their schemas. Fields get renamed, types change, columns disappear. If you're storing raw JSON, you're protected but your downstream consumers aren't. Build schema validation into your ingestion layer. Use schema registries for streaming workloads. Test schema changes in a staging environment before they reach production.

HN

Helana Nosratbakhsh

Senior Data Engineer & Advisor

dbt in Production: What Nobody Tells You