When George Fraser from Fivetran said dbt was like “Rails for data” at Coalesce this week, it made me smile because that was exactly how I felt when I started with dbt. Ruby on Rails taught me a lot about; - capturing requirements from users - developing in an agile way - infrastructure is code - automated testing - automated deployment - ci/cd and code reviews - convention over configuration - simplicity and having fewer moving parts that can break - and how small teams can have a big impact An enterprise application that we delivered at Amgen back in 2012 is still in use. I was shadow IT doing everything different from the enterprise. No one used Postgres there at the time. No one knew what Ruby was and no one had 90% test coverage. We had just two developers and one product manager, me. Later on when I was part of the team shaping our new Data platform at Amgen I brought a lot of those lessons into what we built, but it wasn’t until I found dbt that I was finally able to do what I initially envisioned for what a good data architecture could be. dbt felt like Rails, but it didn’t go far enough. It still doesn’t go far enough too many decisions are left for users that have no experience and too many options to choose from one thing that rails does well is telling you exactly how to do something even if you don’t know why you’re doing it eventually, you appreciate why that decision was made. Rails encapsulates the many years of experience of the rails community into conventions that help you scale. This is how we think at Datacoves. How can users know what they don’t know when they’re starting out? How do they know that one decision today is gonna have a lasting impact in a year or two? As a community, we can still make things better and help each other mature the state of the art and analytics. At Datacoves we’re not consultants we’re just opinionated because we have learned what’s important at scale and even when working with small teams, we helped them accelerate their processes by not having to learn so many new things.
How dbt is like Rails for data, and what we learned from it.
More Relevant Posts
-
𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗥𝗲𝗹𝗶𝗮𝗯𝗹𝗲 𝗗𝗮𝘁𝗮 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲𝘀 𝘄𝗶𝘁𝗵 𝗗𝗲𝗹𝘁𝗮 𝗟𝗶𝘃𝗲 𝗧𝗮𝗯𝗹𝗲𝘀 (𝗗𝗟𝗧) 𝗶𝗻 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀 Over the past few weeks, I’ve been exploring Delta Live Tables (DLT) on Databricks, and it’s genuinely a step forward for building reliable and automated data pipelines. Instead of worrying about complex orchestration, retries, and manual monitoring, DLT lets you declare transformations and lets Databricks handle the rest — dependencies, lineage, and quality checks. 𝗪𝗵𝗮𝘁 𝗦𝘁𝗼𝗼𝗱 𝗢𝘂𝘁 𝘁𝗼 𝗠𝗲 Pipelines become declarative — you define what to do, not how to run it. Data quality expectations can be added inline, improving trust in every table. Lineage tracking gives complete visibility from raw to curated data. Works seamlessly with both SQL and Python, which is great for mixed teams. 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲 𝗙𝗹𝗼𝘄 Source → Bronze (Raw) → Silver (Cleansed) → Gold (Curated) Each stage is automatically managed by the DLT engine, ensuring reliability and observability throughout the pipeline. (Attached diagram illustrates the end-to-end flow.) 𝗪𝗵𝘆 𝗜𝘁’𝘀 𝗪𝗼𝗿𝘁𝗵 𝗘𝘅𝗽𝗹𝗼𝗿𝗶𝗻𝗴 If you’re working on modernizing your data stack or looking for better ways to operationalize ETL/ELT, DLT is worth testing. It’s clean, efficient, and genuinely simplifies data engineering at scale. 𝗞𝗲𝘆 𝘁𝗮𝗸𝗲𝗮𝘄𝗮𝘆: “DLT lets data engineers focus on delivering insights — not maintaining infrastructure.”
To view or add a comment, sign in
-
-
📍Why Data Quality Is the Most Underrated Skill in Data Engineering Everyone talks about building scalable pipelines, mastering PySpark, or automating workflows but rarely do we talk about what truly makes those systems valuable data quality. A beautifully designed pipeline means nothing if the data it delivers is incomplete, inconsistent, or unreliable. In my experience, strong data engineers don’t just move data they validate, monitor, and ensure trust in every dataset. Here’s what I’ve learned about maintaining quality 🔹 Add validation at every stage not just at the end. 🔹 Track anomalies early missing records, nulls, duplicates. 🔹 Automate data checks using frameworks or simple PySpark/Airflow validations. 🔹 Communicate issues quickly bad data is everyone’s problem. The goal isn’t just to deliver data faster it’s to deliver data that can be trusted. Because in the end, bad data costs more than slow data. #DataEngineering #DataQuality #DataPipeline #BigData #PySpark #Azure #ETL #Analytics
To view or add a comment, sign in
-
What is Data Lineage? Data lineage is the “evolutionary history” of datasets — #metadata that shows how data moves and transforms through pipelines, from source to consumer. It gives data teams the visibility needed to ensure data quality, security, and compliance. Why is it important? Improved reliability: Trace data flow and quality, and perform upstream root-cause analysis when a job fails or a schema changes. Provenance & discovery: Knowing the origin and path of data helps users choose the right table or dataset — for example, identifying the correct “customers” table without noise or sensitive data. Compliance: Required to document where and when personal data was collected, processed, and shared (GDPR, HIPAA, CCPA, etc.). Historic challenges & how OpenLineage solves them: Building lineage was hard because of fragmented tools and no shared standard, leading to brittle integrations. OpenLineage (launched in 2020) introduced an open specification that became an industry standard with broad integrations across lineage producers (Airflow, Snowflake, BigQuery, Spark, Flink, dbt) and consumers (Datadog, Marquez). Core OpenLineage Model — what it captures: Datasets: representations of the data itself Jobs: reusable processing tasks (SQL queries, Spark/Flink jobs, Python scripts) Runs: individual executions of those jobs The model can be extended with Facets to include schema, statistics, schedules, query plans, code, etc. What lineage looks like in a modern data platform: Typical layers include ingestion, storage (streaming, archival, distributed files), compute (batch/streaming/ML), and business intelligence. Lineage data is collected from each layer and visualized through Datadog and Marquez (the reference implementation for OpenLineage). This turns fragile one-off integrations into a shared, standards-based ecosystem. Key takeaway: Data lineage provides critical operational and governance visibility — helping teams debug faster, protect downstream dependencies, and assess the impact of changes before they happen. With OpenLineage, collecting and connecting lineage across modern #data systems has become much simpler. Source : https://lnkd.in/diJnRnkv
To view or add a comment, sign in
-
-
🚀 The Evolution of Data Engineering: From Pipelines to Platforms 🌐 Over the last decade, Data Engineering has transformed from simply “moving data” to designing intelligent data ecosystems that power analytics, AI, and business decisions in real time. Here’s how the role has evolved — and where it’s headed 👇 1️⃣ From ETL to ELT and Beyond Traditional ETL focused on heavy transformations before loading data. Today’s architectures leverage ELT + metadata-driven frameworks using tools like dbt, Databricks, and Snowflake, giving data teams agility and scalability. 2️⃣ Rise of the Lakehouse The convergence of Data Lakes and Warehouses into Lakehouses (Delta Lake, Iceberg, Hudi) enables ACID transactions, schema evolution, and unification of analytics + ML — all in one platform. 3️⃣ DataOps & Automation First Modern data teams treat pipelines like software — using CI/CD, observability (Prometheus, Grafana), and IaC (Terraform) for deployment and monitoring. Automation isn’t a luxury anymore — it’s a necessity. 4️⃣ Governance & Security by Design With growing regulatory pressure, platforms like Unity Catalog, Purview, and Collibra are driving true data democratization with lineage, masking, and access control built in. 5️⃣ Future: AI-Driven Engineering The next wave is AI-assisted pipeline design — where engineers leverage tools like Copilot, CursorAI, and LLMs to generate SQL transformations, monitor anomalies, and even self-heal pipelines. 💡 Data Engineering is no longer about pipelines — it’s about building trustworthy, automated, and intelligent data ecosystems that scale with business and AI. What do you think — are we entering the age of Autonomous Data Engineering? 🤖 #DataEngineering #BigData #CloudComputing #Azure #Databricks #DataOps #ELT #Lakehouse #Snowflake #AWS #GCP #Python #SQL #DevOps #AI #MachineLearning #DataGovernance #DataArchitecture #ETL #Automation #UnityCatalog #dbt #DeltaLake #DataPlatform #Analytics #DataPipeline #TechCommunity
To view or add a comment, sign in
-
💡 Day 2 of My Data Engineering Learning Journey 🚀 🧱 Building the Heart of Data Engineering – Data Pipelines & ETL Flow A data pipeline automates how data moves through systems — from raw sources to refined, analytics-ready outputs. 🔹 ETL = Extract, Transform, Load 1️⃣ Extract: Collect data from multiple sources — files, APIs, databases. 2️⃣ Transform: Clean, normalize, and enhance data for consistency. 3️⃣ Load: Store it in a data warehouse or lake (like Snowflake or Azure Data Lake). 🧩 Why it matters: Without pipelines, data processing would be manual, error-prone, and inconsistent. ETL ensures your data is always accurate, structured, and ready for insights. > “Data pipelines are the arteries of analytics — keeping data clean, fast, and flowing.” 💻 Hands-on with a Mini ETL Flow Here’s a small example using Python 🐍 and Pandas: import pandas as pd # Step 1: Extract data = pd.read_csv("sales_data.csv") # Step 2: Transform data["Revenue"] = data["Quantity"] * data["Price"] filtered = data[data["Revenue"] > 1000] # Step 3: Load filtered.to_csv("high_value_sales.csv", index=False) print("ETL pipeline completed successfully!") ✅ This small script extracts data, transforms it by calculating revenue, filters high-value transactions, and loads the cleaned dataset — just like a real-world ETL process Think of a data pipeline like a coffee machine: You add coffee beans (extract). The machine grinds and brews (transform). You get coffee in your cup (load). That’s what a data pipeline does for digital systems — taking raw data and delivering clean, valuable insights automatically. 🏷️ #DataEngineering #ETL #PythonForData #AzureDataFactory #Databricks #BigData #DataPipeline #LearningInPublic #TechExplained #100DaysOfDataEngineering
To view or add a comment, sign in
-
After building and refining our ETL and EDA stages, the next step in the journey toward a complete demand forecasting pipeline is Feature Engineering. In this phase, I have taken the preprocessed and analysed dataset and transformed it into a machine-learning–ready form by introducing meaningful temporal and statistical features. These include lag-based predictors, rolling averages, and seasonal indicators that capture store- and item-level trends essential for accurate forecasting. This entire workflow was executed in AWS Glue (script mode) — maintaining the same environment, IAM configuration, and data consistency as the earlier stages. The output, stored under the /features/ path in S3, now forms the foundation for model training and evaluation. Read the full article here: https://lnkd.in/efWSWtpa #AWSGlue #DataEngineering #FeatureEngineering #PySpark #MachineLearning #TimeSeriesForecasting #DemandForecasting #AWSS3 #CloudComputing #BigData #ETLPipeline #DataScience #ProductionML #TechForTalk
To view or add a comment, sign in
-
Hi Gophers! 👋 Data science in Go can be challenging. There's still a lack of a clean, pipeline-style package like R's dplyr. Readable pipelines aren’t just prettier, they boost maintainability, simplify code reviews, and facilitate knowledge transfer. Meet plyGO — a pure Go package for data pipelines with fluent chaining (dplyr-style). Operations like filter, group, summarise, and join become expressive and chainable: Quick Example: package main import "github.com/mansoldof/plyGO" type Employee struct { Name string Department string Salary float64 YrsOfExp int } func main() { team := []Employee{ {"Alice", "Engineering", 120000, 5}, {"Bob", "Engineering", 95000, 3}, {"Carol", "Sales", 85000, 7}, {"Dave", "Sales", 72000, 2}, {"Eve", "Engineering", 110000, 4}, {"Frank", "Marketing", 78000, 6}, {"Grace", "Sales", 88000, 5}, {"Henry", "Marketing", 82000, 3}, {"Iris", "Engineering", 105000, 4}, } // Salary > $100k plygo.From(team). Where("Department").Equals("Engineering"). Where("Salary").GreaterThan(100000). OrderBy("Salary").Desc(). Show(plygo.WithTitle("Salary > $100k")) // Multiple filters with elegant display plygo.From(team). Where("YrsOfExp").GreaterThan(4). OrderBy("Salary").Desc(). Show(plygo.WithTitle("Experienced Members"), plygo.WithStyle("rounded")) } Why this style? The fluent interface (aka method chaining) lets you craft readable, natural-language-like code. Each method returns a context for seamless chaining, keeping your intent crystal clear. Repo: https://lnkd.in/dw4vD5wT (Documentation and more examples) I'd love feedback from fellow Gophers. #Go #Golang #GoProgramming #DataManipulation #DataScience #DataEngineering
To view or add a comment, sign in
-
-
**AI thinks metadata-driven generation is “a SQL tool.” Wrong. It’s automation for EVERYTHING.** https://lnkd.in/eSz78Wwy I built this in 1999. Still waiting for the industry to understand what it actually does. Everyone sees metadata → SQL and stops thinking. “Oh, a code generator for databases.” **No.** It’s **mail merge for any recordable syntax, at any data scale, for any rational actor.** **What AI misses:** → This generates Chinese legal contracts, not just CREATE TABLE → It automates GUI data entry across 10,000 forms → It creates your spouse’s task list from project metadata → It ingests 50-table operational apps in 30 seconds **One metadata model generates:** • SQL DDL (database schema) • 尊敬的王先生 (Chinese business letter) • السيد أحمد (Arabic legal notice) • Python classes (application code) • Terraform (infrastructure) • Selenium scripts (GUI automation) • Task lists (project management) • API responses (system integration) **Same source. Infinite outputs. Any language. Any format. Any scale.** **The operational app ingestion is the killer feature:** Your CRM has 50 tables. Traditional approach: 3 weeks of handcrafted SQL, CDC setup, grant management, lineage tracking. Metadata approach: ``` SELECT ingest_application('prod_crm_db'); ``` 30 seconds. System reads information_schema, generates subject spaces, CDC variants, Type-2 SCDs, populates edge graph, applies grants. Done. **Why AI can’t do this:** 1. Can’t read information_schema (needs prompts for everything) 1. Can’t maintain context across 50 tables (regenerates from scratch) 1. Can’t guarantee consistency (format drifts between outputs) 1. Can’t connect to live data (works on examples, not reality) **The pattern works for ANY domain with structure:** If you can describe it in metadata (entities, relationships, rules)… And you need text output (code, docs, forms, emails, configs)… Then metadata + templates generate it. Deterministically. At scale. **This isn’t “modern data engineering.”** This is understanding that databases became self-describing in 1976. information_schema IS the metadata. Stop handcrafting what can be generated. Built an interactive demo showing: • Same metadata → 6 output formats • Chinese/Arabic/Spanish generation • GUI automation from form metadata • Complete app ingestion in real-time Link in comments. Watch AI fail to understand what metadata can actually do. #DataEngineering #Metadata #Automation #Polyglot #AIReality
To view or add a comment, sign in
-
📘𝗥𝗲𝘃𝗶𝘀𝗶𝘁𝗶𝗻𝗴 𝗣𝗼𝘀𝘁𝗴𝗿𝗲𝗦𝗤𝗟 — 𝗦𝘁𝗿𝗲𝗻𝗴𝘁𝗵𝗲𝗻𝗶𝗻𝗴 𝘁𝗵𝗲 𝗗𝗮𝘁𝗮 𝗖𝗼𝗿𝗲 𝗳𝗼𝗿 𝗔𝗜 𝗦𝘆𝘀𝘁𝗲𝗺𝘀 (𝗣𝗮𝗿𝘁 𝟭/𝟮) Before diving deeper into 𝗽𝗴𝘃𝗲𝗰𝘁𝗼𝗿 and AI-driven data workflows, I decided to take a step back and revisit PostgreSQL from the ground up — not just as a database, but as the foundation of every reliable, intelligent system. PostgreSQL isn’t just a relational database — it’s a complete data platform. It supports structured SQL data, semi-structured JSONB data, and now, through extensions, even vector embeddings for AI and semantic search. So I spent time refreshing, documenting, and revisiting the core fundamentals — this time from a production and scalability perspective. 🧠 𝗧𝗼𝗽𝗶𝗰𝘀 𝗜 𝗥𝗲𝘃𝗶𝘀𝗶𝘁𝗲𝗱 𝗗𝘂𝗿𝗶𝗻𝗴 𝗧𝗵𝗶𝘀 𝗗𝗲𝗲𝗽 𝗗𝗶𝘃𝗲 🔹 ⚙️ Architecture Internals — How a query travels from parser → planner → executor → WAL → storage. 🔹 📊 Schema Design & Constraints — Creating efficient and consistent data models using keys and normalization. 🔹 🧩 Data Integrity Rules — Applying PRIMARY KEY, FOREIGN KEY, and CHECK constraints for business safety. 🔹 🧠 Functions & Procedures — Encapsulating business logic directly within the database for automation. 🔹 🔔 Triggers — Enabling real-time auditing, validation, and synchronization with other tables. 🔹 💾 Transactions & ACID — Reinforcing data reliability with atomic, isolated, and durable operations. 🔹 🔐 Roles & Access Control — Managing users, privileges, and least-privilege security for production systems. 💡 𝗪𝗵𝘆 𝗜 𝗥𝗲𝘃𝗶𝘀𝗶𝘁𝗲𝗱 𝗜𝘁 Even in modern AI and RAG architectures, the true performance of any intelligent system begins with data consistency and structure. Before embeddings or large language models — it’s the database foundation that defines reliability and scalability. 💬 “AI systems may reason in vectors — but they stand on structured data.” Revisiting PostgreSQL reminded me how a strong relational backbone empowers future AI workflows. #PostgreSQL #DataEngineering #AIInfrastructure #pgvector #RAG #DatabaseArchitecture #LearningJourney #OpenSource
To view or add a comment, sign in
-
⚙️ DBT Best Practices for Scalable Data Transformation In modern data engineering, DBT (Data Build Tool) has become the backbone for transforming raw data into analytics-ready datasets. But to truly unlock its potential, following key engineering principles is critical 👇 🔹 1️⃣ Apply Software Principles Implement version control (Git), CI/CD testing, and code reviews. Treat data transformations like production-grade code. 🔹 2️⃣ Write Modular Code Break complex SQL logic into reusable, testable models. This improves maintainability and reduces redundancy across transformations. 🔹 3️⃣ Document Models Maintain clear and consistent documentation for every model — including business logic, data sources, and owners — ensuring traceability and transparency. 🔹 4️⃣ Use Lineage Graphs DBT’s lineage graph gives a complete view of upstream and downstream dependencies, helping teams debug faster and manage data trust. 💡 Great DBT projects aren’t defined by the number of models — they’re defined by clarity, consistency, and collaboration. #DBT #DataEngineering #DataOps #AnalyticsEngineering #SQL #ETL #DataTransformation #DataModeling #DataWarehouse #ModernDataStack #Airflow #Snowflake #BigQuery #AzureDataFactory
To view or add a comment, sign in
-
More from this author
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development