
Data Engineering With Dagster – Part One: A Fresh Take on Orchestration
Table of Contents
Intro: From Payloads to Pipelines
Alright — this is a little different from XSS payloads and shell listeners. Welcome to my write-up on Dagster, the data orchestrator that’s part of my ongoing journey into data engineering. This series documents my learning process through Dagster University — and by the end of it, I’ll publish a hands-on example project that you (and I) can reuse for real-world setups.
Think of this as a living tutorial I’m building for myself as I get into real data workflows at work.
🎯 Why I’m Doing This
I’m not just from a security or pentesting background.
In my day job at BASF, I design and automate entire cyber risk workflows — building systems, developing machine learning tooling, and engineering secure, scalable platforms across international teams.
Whether it’s:
- Designing cost models for cyber incidents
- Managing multi-stage ML pipelines
- or creating intelligent threat model platforms…
…I’ve learned that engineering and security are two sides of the same coin.
And in the context of data, orchestration is where those two worlds meet.
Dagster is my way of leveling up in:
- Modular data workflows
- Reproducible, testable pipelines
- Orchestrating real-world systems at ease
So yeah — I’m here to break stuff and to build it back better.
Let’s get right into the vibe shift.
🧬 What Even Is Data Engineering?
“Data engineering is the practice of designing and building software for collecting, storing, and managing data.”
Okay, but what does that really mean?
It’s about making data clean, available, consistent, and ready. Whether that’s for:
- BI dashboards and KPIs
- Training machine learning models
- Providing real-time data to users
- Reacting to events (think: “send an email if sales drop”)
…you can bet a data engineer had to fight the chaos first.
Big mess? Big data engineer energy.
🎼 Enter the Orchestrator
Before you get scared by the word “orchestrator”, here’s the problem it originally solved:
“I need to run a bunch of scripts, in a specific order, on a schedule.”
Boom — first-gen orchestrators like Airflow or Luigi step in. They’re task-centric: “Step A, then Step B, then Step C.”
But modern orchestrators (like Dagster) go way beyond that:
- Visualizing complex pipelines
- Catching and retrying failures
- Sending notifications
- Preventing data collisions
- Understanding what changed, and why
All that, without manually gluing together cron
jobs and spreadsheets.
⭐ Why Dagster?
Dagster’s special sauce is its asset-centric mindset.
It asks: What are you creating?
Not: What step are you running?
And that, my friend, changes everything.
🍪 Asset-Centric vs Task-Centric
Let’s bake some cookies:
- Task-centric:
“Mix flour and sugar, then bake it.”
(Good luck changing the recipe.)
- Asset-centric:
“I want to produce acookie
. It depends ondough
, which depends onflour
andsugar
.”
(Change one ingredient? Dagster knows what needs to be recomputed.)
The focus shifts from steps to outputs — like building a dependency graph of meaningful results.
Style | Focus | Flexibility | Reusability |
---|---|---|---|
Task-centric | Steps | Low | Poor |
Asset-centric | Results | High | Excellent |
Dagster treats data artifacts as first-class citizens. That’s its killer feature.
⚙️ Quick Setup
Not gonna lie — their setup tutorial is pretty smooth. Nothing exotic, no dark magic required.
📚 👉 Check their official Getting Started guide here
I won’t duplicate their steps here — follow that to get your Python environment, Dagster CLI, and example repo up and running.
🧱 Core Concept: Assets
Assets are the bread and butter of Dagster. They’re persistent things in storage — not ephemeral steps in a script.
Think:
- Tables in a data warehouse
- JSON files in an S3 bucket
- Trained ML models
- Data connectors or external APIs
Each asset has:
- A unique key (like
users/raw
orml_models/decision_tree
) - Dependencies (what data it relies on)
- A computation function (how it’s created)
🧪 Example: Bread Asset
@dg.asset
def simple_bread(flour, water, salt):
return flour + water + salt
This asset has three upstream dependencies — flour
, water
, and salt
. When any of them change, Dagster knows simple_bread
needs to be recomputed.
💡 Best practice: Use nouns for asset keys (
daily_sales_report
), not verbs (compute_report
). You’re describing what exists, not what happens.
The Decorator Tells the Story
@dg.asset
def user_table():
...
Behind that little @dg.asset
decorator, Dagster does some heavy lifting:
- Registers this function as part of your data graph
- Tracks lineage and dependencies
- Links the result to a logical data node Dagster can reason about
It’s not just syntactic sugar — it’s how Dagster understands your system.
Dagster Universities’ little knowledge checks could be a bit more hands-on, like a little window with me defining a recipe or a little code or something more exciting like that, i feel like if it doesn’t challenge me it is not as helpful as it could be.
🧠 Mini Excursus: What Even Is an API?
An API (Application Programming Interface) is like a waiter in a restaurant. You (the client) don’t go into the kitchen — you just ask the waiter (API) for what you want, and they bring it to you from the backend.
In this case, the NYC OpenData API is the waiter. We ask it for a very specific file — a .parquet
file with yellow cab trip data from March 2023 — and it hands it over. No need to scrape pages or build forms. Just clean HTTP.
APIs return structured data (like JSON, CSV, or in this case, Parquet), which makes them perfect for automation and orchestration.
🔧 Defining the First Real Asset: NYC Taxi Data
We’re now working inside the asset/trips.py
file of the Dagster University project. Here’s the starter code:
import requests # For accessing the NYC OpenData API
from . import constants # Contains paths and config values, including the save path
import dagster as dg # Dagster framework
@dg.asset
def taxi_trips_file() -> None:
"""
The raw parquet file for the taxi trips dataset. Sources from the NYC Open Data portal.
"""
month_to_fetch = "2023-03"
raw_trips = requests.get(
f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_{month_to_fetch}.parquet"
)
with open(
constants.TAXI_TRIPS_TEMPLATE_FILE_PATH.format(month_to_fetch), "wb"
) as output_file:
output_file.write(raw_trips.content)
🚀 What’s Going On Here?
@dg.asset
: Classic Dagster decorator. This registers the function as a data asset Dagster can track and visualize.def taxi_trips_file() -> None
: We’re defining a function that doesn’t return anything to Python — it writes data to disk. The-> None
is a type hint, telling readers (and tools like linters or IDEs) “don’t expect a return value.”- The triple quotes
""" ... """
form a docstring — a special comment Python attaches to functions. Dagster uses this as the description in the UI!
🧠 Mini (mini) Excursus: Type Annotations & Docstrings
- Type hints like
-> None
ordef foo(bar: str) -> int
make your code more readable and tool-friendly.- Docstrings aren’t just for humans anymore — Dagster shows them right in the UI to explain what your asset is doing.
🧠 Mini Excursus: What’s With the Parquet Format?
Parquet is a columnar storage format, optimized for big data. Instead of storing data row-by-row like CSVs, it stores entire columns together. Why?
- It compresses better
- It’s faster to query specific fields
- It’s perfect for data warehouses and analytics engines like Spark, BigQuery, etc.
So yeah — it’s not just “some weird file ending in .parquet”. It’s what makes modern data workflows scale.
👀 Time to View & Materialize Your Asset
Now that we’ve defined our first asset, it’s time to see it in the Dagster UI and run it.
Step 1: Launch the UI
Start the dev server:
dagster dev
Then go to http://localhost:3000
Step 2: Check Out Your Asset
Click Assets
in the top nav bar. If it’s empty, hit the Reload Definitions button — this re-indexes your project’s asset graph.
You should now see taxi_trips_file
, along with:
- Its group and code location
- Its status (probably: “Never materialized”)
- The description (yep, that docstring shows up!)
Now click View Global Asset Lineage — it’s a bit empty now, but this is where your full DAG (directed acyclic graph ) of assets will show up.
Step 3: Materialize It
Click the big Materialize button. This runs the function we defined earlier, fetching the file and writing it to disk.
If it worked, you’ll see a purple banner saying the run started.
📁 Check your project directory:
The file should now exist atdata/raw/taxi_trips_2023-03.parquet
If it’s not there immediately — relax. It’s downloading a decent-sized file. Refresh after a minute or two.
Step 4: View the Run Details
Click on the Materialized -
- Run stats: ID, date, duration, and affected assets
- Run timeline: Which step ran when, and whether it succeeded
- Logs: Full output for debugging and nerding out
You’ll also find buttons for re-running or launching more assets once you build out the graph.
✅ That’s it — your first real asset, materialized and traceable.
🧠 Mini Excursus: Dagster vs Airflow
Feature | Airflow | Dagster |
---|---|---|
Paradigm | Task-based | Asset-based |
Data awareness | Low | High |
UI & Visualization | Functional | Polished & interactive |
Debugging | Logs per task | Full lineage & context |
Suitable for | Traditional ETL/ELT | ML, analytics, modular data |
Not saying one is better — just saying Dagster seems way more flexible and suitable for all the stuff i have in mind.
🧯 Troubleshooting Failed Runs in Dagster
So far, materialization has been smooth sailing. But what if something breaks? How do you debug a failed run?
Let’s intentionally break something and walk through Dagster’s debugging workflow like pros.
🧪 Step 1: Cause a Failure on Purpose
In assets/trips.py
, comment out the import for constants
:
import requests
# from . import constants # ← intentionally broken
import dagster as dg
This will still let your function execute up until the moment it tries to use constants.TAXI_TRIPS_TEMPLATE_FILE_PATH
— which now doesn’t exist.
Hit Materialize in the UI again.
💥 Expected result: The asset turns red in the graph — it failed.
🔍 Step 2: Investigate the Failed Run
Click the red-colored date next to the asset to open the Run Details page.
Here’s what changes when a run fails:
- Run status: Clearly marked as ❌ Failure
- Timeline view: Problem steps highlighted in red
- Errored tab: Lists the step(s) that failed
- Logs panel: Full trace of what went wrong
🧠 Mini Excursus: Logs > Vibes
Instead of guessing what broke, Dagster gives you structured logs.
Click on the STEP_FAILURE
event in the log timeline. Then hit View full message — this pops up the full Python stacktrace.
Here you’ll see the NameError: name 'constants' is not defined
, which makes the issue very clear.
This isn’t just “run failed” — it’s precise introspection, right down to the line.
🔁 Step 3: Fix and Re-execute
Now go back and uncomment the broken import:
from . import constants # fixed
Save the file, then return to the Run Details page in the UI. In the top-right corner, click Re-execute all.
Dagster re-runs the failed job from scratch. If all goes well…
✅ The asset is now materialized successfully again.
🧠 Mini Excursus: Re-execution is Powerful
Re-executing from the Run Details page lets you:
- Retry only failed steps
- Avoid rerunning expensive successful ones (if you configure that)
- Quickly iterate during development
This is the kind of developer experience that makes data engineering fun again.
📚 Resources So Far
- Dagster Docs
- Dagster University (it’s free!)
- YouTube: Dagster Crash Course
Stay tuned for the upcoming parts! In future it’ll also be wise to check the GitLab regularly for the example projects.