Data Engineering With Dagster – Part One: A Fresh Take on Orchestration

Posted on Apr 1, 2025 9 mins

Dagster Data-Engineering Python Orchestration Pipelines

Table of Contents

Intro: From Payloads to Pipelines

Alright — this is a little different from XSS payloads and shell listeners. Welcome to my write-up on Dagster, the data orchestrator that’s part of my ongoing journey into data engineering. This series documents my learning process through Dagster University — and by the end of it, I’ll publish a hands-on example project that you (and I) can reuse for real-world setups.

Think of this as a living tutorial I’m building for myself as I get into real data workflows at work.

🎯 Why I’m Doing This

I’m not just from a security or pentesting background.

In my day job at BASF, I design and automate entire cyber risk workflows — building systems, developing machine learning tooling, and engineering secure, scalable platforms across international teams.

Whether it’s:

Designing cost models for cyber incidents
Managing multi-stage ML pipelines
or creating intelligent threat model platforms…

…I’ve learned that engineering and security are two sides of the same coin.

And in the context of data, orchestration is where those two worlds meet.

Dagster is my way of leveling up in:

Modular data workflows
Reproducible, testable pipelines
Orchestrating real-world systems at ease

So yeah — I’m here to break stuff and to build it back better.

Let’s get right into the vibe shift.

🧬 What Even Is Data Engineering?

“Data engineering is the practice of designing and building software for collecting, storing, and managing data.”

Okay, but what does that really mean?

It’s about making data clean, available, consistent, and ready. Whether that’s for:

BI dashboards and KPIs
Training machine learning models
Providing real-time data to users
Reacting to events (think: “send an email if sales drop”)

…you can bet a data engineer had to fight the chaos first.

Big mess? Big data engineer energy.

🎼 Enter the Orchestrator

Before you get scared by the word “orchestrator”, here’s the problem it originally solved:

“I need to run a bunch of scripts, in a specific order, on a schedule.”

Boom — first-gen orchestrators like Airflow or Luigi step in. They’re task-centric: “Step A, then Step B, then Step C.”

But modern orchestrators (like Dagster) go way beyond that:

Visualizing complex pipelines
Catching and retrying failures
Sending notifications
Preventing data collisions
Understanding what changed, and why

All that, without manually gluing together cron jobs and spreadsheets.

⭐ Why Dagster?

Dagster’s special sauce is its asset-centric mindset.

It asks: What are you creating?
Not: What step are you running?

And that, my friend, changes everything.

🍪 Asset-Centric vs Task-Centric

Let’s bake some cookies:

Task-centric:
“Mix flour and sugar, then bake it.”
(Good luck changing the recipe.)

Asset-centric:
“I want to produce a cookie. It depends on dough, which depends on flour and sugar.”
(Change one ingredient? Dagster knows what needs to be recomputed.)

The focus shifts from steps to outputs — like building a dependency graph of meaningful results.

Style	Focus	Flexibility	Reusability
Task-centric	Steps	Low	Poor
Asset-centric	Results	High	Excellent

Dagster treats data artifacts as first-class citizens. That’s its killer feature.

⚙️ Quick Setup

Not gonna lie — their setup tutorial is pretty smooth. Nothing exotic, no dark magic required.

📚 👉 Check their official Getting Started guide here

I won’t duplicate their steps here — follow that to get your Python environment, Dagster CLI, and example repo up and running.

🧱 Core Concept: Assets

Assets are the bread and butter of Dagster. They’re persistent things in storage — not ephemeral steps in a script.

Think:

Tables in a data warehouse
JSON files in an S3 bucket
Trained ML models
Data connectors or external APIs

Each asset has:

A unique key (like users/raw or ml_models/decision_tree)
Dependencies (what data it relies on)
A computation function (how it’s created)

🧪 Example: Bread Asset

@dg.asset
def simple_bread(flour, water, salt):
    return flour + water + salt

This asset has three upstream dependencies — flour, water, and salt. When any of them change, Dagster knows simple_bread needs to be recomputed.

💡 Best practice: Use nouns for asset keys (daily_sales_report), not verbs (compute_report). You’re describing what exists, not what happens.

The Decorator Tells the Story

@dg.asset
def user_table():
    ...

Behind that little @dg.asset decorator, Dagster does some heavy lifting:

Registers this function as part of your data graph
Tracks lineage and dependencies
Links the result to a logical data node Dagster can reason about

It’s not just syntactic sugar — it’s how Dagster understands your system.

Dagster Universities’ little knowledge checks could be a bit more hands-on, like a little window with me defining a recipe or a little code or something more exciting like that, i feel like if it doesn’t challenge me it is not as helpful as it could be.

🧠 Mini Excursus: What Even Is an API?

An API (Application Programming Interface) is like a waiter in a restaurant. You (the client) don’t go into the kitchen — you just ask the waiter (API) for what you want, and they bring it to you from the backend.

In this case, the NYC OpenData API is the waiter. We ask it for a very specific file — a .parquet file with yellow cab trip data from March 2023 — and it hands it over. No need to scrape pages or build forms. Just clean HTTP.

APIs return structured data (like JSON, CSV, or in this case, Parquet), which makes them perfect for automation and orchestration.

🔧 Defining the First Real Asset: NYC Taxi Data

We’re now working inside the asset/trips.py file of the Dagster University project. Here’s the starter code:

import requests  # For accessing the NYC OpenData API
from . import constants  # Contains paths and config values, including the save path
import dagster as dg  # Dagster framework


@dg.asset
def taxi_trips_file() -> None:
    """
    The raw parquet file for the taxi trips dataset. Sources from the NYC Open Data portal.
    """
    month_to_fetch = "2023-03"
    raw_trips = requests.get(
        f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_{month_to_fetch}.parquet"
    )

    with open(
        constants.TAXI_TRIPS_TEMPLATE_FILE_PATH.format(month_to_fetch), "wb"
    ) as output_file:
        output_file.write(raw_trips.content)

🚀 What’s Going On Here?

@dg.asset: Classic Dagster decorator. This registers the function as a data asset Dagster can track and visualize.
def taxi_trips_file() -> None: We’re defining a function that doesn’t return anything to Python — it writes data to disk. The -> None is a type hint, telling readers (and tools like linters or IDEs) “don’t expect a return value.”
The triple quotes """ ... """ form a docstring — a special comment Python attaches to functions. Dagster uses this as the description in the UI!

🧠 Mini (mini) Excursus: Type Annotations & Docstrings

Type hints like -> None or def foo(bar: str) -> int make your code more readable and tool-friendly.

Docstrings aren’t just for humans anymore — Dagster shows them right in the UI to explain what your asset is doing.

🧠 Mini Excursus: What’s With the Parquet Format?

Parquet is a columnar storage format, optimized for big data. Instead of storing data row-by-row like CSVs, it stores entire columns together. Why?

It compresses better
It’s faster to query specific fields
It’s perfect for data warehouses and analytics engines like Spark, BigQuery, etc.

So yeah — it’s not just “some weird file ending in .parquet”. It’s what makes modern data workflows scale.

👀 Time to View & Materialize Your Asset

Now that we’ve defined our first asset, it’s time to see it in the Dagster UI and run it.

Step 1: Launch the UI

Start the dev server:

dagster dev

Then go to http://localhost:3000

Step 2: Check Out Your Asset

Click Assets in the top nav bar. If it’s empty, hit the Reload Definitions button — this re-indexes your project’s asset graph.

You should now see taxi_trips_file, along with:

Its group and code location
Its status (probably: “Never materialized”)
The description (yep, that docstring shows up!)

Now click View Global Asset Lineage — it’s a bit empty now, but this is where your full DAG (directed acyclic graph ) of assets will show up.

Step 3: Materialize It

Click the big Materialize button. This runs the function we defined earlier, fetching the file and writing it to disk.

If it worked, you’ll see a purple banner saying the run started.

📁 Check your project directory:
The file should now exist at data/raw/taxi_trips_2023-03.parquet

If it’s not there immediately — relax. It’s downloading a decent-sized file. Refresh after a minute or two.

Step 4: View the Run Details

Click on the Materialized - label in the asset. This opens the Run Details Page, where you can see:

Run stats: ID, date, duration, and affected assets
Run timeline: Which step ran when, and whether it succeeded
Logs: Full output for debugging and nerding out

You’ll also find buttons for re-running or launching more assets once you build out the graph.

✅ That’s it — your first real asset, materialized and traceable.

🧠 Mini Excursus: Dagster vs Airflow

Feature	Airflow	Dagster
Paradigm	Task-based	Asset-based
Data awareness	Low	High
UI & Visualization	Functional	Polished & interactive
Debugging	Logs per task	Full lineage & context
Suitable for	Traditional ETL/ELT	ML, analytics, modular data

Not saying one is better — just saying Dagster seems way more flexible and suitable for all the stuff i have in mind.

🧯 Troubleshooting Failed Runs in Dagster

So far, materialization has been smooth sailing. But what if something breaks? How do you debug a failed run?

Let’s intentionally break something and walk through Dagster’s debugging workflow like pros.

🧪 Step 1: Cause a Failure on Purpose

In assets/trips.py, comment out the import for constants:

import requests
# from . import constants  # ← intentionally broken
import dagster as dg

This will still let your function execute up until the moment it tries to use constants.TAXI_TRIPS_TEMPLATE_FILE_PATH — which now doesn’t exist.

Hit Materialize in the UI again.

💥 Expected result: The asset turns red in the graph — it failed.

🔍 Step 2: Investigate the Failed Run

Click the red-colored date next to the asset to open the Run Details page.

Here’s what changes when a run fails:

Run status: Clearly marked as ❌ Failure
Timeline view: Problem steps highlighted in red
Errored tab: Lists the step(s) that failed
Logs panel: Full trace of what went wrong

🧠 Mini Excursus: Logs > Vibes

Instead of guessing what broke, Dagster gives you structured logs.

Click on the STEP_FAILURE event in the log timeline. Then hit View full message — this pops up the full Python stacktrace.

Here you’ll see the NameError: name 'constants' is not defined, which makes the issue very clear.

This isn’t just “run failed” — it’s precise introspection, right down to the line.

🔁 Step 3: Fix and Re-execute

Now go back and uncomment the broken import:

from . import constants  # fixed

Save the file, then return to the Run Details page in the UI. In the top-right corner, click Re-execute all.

Dagster re-runs the failed job from scratch. If all goes well…

✅ The asset is now materialized successfully again.

🧠 Mini Excursus: Re-execution is Powerful

Re-executing from the Run Details page lets you:

Retry only failed steps
Avoid rerunning expensive successful ones (if you configure that)
Quickly iterate during development

This is the kind of developer experience that makes data engineering fun again.

📚 Resources So Far

Stay tuned for the upcoming parts! In future it’ll also be wise to check the GitLab regularly for the example projects.