Site Logo
Niklas Heringer - Cybersecurity Blog

Data Engineering With Dagster – Part One: A Fresh Take on Orchestration

Posted on 9 mins

Dagster Data-Engineering Python Orchestration Pipelines

Intro: From Payloads to Pipelines

Alright — this is a little different from XSS payloads and shell listeners. Welcome to my write-up on Dagster, the data orchestrator that’s part of my ongoing journey into data engineering. This series documents my learning process through Dagster University — and by the end of it, I’ll publish a hands-on example project that you (and I) can reuse for real-world setups.

Think of this as a living tutorial I’m building for myself as I get into real data workflows at work.

🎯 Why I’m Doing This

I’m not just from a security or pentesting background.

In my day job at BASF, I design and automate entire cyber risk workflows — building systems, developing machine learning tooling, and engineering secure, scalable platforms across international teams.

Whether it’s:

…I’ve learned that engineering and security are two sides of the same coin.

And in the context of data, orchestration is where those two worlds meet.

Dagster is my way of leveling up in:

So yeah — I’m here to break stuff and to build it back better.

Let’s get right into the vibe shift.


🧬 What Even Is Data Engineering?

“Data engineering is the practice of designing and building software for collecting, storing, and managing data.”

Okay, but what does that really mean?

It’s about making data clean, available, consistent, and ready. Whether that’s for:

…you can bet a data engineer had to fight the chaos first.

Big mess? Big data engineer energy.


🎼 Enter the Orchestrator

Before you get scared by the word “orchestrator”, here’s the problem it originally solved:

“I need to run a bunch of scripts, in a specific order, on a schedule.”

Boom — first-gen orchestrators like Airflow or Luigi step in. They’re task-centric: “Step A, then Step B, then Step C.”

But modern orchestrators (like Dagster) go way beyond that:

All that, without manually gluing together cron jobs and spreadsheets.


⭐ Why Dagster?

Dagster’s special sauce is its asset-centric mindset.

It asks: What are you creating?
Not: What step are you running?

And that, my friend, changes everything.

🍪 Asset-Centric vs Task-Centric

Let’s bake some cookies:

The focus shifts from steps to outputs — like building a dependency graph of meaningful results.

Style Focus Flexibility Reusability
Task-centric Steps Low Poor
Asset-centric Results High Excellent

Dagster treats data artifacts as first-class citizens. That’s its killer feature.


⚙️ Quick Setup

Not gonna lie — their setup tutorial is pretty smooth. Nothing exotic, no dark magic required.

📚 👉 Check their official Getting Started guide here

I won’t duplicate their steps here — follow that to get your Python environment, Dagster CLI, and example repo up and running.


🧱 Core Concept: Assets

Assets are the bread and butter of Dagster. They’re persistent things in storage — not ephemeral steps in a script.

Think:

Each asset has:

🧪 Example: Bread Asset

@dg.asset
def simple_bread(flour, water, salt):
    return flour + water + salt

This asset has three upstream dependencies — flour, water, and salt. When any of them change, Dagster knows simple_bread needs to be recomputed.

💡 Best practice: Use nouns for asset keys (daily_sales_report), not verbs (compute_report). You’re describing what exists, not what happens.


The Decorator Tells the Story

@dg.asset
def user_table():
    ...

Behind that little @dg.asset decorator, Dagster does some heavy lifting:

It’s not just syntactic sugar — it’s how Dagster understands your system.


Dagster Universities’ little knowledge checks could be a bit more hands-on, like a little window with me defining a recipe or a little code or something more exciting like that, i feel like if it doesn’t challenge me it is not as helpful as it could be.


🧠 Mini Excursus: What Even Is an API?

An API (Application Programming Interface) is like a waiter in a restaurant. You (the client) don’t go into the kitchen — you just ask the waiter (API) for what you want, and they bring it to you from the backend.

In this case, the NYC OpenData API is the waiter. We ask it for a very specific file — a .parquet file with yellow cab trip data from March 2023 — and it hands it over. No need to scrape pages or build forms. Just clean HTTP.

APIs return structured data (like JSON, CSV, or in this case, Parquet), which makes them perfect for automation and orchestration.


🔧 Defining the First Real Asset: NYC Taxi Data

We’re now working inside the asset/trips.py file of the Dagster University project. Here’s the starter code:

import requests  # For accessing the NYC OpenData API
from . import constants  # Contains paths and config values, including the save path
import dagster as dg  # Dagster framework


@dg.asset
def taxi_trips_file() -> None:
    """
    The raw parquet file for the taxi trips dataset. Sources from the NYC Open Data portal.
    """
    month_to_fetch = "2023-03"
    raw_trips = requests.get(
        f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_{month_to_fetch}.parquet"
    )

    with open(
        constants.TAXI_TRIPS_TEMPLATE_FILE_PATH.format(month_to_fetch), "wb"
    ) as output_file:
        output_file.write(raw_trips.content)

🚀 What’s Going On Here?

🧠 Mini (mini) Excursus: Type Annotations & Docstrings

  • Type hints like -> None or def foo(bar: str) -> int make your code more readable and tool-friendly.
  • Docstrings aren’t just for humans anymore — Dagster shows them right in the UI to explain what your asset is doing.

🧠 Mini Excursus: What’s With the Parquet Format?

Parquet is a columnar storage format, optimized for big data. Instead of storing data row-by-row like CSVs, it stores entire columns together. Why?

So yeah — it’s not just “some weird file ending in .parquet”. It’s what makes modern data workflows scale.


👀 Time to View & Materialize Your Asset

Now that we’ve defined our first asset, it’s time to see it in the Dagster UI and run it.

Step 1: Launch the UI

Start the dev server:

dagster dev

Then go to http://localhost:3000

Step 2: Check Out Your Asset

Click Assets in the top nav bar. If it’s empty, hit the Reload Definitions button — this re-indexes your project’s asset graph.

You should now see taxi_trips_file, along with:

Now click View Global Asset Lineage — it’s a bit empty now, but this is where your full DAG (directed acyclic graph ) of assets will show up.


Step 3: Materialize It

Click the big Materialize button. This runs the function we defined earlier, fetching the file and writing it to disk.

If it worked, you’ll see a purple banner saying the run started.

📁 Check your project directory:
The file should now exist at data/raw/taxi_trips_2023-03.parquet

If it’s not there immediately — relax. It’s downloading a decent-sized file. Refresh after a minute or two.


Step 4: View the Run Details

Click on the Materialized - label in the asset. This opens the Run Details Page, where you can see:

You’ll also find buttons for re-running or launching more assets once you build out the graph.

✅ That’s it — your first real asset, materialized and traceable.


🧠 Mini Excursus: Dagster vs Airflow

Feature Airflow Dagster
Paradigm Task-based Asset-based
Data awareness Low High
UI & Visualization Functional Polished & interactive
Debugging Logs per task Full lineage & context
Suitable for Traditional ETL/ELT ML, analytics, modular data

Not saying one is better — just saying Dagster seems way more flexible and suitable for all the stuff i have in mind.


🧯 Troubleshooting Failed Runs in Dagster

So far, materialization has been smooth sailing. But what if something breaks? How do you debug a failed run?

Let’s intentionally break something and walk through Dagster’s debugging workflow like pros.


🧪 Step 1: Cause a Failure on Purpose

In assets/trips.py, comment out the import for constants:

import requests
# from . import constants  # ← intentionally broken
import dagster as dg

This will still let your function execute up until the moment it tries to use constants.TAXI_TRIPS_TEMPLATE_FILE_PATH — which now doesn’t exist.

Hit Materialize in the UI again.

💥 Expected result: The asset turns red in the graph — it failed.


🔍 Step 2: Investigate the Failed Run

Click the red-colored date next to the asset to open the Run Details page.

Here’s what changes when a run fails:


🧠 Mini Excursus: Logs > Vibes

Instead of guessing what broke, Dagster gives you structured logs.

Click on the STEP_FAILURE event in the log timeline. Then hit View full message — this pops up the full Python stacktrace.

Here you’ll see the NameError: name 'constants' is not defined, which makes the issue very clear.

This isn’t just “run failed” — it’s precise introspection, right down to the line.


🔁 Step 3: Fix and Re-execute

Now go back and uncomment the broken import:

from . import constants  # fixed

Save the file, then return to the Run Details page in the UI. In the top-right corner, click Re-execute all.

Dagster re-runs the failed job from scratch. If all goes well…

✅ The asset is now materialized successfully again.


🧠 Mini Excursus: Re-execution is Powerful

Re-executing from the Run Details page lets you:

This is the kind of developer experience that makes data engineering fun again.


📚 Resources So Far

Stay tuned for the upcoming parts! In future it’ll also be wise to check the GitLab regularly for the example projects.