Data Engineering With Dagster Part Six – Partitioning & Backfills

Posted on Apr 5, 2025 7 mins

Dagster Partitions Backfill Pipeline-Design Duckdb

Table of Contents

🍪 Partitioning: Baking Data One Batch at a Time

Let’s return to our running bakery analogy. Imagine you’re getting cookie orders:

Monday: 3 boxes
Tuesday: 87 boxes
Wednesday: zero

Now, do you mix one massive bowl of dough every week, or prep daily batches based on demand?

Partitioning in Dagster is just that: breaking work into slices based on time or other meaningful dimensions.

A partition is a unit of data (or computation) scoped to a range — like a day, week, or month — that you can reason about, run, retry, and debug independently.

🧠 Why Partition?

Let’s say your asset processes NYC taxi data.

Without partitions? Every job run re-ingests and re-processes all months.
With partitions? You can fetch and materialize just March 2023, or even backfill all of 2022 with a single click.

Benefits:

Concept	Real Benefit
Scalability	Break work into chunks for distributed execution
Efficiency	Skip what’s already done
Resilience	Retry only failed pieces
Debuggability	Zero in on the broken slice
Storage optimization	Avoid redundant reads/writes

🧩 Defining a Partition in Dagster

In Dagster, you declare your partitions using a PartitionsDefinition.

You’ll typically store these in partitions.py. Here’s how we define a monthly partition:

# partitions.py
import dagster as dg
from dagster_essentials.assets import constants

monthly_partition = dg.MonthlyPartitionsDefinition(
    start_date=constants.START_DATE,
    end_date=constants.END_DATE
)

This sets up a list of partition keys — each one representing a month like "2023-01-01", "2023-02-01", and so on.

Think of it like telling Dagster: “My data comes in monthly crates. Here’s where to start, and here’s where to stop.”

🛠 Partitioning an Existing Asset

Let’s take the existing taxi_trips_file asset — which fetches and saves monthly NYC trip data — and make it partition-aware.

Original asset (simplified):

@dg.asset
def taxi_trips_file() -> None:
    month_to_fetch = '2023-03'
    ...

Hardcoded month = bad. Let’s fix it.

Step-by-step upgrade:

Use the monthly partition definition
Add it to the asset via the partitions_def argument.
Introduce the context object
It gives you the partition key for the current materialization run.
Use the partition key to derive the right month
context.partition_key gives a string like "2023-03-01", which we trim to "2023-03".

Here’s the result:

from dagster import asset, AssetExecutionContext
from dagster_essentials.partitions import monthly_partition

@asset(partitions_def=monthly_partition)
def taxi_trips_file(context: AssetExecutionContext) -> None:
    """
    Fetches a month of NYC taxi data as a parquet file.
    """
    # Get the partition’s date (e.g. "2023-03-01") and slice to "2023-03"
    month_to_fetch = context.partition_key[:-3]

    raw_trips = requests.get(
        f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_{month_to_fetch}.parquet"
    )

    with open(constants.TAXI_TRIPS_TEMPLATE_FILE_PATH.format(month_to_fetch), "wb") as f:
        f.write(raw_trips.content)

🧠 Mini Excursus: What Is `context.partition_key`?

Every partitioned run comes with a partition key — that is, what slice of data are we working on right now?

Examples:

Partition	`partition_key`
March 2023	`"2023-03-01"`
June 2024	`"2024-06-01"`

You can use this key to:

Filter your SQL queries
Name output files
Control branching logic
Target specific months for backfill

🔄 And What’s a Backfill?

A backfill is when you go back in time and re-run your asset for a range of partition keys.

When you’d backfill:

You launched the asset late and need to compute all 2023
You fixed a bug in logic and want to update past outputs
Some partitions failed and need to be retried

It’s like running git rebase for your data: reprocess old history with current logic.

🧪 Practice Exercise

Task: Partition the taxi_trips asset to load a month’s data from the parquet file into a DuckDB table, while tracking which partition each row came from.

Your goal:

Use the same monthly_partition
Add a partition_date column to your insert
Ensure re-runs don’t duplicate rows

✅ Example Solution

from dagster import asset, AssetExecutionContext
from dagster_duckdb import DuckDBResource
from dagster_essentials.partitions import monthly_partition

@asset(
    deps=["taxi_trips_file"],
    partitions_def=monthly_partition
)
def taxi_trips(context: AssetExecutionContext, database: DuckDBResource) -> None:
    """
    Loads a month of taxi trips into DuckDB, adding the partition date.
    """
    month_to_fetch = context.partition_key[:-3]

    query = f"""
      create table if not exists trips (
        vendor_id integer, pickup_zone_id integer, dropoff_zone_id integer,
        rate_code_id double, payment_type integer, dropoff_datetime timestamp,
        pickup_datetime timestamp, trip_distance double, passenger_count double,
        total_amount double, partition_date varchar
      );

      delete from trips where partition_date = '{month_to_fetch}';

      insert into trips
      select
        VendorID, PULocationID, DOLocationID, RatecodeID, payment_type, tpep_dropoff_datetime,
        tpep_pickup_datetime, trip_distance, passenger_count, total_amount, '{month_to_fetch}' as partition_date
      from '{constants.TAXI_TRIPS_TEMPLATE_FILE_PATH.format(month_to_fetch)}';
    """

    with database.get_connection() as conn:
        conn.execute(query)

🧠 Key Concept Recap

Concept	How it Works in Practice
`partition_key`	Controls what data to fetch/insert
`partitions_def`	Makes your asset partitionable
`context`	Gives you metadata for scoped execution
`Backfill`	Re-run a specific set of partitions to “catch up” or reprocess

📅 From Time-Based Partitions to Time-Based Schedules

So far, we’ve:

Defined time-based partitions
Refactored assets to use them
Scoped computation to just what matters (e.g., one month, not the entire dataset)

Now we’ll update our job and schedule to match this new logic.

🧠 Why You Should Partition Your Job (Not Just Assets)

Here’s the thing: you can partition your assets, but unless your job also understands and respects those partitions, you’ll just be triggering the whole graph again.

Instead, by giving your job a partition definition, Dagster knows to:

Run the job for one specific partition key at a time
Map that key down to your partitioned assets
Materialize just that slice

It’s like telling the chef to bake only today’s batch — not the entire backlog.

🛠 Updating the Job with `monthly_partition`

Here’s how trip_update_job looks before adding partitions:

trip_update_job = dg.define_asset_job(
    name="trip_update_job",
    selection=dg.AssetSelection.all() - dg.AssetSelection.assets(["trips_by_week"]),
)

Now let’s add monthly_partition to scope this job to one month per run:

from dagster_essentials.partitions import monthly_partition

trip_update_job = dg.define_asset_job(
    name="trip_update_job",
    partitions_def=monthly_partition,
    selection=dg.AssetSelection.all() - dg.AssetSelection.assets(["trips_by_week"]),
)

✅ Why This Works

Because the job has a partitions_def, every time the schedule triggers it, Dagster automatically injects a partition key (like "2023-04-01"). The assets know how to respond — thanks to the partition-aware logic we built in Part 6 (Part 1).

📆 Partitioned Schedules: Smarter Automation

You might remember we scheduled trip_update_job to run on the 5th of every month:

trip_update_schedule = dg.ScheduleDefinition(
    job=trip_update_job,
    cron_schedule="0 0 5 * *",
)

Now that the job is partitioned, Dagster does something special:
It maps each schedule tick to a specific partition key.

That means:

May 5 → "2025-05-01"
June 5 → "2025-06-01"
etc.

This is huge: you now get fine-grained control without manually specifying what slice to run.

🧪 The Dagster UI: Partitioned Assets in Action

With dagster dev running, reload your definitions and head to the UI.

🔍 Visualizing Assets with Partitions

Go to Assets > Asset Lineage
You’ll see partitioned assets like taxi_trips_file and taxi_trips marked with ⚪ or ⚫
Hover over each node to see partition status:
- 0 = no partitions materialized yet
- ⚠️ = failed partitions (if any)
- All = how many total partitions exist

assets UI

🚀 Materializing with Partitions

Click Materialize all on an asset with partitions, and a dialog appears:

You can choose which partitions to run
You can kick off one or many
You can trigger a backfill from here, too

partition selection

🔁 Running a Backfill

Backfilling = reprocessing history. Here’s how:

Click Launch backfill from the materialization modal
It’ll suggest a full range — tweak as needed
Head to Overview > Backfills to track progress

Each backfill run shows:

Which assets are being processed
For which partitions
Whether they succeeded or failed

backfill page

🔎 Drilling into a Partition

Click an asset → Asset Catalog
You’ll see each month listed as a separate line item.

Click "2023-03-01" for example, and Dagster shows:

Which upstream assets it depended on
The actual run and its logs
Any metadata or materialization events

partition detail

🧠 Recap: What You’ve Achieved

So far, you’ve:

Defined date-based partitions for monthly logic
Updated assets to dynamically process based on partition key
Updated a job to run for a specific partition each time
Linked that to a schedule using cron + partition inference
Used the UI to launch backfills and inspect per-partition runs

You now have precise, replayable, scalable data logic.

And you can confidently answer:

📚 Knowledge Recap

✅ Materializing multiple partitions results in a backfill
(Backfills = batch runs over partition keys. One key = normal run. Many keys = backfill.)
✅ Data can be partitioned by time
(We used monthly partitions, but Dagster also supports weekly, hourly, or custom dimension-based partitions.)

✅ True or False?

Partitions can be used in jobs and assets
→ ✅ True

In assets, partitions control what data is processed
In jobs, partitions define which partition key gets passed to all selected assets during the run

Together, they allow for fine-grained, deterministic pipeline execution.

🧠 Mini Excursus: Not Just Time

So far, we partitioned by month — but Dagster’s partitioning isn’t limited to timestamps.

You can partition by:

Static categories (e.g., "region_1", "region_2")
Product lines
Customers
Models
Dynamically generated lists (think: S3 buckets or active tenants)

If you can list it, you can partition on it.

✅ Wrapping Up

Partitioning isn’t just a feature — it’s a way of thinking about scalable, maintainable data engineering. With:

Time-based partitions
Partition-aware jobs
Automated schedules
Backfill recovery
UI-powered observability

…you’ve unlocked real orchestration power.

Onward to sensors and dynamic triggers — but only when you’re ready. 💥