members-only post

Data Engineering With Dagster Part Six – Partitioning & Backfills

Learn how to make your pipelines smarter by slicing them into manageable, date-based partitions and handling backfills like a pro.
Data Engineering With Dagster Part Six – Partitioning & Backfills

Partitioning: Baking Data One Batch at a Time

Let’s return to our running bakery analogy. Imagine you're getting cookie orders:

  • Monday: 3 boxes
  • Tuesday: 87 boxes
  • Wednesday: zero

Now, do you mix one massive bowl of dough every week, or prep daily batches based on demand?

Partitioning in Dagster is just that: breaking work into slices based on time or other meaningful dimensions.

A partition is a unit of data (or computation) scoped to a range - like a day, week, or month - that you can reason about, run, retry, and debug independently.

Why Partition?

Let’s say your asset processes NYC taxi data.

  • Without partitions? Every job run re-ingests and re-processes all months.
  • With partitions? You can fetch and materialize just March 2023, or even backfill all of 2022 with a single click.

Benefits

ConceptReal Benefit
ScalabilityBreak work into chunks for distributed execution
EfficiencySkip what’s already done
ResilienceRetry only failed pieces
DebuggabilityZero in on the broken slice
Storage optimizationAvoid redundant reads/writes

Defining a Partition in Dagster

In Dagster, you declare your partitions using a PartitionsDefinition.

You’ll typically store these in partitions.py. Here's how we define a monthly partition:

# partitions.py
import dagster as dg
from dagster_essentials.assets import constants

monthly_partition = dg.MonthlyPartitionsDefinition(
    start_date=constants.START_DATE,
    end_date=constants.END_DATE
)

This sets up a list of partition keys - each one representing a month like "2023-01-01""2023-02-01", and so on.

Think of it like telling Dagster: “My data comes in monthly crates. Here’s where to start, and here’s where to stop.”

Partitioning an Existing Asset

Let’s take the existing taxi_trips_file asset - which fetches and saves monthly NYC trip data - and make it partition-aware.

Original asset (simplified)

@dg.asset
def taxi_trips_file() -> None:
    month_to_fetch = '2023-03'
    ...

Hardcoded month = bad. Let’s fix it.

Step-by-step upgrade

  1. Use the monthly partition definition
    Add it to the asset via the partitions_def argument.
  2. Introduce the context object
    It gives you the partition key for the current materialization run.
  3. Use the partition key to derive the right month
    context.partition_key gives a string like "2023-03-01", which we trim to "2023-03".

Here’s the result:

from dagster import asset, AssetExecutionContext
from dagster_essentials.partitions import monthly_partition

@asset(partitions_def=monthly_partition)
def taxi_trips_file(context: AssetExecutionContext) -> None:
    """
    Fetches a month of NYC taxi data as a parquet file.
    """
    # Get the partition’s date (e.g. "2023-03-01") and slice to "2023-03"
    month_to_fetch = context.partition_key[:-3]

    raw_trips = requests.get(
        f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_{month_to_fetch}.parquet"
    )

    with open(constants.TAXI_TRIPS_TEMPLATE_FILE_PATH.format(month_to_fetch), "wb") as f:
        f.write(raw_trips.content)

Mini Excursus: What Is context.partition_key?

Every partitioned run comes with a partition key - that is, what slice of data are we working on right now?

Examples:

Partitionpartition_key
March 2023"2023-03-01"
June 2024"2024-06-01"

You can use this key to:

  • Filter your SQL queries
  • Name output files
  • Control branching logic
  • Target specific months for backfill

And What’s a Backfill?

backfill is when you go back in time and re-run your asset for a range of partition keys.

When you'd backfill:

  • You launched the asset late and need to compute all 2023
  • You fixed a bug in logic and want to update past outputs
  • Some partitions failed and need to be retried

It’s like running git rebase for your data: reprocess old history with current logic.


Practice Exercise

Task: Partition the taxi_trips asset to load a month’s data from the parquet file into a DuckDB table, while tracking which partition each row came from.

Your goal:

  • Use the same monthly_partition
  • Add a partition_date column to your insert
  • Ensure re-runs don’t duplicate rows

Example Solution

This post is for subscribers only

Subscribe to continue reading