Data Engineering With Dagster Part Six – Partitioning & Backfills
Partitioning: Baking Data One Batch at a Time
Let’s return to our running bakery analogy. Imagine you're getting cookie orders:
- Monday: 3 boxes
- Tuesday: 87 boxes
- Wednesday: zero
Now, do you mix one massive bowl of dough every week, or prep daily batches based on demand?
Partitioning in Dagster is just that: breaking work into slices based on time or other meaningful dimensions.
A partition is a unit of data (or computation) scoped to a range - like a day, week, or month - that you can reason about, run, retry, and debug independently.
Why Partition?
Let’s say your asset processes NYC taxi data.
- Without partitions? Every job run re-ingests and re-processes all months.
- With partitions? You can fetch and materialize just March 2023, or even backfill all of 2022 with a single click.
Benefits
| Concept | Real Benefit |
|---|---|
| Scalability | Break work into chunks for distributed execution |
| Efficiency | Skip what’s already done |
| Resilience | Retry only failed pieces |
| Debuggability | Zero in on the broken slice |
| Storage optimization | Avoid redundant reads/writes |
Defining a Partition in Dagster
In Dagster, you declare your partitions using a PartitionsDefinition.
You’ll typically store these in partitions.py. Here's how we define a monthly partition:
# partitions.py
import dagster as dg
from dagster_essentials.assets import constants
monthly_partition = dg.MonthlyPartitionsDefinition(
start_date=constants.START_DATE,
end_date=constants.END_DATE
)
This sets up a list of partition keys - each one representing a month like "2023-01-01", "2023-02-01", and so on.
Think of it like telling Dagster: “My data comes in monthly crates. Here’s where to start, and here’s where to stop.”
Partitioning an Existing Asset
Let’s take the existing taxi_trips_file asset - which fetches and saves monthly NYC trip data - and make it partition-aware.
Original asset (simplified)
@dg.asset
def taxi_trips_file() -> None:
month_to_fetch = '2023-03'
...
Hardcoded month = bad. Let’s fix it.
Step-by-step upgrade
- Use the monthly partition definition
Add it to the asset via thepartitions_defargument. - Introduce the
contextobject
It gives you the partition key for the current materialization run. - Use the partition key to derive the right month
context.partition_keygives a string like"2023-03-01", which we trim to"2023-03".
Here’s the result:
from dagster import asset, AssetExecutionContext
from dagster_essentials.partitions import monthly_partition
@asset(partitions_def=monthly_partition)
def taxi_trips_file(context: AssetExecutionContext) -> None:
"""
Fetches a month of NYC taxi data as a parquet file.
"""
# Get the partition’s date (e.g. "2023-03-01") and slice to "2023-03"
month_to_fetch = context.partition_key[:-3]
raw_trips = requests.get(
f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_{month_to_fetch}.parquet"
)
with open(constants.TAXI_TRIPS_TEMPLATE_FILE_PATH.format(month_to_fetch), "wb") as f:
f.write(raw_trips.content)
Mini Excursus: What Is context.partition_key?
Every partitioned run comes with a partition key - that is, what slice of data are we working on right now?
Examples:
| Partition | partition_key |
|---|---|
| March 2023 | "2023-03-01" |
| June 2024 | "2024-06-01" |
You can use this key to:
- Filter your SQL queries
- Name output files
- Control branching logic
- Target specific months for backfill
And What’s a Backfill?
A backfill is when you go back in time and re-run your asset for a range of partition keys.
When you'd backfill:
- You launched the asset late and need to compute all 2023
- You fixed a bug in logic and want to update past outputs
- Some partitions failed and need to be retried
It’s like running git rebase for your data: reprocess old history with current logic.
Practice Exercise
Task: Partition the taxi_trips asset to load a month’s data from the parquet file into a DuckDB table, while tracking which partition each row came from.Your goal:
- Use the same
monthly_partition - Add a
partition_datecolumn to your insert - Ensure re-runs don’t duplicate rows
Example Solution
Subscribe to continue reading