 
  
    
    
    
 
  
Data Engineering With Dagster Part Eight: Metadata
Table of Contents
đź§ Metadata in Dagster: The Two Worlds
Metadata is data about data - and while that sounds meta (and it is), it’s also the secret to making your pipelines understandable, traceable, and collaborative.
Imagine you’re running a bakery:
- The cookies = your data
- The label saying “baked today at 7:32am, by Sam” = your metadata
In Dagster, metadata powers both visibility and observability. There are two types you’ll work with:
- Definition metadata → fixed context like descriptions and groupings
- Materialization metadata → dynamic info like row counts, timestamps, or even rendered charts from each run
We’ll explore both, starting with how to better describe your data for your team - and future you.
✏️ Definition Metadata: Describe Your Assets
Dagster gives you two clean ways to document your assets:
Option 1: Python Docstrings
Add a triple-quoted string at the top of your asset function - Dagster will pick it up:
@dg.asset
def taxi_zones_file() -> None:
    """
    The raw CSV file for the taxi zones dataset. Sourced from the NYC Open Data portal.
    """
These show up in the Dagster UI - perfect for quick inline docs.
  Option 2: The description= Parameter
  
  
    
 
  
You can also add a description directly in the decorator:
@dg.asset(
    description="The raw CSV file for the taxi zones dataset. Sourced from the NYC Open Data portal."
)
def taxi_zones_file() -> None:
    """This docstring won’t be shown in the UI."""
If both exist, Dagster uses the description=. Handy if you want internal vs. external descriptions.
Where You See It
- In the Assets tab under the asset name
- In the Global Asset Lineage graph, when you hover over a node
Documentation becomes part of your code - and your UI.
🗂 Grouping Assets: Don’t Let It Get Messy
As your project grows, your assets multiply. It’s time to group them - not just visually, but functionally.
You can group assets:
- Individually using the group_name=parameter
- By module, using load_assets_from_modules(..., group_name=...)
Both result in clearly labeled boxes in the Dagster UI - and more maintainable job selection.
Grouping Individual Assets
@dg.asset(group_name="raw_files")
def taxi_zones_file() -> None:
    ...
@dg.asset(group_name="ingested")
def taxi_trips() -> None:
    ...
Grouping Entire Modules
In definitions.py:
metric_assets = dg.load_assets_from_modules(
    modules=[metrics],
    group_name="metrics"
)
This is clean, scalable, and ideal for large projects.
đź’ˇ Mini Excursus: Practice Grouping
Try it like this:
# assets/trips.py
@dg.asset(group_name="raw_files")
def taxi_trips_file() -> None:
    ...
@dg.asset(group_name="ingested")
def taxi_trips() -> None:
    ...
# definitions.py
request_assets = dg.load_assets_from_modules(
    modules=[requests],
    group_name="requests"
)
Now, in the UI, assets are visually separated into raw_files, ingested, and requests.
No more asset soup. Just order.
📦 Materialization Metadata: Add Context to Each Run
While definition metadata is about what the asset is, materialization metadata is about what happened during a run.
This might include:
- Number of records processed
- Size of output files
- Execution timestamps
- Even preview charts or markdown
Let’s implement a common one: row count.
  Adding Row Count to taxi_trips_file
  
  
    
 
  
In assets/trips.py:
import pandas as pd
import dagster as dg
@dg.asset(
    partitions_def=monthly_partition,
    group_name="raw_files",
)
def taxi_trips_file(context) -> dg.MaterializeResult:
    """
    The raw parquet files for the taxi trips dataset. Sourced from the NYC Open Data portal.
    """
    partition_date_str = context.partition_key
    month_to_fetch = partition_date_str[:-3]
    raw_trips = requests.get(
        f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_{month_to_fetch}.parquet"
    )
    file_path = constants.TAXI_TRIPS_TEMPLATE_FILE_PATH.format(month_to_fetch)
    with open(file_path, "wb") as output_file:
        output_file.write(raw_trips.content)
    num_rows = len(pd.read_parquet(file_path))
    return dg.MaterializeResult(
        metadata={
            "Number of records": dg.MetadataValue.int(num_rows)
        }
    )
  đź§  Mini Excursus: Do It for taxi_zones_file
  
  
    
 
  
@dg.asset(group_name="raw_files")
def taxi_zones_file() -> dg.MaterializeResult:
    raw_taxi_zones = requests.get(
        "https://community-engineering-artifacts.s3.us-west-2.amazonaws.com/dagster-university/data/taxi_zones.csv"
    )
    with open(constants.TAXI_ZONES_FILE_PATH, "wb") as output_file:
        output_file.write(raw_taxi_zones.content)
    num_rows = len(pd.read_csv(constants.TAXI_ZONES_FILE_PATH))
    return dg.MaterializeResult(
        metadata={
            "Number of records": dg.MetadataValue.int(num_rows)
        }
    )
Viewing in the UI
Once materialized:
- Open Global Asset Lineage
- Select taxi_trips_file
- View row count per partition as a graph
   
Real-time dashboards. Zero setup.
đź–Ľ Markdown Metadata: Inline Chart Previews
Some metadata is visual.
Let’s embed a generated chart (e.g. from adhoc_request) as a rendered Markdown image in the UI.
Here’s how:
import base64
with open(file_path, "rb") as file:
    image_data = file.read()
base64_data = base64.b64encode(image_data).decode('utf-8')
md_content = f""
return dg.MaterializeResult(
    metadata={
        "preview": dg.MetadataValue.md(md_content)
    }
)
Dagster renders it directly in the UI:
   
It’s like adding screenshots to your pipeline logs - but built in.
🔍 Knowledge Check
Q: What’s the best way to organize assets in a code location?
A: Use asset groups - via decorator or module - to stay sane as your DAG grows.
Q: Is the number of rows in a CSV definition or materialization metadata?
A: Materialization. It changes per run, so it belongs with runtime results - not static docs.
âś… Final Thoughts
Dagster’s metadata isn’t a nice-to-have - it’s what turns a black-box DAG into an explainable, observable data system.
With just a few lines of code, you now have:
- Documented assets
- Grouped modules
- Annotated runs
- And inline visualizations
This kind of transparency pays off at scale - especially when you’re debugging at 2AM or onboarding a new teammate.