How Metric Requests are Converted to SQL
How Metric Requests are Converted to SQL
This document explains how DataJunction converts metric requests into SQL queries. Understanding this flow is essential for:
- Debugging generated SQL
- Optimizing metric definitions
- Contributing to the SQL generation code
For background on metric decomposition (the theory behind breaking metrics into pre-aggregatable components), see Metric Decomposition.
Overview Architecture
User Request (metrics + dimensions + filters)
|
v
+-------------------------------------------------------------+
| DECOMPOSITION PHASE |
| - Load metric nodes and their dependencies |
| - Extract components from base metrics |
| - Build parent_map for derived metric relationships |
| - Identify window metrics (LAG/LEAD/etc.) |
+-------------------------------------------------------------+
|
v
+-------------------------------------------------------------+
| MEASURES PHASE (measures.py) |
| - Group metrics by parent fact table -> MetricGroups |
| - Analyze aggregability (FULL, LIMITED, NONE) |
| - Build GrainGroupSQLs (pre-aggregated CTEs) |
| - Build window metric grain groups (coarser grain CTEs) |
+-------------------------------------------------------------+
|
v
+-------------------------------------------------------------+
| METRICS PHASE (metrics.py) |
| - Combine grain groups via FULL OUTER JOIN -> base_metrics |
| - Apply combiner expressions (e.g., SUM(a)/SUM(b)) |
| - Build window CTEs (reaggregate + apply LAG/LEAD) |
| - Build final SELECT with all metrics |
+-------------------------------------------------------------+
|
v
Final SQL Query
Phase 1: Decomposition
Input
A typical metric request includes:
- Metrics:
["v3.total_revenue", "v3.avg_order_value", "v3.wow_revenue_change"] - Dimensions:
["v3.date.month", "v3.product.category"] - Filters:
["v3.date.year = 2024"]
Process
1. Load nodes and build dependency graph:
v3.total_revenue (base) -> parent: v3.order_details (fact)
v3.order_count (base) -> parent: v3.order_details (fact)
v3.avg_order_value (derived) -> parents: [v3.total_revenue, v3.order_count]
v3.wow_revenue_change (window) -> parents: [v3.total_revenue]
2. Extract components from base metrics:
For v3.total_revenue = SELECT SUM(amount) FROM v3.order_details:
DecomposedMetricInfo(
metric_node: v3.total_revenue,
components: [
MetricComponent(
name="hash_abc123",
expression="amount",
aggregation="SUM",
rule=AggregationRule(type=FULL, merge="SUM")
)
],
combiner_ast: SUM(amount), # How to combine components
aggregability: FULL
)
3. Build parent_map:
parent_map = {
"v3.total_revenue": ["v3.order_details"], # fact parent
"v3.order_count": ["v3.order_details"], # fact parent
"v3.avg_order_value": ["v3.total_revenue", "v3.order_count"], # metric parents
"v3.wow_revenue_change": ["v3.total_revenue"], # metric parent (window)
}
Phase 2: Measures (Pre-Aggregation)
Step 2.1: Group by Parent Fact -> MetricGroups
Metrics are grouped by their underlying fact table:
MetricGroup(
parent_node: v3.order_details,
decomposed_metrics: [
DecomposedMetricInfo(v3.total_revenue, ...),
DecomposedMetricInfo(v3.order_count, ...),
]
)
If metrics come from different facts (e.g., v3.order_details and v3.page_views), they become separate MetricGroups.
Step 2.2: Analyze Grain Groups by Aggregability
Within each MetricGroup, components are split by aggregability:
| Aggregability | Description | Example |
|---|---|---|
| FULL | Can be summed at any grain | SUM(amount) -> merge with SUM |
| LIMITED | Requires specific grain | COUNT(DISTINCT customer_id) -> needs customer_id in grain |
| NONE | Cannot be pre-aggregated | RANK() OVER (...) -> pass through raw rows |
Step 2.3: Build GrainGroupSQL (Pre-Aggregated CTEs)
For each GrainGroup, generate a CTE that aggregates components to the requested grain:
-- GrainGroupSQL for v3.order_details (FULL aggregability)
-- Aggregates at user's requested grain: (month, category)
WITH order_details_0 AS (
SELECT
d.month,
p.category,
SUM(od.amount) AS hash_abc123, -- total_revenue component
COUNT(*) AS hash_def456 -- order_count component
FROM v3.order_details od
LEFT JOIN v3.date d ON od.date_id = d.date_id
LEFT JOIN v3.product p ON od.product_id = p.product_id
WHERE d.year = 2024
GROUP BY d.month, p.category
)
GrainGroupSQL structure:
GrainGroupSQL(
query: ast.Query, # The CTE SQL
columns: [ # Column metadata
ColumnMetadata(name="month", semantic_type="dimension"),
ColumnMetadata(name="category", semantic_type="dimension"),
ColumnMetadata(name="hash_abc123", semantic_type="metric_component"),
ColumnMetadata(name="hash_def456", semantic_type="metric_component"),
],
grain: ["month", "category"],
aggregability: FULL,
metrics: ["v3.total_revenue", "v3.order_count"],
parent_name: "v3.order_details",
component_aliases: {
"hash_abc123": "hash_abc123", # component_name -> SQL alias
"hash_def456": "hash_def456",
}
)
Step 2.4: Merge Compatible Grain Groups
If multiple grain groups from the same parent have compatible aggregabilities, they can be merged to reduce JOINs:
Before merge:
- GrainGroup(FULL, components=[SUM(amount)])
- GrainGroup(LIMITED, components=[COUNT(DISTINCT customer_id)])
After merge:
- GrainGroup(merged=True,
grain_columns=[customer_id], # Finest grain needed
components=[SUM(amount), customer_id]) # Raw values at finest grain
The merged CTE outputs at finest grain; reaggregation happens in the metrics phase.
Step 2.5: Build Window Metric Grain Groups
Window metrics (LAG/LEAD for WoW, MoM, etc.) need data at their ORDER BY grain, not necessarily the user’s requested grain.
Example: User requests daily data, but wow_revenue_change needs weekly data for LAG comparison.
# Extract window metric grains
window_metric_grains = {
"v3.wow_revenue_change": {"v3.date.week"}, # ORDER BY grain
"v3.mom_revenue_change": {"v3.date.month"}, # ORDER BY grain
}
Grouping by (grain, parent_fact):
Window metrics are grouped by both their ORDER BY grain AND their parent fact:
# Key: (frozenset of grain cols, parent_name)
grain_parent_to_metrics = {
({"v3.date.week"}, "v3.order_details"): ["v3.wow_revenue_change"],
({"v3.date.week"}, "v3.page_views"): ["v3.wow_page_views_change"],
}
This ensures window metrics from different facts don’t get incorrectly combined.
Cross-fact detection:
If a window metric references metrics from multiple facts:
v3.wow_conversion_rate_change = LAG(conversion_rate, 1) OVER (ORDER BY week)
# where conversion_rate = orders / page_views (cross-fact)
is_cross_fact_window = True # Uses base_metrics CTE as source
Phase 3: Metrics (Combination & Final SQL)
Step 3.1: Build base_metrics CTE
Combine all grain groups via FULL OUTER JOIN:
WITH
-- Grain group CTEs from measures phase
order_details_0 AS (...),
page_views_0 AS (...),
-- Combine grain groups
base_metrics AS (
SELECT
COALESCE(t0.month, t1.month) AS month,
COALESCE(t0.category, t1.category) AS category,
-- Apply combiner expressions to get final metrics
t0.hash_abc123 AS total_revenue, -- SUM(amount)
t0.hash_def456 AS order_count, -- COUNT(*)
t0.hash_abc123 / NULLIF(t0.hash_def456, 0) AS avg_order_value, -- derived
t1.hash_xyz789 AS page_views -- from different fact
FROM order_details_0 t0
FULL OUTER JOIN page_views_0 t1
ON t0.month = t1.month AND t0.category = t1.category
)
Combiner expressions:
For derived metrics, the combiner AST is applied:
# v3.avg_order_value combiner_ast:
# v3.total_revenue / NULLIF(v3.order_count, 0)
#
# Becomes:
# t0.hash_abc123 / NULLIF(t0.hash_def456, 0)
Step 3.2: Build Window CTEs (Reaggregation + Window Functions)
For each window grain group, we may need two CTEs:
CTE 1: Aggregation CTE (reaggregate to coarser grain)
Only needed when the user’s grain is finer than the window’s ORDER BY grain.
-- Reaggregate from base grain group to window grain (weekly)
order_details_week_agg AS (
SELECT
week,
category,
SUM(hash_abc123) AS total_revenue -- Reaggregate component
FROM order_details_0 -- Source: base grain group CTE
GROUP BY week, category
)
For cross-fact window metrics, the source is base_metrics:
-- Cross-fact: reaggregate from base_metrics (has combined metrics)
cross_fact_week_agg AS (
SELECT
week,
category,
SUM(total_revenue) AS total_revenue,
SUM(page_views) AS page_views
FROM base_metrics
GROUP BY week, category
)
CTE 2: Window CTE (apply LAG/LEAD)
-- Apply window function
order_details_week AS (
SELECT
week,
category,
total_revenue,
total_revenue - LAG(total_revenue, 1) OVER (
PARTITION BY category
ORDER BY week
) AS wow_revenue_change
FROM order_details_week_agg
)
Optimization: Skip intermediate CTEs when grain matches
When the user’s requested grain already matches the window’s ORDER BY grain, the window function can be applied directly to base_metrics:
-- No intermediate CTEs needed - apply LAG directly
SELECT
base_metrics.category,
base_metrics.week,
base_metrics.total_revenue - LAG(base_metrics.total_revenue, 1) OVER (
PARTITION BY base_metrics.category
ORDER BY base_metrics.week
) AS wow_revenue_change
FROM base_metrics
Step 3.3: Build Final SELECT
Join all CTEs together for the final result:
SELECT
COALESCE(bm.month, w0.week) AS month,
COALESCE(bm.category, w0.category) AS category,
bm.total_revenue,
bm.order_count,
bm.avg_order_value,
w0.wow_revenue_change
FROM base_metrics bm
FULL OUTER JOIN order_details_week w0
ON bm.category = w0.category
Complete Example: Multi-Metric Query
Request:
metrics = [
"v3.total_revenue", # Base metric (FULL)
"v3.unique_customers", # Base metric (LIMITED - COUNT DISTINCT)
"v3.avg_order_value", # Derived metric
"v3.wow_revenue_change", # Window metric (WoW)
]
dimensions = ["v3.date.day", "v3.product.category"]
Generated SQL:
WITH
-- ==============================================================
-- GRAIN GROUP CTEs (from measures phase)
-- ==============================================================
-- Merged grain group: FULL + LIMITED at finest grain (includes customer_id)
order_details_0 AS (
SELECT
d.day,
p.category,
od.customer_id, -- Grain column for COUNT DISTINCT
SUM(od.amount) AS hash_revenue, -- total_revenue component
COUNT(*) AS hash_orders -- order_count component
FROM v3.order_details od
LEFT JOIN v3.date d ON od.date_id = d.date_id
LEFT JOIN v3.product p ON od.product_id = p.product_id
GROUP BY d.day, p.category, od.customer_id
),
-- ==============================================================
-- BASE_METRICS CTE (combine + apply combiners)
-- ==============================================================
base_metrics AS (
SELECT
day,
category,
SUM(hash_revenue) AS total_revenue,
COUNT(DISTINCT customer_id) AS unique_customers, -- LIMITED reaggregation
SUM(hash_revenue) / NULLIF(SUM(hash_orders), 0) AS avg_order_value
FROM order_details_0
GROUP BY day, category
),
-- ==============================================================
-- WINDOW METRIC CTEs (reaggregate to coarser grain + apply window)
-- ==============================================================
-- Step 1: Reaggregate to weekly grain (from base grain group, not base_metrics)
order_details_week_agg AS (
SELECT
DATE_TRUNC('week', day) AS week, -- Coarser grain
category,
SUM(hash_revenue) AS total_revenue -- Reaggregate from components
FROM order_details_0
GROUP BY DATE_TRUNC('week', day), category
),
-- Step 2: Apply LAG window function
order_details_week AS (
SELECT
week,
category,
total_revenue,
total_revenue - LAG(total_revenue, 1) OVER (
PARTITION BY category
ORDER BY week
) AS wow_revenue_change
FROM order_details_week_agg
)
-- ==============================================================
-- FINAL SELECT
-- ==============================================================
SELECT
bm.day,
bm.category,
bm.total_revenue,
bm.unique_customers,
bm.avg_order_value,
w.wow_revenue_change
FROM base_metrics bm
LEFT JOIN order_details_week w
ON bm.category = w.category
AND DATE_TRUNC('week', bm.day) = w.week
Key Design Decisions
1. Why Grain Groups?
Grain groups enable:
- Pre-aggregation matching: If a materialized pre-agg exists at the right grain, use it instead of scanning source tables
- Efficient JOINs: Metrics from the same fact are computed together in one CTE
- Proper aggregability handling: LIMITED/NONE metrics get special treatment
2. Why Reaggregate from Grain Groups (not base_metrics)?
For single-fact window metrics, we reaggregate from the grain group CTE:
Grain Group CTE (has components)
-> Reaggregate components
-> Apply window function
This is more efficient than reaggregating from base_metrics because:
- Components (SUM, COUNT) can be correctly summed
- Derived metrics (ratios) CANNOT be summed
Example: avg_order_value = revenue / orders cannot be summed! You can’t add averages together.
3. Why Group Window Metrics by (grain, parent_fact)?
Without parent_fact grouping, window metrics with the same ORDER BY grain but different parent facts would be incorrectly combined:
WoW metrics from thumb_rating AND download_session
-> Same weekly grain
-> Incorrectly combined!
-> SUM(thumbs_up_rate) is WRONG (can't sum a rate)
With parent_fact grouping:
("week", "thumb_rating"): [wow_thumbs_up_rate] -> Separate CTE
("week", "download_session"): [wow_download_time] -> Separate CTE
4. When to Use base_metrics as Source?
Cross-fact window metrics must use base_metrics:
wow_conversion_rate = LAG(conversion_rate, 1) OVER (...)
# where conversion_rate = orders / page_views
# conversion_rate spans two facts, so:
# - Cannot reaggregate from single grain group
# - Must read from base_metrics which has the combined metric
The is_cross_fact_window flag on GrainGroupSQL controls this routing.
Metric Types Summary
| Metric Type | Example | SQL Generation |
|---|---|---|
| Base | SUM(amount) | Component in grain group CTE |
| Derived | revenue / orders | Combiner applied in base_metrics |
| Window | LAG(revenue, 1) | Separate CTE with window function |
| Window (cross-fact) | LAG(orders/page_views, 1) | Uses base_metrics as source |
Code References
The SQL generation logic lives in datajunction_server/construction/build_v3/:
| File | Responsibility |
|---|---|
builder.py | Entry point, orchestrates the phases |
decomposition.py | Extracts components from metrics |
measures.py | Builds grain group CTEs |
metrics.py | Combines grain groups, builds final SQL |
types.py | Data structures (GrainGroupSQL, DecomposedMetricInfo, etc.) |
cte.py | CTE building utilities |