Power BI
Composite Models and Aggregations, How to Make a Billion Row Dataset Feel Instant
How composite models and aggregations make very large Power BI datasets feel interactive.
Power BI starts to creak when datasets push past a few hundred million rows. Import mode runs out of memory. DirectQuery struggles with complex DAX. Live connection ties you to whoever owns the source. The compromise that most large estates eventually settle on is the composite model with aggregations, which gives you most of the speed of Import mode and most of the freshness of DirectQuery without forcing a choice between them.
This article explains how composite models work, when they are the right answer, and how to set up aggregations so that nine out of ten queries hit the in memory cache while the tenth falls through cleanly to the warehouse.
The Three Storage Modes
Every table in a Power BI model has a storage mode. There are three.
Import means the table is loaded into VertiPaq at refresh time. It is fast, compressed, and refreshes on a schedule. The trade off is freshness and size. The data is stale between refreshes and the model has to fit in memory.
DirectQuery means the table stays in the source. Power BI generates SQL and sends it to the source on every query. Freshness is excellent. Performance depends entirely on the source. Some DAX patterns either run very slowly or refuse to translate at all.
Dual is a hybrid. The table is imported, but it can also be queried via DirectQuery when joining to a DirectQuery fact table. This sounds confusing and it is, but it is essential for composite models.
A composite model uses different storage modes for different tables in the same dataset. A typical pattern is dimensions in Dual mode, the granular fact in DirectQuery, and pre aggregated summaries in Import.
Why Composite Models Win at Scale
The honest reason composite models exist is that hardware will not save you on truly large datasets. A 5 billion row fact table imported into VertiPaq might compress to 50 GB but it still requires a Premium capacity that costs real money to run. DirectQuery against a well tuned warehouse handles the same data without the memory cost, but interactive performance suffers.
The composite model gets the best of both. The vast majority of queries hit small, pre aggregated tables in Import mode and return in milliseconds. The rare query that needs row level granularity falls through to DirectQuery against the warehouse, where it might take a few seconds, which is acceptable for the kind of investigation that produces it.
Architecture
flowchart LR
User[User opens report]
Engine[Power BI Engine]
AggMonth[Agg by Month, Import, 50K rows]
AggDay[Agg by Day, Import, 1.5M rows]
Fact[FactSales, DirectQuery, 5B rows]
Warehouse[(Azure Synapse / Fabric Warehouse)]
User --> Engine
Engine -->|simple query| AggMonth
Engine -->|drilled query| AggDay
Engine -->|granular query| Fact
Fact --> WarehouseThe engine decides automatically which table to hit. You configure the aggregations once, then forget them. Users do not see the difference, only the performance.
Tutorial, Building a Composite Model with Aggregations
Suppose you have a Fabric Lakehouse with a five billion row sales table. The grain is one row per line item, with date, customer, product, store, and amount columns.
Step 1, Set Up the Base Connection
In Power BI Desktop, choose Get Data, Microsoft Fabric, and connect to the Lakehouse using DirectQuery. Pull in the FactSales table and the four dimensions. Confirm that storage mode for FactSales is DirectQuery and for the dimensions is Dual.
Step 2, Build the Aggregation Tables
Two aggregations cover most of the workload. A monthly aggregation by product category and store region, and a daily aggregation by product and store. Build these in the Lakehouse using a notebook or a SQL script.
CREATE TABLE AggMonthlyCategoryRegion AS
SELECT
DATE_TRUNC('month', SaleDate) AS MonthStart,
p.Category,
s.Region,
SUM(f.Quantity) AS Quantity,
SUM(f.TotalAmount) AS TotalAmount,
COUNT(*) AS LineCount
FROM FactSales f
JOIN DimProduct p ON f.ProductID = p.ProductID
JOIN DimStore s ON f.StoreID = s.StoreID
GROUP BY 1, 2, 3;
CREATE TABLE AggDailyProductStore AS
SELECT
SaleDate,
ProductID,
StoreID,
SUM(Quantity) AS Quantity,
SUM(TotalAmount) AS TotalAmount
FROM FactSales
GROUP BY 1, 2, 3;
The first table reduces five billion rows to perhaps fifty thousand. The second reduces it to a few million. Both are tiny compared to the source.
Step 3, Import the Aggregations
In Power BI Desktop, add the two aggregation tables and set their storage mode to Import. Refresh.
Step 4, Configure the Aggregation Mappings
This is the step that does the magic. Right click each aggregation table and choose Manage Aggregations. The dialogue lets you map columns in the aggregation to summarisations of columns in the detail fact.
For AggMonthlyCategoryRegion, the mappings look like this.
AggMonthlyCategoryRegion.MonthStart -> GroupBy(FactSales.SaleDate truncated to month via DimDate.MonthStart)
AggMonthlyCategoryRegion.Category -> GroupBy(DimProduct.Category)
AggMonthlyCategoryRegion.Region -> GroupBy(DimStore.Region)
AggMonthlyCategoryRegion.Quantity -> Sum(FactSales.Quantity)
AggMonthlyCategoryRegion.TotalAmount -> Sum(FactSales.TotalAmount)
AggMonthlyCategoryRegion.LineCount -> Count rows of FactSales
The engine now knows that any query asking for total quantity by category by region by month can be answered from this small Import table instead of hitting DirectQuery. The mappings for AggDailyProductStore follow the same pattern at finer grain.
Step 5, Set the Aggregation Precedence
When multiple aggregations could answer a query, Power BI picks the smallest one that works. Set a precedence value on each aggregation to break ties. Higher precedence wins.
A simple convention is to assign precedence equal to the inverse of expected size. The monthly aggregation gets precedence 100. The daily aggregation gets precedence 50. The detail FactSales (which serves as the fallback) is implicit at precedence 0.
Step 6, Verify the Hit Rate
Open Performance Analyzer and run a slate of typical queries. Each query result includes a line indicating whether the aggregation was matched.
Query 1: Total Sales by Year by Region
Aggregation match: AggMonthlyCategoryRegion (Import)
Duration: 87 ms
Query 2: Total Sales by Day by Product
Aggregation match: AggDailyProductStore (Import)
Duration: 220 ms
Query 3: Total Sales by Customer
No aggregation match
Storage: DirectQuery FactSales
Duration: 4.8 s
The first two queries are answered from memory. The third falls through to the warehouse, which takes longer but is still acceptable for an investigation. If query 3 turns out to be a frequent pattern, that is a signal to add a third aggregation grouped by customer.
Choosing Aggregation Grains
The art of aggregations is choosing the right grains. Three principles guide the choice.
The first principle is to match the grain of the most frequent visuals. Look at usage metrics or query logs and identify the slicers and the time grains people actually use. If 70 percent of queries are at month and category level, the monthly aggregation pays for itself immediately.
The second principle is to keep the aggregation small enough to stay in memory comfortably. A useful rule of thumb is that the aggregation should be at least 100 times smaller than the detail fact. If it is smaller than that, you might be over aggregating and missing the underlying patterns.
The third principle is to layer aggregations rather than stacking them. A coarse aggregation at month level and a finer aggregation at day level can coexist, with the engine choosing the right one for each query. Trying to build one perfect aggregation that covers every case usually produces a table that is either too coarse for some queries or too large to compress well.
Operational Considerations
Aggregations need refresh management. They are derived data, so they must be rebuilt whenever the underlying fact changes. The cleanest approach is to run the aggregation queries in the warehouse on a schedule, then refresh the aggregation tables in Power BI immediately after.
In a Fabric environment, this can be modelled as a Data Pipeline that runs the source ETL, then the aggregation refresh, then the semantic model refresh. In a Synapse environment, the same orchestration lives in Azure Data Factory.
Monitoring matters too. Track the aggregation hit rate over time. A drop in hit rate often indicates that report authors are building visuals at unusual grains, which signals either a need for a new aggregation or a conversation about whether those visuals belong in this dataset.
When Aggregations Are Not the Answer
Two scenarios make aggregations less useful.
The first is when the fact data is genuinely small enough to import. Below a few hundred million rows, plain Import mode usually wins. The complexity of composite modelling is not worth the savings.
The second is when the queries are dominated by row level filtering. If users frequently ask for "show me all transactions where the discount was greater than 30 percent for this specific customer last week", aggregations cannot help because the question requires row level data. In that case, focus on tuning the warehouse for fast DirectQuery rather than building aggregations that will not be hit.
A Word on User Defined Aggregations
The pattern described above is admin defined. Recent versions of Power BI also support user defined aggregations through Fabric Direct Lake mode and through Live Aggregations in DirectQuery. These features let aggregations be inferred from query patterns automatically.
Direct Lake is particularly interesting because it removes the storage mode question entirely. Tables stored in OneLake as Delta Parquet can be queried directly by VertiPaq without import or DirectQuery. The engine reads parquet files into memory on demand and caches them. For large datasets, this often performs as well as a hand tuned composite model with much less work.
The trade off is that Direct Lake requires a Fabric capacity and the data must live in OneLake or a supported lakehouse. For organisations not yet on Fabric, hand built aggregations remain the right pattern.
The Main Lesson
Composite models with aggregations are one of the few advanced Power BI techniques that genuinely scale to enterprise sized data. They take some setup, but the payoff is dramatic. A dataset that would have required a 400 GB capacity to import can run on a 100 GB capacity, with most queries answered in under a second and the small minority that need row level data taking a few seconds longer.
If your dataset has crossed the 500 million row mark and Import mode is starting to hurt, this pattern is the next step. Build it once, set up the monitoring, and let the engine make the choices.
References and Further Reading
| # | Source | Type | Link |
|---|---|---|---|
| 1 | Microsoft Learn, Composite models in Power BI | Free official documentation | https://learn.microsoft.com/en-us/power-bi/transform-model/desktop-composite-models |
| 2 | Microsoft Learn, User defined aggregations | Free official documentation | https://learn.microsoft.com/en-us/power-bi/transform-model/aggregations-advanced |
| 3 | Microsoft Learn, DirectQuery in Power BI | Free official documentation | https://learn.microsoft.com/en-us/power-bi/connect-data/desktop-directquery-about |
| 4 | SQLBI, Aggregations in DAX | Free practitioner article | https://www.sqlbi.com/articles/aggregations-in-dax/ |
| 5 | Microsoft Learn, Storage modes in Power BI | Free official documentation | https://learn.microsoft.com/en-us/power-bi/transform-model/desktop-storage-mode |
| 6 | DAX Studio | Open source diagnostic tool | https://daxstudio.org/ |
| 7 | Microsoft Learn, Direct Lake mode in Microsoft Fabric | Free official documentation | https://learn.microsoft.com/en-us/fabric/get-started/direct-lake-overview |
| 8 | VertiPaq Analyzer | Open source GitHub project | https://github.com/sql-bi/VertiPaq-Analyzer |
Reader Comments
Add a comment with your name and email. Your email is used only for basic validation and is not shown publicly.