Electricity Maps Blog

Behind the forecasts (Part 2): How we built an engine to predict the future of electricity grids

March 12, 2025

10 mins

This post is a follow-up to “Behind the forecasts (Part 1): Why we build an engine to predict the future of electricity grids”. We think that it’s worth sharing some of the engineering principles the team has followed during the development of our new forecasting engine. This contribution to the domain of building large scale, dynamic ML systems can help other engineers in their own efforts to deliver valuable ML ecosystems. Read the preceding post to understand the context under which we operated.
‍

Building the engine

At Electricity Maps, we foresee a world where billions of grid-connected systems optimize when and where they consume electricity to reduce costs and carbon emissions. Providing accurate grid forecasts is one of our key contributions to enable this global shift. Under the hood of our forecasting engine, combinations of thousands of machine learning models are constantly interweaving learned parameters with extensive features to predict all the components of electricity grids worldwide.

Orchestrating the execution of thousands of interconnected models is no easy feat. We’d like to highlight here some of the key elements that, compounded, allowed our grid forecasts team to successfully deliver a complete global offering.

The human race built most nobly when limitations were the greatest, wrote Frank Lloyd Wright [1]. In our case, constraints were numerous: we cannot assume that both our features and targets are ever complete, nor devoid of outliers; we must automate all operations within our model’s lifecycle to avoid relying on manual operations; our models are all interconnected through flow tracing, creating a potential for intractable combinations of individual forecasts; lastly, we are operating under ambitious time constraints, as with anything climate-related, the opportunity costs are too great to bear. We therefore had to focus on equipping our engine only with the absolute essential and most valuable components and releasing them as soon as possible to the world for immediate feedback. Under these conditions, the following principles guided us to deliver on our ambition.

Defining the right metrics and tracking them at scale

Knowing that it’s impossible to evaluate progress if you don’t have a clear goalpost is obvious. However, defining the right target is not, especially for a system with thousands of outputs. Luckily for us, we realized the value of clear performance metrics early on in our journey.

One of our customers supported our development of power production forecasts for renewables in the US with clear performance targets. They were set on obtaining solar and wind forecasts with at most a certain relative error.

Having this clear target gave us focus, and when focus is given to talented engineers, great things happen. In a few months, we delivered models with the expected performance and sought to define target metrics to guide us the rest of the way.

*Electricity Maps' quality guarantees for Carbon Intensity and Renewable Share*

‍

Knowing our targets was only solving one part of the puzzle, as we also needed to surface the required metrics from our systems. Our initial approach, building analytics queries on top of our operational database, was short-lived. This attempt was plagued with performance issues and had the potential to negatively affect the level of service we were offering (letting the possibility for a metric computation to take down the production database is really not a great idea). This approach violated two key principles: separation of concerns and scalability. We therefore quickly realized that we needed a more robust solution.

*A scalable analytics setup for our platform*

‍

Thankfully, the BigQuery suite proposes a solution that ticks most of the boxes.

Datastream proposes an easy mechanism to sync all generated forecasts and the targets we want to evaluate them against, as they get written in our operational database.

Dataform (now called Pipelines) provides a framework to define, document, and deploy a collection of complex data transformations required to compute our overall system metrics. It enabled us to introduce software-engineering best practices, i.e. separation of concerns, version control, testing, and monitoring to our analytics engine, which has been paramount to generate trust in the metrics we observe.

Finally, metrics are exposed to the different teams at Electricity Maps through Looker Studio dashboards. Creating an intuitive interface for interacting with analytics is essential to enable other teams to build confidence in the quality of our forecasts, and to make sense of the offering we support. Win-win! Other teams at Electricity Maps have all the information at hand for commercial discussions or to reason about our product, and the grid forecasts team does not get distracted by questions and requests about the forecasts offering.

Automating ourselves out of the engine

Due to flow tracing, the output of individual models is intertwined to determine the future origin of electricity. As a direct consequence, changing the data preprocessor for the model predicting geothermal power production on the West Coast of the US could have an impact on the forecasted coal power consumption in the state of New York.

We furthermore want to guarantee our users access to a traceable release version of our forecasting engine. This means that we need to gather all information describing what features were used, what preprocessor transformed them, what trainer was applied, and what model class generated the forecasts. This information is frozen under a release, and the only allowed changes under an environment are model parameters, which are updated through scheduled retraining.

Organizing all this information in a consistent and scalable system is also an enabler for removing manual operations. Through releases, we know that we can implement automated training, testing, and deployment of our models without having to fear the introduction of infrastructure drifts or affecting our ability to explain to customers where our forecasts come from.

Our nightly environment is generally not exposed to end users and allows us to experiment with new versions of our engine quickly. If, for example, we want to try out a new model class, we’ll be able to release it to nightly and evaluate its performance in a production-like setting. We do not enforce completeness nor quality guarantees there so that we can test risky bets.

The nightly environment will continuously be updated with the latest trained models that pass our deployment tests. We test if an inference run generates complete forecasts, compare the model’s performance against the previous version, how much the features have changed, and so on. A newly trained model will be pushed to our nightly environment if it passes all tests.

Our latest environment serves the most up-to-date production-ready version of our forecasts. It is expected to serve the most accurate forecasts as it will always include the most up-to-date verified configuration of model components, including the latest developments we’ve brought to the engine.

Our support environment holds a backup release of our forecasts, which matches the previous version of the latest environment. It allows users to access a stable version in case changes deployed to latest do not fulfill their expectations.

‍

*Electricity Maps' forecasting releases*

‍

When we want to trigger a major model release, a dedicated service will promote all model configurations from nightly to latest and from latest to support within a version-controlled system.

On a schedule, that service also retrains the model configurations within all environments to ensure the freshness of model parameters. We call that a minor release.

Monitoring thousands of models at once

At Electricity Maps, we’ve historically been using Grafana as our main tool for observability. While it’s a great tool to monitor the health of APIs or low-level systems (among others), our initial attempts to use it to monitor our new forecasting engine directly were not successful.

Our naive approach saw an explosion of the cardinality of our metrics, which soon exceeded the maximum threshold allowed by Grafana. We further realized that the way Grafana samples metrics makes it unsuitable for any long-term monitoring. We needed a reliable system to expose the quality of thousands of forecast models, across three different environments, averaged across different rolling windows, targeting different horizon groups, for multiple months and Grafana just wasn’t it.

Investigations later taught us that if we were to focus on the metrics that are strictly useful for alerting purposes, i.e. uniquely related to completeness, as well as proper labeling of those metrics, we could stay within the limits allowed by Grafana. Sampling issues were not a problem purely for alerting purposes, as we did not need our dashboards to manipulate completeness time series at length. This is how we’ve come to use Grafana as the core component of our monitoring system.

*A robust monitoring setup for our platform*

‍

The rest of the system relies on an internal configuration, supported by BigQuery, to inform Grafana about the possible labels to display in our dashboards. Completeness metrics are scraped from our operational database and pushed to a Prometheus instance, which is then scraped by Grafana.

Such a system ensures that all desired metrics are scraped in a timely manner by Grafana and that appropriate alerts are raised in due time when our models are not performing as expected. The migration to the new forecasting engine did not happen without a few mishaps: our team had to quickly address a broken inference pipeline; problems with our flow tracing for forecasts and forecast generation issues.

The early implementation of this setup allowed us to accelerate our delivery speed, as it freed us from being concerned about unknowingly altering our offering. It also gave us the breathing room to turn every incident into an opportunity to improve our system.

‍

‍

Seeking a single, general-purpose model

The dimensionality of the predictions we make is relatively large: we have to predict about 20 signals, across more than 200 zones, for more than 72 hours horizons. This forces us to avoid hand-crafting models for a particular zone, signal, or horizon, as it hinders our scalability.

Instead, we prefer to iterate on a single general-purpose model that can cope with the varying degrees of availability and robustness of the features it ingests while being robust to many error sources, including those we don’t yet know about.
‍

*Simplified representation of how our general purpose model can be utilized in various prediction tasks*

‍

Depending on the type of forecasts we want to generate, different sets of features will be most relevant. For example, features describing weather patterns are essential to forecast solar power production, while features engineered to provide useful information about the expected future make-up of the power grid are relevant to forecast net flows between regions.

These features can further be pre-processed in a multitude of fashion. Choosing to standardize them or imputing missing values can have a significant impact on the behavior of the predictions.

Choices related to how the model is trained can also significantly impact the model’s ability to learn. For a tree-based model, for example, the trees’ depths will affect the model’s ability to generalize.

All these choices create the potential for making our model configurations intractable. Imagine we were to start using a different feature set, preprocessor, and trainer for each signal in each zone. It would become impossible for us to quickly roll out model improvements and attribute overall performance gains to specific changes we made to the engine.

By smartly interfacing all the components of these modules, we can further ensure that the engine will be future-proof. Whenever an ML Engineer implements a better-performing preprocessor, they can safely and smoothly deploy it while maintaining compatibility with the rest of the system, bringing improvements to the thousands of models running at once.

Cracking the case - Lithuania

Respecting the desired quality guarantees on the Lithuanian grid has been a testing ground for our new forecasting engine.

*Variability in Carbon Intensity and Power Mix in Lithuania across different days*

‍

Lithuania exhibits complex dynamics when it comes to its power grid. It has interconnections with neighbors whose grid mixes are vastly different. Power imported from the South of Sweden will be typically low-carbon (40 gCO2eq/kWh on average in 2024 [2]), while power imported from Poland will be much more carbon-intensive (704 gCO2eq/kWh on average in 2024 [2]). In addition, the carbon intensity of electricity imported from Latvia often varies by a factor of two within a single day.

Furthermore, Lithuania has installed a significant amount of wind and solar power (21% and 9% of total production respectively in 2024 [2]), meaning that grid dynamics will heavily depend on local weather conditions. When renewable power production is insufficient, the grid will see the activation of gas power plants to meet the demand together with imports from neighbors.

All these factors make the share of renewable power, as well as the carbon intensity of the Lithuanian mix, fluctuate significantly. Predicting those accurately therefore requires a forecasting platform that can correctly account for the interplay of all these signals. A platform that would not be able to forecast carbon-intensive imports from Poland or the arrival of a windstill series of days, would be wildly inaccurate.

*Evolution of the Carbon Intensity nMAE metrics in Lithuania in January and February 2025*

‍

Ultimately, compound improvements to our platform have continuously translated into more accurate forecasts for the carbon intensity of the Lithuanian grid.

Without specifically trying to improve Lithuanian forecasts, we’ve seen that the roll-out of a better-performing general-purpose model has led to a first drop in the forecasting error. We later saw that the complete automation of our release process and the reduction of our models’ staleness further reduced the error of the forecasts. Finally, introducing better preprocessing of our features allowed us to better leverage informative features with low availability, which turned out to be particularly useful in predicting interconnections. That was eventually decisive to take the error below our quality threshold.

Stay tuned for the next updates of Electricity Maps’ forecasting engine! If you want to help us decarbonize electricity grids by empowering smarter consumption patterns, consider joining our team!

References

[1] Wright, F.L. and Pfeiffer, B.B. (2000) Collected writings of Frank Lloyd wright: 1931-39 v. 3. New York, NY: Rizzoli International Publications.

[2] Electricity Grid Carbon Emissions (2024) Electricity Maps. Available at: https://app.electricitymaps.com/ (Accessed: 14 February 2025)

‍

Annex

The mean absolute error (MAE) is defined as:

The normalized mean absolute error (nMAE) is defined as:

‍

Article written by

Pierre Segonne

Tech Lead @ Electricity Maps