Green-Software-Foundation/carbon-aware-sdk

[Feature Contribution]: RAPL Node Measurement of Power Consumption

Closed this issue · 2 comments

What happened?

The Carbon Aware SDK right now mainly features an API, which interlocks with commercial services (WattTime and ElectricityMaps) serving an optimization of compute based on "with-the-sun" and/or location-based computing from data centers that mainly source power from regenerative sources (to the extend I've understood such services).

However all modern Operating systems provide metrics on their power consumption through RAPL (Running Average Power Limit) Performance Counters. While having an agent on each machine, might cause additional computation efforts (which can be optimized for through more in-frequent readings) it would allow the Carbon Aware SDK to measure energy consumption on actual computation of such machines (identify/reduce/optimize resource hungry computations) based on CPU/Core, DRAM and GPU Power RAPL Metrics.

To my knowledge it's not possible to accurately pin down the power consumptions per process, however above metrics would allow for example to baseline energy consumptions for resource intensive tasks (like machine learning model training) and measure in-/decreased power consumption based on software changes/optimization.

Such metrics can be easily collected/aggregated in a Prometheus Server (here an example https://github.com/hubblo-org/windows-rapl-driver) and can be performed on-premise, for clients, servers and similiar. However there are restrictions where some power metrics are only available for Intel CPU's or GPU's based on Volta Architecture. But since there are multiple partners in the GSF, this could potentially be influenced/extended to a broader ecosystem.

Having a actual "Software Development KIT/SDK" for multiple languages could make it easier to have an agentless auto-instrumented data collection. Through Application Monitoring Integration like Kibana's NEST Library and/or Application Insights RAPL Data potentially could be co-related with the code being executed (which database query cost the most energy; how did a change in function relate to it's power consumption; etc.) while such metrics might not be exact, it can provide developers and organizations with indications on the carbon footprint of their software/devices - based on hard facts. The intel security advisory highlights that based on RAPL Data it can be potentially even possible to reverse engineer secret data based on the power metrics - which highlights the power, but also the need to restrict the data readings to longer time of collection https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/running-average-power-limit-energy-reporting.html)

Since there is no "Idea" Issue type this is submitted as Feature Contribution - while I'm happy to contribute to it, I however cannot fully commit to contribute the complete implementation by myself. However I think it might be an important add to the Carbon Aware SDK to make it more relevant for organizations.

Code of Conduct

  • I agree to follow this project's Code of Conduct

Feature Commitment

  • I commit to contributing this feature as a PR and working with the GSF to merge this feature into the Carbon Aware SDK.

This issue has not had any activity in 120 days. Please review this issue and ensure it is still relevant. If no more activity is detected on this issue for the next 20 days, it will be closed automatically.

This issue has not had any activity for too long. If you believe this issue has been closed in error, please contact an administrator to re-open, or if absolutly relevant and necessary, create a new issue referencing this one.