Prometheus: unexport unavailable metrics #125492

agoode · 2024-09-08T05:27:23Z

This introduces a new option to the prometheus integration to automatically unexport metrics for unavailable entities.

Proposed change

When an entity becomes unavailable, this component will continue to report the entity's last value, until Home Assistant itself restarts, or the entity returns. These stale metrics can be hard to notice, especially when the particular metric rarely changes (or changes slowly).

The entity_available metric is provided to let queries filter out unavailable metrics, but this is slow with current versions of prometheus (see prometheus/prometheus#9577). And regardless of performance issues, including entity_available increases the complexity of promql expressions and is easy to forget.

Now this component will automatically withdraw metrics when the entity becomes unavailable, which matches the behavior on restart and makes it easier to see missing metrics without using an unless.

Type of change

Dependency upgrade
Bugfix (non-breaking change which fixes an issue)
New integration (thank you!)
New feature (which adds functionality to an existing integration)
Deprecation (breaking change to happen in the future)
Breaking change (fix/feature causing existing functionality to break)
Code quality improvements to existing code or addition of tests

Additional information

This PR fixes or closes issue: fixes #
This PR is related to issue:
Link to documentation pull request: prometheus: Document new behavior of unavailable or unknown entities home-assistant.io#34632

Checklist

The code change is tested and works locally.
Local tests pass. Your PR cannot be merged unless tests pass
There is no commented out code in this PR.
I have followed the development checklist
I have followed the perfect PR recommendations
The code has been formatted using Ruff (ruff format homeassistant tests)
Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

Documentation added/updated for www.home-assistant.io

If the code communicates with devices, web services, or third-party tools:

The manifest file has all fields filled out correctly.
Updated and included derived files by running: python3 -m script.hassfest.
New or updated dependencies have been added to requirements_all.txt.
Updated by running python3 -m script.gen_requirements_all.
For the updated dependencies - a link to the changelog, or at minimum a diff between library versions is added to the PR description.

To help with the load of incoming pull requests:

I have reviewed two other open pull requests in this repository.

home-assistant · 2024-09-08T05:27:29Z

Hey there @knyar, mind taking a look at this pull request as it has been labeled with an integration (prometheus) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of prometheus can trigger bot actions by commenting:

@home-assistant close Closes the pull request.
@home-assistant rename Awesome new title Renames the pull request.
@home-assistant reopen Reopen the pull request.
@home-assistant unassign prometheus Removes the current integration label and assignees on the pull request, add the integration domain after the command.
@home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the pull request.
@home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the pull request.

knyar

This looks great, thank you.

Any thoughts on whether this should eventually be the default? Might be something best to decide & announce now, and change the default value from true to false a few releases down the line.

tests/components/prometheus/test_init.py

agoode · 2024-09-10T03:20:34Z

Yes, I think it would be good for this to become the new default. How would we announce this change?

knyar

Yes, I think it would be good for this to become the new default. How would we announce this change?

I've looked at the developer site but have not been able to find a recommended process for changing the default values of configuration variables.

My suggestion would be to leave a comment mentioning future change of default in the code and in the docs (as part of home-assistant/home-assistant.io#34632). We'll remove the comment in the same PR that will flip the default value, and will mark it as a breaking change.

homeassistant/components/prometheus/__init__.py

agoode · 2024-09-11T00:26:34Z

Thanks, I've updated both PRs.

knyar

Thank you! Hopefully one of the Home Assistant maintainers will be able to review & merge this soon.

rcloran

Nice!

I think there's a good argument to be made that an interim config option isn't necessary. Anyone making queries against entities which could become unavailable has to handle both cases right now, in case the entity is ever unavailable at start, so I think that not emitting metrics for unavailable entities is the correct behaviour -- put another way, I can't think of a case where you'd want to emit a NaN.

That said, giving an opportunity for people to test this in "production" for a while with an easy path to revert is a good thing. I'm just not sure it outweighs the drawbacks :)

agoode · 2024-09-27T23:58:19Z

Thanks! Yes, let me take a look at removing the option entirely. That would definitely simplify things.

agoode · 2024-09-28T02:26:47Z

I've updated both PRs.

knyar · 2024-09-28T07:31:11Z

I don't have any objections to just doing this with no transition period, but maybe it reaches the level of a "Breaking change" now?

I can imagine use cases for which the current behavior is useful - for example, if you have a sensor or another device that is only intermittently available, having its last-known state reported to Prometheus might be more helpful than having metrics regularly disappear and reappear. In the new world, one would need to apply one of the *_over_time functions at query time to achieve something similar.

agoode · 2024-09-28T12:32:59Z

I'm going to move this to draft because I want to think a little more about a couple things.

agoode · 2024-09-28T15:49:17Z

The problem with the current PR is that going unavailable will unexport ALL the metrics, but I think we want to keep most around, especially entity_available. It shouldn't be too bad to fix.

I remembered I have a dashboard that shows which entities are unavailable, using entity_available. I forgot we need it still!

rcloran · 2024-09-28T16:54:38Z

Sorry, I did not intend to create churn with my comment 🙈. I'm happy to see this proceed in either direction.

@knyar my point was that someone writing a "proper" query against an intermittent metric would have to handle missing metrics anyways, as those might occur during restart. I absolutely might be missing something here, though.

agoode · 2024-09-28T17:45:25Z

No problem! I just realized that the original PR had some unintended effects, based on a few misunderstandings I had.

agoode · 2024-10-06T03:11:55Z

Ok, I think it's good now. The state_change, entity_available, and last_updated_time_seconds metrics now stay around, which completely matches how it appears at startup and lets us continue to observe them.

When sensors go offline, this component would continue to report its last value, until Home Assistant itself restarts, or the sensor returns. The `entity_available` metric can be used to filter out unavailable metrics, but this is slow with current versions of prometheus (see prometheus/prometheus#9577). Now, the component will automatically withdraw metrics when the entity becomes unavailable, which matches the behavior on restart and makes it easier to see missing metrics without using an `unless`.

home-assistant bot added cla-signed has-tests integration: prometheus new-feature small-pr PRs with less than 30 lines. Quality Scale: No score labels Sep 8, 2024

agoode mentioned this pull request Sep 8, 2024

prometheus: Document new behavior of unavailable or unknown entities home-assistant/home-assistant.io#34632

Open

8 tasks

agoode force-pushed the nan branch from be744a1 to 32fd06a Compare September 8, 2024 05:31

MartinHjelmare changed the title ~~Add option to unexport unavailable metrics~~ Add prometheus option to unexport unavailable metrics Sep 8, 2024

agoode force-pushed the nan branch from 32fd06a to a5ef645 Compare September 8, 2024 14:39

knyar approved these changes Sep 9, 2024

View reviewed changes

tests/components/prometheus/test_init.py Outdated Show resolved Hide resolved

agoode force-pushed the nan branch from a5ef645 to 15cc29e Compare September 10, 2024 03:15

knyar approved these changes Sep 10, 2024

View reviewed changes

homeassistant/components/prometheus/__init__.py Outdated Show resolved Hide resolved

agoode force-pushed the nan branch from 7aa9fd1 to 5c6a4ec Compare September 10, 2024 23:58

knyar approved these changes Sep 11, 2024

View reviewed changes

agoode force-pushed the nan branch 2 times, most recently from 95b5180 to 75dc1c4 Compare September 15, 2024 14:17

agoode force-pushed the nan branch from 75dc1c4 to cc821b2 Compare September 23, 2024 00:42

jzucker2 mentioned this pull request Sep 26, 2024

Refactor prometheus integration tests #113849

Open

20 tasks

rcloran approved these changes Sep 27, 2024

View reviewed changes

agoode force-pushed the nan branch from cc821b2 to 6422db5 Compare September 28, 2024 01:59

agoode changed the title ~~Add prometheus option to unexport unavailable metrics~~ Prometheus: unexport unavailable metrics Sep 28, 2024

agoode marked this pull request as draft September 28, 2024 12:33

agoode force-pushed the nan branch from 6422db5 to 686efed Compare October 6, 2024 02:40

agoode marked this pull request as ready for review October 6, 2024 02:41

agoode force-pushed the nan branch from 686efed to e3ffc0b Compare October 6, 2024 03:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus: unexport unavailable metrics #125492

Prometheus: unexport unavailable metrics #125492

agoode commented Sep 8, 2024 •

edited

Loading

home-assistant bot commented Sep 8, 2024

knyar left a comment

agoode commented Sep 10, 2024

knyar left a comment

agoode commented Sep 11, 2024

knyar left a comment

rcloran left a comment

agoode commented Sep 27, 2024

agoode commented Sep 28, 2024

knyar commented Sep 28, 2024

agoode commented Sep 28, 2024

agoode commented Sep 28, 2024

rcloran commented Sep 28, 2024

agoode commented Sep 28, 2024

agoode commented Oct 6, 2024

Prometheus: unexport unavailable metrics #125492

Are you sure you want to change the base?

Prometheus: unexport unavailable metrics #125492

Conversation

agoode commented Sep 8, 2024 • edited Loading

Proposed change

Type of change

Additional information

Checklist

home-assistant bot commented Sep 8, 2024

knyar left a comment

Choose a reason for hiding this comment

agoode commented Sep 10, 2024

knyar left a comment

Choose a reason for hiding this comment

agoode commented Sep 11, 2024

knyar left a comment

Choose a reason for hiding this comment

rcloran left a comment

Choose a reason for hiding this comment

agoode commented Sep 27, 2024

agoode commented Sep 28, 2024

knyar commented Sep 28, 2024

agoode commented Sep 28, 2024

agoode commented Sep 28, 2024

rcloran commented Sep 28, 2024

agoode commented Sep 28, 2024

agoode commented Oct 6, 2024

agoode commented Sep 8, 2024 •

edited

Loading