Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus: unexport unavailable metrics #125492

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

agoode
Copy link
Contributor

@agoode agoode commented Sep 8, 2024

This introduces a new option to the prometheus integration to automatically unexport metrics for unavailable entities.

Proposed change

When an entity becomes unavailable, this component will continue to report the entity's last value, until Home Assistant itself restarts, or the entity returns. These stale metrics can be hard to notice, especially when the particular metric rarely changes (or changes slowly).

The entity_available metric is provided to let queries filter out unavailable metrics, but this is slow with current versions of prometheus (see prometheus/prometheus#9577). And regardless of performance issues, including entity_available increases the complexity of promql expressions and is easy to forget.

Now this component will automatically withdraw metrics when the entity becomes unavailable, which matches the behavior on restart and makes it easier to see missing metrics without using an unless.

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New integration (thank you!)
  • New feature (which adds functionality to an existing integration)
  • Deprecation (breaking change to happen in the future)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • I have followed the perfect PR recommendations
  • The code has been formatted using Ruff (ruff format homeassistant tests)
  • Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

If the code communicates with devices, web services, or third-party tools:

  • The manifest file has all fields filled out correctly.
    Updated and included derived files by running: python3 -m script.hassfest.
  • New or updated dependencies have been added to requirements_all.txt.
    Updated by running python3 -m script.gen_requirements_all.
  • For the updated dependencies - a link to the changelog, or at minimum a diff between library versions is added to the PR description.

To help with the load of incoming pull requests:

@home-assistant
Copy link

home-assistant bot commented Sep 8, 2024

Hey there @knyar, mind taking a look at this pull request as it has been labeled with an integration (prometheus) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of prometheus can trigger bot actions by commenting:

  • @home-assistant close Closes the pull request.
  • @home-assistant rename Awesome new title Renames the pull request.
  • @home-assistant reopen Reopen the pull request.
  • @home-assistant unassign prometheus Removes the current integration label and assignees on the pull request, add the integration domain after the command.
  • @home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the pull request.
  • @home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the pull request.

@MartinHjelmare MartinHjelmare changed the title Add option to unexport unavailable metrics Add prometheus option to unexport unavailable metrics Sep 8, 2024
Copy link
Contributor

@knyar knyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you.

Any thoughts on whether this should eventually be the default? Might be something best to decide & announce now, and change the default value from true to false a few releases down the line.

tests/components/prometheus/test_init.py Outdated Show resolved Hide resolved
@agoode
Copy link
Contributor Author

agoode commented Sep 10, 2024

Yes, I think it would be good for this to become the new default. How would we announce this change?

Copy link
Contributor

@knyar knyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it would be good for this to become the new default. How would we announce this change?

I've looked at the developer site but have not been able to find a recommended process for changing the default values of configuration variables.

My suggestion would be to leave a comment mentioning future change of default in the code and in the docs (as part of home-assistant/home-assistant.io#34632). We'll remove the comment in the same PR that will flip the default value, and will mark it as a breaking change.

homeassistant/components/prometheus/__init__.py Outdated Show resolved Hide resolved
@agoode
Copy link
Contributor Author

agoode commented Sep 11, 2024

Thanks, I've updated both PRs.

Copy link
Contributor

@knyar knyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Hopefully one of the Home Assistant maintainers will be able to review & merge this soon.

Copy link
Contributor

@rcloran rcloran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I think there's a good argument to be made that an interim config option isn't necessary. Anyone making queries against entities which could become unavailable has to handle both cases right now, in case the entity is ever unavailable at start, so I think that not emitting metrics for unavailable entities is the correct behaviour -- put another way, I can't think of a case where you'd want to emit a NaN.

That said, giving an opportunity for people to test this in "production" for a while with an easy path to revert is a good thing. I'm just not sure it outweighs the drawbacks :)

@agoode
Copy link
Contributor Author

agoode commented Sep 27, 2024

Thanks! Yes, let me take a look at removing the option entirely. That would definitely simplify things.

@agoode agoode changed the title Add prometheus option to unexport unavailable metrics Prometheus: unexport unavailable metrics Sep 28, 2024
@agoode
Copy link
Contributor Author

agoode commented Sep 28, 2024

I've updated both PRs.

@knyar
Copy link
Contributor

knyar commented Sep 28, 2024

I don't have any objections to just doing this with no transition period, but maybe it reaches the level of a "Breaking change" now?

I can imagine use cases for which the current behavior is useful - for example, if you have a sensor or another device that is only intermittently available, having its last-known state reported to Prometheus might be more helpful than having metrics regularly disappear and reappear. In the new world, one would need to apply one of the *_over_time functions at query time to achieve something similar.

@agoode
Copy link
Contributor Author

agoode commented Sep 28, 2024

I'm going to move this to draft because I want to think a little more about a couple things.

@agoode agoode marked this pull request as draft September 28, 2024 12:33
@agoode
Copy link
Contributor Author

agoode commented Sep 28, 2024

The problem with the current PR is that going unavailable will unexport ALL the metrics, but I think we want to keep most around, especially entity_available. It shouldn't be too bad to fix.

I remembered I have a dashboard that shows which entities are unavailable, using entity_available. I forgot we need it still!

@rcloran
Copy link
Contributor

rcloran commented Sep 28, 2024

Sorry, I did not intend to create churn with my comment 🙈. I'm happy to see this proceed in either direction.

@knyar my point was that someone writing a "proper" query against an intermittent metric would have to handle missing metrics anyways, as those might occur during restart. I absolutely might be missing something here, though.

@agoode
Copy link
Contributor Author

agoode commented Sep 28, 2024

No problem! I just realized that the original PR had some unintended effects, based on a few misunderstandings I had.

@agoode agoode marked this pull request as ready for review October 6, 2024 02:41
@agoode
Copy link
Contributor Author

agoode commented Oct 6, 2024

Ok, I think it's good now. The state_change, entity_available, and last_updated_time_seconds metrics now stay around, which completely matches how it appears at startup and lets us continue to observe them.

When sensors go offline, this component would continue to report its
last value, until Home Assistant itself restarts, or the sensor
returns.

The `entity_available` metric can be used to filter out unavailable
metrics, but this is slow with current versions of prometheus
(see prometheus/prometheus#9577).

Now, the component will automatically withdraw metrics when the entity
becomes unavailable, which matches the behavior on restart and
makes it easier to see missing metrics without using an `unless`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants