Skip to content

Comments

[digiplex] Fix runtime exceptions#17864

Closed
jlaur wants to merge 2 commits intoopenhab:mainfrom
jlaur:digiplex-fix-runtime-exceptions
Closed

[digiplex] Fix runtime exceptions#17864
jlaur wants to merge 2 commits intoopenhab:mainfrom
jlaur:digiplex-fix-runtime-exceptions

Conversation

@jlaur
Copy link
Contributor

@jlaur jlaur commented Dec 8, 2024

This fixes possible NumberFormatException and IndexOutOfBoundsException for unexpected input.

Also adds test coverage and debug logging of messages that could not be parsed. The coverage is based on the current logic rather than real messages, since I don't know what they look like.

Related to #17829
Resolves #17828

jlaur added 2 commits December 8, 2024 01:00
Add test coverage

Signed-off-by: Jacob Laursen <jacob-github@vindvejr.dk>
Signed-off-by: Jacob Laursen <jacob-github@vindvejr.dk>
@jlaur jlaur requested a review from rmichalak as a code owner December 8, 2024 00:02
@jlaur jlaur added the bug An unexpected problem or unintended behavior of an add-on label Dec 8, 2024
Copy link
Contributor

@rmichalak rmichalak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good on paper, however be aware I currently have no ways of testing it live with my alarm system. I'm still on openhab v2 and slowly upgrading my setup to v4 (which triggered me to raise the original issue after a few years of absence in the project).
I need to upgrade my dev env and it can take me a while to do so and be fully up to speed with everything that changed since v2.

@rmichalak
Copy link
Contributor

rmichalak commented Dec 8, 2024

Actually, there may be more to it... before your fix, these malformed responses where throwing an exception and this was triggering restart immediately. Now, they just return UnknownResponse which is considered a valid (but unknown) response and counted as such here.
There's also a check for 'stalled' messages (these that did not receive a response) here and it triggers restart as well.
With your changes, digiplex may return rubbish all the time and we will never know about it (except for log entries)

I agree that the changes you made make sense, but perhaps we should also distinguish between 'unknown' response (as in - not handled by this binding, like here) and erroneous response, which we should count and restart the binding if we cross the threshold.

This may require more changes and I would need to go down to the protocol details once more, plus I will need to setup my dev environment from scratch 😓

@rmichalak
Copy link
Contributor

rmichalak commented Dec 8, 2024

Also mind that there's an alternative binding for the same alarm system: https://www.openhab.org/addons/bindings/paradoxalarm/
which uses an IP150 module (an add-on module sold separately) with more functionality. Mine uses RS232 communication included in the base board (which is cheaper, less functional, and apparently not so stable, at least for me, hence all the hacks and restarts).

@jlaur jlaur added the additional testing preferred The change works for the pull request author. A test from someone else is preferred though. label Dec 8, 2024
@jlaur
Copy link
Contributor Author

jlaur commented Dec 8, 2024

Actually, there may be more to it... before your fix, these malformed responses where throwing an exception and this was triggering restart immediately. Now, they just return UnknownResponse which is considered a valid (but unknown) response and counted as such here.

I agree, see also #17829 (comment). So in other words, it seems we already had this issue with this existing sanity check:

if (message.length() < 4) { // sanity check: try to filter out malformed responses
return new UnknownResponse(message);
}

which I stayed consistent with, but expanded to cover all cases, so we avoid throwing RuntimeExceptions. But if in fact reconnection is needed, something must be changed. How do you distinguish between an unknown message and a malformed message that should cause reconnection?

And how is it currently handled when e.g. a NumberFormatException is thrown, I guess the binding will stop working?

This may require more changes and I would need to go down to the protocol details once more, plus I will need to setup my dev environment from scratch 😓

I guess there is no rush, or have you had this reported by others as well? If it won't make it for 4.3, it can still be backported into a 4.3.x patch later.

@rmichalak
Copy link
Contributor

rmichalak commented Dec 9, 2024

How do you distinguish between an unknown message and a malformed message that should cause reconnection?

I don't know yet, I guess I need to check the protocol again and make sure we handle all the proper responses, even if only with UnknownResponse case for responses/events we don't want to handle explicitly. The rest will be erroneous...

And how is it currently handled when e.g. a NumberFormatException is thrown, I guess the binding will stop working?

Correct, and that's perhaps why it makes sense to still merge #17829 (limited to receiver thread) as it applies a bandaid on this wound at least (even if not fitting perfectly).

I guess there is no rush, or have you had this reported by others as well?

No, I didn't have any reports. Honestly, I'm not sure if anyone is using this binding, apart from me... I guess the one with IP150 module is more popular (and the module itself is cheaper now, than it used to be a few years ago). I may even consider buying it, but since I'm already in this rabbit hole, I would like to leave my extension stable...

@jlaur
Copy link
Contributor Author

jlaur commented Dec 10, 2024

And how is it currently handled when e.g. a NumberFormatException is thrown, I guess the binding will stop working?

Correct, and that's perhaps why it makes sense to still merge #17829 (limited to receiver thread) as it applies a bandaid on this wound at least (even if not fitting perfectly).

But doesn't that PR suffer from this same problem:

be aware I currently have no ways of testing it live with my alarm system.

?

So maybe it does way too many (potentially unneeded) reconnections because of running into the parser bugs? Maybe it doesn't restart enough, if malformed messages are shorter than 4 characters?

And then there is the already mentioned issue that it may hide bugs because it catches all exceptions - which is why I decided to give this PR a shot.

Preferable you could test this when you have your development system up running again, although I realize that might be too late for 4.3. Have in mind that we can also include such bugfixes in patch releases. I think as a temporary solution (any maybe no one, even including you, is currently using the 4.2/4.3 version of the binding) this PR could be merged, because it must be better to gracefully handle malformed message (without triggering reconnection) than to let the binding crash, which will not trigger reconnection either, or..?

@rmichalak
Copy link
Contributor

But doesn't that PR suffer from this same problem:

The original PR I proposed (#17829) is tested as it's been working in my environment for over three years (on openhab2). I just reapplied it to openhab v4 codebase and as it's very simple in nature, it should work there as well. Agree it's rather rudimentary and we can do better, but it works at least....

So maybe it does way too many (potentially unneeded) reconnections because of running into the parser bugs? Maybe it doesn't restart enough, if malformed messages are shorter than 4 characters?

What I observed with my alarm system (which may be very specific to my env), is that it starts with a few malformed responses and then the alarm system stops responding at all. So when we receive a malformed message, there's really no going back from this stage. we need to restart the connection. I tried with a few different baud rates and other connection settings but I couldn't achieve long-term stability. I tried exchanging all the components in the middle (cable, rs232 -> usb converter) without any luck.

Perhaps some restarts are not needed because of shortcoming of the current implementation and you are right that some of the messages shorter than 4 characters may not trigger restart, but for this there's another safeguard to count the number of responses. If it does not match the number of messages sent (by a threshold), we trigger the restart as well.

I may be able to run more tests in January and I think there's really no rush with it... Seems like I may be the only one using this binding (or it's indeed only about my faulty alarm system going south), as otherwise we would have had some reports about binding not working.

@jlaur
Copy link
Contributor Author

jlaur commented Dec 16, 2024

But doesn't that PR suffer from this same problem:

The original PR I proposed (#17829) is tested as it's been working in my environment for over three years (on openhab2). I just reapplied it to openhab v4 codebase and as it's very simple in nature, it should work there as well. Agree it's rather rudimentary and we can do better, but it works at least....

Sorry, I missed this point, that actually it is an old fix and already tested.

I may be able to run more tests in January and I think there's really no rush with it...

👍 Since there doesn't seem to be any rush, do we agree to leave both PR's open for the time being, and then try to "get it right" once you find the time?

@rmichalak
Copy link
Contributor

Since there doesn't seem to be any rush, do we agree to leave both PR's open for the time being, and then try to "get it right" once you find the time?

Sure thing. let's do that!

@jlaur
Copy link
Contributor Author

jlaur commented Jan 8, 2025

Superceded by #18035.

@jlaur jlaur closed this Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

additional testing preferred The change works for the pull request author. A test from someone else is preferred though. bug An unexpected problem or unintended behavior of an add-on

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[digiplex] Received thread stops when an exception happens inside

2 participants