JPERF-811: Fix flaky shouldTolerateEarlyFinish test by waiting until … by pczuj · Pull Request #23 · atlassian/ssh

pczuj · 2022-06-22T16:08:24Z

…OS actually starts (and finishes) the process before interrupting it

pczuj · 2022-06-22T16:08:47Z

Related discussion: #13 (comment)

pczuj · 2022-06-22T16:10:35Z

src/test/kotlin/com/atlassian/performance/tools/ssh/api/SshTest.kt

+        processCommand: String,
+        timeout: Duration
+    ) = this.newConnection().use {
+        it.execute("wait `pgrep '$processCommand'`", timeout)


Probably a bit of an overkill, but I don't have a better idea how we can wait for the process to finish.

It's also a shotgun, because we are waiting not only for the one process we started, but also all of other processes that might be named the same. I think we don't expect there to be any other processes like that anyway.

I think the chance of other processes with the same name is minimal, and this solution should be sufficient.

pczuj · 2022-06-22T16:14:40Z

src/test/kotlin/com/atlassian/performance/tools/ssh/api/SshTest.kt

-            val failResult = fail.stop(Duration.ofMillis(20))
+            repeat(1000) {
+                val fail = sshHost.runInBackground("nonexistant-command")
+                sshHost.waitForAllProcessesToFinish("nonexistant-command", Duration.ofMillis(100))


Timeout with Duration.ofMillis(50) wasn't enough. I had a run where in 1000 repeats it timed out.

I believe that those kind of timeouts are inevitable when we are working with OS. There could always be some huge lag spike where this and any other processes with timeout would just fail.

Anyway it's better to timeout on a ssh command where an exception explicitly says that it was a timeout then when we fail because we didn't get a response. In the former it's obvious that we can just increase the timeout to reduce the likelyhood of the flakiness.

I believe that those kind of timeouts are inevitable when we are working with OS.

Yeah, process scheduling can introduce random delays, but the delay distribution is not open-ended. It will never take 1 hour.

Anyway it's better to timeout on a ssh command where an exception explicitly says that it was a timeout then when we fail because we didn't get a response. In the former it's obvious that we can just increase the timeout to reduce the likelyhood of the flakiness.

Exactly ❤️

pczuj · 2022-06-22T16:17:07Z

src/test/kotlin/com/atlassian/performance/tools/ssh/api/SshTest.kt


-            val fail = sshHost.runInBackground("nonexistent-command")
-            val failResult = fail.stop(Duration.ofMillis(20))
+            repeat(1000) {


I intend to remove this repeat block or at least redice the number to e.g. 10. I pushed it only to show a proof that the wait helps (maybe I should push without the fix first 🤔).

I pushed another PR with only te repeat(1000) block: #24

It should fail to show that the fix in #23 actually helps.

I will close that #24 and remove it's branch after the tests are run - I hope the results will be left.

https://github.com/atlassian/ssh/runs/7008333187?check_suite_focus=true

You can have a PR with repeat open (and red) and then open PRs targeting that red branch. They'd show it's green. And it would allow alternatives to be explored.

…OS actually starts (and finishes) the process before interrupting it

pczuj · 2022-06-28T09:20:50Z

I removed the repeat(1000) block from the commit. Initially I added it only as a proof for the test stability.

dagguh · 2022-06-29T12:16:13Z

src/test/kotlin/com/atlassian/performance/tools/ssh/api/SshTest.kt


-            val fail = sshHost.runInBackground("nonexistent-command")
+            val fail = sshHost.runInBackground("nonexistant-command")
+            sshHost.waitForAllProcessesToFinish("nonexistant-command", Duration.ofMillis(100))


Now the test shows that the runInBackground API cannot be used easily. We should fix the problem in the API rather than in tests.

We could wait for the process to start before returning from runInBackground similarly to how my solution proposal does it in this test, however this approach is very system specific (usage of wait and pgrep). I don't know how we could do that in more portable way.

Another approach would be to just return some specific SshConnection.SshResult as part of BackgroundProcess.stop when we don't really get the exitStatus, however I don't know what would that be and we need to return something if we want to maintain the current API of BackgroundProcess

@dagguh I'd like to fix the flakiness of the test. This is my main goal.

My understanding of possible implementations of your suggestion is to either make it less portable or break the API (see my 2 previous comments). I don't like any of those options and I'm a bit stuck with this PR. Do you have any other ideas how this could be fixed? If not then maybe you have opinion about which of those 2 is better?

@pczuj why are we afraid to break the API and properly apply semantic versioning?

break the API and properly apply semantic versioning

Yes, we can just break the API, it's a tool in our belt.
It can require some extra effort. Let's say we release ssh:3.0.0. Theinfrastructure is a ssh consumer:

api("com.atlassian.performance.tools:ssh:[2.3.0,3.0.0)")

It will have a choice:

ssh:[2.3.0,3.0.0) - stay on old ssh, this is the default state. It takes no effort, but what was the point of ssh:3.0.0, if no consumer needs it?

ssh:[2.3.0,4.0.0) - use the new, without dropping the old. Only possible if the removed API was not used.

ssh:[3.0.0,4.0.0) - use the new, drop the old. This means that all consumers of infrastructure, which are still on ssh:[2.0.0,3.0.0), will no longer be compatible. Therefore that bump is breakage of infrastructure (its POM contract), which is a another major release. This bump would cascade to multiple libs: aws-infrastructure and jira-performance-tests, even if they don't use ssh directly.

We used scenario 3 many times in the past. It's a bit annoying, but not that scary.
We successfully used scenario 2 as well.

We should fix the problem in the API rather than in tests.

I didn't mean we have to break API. I meant: the tests found a real flakiness pain, let's fix it for the consumers as well.

this approach is very system specific (usage of wait and pgrep)

I would avoid it too. Not only does it lose generality, but also more moving parts: brittle and complex.

dagguh · 2022-09-09T10:53:31Z

src/test/kotlin/com/atlassian/performance/tools/ssh/api/SshTest.kt

-            val fail = sshHost.runInBackground("nonexistent-command")
+            val fail = sshHost.runInBackground("nonexistant-command")
+            sshHost.waitForAllProcessesToFinish("nonexistant-command", Duration.ofMillis(100))
            val failResult = fail.stop(Duration.ofMillis(20))


So this fails due to command.exitStatus must not be null.
The exitStatus is supposed to be filled by if (exit-status), but it misses it and falls into if (exit-signal):

We can see that exitSignal = INT, so that's our tryToInterrupt. But why is req coming in with different values? Where exactly is the race condition? 🤔

Lol, this happens when the join fails:

So this is an error during error handling.
PS. in order to notice this I had to add SSHClient.close call (Ssh.runInBackground is leaking those). Otherwise there's a ton of reader and heartbeater threads.

The close in readOutput fails if the command was interrupted.
Softening it unveils the real flakiness source:

SSH command failed to finish in extended time (PT0.025S): SshjExecutedCommand(stdout=^Cbash: nonexistent-command-290: command not found

BTW. It seems arbitrary to have both SshResult and SshjExecutedCommand and two ways of reading stdout

I traced it down to ancient times behind the wall: https://bulldog.internal.atlassian.com/browse/JPT-292

Hmm, I already had the fixes, but I was checking the effect of each commit via rebase (to populate the CHANGELOG correctly). And it turns out that failed to finish in extended time (PT0.025S) happens even without the fixes. Sometimes it's a timeout, sometimes it's a null. I must have mixed up the observations around exitStatus=null with the wrong actual symptom. Gotta, really hammer down my understanding of the problem.

pczuj requested a review from dagguh June 22, 2022 16:08

pczuj requested a review from a team as a code owner June 22, 2022 16:08

pczuj mentioned this pull request Jun 22, 2022

JPERF-716 background process results #13

Merged

pczuj commented Jun 22, 2022

View reviewed changes

pczuj mentioned this pull request Jun 22, 2022

JPERF-811: Show how flaky shouldTolerateEarlyFinish is, so that it's … #24

Closed

JPERF-811: Fix flaky shouldTolerateEarlyFinish test by waiting until …

e79bb92

…OS actually starts (and finishes) the process before interrupting it

pczuj force-pushed the JPERF-811-fix-flaky-SshTest-shouldTolerateEarlyFinish branch from a1ba4f2 to e79bb92 Compare June 28, 2022 09:19

szymonra approved these changes Jun 28, 2022

View reviewed changes

dagguh suggested changes Jun 29, 2022

View reviewed changes

pczuj requested a review from dagguh July 26, 2022 10:19

dagguh reviewed Sep 9, 2022

View reviewed changes

Conversation

pczuj commented Jun 22, 2022

Uh oh!

pczuj commented Jun 22, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pczuj Jun 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pczuj commented Jun 28, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dagguh Sep 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dagguh Sep 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dagguh Sep 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dagguh Sep 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pczuj Jun 22, 2022 •

edited

Loading

dagguh Sep 9, 2022 •

edited

Loading

dagguh Sep 9, 2022 •

edited

Loading

dagguh Sep 9, 2022 •

edited

Loading

dagguh Sep 9, 2022 •

edited

Loading