Failsafe: Killing self fork JVM. PING timeout elapsed.

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Failsafe: Killing self fork JVM. PING timeout elapsed.

Tibor Digana
Hi Jason,

We spoke about this issue on our chat in ASF Slack:
"I think his tests have been paused for a long GC periods and timed out 3x
PING period = 30 seconds. After this period forked JVM supposed the Maven
process was killed by JenkinsCI and therefore all surefire processes are
killed as well and all the file handlers and memory consumptions are freed."

"But I have to say that `-Xmx3g` may cause long GC cycles, see
https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
"

You are using java-1.8-openjdk. I guess you should use Shenandoah GC which
is an experimental algorithm in  JVM 1.8. This would significantly short
the GC cycles.

We should of cource provide a new configuration parameter to give you a
chance to prolong the PING.

Cheers
Tibor
Reply | Threaded
Open this post in threaded view
|

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

AlonsoIR
unsubscribe

El sáb., 23 feb. 2019 a las 13:36, Tibor Digana (<[hidden email]>)
escribió:

> Hi Jason,
>
> We spoke about this issue on our chat in ASF Slack:
> "I think his tests have been paused for a long GC periods and timed out 3x
> PING period = 30 seconds. After this period forked JVM supposed the Maven
> process was killed by JenkinsCI and therefore all surefire processes are
> killed as well and all the file handlers and memory consumptions are
> freed."
>
> "But I have to say that `-Xmx3g` may cause long GC cycles, see
>
> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
> "
>
> You are using java-1.8-openjdk. I guess you should use Shenandoah GC which
> is an experimental algorithm in  JVM 1.8. This would significantly short
> the GC cycles.
>
> We should of cource provide a new configuration parameter to give you a
> chance to prolong the PING.
>
> Cheers
> Tibor
>


--
Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
Reply | Threaded
Open this post in threaded view
|

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Tibor Digana-3
 Alonso Isidoro Roman, why you are telling me " unsubscribe", hm?

On Sat, Feb 23, 2019 at 2:07 PM Alonso Isidoro Roman <[hidden email]>
wrote:

> unsubscribe
>
> El sáb., 23 feb. 2019 a las 13:36, Tibor Digana (<[hidden email]>)
> escribió:
>
> > Hi Jason,
> >
> > We spoke about this issue on our chat in ASF Slack:
> > "I think his tests have been paused for a long GC periods and timed out
> 3x
> > PING period = 30 seconds. After this period forked JVM supposed the Maven
> > process was killed by JenkinsCI and therefore all surefire processes are
> > killed as well and all the file handlers and memory consumptions are
> > freed."
> >
> > "But I have to say that `-Xmx3g` may cause long GC cycles, see
> >
> >
> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
> > "
> >
> > You are using java-1.8-openjdk. I guess you should use Shenandoah GC
> which
> > is an experimental algorithm in  JVM 1.8. This would significantly short
> > the GC cycles.
> >
> > We should of cource provide a new configuration parameter to give you a
> > chance to prolong the PING.
> >
> > Cheers
> > Tibor
> >
>
>
> --
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
> <
> https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links
> >
>


--
Cheers
Tibor
Reply | Threaded
Open this post in threaded view
|

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Jason Young
In reply to this post by Tibor Digana
Thanks for the information. It's good to see someone understands a little
about this.

Incidentally, we have been looking at other GCs and VMs for the application
in production environments, so I'll look into how these affect tests as
well. I'll try to enable some logging about GC pauses to see what's up.

How would `-Xmx3g` cause long GC cycles? Do you think the value is simply
too low?

FWIW we're running the Maven build in an Alpine-based Docker container.

On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <[hidden email]> wrote:

> Hi Jason,
>
> We spoke about this issue on our chat in ASF Slack:
> "I think his tests have been paused for a long GC periods and timed out 3x
> PING period = 30 seconds. After this period forked JVM supposed the Maven
> process was killed by JenkinsCI and therefore all surefire processes are
> killed as well and all the file handlers and memory consumptions are
> freed."
>
> "But I have to say that `-Xmx3g` may cause long GC cycles, see
>
> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
> "
>
> You are using java-1.8-openjdk. I guess you should use Shenandoah GC which
> is an experimental algorithm in  JVM 1.8. This would significantly short
> the GC cycles.
>
> We should of cource provide a new configuration parameter to give you a
> chance to prolong the PING.
>
> Cheers
> Tibor
>


--

Jason Young
Reply | Threaded
Open this post in threaded view
|

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Tibor Digana
>> I'll try to enable some logging about GC pauses to see what's up

Pls do not keep such setting after tuning the GC because this may sometime
break the interprocess communication between Maven process and surefire
process.
It's worth to list GC information in a file and not in the console logs.
This can be configured, I guess.

>> Do you think the value is simply too low?

GCing many objects may take some time and I remember we had a user who had
this problem a year or two ago.
We check every third NOOP (which is 3 x 10 sec) as a fix instead of every
NOP. So 30 seconds looked satisfactory.
I think you use old version 2.20 or something like that. The fixes for
docker have been done so far, so please use the latest version 3.0.0-M3.
See this page
https://maven.apache.org/surefire/maven-surefire-plugin/docker.html, we
used maven:3.5.3-jdk-8-alpine in this test. Which base image did you use?

Cheers
Tibor

On Tue, Feb 26, 2019 at 5:24 PM Jason Young <[hidden email]>
wrote:

> Thanks for the information. It's good to see someone understands a little
> about this.
>
> Incidentally, we have been looking at other GCs and VMs for the application
> in production environments, so I'll look into how these affect tests as
> well. I'll try to enable some logging about GC pauses to see what's up.
>
> How would `-Xmx3g` cause long GC cycles? Do you think the value is simply
> too low?
>
> FWIW we're running the Maven build in an Alpine-based Docker container.
>
> On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <[hidden email]>
> wrote:
>
> > Hi Jason,
> >
> > We spoke about this issue on our chat in ASF Slack:
> > "I think his tests have been paused for a long GC periods and timed out
> 3x
> > PING period = 30 seconds. After this period forked JVM supposed the Maven
> > process was killed by JenkinsCI and therefore all surefire processes are
> > killed as well and all the file handlers and memory consumptions are
> > freed."
> >
> > "But I have to say that `-Xmx3g` may cause long GC cycles, see
> >
> >
> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
> > "
> >
> > You are using java-1.8-openjdk. I guess you should use Shenandoah GC
> which
> > is an experimental algorithm in  JVM 1.8. This would significantly short
> > the GC cycles.
> >
> > We should of cource provide a new configuration parameter to give you a
> > chance to prolong the PING.
> >
> > Cheers
> > Tibor
> >
>
>
> --
>
> Jason Young
>
Reply | Threaded
Open this post in threaded view
|

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Jason Young
Thanks again for the information.

We had increased the RAM to 3g some time ago to prevent OOMEs. More
recently, I increased the RAM again to 5g for extra headroom since we had
more headroom available; the problem hasn't happened since, but it hasn't
been very long.

We use a more customized image based on Alpine 3.8.2. The JDK and Maven are
obtained via apk.

I will try upgrading failsafe (and surefire while I'm at it) sooner, and
probably do some experimentation with JVMs another time (not pressing for
me ATM).

On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <[hidden email]>
wrote:

> >> I'll try to enable some logging about GC pauses to see what's up
>
> Pls do not keep such setting after tuning the GC because this may sometime
> break the interprocess communication between Maven process and surefire
> process.
> It's worth to list GC information in a file and not in the console logs.
> This can be configured, I guess.
>
> >> Do you think the value is simply too low?
>
> GCing many objects may take some time and I remember we had a user who had
> this problem a year or two ago.
> We check every third NOOP (which is 3 x 10 sec) as a fix instead of every
> NOP. So 30 seconds looked satisfactory.
> I think you use old version 2.20 or something like that. The fixes for
> docker have been done so far, so please use the latest version 3.0.0-M3.
> See this page
> https://maven.apache.org/surefire/maven-surefire-plugin/docker.html, we
> used maven:3.5.3-jdk-8-alpine in this test. Which base image did you use?
>
> Cheers
> Tibor
>
> On Tue, Feb 26, 2019 at 5:24 PM Jason Young <[hidden email]>
> wrote:
>
> > Thanks for the information. It's good to see someone understands a little
> > about this.
> >
> > Incidentally, we have been looking at other GCs and VMs for the
> application
> > in production environments, so I'll look into how these affect tests as
> > well. I'll try to enable some logging about GC pauses to see what's up.
> >
> > How would `-Xmx3g` cause long GC cycles? Do you think the value is simply
> > too low?
> >
> > FWIW we're running the Maven build in an Alpine-based Docker container.
> >
> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <[hidden email]>
> > wrote:
> >
> > > Hi Jason,
> > >
> > > We spoke about this issue on our chat in ASF Slack:
> > > "I think his tests have been paused for a long GC periods and timed out
> > 3x
> > > PING period = 30 seconds. After this period forked JVM supposed the
> Maven
> > > process was killed by JenkinsCI and therefore all surefire processes
> are
> > > killed as well and all the file handlers and memory consumptions are
> > > freed."
> > >
> > > "But I have to say that `-Xmx3g` may cause long GC cycles, see
> > >
> > >
> >
> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
> > > "
> > >
> > > You are using java-1.8-openjdk. I guess you should use Shenandoah GC
> > which
> > > is an experimental algorithm in  JVM 1.8. This would significantly
> short
> > > the GC cycles.
> > >
> > > We should of cource provide a new configuration parameter to give you a
> > > chance to prolong the PING.
> > >
> > > Cheers
> > > Tibor
> > >
> >
> >
> > --
> >
> > Jason Young
> >
>


--
Jason Young
Software Engineer | PROCENTIVE
[image: Phone] 715 245 8000 x7609
[image: Mobile] 706 870 3540
[image: Web] procentive.com
Confidentiality Notice: This message is intended for the sole use of the
individual and entity to which it is addressed, and may contain information
that is privileged, confidential and exempt from disclosure under
applicable law. Any unauthorized review, use, disclosure or distribution of
this email message, including any attachment, is prohibited. If you are not
the intended recipient, please advise the sender by reply email and destroy
all copies of the original message.
Reply | Threaded
Open this post in threaded view
|

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Jason Young
I upgraded failsafe and surefire to 3.0.0-M3 as advised; we encountered the
same exception. (Still using -Xmx5g, will switch to OpenJ9 soon in case
that helps.)

BTW I also asked on StackOverflow previously, for anyone interested:
https://stackoverflow.com/questions/54755846/killing-self-fork-jvm-ping-timeout-elapsed

On Tue, Feb 26, 2019 at 6:40 PM Jason Young <[hidden email]>
wrote:

> Thanks again for the information.
>
> We had increased the RAM to 3g some time ago to prevent OOMEs. More
> recently, I increased the RAM again to 5g for extra headroom since we had
> more headroom available; the problem hasn't happened since, but it hasn't
> been very long.
>
> We use a more customized image based on Alpine 3.8.2. The JDK and Maven
> are obtained via apk.
>
> I will try upgrading failsafe (and surefire while I'm at it) sooner, and
> probably do some experimentation with JVMs another time (not pressing for
> me ATM).
>
> On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <[hidden email]>
> wrote:
>
>> >> I'll try to enable some logging about GC pauses to see what's up
>>
>> Pls do not keep such setting after tuning the GC because this may sometime
>> break the interprocess communication between Maven process and surefire
>> process.
>> It's worth to list GC information in a file and not in the console logs.
>> This can be configured, I guess.
>>
>> >> Do you think the value is simply too low?
>>
>> GCing many objects may take some time and I remember we had a user who had
>> this problem a year or two ago.
>> We check every third NOOP (which is 3 x 10 sec) as a fix instead of every
>> NOP. So 30 seconds looked satisfactory.
>> I think you use old version 2.20 or something like that. The fixes for
>> docker have been done so far, so please use the latest version 3.0.0-M3.
>> See this page
>> https://maven.apache.org/surefire/maven-surefire-plugin/docker.html, we
>> used maven:3.5.3-jdk-8-alpine in this test. Which base image did you use?
>>
>> Cheers
>> Tibor
>>
>> On Tue, Feb 26, 2019 at 5:24 PM Jason Young <[hidden email]>
>> wrote:
>>
>> > Thanks for the information. It's good to see someone understands a
>> little
>> > about this.
>> >
>> > Incidentally, we have been looking at other GCs and VMs for the
>> application
>> > in production environments, so I'll look into how these affect tests as
>> > well. I'll try to enable some logging about GC pauses to see what's up.
>> >
>> > How would `-Xmx3g` cause long GC cycles? Do you think the value is
>> simply
>> > too low?
>> >
>> > FWIW we're running the Maven build in an Alpine-based Docker container.
>> >
>> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <[hidden email]>
>> > wrote:
>> >
>> > > Hi Jason,
>> > >
>> > > We spoke about this issue on our chat in ASF Slack:
>> > > "I think his tests have been paused for a long GC periods and timed
>> out
>> > 3x
>> > > PING period = 30 seconds. After this period forked JVM supposed the
>> Maven
>> > > process was killed by JenkinsCI and therefore all surefire processes
>> are
>> > > killed as well and all the file handlers and memory consumptions are
>> > > freed."
>> > >
>> > > "But I have to say that `-Xmx3g` may cause long GC cycles, see
>> > >
>> > >
>> >
>> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
>> > > "
>> > >
>> > > You are using java-1.8-openjdk. I guess you should use Shenandoah GC
>> > which
>> > > is an experimental algorithm in  JVM 1.8. This would significantly
>> short
>> > > the GC cycles.
>> > >
>> > > We should of cource provide a new configuration parameter to give you
>> a
>> > > chance to prolong the PING.
>> > >
>> > > Cheers
>> > > Tibor
>> > >
>> >
>> >
>> > --
>> >
>> > Jason Young
>> >
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Jason Young
Getting back to my original questions, I know that "ping" means to see if a
process is there, and "NOOP" implies it's not a command to do anything. But
what do the terms "ping" and "NOOP" mean in this context, i.e. how do the
processes communicate? I assume they don't sonar. Do other processes also
ping NOOPs? Can I PING Chrome with a NOOP from bash? Is it with TCP?

I'm confused about what I should do regarding GC pauses. Previously I had
code that would write the amount of remaining heap space (or something like
that) to stdout after every test to troubleshoot OOMEs. Can writing to
stdout cause the communication failure somehow?

On Wed, Mar 13, 2019 at 5:57 PM Jason Young <[hidden email]>
wrote:

> I upgraded failsafe and surefire to 3.0.0-M3 as advised; we encountered
> the same exception. (Still using -Xmx5g, will switch to OpenJ9 soon in case
> that helps.)
>
> BTW I also asked on StackOverflow previously, for anyone interested:
> https://stackoverflow.com/questions/54755846/killing-self-fork-jvm-ping-timeout-elapsed
>
> On Tue, Feb 26, 2019 at 6:40 PM Jason Young <[hidden email]>
> wrote:
>
>> Thanks again for the information.
>>
>> We had increased the RAM to 3g some time ago to prevent OOMEs. More
>> recently, I increased the RAM again to 5g for extra headroom since we had
>> more headroom available; the problem hasn't happened since, but it hasn't
>> been very long.
>>
>> We use a more customized image based on Alpine 3.8.2. The JDK and Maven
>> are obtained via apk.
>>
>> I will try upgrading failsafe (and surefire while I'm at it) sooner, and
>> probably do some experimentation with JVMs another time (not pressing for
>> me ATM).
>>
>> On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <[hidden email]>
>> wrote:
>>
>>> >> I'll try to enable some logging about GC pauses to see what's up
>>>
>>> Pls do not keep such setting after tuning the GC because this may
>>> sometime
>>> break the interprocess communication between Maven process and surefire
>>> process.
>>> It's worth to list GC information in a file and not in the console logs.
>>> This can be configured, I guess.
>>>
>>> >> Do you think the value is simply too low?
>>>
>>> GCing many objects may take some time and I remember we had a user who
>>> had
>>> this problem a year or two ago.
>>> We check every third NOOP (which is 3 x 10 sec) as a fix instead of every
>>> NOP. So 30 seconds looked satisfactory.
>>> I think you use old version 2.20 or something like that. The fixes for
>>> docker have been done so far, so please use the latest version 3.0.0-M3.
>>> See this page
>>> https://maven.apache.org/surefire/maven-surefire-plugin/docker.html, we
>>> used maven:3.5.3-jdk-8-alpine in this test. Which base image did you use?
>>>
>>> Cheers
>>> Tibor
>>>
>>> On Tue, Feb 26, 2019 at 5:24 PM Jason Young <[hidden email]>
>>> wrote:
>>>
>>> > Thanks for the information. It's good to see someone understands a
>>> little
>>> > about this.
>>> >
>>> > Incidentally, we have been looking at other GCs and VMs for the
>>> application
>>> > in production environments, so I'll look into how these affect tests as
>>> > well. I'll try to enable some logging about GC pauses to see what's up.
>>> >
>>> > How would `-Xmx3g` cause long GC cycles? Do you think the value is
>>> simply
>>> > too low?
>>> >
>>> > FWIW we're running the Maven build in an Alpine-based Docker container.
>>> >
>>> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <[hidden email]>
>>> > wrote:
>>> >
>>> > > Hi Jason,
>>> > >
>>> > > We spoke about this issue on our chat in ASF Slack:
>>> > > "I think his tests have been paused for a long GC periods and timed
>>> out
>>> > 3x
>>> > > PING period = 30 seconds. After this period forked JVM supposed the
>>> Maven
>>> > > process was killed by JenkinsCI and therefore all surefire processes
>>> are
>>> > > killed as well and all the file handlers and memory consumptions are
>>> > > freed."
>>> > >
>>> > > "But I have to say that `-Xmx3g` may cause long GC cycles, see
>>> > >
>>> > >
>>> >
>>> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
>>> > > "
>>> > >
>>> > > You are using java-1.8-openjdk. I guess you should use Shenandoah GC
>>> > which
>>> > > is an experimental algorithm in  JVM 1.8. This would significantly
>>> short
>>> > > the GC cycles.
>>> > >
>>> > > We should of cource provide a new configuration parameter to give
>>> you a
>>> > > chance to prolong the PING.
>>> > >
>>> > > Cheers
>>> > > Tibor
>>> > >
>>> >
>>> >
>>> > --
>>> >
>>> > Jason Young
>>> >
>>>
>>
>>

--
Jason Young
Software Engineer | PROCENTIVE
[image: Phone] 715 245 8000 x7609
[image: Mobile] 706 870 3540
[image: Web] procentive.com
Confidentiality Notice: This message is intended for the sole use of the
individual and entity to which it is addressed, and may contain information
that is privileged, confidential and exempt from disclosure under
applicable law. Any unauthorized review, use, disclosure or distribution of
this email message, including any attachment, is prohibited. If you are not
the intended recipient, please advise the sender by reply email and destroy
all copies of the original message.
Reply | Threaded
Open this post in threaded view
|

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Mikael Åsberg
These issues regarding communication with forked JVMs, won't they be
resolved once surefire moves to interprocess communication using
tcp/ip sockets? This happens to be the target feature to be included
in the next surefire 3.0.0 milestone:
https://issues.apache.org/jira/projects/SUREFIRE/versions/12344668

There are soooo many issues relating to surefire reading stdout of
forked processes (which is my understanding that it is currently
doing). Many of us are really looking forward to the next milestone.

On Tue, Mar 19, 2019 at 8:59 PM Jason Young <[hidden email]> wrote:

>
> Getting back to my original questions, I know that "ping" means to see if a
> process is there, and "NOOP" implies it's not a command to do anything. But
> what do the terms "ping" and "NOOP" mean in this context, i.e. how do the
> processes communicate? I assume they don't sonar. Do other processes also
> ping NOOPs? Can I PING Chrome with a NOOP from bash? Is it with TCP?
>
> I'm confused about what I should do regarding GC pauses. Previously I had
> code that would write the amount of remaining heap space (or something like
> that) to stdout after every test to troubleshoot OOMEs. Can writing to
> stdout cause the communication failure somehow?
>
> On Wed, Mar 13, 2019 at 5:57 PM Jason Young <[hidden email]>
> wrote:
>
> > I upgraded failsafe and surefire to 3.0.0-M3 as advised; we encountered
> > the same exception. (Still using -Xmx5g, will switch to OpenJ9 soon in case
> > that helps.)
> >
> > BTW I also asked on StackOverflow previously, for anyone interested:
> > https://stackoverflow.com/questions/54755846/killing-self-fork-jvm-ping-timeout-elapsed
> >
> > On Tue, Feb 26, 2019 at 6:40 PM Jason Young <[hidden email]>
> > wrote:
> >
> >> Thanks again for the information.
> >>
> >> We had increased the RAM to 3g some time ago to prevent OOMEs. More
> >> recently, I increased the RAM again to 5g for extra headroom since we had
> >> more headroom available; the problem hasn't happened since, but it hasn't
> >> been very long.
> >>
> >> We use a more customized image based on Alpine 3.8.2. The JDK and Maven
> >> are obtained via apk.
> >>
> >> I will try upgrading failsafe (and surefire while I'm at it) sooner, and
> >> probably do some experimentation with JVMs another time (not pressing for
> >> me ATM).
> >>
> >> On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <[hidden email]>
> >> wrote:
> >>
> >>> >> I'll try to enable some logging about GC pauses to see what's up
> >>>
> >>> Pls do not keep such setting after tuning the GC because this may
> >>> sometime
> >>> break the interprocess communication between Maven process and surefire
> >>> process.
> >>> It's worth to list GC information in a file and not in the console logs.
> >>> This can be configured, I guess.
> >>>
> >>> >> Do you think the value is simply too low?
> >>>
> >>> GCing many objects may take some time and I remember we had a user who
> >>> had
> >>> this problem a year or two ago.
> >>> We check every third NOOP (which is 3 x 10 sec) as a fix instead of every
> >>> NOP. So 30 seconds looked satisfactory.
> >>> I think you use old version 2.20 or something like that. The fixes for
> >>> docker have been done so far, so please use the latest version 3.0.0-M3.
> >>> See this page
> >>> https://maven.apache.org/surefire/maven-surefire-plugin/docker.html, we
> >>> used maven:3.5.3-jdk-8-alpine in this test. Which base image did you use?
> >>>
> >>> Cheers
> >>> Tibor
> >>>
> >>> On Tue, Feb 26, 2019 at 5:24 PM Jason Young <[hidden email]>
> >>> wrote:
> >>>
> >>> > Thanks for the information. It's good to see someone understands a
> >>> little
> >>> > about this.
> >>> >
> >>> > Incidentally, we have been looking at other GCs and VMs for the
> >>> application
> >>> > in production environments, so I'll look into how these affect tests as
> >>> > well. I'll try to enable some logging about GC pauses to see what's up.
> >>> >
> >>> > How would `-Xmx3g` cause long GC cycles? Do you think the value is
> >>> simply
> >>> > too low?
> >>> >
> >>> > FWIW we're running the Maven build in an Alpine-based Docker container.
> >>> >
> >>> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <[hidden email]>
> >>> > wrote:
> >>> >
> >>> > > Hi Jason,
> >>> > >
> >>> > > We spoke about this issue on our chat in ASF Slack:
> >>> > > "I think his tests have been paused for a long GC periods and timed
> >>> out
> >>> > 3x
> >>> > > PING period = 30 seconds. After this period forked JVM supposed the
> >>> Maven
> >>> > > process was killed by JenkinsCI and therefore all surefire processes
> >>> are
> >>> > > killed as well and all the file handlers and memory consumptions are
> >>> > > freed."
> >>> > >
> >>> > > "But I have to say that `-Xmx3g` may cause long GC cycles, see
> >>> > >
> >>> > >
> >>> >
> >>> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
> >>> > > "
> >>> > >
> >>> > > You are using java-1.8-openjdk. I guess you should use Shenandoah GC
> >>> > which
> >>> > > is an experimental algorithm in  JVM 1.8. This would significantly
> >>> short
> >>> > > the GC cycles.
> >>> > >
> >>> > > We should of cource provide a new configuration parameter to give
> >>> you a
> >>> > > chance to prolong the PING.
> >>> > >
> >>> > > Cheers
> >>> > > Tibor
> >>> > >
> >>> >
> >>> >
> >>> > --
> >>> >
> >>> > Jason Young
> >>> >
> >>>
> >>
> >>
>
> --
> Jason Young
> Software Engineer | PROCENTIVE
> [image: Phone] 715 245 8000 x7609
> [image: Mobile] 706 870 3540
> [image: Web] procentive.com
> Confidentiality Notice: This message is intended for the sole use of the
> individual and entity to which it is addressed, and may contain information
> that is privileged, confidential and exempt from disclosure under
> applicable law. Any unauthorized review, use, disclosure or distribution of
> this email message, including any attachment, is prohibited. If you are not
> the intended recipient, please advise the sender by reply email and destroy
> all copies of the original message.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Bernd Eckenfels
I guess a timeout caused by FullGC can happen with TCP as well. Increasing the timeout might not be nice but does look like it would help in both cases. (Problems with stdout are more related to unexpected JVM messages I guess)

Gruss
Bernd
--
http://bernd.eckenfels.net

________________________________
Von: Mikael Åsberg <[hidden email]>
Gesendet: Mittwoch, März 20, 2019 9:40 AM
An: Maven Users List
Betreff: Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

These issues regarding communication with forked JVMs, won't they be
resolved once surefire moves to interprocess communication using
tcp/ip sockets? This happens to be the target feature to be included
in the next surefire 3.0.0 milestone:
https://issues.apache.org/jira/projects/SUREFIRE/versions/12344668

There are soooo many issues relating to surefire reading stdout of
forked processes (which is my understanding that it is currently
doing). Many of us are really looking forward to the next milestone.

On Tue, Mar 19, 2019 at 8:59 PM Jason Young <[hidden email]> wrote:

>
> Getting back to my original questions, I know that "ping" means to see if a
> process is there, and "NOOP" implies it's not a command to do anything. But
> what do the terms "ping" and "NOOP" mean in this context, i.e. how do the
> processes communicate? I assume they don't sonar. Do other processes also
> ping NOOPs? Can I PING Chrome with a NOOP from bash? Is it with TCP?
>
> I'm confused about what I should do regarding GC pauses. Previously I had
> code that would write the amount of remaining heap space (or something like
> that) to stdout after every test to troubleshoot OOMEs. Can writing to
> stdout cause the communication failure somehow?
>
> On Wed, Mar 13, 2019 at 5:57 PM Jason Young <[hidden email]>
> wrote:
>
> > I upgraded failsafe and surefire to 3.0.0-M3 as advised; we encountered
> > the same exception. (Still using -Xmx5g, will switch to OpenJ9 soon in case
> > that helps.)
> >
> > BTW I also asked on StackOverflow previously, for anyone interested:
> > https://stackoverflow.com/questions/54755846/killing-self-fork-jvm-ping-timeout-elapsed
> >
> > On Tue, Feb 26, 2019 at 6:40 PM Jason Young <[hidden email]>
> > wrote:
> >
> >> Thanks again for the information.
> >>
> >> We had increased the RAM to 3g some time ago to prevent OOMEs. More
> >> recently, I increased the RAM again to 5g for extra headroom since we had
> >> more headroom available; the problem hasn't happened since, but it hasn't
> >> been very long.
> >>
> >> We use a more customized image based on Alpine 3.8.2. The JDK and Maven
> >> are obtained via apk.
> >>
> >> I will try upgrading failsafe (and surefire while I'm at it) sooner, and
> >> probably do some experimentation with JVMs another time (not pressing for
> >> me ATM).
> >>
> >> On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <[hidden email]>
> >> wrote:
> >>
> >>> >> I'll try to enable some logging about GC pauses to see what's up
> >>>
> >>> Pls do not keep such setting after tuning the GC because this may
> >>> sometime
> >>> break the interprocess communication between Maven process and surefire
> >>> process.
> >>> It's worth to list GC information in a file and not in the console logs.
> >>> This can be configured, I guess.
> >>>
> >>> >> Do you think the value is simply too low?
> >>>
> >>> GCing many objects may take some time and I remember we had a user who
> >>> had
> >>> this problem a year or two ago.
> >>> We check every third NOOP (which is 3 x 10 sec) as a fix instead of every
> >>> NOP. So 30 seconds looked satisfactory.
> >>> I think you use old version 2.20 or something like that. The fixes for
> >>> docker have been done so far, so please use the latest version 3.0.0-M3.
> >>> See this page
> >>> https://maven.apache.org/surefire/maven-surefire-plugin/docker.html, we
> >>> used maven:3.5.3-jdk-8-alpine in this test. Which base image did you use?
> >>>
> >>> Cheers
> >>> Tibor
> >>>
> >>> On Tue, Feb 26, 2019 at 5:24 PM Jason Young <[hidden email]>
> >>> wrote:
> >>>
> >>> > Thanks for the information. It's good to see someone understands a
> >>> little
> >>> > about this.
> >>> >
> >>> > Incidentally, we have been looking at other GCs and VMs for the
> >>> application
> >>> > in production environments, so I'll look into how these affect tests as
> >>> > well. I'll try to enable some logging about GC pauses to see what's up.
> >>> >
> >>> > How would `-Xmx3g` cause long GC cycles? Do you think the value is
> >>> simply
> >>> > too low?
> >>> >
> >>> > FWIW we're running the Maven build in an Alpine-based Docker container.
> >>> >
> >>> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <[hidden email]>
> >>> > wrote:
> >>> >
> >>> > > Hi Jason,
> >>> > >
> >>> > > We spoke about this issue on our chat in ASF Slack:
> >>> > > "I think his tests have been paused for a long GC periods and timed
> >>> out
> >>> > 3x
> >>> > > PING period = 30 seconds. After this period forked JVM supposed the
> >>> Maven
> >>> > > process was killed by JenkinsCI and therefore all surefire processes
> >>> are
> >>> > > killed as well and all the file handlers and memory consumptions are
> >>> > > freed."
> >>> > >
> >>> > > "But I have to say that `-Xmx3g` may cause long GC cycles, see
> >>> > >
> >>> > >
> >>> >
> >>> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
> >>> > > "
> >>> > >
> >>> > > You are using java-1.8-openjdk. I guess you should use Shenandoah GC
> >>> > which
> >>> > > is an experimental algorithm in JVM 1.8. This would significantly
> >>> short
> >>> > > the GC cycles.
> >>> > >
> >>> > > We should of cource provide a new configuration parameter to give
> >>> you a
> >>> > > chance to prolong the PING.
> >>> > >
> >>> > > Cheers
> >>> > > Tibor
> >>> > >
> >>> >
> >>> >
> >>> > --
> >>> >
> >>> > Jason Young
> >>> >
> >>>
> >>
> >>
>
> --
> Jason Young
> Software Engineer | PROCENTIVE
> [image: Phone] 715 245 8000 x7609
> [image: Mobile] 706 870 3540
> [image: Web] procentive.com
> Confidentiality Notice: This message is intended for the sole use of the
> individual and entity to which it is addressed, and may contain information
> that is privileged, confidential and exempt from disclosure under
> applicable law. Any unauthorized review, use, disclosure or distribution of
> this email message, including any attachment, is prohibited. If you are not
> the intended recipient, please advise the sender by reply email and destroy
> all copies of the original message.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Jason Young
Mikael, sorry I do not appear to have permission to view the link.

I did some digging in the last couple of days. I see that the parent
process reads from stdin. I could not find anywhere that we are using
stdin. FWIW the failures nearly always happen at least 15m into a ~20m test
run, so perf is a likely culprit.

I see also that ForkedBooter reads commands from stdin in one thread, and
uses an executor service to check for a past ping in
ForkedBooter.listenToShutdownCommands(..). When it checks, it also sets
pingDone to false. The executor is configured to run up to 2 threads
concurrently to handle the workload, and is set to run at a fixed rate (not
a fixed delay). If the test suite is busy with testing and GC and has lots
of threads running, it's entirely possible that a thread won't have a
chance to run for a long time (e.g. 5s). Maybe instead of a 30s delay, the
VM gets around to checking for a ping every 35s over a long span of time.
Because we're running at a "fixed rate" and not a "fixed delay", then after
a couple of minutes we might be a full 30s behind schedule. It's possible
the executor will create another thread to run the scheduled task because
it's running behind schedule. This new thread checks for a ping, finds it,
and sets pingDone to false. But then the original thread also runs, say, 2
seconds afterwards, checks pingDone, and finds it is false.

So to mitigate the problem, can we a) make the executor run only 1 thread
and b) schedule the task at a fixed rate? For that matter, is there another
scheduled executor we can reuse? I understand why checking for ping
requires a separate executor. Should I ask in github?

Regarding a previous question, I found out that Alpine's Maven package
comes with an /etc/mavenrc that sets `MAVEN_OPTS="$MAVEN_OPTS -Xmx512m"`
which cannot be undone by setting `MAVEN_OPTS` at the command line; you end
up with e.g. `-Xmx1g -Xmx512m`. (Note this applies to the Maven (parent)
process, not the surefire/failsafe (child) process.)

On Wed, Mar 20, 2019 at 3:46 AM Bernd Eckenfels <[hidden email]>
wrote:

> I guess a timeout caused by FullGC can happen with TCP as well. Increasing
> the timeout might not be nice but does look like it would help in both
> cases. (Problems with stdout are more related to unexpected JVM messages I
> guess)
>
> Gruss
> Bernd
> --
> http://bernd.eckenfels.net
>
> ________________________________
> Von: Mikael Åsberg <[hidden email]>
> Gesendet: Mittwoch, März 20, 2019 9:40 AM
> An: Maven Users List
> Betreff: Re: Failsafe: Killing self fork JVM. PING timeout elapsed.
>
> These issues regarding communication with forked JVMs, won't they be
> resolved once surefire moves to interprocess communication using
> tcp/ip sockets? This happens to be the target feature to be included
> in the next surefire 3.0.0 milestone:
> https://issues.apache.org/jira/projects/SUREFIRE/versions/12344668
>
> There are soooo many issues relating to surefire reading stdout of
> forked processes (which is my understanding that it is currently
> doing). Many of us are really looking forward to the next milestone.
>
> On Tue, Mar 19, 2019 at 8:59 PM Jason Young <[hidden email]>
> wrote:
> >
> > Getting back to my original questions, I know that "ping" means to see
> if a
> > process is there, and "NOOP" implies it's not a command to do anything.
> But
> > what do the terms "ping" and "NOOP" mean in this context, i.e. how do the
> > processes communicate? I assume they don't sonar. Do other processes also
> > ping NOOPs? Can I PING Chrome with a NOOP from bash? Is it with TCP?
> >
> > I'm confused about what I should do regarding GC pauses. Previously I had
> > code that would write the amount of remaining heap space (or something
> like
> > that) to stdout after every test to troubleshoot OOMEs. Can writing to
> > stdout cause the communication failure somehow?
> >
> > On Wed, Mar 13, 2019 at 5:57 PM Jason Young <[hidden email]>
> > wrote:
> >
> > > I upgraded failsafe and surefire to 3.0.0-M3 as advised; we encountered
> > > the same exception. (Still using -Xmx5g, will switch to OpenJ9 soon in
> case
> > > that helps.)
> > >
> > > BTW I also asked on StackOverflow previously, for anyone interested:
> > >
> https://stackoverflow.com/questions/54755846/killing-self-fork-jvm-ping-timeout-elapsed
> > >
> > > On Tue, Feb 26, 2019 at 6:40 PM Jason Young <
> [hidden email]>
> > > wrote:
> > >
> > >> Thanks again for the information.
> > >>
> > >> We had increased the RAM to 3g some time ago to prevent OOMEs. More
> > >> recently, I increased the RAM again to 5g for extra headroom since we
> had
> > >> more headroom available; the problem hasn't happened since, but it
> hasn't
> > >> been very long.
> > >>
> > >> We use a more customized image based on Alpine 3.8.2. The JDK and
> Maven
> > >> are obtained via apk.
> > >>
> > >> I will try upgrading failsafe (and surefire while I'm at it) sooner,
> and
> > >> probably do some experimentation with JVMs another time (not pressing
> for
> > >> me ATM).
> > >>
> > >> On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <[hidden email]
> >
> > >> wrote:
> > >>
> > >>> >> I'll try to enable some logging about GC pauses to see what's up
> > >>>
> > >>> Pls do not keep such setting after tuning the GC because this may
> > >>> sometime
> > >>> break the interprocess communication between Maven process and
> surefire
> > >>> process.
> > >>> It's worth to list GC information in a file and not in the console
> logs.
> > >>> This can be configured, I guess.
> > >>>
> > >>> >> Do you think the value is simply too low?
> > >>>
> > >>> GCing many objects may take some time and I remember we had a user
> who
> > >>> had
> > >>> this problem a year or two ago.
> > >>> We check every third NOOP (which is 3 x 10 sec) as a fix instead of
> every
> > >>> NOP. So 30 seconds looked satisfactory.
> > >>> I think you use old version 2.20 or something like that. The fixes
> for
> > >>> docker have been done so far, so please use the latest version
> 3.0.0-M3.
> > >>> See this page
> > >>> https://maven.apache.org/surefire/maven-surefire-plugin/docker.html,
> we
> > >>> used maven:3.5.3-jdk-8-alpine in this test. Which base image did you
> use?
> > >>>
> > >>> Cheers
> > >>> Tibor
> > >>>
> > >>> On Tue, Feb 26, 2019 at 5:24 PM Jason Young <
> [hidden email]>
> > >>> wrote:
> > >>>
> > >>> > Thanks for the information. It's good to see someone understands a
> > >>> little
> > >>> > about this.
> > >>> >
> > >>> > Incidentally, we have been looking at other GCs and VMs for the
> > >>> application
> > >>> > in production environments, so I'll look into how these affect
> tests as
> > >>> > well. I'll try to enable some logging about GC pauses to see
> what's up.
> > >>> >
> > >>> > How would `-Xmx3g` cause long GC cycles? Do you think the value is
> > >>> simply
> > >>> > too low?
> > >>> >
> > >>> > FWIW we're running the Maven build in an Alpine-based Docker
> container.
> > >>> >
> > >>> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <
> [hidden email]>
> > >>> > wrote:
> > >>> >
> > >>> > > Hi Jason,
> > >>> > >
> > >>> > > We spoke about this issue on our chat in ASF Slack:
> > >>> > > "I think his tests have been paused for a long GC periods and
> timed
> > >>> out
> > >>> > 3x
> > >>> > > PING period = 30 seconds. After this period forked JVM supposed
> the
> > >>> Maven
> > >>> > > process was killed by JenkinsCI and therefore all surefire
> processes
> > >>> are
> > >>> > > killed as well and all the file handlers and memory consumptions
> are
> > >>> > > freed."
> > >>> > >
> > >>> > > "But I have to say that `-Xmx3g` may cause long GC cycles, see
> > >>> > >
> > >>> > >
> > >>> >
> > >>>
> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
> > >>> > > "
> > >>> > >
> > >>> > > You are using java-1.8-openjdk. I guess you should use
> Shenandoah GC
> > >>> > which
> > >>> > > is an experimental algorithm in JVM 1.8. This would significantly
> > >>> short
> > >>> > > the GC cycles.
> > >>> > >
> > >>> > > We should of cource provide a new configuration parameter to give
> > >>> you a
> > >>> > > chance to prolong the PING.
> > >>> > >
> > >>> > > Cheers
> > >>> > > Tibor
> > >>> > >
> > >>> >
> > >>> >
> > >>> > --
> > >>> >
> > >>> > Jason Young
> > >>> >
> > >>>
> > >>
> > >>
> >
> > --
> > Jason Young
> > Software Engineer | PROCENTIVE
> > [image: Phone] 715 245 8000 x7609
> > [image: Mobile] 706 870 3540
> > [image: Web] procentive.com
> > Confidentiality Notice: This message is intended for the sole use of the
> > individual and entity to which it is addressed, and may contain
> information
> > that is privileged, confidential and exempt from disclosure under
> > applicable law. Any unauthorized review, use, disclosure or distribution
> of
> > this email message, including any attachment, is prohibited. If you are
> not
> > the intended recipient, please advise the sender by reply email and
> destroy
> > all copies of the original message.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Tibor Digana
The base images are developed in https://github.com/carlossg/docker-maven,
right?
Who creates "/etc/mavenrc"?

On Thu, Mar 21, 2019 at 12:05 AM Jason Young <[hidden email]>
wrote:

> Mikael, sorry I do not appear to have permission to view the link.
>
> I did some digging in the last couple of days. I see that the parent
> process reads from stdin. I could not find anywhere that we are using
> stdin. FWIW the failures nearly always happen at least 15m into a ~20m test
> run, so perf is a likely culprit.
>
> I see also that ForkedBooter reads commands from stdin in one thread, and
> uses an executor service to check for a past ping in
> ForkedBooter.listenToShutdownCommands(..). When it checks, it also sets
> pingDone to false. The executor is configured to run up to 2 threads
> concurrently to handle the workload, and is set to run at a fixed rate (not
> a fixed delay). If the test suite is busy with testing and GC and has lots
> of threads running, it's entirely possible that a thread won't have a
> chance to run for a long time (e.g. 5s). Maybe instead of a 30s delay, the
> VM gets around to checking for a ping every 35s over a long span of time.
> Because we're running at a "fixed rate" and not a "fixed delay", then after
> a couple of minutes we might be a full 30s behind schedule. It's possible
> the executor will create another thread to run the scheduled task because
> it's running behind schedule. This new thread checks for a ping, finds it,
> and sets pingDone to false. But then the original thread also runs, say, 2
> seconds afterwards, checks pingDone, and finds it is false.
>
> So to mitigate the problem, can we a) make the executor run only 1 thread
> and b) schedule the task at a fixed rate? For that matter, is there another
> scheduled executor we can reuse? I understand why checking for ping
> requires a separate executor. Should I ask in github?
>
> Regarding a previous question, I found out that Alpine's Maven package
> comes with an /etc/mavenrc that sets `MAVEN_OPTS="$MAVEN_OPTS -Xmx512m"`
> which cannot be undone by setting `MAVEN_OPTS` at the command line; you end
> up with e.g. `-Xmx1g -Xmx512m`. (Note this applies to the Maven (parent)
> process, not the surefire/failsafe (child) process.)
>
> On Wed, Mar 20, 2019 at 3:46 AM Bernd Eckenfels <[hidden email]>
> wrote:
>
> > I guess a timeout caused by FullGC can happen with TCP as well.
> Increasing
> > the timeout might not be nice but does look like it would help in both
> > cases. (Problems with stdout are more related to unexpected JVM messages
> I
> > guess)
> >
> > Gruss
> > Bernd
> > --
> > http://bernd.eckenfels.net
> >
> > ________________________________
> > Von: Mikael Åsberg <[hidden email]>
> > Gesendet: Mittwoch, März 20, 2019 9:40 AM
> > An: Maven Users List
> > Betreff: Re: Failsafe: Killing self fork JVM. PING timeout elapsed.
> >
> > These issues regarding communication with forked JVMs, won't they be
> > resolved once surefire moves to interprocess communication using
> > tcp/ip sockets? This happens to be the target feature to be included
> > in the next surefire 3.0.0 milestone:
> > https://issues.apache.org/jira/projects/SUREFIRE/versions/12344668
> >
> > There are soooo many issues relating to surefire reading stdout of
> > forked processes (which is my understanding that it is currently
> > doing). Many of us are really looking forward to the next milestone.
> >
> > On Tue, Mar 19, 2019 at 8:59 PM Jason Young <[hidden email]>
> > wrote:
> > >
> > > Getting back to my original questions, I know that "ping" means to see
> > if a
> > > process is there, and "NOOP" implies it's not a command to do anything.
> > But
> > > what do the terms "ping" and "NOOP" mean in this context, i.e. how do
> the
> > > processes communicate? I assume they don't sonar. Do other processes
> also
> > > ping NOOPs? Can I PING Chrome with a NOOP from bash? Is it with TCP?
> > >
> > > I'm confused about what I should do regarding GC pauses. Previously I
> had
> > > code that would write the amount of remaining heap space (or something
> > like
> > > that) to stdout after every test to troubleshoot OOMEs. Can writing to
> > > stdout cause the communication failure somehow?
> > >
> > > On Wed, Mar 13, 2019 at 5:57 PM Jason Young <
> [hidden email]>
> > > wrote:
> > >
> > > > I upgraded failsafe and surefire to 3.0.0-M3 as advised; we
> encountered
> > > > the same exception. (Still using -Xmx5g, will switch to OpenJ9 soon
> in
> > case
> > > > that helps.)
> > > >
> > > > BTW I also asked on StackOverflow previously, for anyone interested:
> > > >
> >
> https://stackoverflow.com/questions/54755846/killing-self-fork-jvm-ping-timeout-elapsed
> > > >
> > > > On Tue, Feb 26, 2019 at 6:40 PM Jason Young <
> > [hidden email]>
> > > > wrote:
> > > >
> > > >> Thanks again for the information.
> > > >>
> > > >> We had increased the RAM to 3g some time ago to prevent OOMEs. More
> > > >> recently, I increased the RAM again to 5g for extra headroom since
> we
> > had
> > > >> more headroom available; the problem hasn't happened since, but it
> > hasn't
> > > >> been very long.
> > > >>
> > > >> We use a more customized image based on Alpine 3.8.2. The JDK and
> > Maven
> > > >> are obtained via apk.
> > > >>
> > > >> I will try upgrading failsafe (and surefire while I'm at it) sooner,
> > and
> > > >> probably do some experimentation with JVMs another time (not
> pressing
> > for
> > > >> me ATM).
> > > >>
> > > >> On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <
> [hidden email]
> > >
> > > >> wrote:
> > > >>
> > > >>> >> I'll try to enable some logging about GC pauses to see what's up
> > > >>>
> > > >>> Pls do not keep such setting after tuning the GC because this may
> > > >>> sometime
> > > >>> break the interprocess communication between Maven process and
> > surefire
> > > >>> process.
> > > >>> It's worth to list GC information in a file and not in the console
> > logs.
> > > >>> This can be configured, I guess.
> > > >>>
> > > >>> >> Do you think the value is simply too low?
> > > >>>
> > > >>> GCing many objects may take some time and I remember we had a user
> > who
> > > >>> had
> > > >>> this problem a year or two ago.
> > > >>> We check every third NOOP (which is 3 x 10 sec) as a fix instead of
> > every
> > > >>> NOP. So 30 seconds looked satisfactory.
> > > >>> I think you use old version 2.20 or something like that. The fixes
> > for
> > > >>> docker have been done so far, so please use the latest version
> > 3.0.0-M3.
> > > >>> See this page
> > > >>>
> https://maven.apache.org/surefire/maven-surefire-plugin/docker.html,
> > we
> > > >>> used maven:3.5.3-jdk-8-alpine in this test. Which base image did
> you
> > use?
> > > >>>
> > > >>> Cheers
> > > >>> Tibor
> > > >>>
> > > >>> On Tue, Feb 26, 2019 at 5:24 PM Jason Young <
> > [hidden email]>
> > > >>> wrote:
> > > >>>
> > > >>> > Thanks for the information. It's good to see someone understands
> a
> > > >>> little
> > > >>> > about this.
> > > >>> >
> > > >>> > Incidentally, we have been looking at other GCs and VMs for the
> > > >>> application
> > > >>> > in production environments, so I'll look into how these affect
> > tests as
> > > >>> > well. I'll try to enable some logging about GC pauses to see
> > what's up.
> > > >>> >
> > > >>> > How would `-Xmx3g` cause long GC cycles? Do you think the value
> is
> > > >>> simply
> > > >>> > too low?
> > > >>> >
> > > >>> > FWIW we're running the Maven build in an Alpine-based Docker
> > container.
> > > >>> >
> > > >>> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <
> > [hidden email]>
> > > >>> > wrote:
> > > >>> >
> > > >>> > > Hi Jason,
> > > >>> > >
> > > >>> > > We spoke about this issue on our chat in ASF Slack:
> > > >>> > > "I think his tests have been paused for a long GC periods and
> > timed
> > > >>> out
> > > >>> > 3x
> > > >>> > > PING period = 30 seconds. After this period forked JVM supposed
> > the
> > > >>> Maven
> > > >>> > > process was killed by JenkinsCI and therefore all surefire
> > processes
> > > >>> are
> > > >>> > > killed as well and all the file handlers and memory
> consumptions
> > are
> > > >>> > > freed."
> > > >>> > >
> > > >>> > > "But I have to say that `-Xmx3g` may cause long GC cycles, see
> > > >>> > >
> > > >>> > >
> > > >>> >
> > > >>>
> >
> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
> > > >>> > > "
> > > >>> > >
> > > >>> > > You are using java-1.8-openjdk. I guess you should use
> > Shenandoah GC
> > > >>> > which
> > > >>> > > is an experimental algorithm in JVM 1.8. This would
> significantly
> > > >>> short
> > > >>> > > the GC cycles.
> > > >>> > >
> > > >>> > > We should of cource provide a new configuration parameter to
> give
> > > >>> you a
> > > >>> > > chance to prolong the PING.
> > > >>> > >
> > > >>> > > Cheers
> > > >>> > > Tibor
> > > >>> > >
> > > >>> >
> > > >>> >
> > > >>> > --
> > > >>> >
> > > >>> > Jason Young
> > > >>> >
> > > >>>
> > > >>
> > > >>
> > >
> > > --
> > > Jason Young
> > > Software Engineer | PROCENTIVE
> > > [image: Phone] 715 245 8000 x7609
> > > [image: Mobile] 706 870 3540
> > > [image: Web] procentive.com
> > > Confidentiality Notice: This message is intended for the sole use of
> the
> > > individual and entity to which it is addressed, and may contain
> > information
> > > that is privileged, confidential and exempt from disclosure under
> > > applicable law. Any unauthorized review, use, disclosure or
> distribution
> > of
> > > this email message, including any attachment, is prohibited. If you are
> > not
> > > the intended recipient, please advise the sender by reply email and
> > destroy
> > > all copies of the original message.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Jason Young
To clarify, this is the image ours is based on:
https://hub.docker.com/_/alpine; we are not using a vanilla Maven image as
we have to add a few other items to this image as well. I don't know who
maintains that or adds /etc/mavenrc; I'll bring that up in another topic in
another forum. I was just answering a previous question, and also throwing
out info for anyone struggling with e.g. OOME in their Maven process (a lot
of people have that problem when running the Sonar scanner, for example). I
don't think we need to worry about why that file is added for this topic.

But while we're on the subject, is there an -Xmx value you would recommend
to ensure the parent process can send the NOOP to the surefire process?

I'm making a custom build of surefire-booter to workaround the initial
problem by commenting out the code to exit. I am also adding logging via
stdout so I can see if ForkedBooter is indeed running the ping job in small
increments of time. Is there a logging API which I can use instead?

On Fri, Mar 22, 2019 at 5:44 AM Tibor Digana <[hidden email]> wrote:

> The base images are developed in https://github.com/carlossg/docker-maven,
> right?
> Who creates "/etc/mavenrc"?
>
> On Thu, Mar 21, 2019 at 12:05 AM Jason Young <[hidden email]>
> wrote:
>
> > Mikael, sorry I do not appear to have permission to view the link.
> >
> > I did some digging in the last couple of days. I see that the parent
> > process reads from stdin. I could not find anywhere that we are using
> > stdin. FWIW the failures nearly always happen at least 15m into a ~20m
> test
> > run, so perf is a likely culprit.
> >
> > I see also that ForkedBooter reads commands from stdin in one thread, and
> > uses an executor service to check for a past ping in
> > ForkedBooter.listenToShutdownCommands(..). When it checks, it also sets
> > pingDone to false. The executor is configured to run up to 2 threads
> > concurrently to handle the workload, and is set to run at a fixed rate
> (not
> > a fixed delay). If the test suite is busy with testing and GC and has
> lots
> > of threads running, it's entirely possible that a thread won't have a
> > chance to run for a long time (e.g. 5s). Maybe instead of a 30s delay,
> the
> > VM gets around to checking for a ping every 35s over a long span of time.
> > Because we're running at a "fixed rate" and not a "fixed delay", then
> after
> > a couple of minutes we might be a full 30s behind schedule. It's possible
> > the executor will create another thread to run the scheduled task because
> > it's running behind schedule. This new thread checks for a ping, finds
> it,
> > and sets pingDone to false. But then the original thread also runs, say,
> 2
> > seconds afterwards, checks pingDone, and finds it is false.
> >
> > So to mitigate the problem, can we a) make the executor run only 1 thread
> > and b) schedule the task at a fixed rate? For that matter, is there
> another
> > scheduled executor we can reuse? I understand why checking for ping
> > requires a separate executor. Should I ask in github?
> >
> > Regarding a previous question, I found out that Alpine's Maven package
> > comes with an /etc/mavenrc that sets `MAVEN_OPTS="$MAVEN_OPTS -Xmx512m"`
> > which cannot be undone by setting `MAVEN_OPTS` at the command line; you
> end
> > up with e.g. `-Xmx1g -Xmx512m`. (Note this applies to the Maven (parent)
> > process, not the surefire/failsafe (child) process.)
> >
> > On Wed, Mar 20, 2019 at 3:46 AM Bernd Eckenfels <[hidden email]>
> > wrote:
> >
> > > I guess a timeout caused by FullGC can happen with TCP as well.
> > Increasing
> > > the timeout might not be nice but does look like it would help in both
> > > cases. (Problems with stdout are more related to unexpected JVM
> messages
> > I
> > > guess)
> > >
> > > Gruss
> > > Bernd
> > > --
> > > http://bernd.eckenfels.net
> > >
> > > ________________________________
> > > Von: Mikael Åsberg <[hidden email]>
> > > Gesendet: Mittwoch, März 20, 2019 9:40 AM
> > > An: Maven Users List
> > > Betreff: Re: Failsafe: Killing self fork JVM. PING timeout elapsed.
> > >
> > > These issues regarding communication with forked JVMs, won't they be
> > > resolved once surefire moves to interprocess communication using
> > > tcp/ip sockets? This happens to be the target feature to be included
> > > in the next surefire 3.0.0 milestone:
> > > https://issues.apache.org/jira/projects/SUREFIRE/versions/12344668
> > >
> > > There are soooo many issues relating to surefire reading stdout of
> > > forked processes (which is my understanding that it is currently
> > > doing). Many of us are really looking forward to the next milestone.
> > >
> > > On Tue, Mar 19, 2019 at 8:59 PM Jason Young <
> [hidden email]>
> > > wrote:
> > > >
> > > > Getting back to my original questions, I know that "ping" means to
> see
> > > if a
> > > > process is there, and "NOOP" implies it's not a command to do
> anything.
> > > But
> > > > what do the terms "ping" and "NOOP" mean in this context, i.e. how do
> > the
> > > > processes communicate? I assume they don't sonar. Do other processes
> > also
> > > > ping NOOPs? Can I PING Chrome with a NOOP from bash? Is it with TCP?
> > > >
> > > > I'm confused about what I should do regarding GC pauses. Previously I
> > had
> > > > code that would write the amount of remaining heap space (or
> something
> > > like
> > > > that) to stdout after every test to troubleshoot OOMEs. Can writing
> to
> > > > stdout cause the communication failure somehow?
> > > >
> > > > On Wed, Mar 13, 2019 at 5:57 PM Jason Young <
> > [hidden email]>
> > > > wrote:
> > > >
> > > > > I upgraded failsafe and surefire to 3.0.0-M3 as advised; we
> > encountered
> > > > > the same exception. (Still using -Xmx5g, will switch to OpenJ9 soon
> > in
> > > case
> > > > > that helps.)
> > > > >
> > > > > BTW I also asked on StackOverflow previously, for anyone
> interested:
> > > > >
> > >
> >
> https://stackoverflow.com/questions/54755846/killing-self-fork-jvm-ping-timeout-elapsed
> > > > >
> > > > > On Tue, Feb 26, 2019 at 6:40 PM Jason Young <
> > > [hidden email]>
> > > > > wrote:
> > > > >
> > > > >> Thanks again for the information.
> > > > >>
> > > > >> We had increased the RAM to 3g some time ago to prevent OOMEs.
> More
> > > > >> recently, I increased the RAM again to 5g for extra headroom since
> > we
> > > had
> > > > >> more headroom available; the problem hasn't happened since, but it
> > > hasn't
> > > > >> been very long.
> > > > >>
> > > > >> We use a more customized image based on Alpine 3.8.2. The JDK and
> > > Maven
> > > > >> are obtained via apk.
> > > > >>
> > > > >> I will try upgrading failsafe (and surefire while I'm at it)
> sooner,
> > > and
> > > > >> probably do some experimentation with JVMs another time (not
> > pressing
> > > for
> > > > >> me ATM).
> > > > >>
> > > > >> On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <
> > [hidden email]
> > > >
> > > > >> wrote:
> > > > >>
> > > > >>> >> I'll try to enable some logging about GC pauses to see what's
> up
> > > > >>>
> > > > >>> Pls do not keep such setting after tuning the GC because this may
> > > > >>> sometime
> > > > >>> break the interprocess communication between Maven process and
> > > surefire
> > > > >>> process.
> > > > >>> It's worth to list GC information in a file and not in the
> console
> > > logs.
> > > > >>> This can be configured, I guess.
> > > > >>>
> > > > >>> >> Do you think the value is simply too low?
> > > > >>>
> > > > >>> GCing many objects may take some time and I remember we had a
> user
> > > who
> > > > >>> had
> > > > >>> this problem a year or two ago.
> > > > >>> We check every third NOOP (which is 3 x 10 sec) as a fix instead
> of
> > > every
> > > > >>> NOP. So 30 seconds looked satisfactory.
> > > > >>> I think you use old version 2.20 or something like that. The
> fixes
> > > for
> > > > >>> docker have been done so far, so please use the latest version
> > > 3.0.0-M3.
> > > > >>> See this page
> > > > >>>
> > https://maven.apache.org/surefire/maven-surefire-plugin/docker.html,
> > > we
> > > > >>> used maven:3.5.3-jdk-8-alpine in this test. Which base image did
> > you
> > > use?
> > > > >>>
> > > > >>> Cheers
> > > > >>> Tibor
> > > > >>>
> > > > >>> On Tue, Feb 26, 2019 at 5:24 PM Jason Young <
> > > [hidden email]>
> > > > >>> wrote:
> > > > >>>
> > > > >>> > Thanks for the information. It's good to see someone
> understands
> > a
> > > > >>> little
> > > > >>> > about this.
> > > > >>> >
> > > > >>> > Incidentally, we have been looking at other GCs and VMs for the
> > > > >>> application
> > > > >>> > in production environments, so I'll look into how these affect
> > > tests as
> > > > >>> > well. I'll try to enable some logging about GC pauses to see
> > > what's up.
> > > > >>> >
> > > > >>> > How would `-Xmx3g` cause long GC cycles? Do you think the value
> > is
> > > > >>> simply
> > > > >>> > too low?
> > > > >>> >
> > > > >>> > FWIW we're running the Maven build in an Alpine-based Docker
> > > container.
> > > > >>> >
> > > > >>> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <
> > > [hidden email]>
> > > > >>> > wrote:
> > > > >>> >
> > > > >>> > > Hi Jason,
> > > > >>> > >
> > > > >>> > > We spoke about this issue on our chat in ASF Slack:
> > > > >>> > > "I think his tests have been paused for a long GC periods and
> > > timed
> > > > >>> out
> > > > >>> > 3x
> > > > >>> > > PING period = 30 seconds. After this period forked JVM
> supposed
> > > the
> > > > >>> Maven
> > > > >>> > > process was killed by JenkinsCI and therefore all surefire
> > > processes
> > > > >>> are
> > > > >>> > > killed as well and all the file handlers and memory
> > consumptions
> > > are
> > > > >>> > > freed."
> > > > >>> > >
> > > > >>> > > "But I have to say that `-Xmx3g` may cause long GC cycles,
> see
> > > > >>> > >
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > >
> >
> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
> > > > >>> > > "
> > > > >>> > >
> > > > >>> > > You are using java-1.8-openjdk. I guess you should use
> > > Shenandoah GC
> > > > >>> > which
> > > > >>> > > is an experimental algorithm in JVM 1.8. This would
> > significantly
> > > > >>> short
> > > > >>> > > the GC cycles.
> > > > >>> > >
> > > > >>> > > We should of cource provide a new configuration parameter to
> > give
> > > > >>> you a
> > > > >>> > > chance to prolong the PING.
> > > > >>> > >
> > > > >>> > > Cheers
> > > > >>> > > Tibor
> > > > >>> > >
> > > > >>> >
> > > > >>> >
> > > > >>> > --
> > > > >>> >
> > > > >>> > Jason Young
> > > > >>> >
> > > > >>>
> > > > >>
> > > > >>
> > > >
> > > > --
> > > > Jason Young
> > > > Software Engineer | PROCENTIVE
> > > > [image: Phone] 715 245 8000 x7609
> > > > [image: Mobile] 706 870 3540
> > > > [image: Web] procentive.com
> > > > Confidentiality Notice: This message is intended for the sole use of
> > the
> > > > individual and entity to which it is addressed, and may contain
> > > information
> > > > that is privileged, confidential and exempt from disclosure under
> > > > applicable law. Any unauthorized review, use, disclosure or
> > distribution
> > > of
> > > > this email message, including any attachment, is prohibited. If you
> are
> > > not
> > > > the intended recipient, please advise the sender by reply email and
> > > destroy
> > > > all copies of the original message.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Tibor Digana
>> is there an -Xmx value you would recommend to ensure the parent process
can send the NOOP to the surefire process?

no such value is generic value to recommend. Each application or test must
find our the value to prevent from OOM.


>> I'm making a custom build of surefire-booter to workaround the initial
problem by commenting out the code to exit.

Very good step forward. Then post your findings and we can continue on
making a pull request on GitHub.


>> Is there a logging API which I can use instead?

Yes, it is very low level of logging (sending events to parent process) but
this will be turned to high level abstraction in several days (branch
surefire-1222_2).
Yet, pls use this function:

encodeAndWriteToOutput()


>>





On Fri, Mar 22, 2019 at 4:44 PM Jason Young <[hidden email]>
wrote:

> To clarify, this is the image ours is based on:
> https://hub.docker.com/_/alpine; we are not using a vanilla Maven image as
> we have to add a few other items to this image as well. I don't know who
> maintains that or adds /etc/mavenrc; I'll bring that up in another topic in
> another forum. I was just answering a previous question, and also throwing
> out info for anyone struggling with e.g. OOME in their Maven process (a lot
> of people have that problem when running the Sonar scanner, for example). I
> don't think we need to worry about why that file is added for this topic.
>
> But while we're on the subject, is there an -Xmx value you would recommend
> to ensure the parent process can send the NOOP to the surefire process?
>
> I'm making a custom build of surefire-booter to workaround the initial
> problem by commenting out the code to exit. I am also adding logging via
> stdout so I can see if ForkedBooter is indeed running the ping job in small
> increments of time. Is there a logging API which I can use instead?
>
> On Fri, Mar 22, 2019 at 5:44 AM Tibor Digana <[hidden email]>
> wrote:
>
> > The base images are developed in
> https://github.com/carlossg/docker-maven,
> > right?
> > Who creates "/etc/mavenrc"?
> >
> > On Thu, Mar 21, 2019 at 12:05 AM Jason Young <[hidden email]
> >
> > wrote:
> >
> > > Mikael, sorry I do not appear to have permission to view the link.
> > >
> > > I did some digging in the last couple of days. I see that the parent
> > > process reads from stdin. I could not find anywhere that we are using
> > > stdin. FWIW the failures nearly always happen at least 15m into a ~20m
> > test
> > > run, so perf is a likely culprit.
> > >
> > > I see also that ForkedBooter reads commands from stdin in one thread,
> and
> > > uses an executor service to check for a past ping in
> > > ForkedBooter.listenToShutdownCommands(..). When it checks, it also sets
> > > pingDone to false. The executor is configured to run up to 2 threads
> > > concurrently to handle the workload, and is set to run at a fixed rate
> > (not
> > > a fixed delay). If the test suite is busy with testing and GC and has
> > lots
> > > of threads running, it's entirely possible that a thread won't have a
> > > chance to run for a long time (e.g. 5s). Maybe instead of a 30s delay,
> > the
> > > VM gets around to checking for a ping every 35s over a long span of
> time.
> > > Because we're running at a "fixed rate" and not a "fixed delay", then
> > after
> > > a couple of minutes we might be a full 30s behind schedule. It's
> possible
> > > the executor will create another thread to run the scheduled task
> because
> > > it's running behind schedule. This new thread checks for a ping, finds
> > it,
> > > and sets pingDone to false. But then the original thread also runs,
> say,
> > 2
> > > seconds afterwards, checks pingDone, and finds it is false.
> > >
> > > So to mitigate the problem, can we a) make the executor run only 1
> thread
> > > and b) schedule the task at a fixed rate? For that matter, is there
> > another
> > > scheduled executor we can reuse? I understand why checking for ping
> > > requires a separate executor. Should I ask in github?
> > >
> > > Regarding a previous question, I found out that Alpine's Maven package
> > > comes with an /etc/mavenrc that sets `MAVEN_OPTS="$MAVEN_OPTS
> -Xmx512m"`
> > > which cannot be undone by setting `MAVEN_OPTS` at the command line; you
> > end
> > > up with e.g. `-Xmx1g -Xmx512m`. (Note this applies to the Maven
> (parent)
> > > process, not the surefire/failsafe (child) process.)
> > >
> > > On Wed, Mar 20, 2019 at 3:46 AM Bernd Eckenfels <
> [hidden email]>
> > > wrote:
> > >
> > > > I guess a timeout caused by FullGC can happen with TCP as well.
> > > Increasing
> > > > the timeout might not be nice but does look like it would help in
> both
> > > > cases. (Problems with stdout are more related to unexpected JVM
> > messages
> > > I
> > > > guess)
> > > >
> > > > Gruss
> > > > Bernd
> > > > --
> > > > http://bernd.eckenfels.net
> > > >
> > > > ________________________________
> > > > Von: Mikael Åsberg <[hidden email]>
> > > > Gesendet: Mittwoch, März 20, 2019 9:40 AM
> > > > An: Maven Users List
> > > > Betreff: Re: Failsafe: Killing self fork JVM. PING timeout elapsed.
> > > >
> > > > These issues regarding communication with forked JVMs, won't they be
> > > > resolved once surefire moves to interprocess communication using
> > > > tcp/ip sockets? This happens to be the target feature to be included
> > > > in the next surefire 3.0.0 milestone:
> > > > https://issues.apache.org/jira/projects/SUREFIRE/versions/12344668
> > > >
> > > > There are soooo many issues relating to surefire reading stdout of
> > > > forked processes (which is my understanding that it is currently
> > > > doing). Many of us are really looking forward to the next milestone.
> > > >
> > > > On Tue, Mar 19, 2019 at 8:59 PM Jason Young <
> > [hidden email]>
> > > > wrote:
> > > > >
> > > > > Getting back to my original questions, I know that "ping" means to
> > see
> > > > if a
> > > > > process is there, and "NOOP" implies it's not a command to do
> > anything.
> > > > But
> > > > > what do the terms "ping" and "NOOP" mean in this context, i.e. how
> do
> > > the
> > > > > processes communicate? I assume they don't sonar. Do other
> processes
> > > also
> > > > > ping NOOPs? Can I PING Chrome with a NOOP from bash? Is it with
> TCP?
> > > > >
> > > > > I'm confused about what I should do regarding GC pauses.
> Previously I
> > > had
> > > > > code that would write the amount of remaining heap space (or
> > something
> > > > like
> > > > > that) to stdout after every test to troubleshoot OOMEs. Can writing
> > to
> > > > > stdout cause the communication failure somehow?
> > > > >
> > > > > On Wed, Mar 13, 2019 at 5:57 PM Jason Young <
> > > [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > I upgraded failsafe and surefire to 3.0.0-M3 as advised; we
> > > encountered
> > > > > > the same exception. (Still using -Xmx5g, will switch to OpenJ9
> soon
> > > in
> > > > case
> > > > > > that helps.)
> > > > > >
> > > > > > BTW I also asked on StackOverflow previously, for anyone
> > interested:
> > > > > >
> > > >
> > >
> >
> https://stackoverflow.com/questions/54755846/killing-self-fork-jvm-ping-timeout-elapsed
> > > > > >
> > > > > > On Tue, Feb 26, 2019 at 6:40 PM Jason Young <
> > > > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > >> Thanks again for the information.
> > > > > >>
> > > > > >> We had increased the RAM to 3g some time ago to prevent OOMEs.
> > More
> > > > > >> recently, I increased the RAM again to 5g for extra headroom
> since
> > > we
> > > > had
> > > > > >> more headroom available; the problem hasn't happened since, but
> it
> > > > hasn't
> > > > > >> been very long.
> > > > > >>
> > > > > >> We use a more customized image based on Alpine 3.8.2. The JDK
> and
> > > > Maven
> > > > > >> are obtained via apk.
> > > > > >>
> > > > > >> I will try upgrading failsafe (and surefire while I'm at it)
> > sooner,
> > > > and
> > > > > >> probably do some experimentation with JVMs another time (not
> > > pressing
> > > > for
> > > > > >> me ATM).
> > > > > >>
> > > > > >> On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <
> > > [hidden email]
> > > > >
> > > > > >> wrote:
> > > > > >>
> > > > > >>> >> I'll try to enable some logging about GC pauses to see
> what's
> > up
> > > > > >>>
> > > > > >>> Pls do not keep such setting after tuning the GC because this
> may
> > > > > >>> sometime
> > > > > >>> break the interprocess communication between Maven process and
> > > > surefire
> > > > > >>> process.
> > > > > >>> It's worth to list GC information in a file and not in the
> > console
> > > > logs.
> > > > > >>> This can be configured, I guess.
> > > > > >>>
> > > > > >>> >> Do you think the value is simply too low?
> > > > > >>>
> > > > > >>> GCing many objects may take some time and I remember we had a
> > user
> > > > who
> > > > > >>> had
> > > > > >>> this problem a year or two ago.
> > > > > >>> We check every third NOOP (which is 3 x 10 sec) as a fix
> instead
> > of
> > > > every
> > > > > >>> NOP. So 30 seconds looked satisfactory.
> > > > > >>> I think you use old version 2.20 or something like that. The
> > fixes
> > > > for
> > > > > >>> docker have been done so far, so please use the latest version
> > > > 3.0.0-M3.
> > > > > >>> See this page
> > > > > >>>
> > > https://maven.apache.org/surefire/maven-surefire-plugin/docker.html,
> > > > we
> > > > > >>> used maven:3.5.3-jdk-8-alpine in this test. Which base image
> did
> > > you
> > > > use?
> > > > > >>>
> > > > > >>> Cheers
> > > > > >>> Tibor
> > > > > >>>
> > > > > >>> On Tue, Feb 26, 2019 at 5:24 PM Jason Young <
> > > > [hidden email]>
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>> > Thanks for the information. It's good to see someone
> > understands
> > > a
> > > > > >>> little
> > > > > >>> > about this.
> > > > > >>> >
> > > > > >>> > Incidentally, we have been looking at other GCs and VMs for
> the
> > > > > >>> application
> > > > > >>> > in production environments, so I'll look into how these
> affect
> > > > tests as
> > > > > >>> > well. I'll try to enable some logging about GC pauses to see
> > > > what's up.
> > > > > >>> >
> > > > > >>> > How would `-Xmx3g` cause long GC cycles? Do you think the
> value
> > > is
> > > > > >>> simply
> > > > > >>> > too low?
> > > > > >>> >
> > > > > >>> > FWIW we're running the Maven build in an Alpine-based Docker
> > > > container.
> > > > > >>> >
> > > > > >>> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <
> > > > [hidden email]>
> > > > > >>> > wrote:
> > > > > >>> >
> > > > > >>> > > Hi Jason,
> > > > > >>> > >
> > > > > >>> > > We spoke about this issue on our chat in ASF Slack:
> > > > > >>> > > "I think his tests have been paused for a long GC periods
> and
> > > > timed
> > > > > >>> out
> > > > > >>> > 3x
> > > > > >>> > > PING period = 30 seconds. After this period forked JVM
> > supposed
> > > > the
> > > > > >>> Maven
> > > > > >>> > > process was killed by JenkinsCI and therefore all surefire
> > > > processes
> > > > > >>> are
> > > > > >>> > > killed as well and all the file handlers and memory
> > > consumptions
> > > > are
> > > > > >>> > > freed."
> > > > > >>> > >
> > > > > >>> > > "But I have to say that `-Xmx3g` may cause long GC cycles,
> > see
> > > > > >>> > >
> > > > > >>> > >
> > > > > >>> >
> > > > > >>>
> > > >
> > >
> >
> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
> > > > > >>> > > "
> > > > > >>> > >
> > > > > >>> > > You are using java-1.8-openjdk. I guess you should use
> > > > Shenandoah GC
> > > > > >>> > which
> > > > > >>> > > is an experimental algorithm in JVM 1.8. This would
> > > significantly
> > > > > >>> short
> > > > > >>> > > the GC cycles.
> > > > > >>> > >
> > > > > >>> > > We should of cource provide a new configuration parameter
> to
> > > give
> > > > > >>> you a
> > > > > >>> > > chance to prolong the PING.
> > > > > >>> > >
> > > > > >>> > > Cheers
> > > > > >>> > > Tibor
> > > > > >>> > >
> > > > > >>> >
> > > > > >>> >
> > > > > >>> > --
> > > > > >>> >
> > > > > >>> > Jason Young
> > > > > >>> >
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > >
> > > > > --
> > > > > Jason Young
> > > > > Software Engineer | PROCENTIVE
> > > > > [image: Phone] 715 245 8000 x7609
> > > > > [image: Mobile] 706 870 3540
> > > > > [image: Web] procentive.com
> > > > > Confidentiality Notice: This message is intended for the sole use
> of
> > > the
> > > > > individual and entity to which it is addressed, and may contain
> > > > information
> > > > > that is privileged, confidential and exempt from disclosure under
> > > > > applicable law. Any unauthorized review, use, disclosure or
> > > distribution
> > > > of
> > > > > this email message, including any attachment, is prohibited. If you
> > are
> > > > not
> > > > > the intended recipient, please advise the sender by reply email and
> > > > destroy
> > > > > all copies of the original message.
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [hidden email]
> > > > For additional commands, e-mail: [hidden email]
> > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Jason Young
Thanks Tibor. I'll use `encodeAndWriteToOutput` later. For now, I hacked
together some temporary changes to ForkedBooter so that it "logs" (just
`System.out.println`) the current time every time:

   - the command reader thread reads a NOOP via stdin,
   - it starts the ping job,
   - it checks for NOOP (`canUseNewPingMechanism(..)` returns `false`), and
   - it would have exited because there is no NOOP (`!pingDone`).

I also commented out the code to exit in that case so it keeps running and
logging.

The resulting output starts out showing that it's receiving ping every 10s
and checking for NOOP every 30s, as expected. But then one large test
begins, and:

DEBUG 2019/03/22 19:14:40.223 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:02.723 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:03.139 GMT ForkedBooter: Running ping job (i.e.
checking if we read a ping earlier).
DEBUG 2019/03/22 19:18:03.172 GMT ForkedBooter: Checking for NOOP from
Maven process.
DEBUG 2019/03/22 19:18:03.186 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Running ping job (i.e.
checking if we read a ping earlier).
DEBUG 2019/03/22 19:18:03.188 GMT ForkedBooter: Checking for NOOP from
Maven process.
DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:03.243 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:03.307 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:03.308 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:03.202 GMT ForkedBooter: Running ping job (i.e.
checking if we read a ping earlier).
DEBUG 2019/03/22 19:18:03.323 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:03.337 GMT ForkedBooter: Checking for NOOP from
Maven process.
DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Running ping job (i.e.
checking if we read a ping earlier).
DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:03.352 GMT ForkedBooter: Checking for NOOP from
Maven process.
DEBUG 2019/03/22 19:18:03.366 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:03.393 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:03.380 GMT ForkedBooter: Running ping job (i.e.
checking if we read a ping earlier).
DEBUG 2019/03/22 19:18:03.409 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:03.423 GMT ForkedBooter: Checking for NOOP from
Maven process.
DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Running ping job (i.e.
checking if we read a ping earlier).
DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:03.453 GMT ForkedBooter: Checking for NOOP from
Maven process.
DEBUG 2019/03/22 19:18:03.454 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Running ping job (i.e.
checking if we read a ping earlier).
DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Checking for NOOP from
Maven process.
DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:08.872 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:18.859 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:19.282 GMT ForkedBooter: Running ping job (i.e.
checking if we read a ping earlier).
DEBUG 2019/03/22 19:18:19.330 GMT ForkedBooter: Checking for NOOP from
Maven process.
DEBUG 2019/03/22 19:18:28.845 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:38.832 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:48.838 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:18:49.241 GMT ForkedBooter: Running ping job (i.e.
checking if we read a ping earlier).
DEBUG 2019/03/22 19:18:49.265 GMT ForkedBooter: Checking for NOOP from
Maven process.
DEBUG 2019/03/22 19:19:01.162 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:19:10.069 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:19:19.136 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Running ping job (i.e.
checking if we read a ping earlier).
DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Checking for NOOP from
Maven process.
DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job (i.e.
checking if we read a ping earlier).
DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP from
Maven process.
WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP from
Maven process.
DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job (i.e.
checking if we read a ping earlier).
DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP from
Maven process.
WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP from
Maven process.
DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
checking if we read a ping earlier).
DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
Maven process.
WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP from
Maven process.
DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
checking if we read a ping earlier).
DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
Maven process.
WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP from
Maven process.
DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
checking if we read a ping earlier).
DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
Maven process.
WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP from
Maven process.
DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
checking if we read a ping earlier).
DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
Maven process.
WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP from
Maven process.
DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
process.
DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
process.
...

With these changes, a project which normally takes 15 to 20 minutes to run
actually took 80m. Would printing directly to stdout from within a test do
that somehow? If not, we have a problem in our code.

My comments on the above report to follow...

On Fri, Mar 22, 2019 at 11:58 AM Tibor Digana <[hidden email]>
wrote:

> >> is there an -Xmx value you would recommend to ensure the parent process
> can send the NOOP to the surefire process?
>
> no such value is generic value to recommend. Each application or test must
> find our the value to prevent from OOM.
>
>
> >> I'm making a custom build of surefire-booter to workaround the initial
> problem by commenting out the code to exit.
>
> Very good step forward. Then post your findings and we can continue on
> making a pull request on GitHub.
>
>
> >> Is there a logging API which I can use instead?
>
> Yes, it is very low level of logging (sending events to parent process) but
> this will be turned to high level abstraction in several days (branch
> surefire-1222_2).
> Yet, pls use this function:
>
> encodeAndWriteToOutput()
>
>
> >>
>
>
>
>
>
> On Fri, Mar 22, 2019 at 4:44 PM Jason Young <[hidden email]>
> wrote:
>
> > To clarify, this is the image ours is based on:
> > https://hub.docker.com/_/alpine; we are not using a vanilla Maven image
> as
> > we have to add a few other items to this image as well. I don't know who
> > maintains that or adds /etc/mavenrc; I'll bring that up in another topic
> in
> > another forum. I was just answering a previous question, and also
> throwing
> > out info for anyone struggling with e.g. OOME in their Maven process (a
> lot
> > of people have that problem when running the Sonar scanner, for
> example). I
> > don't think we need to worry about why that file is added for this topic.
> >
> > But while we're on the subject, is there an -Xmx value you would
> recommend
> > to ensure the parent process can send the NOOP to the surefire process?
> >
> > I'm making a custom build of surefire-booter to workaround the initial
> > problem by commenting out the code to exit. I am also adding logging via
> > stdout so I can see if ForkedBooter is indeed running the ping job in
> small
> > increments of time. Is there a logging API which I can use instead?
> >
> > On Fri, Mar 22, 2019 at 5:44 AM Tibor Digana <[hidden email]>
> > wrote:
> >
> > > The base images are developed in
> > https://github.com/carlossg/docker-maven,
> > > right?
> > > Who creates "/etc/mavenrc"?
> > >
> > > On Thu, Mar 21, 2019 at 12:05 AM Jason Young <
> [hidden email]
> > >
> > > wrote:
> > >
> > > > Mikael, sorry I do not appear to have permission to view the link.
> > > >
> > > > I did some digging in the last couple of days. I see that the parent
> > > > process reads from stdin. I could not find anywhere that we are using
> > > > stdin. FWIW the failures nearly always happen at least 15m into a
> ~20m
> > > test
> > > > run, so perf is a likely culprit.
> > > >
> > > > I see also that ForkedBooter reads commands from stdin in one thread,
> > and
> > > > uses an executor service to check for a past ping in
> > > > ForkedBooter.listenToShutdownCommands(..). When it checks, it also
> sets
> > > > pingDone to false. The executor is configured to run up to 2 threads
> > > > concurrently to handle the workload, and is set to run at a fixed
> rate
> > > (not
> > > > a fixed delay). If the test suite is busy with testing and GC and has
> > > lots
> > > > of threads running, it's entirely possible that a thread won't have a
> > > > chance to run for a long time (e.g. 5s). Maybe instead of a 30s
> delay,
> > > the
> > > > VM gets around to checking for a ping every 35s over a long span of
> > time.
> > > > Because we're running at a "fixed rate" and not a "fixed delay", then
> > > after
> > > > a couple of minutes we might be a full 30s behind schedule. It's
> > possible
> > > > the executor will create another thread to run the scheduled task
> > because
> > > > it's running behind schedule. This new thread checks for a ping,
> finds
> > > it,
> > > > and sets pingDone to false. But then the original thread also runs,
> > say,
> > > 2
> > > > seconds afterwards, checks pingDone, and finds it is false.
> > > >
> > > > So to mitigate the problem, can we a) make the executor run only 1
> > thread
> > > > and b) schedule the task at a fixed rate? For that matter, is there
> > > another
> > > > scheduled executor we can reuse? I understand why checking for ping
> > > > requires a separate executor. Should I ask in github?
> > > >
> > > > Regarding a previous question, I found out that Alpine's Maven
> package
> > > > comes with an /etc/mavenrc that sets `MAVEN_OPTS="$MAVEN_OPTS
> > -Xmx512m"`
> > > > which cannot be undone by setting `MAVEN_OPTS` at the command line;
> you
> > > end
> > > > up with e.g. `-Xmx1g -Xmx512m`. (Note this applies to the Maven
> > (parent)
> > > > process, not the surefire/failsafe (child) process.)
> > > >
> > > > On Wed, Mar 20, 2019 at 3:46 AM Bernd Eckenfels <
> > [hidden email]>
> > > > wrote:
> > > >
> > > > > I guess a timeout caused by FullGC can happen with TCP as well.
> > > > Increasing
> > > > > the timeout might not be nice but does look like it would help in
> > both
> > > > > cases. (Problems with stdout are more related to unexpected JVM
> > > messages
> > > > I
> > > > > guess)
> > > > >
> > > > > Gruss
> > > > > Bernd
> > > > > --
> > > > > http://bernd.eckenfels.net
> > > > >
> > > > > ________________________________
> > > > > Von: Mikael Åsberg <[hidden email]>
> > > > > Gesendet: Mittwoch, März 20, 2019 9:40 AM
> > > > > An: Maven Users List
> > > > > Betreff: Re: Failsafe: Killing self fork JVM. PING timeout elapsed.
> > > > >
> > > > > These issues regarding communication with forked JVMs, won't they
> be
> > > > > resolved once surefire moves to interprocess communication using
> > > > > tcp/ip sockets? This happens to be the target feature to be
> included
> > > > > in the next surefire 3.0.0 milestone:
> > > > > https://issues.apache.org/jira/projects/SUREFIRE/versions/12344668
> > > > >
> > > > > There are soooo many issues relating to surefire reading stdout of
> > > > > forked processes (which is my understanding that it is currently
> > > > > doing). Many of us are really looking forward to the next
> milestone.
> > > > >
> > > > > On Tue, Mar 19, 2019 at 8:59 PM Jason Young <
> > > [hidden email]>
> > > > > wrote:
> > > > > >
> > > > > > Getting back to my original questions, I know that "ping" means
> to
> > > see
> > > > > if a
> > > > > > process is there, and "NOOP" implies it's not a command to do
> > > anything.
> > > > > But
> > > > > > what do the terms "ping" and "NOOP" mean in this context, i.e.
> how
> > do
> > > > the
> > > > > > processes communicate? I assume they don't sonar. Do other
> > processes
> > > > also
> > > > > > ping NOOPs? Can I PING Chrome with a NOOP from bash? Is it with
> > TCP?
> > > > > >
> > > > > > I'm confused about what I should do regarding GC pauses.
> > Previously I
> > > > had
> > > > > > code that would write the amount of remaining heap space (or
> > > something
> > > > > like
> > > > > > that) to stdout after every test to troubleshoot OOMEs. Can
> writing
> > > to
> > > > > > stdout cause the communication failure somehow?
> > > > > >
> > > > > > On Wed, Mar 13, 2019 at 5:57 PM Jason Young <
> > > > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > I upgraded failsafe and surefire to 3.0.0-M3 as advised; we
> > > > encountered
> > > > > > > the same exception. (Still using -Xmx5g, will switch to OpenJ9
> > soon
> > > > in
> > > > > case
> > > > > > > that helps.)
> > > > > > >
> > > > > > > BTW I also asked on StackOverflow previously, for anyone
> > > interested:
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://stackoverflow.com/questions/54755846/killing-self-fork-jvm-ping-timeout-elapsed
> > > > > > >
> > > > > > > On Tue, Feb 26, 2019 at 6:40 PM Jason Young <
> > > > > [hidden email]>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Thanks again for the information.
> > > > > > >>
> > > > > > >> We had increased the RAM to 3g some time ago to prevent OOMEs.
> > > More
> > > > > > >> recently, I increased the RAM again to 5g for extra headroom
> > since
> > > > we
> > > > > had
> > > > > > >> more headroom available; the problem hasn't happened since,
> but
> > it
> > > > > hasn't
> > > > > > >> been very long.
> > > > > > >>
> > > > > > >> We use a more customized image based on Alpine 3.8.2. The JDK
> > and
> > > > > Maven
> > > > > > >> are obtained via apk.
> > > > > > >>
> > > > > > >> I will try upgrading failsafe (and surefire while I'm at it)
> > > sooner,
> > > > > and
> > > > > > >> probably do some experimentation with JVMs another time (not
> > > > pressing
> > > > > for
> > > > > > >> me ATM).
> > > > > > >>
> > > > > > >> On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <
> > > > [hidden email]
> > > > > >
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> >> I'll try to enable some logging about GC pauses to see
> > what's
> > > up
> > > > > > >>>
> > > > > > >>> Pls do not keep such setting after tuning the GC because this
> > may
> > > > > > >>> sometime
> > > > > > >>> break the interprocess communication between Maven process
> and
> > > > > surefire
> > > > > > >>> process.
> > > > > > >>> It's worth to list GC information in a file and not in the
> > > console
> > > > > logs.
> > > > > > >>> This can be configured, I guess.
> > > > > > >>>
> > > > > > >>> >> Do you think the value is simply too low?
> > > > > > >>>
> > > > > > >>> GCing many objects may take some time and I remember we had a
> > > user
> > > > > who
> > > > > > >>> had
> > > > > > >>> this problem a year or two ago.
> > > > > > >>> We check every third NOOP (which is 3 x 10 sec) as a fix
> > instead
> > > of
> > > > > every
> > > > > > >>> NOP. So 30 seconds looked satisfactory.
> > > > > > >>> I think you use old version 2.20 or something like that. The
> > > fixes
> > > > > for
> > > > > > >>> docker have been done so far, so please use the latest
> version
> > > > > 3.0.0-M3.
> > > > > > >>> See this page
> > > > > > >>>
> > > > https://maven.apache.org/surefire/maven-surefire-plugin/docker.html,
> > > > > we
> > > > > > >>> used maven:3.5.3-jdk-8-alpine in this test. Which base image
> > did
> > > > you
> > > > > use?
> > > > > > >>>
> > > > > > >>> Cheers
> > > > > > >>> Tibor
> > > > > > >>>
> > > > > > >>> On Tue, Feb 26, 2019 at 5:24 PM Jason Young <
> > > > > [hidden email]>
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>> > Thanks for the information. It's good to see someone
> > > understands
> > > > a
> > > > > > >>> little
> > > > > > >>> > about this.
> > > > > > >>> >
> > > > > > >>> > Incidentally, we have been looking at other GCs and VMs for
> > the
> > > > > > >>> application
> > > > > > >>> > in production environments, so I'll look into how these
> > affect
> > > > > tests as
> > > > > > >>> > well. I'll try to enable some logging about GC pauses to
> see
> > > > > what's up.
> > > > > > >>> >
> > > > > > >>> > How would `-Xmx3g` cause long GC cycles? Do you think the
> > value
> > > > is
> > > > > > >>> simply
> > > > > > >>> > too low?
> > > > > > >>> >
> > > > > > >>> > FWIW we're running the Maven build in an Alpine-based
> Docker
> > > > > container.
> > > > > > >>> >
> > > > > > >>> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <
> > > > > [hidden email]>
> > > > > > >>> > wrote:
> > > > > > >>> >
> > > > > > >>> > > Hi Jason,
> > > > > > >>> > >
> > > > > > >>> > > We spoke about this issue on our chat in ASF Slack:
> > > > > > >>> > > "I think his tests have been paused for a long GC periods
> > and
> > > > > timed
> > > > > > >>> out
> > > > > > >>> > 3x
> > > > > > >>> > > PING period = 30 seconds. After this period forked JVM
> > > supposed
> > > > > the
> > > > > > >>> Maven
> > > > > > >>> > > process was killed by JenkinsCI and therefore all
> surefire
> > > > > processes
> > > > > > >>> are
> > > > > > >>> > > killed as well and all the file handlers and memory
> > > > consumptions
> > > > > are
> > > > > > >>> > > freed."
> > > > > > >>> > >
> > > > > > >>> > > "But I have to say that `-Xmx3g` may cause long GC
> cycles,
> > > see
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>>
> > > > >
> > > >
> > >
> >
> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
> > > > > > >>> > > "
> > > > > > >>> > >
> > > > > > >>> > > You are using java-1.8-openjdk. I guess you should use
> > > > > Shenandoah GC
> > > > > > >>> > which
> > > > > > >>> > > is an experimental algorithm in JVM 1.8. This would
> > > > significantly
> > > > > > >>> short
> > > > > > >>> > > the GC cycles.
> > > > > > >>> > >
> > > > > > >>> > > We should of cource provide a new configuration parameter
> > to
> > > > give
> > > > > > >>> you a
> > > > > > >>> > > chance to prolong the PING.
> > > > > > >>> > >
> > > > > > >>> > > Cheers
> > > > > > >>> > > Tibor
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> > --
> > > > > > >>> >
> > > > > > >>> > Jason Young
> > > > > > >>> >
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > > > --
> > > > > > Jason Young
> > > > > > Software Engineer | PROCENTIVE
> > > > > > [image: Phone] 715 245 8000 x7609
> > > > > > [image: Mobile] 706 870 3540
> > > > > > [image: Web] procentive.com
> > > > > > Confidentiality Notice: This message is intended for the sole use
> > of
> > > > the
> > > > > > individual and entity to which it is addressed, and may contain
> > > > > information
> > > > > > that is privileged, confidential and exempt from disclosure under
> > > > > > applicable law. Any unauthorized review, use, disclosure or
> > > > distribution
> > > > > of
> > > > > > this email message, including any attachment, is prohibited. If
> you
> > > are
> > > > > not
> > > > > > the intended recipient, please advise the sender by reply email
> and
> > > > > destroy
> > > > > > all copies of the original message.
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: [hidden email]
> > > > > For additional commands, e-mail: [hidden email]
> > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Jason Young
Notice:
* an over 3-minute gap between the first two entries above--neither the
command reader thread nor the pingJob had a chance to run.
* the command reader thread and the ping job executor evidently get a
chance to run from 19:18:02 to 19:19:19. They ran with increased frequency
during that time to "catch up": The command reader thread is programmed to
read from stdin until there's nothing left to read, and the pingScheduler
runs the pingJob at a "fixed" rate.
* Another 3-minute gap and another attempt to catch up at 19:22:28. This
time though, the command reader thread evidently did not get a turn in time.

I did not bother to log what threads the ping job was running under; I can
do that later. I also did not add any logging on the Maven process side; I
can do that as well.

This is on a VM with unthrottled access to 4 processor cores and 8GB RAM. I
think the code under test is CPU- and RAM-intensive. There probably were GC
pauses, but I don't have proper logging for that in place.

Note that this is all very normal behavior: Both the command reader and the
ping job's thread (run by the pingScheduler) are instructed to run until
their work is done, not to sleep or yield. For high throughput, an OS will
favor continuing to run a thread for a while once it is started over
switching tasks, so it's normal both that each thread would run for a while
after a pause, and (by extension) that each would have long pauses as other
threads had to run, too. Because of the long gaps between runs of the
command reader thread, my previous advice (schedule the pingJob on a "fixed
delay" instead of a "fixed rate") would not have saved the build in this
case (though "fixed delay" is more correct and would save the build in
other possible cases). Our code base has some unit tests that rely on
"normal case" timing, but they now have the @Ignore attribute because they
would rarely, but too often, fail. For most teams and projects, unit tests
have to be more reliable than that.

So I think we need a solution that does not rely on preemptive scheduling
from the OS or on thread scheduling, but which ensures that commnads have
been read before running code depending on commands. There are multiple
ways to do that; maybe I can find time to write a solution later.


On Fri, Mar 22, 2019 at 3:46 PM Jason Young <[hidden email]>
wrote:

> Thanks Tibor. I'll use `encodeAndWriteToOutput` later. For now, I hacked
> together some temporary changes to ForkedBooter so that it "logs" (just
> `System.out.println`) the current time every time:
>
>    - the command reader thread reads a NOOP via stdin,
>    - it starts the ping job,
>    - it checks for NOOP (`canUseNewPingMechanism(..)` returns `false`),
>    and
>    - it would have exited because there is no NOOP (`!pingDone`).
>
> I also commented out the code to exit in that case so it keeps running and
> logging.
>
> The resulting output starts out showing that it's receiving ping every 10s
> and checking for NOOP every 30s, as expected. But then one large test
> begins, and:
>
> DEBUG 2019/03/22 19:14:40.223 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:02.723 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.139 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:18:03.172 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> DEBUG 2019/03/22 19:18:03.186 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:18:03.188 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.243 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.307 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.308 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.202 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:18:03.323 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.337 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.352 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> DEBUG 2019/03/22 19:18:03.366 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.393 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.380 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:18:03.409 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.423 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.453 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> DEBUG 2019/03/22 19:18:03.454 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:08.872 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:18.859 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:19.282 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:18:19.330 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> DEBUG 2019/03/22 19:18:28.845 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:38.832 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:48.838 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:49.241 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:18:49.265 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> DEBUG 2019/03/22 19:19:01.162 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:19:10.069 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:19:19.136 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP from
> Maven process.
> DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP from
> Maven process.
> DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP from
> Maven process.
> DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP from
> Maven process.
> DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP from
> Maven process.
> DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP from
> Maven process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> process.
> ...
>
> With these changes, a project which normally takes 15 to 20 minutes to run
> actually took 80m. Would printing directly to stdout from within a test do
> that somehow? If not, we have a problem in our code.
>
> My comments on the above report to follow...
>
> On Fri, Mar 22, 2019 at 11:58 AM Tibor Digana <[hidden email]>
> wrote:
>
>> >> is there an -Xmx value you would recommend to ensure the parent process
>> can send the NOOP to the surefire process?
>>
>> no such value is generic value to recommend. Each application or test must
>> find our the value to prevent from OOM.
>>
>>
>> >> I'm making a custom build of surefire-booter to workaround the initial
>> problem by commenting out the code to exit.
>>
>> Very good step forward. Then post your findings and we can continue on
>> making a pull request on GitHub.
>>
>>
>> >> Is there a logging API which I can use instead?
>>
>> Yes, it is very low level of logging (sending events to parent process)
>> but
>> this will be turned to high level abstraction in several days (branch
>> surefire-1222_2).
>> Yet, pls use this function:
>>
>> encodeAndWriteToOutput()
>>
>>
>> >>
>>
>>
>>
>>
>>
>> On Fri, Mar 22, 2019 at 4:44 PM Jason Young <[hidden email]>
>> wrote:
>>
>> > To clarify, this is the image ours is based on:
>> > https://hub.docker.com/_/alpine; we are not using a vanilla Maven
>> image as
>> > we have to add a few other items to this image as well. I don't know who
>> > maintains that or adds /etc/mavenrc; I'll bring that up in another
>> topic in
>> > another forum. I was just answering a previous question, and also
>> throwing
>> > out info for anyone struggling with e.g. OOME in their Maven process (a
>> lot
>> > of people have that problem when running the Sonar scanner, for
>> example). I
>> > don't think we need to worry about why that file is added for this
>> topic.
>> >
>> > But while we're on the subject, is there an -Xmx value you would
>> recommend
>> > to ensure the parent process can send the NOOP to the surefire process?
>> >
>> > I'm making a custom build of surefire-booter to workaround the initial
>> > problem by commenting out the code to exit. I am also adding logging via
>> > stdout so I can see if ForkedBooter is indeed running the ping job in
>> small
>> > increments of time. Is there a logging API which I can use instead?
>> >
>> > On Fri, Mar 22, 2019 at 5:44 AM Tibor Digana <[hidden email]>
>> > wrote:
>> >
>> > > The base images are developed in
>> > https://github.com/carlossg/docker-maven,
>> > > right?
>> > > Who creates "/etc/mavenrc"?
>> > >
>> > > On Thu, Mar 21, 2019 at 12:05 AM Jason Young <
>> [hidden email]
>> > >
>> > > wrote:
>> > >
>> > > > Mikael, sorry I do not appear to have permission to view the link.
>> > > >
>> > > > I did some digging in the last couple of days. I see that the parent
>> > > > process reads from stdin. I could not find anywhere that we are
>> using
>> > > > stdin. FWIW the failures nearly always happen at least 15m into a
>> ~20m
>> > > test
>> > > > run, so perf is a likely culprit.
>> > > >
>> > > > I see also that ForkedBooter reads commands from stdin in one
>> thread,
>> > and
>> > > > uses an executor service to check for a past ping in
>> > > > ForkedBooter.listenToShutdownCommands(..). When it checks, it also
>> sets
>> > > > pingDone to false. The executor is configured to run up to 2 threads
>> > > > concurrently to handle the workload, and is set to run at a fixed
>> rate
>> > > (not
>> > > > a fixed delay). If the test suite is busy with testing and GC and
>> has
>> > > lots
>> > > > of threads running, it's entirely possible that a thread won't have
>> a
>> > > > chance to run for a long time (e.g. 5s). Maybe instead of a 30s
>> delay,
>> > > the
>> > > > VM gets around to checking for a ping every 35s over a long span of
>> > time.
>> > > > Because we're running at a "fixed rate" and not a "fixed delay",
>> then
>> > > after
>> > > > a couple of minutes we might be a full 30s behind schedule. It's
>> > possible
>> > > > the executor will create another thread to run the scheduled task
>> > because
>> > > > it's running behind schedule. This new thread checks for a ping,
>> finds
>> > > it,
>> > > > and sets pingDone to false. But then the original thread also runs,
>> > say,
>> > > 2
>> > > > seconds afterwards, checks pingDone, and finds it is false.
>> > > >
>> > > > So to mitigate the problem, can we a) make the executor run only 1
>> > thread
>> > > > and b) schedule the task at a fixed rate? For that matter, is there
>> > > another
>> > > > scheduled executor we can reuse? I understand why checking for ping
>> > > > requires a separate executor. Should I ask in github?
>> > > >
>> > > > Regarding a previous question, I found out that Alpine's Maven
>> package
>> > > > comes with an /etc/mavenrc that sets `MAVEN_OPTS="$MAVEN_OPTS
>> > -Xmx512m"`
>> > > > which cannot be undone by setting `MAVEN_OPTS` at the command line;
>> you
>> > > end
>> > > > up with e.g. `-Xmx1g -Xmx512m`. (Note this applies to the Maven
>> > (parent)
>> > > > process, not the surefire/failsafe (child) process.)
>> > > >
>> > > > On Wed, Mar 20, 2019 at 3:46 AM Bernd Eckenfels <
>> > [hidden email]>
>> > > > wrote:
>> > > >
>> > > > > I guess a timeout caused by FullGC can happen with TCP as well.
>> > > > Increasing
>> > > > > the timeout might not be nice but does look like it would help in
>> > both
>> > > > > cases. (Problems with stdout are more related to unexpected JVM
>> > > messages
>> > > > I
>> > > > > guess)
>> > > > >
>> > > > > Gruss
>> > > > > Bernd
>> > > > > --
>> > > > > http://bernd.eckenfels.net
>> > > > >
>> > > > > ________________________________
>> > > > > Von: Mikael Åsberg <[hidden email]>
>> > > > > Gesendet: Mittwoch, März 20, 2019 9:40 AM
>> > > > > An: Maven Users List
>> > > > > Betreff: Re: Failsafe: Killing self fork JVM. PING timeout
>> elapsed.
>> > > > >
>> > > > > These issues regarding communication with forked JVMs, won't they
>> be
>> > > > > resolved once surefire moves to interprocess communication using
>> > > > > tcp/ip sockets? This happens to be the target feature to be
>> included
>> > > > > in the next surefire 3.0.0 milestone:
>> > > > >
>> https://issues.apache.org/jira/projects/SUREFIRE/versions/12344668
>> > > > >
>> > > > > There are soooo many issues relating to surefire reading stdout of
>> > > > > forked processes (which is my understanding that it is currently
>> > > > > doing). Many of us are really looking forward to the next
>> milestone.
>> > > > >
>> > > > > On Tue, Mar 19, 2019 at 8:59 PM Jason Young <
>> > > [hidden email]>
>> > > > > wrote:
>> > > > > >
>> > > > > > Getting back to my original questions, I know that "ping" means
>> to
>> > > see
>> > > > > if a
>> > > > > > process is there, and "NOOP" implies it's not a command to do
>> > > anything.
>> > > > > But
>> > > > > > what do the terms "ping" and "NOOP" mean in this context, i.e.
>> how
>> > do
>> > > > the
>> > > > > > processes communicate? I assume they don't sonar. Do other
>> > processes
>> > > > also
>> > > > > > ping NOOPs? Can I PING Chrome with a NOOP from bash? Is it with
>> > TCP?
>> > > > > >
>> > > > > > I'm confused about what I should do regarding GC pauses.
>> > Previously I
>> > > > had
>> > > > > > code that would write the amount of remaining heap space (or
>> > > something
>> > > > > like
>> > > > > > that) to stdout after every test to troubleshoot OOMEs. Can
>> writing
>> > > to
>> > > > > > stdout cause the communication failure somehow?
>> > > > > >
>> > > > > > On Wed, Mar 13, 2019 at 5:57 PM Jason Young <
>> > > > [hidden email]>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > I upgraded failsafe and surefire to 3.0.0-M3 as advised; we
>> > > > encountered
>> > > > > > > the same exception. (Still using -Xmx5g, will switch to OpenJ9
>> > soon
>> > > > in
>> > > > > case
>> > > > > > > that helps.)
>> > > > > > >
>> > > > > > > BTW I also asked on StackOverflow previously, for anyone
>> > > interested:
>> > > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://stackoverflow.com/questions/54755846/killing-self-fork-jvm-ping-timeout-elapsed
>> > > > > > >
>> > > > > > > On Tue, Feb 26, 2019 at 6:40 PM Jason Young <
>> > > > > [hidden email]>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > >> Thanks again for the information.
>> > > > > > >>
>> > > > > > >> We had increased the RAM to 3g some time ago to prevent
>> OOMEs.
>> > > More
>> > > > > > >> recently, I increased the RAM again to 5g for extra headroom
>> > since
>> > > > we
>> > > > > had
>> > > > > > >> more headroom available; the problem hasn't happened since,
>> but
>> > it
>> > > > > hasn't
>> > > > > > >> been very long.
>> > > > > > >>
>> > > > > > >> We use a more customized image based on Alpine 3.8.2. The JDK
>> > and
>> > > > > Maven
>> > > > > > >> are obtained via apk.
>> > > > > > >>
>> > > > > > >> I will try upgrading failsafe (and surefire while I'm at it)
>> > > sooner,
>> > > > > and
>> > > > > > >> probably do some experimentation with JVMs another time (not
>> > > > pressing
>> > > > > for
>> > > > > > >> me ATM).
>> > > > > > >>
>> > > > > > >> On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <
>> > > > [hidden email]
>> > > > > >
>> > > > > > >> wrote:
>> > > > > > >>
>> > > > > > >>> >> I'll try to enable some logging about GC pauses to see
>> > what's
>> > > up
>> > > > > > >>>
>> > > > > > >>> Pls do not keep such setting after tuning the GC because
>> this
>> > may
>> > > > > > >>> sometime
>> > > > > > >>> break the interprocess communication between Maven process
>> and
>> > > > > surefire
>> > > > > > >>> process.
>> > > > > > >>> It's worth to list GC information in a file and not in the
>> > > console
>> > > > > logs.
>> > > > > > >>> This can be configured, I guess.
>> > > > > > >>>
>> > > > > > >>> >> Do you think the value is simply too low?
>> > > > > > >>>
>> > > > > > >>> GCing many objects may take some time and I remember we had
>> a
>> > > user
>> > > > > who
>> > > > > > >>> had
>> > > > > > >>> this problem a year or two ago.
>> > > > > > >>> We check every third NOOP (which is 3 x 10 sec) as a fix
>> > instead
>> > > of
>> > > > > every
>> > > > > > >>> NOP. So 30 seconds looked satisfactory.
>> > > > > > >>> I think you use old version 2.20 or something like that. The
>> > > fixes
>> > > > > for
>> > > > > > >>> docker have been done so far, so please use the latest
>> version
>> > > > > 3.0.0-M3.
>> > > > > > >>> See this page
>> > > > > > >>>
>> > > > https://maven.apache.org/surefire/maven-surefire-plugin/docker.html
>> ,
>> > > > > we
>> > > > > > >>> used maven:3.5.3-jdk-8-alpine in this test. Which base image
>> > did
>> > > > you
>> > > > > use?
>> > > > > > >>>
>> > > > > > >>> Cheers
>> > > > > > >>> Tibor
>> > > > > > >>>
>> > > > > > >>> On Tue, Feb 26, 2019 at 5:24 PM Jason Young <
>> > > > > [hidden email]>
>> > > > > > >>> wrote:
>> > > > > > >>>
>> > > > > > >>> > Thanks for the information. It's good to see someone
>> > > understands
>> > > > a
>> > > > > > >>> little
>> > > > > > >>> > about this.
>> > > > > > >>> >
>> > > > > > >>> > Incidentally, we have been looking at other GCs and VMs
>> for
>> > the
>> > > > > > >>> application
>> > > > > > >>> > in production environments, so I'll look into how these
>> > affect
>> > > > > tests as
>> > > > > > >>> > well. I'll try to enable some logging about GC pauses to
>> see
>> > > > > what's up.
>> > > > > > >>> >
>> > > > > > >>> > How would `-Xmx3g` cause long GC cycles? Do you think the
>> > value
>> > > > is
>> > > > > > >>> simply
>> > > > > > >>> > too low?
>> > > > > > >>> >
>> > > > > > >>> > FWIW we're running the Maven build in an Alpine-based
>> Docker
>> > > > > container.
>> > > > > > >>> >
>> > > > > > >>> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <
>> > > > > [hidden email]>
>> > > > > > >>> > wrote:
>> > > > > > >>> >
>> > > > > > >>> > > Hi Jason,
>> > > > > > >>> > >
>> > > > > > >>> > > We spoke about this issue on our chat in ASF Slack:
>> > > > > > >>> > > "I think his tests have been paused for a long GC
>> periods
>> > and
>> > > > > timed
>> > > > > > >>> out
>> > > > > > >>> > 3x
>> > > > > > >>> > > PING period = 30 seconds. After this period forked JVM
>> > > supposed
>> > > > > the
>> > > > > > >>> Maven
>> > > > > > >>> > > process was killed by JenkinsCI and therefore all
>> surefire
>> > > > > processes
>> > > > > > >>> are
>> > > > > > >>> > > killed as well and all the file handlers and memory
>> > > > consumptions
>> > > > > are
>> > > > > > >>> > > freed."
>> > > > > > >>> > >
>> > > > > > >>> > > "But I have to say that `-Xmx3g` may cause long GC
>> cycles,
>> > > see
>> > > > > > >>> > >
>> > > > > > >>> > >
>> > > > > > >>> >
>> > > > > > >>>
>> > > > >
>> > > >
>> > >
>> >
>> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
>> > > > > > >>> > > "
>> > > > > > >>> > >
>> > > > > > >>> > > You are using java-1.8-openjdk. I guess you should use
>> > > > > Shenandoah GC
>> > > > > > >>> > which
>> > > > > > >>> > > is an experimental algorithm in JVM 1.8. This would
>> > > > significantly
>> > > > > > >>> short
>> > > > > > >>> > > the GC cycles.
>> > > > > > >>> > >
>> > > > > > >>> > > We should of cource provide a new configuration
>> parameter
>> > to
>> > > > give
>> > > > > > >>> you a
>> > > > > > >>> > > chance to prolong the PING.
>> > > > > > >>> > >
>> > > > > > >>> > > Cheers
>> > > > > > >>> > > Tibor
>> > > > > > >>> > >
>> > > > > > >>> >
>> > > > > > >>> >
>> > > > > > >>> > --
>> > > > > > >>> >
>> > > > > > >>> > Jason Young
>> > > > > > >>> >
>> > > > > > >>>
>> > > > > > >>
>> > > > > > >>
>> > > > > >
>> > > > > > --
>> > > > > > Jason Young
>> > > > > > Software Engineer | PROCENTIVE
>> > > > > > [image: Phone] 715 245 8000 x7609
>> > > > > > [image: Mobile] 706 870 3540
>> > > > > > [image: Web] procentive.com
>> > > > > > Confidentiality Notice: This message is intended for the sole
>> use
>> > of
>> > > > the
>> > > > > > individual and entity to which it is addressed, and may contain
>> > > > > information
>> > > > > > that is privileged, confidential and exempt from disclosure
>> under
>> > > > > > applicable law. Any unauthorized review, use, disclosure or
>> > > > distribution
>> > > > > of
>> > > > > > this email message, including any attachment, is prohibited. If
>> you
>> > > are
>> > > > > not
>> > > > > > the intended recipient, please advise the sender by reply email
>> and
>> > > > > destroy
>> > > > > > all copies of the original message.
>> > > > >
>> > > > >
>> ---------------------------------------------------------------------
>> > > > > To unsubscribe, e-mail: [hidden email]
>> > > > > For additional commands, e-mail: [hidden email]
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Tibor Digana
In reply to this post by Jason Young
Hi Jason,

I will get back to your last two emails, but currently your results remind
me to mention the following:

There are actually two mechanisms checking Maven process (parent process).
The new one is faster but with native support. And the older is pure Java
using ping events.

The ping is not used if  PpidChecker#canUse() returns true, and the PPID is
checked every second whether Maven process is alive.
The reason why it returned false is that "/bin/ps -o etime -p <ppid>"
failed? (see the dump file in directory  /target/surefire-reports, what
logged in there?)
Maybe caught an exception in [1] due to "ps" is not installed. If this is
happens then the ForkedBooter falls back to the old mechanism listening to
the PING events.

The method PpidChecker#canUse() is called in ForkedBooter in the method:

processCheckerJob( final PingScheduler pingMechanism )


If the faster mechanism in PpidChecker objectively fails due to the OS or
system, we have to concentrate on the older mechanism using the PING as you
have described in the previous email.
If there's no such OS nor system issue and "ps" works in native console in
your Docker container, then we have to concentrate on PpidChecker as well.

[1]:
https://github.com/apache/maven-surefire/blob/master/surefire-booter/src/main/java/org/apache/maven/surefire/booter/PpidChecker.java#L363





On Fri, Mar 22, 2019 at 9:47 PM Jason Young <[hidden email]>
wrote:

> Thanks Tibor. I'll use `encodeAndWriteToOutput` later. For now, I hacked
> together some temporary changes to ForkedBooter so that it "logs" (just
> `System.out.println`) the current time every time:
>
>    - the command reader thread reads a NOOP via stdin,
>    - it starts the ping job,
>    - it checks for NOOP (`canUseNewPingMechanism(..)` returns `false`), and
>    - it would have exited because there is no NOOP (`!pingDone`).
>
> I also commented out the code to exit in that case so it keeps running and
> logging.
>
> The resulting output starts out showing that it's receiving ping every 10s
> and checking for NOOP every 30s, as expected. But then one large test
> begins, and:
>
> DEBUG 2019/03/22 19:14:40.223 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:02.723 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.139 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:18:03.172 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> DEBUG 2019/03/22 19:18:03.186 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:18:03.188 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.243 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.307 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.308 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.202 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:18:03.323 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.337 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.352 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> DEBUG 2019/03/22 19:18:03.366 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.393 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.380 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:18:03.409 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.423 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.453 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> DEBUG 2019/03/22 19:18:03.454 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:08.872 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:18.859 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:19.282 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:18:19.330 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> DEBUG 2019/03/22 19:18:28.845 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:38.832 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:48.838 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:18:49.241 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:18:49.265 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> DEBUG 2019/03/22 19:19:01.162 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:19:10.069 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:19:19.136 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP from
> Maven process.
> DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP from
> Maven process.
> DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP from
> Maven process.
> DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP from
> Maven process.
> DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP from
> Maven process.
> DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> checking if we read a ping earlier).
> DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> Maven process.
> WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP from
> Maven process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> process.
> DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> process.
> ...
>
> With these changes, a project which normally takes 15 to 20 minutes to run
> actually took 80m. Would printing directly to stdout from within a test do
> that somehow? If not, we have a problem in our code.
>
> My comments on the above report to follow...
>
> On Fri, Mar 22, 2019 at 11:58 AM Tibor Digana <[hidden email]>
> wrote:
>
> > >> is there an -Xmx value you would recommend to ensure the parent
> process
> > can send the NOOP to the surefire process?
> >
> > no such value is generic value to recommend. Each application or test
> must
> > find our the value to prevent from OOM.
> >
> >
> > >> I'm making a custom build of surefire-booter to workaround the initial
> > problem by commenting out the code to exit.
> >
> > Very good step forward. Then post your findings and we can continue on
> > making a pull request on GitHub.
> >
> >
> > >> Is there a logging API which I can use instead?
> >
> > Yes, it is very low level of logging (sending events to parent process)
> but
> > this will be turned to high level abstraction in several days (branch
> > surefire-1222_2).
> > Yet, pls use this function:
> >
> > encodeAndWriteToOutput()
> >
> >
> > >>
> >
> >
> >
> >
> >
> > On Fri, Mar 22, 2019 at 4:44 PM Jason Young <[hidden email]>
> > wrote:
> >
> > > To clarify, this is the image ours is based on:
> > > https://hub.docker.com/_/alpine; we are not using a vanilla Maven
> image
> > as
> > > we have to add a few other items to this image as well. I don't know
> who
> > > maintains that or adds /etc/mavenrc; I'll bring that up in another
> topic
> > in
> > > another forum. I was just answering a previous question, and also
> > throwing
> > > out info for anyone struggling with e.g. OOME in their Maven process (a
> > lot
> > > of people have that problem when running the Sonar scanner, for
> > example). I
> > > don't think we need to worry about why that file is added for this
> topic.
> > >
> > > But while we're on the subject, is there an -Xmx value you would
> > recommend
> > > to ensure the parent process can send the NOOP to the surefire process?
> > >
> > > I'm making a custom build of surefire-booter to workaround the initial
> > > problem by commenting out the code to exit. I am also adding logging
> via
> > > stdout so I can see if ForkedBooter is indeed running the ping job in
> > small
> > > increments of time. Is there a logging API which I can use instead?
> > >
> > > On Fri, Mar 22, 2019 at 5:44 AM Tibor Digana <[hidden email]>
> > > wrote:
> > >
> > > > The base images are developed in
> > > https://github.com/carlossg/docker-maven,
> > > > right?
> > > > Who creates "/etc/mavenrc"?
> > > >
> > > > On Thu, Mar 21, 2019 at 12:05 AM Jason Young <
> > [hidden email]
> > > >
> > > > wrote:
> > > >
> > > > > Mikael, sorry I do not appear to have permission to view the link.
> > > > >
> > > > > I did some digging in the last couple of days. I see that the
> parent
> > > > > process reads from stdin. I could not find anywhere that we are
> using
> > > > > stdin. FWIW the failures nearly always happen at least 15m into a
> > ~20m
> > > > test
> > > > > run, so perf is a likely culprit.
> > > > >
> > > > > I see also that ForkedBooter reads commands from stdin in one
> thread,
> > > and
> > > > > uses an executor service to check for a past ping in
> > > > > ForkedBooter.listenToShutdownCommands(..). When it checks, it also
> > sets
> > > > > pingDone to false. The executor is configured to run up to 2
> threads
> > > > > concurrently to handle the workload, and is set to run at a fixed
> > rate
> > > > (not
> > > > > a fixed delay). If the test suite is busy with testing and GC and
> has
> > > > lots
> > > > > of threads running, it's entirely possible that a thread won't
> have a
> > > > > chance to run for a long time (e.g. 5s). Maybe instead of a 30s
> > delay,
> > > > the
> > > > > VM gets around to checking for a ping every 35s over a long span of
> > > time.
> > > > > Because we're running at a "fixed rate" and not a "fixed delay",
> then
> > > > after
> > > > > a couple of minutes we might be a full 30s behind schedule. It's
> > > possible
> > > > > the executor will create another thread to run the scheduled task
> > > because
> > > > > it's running behind schedule. This new thread checks for a ping,
> > finds
> > > > it,
> > > > > and sets pingDone to false. But then the original thread also runs,
> > > say,
> > > > 2
> > > > > seconds afterwards, checks pingDone, and finds it is false.
> > > > >
> > > > > So to mitigate the problem, can we a) make the executor run only 1
> > > thread
> > > > > and b) schedule the task at a fixed rate? For that matter, is there
> > > > another
> > > > > scheduled executor we can reuse? I understand why checking for ping
> > > > > requires a separate executor. Should I ask in github?
> > > > >
> > > > > Regarding a previous question, I found out that Alpine's Maven
> > package
> > > > > comes with an /etc/mavenrc that sets `MAVEN_OPTS="$MAVEN_OPTS
> > > -Xmx512m"`
> > > > > which cannot be undone by setting `MAVEN_OPTS` at the command line;
> > you
> > > > end
> > > > > up with e.g. `-Xmx1g -Xmx512m`. (Note this applies to the Maven
> > > (parent)
> > > > > process, not the surefire/failsafe (child) process.)
> > > > >
> > > > > On Wed, Mar 20, 2019 at 3:46 AM Bernd Eckenfels <
> > > [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > I guess a timeout caused by FullGC can happen with TCP as well.
> > > > > Increasing
> > > > > > the timeout might not be nice but does look like it would help in
> > > both
> > > > > > cases. (Problems with stdout are more related to unexpected JVM
> > > > messages
> > > > > I
> > > > > > guess)
> > > > > >
> > > > > > Gruss
> > > > > > Bernd
> > > > > > --
> > > > > > http://bernd.eckenfels.net
> > > > > >
> > > > > > ________________________________
> > > > > > Von: Mikael Åsberg <[hidden email]>
> > > > > > Gesendet: Mittwoch, März 20, 2019 9:40 AM
> > > > > > An: Maven Users List
> > > > > > Betreff: Re: Failsafe: Killing self fork JVM. PING timeout
> elapsed.
> > > > > >
> > > > > > These issues regarding communication with forked JVMs, won't they
> > be
> > > > > > resolved once surefire moves to interprocess communication using
> > > > > > tcp/ip sockets? This happens to be the target feature to be
> > included
> > > > > > in the next surefire 3.0.0 milestone:
> > > > > >
> https://issues.apache.org/jira/projects/SUREFIRE/versions/12344668
> > > > > >
> > > > > > There are soooo many issues relating to surefire reading stdout
> of
> > > > > > forked processes (which is my understanding that it is currently
> > > > > > doing). Many of us are really looking forward to the next
> > milestone.
> > > > > >
> > > > > > On Tue, Mar 19, 2019 at 8:59 PM Jason Young <
> > > > [hidden email]>
> > > > > > wrote:
> > > > > > >
> > > > > > > Getting back to my original questions, I know that "ping" means
> > to
> > > > see
> > > > > > if a
> > > > > > > process is there, and "NOOP" implies it's not a command to do
> > > > anything.
> > > > > > But
> > > > > > > what do the terms "ping" and "NOOP" mean in this context, i.e.
> > how
> > > do
> > > > > the
> > > > > > > processes communicate? I assume they don't sonar. Do other
> > > processes
> > > > > also
> > > > > > > ping NOOPs? Can I PING Chrome with a NOOP from bash? Is it with
> > > TCP?
> > > > > > >
> > > > > > > I'm confused about what I should do regarding GC pauses.
> > > Previously I
> > > > > had
> > > > > > > code that would write the amount of remaining heap space (or
> > > > something
> > > > > > like
> > > > > > > that) to stdout after every test to troubleshoot OOMEs. Can
> > writing
> > > > to
> > > > > > > stdout cause the communication failure somehow?
> > > > > > >
> > > > > > > On Wed, Mar 13, 2019 at 5:57 PM Jason Young <
> > > > > [hidden email]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I upgraded failsafe and surefire to 3.0.0-M3 as advised; we
> > > > > encountered
> > > > > > > > the same exception. (Still using -Xmx5g, will switch to
> OpenJ9
> > > soon
> > > > > in
> > > > > > case
> > > > > > > > that helps.)
> > > > > > > >
> > > > > > > > BTW I also asked on StackOverflow previously, for anyone
> > > > interested:
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://stackoverflow.com/questions/54755846/killing-self-fork-jvm-ping-timeout-elapsed
> > > > > > > >
> > > > > > > > On Tue, Feb 26, 2019 at 6:40 PM Jason Young <
> > > > > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Thanks again for the information.
> > > > > > > >>
> > > > > > > >> We had increased the RAM to 3g some time ago to prevent
> OOMEs.
> > > > More
> > > > > > > >> recently, I increased the RAM again to 5g for extra headroom
> > > since
> > > > > we
> > > > > > had
> > > > > > > >> more headroom available; the problem hasn't happened since,
> > but
> > > it
> > > > > > hasn't
> > > > > > > >> been very long.
> > > > > > > >>
> > > > > > > >> We use a more customized image based on Alpine 3.8.2. The
> JDK
> > > and
> > > > > > Maven
> > > > > > > >> are obtained via apk.
> > > > > > > >>
> > > > > > > >> I will try upgrading failsafe (and surefire while I'm at it)
> > > > sooner,
> > > > > > and
> > > > > > > >> probably do some experimentation with JVMs another time (not
> > > > > pressing
> > > > > > for
> > > > > > > >> me ATM).
> > > > > > > >>
> > > > > > > >> On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <
> > > > > [hidden email]
> > > > > > >
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >>> >> I'll try to enable some logging about GC pauses to see
> > > what's
> > > > up
> > > > > > > >>>
> > > > > > > >>> Pls do not keep such setting after tuning the GC because
> this
> > > may
> > > > > > > >>> sometime
> > > > > > > >>> break the interprocess communication between Maven process
> > and
> > > > > > surefire
> > > > > > > >>> process.
> > > > > > > >>> It's worth to list GC information in a file and not in the
> > > > console
> > > > > > logs.
> > > > > > > >>> This can be configured, I guess.
> > > > > > > >>>
> > > > > > > >>> >> Do you think the value is simply too low?
> > > > > > > >>>
> > > > > > > >>> GCing many objects may take some time and I remember we
> had a
> > > > user
> > > > > > who
> > > > > > > >>> had
> > > > > > > >>> this problem a year or two ago.
> > > > > > > >>> We check every third NOOP (which is 3 x 10 sec) as a fix
> > > instead
> > > > of
> > > > > > every
> > > > > > > >>> NOP. So 30 seconds looked satisfactory.
> > > > > > > >>> I think you use old version 2.20 or something like that.
> The
> > > > fixes
> > > > > > for
> > > > > > > >>> docker have been done so far, so please use the latest
> > version
> > > > > > 3.0.0-M3.
> > > > > > > >>> See this page
> > > > > > > >>>
> > > > >
> https://maven.apache.org/surefire/maven-surefire-plugin/docker.html,
> > > > > > we
> > > > > > > >>> used maven:3.5.3-jdk-8-alpine in this test. Which base
> image
> > > did
> > > > > you
> > > > > > use?
> > > > > > > >>>
> > > > > > > >>> Cheers
> > > > > > > >>> Tibor
> > > > > > > >>>
> > > > > > > >>> On Tue, Feb 26, 2019 at 5:24 PM Jason Young <
> > > > > > [hidden email]>
> > > > > > > >>> wrote:
> > > > > > > >>>
> > > > > > > >>> > Thanks for the information. It's good to see someone
> > > > understands
> > > > > a
> > > > > > > >>> little
> > > > > > > >>> > about this.
> > > > > > > >>> >
> > > > > > > >>> > Incidentally, we have been looking at other GCs and VMs
> for
> > > the
> > > > > > > >>> application
> > > > > > > >>> > in production environments, so I'll look into how these
> > > affect
> > > > > > tests as
> > > > > > > >>> > well. I'll try to enable some logging about GC pauses to
> > see
> > > > > > what's up.
> > > > > > > >>> >
> > > > > > > >>> > How would `-Xmx3g` cause long GC cycles? Do you think the
> > > value
> > > > > is
> > > > > > > >>> simply
> > > > > > > >>> > too low?
> > > > > > > >>> >
> > > > > > > >>> > FWIW we're running the Maven build in an Alpine-based
> > Docker
> > > > > > container.
> > > > > > > >>> >
> > > > > > > >>> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <
> > > > > > [hidden email]>
> > > > > > > >>> > wrote:
> > > > > > > >>> >
> > > > > > > >>> > > Hi Jason,
> > > > > > > >>> > >
> > > > > > > >>> > > We spoke about this issue on our chat in ASF Slack:
> > > > > > > >>> > > "I think his tests have been paused for a long GC
> periods
> > > and
> > > > > > timed
> > > > > > > >>> out
> > > > > > > >>> > 3x
> > > > > > > >>> > > PING period = 30 seconds. After this period forked JVM
> > > > supposed
> > > > > > the
> > > > > > > >>> Maven
> > > > > > > >>> > > process was killed by JenkinsCI and therefore all
> > surefire
> > > > > > processes
> > > > > > > >>> are
> > > > > > > >>> > > killed as well and all the file handlers and memory
> > > > > consumptions
> > > > > > are
> > > > > > > >>> > > freed."
> > > > > > > >>> > >
> > > > > > > >>> > > "But I have to say that `-Xmx3g` may cause long GC
> > cycles,
> > > > see
> > > > > > > >>> > >
> > > > > > > >>> > >
> > > > > > > >>> >
> > > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
> > > > > > > >>> > > "
> > > > > > > >>> > >
> > > > > > > >>> > > You are using java-1.8-openjdk. I guess you should use
> > > > > > Shenandoah GC
> > > > > > > >>> > which
> > > > > > > >>> > > is an experimental algorithm in JVM 1.8. This would
> > > > > significantly
> > > > > > > >>> short
> > > > > > > >>> > > the GC cycles.
> > > > > > > >>> > >
> > > > > > > >>> > > We should of cource provide a new configuration
> parameter
> > > to
> > > > > give
> > > > > > > >>> you a
> > > > > > > >>> > > chance to prolong the PING.
> > > > > > > >>> > >
> > > > > > > >>> > > Cheers
> > > > > > > >>> > > Tibor
> > > > > > > >>> > >
> > > > > > > >>> >
> > > > > > > >>> >
> > > > > > > >>> > --
> > > > > > > >>> >
> > > > > > > >>> > Jason Young
> > > > > > > >>> >
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > > > --
> > > > > > > Jason Young
> > > > > > > Software Engineer | PROCENTIVE
> > > > > > > [image: Phone] 715 245 8000 x7609
> > > > > > > [image: Mobile] 706 870 3540
> > > > > > > [image: Web] procentive.com
> > > > > > > Confidentiality Notice: This message is intended for the sole
> use
> > > of
> > > > > the
> > > > > > > individual and entity to which it is addressed, and may contain
> > > > > > information
> > > > > > > that is privileged, confidential and exempt from disclosure
> under
> > > > > > > applicable law. Any unauthorized review, use, disclosure or
> > > > > distribution
> > > > > > of
> > > > > > > this email message, including any attachment, is prohibited. If
> > you
> > > > are
> > > > > > not
> > > > > > > the intended recipient, please advise the sender by reply email
> > and
> > > > > > destroy
> > > > > > > all copies of the original message.
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: [hidden email]
> > > > > > For additional commands, e-mail: [hidden email]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Tibor Digana
In reply to this post by Jason Young
Hi Jason,

I am sorry I haven't reply.
Excellent analysis!

Can you make an experiment and use pingJob on a "fixed delay" instead of a
"fixed rate"?
If you want to make more, feel free.

We should mabe think about how we would add more intelligence in these
timers in order to detect the "blackout".

Cheers
Tibor




On Fri, Mar 22, 2019 at 9:53 PM Jason Young <[hidden email]>
wrote:

> Notice:
> * an over 3-minute gap between the first two entries above--neither the
> command reader thread nor the pingJob had a chance to run.
> * the command reader thread and the ping job executor evidently get a
> chance to run from 19:18:02 to 19:19:19. They ran with increased frequency
> during that time to "catch up": The command reader thread is programmed to
> read from stdin until there's nothing left to read, and the pingScheduler
> runs the pingJob at a "fixed" rate.
> * Another 3-minute gap and another attempt to catch up at 19:22:28. This
> time though, the command reader thread evidently did not get a turn in
> time.
>
> I did not bother to log what threads the ping job was running under; I can
> do that later. I also did not add any logging on the Maven process side; I
> can do that as well.
>
> This is on a VM with unthrottled access to 4 processor cores and 8GB RAM. I
> think the code under test is CPU- and RAM-intensive. There probably were GC
> pauses, but I don't have proper logging for that in place.
>
> Note that this is all very normal behavior: Both the command reader and the
> ping job's thread (run by the pingScheduler) are instructed to run until
> their work is done, not to sleep or yield. For high throughput, an OS will
> favor continuing to run a thread for a while once it is started over
> switching tasks, so it's normal both that each thread would run for a while
> after a pause, and (by extension) that each would have long pauses as other
> threads had to run, too. Because of the long gaps between runs of the
> command reader thread, my previous advice (schedule the pingJob on a "fixed
> delay" instead of a "fixed rate") would not have saved the build in this
> case (though "fixed delay" is more correct and would save the build in
> other possible cases). Our code base has some unit tests that rely on
> "normal case" timing, but they now have the @Ignore attribute because they
> would rarely, but too often, fail. For most teams and projects, unit tests
> have to be more reliable than that.
>
> So I think we need a solution that does not rely on preemptive scheduling
> from the OS or on thread scheduling, but which ensures that commnads have
> been read before running code depending on commands. There are multiple
> ways to do that; maybe I can find time to write a solution later.
>
>
> On Fri, Mar 22, 2019 at 3:46 PM Jason Young <[hidden email]>
> wrote:
>
> > Thanks Tibor. I'll use `encodeAndWriteToOutput` later. For now, I hacked
> > together some temporary changes to ForkedBooter so that it "logs" (just
> > `System.out.println`) the current time every time:
> >
> >    - the command reader thread reads a NOOP via stdin,
> >    - it starts the ping job,
> >    - it checks for NOOP (`canUseNewPingMechanism(..)` returns `false`),
> >    and
> >    - it would have exited because there is no NOOP (`!pingDone`).
> >
> > I also commented out the code to exit in that case so it keeps running
> and
> > logging.
> >
> > The resulting output starts out showing that it's receiving ping every
> 10s
> > and checking for NOOP every 30s, as expected. But then one large test
> > begins, and:
> >
> > DEBUG 2019/03/22 19:14:40.223 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:02.723 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.139 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:18:03.172 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:18:03.186 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:18:03.188 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.243 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.307 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.308 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.202 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:18:03.323 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.337 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.352 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:18:03.366 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.393 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.380 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:18:03.409 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.423 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.453 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:18:03.454 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:08.872 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:18.859 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:19.282 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:18:19.330 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:18:28.845 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:38.832 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:48.838 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:49.241 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:18:49.265 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:19:01.162 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:19:10.069 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:19:19.136 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> > process.
> > ...
> >
> > With these changes, a project which normally takes 15 to 20 minutes to
> run
> > actually took 80m. Would printing directly to stdout from within a test
> do
> > that somehow? If not, we have a problem in our code.
> >
> > My comments on the above report to follow...
> >
> > On Fri, Mar 22, 2019 at 11:58 AM Tibor Digana <[hidden email]>
> > wrote:
> >
> >> >> is there an -Xmx value you would recommend to ensure the parent
> process
> >> can send the NOOP to the surefire process?
> >>
> >> no such value is generic value to recommend. Each application or test
> must
> >> find our the value to prevent from OOM.
> >>
> >>
> >> >> I'm making a custom build of surefire-booter to workaround the
> initial
> >> problem by commenting out the code to exit.
> >>
> >> Very good step forward. Then post your findings and we can continue on
> >> making a pull request on GitHub.
> >>
> >>
> >> >> Is there a logging API which I can use instead?
> >>
> >> Yes, it is very low level of logging (sending events to parent process)
> >> but
> >> this will be turned to high level abstraction in several days (branch
> >> surefire-1222_2).
> >> Yet, pls use this function:
> >>
> >> encodeAndWriteToOutput()
> >>
> >>
> >> >>
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Mar 22, 2019 at 4:44 PM Jason Young <[hidden email]
> >
> >> wrote:
> >>
> >> > To clarify, this is the image ours is based on:
> >> > https://hub.docker.com/_/alpine; we are not using a vanilla Maven
> >> image as
> >> > we have to add a few other items to this image as well. I don't know
> who
> >> > maintains that or adds /etc/mavenrc; I'll bring that up in another
> >> topic in
> >> > another forum. I was just answering a previous question, and also
> >> throwing
> >> > out info for anyone struggling with e.g. OOME in their Maven process
> (a
> >> lot
> >> > of people have that problem when running the Sonar scanner, for
> >> example). I
> >> > don't think we need to worry about why that file is added for this
> >> topic.
> >> >
> >> > But while we're on the subject, is there an -Xmx value you would
> >> recommend
> >> > to ensure the parent process can send the NOOP to the surefire
> process?
> >> >
> >> > I'm making a custom build of surefire-booter to workaround the initial
> >> > problem by commenting out the code to exit. I am also adding logging
> via
> >> > stdout so I can see if ForkedBooter is indeed running the ping job in
> >> small
> >> > increments of time. Is there a logging API which I can use instead?
> >> >
> >> > On Fri, Mar 22, 2019 at 5:44 AM Tibor Digana <[hidden email]>
> >> > wrote:
> >> >
> >> > > The base images are developed in
> >> > https://github.com/carlossg/docker-maven,
> >> > > right?
> >> > > Who creates "/etc/mavenrc"?
> >> > >
> >> > > On Thu, Mar 21, 2019 at 12:05 AM Jason Young <
> >> [hidden email]
> >> > >
> >> > > wrote:
> >> > >
> >> > > > Mikael, sorry I do not appear to have permission to view the link.
> >> > > >
> >> > > > I did some digging in the last couple of days. I see that the
> parent
> >> > > > process reads from stdin. I could not find anywhere that we are
> >> using
> >> > > > stdin. FWIW the failures nearly always happen at least 15m into a
> >> ~20m
> >> > > test
> >> > > > run, so perf is a likely culprit.
> >> > > >
> >> > > > I see also that ForkedBooter reads commands from stdin in one
> >> thread,
> >> > and
> >> > > > uses an executor service to check for a past ping in
> >> > > > ForkedBooter.listenToShutdownCommands(..). When it checks, it also
> >> sets
> >> > > > pingDone to false. The executor is configured to run up to 2
> threads
> >> > > > concurrently to handle the workload, and is set to run at a fixed
> >> rate
> >> > > (not
> >> > > > a fixed delay). If the test suite is busy with testing and GC and
> >> has
> >> > > lots
> >> > > > of threads running, it's entirely possible that a thread won't
> have
> >> a
> >> > > > chance to run for a long time (e.g. 5s). Maybe instead of a 30s
> >> delay,
> >> > > the
> >> > > > VM gets around to checking for a ping every 35s over a long span
> of
> >> > time.
> >> > > > Because we're running at a "fixed rate" and not a "fixed delay",
> >> then
> >> > > after
> >> > > > a couple of minutes we might be a full 30s behind schedule. It's
> >> > possible
> >> > > > the executor will create another thread to run the scheduled task
> >> > because
> >> > > > it's running behind schedule. This new thread checks for a ping,
> >> finds
> >> > > it,
> >> > > > and sets pingDone to false. But then the original thread also
> runs,
> >> > say,
> >> > > 2
> >> > > > seconds afterwards, checks pingDone, and finds it is false.
> >> > > >
> >> > > > So to mitigate the problem, can we a) make the executor run only 1
> >> > thread
> >> > > > and b) schedule the task at a fixed rate? For that matter, is
> there
> >> > > another
> >> > > > scheduled executor we can reuse? I understand why checking for
> ping
> >> > > > requires a separate executor. Should I ask in github?
> >> > > >
> >> > > > Regarding a previous question, I found out that Alpine's Maven
> >> package
> >> > > > comes with an /etc/mavenrc that sets `MAVEN_OPTS="$MAVEN_OPTS
> >> > -Xmx512m"`
> >> > > > which cannot be undone by setting `MAVEN_OPTS` at the command
> line;
> >> you
> >> > > end
> >> > > > up with e.g. `-Xmx1g -Xmx512m`. (Note this applies to the Maven
> >> > (parent)
> >> > > > process, not the surefire/failsafe (child) process.)
> >> > > >
> >> > > > On Wed, Mar 20, 2019 at 3:46 AM Bernd Eckenfels <
> >> > [hidden email]>
> >> > > > wrote:
> >> > > >
> >> > > > > I guess a timeout caused by FullGC can happen with TCP as well.
> >> > > > Increasing
> >> > > > > the timeout might not be nice but does look like it would help
> in
> >> > both
> >> > > > > cases. (Problems with stdout are more related to unexpected JVM
> >> > > messages
> >> > > > I
> >> > > > > guess)
> >> > > > >
> >> > > > > Gruss
> >> > > > > Bernd
> >> > > > > --
> >> > > > > http://bernd.eckenfels.net
> >> > > > >
> >> > > > > ________________________________
> >> > > > > Von: Mikael Åsberg <[hidden email]>
> >> > > > > Gesendet: Mittwoch, März 20, 2019 9:40 AM
> >> > > > > An: Maven Users List
> >> > > > > Betreff: Re: Failsafe: Killing self fork JVM. PING timeout
> >> elapsed.
> >> > > > >
> >> > > > > These issues regarding communication with forked JVMs, won't
> they
> >> be
> >> > > > > resolved once surefire moves to interprocess communication using
> >> > > > > tcp/ip sockets? This happens to be the target feature to be
> >> included
> >> > > > > in the next surefire 3.0.0 milestone:
> >> > > > >
> >> https://issues.apache.org/jira/projects/SUREFIRE/versions/12344668
> >> > > > >
> >> > > > > There are soooo many issues relating to surefire reading stdout
> of
> >> > > > > forked processes (which is my understanding that it is currently
> >> > > > > doing). Many of us are really looking forward to the next
> >> milestone.
> >> > > > >
> >> > > > > On Tue, Mar 19, 2019 at 8:59 PM Jason Young <
> >> > > [hidden email]>
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > Getting back to my original questions, I know that "ping"
> means
> >> to
> >> > > see
> >> > > > > if a
> >> > > > > > process is there, and "NOOP" implies it's not a command to do
> >> > > anything.
> >> > > > > But
> >> > > > > > what do the terms "ping" and "NOOP" mean in this context, i.e.
> >> how
> >> > do
> >> > > > the
> >> > > > > > processes communicate? I assume they don't sonar. Do other
> >> > processes
> >> > > > also
> >> > > > > > ping NOOPs? Can I PING Chrome with a NOOP from bash? Is it
> with
> >> > TCP?
> >> > > > > >
> >> > > > > > I'm confused about what I should do regarding GC pauses.
> >> > Previously I
> >> > > > had
> >> > > > > > code that would write the amount of remaining heap space (or
> >> > > something
> >> > > > > like
> >> > > > > > that) to stdout after every test to troubleshoot OOMEs. Can
> >> writing
> >> > > to
> >> > > > > > stdout cause the communication failure somehow?
> >> > > > > >
> >> > > > > > On Wed, Mar 13, 2019 at 5:57 PM Jason Young <
> >> > > > [hidden email]>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > I upgraded failsafe and surefire to 3.0.0-M3 as advised; we
> >> > > > encountered
> >> > > > > > > the same exception. (Still using -Xmx5g, will switch to
> OpenJ9
> >> > soon
> >> > > > in
> >> > > > > case
> >> > > > > > > that helps.)
> >> > > > > > >
> >> > > > > > > BTW I also asked on StackOverflow previously, for anyone
> >> > > interested:
> >> > > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://stackoverflow.com/questions/54755846/killing-self-fork-jvm-ping-timeout-elapsed
> >> > > > > > >
> >> > > > > > > On Tue, Feb 26, 2019 at 6:40 PM Jason Young <
> >> > > > > [hidden email]>
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > >> Thanks again for the information.
> >> > > > > > >>
> >> > > > > > >> We had increased the RAM to 3g some time ago to prevent
> >> OOMEs.
> >> > > More
> >> > > > > > >> recently, I increased the RAM again to 5g for extra
> headroom
> >> > since
> >> > > > we
> >> > > > > had
> >> > > > > > >> more headroom available; the problem hasn't happened since,
> >> but
> >> > it
> >> > > > > hasn't
> >> > > > > > >> been very long.
> >> > > > > > >>
> >> > > > > > >> We use a more customized image based on Alpine 3.8.2. The
> JDK
> >> > and
> >> > > > > Maven
> >> > > > > > >> are obtained via apk.
> >> > > > > > >>
> >> > > > > > >> I will try upgrading failsafe (and surefire while I'm at
> it)
> >> > > sooner,
> >> > > > > and
> >> > > > > > >> probably do some experimentation with JVMs another time
> (not
> >> > > > pressing
> >> > > > > for
> >> > > > > > >> me ATM).
> >> > > > > > >>
> >> > > > > > >> On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <
> >> > > > [hidden email]
> >> > > > > >
> >> > > > > > >> wrote:
> >> > > > > > >>
> >> > > > > > >>> >> I'll try to enable some logging about GC pauses to see
> >> > what's
> >> > > up
> >> > > > > > >>>
> >> > > > > > >>> Pls do not keep such setting after tuning the GC because
> >> this
> >> > may
> >> > > > > > >>> sometime
> >> > > > > > >>> break the interprocess communication between Maven process
> >> and
> >> > > > > surefire
> >> > > > > > >>> process.
> >> > > > > > >>> It's worth to list GC information in a file and not in the
> >> > > console
> >> > > > > logs.
> >> > > > > > >>> This can be configured, I guess.
> >> > > > > > >>>
> >> > > > > > >>> >> Do you think the value is simply too low?
> >> > > > > > >>>
> >> > > > > > >>> GCing many objects may take some time and I remember we
> had
> >> a
> >> > > user
> >> > > > > who
> >> > > > > > >>> had
> >> > > > > > >>> this problem a year or two ago.
> >> > > > > > >>> We check every third NOOP (which is 3 x 10 sec) as a fix
> >> > instead
> >> > > of
> >> > > > > every
> >> > > > > > >>> NOP. So 30 seconds looked satisfactory.
> >> > > > > > >>> I think you use old version 2.20 or something like that.
> The
> >> > > fixes
> >> > > > > for
> >> > > > > > >>> docker have been done so far, so please use the latest
> >> version
> >> > > > > 3.0.0-M3.
> >> > > > > > >>> See this page
> >> > > > > > >>>
> >> > > >
> https://maven.apache.org/surefire/maven-surefire-plugin/docker.html
> >> ,
> >> > > > > we
> >> > > > > > >>> used maven:3.5.3-jdk-8-alpine in this test. Which base
> image
> >> > did
> >> > > > you
> >> > > > > use?
> >> > > > > > >>>
> >> > > > > > >>> Cheers
> >> > > > > > >>> Tibor
> >> > > > > > >>>
> >> > > > > > >>> On Tue, Feb 26, 2019 at 5:24 PM Jason Young <
> >> > > > > [hidden email]>
> >> > > > > > >>> wrote:
> >> > > > > > >>>
> >> > > > > > >>> > Thanks for the information. It's good to see someone
> >> > > understands
> >> > > > a
> >> > > > > > >>> little
> >> > > > > > >>> > about this.
> >> > > > > > >>> >
> >> > > > > > >>> > Incidentally, we have been looking at other GCs and VMs
> >> for
> >> > the
> >> > > > > > >>> application
> >> > > > > > >>> > in production environments, so I'll look into how these
> >> > affect
> >> > > > > tests as
> >> > > > > > >>> > well. I'll try to enable some logging about GC pauses to
> >> see
> >> > > > > what's up.
> >> > > > > > >>> >
> >> > > > > > >>> > How would `-Xmx3g` cause long GC cycles? Do you think
> the
> >> > value
> >> > > > is
> >> > > > > > >>> simply
> >> > > > > > >>> > too low?
> >> > > > > > >>> >
> >> > > > > > >>> > FWIW we're running the Maven build in an Alpine-based
> >> Docker
> >> > > > > container.
> >> > > > > > >>> >
> >> > > > > > >>> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <
> >> > > > > [hidden email]>
> >> > > > > > >>> > wrote:
> >> > > > > > >>> >
> >> > > > > > >>> > > Hi Jason,
> >> > > > > > >>> > >
> >> > > > > > >>> > > We spoke about this issue on our chat in ASF Slack:
> >> > > > > > >>> > > "I think his tests have been paused for a long GC
> >> periods
> >> > and
> >> > > > > timed
> >> > > > > > >>> out
> >> > > > > > >>> > 3x
> >> > > > > > >>> > > PING period = 30 seconds. After this period forked JVM
> >> > > supposed
> >> > > > > the
> >> > > > > > >>> Maven
> >> > > > > > >>> > > process was killed by JenkinsCI and therefore all
> >> surefire
> >> > > > > processes
> >> > > > > > >>> are
> >> > > > > > >>> > > killed as well and all the file handlers and memory
> >> > > > consumptions
> >> > > > > are
> >> > > > > > >>> > > freed."
> >> > > > > > >>> > >
> >> > > > > > >>> > > "But I have to say that `-Xmx3g` may cause long GC
> >> cycles,
> >> > > see
> >> > > > > > >>> > >
> >> > > > > > >>> > >
> >> > > > > > >>> >
> >> > > > > > >>>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
> >> > > > > > >>> > > "
> >> > > > > > >>> > >
> >> > > > > > >>> > > You are using java-1.8-openjdk. I guess you should use
> >> > > > > Shenandoah GC
> >> > > > > > >>> > which
> >> > > > > > >>> > > is an experimental algorithm in JVM 1.8. This would
> >> > > > significantly
> >> > > > > > >>> short
> >> > > > > > >>> > > the GC cycles.
> >> > > > > > >>> > >
> >> > > > > > >>> > > We should of cource provide a new configuration
> >> parameter
> >> > to
> >> > > > give
> >> > > > > > >>> you a
> >> > > > > > >>> > > chance to prolong the PING.
> >> > > > > > >>> > >
> >> > > > > > >>> > > Cheers
> >> > > > > > >>> > > Tibor
> >> > > > > > >>> > >
> >> > > > > > >>> >
> >> > > > > > >>> >
> >> > > > > > >>> > --
> >> > > > > > >>> >
> >> > > > > > >>> > Jason Young
> >> > > > > > >>> >
> >> > > > > > >>>
> >> > > > > > >>
> >> > > > > > >>
> >> > > > > >
> >> > > > > > --
> >> > > > > > Jason Young
> >> > > > > > Software Engineer | PROCENTIVE
> >> > > > > > [image: Phone] 715 245 8000 x7609
> >> > > > > > [image: Mobile] 706 870 3540
> >> > > > > > [image: Web] procentive.com
> >> > > > > > Confidentiality Notice: This message is intended for the sole
> >> use
> >> > of
> >> > > > the
> >> > > > > > individual and entity to which it is addressed, and may
> contain
> >> > > > > information
> >> > > > > > that is privileged, confidential and exempt from disclosure
> >> under
> >> > > > > > applicable law. Any unauthorized review, use, disclosure or
> >> > > > distribution
> >> > > > > of
> >> > > > > > this email message, including any attachment, is prohibited.
> If
> >> you
> >> > > are
> >> > > > > not
> >> > > > > > the intended recipient, please advise the sender by reply
> email
> >> and
> >> > > > > destroy
> >> > > > > > all copies of the original message.
> >> > > > >
> >> > > > >
> >> ---------------------------------------------------------------------
> >> > > > > To unsubscribe, e-mail: [hidden email]
> >> > > > > For additional commands, e-mail: [hidden email]
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Jason Young
Thanks.

I finally took a little time to look at why the `ps` command would not
work. For our project, we cannot use a default Maven image, so I made a
custom one based on alpine. One piece of software the Maven alpine image
has is `pcprops`; adding that fixes the problem:

$ docker run -it alpine sh

> / # apk update
> ...
> OK: 8445 distinct packages available
> / # apk upgrade
> ...
> OK: 4 MiB in 13 packages
> / # ps -o etime= -p 1
> ps: unrecognized option: p
> BusyBox v1.26.2 (2018-05-30 13:51:20 GMT) multi-call binary.
> Usage: ps [-o COL1,COL2=HEADER]
> Show list of processes
> -o COL1,COL2=HEADER Select columns for display
> / # apk add procps
> ...
> OK: 12 MiB in 19 packages
> / # ps -o etime= -p 1
>       05:21


Hope that helps anyone else using a custom-built alpine-based Maven image.

I'm not sure if I'll have time to experiment much with Surefire changes in
the near future; my focus at the moment is figuring out all the reasons our
test suite sometimes takes 5 or more times as long as normal (often also
causes Surefire to kill the child process, but even if it didn't, we have a
problem!).

On Sun, Apr 7, 2019 at 2:04 AM Tibor Digana <[hidden email]> wrote:

> Hi Jason,
>
> I am sorry I haven't reply.
> Excellent analysis!
>
> Can you make an experiment and use pingJob on a "fixed delay" instead of a
> "fixed rate"?
> If you want to make more, feel free.
>
> We should mabe think about how we would add more intelligence in these
> timers in order to detect the "blackout".
>
> Cheers
> Tibor
>
>
>
>
> On Fri, Mar 22, 2019 at 9:53 PM Jason Young <[hidden email]>
> wrote:
>
> > Notice:
> > * an over 3-minute gap between the first two entries above--neither the
> > command reader thread nor the pingJob had a chance to run.
> > * the command reader thread and the ping job executor evidently get a
> > chance to run from 19:18:02 to 19:19:19. They ran with increased
> frequency
> > during that time to "catch up": The command reader thread is programmed
> to
> > read from stdin until there's nothing left to read, and the pingScheduler
> > runs the pingJob at a "fixed" rate.
> > * Another 3-minute gap and another attempt to catch up at 19:22:28. This
> > time though, the command reader thread evidently did not get a turn in
> > time.
> >
> > I did not bother to log what threads the ping job was running under; I
> can
> > do that later. I also did not add any logging on the Maven process side;
> I
> > can do that as well.
> >
> > This is on a VM with unthrottled access to 4 processor cores and 8GB
> RAM. I
> > think the code under test is CPU- and RAM-intensive. There probably were
> GC
> > pauses, but I don't have proper logging for that in place.
> >
> > Note that this is all very normal behavior: Both the command reader and
> the
> > ping job's thread (run by the pingScheduler) are instructed to run until
> > their work is done, not to sleep or yield. For high throughput, an OS
> will
> > favor continuing to run a thread for a while once it is started over
> > switching tasks, so it's normal both that each thread would run for a
> while
> > after a pause, and (by extension) that each would have long pauses as
> other
> > threads had to run, too. Because of the long gaps between runs of the
> > command reader thread, my previous advice (schedule the pingJob on a
> "fixed
> > delay" instead of a "fixed rate") would not have saved the build in this
> > case (though "fixed delay" is more correct and would save the build in
> > other possible cases). Our code base has some unit tests that rely on
> > "normal case" timing, but they now have the @Ignore attribute because
> they
> > would rarely, but too often, fail. For most teams and projects, unit
> tests
> > have to be more reliable than that.
> >
> > So I think we need a solution that does not rely on preemptive scheduling
> > from the OS or on thread scheduling, but which ensures that commnads have
> > been read before running code depending on commands. There are multiple
> > ways to do that; maybe I can find time to write a solution later.
> >
> >
> > On Fri, Mar 22, 2019 at 3:46 PM Jason Young <[hidden email]>
> > wrote:
> >
> > > Thanks Tibor. I'll use `encodeAndWriteToOutput` later. For now, I
> hacked
> > > together some temporary changes to ForkedBooter so that it "logs" (just
> > > `System.out.println`) the current time every time:
> > >
> > >    - the command reader thread reads a NOOP via stdin,
> > >    - it starts the ping job,
> > >    - it checks for NOOP (`canUseNewPingMechanism(..)` returns `false`),
> > >    and
> > >    - it would have exited because there is no NOOP (`!pingDone`).
> > >
> > > I also commented out the code to exit in that case so it keeps running
> > and
> > > logging.
> > >
> > > The resulting output starts out showing that it's receiving ping every
> > 10s
> > > and checking for NOOP every 30s, as expected. But then one large test
> > > begins, and:
> > >
> > > DEBUG 2019/03/22 19:14:40.223 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:02.723 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.139 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:18:03.172 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > DEBUG 2019/03/22 19:18:03.186 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:18:03.188 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.243 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.307 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.308 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.202 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:18:03.323 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.337 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.352 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > DEBUG 2019/03/22 19:18:03.366 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.393 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.380 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:18:03.409 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.423 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.453 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > DEBUG 2019/03/22 19:18:03.454 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:08.872 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:18.859 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:19.282 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:18:19.330 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > DEBUG 2019/03/22 19:18:28.845 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:38.832 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:48.838 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:49.241 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:18:49.265 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > DEBUG 2019/03/22 19:19:01.162 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:19:10.069 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:19:19.136 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP
> from
> > > Maven process.
> > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP
> from
> > > Maven process.
> > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
> from
> > > Maven process.
> > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
> from
> > > Maven process.
> > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
> from
> > > Maven process.
> > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
> from
> > > Maven process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > ...
> > >
> > > With these changes, a project which normally takes 15 to 20 minutes to
> > run
> > > actually took 80m. Would printing directly to stdout from within a test
> > do
> > > that somehow? If not, we have a problem in our code.
> > >
> > > My comments on the above report to follow...
> > >
> > > On Fri, Mar 22, 2019 at 11:58 AM Tibor Digana <[hidden email]>
> > > wrote:
> > >
> > >> >> is there an -Xmx value you would recommend to ensure the parent
> > process
> > >> can send the NOOP to the surefire process?
> > >>
> > >> no such value is generic value to recommend. Each application or test
> > must
> > >> find our the value to prevent from OOM.
> > >>
> > >>
> > >> >> I'm making a custom build of surefire-booter to workaround the
> > initial
> > >> problem by commenting out the code to exit.
> > >>
> > >> Very good step forward. Then post your findings and we can continue on
> > >> making a pull request on GitHub.
> > >>
> > >>
> > >> >> Is there a logging API which I can use instead?
> > >>
> > >> Yes, it is very low level of logging (sending events to parent
> process)
> > >> but
> > >> this will be turned to high level abstraction in several days (branch
> > >> surefire-1222_2).
> > >> Yet, pls use this function:
> > >>
> > >> encodeAndWriteToOutput()
> > >>
> > >>
> > >> >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Fri, Mar 22, 2019 at 4:44 PM Jason Young <
> [hidden email]
> > >
> > >> wrote:
> > >>
> > >> > To clarify, this is the image ours is based on:
> > >> > https://hub.docker.com/_/alpine; we are not using a vanilla Maven
> > >> image as
> > >> > we have to add a few other items to this image as well. I don't know
> > who
> > >> > maintains that or adds /etc/mavenrc; I'll bring that up in another
> > >> topic in
> > >> > another forum. I was just answering a previous question, and also
> > >> throwing
> > >> > out info for anyone struggling with e.g. OOME in their Maven process
> > (a
> > >> lot
> > >> > of people have that problem when running the Sonar scanner, for
> > >> example). I
> > >> > don't think we need to worry about why that file is added for this
> > >> topic.
> > >> >
> > >> > But while we're on the subject, is there an -Xmx value you would
> > >> recommend
> > >> > to ensure the parent process can send the NOOP to the surefire
> > process?
> > >> >
> > >> > I'm making a custom build of surefire-booter to workaround the
> initial
> > >> > problem by commenting out the code to exit. I am also adding logging
> > via
> > >> > stdout so I can see if ForkedBooter is indeed running the ping job
> in
> > >> small
> > >> > increments of time. Is there a logging API which I can use instead?
> > >> >
> > >> > On Fri, Mar 22, 2019 at 5:44 AM Tibor Digana <
> [hidden email]>
> > >> > wrote:
> > >> >
> > >> > > The base images are developed in
> > >> > https://github.com/carlossg/docker-maven,
> > >> > > right?
> > >> > > Who creates "/etc/mavenrc"?
> > >> > >
> > >> > > On Thu, Mar 21, 2019 at 12:05 AM Jason Young <
> > >> [hidden email]
> > >> > >
> > >> > > wrote:
> > >> > >
> > >> > > > Mikael, sorry I do not appear to have permission to view the
> link.
> > >> > > >
> > >> > > > I did some digging in the last couple of days. I see that the
> > parent
> > >> > > > process reads from stdin. I could not find anywhere that we are
> > >> using
> > >> > > > stdin. FWIW the failures nearly always happen at least 15m into
> a
> > >> ~20m
> > >> > > test
> > >> > > > run, so perf is a likely culprit.
> > >> > > >
> > >> > > > I see also that ForkedBooter reads commands from stdin in one
> > >> thread,
> > >> > and
> > >> > > > uses an executor service to check for a past ping in
> > >> > > > ForkedBooter.listenToShutdownCommands(..). When it checks, it
> also
> > >> sets
> > >> > > > pingDone to false. The executor is configured to run up to 2
> > threads
> > >> > > > concurrently to handle the workload, and is set to run at a
> fixed
> > >> rate
> > >> > > (not
> > >> > > > a fixed delay). If the test suite is busy with testing and GC
> and
> > >> has
> > >> > > lots
> > >> > > > of threads running, it's entirely possible that a thread won't
> > have
> > >> a
> > >> > > > chance to run for a long time (e.g. 5s). Maybe instead of a 30s
> > >> delay,
> > >> > > the
> > >> > > > VM gets around to checking for a ping every 35s over a long span
> > of
> > >> > time.
> > >> > > > Because we're running at a "fixed rate" and not a "fixed delay",
> > >> then
> > >> > > after
> > >> > > > a couple of minutes we might be a full 30s behind schedule. It's
> > >> > possible
> > >> > > > the executor will create another thread to run the scheduled
> task
> > >> > because
> > >> > > > it's running behind schedule. This new thread checks for a ping,
> > >> finds
> > >> > > it,
> > >> > > > and sets pingDone to false. But then the original thread also
> > runs,
> > >> > say,
> > >> > > 2
> > >> > > > seconds afterwards, checks pingDone, and finds it is false.
> > >> > > >
> > >> > > > So to mitigate the problem, can we a) make the executor run
> only 1
> > >> > thread
> > >> > > > and b) schedule the task at a fixed rate? For that matter, is
> > there
> > >> > > another
> > >> > > > scheduled executor we can reuse? I understand why checking for
> > ping
> > >> > > > requires a separate executor. Should I ask in github?
> > >> > > >
> > >> > > > Regarding a previous question, I found out that Alpine's Maven
> > >> package
> > >> > > > comes with an /etc/mavenrc that sets `MAVEN_OPTS="$MAVEN_OPTS
> > >> > -Xmx512m"`
> > >> > > > which cannot be undone by setting `MAVEN_OPTS` at the command
> > line;
> > >> you
> > >> > > end
> > >> > > > up with e.g. `-Xmx1g -Xmx512m`. (Note this applies to the Maven
> > >> > (parent)
> > >> > > > process, not the surefire/failsafe (child) process.)
> > >> > > >
> > >> > > > On Wed, Mar 20, 2019 at 3:46 AM Bernd Eckenfels <
> > >> > [hidden email]>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > I guess a timeout caused by FullGC can happen with TCP as
> well.
> > >> > > > Increasing
> > >> > > > > the timeout might not be nice but does look like it would help
> > in
> > >> > both
> > >> > > > > cases. (Problems with stdout are more related to unexpected
> JVM
> > >> > > messages
> > >> > > > I
> > >> > > > > guess)
> > >> > > > >
> > >> > > > > Gruss
> > >> > > > > Bernd
> > >> > > > > --
> > >> > > > > http://bernd.eckenfels.net
> > >> > > > >
> > >> > > > > ________________________________
> > >> > > > > Von: Mikael Åsberg <[hidden email]>
> > >> > > > > Gesendet: Mittwoch, März 20, 2019 9:40 AM
> > >> > > > > An: Maven Users List
> > >> > > > > Betreff: Re: Failsafe: Killing self fork JVM. PING timeout
> > >> elapsed.
> > >> > > > >
> > >> > > > > These issues regarding communication with forked JVMs, won't
> > they
> > >> be
> > >> > > > > resolved once surefire moves to interprocess communication
> using
> > >> > > > > tcp/ip sockets? This happens to be the target feature to be
> > >> included
> > >> > > > > in the next surefire 3.0.0 milestone:
> > >> > > > >
> > >> https://issues.apache.org/jira/projects/SUREFIRE/versions/12344668
> > >> > > > >
> > >> > > > > There are soooo many issues relating to surefire reading
> stdout
> > of
> > >> > > > > forked processes (which is my understanding that it is
> currently
> > >> > > > > doing). Many of us are really looking forward to the next
> > >> milestone.
> > >> > > > >
> > >> > > > > On Tue, Mar 19, 2019 at 8:59 PM Jason Young <
> > >> > > [hidden email]>
> > >> > > > > wrote:
> > >> > > > > >
> > >> > > > > > Getting back to my original questions, I know that "ping"
> > means
> > >> to
> > >> > > see
> > >> > > > > if a
> > >> > > > > > process is there, and "NOOP" implies it's not a command to
> do
> > >> > > anything.
> > >> > > > > But
> > >> > > > > > what do the terms "ping" and "NOOP" mean in this context,
> i.e.
> > >> how
> > >> > do
> > >> > > > the
> > >> > > > > > processes communicate? I assume they don't sonar. Do other
> > >> > processes
> > >> > > > also
> > >> > > > > > ping NOOPs? Can I PING Chrome with a NOOP from bash? Is it
> > with
> > >> > TCP?
> > >> > > > > >
> > >> > > > > > I'm confused about what I should do regarding GC pauses.
> > >> > Previously I
> > >> > > > had
> > >> > > > > > code that would write the amount of remaining heap space (or
> > >> > > something
> > >> > > > > like
> > >> > > > > > that) to stdout after every test to troubleshoot OOMEs. Can
> > >> writing
> > >> > > to
> > >> > > > > > stdout cause the communication failure somehow?
> > >> > > > > >
> > >> > > > > > On Wed, Mar 13, 2019 at 5:57 PM Jason Young <
> > >> > > > [hidden email]>
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > I upgraded failsafe and surefire to 3.0.0-M3 as advised;
> we
> > >> > > > encountered
> > >> > > > > > > the same exception. (Still using -Xmx5g, will switch to
> > OpenJ9
> > >> > soon
> > >> > > > in
> > >> > > > > case
> > >> > > > > > > that helps.)
> > >> > > > > > >
> > >> > > > > > > BTW I also asked on StackOverflow previously, for anyone
> > >> > > interested:
> > >> > > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://stackoverflow.com/questions/54755846/killing-self-fork-jvm-ping-timeout-elapsed
> > >> > > > > > >
> > >> > > > > > > On Tue, Feb 26, 2019 at 6:40 PM Jason Young <
> > >> > > > > [hidden email]>
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > >> Thanks again for the information.
> > >> > > > > > >>
> > >> > > > > > >> We had increased the RAM to 3g some time ago to prevent
> > >> OOMEs.
> > >> > > More
> > >> > > > > > >> recently, I increased the RAM again to 5g for extra
> > headroom
> > >> > since
> > >> > > > we
> > >> > > > > had
> > >> > > > > > >> more headroom available; the problem hasn't happened
> since,
> > >> but
> > >> > it
> > >> > > > > hasn't
> > >> > > > > > >> been very long.
> > >> > > > > > >>
> > >> > > > > > >> We use a more customized image based on Alpine 3.8.2. The
> > JDK
> > >> > and
> > >> > > > > Maven
> > >> > > > > > >> are obtained via apk.
> > >> > > > > > >>
> > >> > > > > > >> I will try upgrading failsafe (and surefire while I'm at
> > it)
> > >> > > sooner,
> > >> > > > > and
> > >> > > > > > >> probably do some experimentation with JVMs another time
> > (not
> > >> > > > pressing
> > >> > > > > for
> > >> > > > > > >> me ATM).
> > >> > > > > > >>
> > >> > > > > > >> On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <
> > >> > > > [hidden email]
> > >> > > > > >
> > >> > > > > > >> wrote:
> > >> > > > > > >>
> > >> > > > > > >>> >> I'll try to enable some logging about GC pauses to
> see
> > >> > what's
> > >> > > up
> > >> > > > > > >>>
> > >> > > > > > >>> Pls do not keep such setting after tuning the GC because
> > >> this
> > >> > may
> > >> > > > > > >>> sometime
> > >> > > > > > >>> break the interprocess communication between Maven
> process
> > >> and
> > >> > > > > surefire
> > >> > > > > > >>> process.
> > >> > > > > > >>> It's worth to list GC information in a file and not in
> the
> > >> > > console
> > >> > > > > logs.
> > >> > > > > > >>> This can be configured, I guess.
> > >> > > > > > >>>
> > >> > > > > > >>> >> Do you think the value is simply too low?
> > >> > > > > > >>>
> > >> > > > > > >>> GCing many objects may take some time and I remember we
> > had
> > >> a
> > >> > > user
> > >> > > > > who
> > >> > > > > > >>> had
> > >> > > > > > >>> this problem a year or two ago.
> > >> > > > > > >>> We check every third NOOP (which is 3 x 10 sec) as a fix
> > >> > instead
> > >> > > of
> > >> > > > > every
> > >> > > > > > >>> NOP. So 30 seconds looked satisfactory.
> > >> > > > > > >>> I think you use old version 2.20 or something like that.
> > The
> > >> > > fixes
> > >> > > > > for
> > >> > > > > > >>> docker have been done so far, so please use the latest
> > >> version
> > >> > > > > 3.0.0-M3.
> > >> > > > > > >>> See this page
> > >> > > > > > >>>
> > >> > > >
> > https://maven.apache.org/surefire/maven-surefire-plugin/docker.html
> > >> ,
> > >> > > > > we
> > >> > > > > > >>> used maven:3.5.3-jdk-8-alpine in this test. Which base
> > image
> > >> > did
> > >> > > > you
> > >> > > > > use?
> > >> > > > > > >>>
> > >> > > > > > >>> Cheers
> > >> > > > > > >>> Tibor
> > >> > > > > > >>>
> > >> > > > > > >>> On Tue, Feb 26, 2019 at 5:24 PM Jason Young <
> > >> > > > > [hidden email]>
> > >> > > > > > >>> wrote:
> > >> > > > > > >>>
> > >> > > > > > >>> > Thanks for the information. It's good to see someone
> > >> > > understands
> > >> > > > a
> > >> > > > > > >>> little
> > >> > > > > > >>> > about this.
> > >> > > > > > >>> >
> > >> > > > > > >>> > Incidentally, we have been looking at other GCs and
> VMs
> > >> for
> > >> > the
> > >> > > > > > >>> application
> > >> > > > > > >>> > in production environments, so I'll look into how
> these
> > >> > affect
> > >> > > > > tests as
> > >> > > > > > >>> > well. I'll try to enable some logging about GC pauses
> to
> > >> see
> > >> > > > > what's up.
> > >> > > > > > >>> >
> > >> > > > > > >>> > How would `-Xmx3g` cause long GC cycles? Do you think
> > the
> > >> > value
> > >> > > > is
> > >> > > > > > >>> simply
> > >> > > > > > >>> > too low?
> > >> > > > > > >>> >
> > >> > > > > > >>> > FWIW we're running the Maven build in an Alpine-based
> > >> Docker
> > >> > > > > container.
> > >> > > > > > >>> >
> > >> > > > > > >>> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <
> > >> > > > > [hidden email]>
> > >> > > > > > >>> > wrote:
> > >> > > > > > >>> >
> > >> > > > > > >>> > > Hi Jason,
> > >> > > > > > >>> > >
> > >> > > > > > >>> > > We spoke about this issue on our chat in ASF Slack:
> > >> > > > > > >>> > > "I think his tests have been paused for a long GC
> > >> periods
> > >> > and
> > >> > > > > timed
> > >> > > > > > >>> out
> > >> > > > > > >>> > 3x
> > >> > > > > > >>> > > PING period = 30 seconds. After this period forked
> JVM
> > >> > > supposed
> > >> > > > > the
> > >> > > > > > >>> Maven
> > >> > > > > > >>> > > process was killed by JenkinsCI and therefore all
> > >> surefire
> > >> > > > > processes
> > >> > > > > > >>> are
> > >> > > > > > >>> > > killed as well and all the file handlers and memory
> > >> > > > consumptions
> > >> > > > > are
> > >> > > > > > >>> > > freed."
> > >> > > > > > >>> > >
> > >> > > > > > >>> > > "But I have to say that `-Xmx3g` may cause long GC
> > >> cycles,
> > >> > > see
> > >> > > > > > >>> > >
> > >> > > > > > >>> > >
> > >> > > > > > >>> >
> > >> > > > > > >>>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
> > >> > > > > > >>> > > "
> > >> > > > > > >>> > >
> > >> > > > > > >>> > > You are using java-1.8-openjdk. I guess you should
> use
> > >> > > > > Shenandoah GC
> > >> > > > > > >>> > which
> > >> > > > > > >>> > > is an experimental algorithm in JVM 1.8. This would
> > >> > > > significantly
> > >> > > > > > >>> short
> > >> > > > > > >>> > > the GC cycles.
> > >> > > > > > >>> > >
> > >> > > > > > >>> > > We should of cource provide a new configuration
> > >> parameter
> > >> > to
> > >> > > > give
> > >> > > > > > >>> you a
> > >> > > > > > >>> > > chance to prolong the PING.
> > >> > > > > > >>> > >
> > >> > > > > > >>> > > Cheers
> > >> > > > > > >>> > > Tibor
> > >> > > > > > >>> > >
> > >> > > > > > >>> >
> > >> > > > > > >>> >
> > >> > > > > > >>> > --
> > >> > > > > > >>> >
> > >> > > > > > >>> > Jason Young
> > >> > > > > > >>> >
> > >> > > > > > >>>
> > >> > > > > > >>
> > >> > > > > > >>
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > > Jason Young
> > >> > > > > > Software Engineer | PROCENTIVE
> > >> > > > > > [image: Phone] 715 245 8000 x7609
> > >> > > > > > [image: Mobile] 706 870 3540
> > >> > > > > > [image: Web] procentive.com
> > >> > > > > > Confidentiality Notice: This message is intended for the
> sole
> > >> use
> > >> > of
> > >> > > > the
> > >> > > > > > individual and entity to which it is addressed, and may
> > contain
> > >> > > > > information
> > >> > > > > > that is privileged, confidential and exempt from disclosure
> > >> under
> > >> > > > > > applicable law. Any unauthorized review, use, disclosure or
> > >> > > > distribution
> > >> > > > > of
> > >> > > > > > this email message, including any attachment, is prohibited.
> > If
> > >> you
> > >> > > are
> > >> > > > > not
> > >> > > > > > the intended recipient, please advise the sender by reply
> > email
> > >> and
> > >> > > > > destroy
> > >> > > > > > all copies of the original message.
> > >> > > > >
> > >> > > > >
> > >> ---------------------------------------------------------------------
> > >> > > > > To unsubscribe, e-mail: [hidden email]
> > >> > > > > For additional commands, e-mail: [hidden email]
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Tibor Digana
Hi Jason,

Thanks for investigating and pointing to *pcprop*. I will continue on it.

If I made an improvement in a branch, would you be able to install it in
your repository and run your project with Surefire 3.0.0-SNAPSHOT?

Cheers
Tibor

On Wed, Apr 10, 2019 at 3:18 AM Jason Young <[hidden email]>
wrote:

> Thanks.
>
> I finally took a little time to look at why the `ps` command would not
> work. For our project, we cannot use a default Maven image, so I made a
> custom one based on alpine. One piece of software the Maven alpine image
> has is `pcprops`; adding that fixes the problem:
>
> $ docker run -it alpine sh
> > / # apk update
> > ...
> > OK: 8445 distinct packages available
> > / # apk upgrade
> > ...
> > OK: 4 MiB in 13 packages
> > / # ps -o etime= -p 1
> > ps: unrecognized option: p
> > BusyBox v1.26.2 (2018-05-30 13:51:20 GMT) multi-call binary.
> > Usage: ps [-o COL1,COL2=HEADER]
> > Show list of processes
> > -o COL1,COL2=HEADER Select columns for display
> > / # apk add procps
> > ...
> > OK: 12 MiB in 19 packages
> > / # ps -o etime= -p 1
> >       05:21
>
>
> Hope that helps anyone else using a custom-built alpine-based Maven image.
>
> I'm not sure if I'll have time to experiment much with Surefire changes in
> the near future; my focus at the moment is figuring out all the reasons our
> test suite sometimes takes 5 or more times as long as normal (often also
> causes Surefire to kill the child process, but even if it didn't, we have a
> problem!).
>
> On Sun, Apr 7, 2019 at 2:04 AM Tibor Digana <[hidden email]>
> wrote:
>
> > Hi Jason,
> >
> > I am sorry I haven't reply.
> > Excellent analysis!
> >
> > Can you make an experiment and use pingJob on a "fixed delay" instead of
> a
> > "fixed rate"?
> > If you want to make more, feel free.
> >
> > We should mabe think about how we would add more intelligence in these
> > timers in order to detect the "blackout".
> >
> > Cheers
> > Tibor
> >
> >
> >
> >
> > On Fri, Mar 22, 2019 at 9:53 PM Jason Young <[hidden email]>
> > wrote:
> >
> > > Notice:
> > > * an over 3-minute gap between the first two entries above--neither the
> > > command reader thread nor the pingJob had a chance to run.
> > > * the command reader thread and the ping job executor evidently get a
> > > chance to run from 19:18:02 to 19:19:19. They ran with increased
> > frequency
> > > during that time to "catch up": The command reader thread is programmed
> > to
> > > read from stdin until there's nothing left to read, and the
> pingScheduler
> > > runs the pingJob at a "fixed" rate.
> > > * Another 3-minute gap and another attempt to catch up at 19:22:28.
> This
> > > time though, the command reader thread evidently did not get a turn in
> > > time.
> > >
> > > I did not bother to log what threads the ping job was running under; I
> > can
> > > do that later. I also did not add any logging on the Maven process
> side;
> > I
> > > can do that as well.
> > >
> > > This is on a VM with unthrottled access to 4 processor cores and 8GB
> > RAM. I
> > > think the code under test is CPU- and RAM-intensive. There probably
> were
> > GC
> > > pauses, but I don't have proper logging for that in place.
> > >
> > > Note that this is all very normal behavior: Both the command reader and
> > the
> > > ping job's thread (run by the pingScheduler) are instructed to run
> until
> > > their work is done, not to sleep or yield. For high throughput, an OS
> > will
> > > favor continuing to run a thread for a while once it is started over
> > > switching tasks, so it's normal both that each thread would run for a
> > while
> > > after a pause, and (by extension) that each would have long pauses as
> > other
> > > threads had to run, too. Because of the long gaps between runs of the
> > > command reader thread, my previous advice (schedule the pingJob on a
> > "fixed
> > > delay" instead of a "fixed rate") would not have saved the build in
> this
> > > case (though "fixed delay" is more correct and would save the build in
> > > other possible cases). Our code base has some unit tests that rely on
> > > "normal case" timing, but they now have the @Ignore attribute because
> > they
> > > would rarely, but too often, fail. For most teams and projects, unit
> > tests
> > > have to be more reliable than that.
> > >
> > > So I think we need a solution that does not rely on preemptive
> scheduling
> > > from the OS or on thread scheduling, but which ensures that commnads
> have
> > > been read before running code depending on commands. There are multiple
> > > ways to do that; maybe I can find time to write a solution later.
> > >
> > >
> > > On Fri, Mar 22, 2019 at 3:46 PM Jason Young <
> [hidden email]>
> > > wrote:
> > >
> > > > Thanks Tibor. I'll use `encodeAndWriteToOutput` later. For now, I
> > hacked
> > > > together some temporary changes to ForkedBooter so that it "logs"
> (just
> > > > `System.out.println`) the current time every time:
> > > >
> > > >    - the command reader thread reads a NOOP via stdin,
> > > >    - it starts the ping job,
> > > >    - it checks for NOOP (`canUseNewPingMechanism(..)` returns
> `false`),
> > > >    and
> > > >    - it would have exited because there is no NOOP (`!pingDone`).
> > > >
> > > > I also commented out the code to exit in that case so it keeps
> running
> > > and
> > > > logging.
> > > >
> > > > The resulting output starts out showing that it's receiving ping
> every
> > > 10s
> > > > and checking for NOOP every 30s, as expected. But then one large test
> > > > begins, and:
> > > >
> > > > DEBUG 2019/03/22 19:14:40.223 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:02.723 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.139 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:03.172 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:03.186 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:03.188 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.243 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.307 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.308 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.202 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:03.323 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.337 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.352 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:03.366 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.393 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.380 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:03.409 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.423 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.453 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:03.454 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:08.872 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:18.859 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:19.282 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:19.330 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:28.845 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:38.832 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:48.838 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:49.241 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:49.265 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:19:01.162 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:19:10.069 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:19:19.136 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP
> > from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP
> > from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
> > from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
> > from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
> > from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
> > from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > ...
> > > >
> > > > With these changes, a project which normally takes 15 to 20 minutes
> to
> > > run
> > > > actually took 80m. Would printing directly to stdout from within a
> test
> > > do
> > > > that somehow? If not, we have a problem in our code.
> > > >
> > > > My comments on the above report to follow...
> > > >
> > > > On Fri, Mar 22, 2019 at 11:58 AM Tibor Digana <
> [hidden email]>
> > > > wrote:
> > > >
> > > >> >> is there an -Xmx value you would recommend to ensure the parent
> > > process
> > > >> can send the NOOP to the surefire process?
> > > >>
> > > >> no such value is generic value to recommend. Each application or
> test
> > > must
> > > >> find our the value to prevent from OOM.
> > > >>
> > > >>
> > > >> >> I'm making a custom build of surefire-booter to workaround the
> > > initial
> > > >> problem by commenting out the code to exit.
> > > >>
> > > >> Very good step forward. Then post your findings and we can continue
> on
> > > >> making a pull request on GitHub.
> > > >>
> > > >>
> > > >> >> Is there a logging API which I can use instead?
> > > >>
> > > >> Yes, it is very low level of logging (sending events to parent
> > process)
> > > >> but
> > > >> this will be turned to high level abstraction in several days
> (branch
> > > >> surefire-1222_2).
> > > >> Yet, pls use this function:
> > > >>
> > > >> encodeAndWriteToOutput()
> > > >>
> > > >>
> > > >> >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Fri, Mar 22, 2019 at 4:44 PM Jason Young <
> > [hidden email]
> > > >
> > > >> wrote:
> > > >>
> > > >> > To clarify, this is the image ours is based on:
> > > >> > https://hub.docker.com/_/alpine; we are not using a vanilla Maven
> > > >> image as
> > > >> > we have to add a few other items to this image as well. I don't
> know
> > > who
> > > >> > maintains that or adds /etc/mavenrc; I'll bring that up in another
> > > >> topic in
> > > >> > another forum. I was just answering a previous question, and also
> > > >> throwing
> > > >> > out info for anyone struggling with e.g. OOME in their Maven
> process
> > > (a
> > > >> lot
> > > >> > of people have that problem when running the Sonar scanner, for
> > > >> example). I
> > > >> > don't think we need to worry about why that file is added for this
> > > >> topic.
> > > >> >
> > > >> > But while we're on the subject, is there an -Xmx value you would
> > > >> recommend
> > > >> > to ensure the parent process can send the NOOP to the surefire
> > > process?
> > > >> >
> > > >> > I'm making a custom build of surefire-booter to workaround the
> > initial
> > > >> > problem by commenting out the code to exit. I am also adding
> logging
> > > via
> > > >> > stdout so I can see if ForkedBooter is indeed running the ping job
> > in
> > > >> small
> > > >> > increments of time. Is there a logging API which I can use
> instead?
> > > >> >
> > > >> > On Fri, Mar 22, 2019 at 5:44 AM Tibor Digana <
> > [hidden email]>
> > > >> > wrote:
> > > >> >
> > > >> > > The base images are developed in
> > > >> > https://github.com/carlossg/docker-maven,
> > > >> > > right?
> > > >> > > Who creates "/etc/mavenrc"?
> > > >> > >
> > > >> > > On Thu, Mar 21, 2019 at 12:05 AM Jason Young <
> > > >> [hidden email]
> > > >> > >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Mikael, sorry I do not appear to have permission to view the
> > link.
> > > >> > > >
> > > >> > > > I did some digging in the last couple of days. I see that the
> > > parent
> > > >> > > > process reads from stdin. I could not find anywhere that we
> are
> > > >> using
> > > >> > > > stdin. FWIW the failures nearly always happen at least 15m
> into
> > a
> > > >> ~20m
> > > >> > > test
> > > >> > > > run, so perf is a likely culprit.
> > > >> > > >
> > > >> > > > I see also that ForkedBooter reads commands from stdin in one
> > > >> thread,
> > > >> > and
> > > >> > > > uses an executor service to check for a past ping in
> > > >> > > > ForkedBooter.listenToShutdownCommands(..). When it checks, it
> > also
> > > >> sets
> > > >> > > > pingDone to false. The executor is configured to run up to 2
> > > threads
> > > >> > > > concurrently to handle the workload, and is set to run at a
> > fixed
> > > >> rate
> > > >> > > (not
> > > >> > > > a fixed delay). If the test suite is busy with testing and GC
> > and
> > > >> has
> > > >> > > lots
> > > >> > > > of threads running, it's entirely possible that a thread won't
> > > have
> > > >> a
> > > >> > > > chance to run for a long time (e.g. 5s). Maybe instead of a
> 30s
> > > >> delay,
> > > >> > > the
> > > >> > > > VM gets around to checking for a ping every 35s over a long
> span
> > > of
> > > >> > time.
> > > >> > > > Because we're running at a "fixed rate" and not a "fixed
> delay",
> > > >> then
> > > >> > > after
> > > >> > > > a couple of minutes we might be a full 30s behind schedule.
> It's
> > > >> > possible
> > > >> > > > the executor will create another thread to run the scheduled
> > task
> > > >> > because
> > > >> > > > it's running behind schedule. This new thread checks for a
> ping,
> > > >> finds
> > > >> > > it,
> > > >> > > > and sets pingDone to false. But then the original thread also
> > > runs,
> > > >> > say,
> > > >> > > 2
> > > >> > > > seconds afterwards, checks pingDone, and finds it is false.
> > > >> > > >
> > > >> > > > So to mitigate the problem, can we a) make the executor run
> > only 1
> > > >> > thread
> > > >> > > > and b) schedule the task at a fixed rate? For that matter, is
> > > there
> > > >> > > another
> > > >> > > > scheduled executor we can reuse? I understand why checking for
> > > ping
> > > >> > > > requires a separate executor. Should I ask in github?
> > > >> > > >
> > > >> > > > Regarding a previous question, I found out that Alpine's Maven
> > > >> package
> > > >> > > > comes with an /etc/mavenrc that sets `MAVEN_OPTS="$MAVEN_OPTS
> > > >> > -Xmx512m"`
> > > >> > > > which cannot be undone by setting `MAVEN_OPTS` at the command
> > > line;
> > > >> you
> > > >> > > end
> > > >> > > > up with e.g. `-Xmx1g -Xmx512m`. (Note this applies to the
> Maven
> > > >> > (parent)
> > > >> > > > process, not the surefire/failsafe (child) process.)
> > > >> > > >
> > > >> > > > On Wed, Mar 20, 2019 at 3:46 AM Bernd Eckenfels <
> > > >> > [hidden email]>
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > I guess a timeout caused by FullGC can happen with TCP as
> > well.
> > > >> > > > Increasing
> > > >> > > > > the timeout might not be nice but does look like it would
> help
> > > in
> > > >> > both
> > > >> > > > > cases. (Problems with stdout are more related to unexpected
> > JVM
> > > >> > > messages
> > > >> > > > I
> > > >> > > > > guess)
> > > >> > > > >
> > > >> > > > > Gruss
> > > >> > > > > Bernd
> > > >> > > > > --
> > > >> > > > > http://bernd.eckenfels.net
> > > >> > > > >
> > > >> > > > > ________________________________
> > > >> > > > > Von: Mikael Åsberg <[hidden email]>
> > > >> > > > > Gesendet: Mittwoch, März 20, 2019 9:40 AM
> > > >> > > > > An: Maven Users List
> > > >> > > > > Betreff: Re: Failsafe: Killing self fork JVM. PING timeout
> > > >> elapsed.
> > > >> > > > >
> > > >> > > > > These issues regarding communication with forked JVMs, won't
> > > they
> > > >> be
> > > >> > > > > resolved once surefire moves to interprocess communication
> > using
> > > >> > > > > tcp/ip sockets? This happens to be the target feature to be
> > > >> included
> > > >> > > > > in the next surefire 3.0.0 milestone:
> > > >> > > > >
> > > >> https://issues.apache.org/jira/projects/SUREFIRE/versions/12344668
> > > >> > > > >
> > > >> > > > > There are soooo many issues relating to surefire reading
> > stdout
> > > of
> > > >> > > > > forked processes (which is my understanding that it is
> > currently
> > > >> > > > > doing). Many of us are really looking forward to the next
> > > >> milestone.
> > > >> > > > >
> > > >> > > > > On Tue, Mar 19, 2019 at 8:59 PM Jason Young <
> > > >> > > [hidden email]>
> > > >> > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > Getting back to my original questions, I know that "ping"
> > > means
> > > >> to
> > > >> > > see
> > > >> > > > > if a
> > > >> > > > > > process is there, and "NOOP" implies it's not a command to
> > do
> > > >> > > anything.
> > > >> > > > > But
> > > >> > > > > > what do the terms "ping" and "NOOP" mean in this context,
> > i.e.
> > > >> how
> > > >> > do
> > > >> > > > the
> > > >> > > > > > processes communicate? I assume they don't sonar. Do other
> > > >> > processes
> > > >> > > > also
> > > >> > > > > > ping NOOPs? Can I PING Chrome with a NOOP from bash? Is it
> > > with
> > > >> > TCP?
> > > >> > > > > >
> > > >> > > > > > I'm confused about what I should do regarding GC pauses.
> > > >> > Previously I
> > > >> > > > had
> > > >> > > > > > code that would write the amount of remaining heap space
> (or
> > > >> > > something
> > > >> > > > > like
> > > >> > > > > > that) to stdout after every test to troubleshoot OOMEs.
> Can
> > > >> writing
> > > >> > > to
> > > >> > > > > > stdout cause the communication failure somehow?
> > > >> > > > > >
> > > >> > > > > > On Wed, Mar 13, 2019 at 5:57 PM Jason Young <
> > > >> > > > [hidden email]>
> > > >> > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > I upgraded failsafe and surefire to 3.0.0-M3 as advised;
> > we
> > > >> > > > encountered
> > > >> > > > > > > the same exception. (Still using -Xmx5g, will switch to
> > > OpenJ9
> > > >> > soon
> > > >> > > > in
> > > >> > > > > case
> > > >> > > > > > > that helps.)
> > > >> > > > > > >
> > > >> > > > > > > BTW I also asked on StackOverflow previously, for anyone
> > > >> > > interested:
> > > >> > > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://stackoverflow.com/questions/54755846/killing-self-fork-jvm-ping-timeout-elapsed
> > > >> > > > > > >
> > > >> > > > > > > On Tue, Feb 26, 2019 at 6:40 PM Jason Young <
> > > >> > > > > [hidden email]>
> > > >> > > > > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > >> Thanks again for the information.
> > > >> > > > > > >>
> > > >> > > > > > >> We had increased the RAM to 3g some time ago to prevent
> > > >> OOMEs.
> > > >> > > More
> > > >> > > > > > >> recently, I increased the RAM again to 5g for extra
> > > headroom
> > > >> > since
> > > >> > > > we
> > > >> > > > > had
> > > >> > > > > > >> more headroom available; the problem hasn't happened
> > since,
> > > >> but
> > > >> > it
> > > >> > > > > hasn't
> > > >> > > > > > >> been very long.
> > > >> > > > > > >>
> > > >> > > > > > >> We use a more customized image based on Alpine 3.8.2.
> The
> > > JDK
> > > >> > and
> > > >> > > > > Maven
> > > >> > > > > > >> are obtained via apk.
> > > >> > > > > > >>
> > > >> > > > > > >> I will try upgrading failsafe (and surefire while I'm
> at
> > > it)
> > > >> > > sooner,
> > > >> > > > > and
> > > >> > > > > > >> probably do some experimentation with JVMs another time
> > > (not
> > > >> > > > pressing
> > > >> > > > > for
> > > >> > > > > > >> me ATM).
> > > >> > > > > > >>
> > > >> > > > > > >> On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <
> > > >> > > > [hidden email]
> > > >> > > > > >
> > > >> > > > > > >> wrote:
> > > >> > > > > > >>
> > > >> > > > > > >>> >> I'll try to enable some logging about GC pauses to
> > see
> > > >> > what's
> > > >> > > up
> > > >> > > > > > >>>
> > > >> > > > > > >>> Pls do not keep such setting after tuning the GC
> because
> > > >> this
> > > >> > may
> > > >> > > > > > >>> sometime
> > > >> > > > > > >>> break the interprocess communication between Maven
> > process
> > > >> and
> > > >> > > > > surefire
> > > >> > > > > > >>> process.
> > > >> > > > > > >>> It's worth to list GC information in a file and not in
> > the
> > > >> > > console
> > > >> > > > > logs.
> > > >> > > > > > >>> This can be configured, I guess.
> > > >> > > > > > >>>
> > > >> > > > > > >>> >> Do you think the value is simply too low?
> > > >> > > > > > >>>
> > > >> > > > > > >>> GCing many objects may take some time and I remember
> we
> > > had
> > > >> a
> > > >> > > user
> > > >> > > > > who
> > > >> > > > > > >>> had
> > > >> > > > > > >>> this problem a year or two ago.
> > > >> > > > > > >>> We check every third NOOP (which is 3 x 10 sec) as a
> fix
> > > >> > instead
> > > >> > > of
> > > >> > > > > every
> > > >> > > > > > >>> NOP. So 30 seconds looked satisfactory.
> > > >> > > > > > >>> I think you use old version 2.20 or something like
> that.
> > > The
> > > >> > > fixes
> > > >> > > > > for
> > > >> > > > > > >>> docker have been done so far, so please use the latest
> > > >> version
> > > >> > > > > 3.0.0-M3.
> > > >> > > > > > >>> See this page
> > > >> > > > > > >>>
> > > >> > > >
> > > https://maven.apache.org/surefire/maven-surefire-plugin/docker.html
> > > >> ,
> > > >> > > > > we
> > > >> > > > > > >>> used maven:3.5.3-jdk-8-alpine in this test. Which base
> > > image
> > > >> > did
> > > >> > > > you
> > > >> > > > > use?
> > > >> > > > > > >>>
> > > >> > > > > > >>> Cheers
> > > >> > > > > > >>> Tibor
> > > >> > > > > > >>>
> > > >> > > > > > >>> On Tue, Feb 26, 2019 at 5:24 PM Jason Young <
> > > >> > > > > [hidden email]>
> > > >> > > > > > >>> wrote:
> > > >> > > > > > >>>
> > > >> > > > > > >>> > Thanks for the information. It's good to see someone
> > > >> > > understands
> > > >> > > > a
> > > >> > > > > > >>> little
> > > >> > > > > > >>> > about this.
> > > >> > > > > > >>> >
> > > >> > > > > > >>> > Incidentally, we have been looking at other GCs and
> > VMs
> > > >> for
> > > >> > the
> > > >> > > > > > >>> application
> > > >> > > > > > >>> > in production environments, so I'll look into how
> > these
> > > >> > affect
> > > >> > > > > tests as
> > > >> > > > > > >>> > well. I'll try to enable some logging about GC
> pauses
> > to
> > > >> see
> > > >> > > > > what's up.
> > > >> > > > > > >>> >
> > > >> > > > > > >>> > How would `-Xmx3g` cause long GC cycles? Do you
> think
> > > the
> > > >> > value
> > > >> > > > is
> > > >> > > > > > >>> simply
> > > >> > > > > > >>> > too low?
> > > >> > > > > > >>> >
> > > >> > > > > > >>> > FWIW we're running the Maven build in an
> Alpine-based
> > > >> Docker
> > > >> > > > > container.
> > > >> > > > > > >>> >
> > > >> > > > > > >>> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <
> > > >> > > > > [hidden email]>
> > > >> > > > > > >>> > wrote:
> > > >> > > > > > >>> >
> > > >> > > > > > >>> > > Hi Jason,
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> > > We spoke about this issue on our chat in ASF
> Slack:
> > > >> > > > > > >>> > > "I think his tests have been paused for a long GC
> > > >> periods
> > > >> > and
> > > >> > > > > timed
> > > >> > > > > > >>> out
> > > >> > > > > > >>> > 3x
> > > >> > > > > > >>> > > PING period = 30 seconds. After this period forked
> > JVM
> > > >> > > supposed
> > > >> > > > > the
> > > >> > > > > > >>> Maven
> > > >> > > > > > >>> > > process was killed by JenkinsCI and therefore all
> > > >> surefire
> > > >> > > > > processes
> > > >> > > > > > >>> are
> > > >> > > > > > >>> > > killed as well and all the file handlers and
> memory
> > > >> > > > consumptions
> > > >> > > > > are
> > > >> > > > > > >>> > > freed."
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> > > "But I have to say that `-Xmx3g` may cause long GC
> > > >> cycles,
> > > >> > > see
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> >
> > > >> > > > > > >>>
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
> > > >> > > > > > >>> > > "
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> > > You are using java-1.8-openjdk. I guess you should
> > use
> > > >> > > > > Shenandoah GC
> > > >> > > > > > >>> > which
> > > >> > > > > > >>> > > is an experimental algorithm in JVM 1.8. This
> would
> > > >> > > > significantly
> > > >> > > > > > >>> short
> > > >> > > > > > >>> > > the GC cycles.
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> > > We should of cource provide a new configuration
> > > >> parameter
> > > >> > to
> > > >> > > > give
> > > >> > > > > > >>> you a
> > > >> > > > > > >>> > > chance to prolong the PING.
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> > > Cheers
> > > >> > > > > > >>> > > Tibor
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> >
> > > >> > > > > > >>> >
> > > >> > > > > > >>> > --
> > > >> > > > > > >>> >
> > > >> > > > > > >>> > Jason Young
> > > >> > > > > > >>> >
> > > >> > > > > > >>>
> > > >> > > > > > >>
> > > >> > > > > > >>
> > > >> > > > > >
> > > >> > > > > > --
> > > >> > > > > > Jason Young
> > > >> > > > > > Software Engineer | PROCENTIVE
> > > >> > > > > > [image: Phone] 715 245 8000 x7609
> > > >> > > > > > [image: Mobile] 706 870 3540
> > > >> > > > > > [image: Web] procentive.com
> > > >> > > > > > Confidentiality Notice: This message is intended for the
> > sole
> > > >> use
> > > >> > of
> > > >> > > > the
> > > >> > > > > > individual and entity to which it is addressed, and may
> > > contain
> > > >> > > > > information
> > > >> > > > > > that is privileged, confidential and exempt from
> disclosure
> > > >> under
> > > >> > > > > > applicable law. Any unauthorized review, use, disclosure
> or
> > > >> > > > distribution
> > > >> > > > > of
> > > >> > > > > > this email message, including any attachment, is
> prohibited.
> > > If
> > > >> you
> > > >> > > are
> > > >> > > > > not
> > > >> > > > > > the intended recipient, please advise the sender by reply
> > > email
> > > >> and
> > > >> > > > > destroy
> > > >> > > > > > all copies of the original message.
> > > >> > > > >
> > > >> > > > >
> > > >>
> ---------------------------------------------------------------------
> > > >> > > > > To unsubscribe, e-mail: [hidden email]
> > > >> > > > > For additional commands, e-mail:
> [hidden email]
> > > >> > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>
12