Friday 15 May 2015

gridgain - GridComputeExecutionRejectedException not handled when GridJobStealingCollisionSpi is used -



gridgain - GridComputeExecutionRejectedException not handled when GridJobStealingCollisionSpi is used -

i've been using gridgain more 3 years , besides few bumps worked smoothly. @ to the lowest degree i've been able figure out wrong (also due solid documentation , examples). well, until now..

for 1 of projects trying enable job stealing in computational grid powered gridgain 6.5.0. configuration went smoothly, however, time time gridcomputeexecutionrejectedexception, bubbles way client. unusual thing gridcomputeexecutionrejectedexception supposed detected , routed failover policy provided in result method of standard gridcomputetaskadapter (which extend):

public gridcomputejobresultpolicy result(gridcomputejobresult res, list<gridcomputejobresult> rcvd) throws gridexception { gridexception e = res.getexception(); // seek failover if result failed. if (e != null) { // don't failover user's code errors. if (e instanceof gridcomputeexecutionrejectedexception || e instanceof gridtopologyexception || // failover exception wrapped. e.hascause(gridcomputejobfailoverexception.class)) homecoming failover; throw new gridexception("remote job threw user exception (override or implement gridcomputetask.result(..) " + "method if have automatic failover exception).", e); } // wait job responses. homecoming wait; }

the exception thrown during collision follows:

014-10-26 23:57:33,190 [http-bio-8080-exec-13] error errors.grailsexceptionresolver - gridcomputeexecutionrejectedexception occurred when processing request: [post] /evorun/runevolution job cancelled before execution [jobses=gridjobsessionimpl [ses=gridtasksessionimpl [taskname=edu.banda.coel.server.grid.gridcollectiontask, dep=localdeployment [super=griddeployment [ts=1414392425356, depmode=shared, clsldr=sun.misc.launcher$appclassloader@2e2e1b6c, clsldrid=4faab505941-ea582293-39ba-4648-9022-596e6626954b, userver=0, loc=true, sampleclsname=java.lang.string, pendingundeploy=false, undeployed=false, usage=0]], taskclsname=edu.banda.coel.server.grid.gridcollectiontask, sesid=7f4e9505941-b2e9befc-051f-4e17-ba8d-bbafbe9cd7a3, starttime=1414392785621, endtime=9223372036854775807, tasknodeid=b2e9befc-051f-4e17-ba8d-bbafbe9cd7a3, clsldr=sun.misc.launcher$appclassloader@2e2e1b6c, closed=false, cpspi=null, failspi=null, loadspi=null, usage=1, fullsup=false, subjid=b2e9befc-051f-4e17-ba8d-bbafbe9cd7a3], jobid=55ee9505941-8522cc8b-10fb-4afd-945f-caa0e0c561f0], job=edu.banda.coel.server.grid.gridcollectioninputtask$1@380042f5] more info see: troubleshooting: http://bit.ly/gridgain-troubleshooting documentation center: http://bit.ly/gridgain-documentation . stacktrace follows: class org.gridgain.grid.compute.gridcomputeexecutionrejectedexception: job cancelled before execution [jobses=gridjobsessionimpl [ses=gridtasksessionimpl [taskname=edu.banda.coel.server.grid.gridcollectiontask, dep=localdeployment [super=griddeployment [ts=1414392425356, depmode=shared, clsldr=sun.misc.launcher$appclassloader@2e2e1b6c, clsldrid=4faab505941-ea582293-39ba-4648-9022-596e6626954b, userver=0, loc=true, sampleclsname=java.lang.string, pendingundeploy=false, undeployed=false, usage=0]], taskclsname=edu.banda.coel.server.grid.gridcollectiontask, sesid=7f4e9505941-b2e9befc-051f-4e17-ba8d-bbafbe9cd7a3, starttime=1414392785621, endtime=9223372036854775807, tasknodeid=b2e9befc-051f-4e17-ba8d-bbafbe9cd7a3, clsldr=sun.misc.launcher$appclassloader@2e2e1b6c, closed=false, cpspi=null, failspi=null, loadspi=null, usage=1, fullsup=false, subjid=b2e9befc-051f-4e17-ba8d-bbafbe9cd7a3], jobid=55ee9505941-8522cc8b-10fb-4afd-945f-caa0e0c561f0], job=edu.banda.coel.server.grid.gridcollectioninputtask$1@380042f5] more info see: troubleshooting: http://bit.ly/gridgain-troubleshooting documentation center: http://bit.ly/gridgain-documentation @ org.gridgain.grid.kernal.processors.job.gridjobprocessor.onbeforeactivatejob(gridjobprocessor.java:1190) @ org.gridgain.grid.kernal.processors.job.gridjobprocessor.access$1500(gridjobprocessor.java:62) @ org.gridgain.grid.kernal.processors.job.gridjobprocessor$collisionjobcontext.activate(gridjobprocessor.java:1469) @ org.gridgain.grid.spi.collision.jobstealing.gridjobstealingcollisionspi.checkbusy(gridjobstealingcollisionspi.java:640) @ org.gridgain.grid.spi.collision.jobstealing.gridjobstealingcollisionspi.oncollision(gridjobstealingcollisionspi.java:589) @ org.gridgain.grid.kernal.managers.collision.gridcollisionmanager.oncollision(gridcollisionmanager.java:124) @ org.gridgain.grid.kernal.processors.job.gridjobprocessor.handlecollisions(gridjobprocessor.java:669) @ org.gridgain.grid.kernal.processors.job.gridjobprocessor.processjobexecuterequest(gridjobprocessor.java:1089) @ org.gridgain.grid.kernal.processors.job.gridjobprocessor$jobexecutionlistener.onmessage(gridjobprocessor.java:1732) @ org.gridgain.grid.kernal.managers.communication.gridiomanager.processregularmessage0(gridiomanager.java:654) @ org.gridgain.grid.kernal.managers.communication.gridiomanager.access$1800(gridiomanager.java:62) @ org.gridgain.grid.kernal.managers.communication.gridiomanager$6.body(gridiomanager.java:615) @ org.gridgain.grid.util.worker.gridworker.run(gridworker.java:151) @ java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor.java:1145) @ java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor.java:615) @ java.lang.thread.run(thread.java:745)

i've found out piece of code responsible activating jobs in gridjobstealingcollisionspi has comment "we need create sure job not beingness rejected thread." scenario described in comment somehow occurred? (i know there synchronized block in code should prevent that.)

anyway, i'll highly appreciate help!

my configuration file follows:

<beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xmlns:util="http://www.springframework.org/schema/util" xsi:schemalocation=" http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-3.1.xsd http://www.springframework.org/schema/util http://www.springframework.org/schema/util/spring-util-3.1.xsd"> <bean id="grid.cfg" class="org.gridgain.grid.gridconfiguration"> <property name="marshaller"> <bean class="org.gridgain.grid.marshaller.optimized.gridoptimizedmarshaller"> <property name="requireserializable" value="false"/> </bean> </property> <property name="includeeventtypes"> <util:constant static-field="org.gridgain.grid.events.grideventtype.evts_task_execution"/> </property> <property name="discoveryspi"> <bean class="org.gridgain.grid.spi.discovery.tcp.gridtcpdiscoveryspi"> <property name="ipfinder"> <bean class="org.gridgain.grid.spi.discovery.tcp.ipfinder.sharedfs.gridtcpdiscoverysharedfsipfinder"/> </property> </bean> </property> <property name="loadbalancingspi"> <bean class="org.gridgain.grid.spi.loadbalancing.adaptive.gridadaptiveloadbalancingspi"> <property name="loadprobe"> <bean class="org.gridgain.grid.spi.loadbalancing.adaptive.gridadaptiveprocessingtimeloadprobe"/> </property> </bean> </property> <property name="collisionspi"> <bean class="org.gridgain.grid.spi.collision.jobstealing.gridjobstealingcollisionspi"> <property name="activejobsthreshold" value="28"/> <property name="waitjobsthreshold" value="0"/> <property name="messageexpiretime" value="3000"/> <property name="maximumstealingattempts" value="5"/> <property name="stealingenabled" value="true"/> </bean> </property> <property name="failoverspi"> <bean class="org.gridgain.grid.spi.failover.jobstealing.gridjobstealingfailoverspi"> <property name="maximumfailoverattempts" value="5"/> </bean> </property> <property name="swapspacespi"> <bean class="org.gridgain.grid.spi.swapspace.noop.gridnoopswapspacespi"/> </property> </bean> </beans>

edit: requested here abstract task class:

public abstract class gridcollectioninputtask<in,out,job_out> extends gridcomputetasksplitadapter<collection<in>, out> { /** auto-injected grid logger. */ @gridloggerresource private gridlogger log = null; private final argumentcallable<in,job_out> callable; public gridcollectioninputtask(argumentcallable<in,job_out> callable) { this.callable = callable; } @override protected collection<? extends gridcomputejob> split(int gridsize, collection<in> inputs) throws gridexception { list<gridcomputejob> jobs = new arraylist<gridcomputejob>(inputs.size()); (in input : inputs) { jobs.add(new gridcomputejobadapter(input) { @suppresswarnings("unchecked") @override public job_out execute() { homecoming callable.call((in) argument(0)); } }); } homecoming jobs; } @override public out reduce(list<gridcomputejobresult> results) throws gridexception { collection<job_out> jobresults = new arraylist<job_out>(); (gridcomputejobresult res : results) jobresults.add((job_out) res.getdata()); homecoming createtaskoutput(jobresults); } protected abstract out createtaskoutput(collection<job_out> jobresults); }

edit: after introducing try-catch block in service class (that calls grid) got total stack showing surprisingly gridtopologyexception:

2014-10-29 19:43:07,896 [http-bio-8080-exec-32] error impl.evolutionserviceimpl - evolution run failed! edu.banda.coel.coelruntimeexception: 'gridfitnessevaluatorbotaskadapter' failed on grid. @ edu.banda.coel.server.grid.computationalgrid.runongridsync(computationalgrid.java:231) ... @ edu.banda.coel.server.service.impl.evolutionserviceimpl.evolve(evolutionserviceimpl.java:125) @ com.banda.math.domain.evo.evoruncontroller.runevolution(evoruncontroller.groovy:119) @ java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor.java:1145) @ java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor.java:615) @ java.lang.thread.run(thread.java:745) caused by: class org.gridgain.grid.gridtopologyexception: failed failover job node (failover spi returned null) [job=edu.banda.coel.server.grid.gridcollectioninputtask$1@47ba5075, node=gridtcpdiscoverynode [id=368ffe13-76c7-42f6-9339-a34c772c0931, addrs=[xxx.xxx.xxx.xxx, 127.0.0.1], sockaddrs=[xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:47500, /xxx.xxx.xxx.xxx:47500, /127.0.0.1:47500], discport=47500, order=24, loc=false, ver=6.5.0#20140925-sha1:48190079]] @ org.gridgain.grid.kernal.processors.task.gridtaskworker.failover(gridtaskworker.java:984) @ org.gridgain.grid.kernal.processors.task.gridtaskworker.onresponse(gridtaskworker.java:757) @ org.gridgain.grid.kernal.processors.task.gridtaskprocessor.processjobexecuteresponse(gridtaskprocessor.java:906) @ org.gridgain.grid.kernal.processors.task.gridtaskprocessor$jobmessagelistener.onmessage(gridtaskprocessor.java:1138) @ org.gridgain.grid.kernal.managers.communication.gridiomanager.processregularmessage0(gridiomanager.java:654) @ org.gridgain.grid.kernal.managers.communication.gridiomanager.access$1800(gridiomanager.java:62) @ org.gridgain.grid.kernal.managers.communication.gridiomanager$6.body(gridiomanager.java:615) @ org.gridgain.grid.util.worker.gridworker.run(gridworker.java:151) ... 3 more caused by: class org.gridgain.grid.compute.gridcomputeexecutionrejectedexception: job cancelled before execution [jobses=gridjobsessionimpl [ses=gridtasksessionimpl [taskname=edu.banda.coel.server.grid.gridcollectiontask, dep=localdeployment [super=griddeployment [ts=1414636288878, depmode=shared, clsldr=sun.misc.launcher$appclassloader@684be8b8, clsldrid=3bab4ee5941-368ffe13-76c7-42f6-9339-a34c772c0931, userver=0, loc=true, sampleclsname=java.lang.string, pendingundeploy=false, undeployed=false, usage=0]], taskclsname=edu.banda.coel.server.grid.gridcollectiontask, sesid=cc04ede5941-e05a00ce-2864-46a8-bf7c-4452f2a6d46e, starttime=1414636742023, endtime=9223372036854775807, tasknodeid=e05a00ce-2864-46a8-bf7c-4452f2a6d46e, clsldr=sun.misc.launcher$appclassloader@684be8b8, closed=false, cpspi=null, failspi=null, loadspi=null, usage=1, fullsup=false, subjid=e05a00ce-2864-46a8-bf7c-4452f2a6d46e], jobid=21b4ede5941-368ffe13-76c7-42f6-9339-a34c772c0931], job=edu.banda.coel.server.grid.gridcollectioninputtask$1@1886b071] more info see: troubleshooting: http://bit.ly/gridgain-troubleshooting documentation center: http://bit.ly/gridgain-documentation @ org.gridgain.grid.kernal.processors.job.gridjobprocessor.onbeforeactivatejob(gridjobprocessor.java:1190) @ org.gridgain.grid.kernal.processors.job.gridjobprocessor.access$1500(gridjobprocessor.java:62) @ org.gridgain.grid.kernal.processors.job.gridjobprocessor$collisionjobcontext.activate(gridjobprocessor.java:1469) @ org.gridgain.grid.spi.collision.jobstealing.gridjobstealingcollisionspi.checkbusy(gridjobstealingcollisionspi.java:640) @ org.gridgain.grid.spi.collision.jobstealing.gridjobstealingcollisionspi.oncollision(gridjobstealingcollisionspi.java:589) @ org.gridgain.grid.kernal.managers.collision.gridcollisionmanager.oncollision(gridcollisionmanager.java:124) @ org.gridgain.grid.kernal.processors.job.gridjobprocessor.handlecollisions(gridjobprocessor.java:669) @ org.gridgain.grid.kernal.processors.job.gridjobprocessor.access$3000(gridjobprocessor.java:62) @ org.gridgain.grid.kernal.processors.job.gridjobprocessor$jobeventlistener.onjobfinished(gridjobprocessor.java:1636) @ org.gridgain.grid.kernal.processors.job.gridjobworker.finishjob(gridjobworker.java:807) @ org.gridgain.grid.kernal.processors.job.gridjobworker.execute0(gridjobworker.java:533) @ org.gridgain.grid.kernal.processors.job.gridjobworker.body(gridjobworker.java:429) ... 4 more

gridgain

No comments:

Post a Comment