Hi,
I’ve encounter an issue on my Hops 2.2 installation about Resource Manager Leader Election protocol.
I have a 5 nodes cluster (2 namenode, 3 datanodes (hadoop + rondb)) with HA enabled for the namenode and Resource Manager too.
The system was running well until one ndb dead with this error message:
[ndbd] ALERT – Node 3: Forced node shutdown completed. Caused by error 6050: ‘WatchDog terminate, internal error or massive overload on the machine running this node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node’
After the error, the ndb cluster restarted automatically and returned to work well but, unfortunally, the Resource Manager dead and began to print the following error:
ERROR io.hops.leaderElection.LeaderElection: LE Status: id 11 LeaderElection thread received StorageException. sucessfulTx 539915 failedTx 504483125 time period 1
000 com.mysql.clusterj.ClusterJDatastoreException: Error in NdbJTie: returnCode , code 4,009, mysqlCode 157, status 3, classification 11, message Cluster Failure .
io.hops.exception.StorageException: com.mysql.clusterj.ClusterJDatastoreException: Error in NdbJTie: returnCode , code 4,009, mysqlCode 157, status 3, classification 11, message Cluster
Failure .
at io.hops.metadata.ndb.wrapper.HopsExceptionHelper.wrap(HopsExceptionHelper.java:42)
at io.hops.metadata.ndb.wrapper.HopsSession.find(HopsSession.java:62)
at io.hops.metadata.ndb.dalimpl.hdfs.VariableClusterj.getVariable(VariableClusterj.java:66)
at io.hops.metadata.ndb.dalimpl.hdfs.VariableClusterj.getVariable(VariableClusterj.java:36)
at io.hops.transaction.context.VariableContext.find(VariableContext.java:58)
at io.hops.transaction.context.VariableContext.find(VariableContext.java:31)
at io.hops.transaction.context.TransactionContext.find(TransactionContext.java:139)
at io.hops.transaction.EntityManager.find(EntityManager.java:98)
at io.hops.transaction.lock.Lock.acquireLock(Lock.java:132)
at io.hops.transaction.lock.VariablesLock.acquire(VariablesLock.java:59)
at io.hops.transaction.lock.LeaderElectionTransactionalLockAcquirer.acquire(LeaderElectionTransactionalLockAcquirer.java:32)
at io.hops.transaction.handler.TransactionalRequestHandler.execute(TransactionalRequestHandler.java:89)
at io.hops.transaction.handler.LeaderTransactionalRequestHandler.execute(LeaderTransactionalRequestHandler.java:39)
at io.hops.transaction.handler.RequestHandler.handle(RequestHandler.java:68)
at io.hops.leaderElection.LETransaction.doTransaction(LETransaction.java:129)
at io.hops.leaderElection.LeaderElection.run(LeaderElection.java:104)
Caused by: com.mysql.clusterj.ClusterJDatastoreException: Error in NdbJTie: returnCode , code 4,009, mysqlCode 157, status 3, classification 11, message Cluster Failure .
at com.mysql.clusterj.tie.Utility.throwError(Utility.java:1340)
at com.mysql.clusterj.tie.DbImpl.handleError(DbImpl.java:230)
at com.mysql.clusterj.tie.DbImpl.enlist(DbImpl.java:258)
at com.mysql.clusterj.tie.PartitionKeyImpl.enlist(PartitionKeyImpl.java:255)
at com.mysql.clusterj.tie.ClusterTransactionImpl.enlist(ClusterTransactionImpl.java:175)
at com.mysql.clusterj.tie.ClusterTransactionImpl.readTuple(ClusterTransactionImpl.java:552)
at com.mysql.clusterj.tie.NdbRecordOperationImpl.load(NdbRecordOperationImpl.java:326)
at com.mysql.clusterj.tie.NdbRecordSmartValueHandlerImpl.load(NdbRecordSmartValueHandlerImpl.java:211)
at com.mysql.clusterj.core.SessionImpl.initializeFromDatabase(SessionImpl.java:222)
at com.mysql.clusterj.core.SessionImpl.find(SessionImpl.java:190)
at io.hops.metadata.ndb.wrapper.HopsSession.find(HopsSession.java:60)
… 14 more
WARN io.hops.transaction.handler.RequestHandler: LEADER_ELECTION TX Failed. TX Time: 0 ms, RetryCount: 0, TX Stats – Setup: 0ms, AcquireLocks: -1ms, InMemoryProc
essing: -1ms, CommitTime: -1ms. Locks: . io.hops.exception.StorageException: com.mysql.clusterj.ClusterJDatastoreException: Error in NdbJTie: returnCode , code 4,009, mysqlCode 157, stat
us 3, classification 11, message Cluster Failure .
io.hops.exception.StorageException: com.mysql.clusterj.ClusterJDatastoreException: Error in NdbJTie: returnCode , code 4,009, mysqlCode 157, status 3, classification 11, message Cluster
Failure .
at io.hops.metadata.ndb.wrapper.HopsExceptionHelper.wrap(HopsExceptionHelper.java:42)
at io.hops.metadata.ndb.wrapper.HopsSession.find(HopsSession.java:62)
at io.hops.metadata.ndb.dalimpl.hdfs.VariableClusterj.getVariable(VariableClusterj.java:66)
at io.hops.metadata.ndb.dalimpl.hdfs.VariableClusterj.getVariable(VariableClusterj.java:36)
at io.hops.transaction.context.VariableContext.find(VariableContext.java:58)
at io.hops.transaction.context.VariableContext.find(VariableContext.java:31)
at io.hops.transaction.context.TransactionContext.find(TransactionContext.java:139)
at io.hops.transaction.EntityManager.find(EntityManager.java:98)
at io.hops.transaction.lock.Lock.acquireLock(Lock.java:132)
at io.hops.transaction.lock.VariablesLock.acquire(VariablesLock.java:59)
at io.hops.transaction.lock.LeaderElectionTransactionalLockAcquirer.acquire(LeaderElectionTransactionalLockAcquirer.java:32)
at io.hops.transaction.handler.TransactionalRequestHandler.execute(TransactionalRequestHandler.java:89)
at io.hops.transaction.handler.LeaderTransactionalRequestHandler.execute(LeaderTransactionalRequestHandler.java:39)
at io.hops.transaction.handler.RequestHandler.handle(RequestHandler.java:68)
at io.hops.leaderElection.LETransaction.doTransaction(LETransaction.java:129)
at io.hops.leaderElection.LeaderElection.run(LeaderElection.java:104)
Caused by: com.mysql.clusterj.ClusterJDatastoreException: Error in NdbJTie: returnCode , code 4,009, mysqlCode 157, status 3, classification 11, message Cluster Failure .
at com.mysql.clusterj.tie.Utility.throwError(Utility.java:1340)
at com.mysql.clusterj.tie.DbImpl.handleError(DbImpl.java:230)
at com.mysql.clusterj.tie.DbImpl.enlist(DbImpl.java:258)
at com.mysql.clusterj.tie.PartitionKeyImpl.enlist(PartitionKeyImpl.java:255)
at com.mysql.clusterj.tie.ClusterTransactionImpl.enlist(ClusterTransactionImpl.java:175)
at com.mysql.clusterj.tie.ClusterTransactionImpl.readTuple(ClusterTransactionImpl.java:552)
at com.mysql.clusterj.tie.NdbRecordOperationImpl.load(NdbRecordOperationImpl.java:326)
at com.mysql.clusterj.tie.NdbRecordSmartValueHandlerImpl.load(NdbRecordSmartValueHandlerImpl.java:211)
at com.mysql.clusterj.core.SessionImpl.initializeFromDatabase(SessionImpl.java:222)
at com.mysql.clusterj.core.SessionImpl.find(SessionImpl.java:190)
at io.hops.metadata.ndb.wrapper.HopsSession.find(HopsSession.java:60)
… 14 more
In order to leave the error status I restarted the two Resource Manager services but encountered another exception, maybe during the leader election phase, and this is the stack trace:
INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
INFO io.hops.util.GroupMembershipService: Started GMS on 1 port: 8034
INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8034: starting
INFO org.apache.hadoop.service.AbstractService: Service io.hops.util.GroupMembershipService failed in state STARTED
java.lang.NullPointerException
at io.hops.util.GroupMembershipService.initLEandGM(GroupMembershipService.java:354)
at io.hops.util.GroupMembershipService.startGroupMembership(GroupMembershipService.java:136)
at io.hops.util.GroupMembershipService.serviceStart(GroupMembershipService.java:127)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1352)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1579)
INFO io.hops.util.GroupMembershipService: stopping group membership service server on 0:0:0:0:0:0:0:0:8034
INFO org.apache.hadoop.ipc.Server: Stopping server on 8034
INFO io.hops.util.GroupMembershipService: stopping group membership service service
INFO io.hops.util.GroupMembershipService: stopped group membership service
INFO io.hops.util.GroupMembershipService: stopped GMS on 1
INFO org.apache.hadoop.service.AbstractService: Service ResourceManager failed in state STARTED
java.lang.NullPointerException
at io.hops.util.GroupMembershipService.initLEandGM(GroupMembershipService.java:354)
at io.hops.util.GroupMembershipService.startGroupMembership(GroupMembershipService.java:136)
at io.hops.util.GroupMembershipService.serviceStart(GroupMembershipService.java:127)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1352)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1579)
INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8034
ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
INFO org.eclipse.jetty.server.handler.ContextHandler: Stopped o.e.j.w.WebAppContext@42ea7565{/,null,UNAVAILABLE}{/cluster}
INFO org.eclipse.jetty.server.AbstractConnector: Stopped ServerConnector@7ff65005{SSL,[ssl, http/1.1]}{0.0.0.0:8090}
I can’t figured out the reason of the NullPointerException, any ideas?
Cluster details:
• Hopsworks 2.2
• HDFS hadoop-hdfs-3.2.0.3-RC0
• YARN hadoop-yarn-server-resourcemanager-3.2.0.3-RC0