Resource Manager Leader Election Protocol Issue

arosc · September 10, 2021, 1:52pm

Hi,
I’ve encounter an issue on my Hops 2.2 installation about Resource Manager Leader Election protocol.
I have a 5 nodes cluster (2 namenode, 3 datanodes (hadoop + rondb)) with HA enabled for the namenode and Resource Manager too.
The system was running well until one ndb dead with this error message:

[ndbd] ALERT – Node 3: Forced node shutdown completed. Caused by error 6050: ‘WatchDog terminate, internal error or massive overload on the machine running this node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node’

After the error, the ndb cluster restarted automatically and returned to work well but, unfortunally, the Resource Manager dead and began to print the following error:

ERROR io.hops.leaderElection.LeaderElection: LE Status: id 11 LeaderElection thread received StorageException. sucessfulTx 539915 failedTx 504483125 time period 1
000 com.mysql.clusterj.ClusterJDatastoreException: Error in NdbJTie: returnCode , code 4,009, mysqlCode 157, status 3, classification 11, message Cluster Failure .
io.hops.exception.StorageException: com.mysql.clusterj.ClusterJDatastoreException: Error in NdbJTie: returnCode , code 4,009, mysqlCode 157, status 3, classification 11, message Cluster
Failure .
at io.hops.metadata.ndb.wrapper.HopsExceptionHelper.wrap(HopsExceptionHelper.java:42)
at io.hops.metadata.ndb.wrapper.HopsSession.find(HopsSession.java:62)
at io.hops.metadata.ndb.dalimpl.hdfs.VariableClusterj.getVariable(VariableClusterj.java:66)
at io.hops.metadata.ndb.dalimpl.hdfs.VariableClusterj.getVariable(VariableClusterj.java:36)
at io.hops.transaction.context.VariableContext.find(VariableContext.java:58)
at io.hops.transaction.context.VariableContext.find(VariableContext.java:31)
at io.hops.transaction.context.TransactionContext.find(TransactionContext.java:139)
at io.hops.transaction.EntityManager.find(EntityManager.java:98)
at io.hops.transaction.lock.Lock.acquireLock(Lock.java:132)
at io.hops.transaction.lock.VariablesLock.acquire(VariablesLock.java:59)
at io.hops.transaction.lock.LeaderElectionTransactionalLockAcquirer.acquire(LeaderElectionTransactionalLockAcquirer.java:32)
at io.hops.transaction.handler.TransactionalRequestHandler.execute(TransactionalRequestHandler.java:89)
at io.hops.transaction.handler.LeaderTransactionalRequestHandler.execute(LeaderTransactionalRequestHandler.java:39)
at io.hops.transaction.handler.RequestHandler.handle(RequestHandler.java:68)
at io.hops.leaderElection.LETransaction.doTransaction(LETransaction.java:129)
at io.hops.leaderElection.LeaderElection.run(LeaderElection.java:104)
Caused by: com.mysql.clusterj.ClusterJDatastoreException: Error in NdbJTie: returnCode , code 4,009, mysqlCode 157, status 3, classification 11, message Cluster Failure .
at com.mysql.clusterj.tie.Utility.throwError(Utility.java:1340)
at com.mysql.clusterj.tie.DbImpl.handleError(DbImpl.java:230)
at com.mysql.clusterj.tie.DbImpl.enlist(DbImpl.java:258)
at com.mysql.clusterj.tie.PartitionKeyImpl.enlist(PartitionKeyImpl.java:255)
at com.mysql.clusterj.tie.ClusterTransactionImpl.enlist(ClusterTransactionImpl.java:175)
at com.mysql.clusterj.tie.ClusterTransactionImpl.readTuple(ClusterTransactionImpl.java:552)
at com.mysql.clusterj.tie.NdbRecordOperationImpl.load(NdbRecordOperationImpl.java:326)
at com.mysql.clusterj.tie.NdbRecordSmartValueHandlerImpl.load(NdbRecordSmartValueHandlerImpl.java:211)
at com.mysql.clusterj.core.SessionImpl.initializeFromDatabase(SessionImpl.java:222)
at com.mysql.clusterj.core.SessionImpl.find(SessionImpl.java:190)
at io.hops.metadata.ndb.wrapper.HopsSession.find(HopsSession.java:60)
… 14 more
WARN io.hops.transaction.handler.RequestHandler: LEADER_ELECTION TX Failed. TX Time: 0 ms, RetryCount: 0, TX Stats – Setup: 0ms, AcquireLocks: -1ms, InMemoryProc
essing: -1ms, CommitTime: -1ms. Locks: . io.hops.exception.StorageException: com.mysql.clusterj.ClusterJDatastoreException: Error in NdbJTie: returnCode , code 4,009, mysqlCode 157, stat
us 3, classification 11, message Cluster Failure .
io.hops.exception.StorageException: com.mysql.clusterj.ClusterJDatastoreException: Error in NdbJTie: returnCode , code 4,009, mysqlCode 157, status 3, classification 11, message Cluster
Failure .
at io.hops.metadata.ndb.wrapper.HopsExceptionHelper.wrap(HopsExceptionHelper.java:42)
at io.hops.metadata.ndb.wrapper.HopsSession.find(HopsSession.java:62)
at io.hops.metadata.ndb.dalimpl.hdfs.VariableClusterj.getVariable(VariableClusterj.java:66)
at io.hops.metadata.ndb.dalimpl.hdfs.VariableClusterj.getVariable(VariableClusterj.java:36)
at io.hops.transaction.context.VariableContext.find(VariableContext.java:58)
at io.hops.transaction.context.VariableContext.find(VariableContext.java:31)
at io.hops.transaction.context.TransactionContext.find(TransactionContext.java:139)
at io.hops.transaction.EntityManager.find(EntityManager.java:98)
at io.hops.transaction.lock.Lock.acquireLock(Lock.java:132)
at io.hops.transaction.lock.VariablesLock.acquire(VariablesLock.java:59)
at io.hops.transaction.lock.LeaderElectionTransactionalLockAcquirer.acquire(LeaderElectionTransactionalLockAcquirer.java:32)
at io.hops.transaction.handler.TransactionalRequestHandler.execute(TransactionalRequestHandler.java:89)
at io.hops.transaction.handler.LeaderTransactionalRequestHandler.execute(LeaderTransactionalRequestHandler.java:39)
at io.hops.transaction.handler.RequestHandler.handle(RequestHandler.java:68)
at io.hops.leaderElection.LETransaction.doTransaction(LETransaction.java:129)
at io.hops.leaderElection.LeaderElection.run(LeaderElection.java:104)
Caused by: com.mysql.clusterj.ClusterJDatastoreException: Error in NdbJTie: returnCode , code 4,009, mysqlCode 157, status 3, classification 11, message Cluster Failure .
at com.mysql.clusterj.tie.Utility.throwError(Utility.java:1340)
at com.mysql.clusterj.tie.DbImpl.handleError(DbImpl.java:230)
at com.mysql.clusterj.tie.DbImpl.enlist(DbImpl.java:258)
at com.mysql.clusterj.tie.PartitionKeyImpl.enlist(PartitionKeyImpl.java:255)
at com.mysql.clusterj.tie.ClusterTransactionImpl.enlist(ClusterTransactionImpl.java:175)
at com.mysql.clusterj.tie.ClusterTransactionImpl.readTuple(ClusterTransactionImpl.java:552)
at com.mysql.clusterj.tie.NdbRecordOperationImpl.load(NdbRecordOperationImpl.java:326)
at com.mysql.clusterj.tie.NdbRecordSmartValueHandlerImpl.load(NdbRecordSmartValueHandlerImpl.java:211)
at com.mysql.clusterj.core.SessionImpl.initializeFromDatabase(SessionImpl.java:222)
at com.mysql.clusterj.core.SessionImpl.find(SessionImpl.java:190)
at io.hops.metadata.ndb.wrapper.HopsSession.find(HopsSession.java:60)
… 14 more

In order to leave the error status I restarted the two Resource Manager services but encountered another exception, maybe during the leader election phase, and this is the stack trace:

INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
INFO io.hops.util.GroupMembershipService: Started GMS on 1 port: 8034
INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8034: starting
INFO org.apache.hadoop.service.AbstractService: Service io.hops.util.GroupMembershipService failed in state STARTED
java.lang.NullPointerException
at io.hops.util.GroupMembershipService.initLEandGM(GroupMembershipService.java:354)
at io.hops.util.GroupMembershipService.startGroupMembership(GroupMembershipService.java:136)
at io.hops.util.GroupMembershipService.serviceStart(GroupMembershipService.java:127)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1352)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1579)
INFO io.hops.util.GroupMembershipService: stopping group membership service server on 0:0:0:0:0:0:0:0:8034
INFO org.apache.hadoop.ipc.Server: Stopping server on 8034
INFO io.hops.util.GroupMembershipService: stopping group membership service service
INFO io.hops.util.GroupMembershipService: stopped group membership service
INFO io.hops.util.GroupMembershipService: stopped GMS on 1
INFO org.apache.hadoop.service.AbstractService: Service ResourceManager failed in state STARTED
java.lang.NullPointerException
at io.hops.util.GroupMembershipService.initLEandGM(GroupMembershipService.java:354)
at io.hops.util.GroupMembershipService.startGroupMembership(GroupMembershipService.java:136)
at io.hops.util.GroupMembershipService.serviceStart(GroupMembershipService.java:127)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1352)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1579)
INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8034
ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
INFO org.eclipse.jetty.server.handler.ContextHandler: Stopped o.e.j.w.WebAppContext@42ea7565{/,null,UNAVAILABLE}{/cluster}
INFO org.eclipse.jetty.server.AbstractConnector: Stopped ServerConnector@7ff65005{SSL,[ssl, http/1.1]}{0.0.0.0:8090}

I can’t figured out the reason of the NullPointerException, any ideas?

Cluster details:
• Hopsworks 2.2
• HDFS hadoop-hdfs-3.2.0.3-RC0
• YARN hadoop-yarn-server-resourcemanager-3.2.0.3-RC0

salman · September 13, 2021, 10:44am

would you please provide all the logs for hadoop in the /srv/hops/hadoop/logs folder. From the logs, it seems that the leader-election is taking a very long time to update the database. This could be a side effect of some other issue(s) in the system. For example, some operation is hogging the database.

arosc · September 15, 2021, 10:37am

Hi @salman,
Thank you for the reply. By checking the yarn configuration on the official hadoop 3.x documentation, we noticed that the https properties of the yarn web app were not present for the two resource managers:

<property>
    <name>yarn.resourcemanager.webapp.https.address.1</name>
    <value>1.resourcemanager.service.consul:8090</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.https.address.0</name>
    <value>0.resourcemanager.service.consul:8090</value>
  </property>

Adding this configuration the error gone away. What do you think about the solution?

Thank you