RonDB master data node restart issue

arosc · October 13, 2021, 2:07pm

Hi,
We are facing a problem with the 2.2 Hopsworks platform when restarting the RonDB master data node.

We have a 3 nodes RonDB cluster:

[ndbd(NDB)] 3 node(s)
id=1 @10.206.197.54 (RonDB-21.04.0, Nodegroup: 0, *)
id=2 @10.206.197.55 (RonDB-21.04.0, Nodegroup: 0)
id=3 @10.206.197.58 (RonDB-21.04.0, Nodegroup: 0)

Since we have a ndb cluster we expect that glassfish and hdfs continue to work also in the case of a node restart. Unfortunally, master RonDB node restart causes the following issues:

Glassfish prints:

Caused by: Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.6.4.qualifier): org.eclipse.persistence.exceptions.DatabaseException
Internal Exception: java.sql.SQLException: Got temporary error 1204 ‘Temporary failure, distribution changed’ from NDBCLUSTER
Error Code: 1297
Call: SELECT TIMERID, APPLICATIONID, BLOB, CONTAINERID, CREATIONTIMERAW, INITIALEXPIRATIONRAW, INTERVALDURATION, LASTEXPIRATIONRAW, OWNERID, PKHASHCODE, SCHEDULE, STATE FROM EJB__TIMER__TBL WHERE (TIMERID = ?)
bind => [1 parameter bound]
Query: ReadObjectQuery(name=“readTimerState” referenceClass=TimerState sql=“SELECT TIMERID, APPLICATIONID, BLOB, CONTAINERID, CREATIONTIMERAW, INITIALEXPIRATIONRAW, INTERVALDURATION, LASTEXPIRATIONRAW, OWNERID, PKHASHCODE, SCHEDULE, STATE FROM EJB__TIMER__TBL WHERE (TIMERID = ?)”)

Hadoop prints:

at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:740)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:868)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:814)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1821)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2900)
Caused by: com.mysql.clusterj.ClusterJDatastoreException: Datastore exception. Return code: -1 Error code: 1,204 MySQL code: -1 Status: 1 Classification: 8 Message: unique key hdfs_users

Is there any misconfiguration or maybe a bug?

antonios · October 18, 2021, 11:14am

Hi @arosc

I’m afraid I haven’t seen this issue before. Are you using hopsworks.ai? When you are saying master RonDB node to which node are you referring to? By restart do you mean restarting the process using systemd?

From the information you have posted, I see three RonDB datanodes with replication of 3 as they all belong to the same node group (node group 0). In that setup your cluster should survive 2 node failures.

Do the services recover after a while?

arosc · October 25, 2021, 7:38am

Hi @antonios,
I’m not on hopsworks.ai but I’m on prem. When I say master RonDB I refer to the one with the * character (id=1 @10.206.197.54 (RonDB-21.04.0, Nodegroup: 0, *))
By restart I mean the systemctl restart command. The system survives if one or two node stop. The system survives also if one or two node restart but only if the node is a non master node.

Thank you

Mikael_Ronstrom · October 25, 2021, 8:32am

It is neither a misconfiguration nor a bug.
It is a normal behaviour that during reconfiguration of the node setup as part of node
failure handling there could be a few aborted transactions. This error is a temporary
error, so solution is to simply retry the transaction.

Mikael_Ronstrom · October 25, 2021, 8:38am

This particular error happens when you reconfigure the alive nodes. Since RonDB
is a distributed system, we could have some race conditions where a transaction tries
to write using the old alive nodes. This is fairly uncommon, but still possible. When the
transaction is retried, it will use the new setup of alive nodes and succeed.

The error is most commonly happening during startup of the failed node.