Ndbd failing after VM restart

Fernando_Marines · August 26, 2020, 11:31pm

Hello,
its a VM installation of hopsworks ver 1.2
the following services are DEAD and can’t start them successfully. could you help with this issue?

Thank you

2020-08-26 23:25:52 INFO [agent/setupLogging] Hops-Kagent started.
2020-08-26 23:25:52 INFO [agent/setupLogging] Heartbeat URL: https://10.0.2.15:8181/hopsworks-api/api/agentresource?action=heartbeat
2020-08-26 23:25:52 INFO [agent/setupLogging] Host Id: hopsworks0.logicalclocks.com
2020-08-26 23:25:52 INFO [agent/setupLogging] Hostname: hopsworks0.logicalclocks.com
2020-08-26 23:25:52 INFO [agent/setupLogging] Public IP: 10.0.2.15
2020-08-26 23:25:52 INFO [agent/setupLogging] Private IP: 10.0.2.15
2020-08-26 23:25:52 INFO [service/alive] Service ndb_mgmd is alive
2020-08-26 23:25:52 INFO [service/alive] Service prometheus is alive
2020-08-26 23:25:52 INFO [service/alive] Service alertmanager is alive
2020-08-26 23:25:52 INFO [service/alive] Service node_exporter is alive
2020-08-26 23:25:52 ERROR [service/alive] Service ndbmtd is DEAD.
2020-08-26 23:25:52 ERROR [service/alive] Service mysqld is DEAD.
2020-08-26 23:25:52 INFO [service/alive] Service mysqld_exporter is alive
2020-08-26 23:25:52 ERROR [service/alive] Service airflow-webserver is DEAD.
2020-08-26 23:25:52 ERROR [service/alive] Service airflow-scheduler is DEAD.
2020-08-26 23:25:52 INFO [service/alive] Service glassfish-domain1 is alive
2020-08-26 23:25:52 INFO [service/alive] Service kagent is alive
2020-08-26 23:25:52 ERROR [service/alive] Service namenode is DEAD.
2020-08-26 23:25:52 INFO [service/alive] Service sqoop is alive
2020-08-26 23:25:52 INFO [service/alive] Service zookeeper is alive
2020-08-26 23:25:52 INFO [service/alive] Service influxdb is alive
2020-08-26 23:25:52 INFO [service/alive] Service grafana is alive
2020-08-26 23:25:52 INFO [service/alive] Service elasticsearch is alive
2020-08-26 23:25:52 INFO [service/alive] Service elastic_exporter is alive
2020-08-26 23:25:52 INFO [service/alive] Service datanode is alive
2020-08-26 23:25:52 ERROR [service/alive] Service kafka is DEAD.
2020-08-26 23:25:52 ERROR [service/alive] Service epipe is DEAD.
2020-08-26 23:25:52 INFO [service/alive] Service historyserver is alive
2020-08-26 23:25:52 ERROR [service/alive] Service resourcemanager is DEAD.
2020-08-26 23:25:52 INFO [service/alive] Service logstash is alive
2020-08-26 23:25:52 INFO [service/alive] Service kibana is alive
2020-08-26 23:25:52 ERROR [service/alive] Service hivemetastore is DEAD.
2020-08-26 23:25:52 INFO [service/alive] Service hiveserver2 is alive
2020-08-26 23:25:52 INFO [service/alive] Service livy is alive
2020-08-26 23:25:52 INFO [service/alive] Service flinkhistoryserver is alive
2020-08-26 23:25:52 ERROR [service/alive] Service nodemanager is DEAD.
2020-08-26 23:25:52 ERROR [service/alive] Service sparkhistoryserver is DEAD.
2020-08-26 23:25:52 INFO [service/alive] Service filebeat-beamjobservercluster is alive
2020-08-26 23:25:52 INFO [service/alive] Service filebeat-beamjobserverlocal is alive
2020-08-26 23:25:52 INFO [service/alive] Service filebeat-beamsdkworker is alive
2020-08-26 23:25:52 INFO [service/alive] Service filebeat-spark is alive
2020-08-26 23:25:52 INFO [service/alive] Service filebeat-kagent is alive
2020-08-26 23:25:52 INFO [service/alive] Service filebeat-tf-serving is alive
2020-08-26 23:25:52 INFO [service/alive] Service filebeat-sklearn-serving is alive

ndbd log is showing the following error:

020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] WARNING 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:12 [ndbd] INFO 2020-08-26 22:57:13 [ndbd] INFO 2020-08-26 22:57:13 [ndbd] INFO 2020-08-26 22:57:13 [ndbd] INFO 2020-08-26 22:57:13 [ndbd] INFO 2020-08-26 22:57:13 [ndbd] INFO error: [ code: 1509 line: 2020-08-26 22:57:13 [ndbd] INFO 2020-08-26 22:57:13 [ndbd] INFO 2020-08-26 22:57:13 [ndbd] INFO 2020-08-26 22:57:13 [ndbd] ALERT – Not initial start
– Local sysfile: Node restorable on its own, gci: 0, version: 70603
– Start phase 0 completed
– Phase 0 has made some file system initialisations
– Failed to memlock pages, error: 12 (Cannot allocate memory)
– Watchdog KillSwitch off.
– Starting QMGR phase 1
– Starting with m_restart_seq set to 33
– DIH reported normal start, now starting the Node Inclusion Protocol
– Include node protocol completed, phase 1 in QMGR completed
– Start phase 1 completed
– Phase 1 initialised some variables and included node in cluster, locked memory if configured to do so
– Starting with m_restart_seq set to 33
– Asking master node to accept our start (we are master, GCI = 4209500)
– System Restart: master node: 1, num starting: 1, gci: 4209500
– CNTR_START_CONF: started: 0000000000000000
– CNTR_START_CONF: starting: 0000000000000002
– NDBCNTR master accepted us into cluster, start NDB start phase 1
– We are performing a restart of the cluster, restoring GCI = 4209500
– LDM(1): Started LDM restart phase 1 (read REDO log page headers to init REDO log data)
– Schema file initialisation Starting
– Schema file initialisation Completed
– NDB start phase 1 completed
– Start phase 2 completed
– Phase 2 did more initialisations, master accepted our start, we started REDO log initialisations
– Grant nodes to start phase: 3, nodes: 0000000000000002
– Start NDB start phase 2
– NDB start phase 2 completed
– Start phase 3 completed
– Phase 3 performed local connection setups
– Grant nodes to start phase: 4, nodes: 0000000000000002
– Start NDB start phase 3
– NDB start phase 3 completed
– Restart recreating table with id = 71
– Restart recreating table with id = 73
– LDM(1):Ready to start execute REDO log phase, prepare REDO log phase completed
– Restart recreating table with id = 72
– Restart recreating table with id = 74
– Restart recreating table with id = 1125
24299 node: 1 count: 1 status: 0 key: 0 name: ‘’ ]
– Failed to restore schema during restart, error 1509.
– DBDICT (Line: 4824) 0x00000002
– Error handler shutting down system
– Node 1: Forced node shutdown completed. Occured during startphase 4. Caused by error 2355: ‘Failure to restore schema(Resource configuration error). Permanent error, external action needed’.

ermias · August 27, 2020, 8:06am

Hi,
check if you have enough disk space in the vm. If it is more than 80% used then cleanup (/tmp and log files) and try restarting all services.

Theo · August 27, 2020, 10:18am

@Fernando_Marines some services store their PID files in /tmp. It’s better to first shutdown all of them down with
sudo /srv/hops/kagent/kagent/bin/shutdown-all-local-services.sh, then cleanup space and then start them with
sudo /srv/hops/kagent/kagent/bin/start-all-local-services.sh

Fernando_Marines · August 28, 2020, 5:28pm

thank you both for the advice, but it still not starting, the same DEAD services remain there. i cleaned logs old than 30 days, stopped services with the script, and added more space to the VM. restarted and still not coming up. any other thought can be here?

[vagrant@hopsworks0 ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos-root 96G 34G 63G 36% /
devtmpfs 59G 0 59G 0% /dev
tmpfs 59G 0 59G 0% /dev/shm
tmpfs 59G 8.7M 59G 1% /run
tmpfs 59G 0 59G 0% /sys/fs/cgroup
/dev/sda1 1014M 142M 873M 14% /boot
/dev/mapper/centos-home 20G 1.3G 19G 7% /home
tmpfs 12G 0 12G 0% /run/user/1000

[vagrant@hopsworks0 ~]$ sudo systemctl restart ndbmtd
[vagrant@hopsworks0 ~]$ sudo systemctl status ndbmtd
● ndbmtd.service - NDB Multi-threaded DataNode (ndbmtd)
Loaded: loaded (/usr/lib/systemd/system/ndbmtd.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Fri 2020-08-28 17:21:50 UTC; 5s ago
Process: 3024 ExecStop=/srv/hops/mysql-cluster/ndb/scripts/ndbd-stop.sh (code=exited, status=1/FAILURE)
Process: 2957 ExecStart=/srv/hops/mysql-cluster/ndb/scripts/ndbd-start.sh (code=exited, status=0/SUCCESS)
Main PID: 2963 (code=exited, status=0/SUCCESS)

Aug 28 17:21:47 hopsworks0.logicalclocks.com ndbd-start.sh[2957]: Starting Data Node 1.
Aug 28 17:21:47 hopsworks0.logicalclocks.com ndbd-start.sh[2957]: ndbmtd: [Warning] option ‘connect-retries’: signed value -1 adjusted to 0
Aug 28 17:21:47 hopsworks0.logicalclocks.com ndbd-start.sh[2957]: 2020-08-28 17:21:47 [ndbd] INFO – Angel connected to ‘10.0.2.15:1186’
Aug 28 17:21:47 hopsworks0.logicalclocks.com ndbd-start.sh[2957]: 2020-08-28 17:21:47 [ndbd] INFO – Angel allocated nodeid: 1
Aug 28 17:21:47 hopsworks0.logicalclocks.com systemd[1]: Started NDB Multi-threaded DataNode (ndbmtd).
Aug 28 17:21:49 hopsworks0.logicalclocks.com ndbd-stop.sh[3024]: Incremeting the PID by 1 to skip over watchdog process
Aug 28 17:21:49 hopsworks0.logicalclocks.com ndbd-stop.sh[3024]: Killing ndbmtd with process-id 2964
Aug 28 17:21:50 hopsworks0.logicalclocks.com systemd[1]: ndbmtd.service: control process exited, code=exited status=1
Aug 28 17:21:50 hopsworks0.logicalclocks.com systemd[1]: Unit ndbmtd.service entered failed state.
Aug 28 17:21:50 hopsworks0.logicalclocks.com systemd[1]: ndbmtd.service failed.

ndb_1_error.log

Time: Friday 28 August 2020 - 16:55:16
Status: Permanent error, external action needed
Message: Failure to restore schema (Resource configuration error)
Error: 2355
Error data: Failed to restore schema during restart, error 1509.
Error object: DBDICT (Line: 4824) 0x00000002
Program: ndbmtd
Pid: 4167 thr: 0
Version: mysql-5.7.25 ndb-7.6.9
Trace file name: ndb_1_trace.log.9
Trace file path: /srv/hops/mysql-cluster/log/ndb_1_trace.log.9 [t1…t4]
EOM

Time: Friday 28 August 2020 - 17:15:13
Status: Permanent error, external action needed
Message: Failure to restore schema (Resource configuration error)
Error: 2355
Error data: Failed to restore schema during restart, error 1509.
Error object: DBDICT (Line: 4824) 0x00000002
Program: ndbmtd
Pid: 1995 thr: 0
Version: mysql-5.7.25 ndb-7.6.9
Trace file name: ndb_1_trace.log.10
Trace file path: /srv/hops/mysql-cluster/log/ndb_1_trace.log.10 [t1…t4]
EOM

Time: Friday 28 August 2020 - 17:21:49
Status: Permanent error, external action needed
Message: Failure to restore schema (Resource configuration error)
Error: 2355
Error data: Failed to restore schema during restart, error 1509.
Error object: DBDICT (Line: 4824) 0x00000002
Program: ndbmtd
Pid: 2964 thr: 0
Version: mysql-5.7.25 ndb-7.6.9
Trace file name: ndb_1_trace.log.11
Trace file path: /srv/hops/mysql-cluster/log/ndb_1_trace.log.11 [t1…t4]

Fernando_Marines · August 28, 2020, 5:48pm

i found this error from the trace:

ndb_1_trace.log.11_t1

--------------- Signal ----------------
r.bn: 244 “BACKUP”, r.proc: 1, r.sigId: 139 gsn: 761 “STOP_FOR_CRASH” prio: 0
s.bn: 0 “SYS”, s.proc: 0, s.sigId: 0 length: 1 trace: 0 #sec: 0 fragInf: 0
H’00000000
--------------- Signal ----------------
r.bn: 260 “LGMAN”, r.proc: 1, r.sigId: 138 gsn: 728 “CREATE_FILE_IMPL_REQ” prio: 1
s.bn: 250 “DBDICT”, s.proc: 1, s.sigId: 1179 length: 7 trace: 0 #sec: 0 fragInf: 0
H’01000009 H’00fa0001 H’00000010 H’00000048 H’01000009 H’00000047 H’00000001
--------------- Signal ----------------
r.bn: 260 “LGMAN”, r.proc: 1, r.sigId: 137 gsn: 260 “FSOPENREF” prio: 1
s.bn: 253 “NDBFS”, s.proc: 1, s.sigId: 1175 length: 4 trace: 0 #sec: 0 fragInf: 0
UserPointer: 30334976
ErrorCode: 2815, File not found
OS ErrorCode: 2
--------------- Signal ----------------

Fernando_Marines · September 2, 2020, 12:48am

Hello again,i still can’t find a solution or workaround on this issue, would you guys know what other option i may need to look into here?
Regards

ermias · September 2, 2020, 2:14pm

I think your vm might be in an unrecoverable state. If it is possible recreate the vm it takes less effort than trying to fix it.

Fernando_Marines · September 4, 2020, 1:31am

is there a way i can extract their jupyter notebooks ?