I wrote a test model in jupyter, this model worked well a few days ago, and I was able to see the experimental data in the Experiments menu. The relevant Python code is as follows:
from hops import experiment
experiment.launch(keras_mnist, name=‘keras_mnist’, local_logdir=True, metric_key=‘accuracy’)
But when I ran the same model in jupyter today, I found that I couldn’t see its experimental data in the Experiments menu. But I can see the latest experimental data in the Data Sets menu (Experiments submenu) , and the old experimental data are still able to be queried.
By reading the source code, I’ve already known that the experimental data are queried from the Elasticsearch. The Hopsworks seems to be working well, and I didn’t find any error information in the Glassfish log.
A key clue is that I found an error message in the YARN(jobs) Admin menu, the details are as follows:
Failed redirect for container_e24_1617342301771_0012_01_000001
Failed while trying to construct the redirect url to the log server. Log Server url may not be configured
Local Logs:
java.lang.Exception: Unknown container. Container either has not started or has already completed or doesn’t belong to this node at all.
But I found that the same error message had also appeared in previous Spark sessions, so I’m not sure if the problem is related to this error message.
BTW, I used to do the following things:
I imported a SSL certificate into GlassFish yesterday, and everything looked good. Because I’m not sure if the problem is related to this operation, I’ve restored everything to its original state today.
I found that a data node of HDFS was dead in the morning, then I restarted it.
The Grafana reported that Elasticsearch’s Unassigned Shards was 52 and its status is yellow, but it doesn’t seem to be a serious problem.
I’ve updated the version of TensorFlow in the Python menu, and I found that the Hopsworks generated a new docker image automatically.
In the HDFS Admin menu, there is a prompt message, which is “Upgrade in progress. Not yet finalized.”.
Has anyone ever run into the similar problem? Any suggestions will be much appreciated.
We have a systemd service called epipe which is the one responsible for publishing events to Elasticsearch. I suspect it may have gotten stuck or there is some problem with it. Can you try
systemctl restart epipe
And then go back to the Experiments service and see if the experiments appear?
Hi Freeman. We have it as a JIRA. The problem is that ePipe does not reconnect to RonDB if RonDB is restarted (or fails and restarts). If you restart ePipe, it fixes the problem, but yes, we want ePipe to stubbornly try to reconnect to RonDB.