Experimental data cannot be queried

Freeman · April 13, 2021, 1:49pm

I wrote a test model in jupyter, this model worked well a few days ago, and I was able to see the experimental data in the Experiments menu. The relevant Python code is as follows:

from hops import experiment
experiment.launch(keras_mnist, name=‘keras_mnist’, local_logdir=True, metric_key=‘accuracy’)

But when I ran the same model in jupyter today, I found that I couldn’t see its experimental data in the Experiments menu. But I can see the latest experimental data in the Data Sets menu (Experiments submenu) , and the old experimental data are still able to be queried.

By reading the source code, I’ve already known that the experimental data are queried from the Elasticsearch. The Hopsworks seems to be working well, and I didn’t find any error information in the Glassfish log.

A key clue is that I found an error message in the YARN(jobs) Admin menu, the details are as follows:

Failed redirect for container_e24_1617342301771_0012_01_000001

Failed while trying to construct the redirect url to the log server. Log Server url may not be configured
Local Logs:
java.lang.Exception: Unknown container. Container either has not started or has already completed or doesn’t belong to this node at all.

But I found that the same error message had also appeared in previous Spark sessions, so I’m not sure if the problem is related to this error message.

BTW, I used to do the following things:

I imported a SSL certificate into GlassFish yesterday, and everything looked good. Because I’m not sure if the problem is related to this operation, I’ve restored everything to its original state today.
I found that a data node of HDFS was dead in the morning, then I restarted it.
The Grafana reported that Elasticsearch’s Unassigned Shards was 52 and its status is yellow, but it doesn’t seem to be a serious problem.
I’ve updated the version of TensorFlow in the Python menu, and I found that the Hopsworks generated a new docker image automatically.
In the HDFS Admin menu, there is a prompt message, which is “Upgrade in progress. Not yet finalized.”.

Has anyone ever run into the similar problem? Any suggestions will be much appreciated.

Robin_Andersson · April 14, 2021, 7:34am

Hi @Freeman!

We have a systemd service called epipe which is the one responsible for publishing events to Elasticsearch. I suspect it may have gotten stuck or there is some problem with it. Can you try

systemctl restart epipe

And then go back to the Experiments service and see if the experiments appear?

Best,
Robin

Freeman · April 14, 2021, 8:07am

Hi Robin,

Thank you very much for your reply.
According to your reply, I’ve solved this problem without difficulty, thanks so much for your help.

BTW, could you tell me what causes the problem of epipe?

Robin_Andersson · April 14, 2021, 8:37am

Hi @Freeman!

We have not been able to pinpoint exactly what causes this but we are aware of the issue. If it happens again please run

systemctl status epipe

And let us know if the process is still running.

Best,
Robin

Freeman · April 14, 2021, 9:07am

Hi @Robin_Andersson ,

I had already executed the command “systemctl status epipe” before I restarted the Epipe. The detailed information was as follows:

[root@dwfaihead logs]# systemctl status epipe
● epipe.service - ePipe Server
Loaded: loaded (/usr/lib/systemd/system/epipe.service; enabled; vendor preset: disabled)
Active: active (running) since Sun 2021-01-10 20:49:13 CST; 3 months 2 days ago
Main PID: 95129 (epipe)
Tasks: 61
Memory: 16.2M
CGroup: /system.slice/epipe.service
└─95129 /srv/hops/epipe/bin/epipe -c /srv/hops/epipe/conf/config.ini

After I restarted the Epipe, the output information was as follows:

[root@dwfaihead logs]# systemctl restart epipe
[root@dwfaihead logs]# systemctl status epipe
● epipe.service - ePipe Server
Loaded: loaded (/usr/lib/systemd/system/epipe.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2021-04-14 15:36:36 CST; 7s ago
Process: 12123 ExecStop=/srv/hops/epipe/bin/stop-epipe.sh (code=killed, signal=KILL)
Process: 12136 ExecStart=/srv/hops/epipe/bin/start-epipe.sh (code=exited, status=0/SUCCESS)
Main PID: 12140 (epipe)
Tasks: 61
Memory: 49.3M
CGroup: /system.slice/epipe.service
└─12140 /srv/hops/epipe/bin/epipe -c /srv/hops/epipe/conf/config.ini

Apr 14 15:36:36 dwfaihead systemd[1]: Starting ePipe Server…
Apr 14 15:36:36 dwfaihead systemd[1]: Started ePipe Server.

It seems that the Epipe was working well before I restarted it.

Freeman · October 11, 2021, 2:36pm

Hi @Robin_Andersson ,

This problem happens from time to time, every time when I restart the Epipe service, the problem is solved.

I suggest if the R&D team could create a task to follow up on this issue?

Jim_Dowling · October 11, 2021, 2:43pm

Hi Freeman. We have it as a JIRA. The problem is that ePipe does not reconnect to RonDB if RonDB is restarted (or fails and restarts). If you restart ePipe, it fixes the problem, but yes, we want ePipe to stubbornly try to reconnect to RonDB.

https://logicalclocks.atlassian.net/browse/HOPSWORKS-1832?jql=text%20~%20"epipe"

Freeman · October 11, 2021, 3:05pm

Hi @Jim_Dowling ,

I’m glad to hear that the R&D team is following up on this issue, thank you for your reply.