Hopsworks 2.1 Server Issue

mrwatchmework · March 2, 2021, 9:50pm

Hello,

I am constantly getting the following error when i try to force remove a project:

<2021-03-02T21:20:28.726>*** SUCCESS ***Project found in the database *custfeatures* <2021-03-02T21:20:28.797>*** SUCCESS ***Updated team role *custfeatures* <2021-03-02T21:35:29.285>*** SUCCESS ***Killed Yarn jobs *custfeatures* <2021-03-02T21:35:29.302>*** SUCCESS ***Removed Jupyter *custfeatures* <2021-03-02T21:35:29.313>*** SUCCESS ***Logged project removal *custfeatures* <2021-03-02T21:35:33.243>*** SUCCESS ***Changed ownership of dummy inode *custfeatures* <2021-03-02T21:35:33.466>*** SUCCESS ***Removed Kafka topics *custfeatures* <2021-03-02T21:35:37.708>*** SUCCESS ***Removed quotas *custfeatures* <2021-03-02T21:35:37.715>*** SUCCESS ***Fixed shared datasets *custfeatures* <2021-03-02T21:35:42.147>*** SUCCESS ***Removed ElasticSearch *custfeatures* <2021-03-02T21:35:42.267>*** SUCCESS ***Removed HDFS Groups and Users *custfeatures* <2021-03-02T21:35:42.273>*** SUCCESS ***Removed local TensorBoards *custfeatures* <2021-03-02T21:35:42.28>*** SUCCESS ***Removed servings *custfeatures* <2021-03-02T21:35:42.298>*** SUCCESS ***Removed Airflow DAGs and security references *custfeatures* <2021-03-02T21:35:42.395>*** SUCCESS ***Removed all X.509 certificates related to the Project from CertificateMaterializer *custfeatures* <2021-03-02T21:35:42.441>*** SUCCESS ***Removed conda envs *custfeatures* <2021-03-02T21:35:42.449>*** SUCCESS ***Removed dummy Inode *custfeatures* <2021-03-02T21:35:29.268>*** ERROR ***Error when reading YARN apps during project cleanup *custfeatures* <2021-03-02T21:35:29.268>*** ERROR ***Retry interrupted *custfeatures* <2021-03-02T21:35:29.302>*** ERROR ***Error when getting Yarn logs during project cleanup *custfeatures* <2021-03-02T21:35:29.303>*** ERROR ***null *custfeatures* <2021-03-02T21:35:33.238>*** ERROR ***Error when changing ownership of root Project dir during project cleanup *custfeatures* <2021-03-02T21:35:33.238>*** ERROR ***Cannot set owner for /Projects/custfeatures. Name node is in safe mode. Resources are low on NN. Please add or free up more resources then turn off safe mode manually. NOTE: If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:893) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setOwner(FSNamesystem.java:1008) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setOwner(NameNodeRpcServer.java:526) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setOwner(ClientNamenodeProtocolServerSideTranslatorPB.java:574) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:868) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:814) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1821) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2900) *custfeatures* <2021-03-02T21:35:37.706>*** ERROR ***Error when removing project-related files during project cleanup *custfeatures* <2021-03-02T21:35:37.706>*** ERROR ***Cannot delete /user/yarn/logs/custfeatures__coreysto. Name node is in safe mode. Resources are low on NN. Please add or free up more resources then turn off safe mode manually. NOTE: If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:893) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3622) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:748) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:725) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:868) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:814) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1821) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2900) *custfeatures* <2021-03-02T21:35:41.077>*** ERROR ***Error when removing hive db during project cleanup *custfeatures* <2021-03-02T21:35:41.077>*** ERROR ***Cannot delete /tmp/hive/custfeatures__mikemoun. Name node is in safe mode. Resources are low on NN. Please add or free up more resources then turn off safe mode manually. NOTE: If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:893) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3622) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:748) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:725) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:868) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:814) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1821) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2900) *custfeatures* <2021-03-02T21:35:46.094>*** ERROR ***Error when removing root Project dir during project cleanup *custfeatures* <2021-03-02T21:35:46.095>*** ERROR ***Cannot delete /Projects/custfeatures. Name node is in safe mode. Resources are low on NN. Please add or free up more resources then turn off safe mode manually. NOTE: If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:893) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3622) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:748) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:725) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:868) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:814) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1821) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2900) *custfeatures*

I also notice that the YARN resoucrcemanager service is in a BAD state and everytime I try to restart it it just shuts down again.

I have tried stopping and restarting all services already.

Any suggestions ?

moritzmeister · March 3, 2021, 9:27am

Hi,

can you provide a bit more details on where you are running Hopsworks, is it your own installation, single node or cluster? Or running via our managed service hopsworks.ai?

Also what do you mean with “force remove a project”, a regular delete through the UI?

One reason for the NameNode being in Safe mode might be that the disk utilization is >90%.
Can you check the available space with df -h on the machine that runs the NameNode?

Thanks!

mrwatchmework · March 3, 2021, 1:43pm

I am running Hopworks Enterprise 2.1 on a single VM in Azure.

We provisioned the machine following the recommend specs in the documentation. We have a 100GB OS disk and 512GB Data disk

By “Force Remove”, I mean by clicking the “Force Remove” button from the projects management page.

Here is the output from df -h

Filesystem      Size  Used Avail Use% Mounted on
udev             32G     0   32G   0% /dev
tmpfs           6.3G  936K  6.3G   1% /run
/dev/sdb1        97G   21G   77G  22% /
tmpfs            32G     0   32G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/sdb15      105M  3.7M  101M   4% /boot/efi
/dev/sda1       126G   41G   80G  34% /mnt
tmpfs           6.3G     0  6.3G   0% /run/user/998
tmpfs           6.3G     0  6.3G   0% /run/user/1000

moritzmeister · March 4, 2021, 3:44pm

Hi again,

okay so it is not the disk, then most likely it is the database (NDB) which is also used by the NameNode that ran out of space.

You can increase memory with the property DataMemory in /srv/hops/mysql-cluster/config.ini. Change it to something like 4096M, depending on the memory available on your machine, maybe less.

Then restart the services systemctl restart ndb_mgmd and systemctl restart ndbmtd. It might happen that the namenode and resourcemanager services die while doing this.
In that case restart those as well. You can check the status of all services with the following script /srv/hops/kagent/kagent/bin/status-all-local-services.sh.

Let me know if that gets the NameNode out of safe mode!

The reason why this might happen is if you created lots of small files in the distributed filesystem, such that the metadata of the filesystem grew too big. Usually, you would deploy a multi-node cluster with multiple NDB datanodes if you have these kind of requirements.