Its clear what the error means. How do I recover? The stop button is greyed out. Only option seems to be terminate and loose all data. Which I’d rather not do. Is it possible to recover this cluster?
Looking at the error above it seems that cluster failed to create workers upon start due to AWS hitting limits on the VM types in the region for your cluster InsufficientInstanceCapacity
I can help you reset the cluster status but first can you check what is the status of the -master and rondb_all-in-one VMs from the AWS console?
Thanks for the help. In the EC console, Hopsworks-master was stopped. I was able to start it successfully. The rondb instance was already in a running state.
But the status in the Hopsworks dashboard has not changed.
Thanks for reporting the issue, I was about to reset your cluster state but I noticed that your cluster is in a running state. Looking at the logs it seems that you have already terminated the old cluster and created a new one. Let me know if you still facing issues with the cluster. For your specific issue, I suggest as a workaround for now is to always remove the workers before stopping the cluster.
I have created an internal ticket for this issue to investigate the cause of it and fix it on our side.
Thanks for the help. Yes, I needed to re-create the cluster. We’re just testing things right now, so nothing important lost. In this issue, I had stopped the cluster via the UI, and this was on the re-start via the UI.
We’ve been stopping clusters over night to save on costs. If we encounter this issue again, is there a way we can recover without getting support involved? The option in the UI was to terminate the cluster.
I’m sorry, I don’t follow. The cluster was in a stopped state. There should not have been any workers running. This error occurred when I was restarting the cluster the next day.
If you have a cluster with workers running when you stop the cluster we terminate the workers as they are stateless, but at the same time we keep the information as part of the cluster state. So when you start the cluster again, we create the same number of workers with the same configuration as what you were using before stopping the cluster and that is where you ended up with the insufficient capacity error - which we are investigating to fix it on our side that doesn’t seem to be a consistent error.
On the other hand, if you terminate the workers first then stop the cluster, the next time you start it, the cluster will just start without attempting to create new workers during start and you shouldn’t hit the same issue as mentioned above. I hope that clarifies it for you