How does the Model Serving support high concurrent access?

Freeman · April 1, 2021, 7:37am

With the increase of the visits, the Model Serving of Hopsworks cluster will have to face more and more pressure.

By reading the document, I know that the Model Serving of Hopsworks is based on the TensorFlow Serving and the SkLearn Model Serving. But I didn’t find the relevant documents on how to configure the Model Serving to respond to this challenge. I noticed that the Kubernetes was mentioned in the official documents, but I didn’t find that any related services had been started on the local servers.This is an issue for auto-scaling of server resources, actually. Is it feasible to solve this issue by adding the worker nodes?

Does anyone have relevant experience? Any suggestions will be much appreciated.

Jim_Dowling · April 1, 2021, 10:21am

Hi Freeman. In Hopsworks 2.2 (coming out any day now), we have support for KFServing 0.5.1. It supports replicated, elastic model instances. KFServing also supports transformers that execute before the model. Here, we can access the online feature store to build feature vectors and apply real-time transformations.

Jim_Dowling · April 1, 2021, 10:22am

I should add that, for the moment, KFServing will only be available in the managed version of Hopsworks in hopsworks.ai and the Enterprise version.

javierdlrm · April 1, 2021, 1:41pm

Hi Freeman,
Just to add more details.

Yes, models are served either using Tensorflow Serving or Flask (e.g SkLearn, XGBoost…) where you provide your custom implementation for loading a model and making predictions. When serving models with Tensorflow Serving you can enable request batching to handle requests in batches.

In the community version of Hopsworks, the model server is launched on a docker container. This means there is a single instance of the model server handling requests which can be insufficient for high-demand scenarios.

In the enterprise version, model servers are deployed using Kubernetes which auto-scales the number of replicas based on the number of in-flight requests. Also, it is possible to specify the minimum number of replicas desired. Incoming requests are load-balanced across the running model server replicas.

Freeman · April 2, 2021, 1:17am

Hi Jim,
Thank you for your reply.

I’ve studied the KFServing before, it is based on Kubernetes and Istio. I am sorry to hear that the KFServing will only be available in hopsworks.ai and the Enterprise version. But I still wish that the Hopsworks team would consider adding the KFServing to the community version, I think it’s very helpful to promote the Hopsworks.

So, for the moment, I’ve gotta deploy KFServing myself.

Freeman · April 2, 2021, 1:44am

Hi javierdlrm,
Thank you for your detailed reply.

I understand what you said, and I understand it should be a business strategy of Hopsworks. In my opinion, since the community version has provided so many functions, the icing on the cake is adding KFServing to the community version. You know that the users of Hopsworks totally can deploy KFServing themselves, although KFServing is a little bit more complicated. It is just my opinion not others.

Jim_Dowling · April 3, 2021, 8:06am

A workaround for the moment is to deploy the same model as several different endpoints (each model is a docker container running locally on the Hopsworks server) then have your application load balance requests over the endpoints. You may need a larger server for the Hopsworks cluster to do this, if you have higher load on the models.

Freeman · April 3, 2021, 11:20am

Hi Jim,
Thank you for your reply. I think your suggestion is feasible, I’m going to have a try.