RestAPIError with both train_test_split and get_train_test_split

trent · May 2, 2023, 6:05am

Hi Hopsworks Community!

I recently encountered an issue while working on a project and was hoping to get some insights from this forum. I am getting a RestAPIError when running both

“X_train, X_test, y_train, y_test = fv.train_test_split(test_size=0.2)”

and

“X_train, y_train, X_test, y_test = fv.get_train_test_split(training_dataset_version=1)”.

I tried the second one after seeing training data versions accumulate in the feature view overview section on the web site.

The error message is as follows:

RestAPIError: Metadata operation error: (url: https://c.app.hopsworks.ai/hopsworks-api/api/project/38123/featurestores/38017/featureview/historical_data_21/version/1/trainingdatasets/version/1/statistics). Server response: 
HTTP code: 400, HTTP reason: Bad Request, error code: 270137, error msg: Error saving statistics, user msg: Not a valid JSON

Which leads me to a JSON response that says:

{"type":"restApiJsonResponse","errorCode":200003,"errorMsg":"Authorization header not set."}

This issue is a little weird, as the same code works seamlessly with a similar, albeit smaller, dataset and feature group. Additionally, I noticed that training data versions are accumulating in the feature views overview section on the Hopsworks website with the larger dataset regardless of the error. Makes me wonder if it’s just a limitation of the free API.

For context, I am working from the loan approval training pipeline notebook (with my data) that was covered in the workshop held in Seattle a couple of weeks ago, and I am running Hopsworks 3.0.5 in Python 3.9.0.

Has anyone else faced this issue or have any suggestions on how to resolve it? I would greatly appreciate any help or guidance on this matter.

Thanks in advance!

Trent Leslie

Davit_Bzhalava · May 2, 2023, 11:01am

Hi Trent,

Can you check what version of pandas do you have?

Also please run the following code and send results of final_stats to us:

def _convert_pandas_statistics(stat):
        content_dict = {}
        percentiles = []
        if "25%" in stat:
            percentiles = [0] * 100
            percentiles[24] = stat["25%"]
            percentiles[49] = stat["50%"]
            percentiles[74] = stat["75%"]
        if "mean" in stat:
            content_dict["mean"] = stat["mean"]
        if "mean" in stat and "count" in stat:
            content_dict["sum"] = stat["mean"] * stat["count"]
        if "max" in stat:
            content_dict["maximum"] = stat["max"]
        if "std" in stat:
            content_dict["stdDev"] = stat["std"]
        if "min" in stat:
            content_dict["minimum"] = stat["min"]
        if percentiles:
            content_dict["approxPercentiles"] = percentiles
        return content_dict
    
df = fv.get_batch_data(training_dataset_version=1)
stats = df.describe()
final_stats = []
for col in stats.columns:
    stat = _convert_pandas_statistics(stats[col].to_dict())
    stat["dataType"] = (
                "Fractional"
                if isinstance(stats[col].dtype, type(np.dtype(np.float64)))
                else "Integral"
            )
    stat["isDataTypeInferred"] = "false"
    stat["column"] = col.split(".")[-1]
    stat["completeness"] = 1
    final_stats.append(stat)

trent · May 2, 2023, 2:38pm

Thanks for the quick response.

Pandas is 2.0.1.

Small Dataset Output (still working)

Large Dataset Output (still not working)

Note that the get_batch_data function didn’t recognize the training_dataset_version argument (at least in Hopsworks 3.0.5).

Thanks again for the help!

trent · May 2, 2023, 5:50pm

Got it. I was having inf values sneak through with the larger dataset. I thought I had addressed that, but it was for a stats metadata dataframe, not the dataframe itself.

Lesson learned, hope this helps someone in the future!