Can't read data through Storage Connectors (jdbc, Oracle DB)

Glastias · June 25, 2021, 9:51am

Hello!
We have a DWH Oracle, we want to connect it to the feature store, for On-Demand data retrieval
There is a correctly configured connector
Oracle Database Version 19c
Ojdbc driver - ojdbc8-19.9.0.0.jar ( I tried several versions of ojdbc driver, the result is the same )
Сurrent versions of services

I am trying to execute the following code to create a feature
import hsfs
# Create a connection
connection = hsfs.connection()
# Get the feature store handle for the project’s feature store
fs = connection.get_feature_store()
dwh = fs.get_storage_connector(‘dwh2’)
p_query="""(select 1 as ID FROM DUAL)"""
fea = fs.create_on_demand_feature_group(
name=‘fea’,
version=None,
query=p_query,
description=‘On-Demand data from Oracle’,
storage_connector = dwh,
statistics_config = True
)
fea .save()

After which I get the following error
An error was encountered:
An error occurred while calling o189.load.
: java.sql.SQLSyntaxErrorException: ORA-00911: invalid character
at oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:509)
at oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:461)
at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1104)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:553)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:269)
at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:655)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:270)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:91)
at oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:807)
at oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:983)
at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1168)
at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3666)
at oracle.jdbc.driver.T4CPreparedStatement.executeInternal(T4CPreparedStatement.java:1426)
at oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:3713)
at oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:1167)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:61)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: Error : 911, Position : 43, Sql = SELECT * FROM ((select 1 as ID FROM DUAL)) __SPARK_GEN_JDBC_SUBQUERY_NAME_1 WHERE 1=0, OriginalSql = SELECT * FROM ((select 1 as ID FROM DUAL)) __SPARK_GEN_JDBC_SUBQUERY_NAME_1 WHERE 1=0, Error Msg = ORA-00911: invalid character
at oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:513)
… 32 more
Traceback (most recent call last):
File “/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/feature_group.py”, line 1278, in save
self._feature_group_engine.save(self)
File “/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/core/on_demand_feature_group_engine.py”, line 26, in save
feature_group, “read_ondmd”
File “/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/engine/spark.py”, line 110, in register_on_demand_temporary_table
on_demand_fg.query, on_demand_fg.storage_connector
File “/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/engine/spark.py”, line 83, in _jdbc
self._spark_session.read.format(self.JDBC_FORMAT).options(**options).load()
File “/srv/hops/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py”, line 172, in load
return self._df(self._jreader.load())
File “/srv/hops/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py”, line 1257, in call
answer, self.gateway_client, self.target_id, self.name)
File “/srv/hops/spark/python/lib/pyspark.zip/pyspark/sql/utils.py”, line 63, in deco
return f(*a, **kw)
File “/srv/hops/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py”, line 328, in get_return_value
format(target_id, “.”, name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o189.load.
: java.sql.SQLSyntaxErrorException: ORA-00911: invalid character
at oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:509)
at oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:461)
at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1104)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:553)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:269)
at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:655)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:270)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:91)
at oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:807)
at oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:983)
at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1168)
at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3666)
at oracle.jdbc.driver.T4CPreparedStatement.executeInternal(T4CPreparedStatement.java:1426)
at oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:3713)
at oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:1167)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:61)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: Error : 911, Position : 43, Sql = SELECT * FROM ((select 1 as ID FROM DUAL)) __SPARK_GEN_JDBC_SUBQUERY_NAME_1 WHERE 1=0, OriginalSql = SELECT * FROM ((select 1 as ID FROM DUAL)) __SPARK_GEN_JDBC_SUBQUERY_NAME_1 WHERE 1=0, Error Msg = ORA-00911: invalid character
at oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:513)
… 32 more

Help, please, where to look for a solution to the problem?

moritzmeister · June 28, 2021, 9:14am

Hi Glastias,

I think the problem is not with the connector but with the SQL statement you are providing.
There seems to be a character in the query that Spark does not like (see the ORA-00911: invalid character in the error message) and I suspect it is either the ID column name, which might be a reserved name in Oracle or it is the additional parenthesis around the select. Can you please try the two following variants?

p_query="""SELECT 1 AS ID_new FROM DUAL"""
p_query="""SELECT 1 AS ID FROM DUAL"""

Since you are getting a specific oracle error, I believe the connection is working. With these two variants we should be able to narrow down the invalid character.

Glastias · June 28, 2021, 9:54am

Hello, moritzmeister!

No, “ID” is not a reserved name in Oracle. I tried - the same result
Caused by: Error : 911, Position : 45, Sql = SELECT * FROM (SELECT 1 AS ID_new FROM DUAL) __SPARK_GEN_JDBC_SUBQUERY_NAME_0 WHERE 1=0, OriginalSql = SELECT * FROM (SELECT 1 AS ID_new FROM DUAL) __SPARK_GEN_JDBC_SUBQUERY_NAME_0 WHERE 1=0, Error Msg = ORA-00911: invalid character
at oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:513)
… 32 more

At the same time, if you use spark, then everything works, but only if you specify “dbtable” as a parameter, not “query”

dwh = fs.get_storage_connector(‘dwh_connect’)
dwhOptions = dwh.spark_options()
dwhOptions[“dbtable”] = “”"(SELECT 1 AS ID_new FROM DUAL)"""
from pyspark.sql import SQLContext
sc = spark.sparkContext
sqlContext = SQLContext(sc)
sqlContext.read.format(‘jdbc’)
.options(**dwhOptions)
.load()

Reads well !
With the “query” parameter, the error is the same

dwhOptions[“query”] = “”"(SELECT 1 AS ID_new FROM DUAL)"""

moritzmeister · June 28, 2021, 11:31am

You are right, I did some digging and you are hitting a bug in Spark: https://issues.apache.org/jira/browse/SPARK-27596

Spark will be upgraded to 3.1 in the next Hopsworks release, so it will be fixed then. In the meantime, I suggest you read the table into a Spark dataframe with the dbtable options as you have done already, and just ingest the data into the Feature Store as a regular Feature Group.

It will look like this:

fea = fs.create_feature_group(
    name=‘dual’,
    version=1,
    description=‘Dual table from Oracle’,
    statistics_config=True)
fea.save(dual_df)

Where dual_df is your dataframe that you read previously.

Glastias · June 29, 2021, 5:22am

Thank you!
For a regular Feature Group, you would have to write jobs to update the data, and quite a lot. We wanted to simplify the work of DE and DS and make it possible to work with features through such a connection.

What if we update spark for our current version of hopsworks? Is it possible? And what dependencies might need to be updated?

moritzmeister · June 29, 2021, 9:18am

Yes, I know that this might be a bit inconvenient, it was more meant as a temporary suggestion for development until you can upgrade to a new Hopsworks version and then switch to On Demand Feature Groups.

I don’t think you want to go down the road of upgrading Spark manually, it is connected to quite a lot of things and requires many binaries to be replaced, e.g. Hudi, HDFS and more.

A bit easier workaround for you would be to change the HSFS client library to use the dbtable option instead of query. For that you would need to fork GitHub - logicalclocks/feature-store-api: Python - Java/Scala API for the Hopsworks feature store, checkout the 2.2 Release branch (branch-2.2) and then change the dictionary key in this line: feature-store-api/spark.py at 66bdcfa2695c82434afe6dcb6701f86546428bca · logicalclocks/feature-store-api · GitHub

Subsequently you should be able to build the new python wheel, upload it to Hopsworks and install it in the Python environment of your project through the UI.

Glastias · June 29, 2021, 2:02pm

Thank you so much!
I will try))
I will write the result here

moritzmeister · June 30, 2021, 9:21am

Yes let me know if you get stuck.

Just keep in mind that if you are planning to use also other JDBC connectors to other databases than oracle, this might lead to problems.