spark-cassandra-connector read throughput unpredictable
0
votes
1
answer
286
views
A user reports that the range query throughput is far higher than expected when setting spark.cassandra.input.readsPerSec in the spark-cassandra-connector.
Job dependencies. The Java driver version is set to 4.13.0.
com.datastax.spark
spark-cassandra-connector_2.12
3.2.0
com.datastax.oss
java-driver-core-shaded
...
com.datastax.oss
java-driver-core
4.13.0
There are two steps in the job (both an FTS):
Dataset dataset = sparkSession.sqlContext().read()
.format("org.apache.spark.sql.cassandra")
.option("table", "inbox_user_msg_dummy")
.option("keyspace", "ssmp_inbox2").load();
-and-
Dataset olderDataset = sparkSession.sql("SELECT * FROM inbox_user_msg_dummy where app_uuid = 'cb663e07-7bcc-4039-ae97-8fb8e8a9ff77' AND " +
"create_hour = token(G9e7Y4Y, 2023-08-10T04:17:27.234Z, cb663e07-7bcc-4039-ae97-8fb8e8a9ff77) AND token(user_id, create_hour, app_uuid) <= 9121832956220923771 LIMIT 10
FWIW, avg partition size is 649 bytes, max is 2.7kb.
Asked by Paul
(416 rep)
Nov 7, 2023, 07:56 PM
Last activity: Nov 8, 2023, 02:07 PM
Last activity: Nov 8, 2023, 02:07 PM