spark-cassandra-connector read throughput unpredictable

0 votes

1 answer

286 views

cassandra apache-spark spark-cassandra-connector

                          A user reports that the range query throughput is far higher than expected when setting spark.cassandra.input.readsPerSec in the spark-cassandra-connector.

Job dependencies. The Java driver version is set to 4.13.0.

            com.datastax.spark
            spark-cassandra-connector_2.12
            3.2.0
            
                     com.datastax.oss
                    java-driver-core-shaded
                
...

            com.datastax.oss
            java-driver-core
            4.13.0
        
There are two steps in the job (both an FTS):

    Dataset dataset = sparkSession.sqlContext().read()
    .format("org.apache.spark.sql.cassandra")
    .option("table", "inbox_user_msg_dummy")
    .option("keyspace", "ssmp_inbox2").load();

-and- 

    Dataset olderDataset = sparkSession.sql("SELECT * FROM inbox_user_msg_dummy where app_uuid = 'cb663e07-7bcc-4039-ae97-8fb8e8a9ff77' AND " +
    "create_hour = token(G9e7Y4Y, 2023-08-10T04:17:27.234Z, cb663e07-7bcc-4039-ae97-8fb8e8a9ff77) AND token(user_id, create_hour, app_uuid) <= 9121832956220923771 LIMIT 10

FWIW, avg partition size is 649 bytes, max is 2.7kb.

Asked by Paul (416 rep)

Nov 7, 2023, 07:56 PM
Last activity: Nov 8, 2023, 02:07 PM

spark-cassandra-connector read throughput unpredictable

Related Questions