Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

1 votes

0 answers

16 views

hive - Can not create the managed table The associated location already exists

I'm trying to create a managed Hive table using Spark SQL with the following query: DROP TABLE IF EXISTS db.TMP_ARR; CREATE TABLE db.TMP_ARR AS SELECT ID, - more fields.. FROM some_source_table INT; However, the job fails with the following error: org.apache.spark.sql.AnalysisException: Can not crea...

                                  I'm trying to create a managed Hive table using Spark SQL with the following query:

    DROP TABLE IF EXISTS db.TMP_ARR; 
    
    CREATE TABLE db.TMP_ARR AS 
    
    SELECT      ID, 
    
    - more fields..
    FROM some_source_table INT;

However, the job fails with the following error:

    org.apache.spark.sql.AnalysisException: Can not create the managed table('db.tmp_arr'). The associated location ('hdfs://coreCluster/warehouse/tablespace/managed/hive/db.db/tmp_arr') already exists

**What I understand:** I'm trying to create a managed table.

Spark expects that the target location in HDFS does not already exist when creating a managed table.

Apparently, that folder already exists, possibly due to a previous failed run or manual intervention.

**My questions:**

Why does Spark throw this error even though I used DROP TABLE IF EXISTS before CREATE TABLE?

What's the correct way to ensure a managed table can be created without this conflict?

Should I manually delete the path in HDFS before creating the table, or is there a safer/better approach?

**Environment:** Spark version: 3.3.2

Hive metastore: enabled

Storage: HDFS

*1. It's important that the table is managed (not external), and that we don’t manually assign a LOCATION. 
2. many similar jobs are running concurrently (creating/dropping managed tables in the same Hive schema).*

hieutmbk (11 rep)

Aug 1, 2025, 02:44 AM

0 votes

1 answers

969 views

Bulk SQL INSERT into Azure SQL Database using spark causes blocking/contention?

sql-server azure apache-spark query-performance

I am running the following code using microsoft's sql sparkconnector to write a 1-2 Billion dataframe into Azure SQL Database. df.write \ .format("com.microsoft.sqlserver.jdbc.spark") \ .mode("append") \ .option("url", secrets.db.url) \ .option("dbtable", 'tableName') \ .option("user", secrets.db.us...

                                  I am running the following code using microsoft's sql sparkconnector to write a 1-2 Billion dataframe into Azure SQL Database.

    df.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .mode("append") \
    .option("url", secrets.db.url) \
    .option("dbtable", 'tableName') \
    .option("user", secrets.db.user) \
    .option("password", secrets.db.password) \
    .option("batchsize", 1048576) \
    .option("schemaCheckEnabled", "false") \
    .option("BulkCopyTimeout", 3600) \
    .save()





    EXEC sp_WhoIsActive
    @find_block_leaders = 1,
    @sort_order = '[blocked_session_count] DESC'

The wait_info column's value is Resource_Semaphore.



Configs:
My dataframe is partitioned over 2100 partitions on a cluster of 900 cores
My database is 14 vcores in General Purpose tier on Azure.

My query is incredibly slow because of this blocking. It's almost like it's running one bulk insert from my cluster at a time. Any suggestions on what to change to speed it up? or any insights into why it's blocking? 
                                

Youssef Fares

Dec 22, 2020, 03:35 PM • Last activity: Mar 11, 2025, 07:09 PM

2 votes

1 answers

677 views

Best practices for large JOINs - Warehouse or External Compute (e.g. Spark)

join self-join snowflake apache-spark

I am working on a problem that requires a very large join. The JOIN itself is pretty straightforward but the amount of data I am processing is very large. I am wondering for very large JOINs, is there a preferred type of technology. For example, is it more effective to a Data Warehouse (like Snowflake) or in some other MPP system like Spark? To make the problem more concrete I created a hypothetical problem similar to my actual problem. Assume I have a table that looks like this:

I am working on some logic that requires account pairs that have the same name. To find pairs of accounts with the same account I can easily do something like this:

SELECT 
	account1.name, 
    account2.name
FROM accounts as account1
JOIN accounts as account2 ON account1.name = account2.name AND account1.acount_id != account2.acount_id

The problem I am facing is due to the amount of data I am processing. There are roughly ~2 trillion records I am trying to self JOIN on. Obviously, this will take some time and some pretty serious compute. I have run a similar query in Snowflake using XL and 3XL warehouses but after several hours of running, I canceled them. I am hoping there is a more cost-effective or time-efficient way. Has anyone had success with massive JOINs? Are there any other tricks I could deploy? What tool did you find the most effective?

Arthur Putnam (553 rep)

Feb 24, 2022, 09:06 PM • Last activity: Dec 18, 2024, 12:01 PM

0 votes

0 answers

739 views

Py4JJavaError while creating SparkSession

apache-spark

I encountered this error while trying to create spark session in a python virtual environment using VS Code as IDE. The code I ran and the output is below, please help. **Code** ```# Create SparkSession spark = SparkSession.builder\ .master("local[1]")\ .appName("spark-app-version-x")\ .getOrCreate(...

I encountered this error while trying to create spark session in a python virtual environment using VS Code as IDE. The code I ran and the output is below, please help. **Code**

# Create SparkSession
spark = SparkSession.builder\
            .master("local")\
            .appName("spark-app-version-x")\
            .getOrCreate()

` **Output** --------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) Cell In, line 5 1 # Create SparkSession 2 spark = SparkSession.builder\ 3 .master("local")\ 4 .appName("spark-app-version-x")\ ----> 5 .getOrCreate() File c:\Users\Oluwa\Documents\DATA PROJECTS\Hands-On Real Time PySpark Project for Beginners\Lab PySpark_1\PSP_1-Venv\Lib\site-packages\pyspark\sql\session.py:269, in SparkSession.Builder.getOrCreate(self) 267 sparkConf.set(key, value) 268 # This SparkContext may be an existing one. --> 269 sc = SparkContext.getOrCreate(sparkConf) 270 # Do not update SparkConf for existing SparkContext, as it's shared 271 # by all sessions. 272 session = SparkSession(sc, options=self._options) File c:\Users\Oluwa\Documents\DATA PROJECTS\Hands-On Real Time PySpark Project for Beginners\Lab PySpark_1\PSP_1-Venv\Lib\site-packages\pyspark\context.py:483, in SparkContext.getOrCreate(cls, conf) 481 with SparkContext._lock: 482 if SparkContext._active_spark_context is None: --> 483 SparkContext(conf=conf or SparkConf()) 484 assert SparkContext._active_spark_context is not None 485 return SparkContext._active_spark_context File c:\Users\Oluwa\Documents\DATA PROJECTS\Hands-On Real Time PySpark Project for Beginners\Lab PySpark_1\PSP_1-Venv\Lib\site-packages\pyspark\context.py:197, in SparkContext.__init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls, udf_profiler_cls) 195 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf) 196 try: --> 197 self._do_init( 198 master, 199 appName, 200 sparkHome, 201 pyFiles, 202 environment, 203 batchSize, 204 serializer, 205 conf, 206 jsc, 207 profiler_cls, 208 udf_profiler_cls, 209 ) 210 except BaseException: 211 # If an error occurs, clean up in order to allow future SparkContext creation: 212 self.stop() File c:\Users\Oluwa\Documents\DATA PROJECTS\Hands-On Real Time PySpark Project for Beginners\Lab PySpark_1\PSP_1-Venv\Lib\site-packages\pyspark\context.py:282, in SparkContext._do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, jsc, profiler_cls, udf_profiler_cls) 279 self.environment["PYTHONHASHSEED"] = os.environ.get("PYTHONHASHSEED", "0") 281 # Create the Java SparkContext through Py4J --> 282 self._jsc = jsc or self._initialize_context(self._conf._jconf) 283 # Reset the SparkConf to the one actually used by the SparkContext in JVM. 284 self._conf = SparkConf(_jconf=self._jsc.sc().conf()) File c:\Users\Oluwa\Documents\DATA PROJECTS\Hands-On Real Time PySpark Project for Beginners\Lab PySpark_1\PSP_1-Venv\Lib\site-packages\pyspark\context.py:402, in SparkContext._initialize_context(self, jconf) 398 """ 399 Initialize SparkContext in function to allow subclass specific initialization 400 """ 401 assert self._jvm is not None --> 402 return self._jvm.JavaSparkContext(jconf) File c:\Users\Oluwa\Documents\DATA PROJECTS\Hands-On Real Time PySpark Project for Beginners\Lab PySpark_1\PSP_1-Venv\Lib\site-packages\py4j\java_gateway.py:1585, in JavaClass.__call__(self, *args) 1579 command = proto.CONSTRUCTOR_COMMAND_NAME +\ 1580 self._command_header +\ 1581 args_command +\ 1582 proto.END_COMMAND_PART 1584 answer = self._gateway_client.send_command(command) -> 1585 return_value = get_return_value( 1586 answer, self._gateway_client, None, self._fqn) 1588 for temp_arg in temp_args: 1589 temp_arg._detach() File c:\Users\Oluwa\Documents\DATA PROJECTS\Hands-On Real Time PySpark Project for Beginners\Lab PySpark_1\PSP_1-Venv\Lib\site-packages\py4j\protocol.py:326, in get_return_value(answer, gateway_client, target_id, name) 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) 325 if answer == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value) 329 else: 330 raise Py4JError( 331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n". 332 format(target_id, ".", name, value)) Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.ExceptionInInitializerError at org.apache.spark.unsafe.array.ByteArrayMethods.(ByteArrayMethods.java:56) at org.apache.spark.memory.MemoryManager.defaultPageSizeBytes$lzycompute(MemoryManager.scala:264) at org.apache.spark.memory.MemoryManager.defaultPageSizeBytes(MemoryManager.scala:254) at org.apache.spark.memory.MemoryManager.$anonfun$pageSizeBytes$1(MemoryManager.scala:273) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.memory.MemoryManager.(MemoryManager.scala:273) at org.apache.spark.memory.UnifiedMemoryManager.(UnifiedMemoryManager.scala:58) at org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:207) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:320) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:194) at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:279) at org.apache.spark.SparkContext.(SparkContext.scala:464) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62) at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:238) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:1570) Caused by: java.lang.IllegalStateException: java.lang.NoSuchMethodException: java.nio.DirectByteBuffer.(long,int) at org.apache.spark.unsafe.Platform.(Platform.java:113) ... 25 more Caused by: java.lang.NoSuchMethodException: java.nio.DirectByteBuffer.(long,int) at java.base/java.lang.Class.getConstructor0(Class.java:3784) at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2955) at org.apache.spark.unsafe.Platform.(Platform.java:71) ... 25 more

tomiealff (1 rep)

Jul 22, 2024, 12:49 PM

1 votes

0 answers

210 views

Writing large dataset from spark dataframe

postgresql python jdbc apache-spark databricks

We have a azure databricks job that retrieves some large dataset with pyspark. The dataframe has about 11 billion rows. We are currently writing this out to a postgresql DB (also in azure). Currently we are using the jdbc connector to write row out in batch to the existing table (batch size 10,000,0...

                                  We have a azure databricks job that retrieves some large dataset with pyspark. The dataframe has about 11 billion rows. We are currently writing this out to a postgresql DB (also in azure). Currently we are using the jdbc connector to write row out in batch to the existing table (batch size 10,000,000). This table does have a handful of indexes on it, so inserts take awhile. It is dozens of hours to complete this operation (assuming if finishes successfully at all). 

I feel like it would make more sense to use COPY to load the data into the database, but I don't see any well establish patterns for doing that in databricks. 

I don't have a ton of spark or databricks experience, so any tips are appreciated.

Kyle Chamberlin (13 rep)

Feb 16, 2024, 12:57 AM

2 votes

2 answers

839 views

Check the time that PostgreSQL is taking to automatically create existing indexes when you do bulk insert using copy command from Spark

postgresql postgresql-performance apache-spark

I am doing a bulk insert from the spark to the postgres table. Amount of data that I am ingesting is huge. The number of records is around 120-130 million I am first saving the records as multiple csv files on distributed storage location i.e. S3 bucket in my use case. Now I am using multiple copy c...

                                  I am doing a bulk insert from the spark to the postgres table. Amount of data that I am ingesting is huge. The number of records is around 120-130 million

I am first saving the records as multiple csv files on distributed storage location i.e. S3 bucket in my use case. Now I am using multiple copy command to copy the data in the PostgreSQL table. The actual PostgreSQL table has four indexes on it.  

The copy command takes around 8 hours to save the data. I created a similar table without indexes and the data got saved in around 28-30 minutes. Based on searches that I have done on internet on multiple sites, they mentioned that indexes can slow down the performance and that is definitely seen based on the time difference that I have specified above. 

Now the actual question is how I can identify which index creation is taking more time. Is there any utility,  query or command that shows the time taken to create the indexes on the table when we are doing bulk inserts. 
I am using the below query to see the number of multiple copy commands are running on the PostgreSQL instance: 

    SELECT * FROM pg_stat_activity 
    where usename = 'xyz' and application_name ='PostgreSQL JDBC Driver'

Is there something like this query or any tool or command that I can use to see amount of time t is taking to create the indexes. Also how much time is taken by each index?

Any idea, guidance or suggestion are welcome. I am not that familiar with PostgresQL.

Nikunj Kakadiya (123 rep)

Jan 21, 2021, 11:05 AM • Last activity: Dec 13, 2023, 07:04 PM

0 votes

1 answers

286 views

spark-cassandra-connector read throughput unpredictable

cassandra apache-spark spark-cassandra-connector

A user reports that the range query throughput is far higher than expected when setting spark.cassandra.input.readsPerSec in the spark-cassandra-connector. Job dependencies. The Java driver version is set to 4.13.0. com.datastax.spark spark-cassandra-connector_2.12 3.2.0 com.datastax.oss java-driver...

                                  A user reports that the range query throughput is far higher than expected when setting spark.cassandra.input.readsPerSec in the spark-cassandra-connector.

Job dependencies. The Java driver version is set to 4.13.0.

            com.datastax.spark
            spark-cassandra-connector_2.12
            3.2.0
            
                     com.datastax.oss
                    java-driver-core-shaded
                
...

            com.datastax.oss
            java-driver-core
            4.13.0
        
There are two steps in the job (both an FTS):

    Dataset dataset = sparkSession.sqlContext().read()
    .format("org.apache.spark.sql.cassandra")
    .option("table", "inbox_user_msg_dummy")
    .option("keyspace", "ssmp_inbox2").load();

-and- 

    Dataset olderDataset = sparkSession.sql("SELECT * FROM inbox_user_msg_dummy where app_uuid = 'cb663e07-7bcc-4039-ae97-8fb8e8a9ff77' AND " +
    "create_hour = token(G9e7Y4Y, 2023-08-10T04:17:27.234Z, cb663e07-7bcc-4039-ae97-8fb8e8a9ff77) AND token(user_id, create_hour, app_uuid) <= 9121832956220923771 LIMIT 10

FWIW, avg partition size is 649 bytes, max is 2.7kb.

Paul (416 rep)

Nov 7, 2023, 07:56 PM • Last activity: Nov 8, 2023, 02:07 PM

1 votes

1 answers

865 views

Optimal join for joining facts with scd-type-2 dimension for aggregation/reporting

query-performance join data-warehouse slowly-changing-dimension apache-spark

I have a fact table and an scd-type-2 dimension table. I want to produce sales report by region and year. I have working solution with a query that joins them for reporting purposes. When I run the query in spark/databricks, it gives me a little warning at the bottom: `Use range join optimization: T...

Use range join optimization: This query has a join condition that can benefit from range join optimization. To improve performance, consider adding a range join hint.

and points to this link . **Question:** Is there a more optimal way to query when I'm joining using a between condition (instead of = condition)? fact table:

create table sales
(name             string
,sale_date        date
,sold_amt         long);

insert into sales values
('John','2022-02-02',100),
('John','2022-03-03',100),
('John','2023-02-02',200),
('John','2023-03-03',200),
('Rick','2022-02-02',300),
('Rick','2023-02-02',400);

dimension table (scd-type-2)

create table employee_scd2
(name        string
,region      string
,start_date  date,
,end_date    date,
,is_current  boolean); -- unused, kept for completeness

insert into employee_scd2 values
('John','NAM',  '2010-01-01', '2022-12-31', false),
-- John transferred from NAM to APAC starting 2023
('John','APAC', '2023-01-01', '9999-01-01', true),
('Rick','NAM',  '2020-01-01', '9999-12-31', true);

sales report by region and year

select e.region,
       year(s.sale_date) as sale_year,
       SUM(s.sold_amt)   as sale_amt
from       sales s
left join  employee_scd2 e
  on     e.name       = s.name
  and    s.sale_date between e.start_date and e.end_date
group by e.region, year(s.sale_date);

--- I've read following and some more: * https://dba.stackexchange.com/questions/184639/join-fact-table-to-scd-type-2-how-to-write-query-sql-server * https://www.startdataengineering.com/post/how-to-join-fact-scd2-tables/ * How to implement Slowly Changing Dimensions (SCD2) Type 2 in Spark

Kashyap (145 rep)

Mar 7, 2023, 04:49 PM • Last activity: Jun 7, 2023, 05:43 PM

1 votes

0 answers

122 views

Why does MySQL select offset not work in pyspark?

mysql apache-spark

select distinct day from t1 order by 1 desc limit 1,1 should return the second latest date in t1, but this error is thrown: > ParseException: mismatched input ',' expecting { , ';'}

                                      select distinct day 
    from t1 
    order by 1 desc limit 1,1 

should return the second latest date in t1, but this error is thrown:

> ParseException:  mismatched input ',' expecting {, ';'}
                                

uhmosdhsjxpbcrstis (11 rep)

Mar 9, 2023, 12:01 PM • Last activity: Mar 10, 2023, 04:13 PM

1 votes

2 answers

159 views

When will the Spark Cassandra connector with support for Spark 3.3 be released?

cassandra apache-spark

Master branch has a merged [pull request][1] supporting Spark 3.3 how long before this gets built and published on maven repository? [1]: https://github.com/datastax/spark-cassandra-connector/pull/1351

                                  Master branch has a merged pull request  supporting Spark 3.3 how long before this gets built and published on maven repository?
                                

Rubber Duck (111 rep)

Jan 4, 2023, 05:15 AM • Last activity: Jan 13, 2023, 02:21 PM

2 votes

1 answers

110 views

Can we use Cassandra in place of Hadoop with Spark?

database-design cassandra application-design hadoop apache-spark

Considering we have a backend written in NodeJS and uses MySQL and Cassandra as it's databases, if we want to add Spark to the system to do some data analyzing stuff like recommendation, can we do it with Cassandra( I mean using Spark + Cassandra) and reach the same result as we could reach with the...

                                  Considering we have a backend written in NodeJS and uses MySQL and Cassandra as it's databases, if we want to add Spark to the system to do some data analyzing stuff like recommendation, can we do it with Cassandra( I mean using Spark + Cassandra) and reach the same result as we could reach with the Hadoop( Spark + Hadoop)? 

I want to know what Hadoop can do that Cassandra can not to? Or what would make it essential to use Hadoop alongside with the Spark?

user20551429 (69 rep)

Nov 29, 2022, 04:41 AM • Last activity: Nov 29, 2022, 05:09 AM

1 votes

2 answers

1097 views

Cassandra-4-Update: Multiple Schema Versions

schema upgrade cassandra apache-spark

after upgrading our first node, it has a different schema version (according to node tool describe cluster). This caused Spark Jobs to hang, because of reoccurring "schema agreement not reached" by metadata.SchemaAgreementChecker. Is this different schema version by purpose? Will the problem being g...

                                  after upgrading our first node, it has a different schema version (according to node tool describe cluster). This caused Spark Jobs to hang, because of reoccurring "schema agreement not reached" by metadata.SchemaAgreementChecker.

Is this different schema version by purpose? Will the problem being gone after updating all nodes (first update all nodes, then run upgrade sstables)?

Can Spark Jobs configured to overcome the hang-up?

Many thanks in advance.

Best Regards,

Sven

Sven (11 rep)

Oct 12, 2022, 12:41 PM • Last activity: Oct 22, 2022, 07:09 AM

0 votes

0 answers

100 views

Is there an open source implementation of QGM (Query Graph Model)?

postgresql hive apache-spark

I am building a new system that needs to interact essentially as an SQL backend. We would like to import logical queries into it (e.g. from ApacheSPARQ or Postgres and related things) and want to develop an internal representation (IR) for them. Doing something similar to QGM seems like a good start...

                                  I am building a new system that needs to interact essentially as an SQL backend.  We would like to import logical queries into it (e.g. from ApacheSPARQ or Postgres and related things) and want to develop an internal representation (IR) for them.  Doing something similar to QGM seems like a good starting point.  However, rather than inventing it all from scratch, I'd like to borrow from something that already exists and then extend as needed.

Even if there is no processing code, just data structures, it would be a useful starting point.

So, if there is something open source that I could look at, it would be appreciated.

intel_chris (141 rep)

Sep 25, 2021, 12:13 PM

1 votes

0 answers

147 views

MySQL is not showing the history queries after the ETL is complete

mysql docker apache-spark

I am using docker container for MySQL and Spark. Both containers are on AWS Ec2 instance. Pyspark ETL connects to MySQL with JDBC and start extracting the data. When the ETL is running i can see the long running threads/queries in MySQL with command (show full processlist). I think this data comes f...

                                  I am using docker container for MySQL and Spark. Both containers are on AWS Ec2 instance.

Pyspark ETL connects to MySQL with JDBC and start extracting the data. When the ETL is running i can see the long running threads/queries in MySQL with command (show full processlist). I think this data comes from metadata table information_schema.processlist table. 

Problem I am facing is after the ETL is complete, I am not able to find the queries of previous runs in any of information_Schema tables. Do I need to do any configuration ? how can I solve this issue. Which schema/tables do I need to refer. Appreciate your response

nomad123 (19 rep)

Sep 15, 2021, 09:23 PM • Last activity: Sep 15, 2021, 11:14 PM

0 votes

3 answers

129 views

Data storage for analytics

sql-server postgresql hadoop apache-spark

I have to store some amount of data for analytical purposes. - The data source produces 2TB data per month. - Data is collected on a monthly basis (not real-time). - Data is fully structured. - There are 100+ different columns of data. - Availability of SQL is important. - Engineer/developer resourc...

                                  I have to store some amount of data for analytical purposes. 

 - The data source produces 2TB data per month. 
 - Data is collected on a monthly basis (not real-time). 
 - Data is fully structured.
 - There are 100+ different columns of data.
 - Availability of SQL is important.
 - Engineer/developer resources are limited.

I planned to use Postgres (probably with column-oriented extension), however, it would not be feasible for such data amounts (more than 20TB per year). I also made a research on Hadoop/Spark, however, it looks like a bit massive solution (considering, that the data is fully structured). I don't consider cloud-based solutions, as well as expensive ones (preferably, free-licence) .

Would you be so kind to suggest, which data storage to use for big amounts of structured data for analytical purposes?
                                

Leeloo (111 rep)

Dec 24, 2020, 12:23 PM • Last activity: Dec 24, 2020, 02:18 PM

1 votes

0 answers

422 views

Pyspark error inserting into mysql database table that exists

mysql python apache-spark

I am trying to insert into an existing mysql table using Pyspark JDBC connection however I get the following error(picture attached) Can I get assistance on this error. The table exist in the MySql Database, I was successful in Inserting with a Dataframe when selecting data form another table , howe...

Below is the Pyspark code I am using to insert into the table:

import os
from pyspark.sql import SQLContext
from pyspark import SparkContext
from pyspark.sql import SparkSession
import logging 

sc = SparkSession.builder.appName("TestConnection").enableHiveSupport().getOrCreate()
sqlContext = SQLContext(sc)

Provide your Spark-master node below
hostname = "localhost"
dbname = "dev_db"
username = "******"
password = "*******"
jdbc_url = "jdbc:mysql://{0}/{1}?user={2}&password={3}".format(hostname,dbname,username,password)


 sqlContext.sql("insert into test values ('3','Names',123654)").write.format('jdbc').options(url=jdbc_url,driver='com.mysql.jdbc.Driver').mode(append).save()
logging.info("***********Write to MySql Completed***********")

IIShriyaII (11 rep)

Jul 16, 2020, 04:09 PM • Last activity: Jul 16, 2020, 08:15 PM

0 votes

1 answers

1736 views

Auto increment field on update

postgresql apache-spark

I have a field called: update_count, which I need to reflect the number of times a row has been updated. I have been looking into the SERIAL type for the column which sounds promising, but it has one problem - it is [not transaction safe][1], which means that it may get out of sync. This may not hap...

                                  I have a field called: update_count, which I need to reflect the number of times a row has been updated.

I have been looking into the SERIAL type for the column which sounds promising, but it has one problem - it is not transaction safe , which means that it may get out of sync. This may not happen.

My question is then: "How to handle an auto incrementing field on each column while supporting multiple concurrent client?"

Environment: Spark.

7heViking (101 rep)

Jul 13, 2020, 01:12 PM • Last activity: Jul 13, 2020, 01:15 PM

1 votes

0 answers

3054 views

HIVE + understanding the hive-metastore logs

postgresql hive apache-spark

we have `HDP` cluster version - `2.6.4` , and we are runs spark streaming app and we are uses presto cluster in order to run Hive queries when we look on the `hivemetastore` logs ( under `/var/log/hive` ) , we can see the following warnings , that repeat many times in the log [![enter image descript...

                                  we have HDP cluster version - 2.6.4 , and we are runs spark streaming app 

and we are uses presto cluster in order to run Hive queries

when we look on the hivemetastore logs ( under /var/log/hive ) , we can see the following warnings , that repeat many times in the log

    2020-01-23 16:39:37,182 WARN  [pool-5-thread-200]: conf.HiveConf (HiveConf.java:initialize(3093)) - HiveConf of name hive.log.dir does not exist
    2020-01-23 16:39:37,182 WARN  [pool-5-thread-200]: conf.HiveConf (HiveConf.java:initialize(3093)) - HiveConf of name hive.log.file does not exist
    2020-01-23 16:39:37,271 WARN  [pool-5-thread-200]: conf.HiveConf (HiveConf.java:initialize(3093)) - HiveConf of name hive.log.dir does not exist
    2020-01-23 16:39:37,271 WARN  [pool-5-thread-200]: conf.HiveConf (HiveConf.java:initialize(3093)) - HiveConf of name hive.log.file does not exist
    2020-01-23 16:42:37,156 WARN  [pool-5-thread-200]: conf.HiveConf (HiveConf.java:initialize(3093)) - HiveConf of name hive.log.dir does not exist
    2020-01-23 16:42:37,157 WARN  [pool-5-thread-200]: conf.HiveConf (HiveConf.java:initialize(3093)) - HiveConf of name hive.log.file does not exist
    2020-01-23 16:42:37,239 WARN  [pool-5-thread-200]: conf.HiveConf (HiveConf.java:initialize(3093)) - HiveConf of name hive.log.dir does not exist
    2020-01-23 16:42:37,239 WARN  [pool-5-thread-200]: conf.HiveConf (HiveConf.java:initialize(3093)) - HiveConf of name hive.log.file does not exist

and

    2020-01-23 16:15:05,827 WARN  [pool-5-thread-110]: metastore.ObjectStore (ObjectStore.java:getDatabase(698)) - Failed to get database global_temp, returning NoSuchObjectException
    2020-01-23 16:15:08,252 WARN  [pool-5-thread-110]: metastore.MetaStoreDirectSql (MetaStoreDirectSql.java:executeWithArray(1741)) - Failed to execute [select "PARTITIONS"."PART_ID" from "PARTITIONS"  inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID"     and "TBLS"."TBL_NAME" = ?   inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID"      and "DBS"."NAME" = ? inner join "PARTITION_KEY_VALS" "FILTER1" on "FILTER1"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER1"."INTEGER_IDX" = 1 where ((cast("FILTER1"."PART_KEY_VAL" as decimal(21,0)) > ?))] with parameters [profiles_super_spammers, default, 20191223]

and

    2020-01-23 15:44:49,420 WARN  [main]: metastore.MetaStoreDirectSql (MetaStoreDirectSql.java:runTestQuery(265)) - Self-test query [select "DB_ID" from "HIVE".""DBS"] failed; direct SQL is disabled

and from log - hiveserver2.log we get

    2020-01-23 15:45:18,180 ERROR [main]: metastore.RetryingHMSHandler (RetryingHMSHandler.java:(81)) - HMSHandler Fatal error: MetaException(message:Version information not found in metastore. )
    	at org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:6933)

and

    Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
    	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:569)

and

    2020-01-23 15:45:18,180 ERROR [main]: metastore.RetryingHMSHandler (RetryingHMSHandler.java:(81)) - HMSHandler Fatal error: MetaException(message:Version information not found in metastore. )

we are get lost from here because we not sure if all warning in the log are complain about wrong HIVE configuration

I will appreciate to get hint or help about the logs and what is the direction

King David (111 rep)

Jan 23, 2020, 08:50 PM • Last activity: Jan 24, 2020, 09:55 AM

0 votes

0 answers

35 views

Move from SSIS to Azure

ssis azure-sql-database azure apache-spark

I have an ETL solution which we have been doing in conventional SSIS. The job of SSIS is just to create star schema from the application database to dimension and fact tables. There is very little transformations happening. The problem that we are facing is that on local machine (desktop with 32GB o...

                                  I have an ETL solution which we have been doing in conventional SSIS. The job of SSIS is just to create star schema from the application database to dimension and fact tables. There is very little transformations happening. 

The problem that we are facing is that on local machine (desktop with 32GB of RAM) this job takes 16 hrs + to complete if we are lucky. On most days the job fails when it runs out of memory with below error:

>A buffer failed while allocating 104850152 bytes.

On the production server this job runs in 4 hrs, which has 64GB of ram.

We are trying to explore other possibilities to perform this ETL on local. Do you have any suggestion/strategies to leverage azure + python + HDInsights?
Any other suggestion are also welcome.

imba22 (141 rep)

Feb 19, 2019, 03:23 PM • Last activity: Feb 20, 2019, 02:19 AM

0 votes

1 answers

3456 views

How to use case and filter with a group by statement and having an aggregate calculation?

group-by apache-spark

I would like to do the following SELECT PO.col1, PO.col2, CASE WHEN PO.col3 8 THEN PO.col4 WHEN PO.col3 = 8 THEN CASE WHEN (ROUND(CAST(PO.col4 AS double) - SUM(CAST(PO.col5 AS double)), 2)) > 0 AND SUM(CAST(PO.col5 AS double)) > 0 THEN ROUND(CAST(PO.col4 AS double) - SUM(CAST(PO.col5 AS double)), 2)...

                                  I would like to do the following

    SELECT
      PO.col1,
      PO.col2,
      CASE
           WHEN PO.col3  8 THEN PO.col4
           WHEN PO.col3 = 8 
                THEN CASE
                          WHEN (ROUND(CAST(PO.col4 AS double) - SUM(CAST(PO.col5 AS double)), 2)) > 0 
                                AND SUM(CAST(PO.col5 AS double)) > 0
                                THEN ROUND(CAST(PO.col4 AS double) - SUM(CAST(PO.col5 AS double)), 2)
                          END
           END AS Quantity
    FROM
      my_table AS PO
    GROUP BY 
      PO.col1, PO.col2
    HAVING 
      Quantity > 0

I apologize for not being able to provide sample data/columns. I am not allowed to share my work data.

The logic which I have been told to implement is as follows.

For each unique pair combination of col1, col2, get a constructed attribute called Quantity. Quantity = col4 wherever col3 = 8. If col3 != 8, then check SUM(col5) > 0. If yes then check col4 - SUM(col5) > 0. If yes, then keep the record, else skip.

Out of my own observation I checked that COUNT(DISTINCT col4) for each unique pair of col1 and col2 is 1. I hope that's helpful answering my question.

We are working with Spark SQL so I don't know what tag to use for this in the question.

SAMPLE DATA 
                                

scientific_explorer (133 rep)

Jan 24, 2019, 08:08 AM • Last activity: Jan 25, 2019, 07:19 PM

Showing page 1 of 20 total questions