Sample Header Ad - 728x90

Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

1 votes
0 answers
16 views
hive - Can not create the managed table The associated location already exists
I'm trying to create a managed Hive table using Spark SQL with the following query: DROP TABLE IF EXISTS db.TMP_ARR; CREATE TABLE db.TMP_ARR AS SELECT ID, - more fields.. FROM some_source_table INT; However, the job fails with the following error: org.apache.spark.sql.AnalysisException: Can not crea...
I'm trying to create a managed Hive table using Spark SQL with the following query: DROP TABLE IF EXISTS db.TMP_ARR; CREATE TABLE db.TMP_ARR AS SELECT ID, - more fields.. FROM some_source_table INT; However, the job fails with the following error: org.apache.spark.sql.AnalysisException: Can not create the managed table('db.tmp_arr'). The associated location ('hdfs://coreCluster/warehouse/tablespace/managed/hive/db.db/tmp_arr') already exists **What I understand:** I'm trying to create a managed table. Spark expects that the target location in HDFS does not already exist when creating a managed table. Apparently, that folder already exists, possibly due to a previous failed run or manual intervention. **My questions:** Why does Spark throw this error even though I used DROP TABLE IF EXISTS before CREATE TABLE? What's the correct way to ensure a managed table can be created without this conflict? Should I manually delete the path in HDFS before creating the table, or is there a safer/better approach? **Environment:** Spark version: 3.3.2 Hive metastore: enabled Storage: HDFS *1. It's important that the table is managed (not external), and that we don’t manually assign a LOCATION. 2. many similar jobs are running concurrently (creating/dropping managed tables in the same Hive schema).*
hieutmbk (11 rep)
Aug 1, 2025, 02:44 AM
0 votes
1 answers
969 views
Bulk SQL INSERT into Azure SQL Database using spark causes blocking/contention?
I am running the following code using microsoft's sql sparkconnector to write a 1-2 Billion dataframe into Azure SQL Database. df.write \ .format("com.microsoft.sqlserver.jdbc.spark") \ .mode("append") \ .option("url", secrets.db.url) \ .option("dbtable", 'tableName') \ .option("user", secrets.db.us...
I am running the following code using microsoft's sql sparkconnector to write a 1-2 Billion dataframe into Azure SQL Database. df.write \ .format("com.microsoft.sqlserver.jdbc.spark") \ .mode("append") \ .option("url", secrets.db.url) \ .option("dbtable", 'tableName') \ .option("user", secrets.db.user) \ .option("password", secrets.db.password) \ .option("batchsize", 1048576) \ .option("schemaCheckEnabled", "false") \ .option("BulkCopyTimeout", 3600) \ .save() This is a snapshot of the DB Utilization graph And These are the first few rows from the following query using sp_whoisactive EXEC sp_WhoIsActive @find_block_leaders = 1, @sort_order = '[blocked_session_count] DESC' The wait_info column's value is Resource_Semaphore. Configs: My dataframe is partitioned over 2100 partitions on a cluster of 900 cores My database is 14 vcores in General Purpose tier on Azure. My query is incredibly slow because of this blocking. It's almost like it's running one bulk insert from my cluster at a time. Any suggestions on what to change to speed it up? or any insights into why it's blocking?
Youssef Fares
Dec 22, 2020, 03:35 PM • Last activity: Mar 11, 2025, 07:09 PM
2 votes
1 answers
677 views
Best practices for large JOINs - Warehouse or External Compute (e.g. Spark)
I am working on a problem that requires a very large join. The JOIN itself is pretty straightforward but the amount of data I am processing is very large. I am wondering for very large JOINs, is there a preferred type of technology. For example, is it more effective to a Data Warehouse (like Snowfla...
I am working on a problem that requires a very large join. The JOIN itself is pretty straightforward but the amount of data I am processing is very large. I am wondering for very large JOINs, is there a preferred type of technology. For example, is it more effective to a Data Warehouse (like Snowflake) or in some other MPP system like Spark? To make the problem more concrete I created a hypothetical problem similar to my actual problem. Assume I have a table that looks like this: enter image description here I am working on some logic that requires account pairs that have the same name. To find pairs of accounts with the same account I can easily do something like this:
SELECT 
	account1.name, 
    account2.name
FROM accounts as account1
JOIN accounts as account2 ON account1.name = account2.name AND account1.acount_id != account2.acount_id
The problem I am facing is due to the amount of data I am processing. There are roughly ~2 trillion records I am trying to self JOIN on. Obviously, this will take some time and some pretty serious compute. I have run a similar query in Snowflake using XL and 3XL warehouses but after several hours of running, I canceled them. I am hoping there is a more cost-effective or time-efficient way. Has anyone had success with massive JOINs? Are there any other tricks I could deploy? What tool did you find the most effective?
Arthur Putnam (553 rep)
Feb 24, 2022, 09:06 PM • Last activity: Dec 18, 2024, 12:01 PM
0 votes
0 answers
739 views
Py4JJavaError while creating SparkSession
I encountered this error while trying to create spark session in a python virtual environment using VS Code as IDE. The code I ran and the output is below, please help. **Code** ```# Create SparkSession spark = SparkSession.builder\ .master("local[1]")\ .appName("spark-app-version-x")\ .getOrCreate(...
I encountered this error while trying to create spark session in a python virtual environment using VS Code as IDE. The code I ran and the output is below, please help. **Code**
# Create SparkSession
spark = SparkSession.builder\
            .master("local")\
            .appName("spark-app-version-x")\
            .getOrCreate()
` **Output** --------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) Cell In, line 5 1 # Create SparkSession 2 spark = SparkSession.builder\ 3 .master("local")\ 4 .appName("spark-app-version-x")\ ----> 5 .getOrCreate() File c:\Users\Oluwa\Documents\DATA PROJECTS\Hands-On Real Time PySpark Project for Beginners\Lab PySpark_1\PSP_1-Venv\Lib\site-packages\pyspark\sql\session.py:269, in SparkSession.Builder.getOrCreate(self) 267 sparkConf.set(key, value) 268 # This SparkContext may be an existing one. --> 269 sc = SparkContext.getOrCreate(sparkConf) 270 # Do not update SparkConf for existing SparkContext, as it's shared 271 # by all sessions. 272 session = SparkSession(sc, options=self._options) File c:\Users\Oluwa\Documents\DATA PROJECTS\Hands-On Real Time PySpark Project for Beginners\Lab PySpark_1\PSP_1-Venv\Lib\site-packages\pyspark\context.py:483, in SparkContext.getOrCreate(cls, conf) 481 with SparkContext._lock: 482 if SparkContext._active_spark_context is None: --> 483 SparkContext(conf=conf or SparkConf()) 484 assert SparkContext._active_spark_context is not None 485 return SparkContext._active_spark_context File c:\Users\Oluwa\Documents\DATA PROJECTS\Hands-On Real Time PySpark Project for Beginners\Lab PySpark_1\PSP_1-Venv\Lib\site-packages\pyspark\context.py:197, in SparkContext.__init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls, udf_profiler_cls) 195 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf) 196 try: --> 197 self._do_init( 198 master, 199 appName, 200 sparkHome, 201 pyFiles, 202 environment, 203 batchSize, 204 serializer, 205 conf, 206 jsc, 207 profiler_cls, 208 udf_profiler_cls, 209 ) 210 except BaseException: 211 # If an error occurs, clean up in order to allow future SparkContext creation: 212 self.stop() File c:\Users\Oluwa\Documents\DATA PROJECTS\Hands-On Real Time PySpark Project for Beginners\Lab PySpark_1\PSP_1-Venv\Lib\site-packages\pyspark\context.py:282, in SparkContext._do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, jsc, profiler_cls, udf_profiler_cls) 279 self.environment["PYTHONHASHSEED"] = os.environ.get("PYTHONHASHSEED", "0") 281 # Create the Java SparkContext through Py4J --> 282 self._jsc = jsc or self._initialize_context(self._conf._jconf) 283 # Reset the SparkConf to the one actually used by the SparkContext in JVM. 284 self._conf = SparkConf(_jconf=self._jsc.sc().conf()) File c:\Users\Oluwa\Documents\DATA PROJECTS\Hands-On Real Time PySpark Project for Beginners\Lab PySpark_1\PSP_1-Venv\Lib\site-packages\pyspark\context.py:402, in SparkContext._initialize_context(self, jconf) 398 """ 399 Initialize SparkContext in function to allow subclass specific initialization 400 """ 401 assert self._jvm is not None --> 402 return self._jvm.JavaSparkContext(jconf) File c:\Users\Oluwa\Documents\DATA PROJECTS\Hands-On Real Time PySpark Project for Beginners\Lab PySpark_1\PSP_1-Venv\Lib\site-packages\py4j\java_gateway.py:1585, in JavaClass.__call__(self, *args) 1579 command = proto.CONSTRUCTOR_COMMAND_NAME +\ 1580 self._command_header +\ 1581 args_command +\ 1582 proto.END_COMMAND_PART 1584 answer = self._gateway_client.send_command(command) -> 1585 return_value = get_return_value( 1586 answer, self._gateway_client, None, self._fqn) 1588 for temp_arg in temp_args: 1589 temp_arg._detach() File c:\Users\Oluwa\Documents\DATA PROJECTS\Hands-On Real Time PySpark Project for Beginners\Lab PySpark_1\PSP_1-Venv\Lib\site-packages\py4j\protocol.py:326, in get_return_value(answer, gateway_client, target_id, name) 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) 325 if answer == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value) 329 else: 330 raise Py4JError( 331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n". 332 format(target_id, ".", name, value)) Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.ExceptionInInitializerError at org.apache.spark.unsafe.array.ByteArrayMethods.(ByteArrayMethods.java:56) at org.apache.spark.memory.MemoryManager.defaultPageSizeBytes$lzycompute(MemoryManager.scala:264) at org.apache.spark.memory.MemoryManager.defaultPageSizeBytes(MemoryManager.scala:254) at org.apache.spark.memory.MemoryManager.$anonfun$pageSizeBytes$1(MemoryManager.scala:273) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.memory.MemoryManager.(MemoryManager.scala:273) at org.apache.spark.memory.UnifiedMemoryManager.(UnifiedMemoryManager.scala:58) at org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:207) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:320) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:194) at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:279) at org.apache.spark.SparkContext.(SparkContext.scala:464) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62) at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:238) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:1570) Caused by: java.lang.IllegalStateException: java.lang.NoSuchMethodException: java.nio.DirectByteBuffer.(long,int) at org.apache.spark.unsafe.Platform.(Platform.java:113) ... 25 more Caused by: java.lang.NoSuchMethodException: java.nio.DirectByteBuffer.(long,int) at java.base/java.lang.Class.getConstructor0(Class.java:3784) at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2955) at org.apache.spark.unsafe.Platform.(Platform.java:71) ... 25 more
tomiealff (1 rep)
Jul 22, 2024, 12:49 PM
1 votes
0 answers
210 views
Writing large dataset from spark dataframe
We have a azure databricks job that retrieves some large dataset with pyspark. The dataframe has about 11 billion rows. We are currently writing this out to a postgresql DB (also in azure). Currently we are using the jdbc connector to write row out in batch to the existing table (batch size 10,000,0...
We have a azure databricks job that retrieves some large dataset with pyspark. The dataframe has about 11 billion rows. We are currently writing this out to a postgresql DB (also in azure). Currently we are using the jdbc connector to write row out in batch to the existing table (batch size 10,000,000). This table does have a handful of indexes on it, so inserts take awhile. It is dozens of hours to complete this operation (assuming if finishes successfully at all). I feel like it would make more sense to use COPY to load the data into the database, but I don't see any well establish patterns for doing that in databricks. I don't have a ton of spark or databricks experience, so any tips are appreciated.
Kyle Chamberlin (13 rep)
Feb 16, 2024, 12:57 AM
2 votes
2 answers
839 views
Check the time that PostgreSQL is taking to automatically create existing indexes when you do bulk insert using copy command from Spark
I am doing a bulk insert from the spark to the postgres table. Amount of data that I am ingesting is huge. The number of records is around 120-130 million I am first saving the records as multiple csv files on distributed storage location i.e. S3 bucket in my use case. Now I am using multiple copy c...
I am doing a bulk insert from the spark to the postgres table. Amount of data that I am ingesting is huge. The number of records is around 120-130 million I am first saving the records as multiple csv files on distributed storage location i.e. S3 bucket in my use case. Now I am using multiple copy command to copy the data in the PostgreSQL table. The actual PostgreSQL table has four indexes on it. The copy command takes around 8 hours to save the data. I created a similar table without indexes and the data got saved in around 28-30 minutes. Based on searches that I have done on internet on multiple sites, they mentioned that indexes can slow down the performance and that is definitely seen based on the time difference that I have specified above. Now the actual question is how I can identify which index creation is taking more time. Is there any utility, query or command that shows the time taken to create the indexes on the table when we are doing bulk inserts. I am using the below query to see the number of multiple copy commands are running on the PostgreSQL instance: SELECT * FROM pg_stat_activity where usename = 'xyz' and application_name ='PostgreSQL JDBC Driver' Is there something like this query or any tool or command that I can use to see amount of time t is taking to create the indexes. Also how much time is taken by each index? Any idea, guidance or suggestion are welcome. I am not that familiar with PostgresQL.
Nikunj Kakadiya (123 rep)
Jan 21, 2021, 11:05 AM • Last activity: Dec 13, 2023, 07:04 PM
0 votes
1 answers
286 views
spark-cassandra-connector read throughput unpredictable
A user reports that the range query throughput is far higher than expected when setting spark.cassandra.input.readsPerSec in the spark-cassandra-connector. Job dependencies. The Java driver version is set to 4.13.0. com.datastax.spark spark-cassandra-connector_2.12 3.2.0 com.datastax.oss java-driver...
A user reports that the range query throughput is far higher than expected when setting spark.cassandra.input.readsPerSec in the spark-cassandra-connector. Job dependencies. The Java driver version is set to 4.13.0. com.datastax.spark spark-cassandra-connector_2.12 3.2.0 com.datastax.oss java-driver-core-shaded ... com.datastax.oss java-driver-core 4.13.0 There are two steps in the job (both an FTS): Dataset dataset = sparkSession.sqlContext().read() .format("org.apache.spark.sql.cassandra") .option("table", "inbox_user_msg_dummy") .option("keyspace", "ssmp_inbox2").load(); -and- Dataset olderDataset = sparkSession.sql("SELECT * FROM inbox_user_msg_dummy where app_uuid = 'cb663e07-7bcc-4039-ae97-8fb8e8a9ff77' AND " + "create_hour = token(G9e7Y4Y, 2023-08-10T04:17:27.234Z, cb663e07-7bcc-4039-ae97-8fb8e8a9ff77) AND token(user_id, create_hour, app_uuid) <= 9121832956220923771 LIMIT 10 FWIW, avg partition size is 649 bytes, max is 2.7kb.
Paul (416 rep)
Nov 7, 2023, 07:56 PM • Last activity: Nov 8, 2023, 02:07 PM
1 votes
1 answers
865 views
Optimal join for joining facts with scd-type-2 dimension for aggregation/reporting
I have a fact table and an scd-type-2 dimension table. I want to produce sales report by region and year. I have working solution with a query that joins them for reporting purposes. When I run the query in spark/databricks, it gives me a little warning at the bottom: `Use range join optimization: T...
I have a fact table and an scd-type-2 dimension table. I want to produce sales report by region and year. I have working solution with a query that joins them for reporting purposes. When I run the query in spark/databricks, it gives me a little warning at the bottom: Use range join optimization: This query has a join condition that can benefit from range join optimization. To improve performance, consider adding a range join hint. and points to this link . **Question:** Is there a more optimal way to query when I'm joining using a between condition (instead of = condition)? fact table:
create table sales
(name             string
,sale_date        date
,sold_amt         long);

insert into sales values
('John','2022-02-02',100),
('John','2022-03-03',100),
('John','2023-02-02',200),
('John','2023-03-03',200),
('Rick','2022-02-02',300),
('Rick','2023-02-02',400);
dimension table (scd-type-2)
create table employee_scd2
(name        string
,region      string
,start_date  date,
,end_date    date,
,is_current  boolean); -- unused, kept for completeness

insert into employee_scd2 values
('John','NAM',  '2010-01-01', '2022-12-31', false),
-- John transferred from NAM to APAC starting 2023
('John','APAC', '2023-01-01', '9999-01-01', true),
('Rick','NAM',  '2020-01-01', '9999-12-31', true);
sales report by region and year
select e.region,
       year(s.sale_date) as sale_year,
       SUM(s.sold_amt)   as sale_amt
from       sales s
left join  employee_scd2 e
  on     e.name       = s.name
  and    s.sale_date between e.start_date and e.end_date
group by e.region, year(s.sale_date);
--- I've read following and some more: * https://dba.stackexchange.com/questions/184639/join-fact-table-to-scd-type-2-how-to-write-query-sql-server * https://www.startdataengineering.com/post/how-to-join-fact-scd2-tables/ * How to implement Slowly Changing Dimensions (SCD2) Type 2 in Spark
Kashyap (145 rep)
Mar 7, 2023, 04:49 PM • Last activity: Jun 7, 2023, 05:43 PM
1 votes
0 answers
122 views
Why does MySQL select offset not work in pyspark?
select distinct day from t1 order by 1 desc limit 1,1 should return the second latest date in t1, but this error is thrown: > ParseException: mismatched input ',' expecting { , ';'}
select distinct day from t1 order by 1 desc limit 1,1 should return the second latest date in t1, but this error is thrown: > ParseException: mismatched input ',' expecting {, ';'}
uhmosdhsjxpbcrstis (11 rep)
Mar 9, 2023, 12:01 PM • Last activity: Mar 10, 2023, 04:13 PM
1 votes
2 answers
159 views
When will the Spark Cassandra connector with support for Spark 3.3 be released?
Master branch has a merged [pull request][1] supporting Spark 3.3 how long before this gets built and published on maven repository? [1]: https://github.com/datastax/spark-cassandra-connector/pull/1351
Master branch has a merged pull request supporting Spark 3.3 how long before this gets built and published on maven repository?
Rubber Duck (111 rep)
Jan 4, 2023, 05:15 AM • Last activity: Jan 13, 2023, 02:21 PM
2 votes
1 answers
110 views
Can we use Cassandra in place of Hadoop with Spark?
Considering we have a backend written in NodeJS and uses MySQL and Cassandra as it's databases, if we want to add Spark to the system to do some data analyzing stuff like recommendation, can we do it with Cassandra( I mean using Spark + Cassandra) and reach the same result as we could reach with the...
Considering we have a backend written in NodeJS and uses MySQL and Cassandra as it's databases, if we want to add Spark to the system to do some data analyzing stuff like recommendation, can we do it with Cassandra( I mean using Spark + Cassandra) and reach the same result as we could reach with the Hadoop( Spark + Hadoop)? I want to know what Hadoop can do that Cassandra can not to? Or what would make it essential to use Hadoop alongside with the Spark?
user20551429 (69 rep)
Nov 29, 2022, 04:41 AM • Last activity: Nov 29, 2022, 05:09 AM
1 votes
2 answers
1097 views
Cassandra-4-Update: Multiple Schema Versions
after upgrading our first node, it has a different schema version (according to node tool describe cluster). This caused Spark Jobs to hang, because of reoccurring "schema agreement not reached" by metadata.SchemaAgreementChecker. Is this different schema version by purpose? Will the problem being g...
after upgrading our first node, it has a different schema version (according to node tool describe cluster). This caused Spark Jobs to hang, because of reoccurring "schema agreement not reached" by metadata.SchemaAgreementChecker. Is this different schema version by purpose? Will the problem being gone after updating all nodes (first update all nodes, then run upgrade sstables)? Can Spark Jobs configured to overcome the hang-up? Many thanks in advance. Best Regards, Sven
Sven (11 rep)
Oct 12, 2022, 12:41 PM • Last activity: Oct 22, 2022, 07:09 AM
0 votes
0 answers
100 views
Is there an open source implementation of QGM (Query Graph Model)?
I am building a new system that needs to interact essentially as an SQL backend. We would like to import logical queries into it (e.g. from ApacheSPARQ or Postgres and related things) and want to develop an internal representation (IR) for them. Doing something similar to QGM seems like a good start...
I am building a new system that needs to interact essentially as an SQL backend. We would like to import logical queries into it (e.g. from ApacheSPARQ or Postgres and related things) and want to develop an internal representation (IR) for them. Doing something similar to QGM seems like a good starting point. However, rather than inventing it all from scratch, I'd like to borrow from something that already exists and then extend as needed. Even if there is no processing code, just data structures, it would be a useful starting point. So, if there is something open source that I could look at, it would be appreciated.
intel_chris (141 rep)
Sep 25, 2021, 12:13 PM
1 votes
0 answers
147 views
MySQL is not showing the history queries after the ETL is complete
I am using docker container for MySQL and Spark. Both containers are on AWS Ec2 instance. Pyspark ETL connects to MySQL with JDBC and start extracting the data. When the ETL is running i can see the long running threads/queries in MySQL with command (show full processlist). I think this data comes f...
I am using docker container for MySQL and Spark. Both containers are on AWS Ec2 instance. Pyspark ETL connects to MySQL with JDBC and start extracting the data. When the ETL is running i can see the long running threads/queries in MySQL with command (show full processlist). I think this data comes from metadata table information_schema.processlist table. Problem I am facing is after the ETL is complete, I am not able to find the queries of previous runs in any of information_Schema tables. Do I need to do any configuration ? how can I solve this issue. Which schema/tables do I need to refer. Appreciate your response
nomad123 (19 rep)
Sep 15, 2021, 09:23 PM • Last activity: Sep 15, 2021, 11:14 PM
0 votes
3 answers
129 views
Data storage for analytics
I have to store some amount of data for analytical purposes. - The data source produces 2TB data per month. - Data is collected on a monthly basis (not real-time). - Data is fully structured. - There are 100+ different columns of data. - Availability of SQL is important. - Engineer/developer resourc...
I have to store some amount of data for analytical purposes. - The data source produces 2TB data per month. - Data is collected on a monthly basis (not real-time). - Data is fully structured. - There are 100+ different columns of data. - Availability of SQL is important. - Engineer/developer resources are limited. I planned to use Postgres (probably with column-oriented extension), however, it would not be feasible for such data amounts (more than 20TB per year). I also made a research on Hadoop/Spark, however, it looks like a bit massive solution (considering, that the data is fully structured). I don't consider cloud-based solutions, as well as expensive ones (preferably, free-licence) . Would you be so kind to suggest, which data storage to use for big amounts of structured data for analytical purposes?
Leeloo (111 rep)
Dec 24, 2020, 12:23 PM • Last activity: Dec 24, 2020, 02:18 PM
1 votes
0 answers
422 views
Pyspark error inserting into mysql database table that exists
I am trying to insert into an existing mysql table using Pyspark JDBC connection however I get the following error(picture attached) Can I get assistance on this error. The table exist in the MySql Database, I was successful in Inserting with a Dataframe when selecting data form another table , howe...
I am trying to insert into an existing mysql table using Pyspark JDBC connection however I get the following error(picture attached) Can I get assistance on this error. The table exist in the MySql Database, I was successful in Inserting with a Dataframe when selecting data form another table , however i would like to Insert directly to the Mysql datadase table. PySpark Error Below is the Pyspark code I am using to insert into the table:
import os
from pyspark.sql import SQLContext
from pyspark import SparkContext
from pyspark.sql import SparkSession
import logging 

sc = SparkSession.builder.appName("TestConnection").enableHiveSupport().getOrCreate()
sqlContext = SQLContext(sc)

Provide your Spark-master node below
hostname = "localhost"
dbname = "dev_db"
username = "******"
password = "*******"
jdbc_url = "jdbc:mysql://{0}/{1}?user={2}&password={3}".format(hostname,dbname,username,password)


 sqlContext.sql("insert into test values ('3','Names',123654)").write.format('jdbc').options(url=jdbc_url,driver='com.mysql.jdbc.Driver').mode(append).save()
logging.info("***********Write to MySql Completed***********")
IIShriyaII (11 rep)
Jul 16, 2020, 04:09 PM • Last activity: Jul 16, 2020, 08:15 PM
0 votes
1 answers
1736 views
Auto increment field on update
I have a field called: update_count, which I need to reflect the number of times a row has been updated. I have been looking into the SERIAL type for the column which sounds promising, but it has one problem - it is [not transaction safe][1], which means that it may get out of sync. This may not hap...
I have a field called: update_count, which I need to reflect the number of times a row has been updated. I have been looking into the SERIAL type for the column which sounds promising, but it has one problem - it is not transaction safe , which means that it may get out of sync. This may not happen. My question is then: "How to handle an auto incrementing field on each column while supporting multiple concurrent client?" Environment: Spark.
7heViking (101 rep)
Jul 13, 2020, 01:12 PM • Last activity: Jul 13, 2020, 01:15 PM
1 votes
0 answers
3054 views
HIVE + understanding the hive-metastore logs
we have `HDP` cluster version - `2.6.4` , and we are runs spark streaming app and we are uses presto cluster in order to run Hive queries when we look on the `hivemetastore` logs ( under `/var/log/hive` ) , we can see the following warnings , that repeat many times in the log [![enter image descript...
we have HDP cluster version - 2.6.4 , and we are runs spark streaming app and we are uses presto cluster in order to run Hive queries when we look on the hivemetastore logs ( under /var/log/hive ) , we can see the following warnings , that repeat many times in the log enter image description here 2020-01-23 16:39:37,182 WARN [pool-5-thread-200]: conf.HiveConf (HiveConf.java:initialize(3093)) - HiveConf of name hive.log.dir does not exist 2020-01-23 16:39:37,182 WARN [pool-5-thread-200]: conf.HiveConf (HiveConf.java:initialize(3093)) - HiveConf of name hive.log.file does not exist 2020-01-23 16:39:37,271 WARN [pool-5-thread-200]: conf.HiveConf (HiveConf.java:initialize(3093)) - HiveConf of name hive.log.dir does not exist 2020-01-23 16:39:37,271 WARN [pool-5-thread-200]: conf.HiveConf (HiveConf.java:initialize(3093)) - HiveConf of name hive.log.file does not exist 2020-01-23 16:42:37,156 WARN [pool-5-thread-200]: conf.HiveConf (HiveConf.java:initialize(3093)) - HiveConf of name hive.log.dir does not exist 2020-01-23 16:42:37,157 WARN [pool-5-thread-200]: conf.HiveConf (HiveConf.java:initialize(3093)) - HiveConf of name hive.log.file does not exist 2020-01-23 16:42:37,239 WARN [pool-5-thread-200]: conf.HiveConf (HiveConf.java:initialize(3093)) - HiveConf of name hive.log.dir does not exist 2020-01-23 16:42:37,239 WARN [pool-5-thread-200]: conf.HiveConf (HiveConf.java:initialize(3093)) - HiveConf of name hive.log.file does not exist and 2020-01-23 16:15:05,827 WARN [pool-5-thread-110]: metastore.ObjectStore (ObjectStore.java:getDatabase(698)) - Failed to get database global_temp, returning NoSuchObjectException 2020-01-23 16:15:08,252 WARN [pool-5-thread-110]: metastore.MetaStoreDirectSql (MetaStoreDirectSql.java:executeWithArray(1741)) - Failed to execute [select "PARTITIONS"."PART_ID" from "PARTITIONS" inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID" and "TBLS"."TBL_NAME" = ? inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID" and "DBS"."NAME" = ? inner join "PARTITION_KEY_VALS" "FILTER1" on "FILTER1"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER1"."INTEGER_IDX" = 1 where ((cast("FILTER1"."PART_KEY_VAL" as decimal(21,0)) > ?))] with parameters [profiles_super_spammers, default, 20191223] and 2020-01-23 15:44:49,420 WARN [main]: metastore.MetaStoreDirectSql (MetaStoreDirectSql.java:runTestQuery(265)) - Self-test query [select "DB_ID" from "HIVE".""DBS"] failed; direct SQL is disabled and from log - hiveserver2.log we get 2020-01-23 15:45:18,180 ERROR [main]: metastore.RetryingHMSHandler (RetryingHMSHandler.java:(81)) - HMSHandler Fatal error: MetaException(message:Version information not found in metastore. ) at org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:6933) and Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:569) and 2020-01-23 15:45:18,180 ERROR [main]: metastore.RetryingHMSHandler (RetryingHMSHandler.java:(81)) - HMSHandler Fatal error: MetaException(message:Version information not found in metastore. ) we are get lost from here because we not sure if all warning in the log are complain about wrong HIVE configuration I will appreciate to get hint or help about the logs and what is the direction
King David (111 rep)
Jan 23, 2020, 08:50 PM • Last activity: Jan 24, 2020, 09:55 AM
0 votes
0 answers
35 views
Move from SSIS to Azure
I have an ETL solution which we have been doing in conventional SSIS. The job of SSIS is just to create star schema from the application database to dimension and fact tables. There is very little transformations happening. The problem that we are facing is that on local machine (desktop with 32GB o...
I have an ETL solution which we have been doing in conventional SSIS. The job of SSIS is just to create star schema from the application database to dimension and fact tables. There is very little transformations happening. The problem that we are facing is that on local machine (desktop with 32GB of RAM) this job takes 16 hrs + to complete if we are lucky. On most days the job fails when it runs out of memory with below error: >A buffer failed while allocating 104850152 bytes. On the production server this job runs in 4 hrs, which has 64GB of ram. We are trying to explore other possibilities to perform this ETL on local. Do you have any suggestion/strategies to leverage azure + python + HDInsights? Any other suggestion are also welcome.
imba22 (141 rep)
Feb 19, 2019, 03:23 PM • Last activity: Feb 20, 2019, 02:19 AM
0 votes
1 answers
3456 views
How to use case and filter with a group by statement and having an aggregate calculation?
I would like to do the following SELECT PO.col1, PO.col2, CASE WHEN PO.col3 8 THEN PO.col4 WHEN PO.col3 = 8 THEN CASE WHEN (ROUND(CAST(PO.col4 AS double) - SUM(CAST(PO.col5 AS double)), 2)) > 0 AND SUM(CAST(PO.col5 AS double)) > 0 THEN ROUND(CAST(PO.col4 AS double) - SUM(CAST(PO.col5 AS double)), 2)...
I would like to do the following SELECT PO.col1, PO.col2, CASE WHEN PO.col3 8 THEN PO.col4 WHEN PO.col3 = 8 THEN CASE WHEN (ROUND(CAST(PO.col4 AS double) - SUM(CAST(PO.col5 AS double)), 2)) > 0 AND SUM(CAST(PO.col5 AS double)) > 0 THEN ROUND(CAST(PO.col4 AS double) - SUM(CAST(PO.col5 AS double)), 2) END END AS Quantity FROM my_table AS PO GROUP BY PO.col1, PO.col2 HAVING Quantity > 0 I apologize for not being able to provide sample data/columns. I am not allowed to share my work data. The logic which I have been told to implement is as follows. For each unique pair combination of col1, col2, get a constructed attribute called Quantity. Quantity = col4 wherever col3 = 8. If col3 != 8, then check SUM(col5) > 0. If yes then check col4 - SUM(col5) > 0. If yes, then keep the record, else skip. Out of my own observation I checked that COUNT(DISTINCT col4) for each unique pair of col1 and col2 is 1. I hope that's helpful answering my question. We are working with Spark SQL so I don't know what tag to use for this in the question. SAMPLE DATA
scientific_explorer (133 rep)
Jan 24, 2019, 08:08 AM • Last activity: Jan 25, 2019, 07:19 PM
Showing page 1 of 20 total questions