Database Administrators
Q&A for database professionals who wish to improve their database skills
Latest Questions
1
votes
1
answers
808
views
IDE Access to Hive without user password?
I'm a DB Developer and in my company we're only able to interact with hive through a technical user who owns the specific write permissions for Hive on HDFS. So in practice this looks like this: 1. I connect through our remote server by using SSH with my user credentials 2. I switch to the technical...
I'm a DB Developer and in my company we're only able to interact with hive through a technical user who owns the specific write permissions for Hive on HDFS.
So in practice this looks like this:
1. I connect through our remote server by using SSH with my user credentials
2. I switch to the technical hive user by doing
sudo su - hive_user
(DBA's wont hand out the password for this, so they prefer only giving us specific sudo permission for switching user like this)
3. I execute a query using beeline -f QUERY_FILE
So as you can see I'm bound to work with the CLI (beeline) all the time, but I'd like to have the convenience using any SQL IDE from my Desktop.
Is there any IDE for Hive, that allows me to connect to our DB as a technical user that can only be accessed by using sudo su - hive_user
? A linked manual for this, also would be nice.
Mayak
(181 rep)
Oct 2, 2019, 11:10 AM
• Last activity: Aug 3, 2025, 03:06 PM
1
votes
0
answers
16
views
hive - Can not create the managed table The associated location already exists
I'm trying to create a managed Hive table using Spark SQL with the following query: DROP TABLE IF EXISTS db.TMP_ARR; CREATE TABLE db.TMP_ARR AS SELECT ID, - more fields.. FROM some_source_table INT; However, the job fails with the following error: org.apache.spark.sql.AnalysisException: Can not crea...
I'm trying to create a managed Hive table using Spark SQL with the following query:
DROP TABLE IF EXISTS db.TMP_ARR;
CREATE TABLE db.TMP_ARR AS
SELECT ID,
- more fields..
FROM some_source_table INT;
However, the job fails with the following error:
org.apache.spark.sql.AnalysisException: Can not create the managed table('db.tmp_arr'). The associated location ('hdfs://coreCluster/warehouse/tablespace/managed/hive/db.db/tmp_arr') already exists
**What I understand:** I'm trying to create a managed table.
Spark expects that the target location in HDFS does not already exist when creating a managed table.
Apparently, that folder already exists, possibly due to a previous failed run or manual intervention.
**My questions:**
Why does Spark throw this error even though I used DROP TABLE IF EXISTS before CREATE TABLE?
What's the correct way to ensure a managed table can be created without this conflict?
Should I manually delete the path in HDFS before creating the table, or is there a safer/better approach?
**Environment:** Spark version: 3.3.2
Hive metastore: enabled
Storage: HDFS
*1. It's important that the table is managed (not external), and that we don’t manually assign a LOCATION.
2. many similar jobs are running concurrently (creating/dropping managed tables in the same Hive schema).*
hieutmbk
(11 rep)
Aug 1, 2025, 02:44 AM
0
votes
1
answers
32
views
How to Create a Managed Table in Apache Hive 4.0.1?
I am running Apache Hive 4.0.1 using Docker with the following command: ``` docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 --name hive4 apache/hive:4.0.1 ``` After starting Hive, I created a table using the following SQL command: ```sql CREATE TABLE FOO(foo string); ``` W...
I am running Apache Hive 4.0.1 using Docker with the following command:
docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 --name hive4 apache/hive:4.0.1
After starting Hive, I created a table using the following SQL command:
CREATE TABLE FOO(foo string);
When I check the table definition using:
SHOW CREATE TABLE default
.FOO
;
I see that the table is defined as EXTERNAL
. However, I want to create a managed table (also referred to as an "internal" table).
What steps should I follow or what specific commands should I use to ensure that the table is created as a managed table in Hive 4.0.1?
Any guidance on this would be greatly appreciated!
Aleksandr Shperling
(1 rep)
Dec 20, 2024, 05:31 PM
• Last activity: Mar 18, 2025, 07:29 AM
0
votes
2
answers
529
views
Failed to link the C library against JNA, Cannot open shared object file
I am having issues with installing Cassandra 4.0 and 4.1 on my rhel 8 server. I've tried using java 11 openjdk and Corretto 11 openjdk. When I start Cassandra I would receive: ``` Native LibraryLinux.java - Failed to link the C library against JNA. \ Native methods will not be available. ``` ``` jav...
I am having issues with installing Cassandra 4.0 and 4.1 on my rhel 8 server.
I've tried using java 11 openjdk and Corretto 11 openjdk.
When I start Cassandra I would receive:
Native LibraryLinux.java - Failed to link the C library against JNA. \
Native methods will not be available.
java.lang.UnsatisfiedLinkError: /tmp/jna8760917299733827163.tmp: \
/tmp/jna8760917299733827163.tmp: cannot open shared object: Operation not permitted
I've tried adding these commands separate and together in cassandra-env.sh
file:
JVM_OPTS="$JVM_OPTS -Djava.io.tmpdir=$CASSANDRA_HOME/tmp"
JVM_OPTS="$JVM_OPTS -Djna.tmpdir=$CASSANDRA_HOME/tmp"
I even tried export JVM_OPTS="$JVM_OPTS -Djava.io.tmpdir=$CASSANDRA_HOME/tmp"
Once that is input the error changes to:
java.lang.UnsatisfiedLinkError: Failed to create temporary file for /com/sun/jna/win32-x86/jnidispatch.dll library: \
JNA temporary directory 'cassandra/tmp' does not exist
Once I create it and start Cassandra a new error shows as:
java.lang.UnsatisfiedLinkError: Failed to create temporary file for /com/sun/jna/win32-x86/jnidispatch.dll library: \
JNA temporary directory 'cassandra/tmp' is not writable
I tried to change permissions, but it would go back to the original error.
Does anyone have a solution for this?
jexport
(1 rep)
May 3, 2024, 07:19 PM
• Last activity: May 7, 2024, 06:35 AM
0
votes
0
answers
49
views
Is it necessary to optimize join in hdfs?
What is the most optimal way to query in Hive (Datalake, based on hdfs)? Establishing filters in the tables prior to join them, select* from (select code from table_1 where type="a") a inner join (select code from table_2 where type="a") b on a.code=b.code Or this way? In `where` condition. Select *...
What is the most optimal way to query in Hive (Datalake, based on hdfs)?
Establishing filters in the tables prior to join them,
select* from
(select code from table_1 where type="a") a
inner join
(select code from table_2
where type="a") b
on a.code=b.code
Or this way? In
where
condition.
Select *
From table_1 inner join table_2 on table_1.codigo=table_2.codigo
Where table_1.type="a" and
Table_2.type="a".
Perhaps the most obvious and quickest answer is the first way. But I think that with HDFS the environment is optimized in such a way that it reads the "where" first and then the "join", I mean, HDFS brings an internal code optimization.
cfsl
(1 rep)
Jan 31, 2024, 09:46 PM
• Last activity: Jan 31, 2024, 10:20 PM
1
votes
0
answers
113
views
How to pass multiple values into hive hql for the same hivevar
Requirement : My Hql has below script in which I want to pass values into the where clause dynamically. How do I dynamically pass using hivevar in below specific scenario where multiple values are expected. Or how do I invoke the hql with hivevar defined for (‘a’,’c’) Create table newtbl As select *...
Requirement :
My Hql has below script in which I want to pass values into the where clause dynamically. How do I dynamically pass using hivevar in below specific scenario where multiple values are expected. Or how do I invoke the hql with hivevar defined for (‘a’,’c’)
Create table newtbl
As select * from temptbl where
id IN (‘a’, ‘c’)
RaCh
(11 rep)
Dec 21, 2023, 12:38 AM
2
votes
1
answers
2670
views
Connecting to cluster with cqlsh returns "Unable to connect to any servers"
I am trying to deploy The Hive 4 on a VMware Workstation 17 player VM to test Splunk integration with The Hive. I am following the guide at this [link][1], but I encountered an error at one of the stages, namely when launching cassandra using the `cqlsh localhost 9042` command: Connection error: ('U...
I am trying to deploy The Hive 4 on a VMware Workstation 17 player VM to test Splunk integration with The Hive.
I am following the guide at this link , but I encountered an error at one of the stages, namely when launching cassandra using the
cqlsh localhost 9042
command:
Connection error: ('Unable to connect to any servers', {'127.0.0.1': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
I tried to solve the problem based on the information from this site, but it didn't help me.
OS Version - Ubuntu 22.04.2 LTS
I'm new in this field, I can provide any information you need

aimakovm
(21 rep)
May 16, 2023, 10:12 AM
• Last activity: May 16, 2023, 12:09 PM
1
votes
1
answers
287
views
Metastore(Mysql) bottleneck for Hive
We have a hive installation that has MariaDB as metastore database. MariaDB has around ~250 GB metadata with ~100GB indexes. It becomes terribly slow during the peak load of 40-60K QPS. Looking from the community to share similar experiences if any and what they did to scale out the meta store or fi...
We have a hive installation that has MariaDB as metastore database. MariaDB has around ~250 GB metadata with ~100GB indexes. It becomes terribly slow during the peak load of 40-60K QPS.
Looking from the community to share similar experiences if any and what they did to scale out the meta store or fix it?
Some of the ideas i am looking at currently are:
- Application Caching at HMS level: Didn't found out of box capability in my current v2.0.1. Is there support for it in higher versions?
- Read replicas and routing SELECTS to it: Facing some failure if there is replication lag and i am trying to read back the value.
- Horizontal sharding of Mysql: founding it way complex. Saw some recommendations of TiDB but not sure of its experience.
Shakti Garg
(111 rep)
Feb 16, 2023, 02:48 PM
• Last activity: Feb 18, 2023, 10:11 AM
1
votes
1
answers
308
views
For each tuple, get the name of the first column which is non-zero
I have a table in Hive which looks like: ``` | Name | 1990 | 1991 | 1992 | 1993 | 1994 | | Rex | 0 | 0 | 1 | 1 | 1 | | Max | 0 | 0 | 0 | 0 | 1 | | Phil | 1 | 1 | 1 | 1 | 1 | ``` I would like to get, for each row, the name of the first column which is non-zero, so something like: ``` | Name | Column...
I have a table in Hive which looks like:
| Name | 1990 | 1991 | 1992 | 1993 | 1994 |
| Rex | 0 | 0 | 1 | 1 | 1 |
| Max | 0 | 0 | 0 | 0 | 1 |
| Phil | 1 | 1 | 1 | 1 | 1 |
I would like to get, for each row, the name of the first column which is non-zero, so something like:
| Name | Column |
| Rex | 1992 |
| Max | 1994 |
| Phil | 1990 |
For each row, it is guaranteed that:
* There is at least one column with "1"; and
* If column X has is "1", for each column Y > X, column Y will also have a "1".
user2891462
(113 rep)
Nov 28, 2021, 04:47 PM
• Last activity: Dec 1, 2021, 05:21 PM
0
votes
1
answers
6286
views
Testing a Hive array for IS NULL says not null
I have a table containing an array, and I want to check if it is empty or NULL. It appears that I cannot check the NULL directly! Can anyone shed light on why the NULL check isn't working? create table `test_array_split` ( `campaign` string , `questions` array ) stored as orc ; insert into `test_arr...
I have a table containing an array, and I want to check if it is empty or NULL. It appears that I cannot check the NULL directly! Can anyone shed light on why the NULL check isn't working?
create table
test_array_split
( campaign
string
, questions
array
)
stored as orc ;
insert into test_array_split
(campaign
) values ('1');
select
campaign
, questions
, size(questions
)
, case when questions
is null then 'null' else 'not null' end isnull
from test_array_split
;
+-----------+------------+------+-----------+
| campaign | questions | _c2 | isnull |
+-----------+------------+------+-----------+
| 1 | NULL | -1 | not null |
+-----------+------------+------+-----------+
PhilHibbs
(539 rep)
Jul 24, 2020, 12:25 PM
• Last activity: Sep 27, 2021, 08:03 PM
0
votes
0
answers
100
views
Is there an open source implementation of QGM (Query Graph Model)?
I am building a new system that needs to interact essentially as an SQL backend. We would like to import logical queries into it (e.g. from ApacheSPARQ or Postgres and related things) and want to develop an internal representation (IR) for them. Doing something similar to QGM seems like a good start...
I am building a new system that needs to interact essentially as an SQL backend. We would like to import logical queries into it (e.g. from ApacheSPARQ or Postgres and related things) and want to develop an internal representation (IR) for them. Doing something similar to QGM seems like a good starting point. However, rather than inventing it all from scratch, I'd like to borrow from something that already exists and then extend as needed.
Even if there is no processing code, just data structures, it would be a useful starting point.
So, if there is something open source that I could look at, it would be appreciated.
intel_chris
(141 rep)
Sep 25, 2021, 12:13 PM
1
votes
1
answers
805
views
How to check total allotted space inside a HDFS 'group'
Our DBA has created a schema for our team in HDFS/HIVE. Not sure if 'schema' is the right word, they call it a 'group'. Anyway, we can only write to the data lake inside this schema, whether it is parquet files or hive tables. Is there a way to check what is the max space allocated to our group , kn...
Our DBA has created a schema for our team in HDFS/HIVE. Not sure if 'schema' is the right word, they call it a 'group'.
Anyway, we can only write to the data lake inside this schema, whether it is parquet files or hive tables.
Is there a way to check what is the max space allocated to our group , knowing only the schema name?
I dont want to accidentally load too much data.
Thank you.
Victor
(127 rep)
May 1, 2021, 03:18 PM
• Last activity: Sep 16, 2021, 12:25 PM
0
votes
2
answers
1585
views
Omit table name and dot in SELECT query
When I perform this query: SELECT `tablename`.* FROM `something`.`tablename` I get a table with column names that contain dots: tablename.c1 | tablename.c2 | tablename.c3 ------------------------------------------ a | 1 | 2 b | 1 | 3 I don't want this, I just want the column names `c1`, `c2` and `c3...
When I perform this query:
SELECT
tablename
.* FROM something
.tablename
I get a table with column names that contain dots:
tablename.c1 | tablename.c2 | tablename.c3
------------------------------------------
a | 1 | 2
b | 1 | 3
I don't want this, I just want the column names c1
, c2
and c3
, I can solve this by writing the following query:
SELECT tablename
.c1
as c1
,
tablename
.c2
as c2
,
tablename
.c3
as c3
FROM something
.tablename
I however have many columns which makes it a very long query. How can I rename the columns from the first query or how can I get this right from the start?
(p.s. the query I'm using contains multiple table references that is why I specify the table name tablename.*
)
Mehdi Nellen
(101 rep)
Jan 16, 2020, 01:18 PM
• Last activity: Jul 9, 2021, 06:01 AM
2
votes
1
answers
1498
views
how to insert data into extra columns of target avro table when source table is having less no of columns compared to target using hive or impala?
Suppose I am having a source Avro table having 10 columns and my target Avro table is having 12 columns, while inserting data into the target table I need to add null values to the extra 2 columns. But when I execute the below query it has thrown the exception > AnalysisException: Target table 'targ...
Suppose I am having a source Avro table having 10 columns and my target Avro table is having 12 columns, while inserting data into the target table I need to add null values to the extra 2 columns.
But when I execute the below query it has thrown the exception
> AnalysisException: Target table 'target_table' has more columns (8) than the SELECT / VALUES clause returns (7)
insert overwrite table target_table select * from source_table;
How to make advantage of Avro table automatic schema change detection here?
**Note:** Suppose if I want to insert only 5 columns to the target and the rest should be default null. So how to achieve this?
user109612
(21 rep)
Nov 3, 2016, 09:53 AM
• Last activity: May 25, 2021, 08:59 AM
3
votes
1
answers
3862
views
Getting the row before a row with a certain value in SQL
I have a table like below where user actions are stored with a timestamp. My goal is to identify the action that happened before a specific action (named reference_action) and count the number of those actions to see which actions happens before the specific actions and how they are distributed. I a...
I have a table like below where user actions are stored with a timestamp. My goal is to identify the action that happened before a specific action (named reference_action) and count the number of those actions to see which actions happens before the specific actions and how they are distributed.
I am aware of window functions like LAG() where I can get the row before a certain row but can't figure out how to include a constraint like
WHERE action_name = "reference_action"
.
The query engine is Presto and the tables are Hive tables but I'm mostly interested in the general SQL approach, therefore that shouldn't matter much.
| session | action_name | timestamp |
| ------- | ----------- | --------- |
| 1 | "some_action" | 1970-01-01 00:01:00 |
| 1 | "some_action" | 1970-01-01 00:02:00 |
| 1 | "some_action" | 1970-01-01 00:03:00 |
| 1 | "desired_action1" | 1970-01-01 00:04:00 |
| 1 | "reference_action" | 1970-01-01 00:05:00 |
| 1 | "some_action" | 1970-01-01 00:06:00 |
| 1 | "some_action" | 1970-01-01 00:07:00 |
| 2 | "some_action" | 1970-01-01 01:23:00 |
| 2 | "some_action" | 1970-01-01 02:34:00 |
| 2 | "desired_action1" | 1970-01-01 03:45:00 |
| 2 | "reference_action" | 1970-01-01 04:56:00 |
| 2 | "some_action" | 1970-01-01 05:58:00 |
| 3 | "some_action" | 1970-01-01 01:23:00 |
| 3 | "some_action" | 1970-01-01 02:34:00 |
| 3 | "desired_action2" | 1970-01-01 03:45:00 |
| 3 | "reference_action" | 1970-01-01 04:56:00 |
| 3 | "some_action" | 1970-01-01 05:58:00 |
The result should look like:
| action | count |
| ------ | ----- |
| "desired_action1" | 2 |
| "desired_action2" | 1 |
There are two rows where "desired_action1" is directly followed by a row with "reference_action", when ordered by timestamp
, hence the count being 2. The same logic applies for why the count is 1 for "desired_action2".
The goal is to know what a user did before he made a purchase (purchase = reference_action). To understand what he did before, I want to look up the action that happened before a purchase. Therefore I need to know the action_name in the row before a reference_action. desired_actions have to be counted, reference_actions are just the rows after the actions I want to count and used to determine which values should be counted.
Daniel Müller
(133 rep)
May 18, 2021, 12:35 PM
• Last activity: May 19, 2021, 10:08 AM
0
votes
1
answers
722
views
User specific default database in Hive
I am running a Oracle Big Data Appliance platform that has Cloudera EDH 5.9.x running under the hood. My users are mainly using beeswax, which is the Hive query editor app within Hue. When a user logs into Hue and then opens beeswax, the default Hive database "default" is preselected. I want to chan...
I am running a Oracle Big Data Appliance platform that has Cloudera EDH 5.9.x running under the hood. My users are mainly using beeswax, which is the Hive query editor app within Hue.
When a user logs into Hue and then opens beeswax, the default Hive database "default" is preselected. I want to change this so that the first database they see is their own sandbox database. Currently, the user has to manually select the database from beeswax or run the
use DATABASE
command in the editor.
Is there a configuration item I can change within any of the CDH software modules that will help me do this? Or is there a concept of a Hive startup script where I can run the use DATABASE
command?
rustycodemonkey
(1 rep)
Oct 12, 2017, 12:13 AM
• Last activity: Apr 30, 2021, 05:05 PM
0
votes
0
answers
25
views
Field with Top Ranking Field Name
Let's imagine a table structured like this: | Bucket | Red | Blue | Green | | ------ | --- |----- | ----- | | First | 1 |3 |4 | | Second | 6 |5 |2 | What I'm trying to achieve is based on the values within each bucket, I'd like to generate another set of fields with the highest ranking, second highe...
Let's imagine a table structured like this:
| Bucket | Red | Blue | Green |
| ------ | --- |----- | ----- |
| First | 1 |3 |4 |
| Second | 6 |5 |2 |
What I'm trying to achieve is based on the values within each bucket, I'd like to generate another set of fields with the highest ranking, second highest ranking, and third highest ranking colors (assume there are more than three colors as well). We are limiting to top 3.
Essentially, what my final output should look like is this:
| Bucket | Red | Blue | Green | Rank 1 | Rank 2 | Rank 3 |
| ------ | --- |----- | ----- | ------ | ------ | ------ |
| First | 1 |3 |4 | Green | Blue | Red |
| Second | 6 |5 |2 | Red | Blue | Green |
Hoping this isn't a redundant question.
Franco Buhay
(1 rep)
Feb 10, 2021, 07:37 PM
0
votes
1
answers
355
views
Creating external tables from SQL Server using JDBC Drivers
I am trying to query Hive tables AND Kylin tables from SQL Server 2019. I was doing some research and found that MSSQL provide PolyBase as a form of interface to Hadoop storage. However, the examples mostly include creating tables and storing file formats into HDFS. All I need is to create an extern...
I am trying to query Hive tables AND Kylin tables from SQL Server 2019. I was doing some research and found that MSSQL provide PolyBase as a form of interface to Hadoop storage. However, the examples mostly include creating tables and storing file formats into HDFS. All I need is to create an external table from MSSQL using JDBC drivers. Is this possible? I would like to avoid using ODBC drivers.
user859385
(101 rep)
Nov 5, 2020, 01:50 PM
• Last activity: Nov 7, 2020, 03:29 PM
0
votes
1
answers
450
views
Query fails sometimes on casting error
I have a query that can run on the same data set, and sometimes it fails and sometimes it succeeds The query is generated by hive metadata service, and I can't modify it. This is a simplified version of the query: select "TBLS"."TBL_ID", "FILTER0"."PART_ID", "TBLS"."TBL_NAME", "FILTER0"."PART_KEY_VA...
I have a query that can run on the same data set, and sometimes it fails and sometimes it succeeds
The query is generated by hive metadata service, and I can't modify it.
This is a simplified version of the query:
select
"TBLS"."TBL_ID",
"FILTER0"."PART_ID",
"TBLS"."TBL_NAME",
"FILTER0"."PART_KEY_VAL"
from
"PARTITIONS"
inner join "TBLS" on
"PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID"
and "TBLS"."TBL_NAME" = 'test_table_int'
inner join "PARTITION_KEY_VALS" "FILTER0" on
"FILTER0"."PART_ID" = "PARTITIONS"."PART_ID"
where
cast("FILTER0"."PART_KEY_VAL" as decimal(21, 0)) = 1
When I spin up a new database, and populate the relevant tables, this is how the whole data looks like (querying without any filters):
and running the query above will return a single row (the one with
but running the query above will result in:
> SQL Error [22P02]: ERROR: invalid input syntax for type numeric: "c"
for some reason, the value "c" is being cast to decimal and it fails, even though the same query on the same data was working earlier
what could be the reason for this behavior?
---
for reference, here is where the query is generated, but I simplified it a bit above: https://github.com/apache/hive/blob/rel/release-3.1.2/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java#L1289-L1339

PART_KEY_VAL
= 1)
the problem starts after I run some automated tests that write to those tables. I couldn't find any pattern, I just run a few complicate tests that write to those tables
Now if I populate those tables again, the data looks similar:

lev
(111 rep)
May 28, 2020, 08:06 AM
• Last activity: May 29, 2020, 07:47 AM
1
votes
0
answers
26
views
Defining external table on JSON with an @ sign in an element
I need to define a Hive external table onto a JSON file that has @ signs in its elements, e.g. { "data": { "@type": "person", "name": "Phil", "job": "Programmer" } } This works: create external table sandbox.test_table ( data STRUCT ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED...
I need to define a Hive external table onto a JSON file that has @ signs in its elements, e.g.
{ "data": { "@type": "person", "name": "Phil", "job": "Programmer" } }
This works:
create external table sandbox.test_table
( data STRUCT
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3a://bucket/DEV/data/raw/test/';
However this misses out the @type element, I've tried these:
data STRUCT
data STRUCT
Neither of them work. Any suggestions how I can do this, or do I need to preprocess the JSON to remove the @ from the element?
PhilHibbs
(539 rep)
Mar 25, 2020, 12:17 PM
Showing page 1 of 20 total questions