Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

1 votes

1 answers

808 views

IDE Access to Hive without user password?

I'm a DB Developer and in my company we're only able to interact with hive through a technical user who owns the specific write permissions for Hive on HDFS. So in practice this looks like this: 1. I connect through our remote server by using SSH with my user credentials 2. I switch to the technical...

                                  I'm a DB Developer and in my company we're only able to interact with hive through a technical user who owns the specific write permissions for Hive on HDFS.
So in practice this looks like this:

 1. I connect through our remote server by using SSH with my user credentials
 2. I switch to the technical hive user by doing sudo su - hive_user (DBA's wont hand out the password for this, so they prefer only giving us specific sudo permission for switching user like this)
 3. I execute a query using beeline -f QUERY_FILE

So as you can see I'm bound to work with the CLI (beeline) all the time, but I'd like to have the convenience using any SQL IDE from my Desktop.

Is there any IDE for Hive, that allows me to connect to our DB as a technical user that can only be accessed by using sudo su - hive_user? A linked manual for this, also would be nice.
                                

Mayak (181 rep)

Oct 2, 2019, 11:10 AM • Last activity: Aug 3, 2025, 03:06 PM

1 votes

0 answers

16 views

hive - Can not create the managed table The associated location already exists

hive apache-spark

I'm trying to create a managed Hive table using Spark SQL with the following query: DROP TABLE IF EXISTS db.TMP_ARR; CREATE TABLE db.TMP_ARR AS SELECT ID, - more fields.. FROM some_source_table INT; However, the job fails with the following error: org.apache.spark.sql.AnalysisException: Can not crea...

                                  I'm trying to create a managed Hive table using Spark SQL with the following query:

    DROP TABLE IF EXISTS db.TMP_ARR; 
    
    CREATE TABLE db.TMP_ARR AS 
    
    SELECT      ID, 
    
    - more fields..
    FROM some_source_table INT;

However, the job fails with the following error:

    org.apache.spark.sql.AnalysisException: Can not create the managed table('db.tmp_arr'). The associated location ('hdfs://coreCluster/warehouse/tablespace/managed/hive/db.db/tmp_arr') already exists

**What I understand:** I'm trying to create a managed table.

Spark expects that the target location in HDFS does not already exist when creating a managed table.

Apparently, that folder already exists, possibly due to a previous failed run or manual intervention.

**My questions:**

Why does Spark throw this error even though I used DROP TABLE IF EXISTS before CREATE TABLE?

What's the correct way to ensure a managed table can be created without this conflict?

Should I manually delete the path in HDFS before creating the table, or is there a safer/better approach?

**Environment:** Spark version: 3.3.2

Hive metastore: enabled

Storage: HDFS

*1. It's important that the table is managed (not external), and that we don’t manually assign a LOCATION. 
2. many similar jobs are running concurrently (creating/dropping managed tables in the same Hive schema).*

hieutmbk (11 rep)

Aug 1, 2025, 02:44 AM

0 votes

1 answers

32 views

How to Create a Managed Table in Apache Hive 4.0.1?

hive

I am running Apache Hive 4.0.1 using Docker with the following command: ``` docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 --name hive4 apache/hive:4.0.1 ``` After starting Hive, I created a table using the following SQL command: ```sql CREATE TABLE FOO(foo string); ``` W...

I am running Apache Hive 4.0.1 using Docker with the following command:

docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 --name hive4 apache/hive:4.0.1

After starting Hive, I created a table using the following SQL command:

CREATE TABLE FOO(foo string);

When I check the table definition using:

SHOW CREATE TABLE default.FOO;

I see that the table is defined as EXTERNAL. However, I want to create a managed table (also referred to as an "internal" table). What steps should I follow or what specific commands should I use to ensure that the table is created as a managed table in Hive 4.0.1? Any guidance on this would be greatly appreciated!

Aleksandr Shperling (1 rep)

Dec 20, 2024, 05:31 PM • Last activity: Mar 18, 2025, 07:29 AM

0 votes

2 answers

529 views

Failed to link the C library against JNA, Cannot open shared object file

cassandra java hive

I am having issues with installing Cassandra 4.0 and 4.1 on my rhel 8 server. I've tried using java 11 openjdk and Corretto 11 openjdk. When I start Cassandra I would receive: ``` Native LibraryLinux.java - Failed to link the C library against JNA. \ Native methods will not be available. ``` ``` jav...

I am having issues with installing Cassandra 4.0 and 4.1 on my rhel 8 server. I've tried using java 11 openjdk and Corretto 11 openjdk. When I start Cassandra I would receive:

Native LibraryLinux.java - Failed to link the C library against JNA.  \
  Native methods will not be available.

java.lang.UnsatisfiedLinkError: /tmp/jna8760917299733827163.tmp: \
  /tmp/jna8760917299733827163.tmp: cannot open shared object: Operation not permitted

I've tried adding these commands separate and together in cassandra-env.sh file: JVM_OPTS="$JVM_OPTS -Djava.io.tmpdir=$CASSANDRA_HOME/tmp" JVM_OPTS="$JVM_OPTS -Djna.tmpdir=$CASSANDRA_HOME/tmp" I even tried export JVM_OPTS="$JVM_OPTS -Djava.io.tmpdir=$CASSANDRA_HOME/tmp" Once that is input the error changes to:

java.lang.UnsatisfiedLinkError: Failed to create temporary file for /com/sun/jna/win32-x86/jnidispatch.dll library: \
   JNA temporary directory 'cassandra/tmp' does not exist

Once I create it and start Cassandra a new error shows as:

java.lang.UnsatisfiedLinkError: Failed to create temporary file for /com/sun/jna/win32-x86/jnidispatch.dll library: \
  JNA temporary directory 'cassandra/tmp' is not writable

I tried to change permissions, but it would go back to the original error. Does anyone have a solution for this?

jexport (1 rep)

May 3, 2024, 07:19 PM • Last activity: May 7, 2024, 06:35 AM

0 votes

0 answers

49 views

Is it necessary to optimize join in hdfs?

hadoop hive

What is the most optimal way to query in Hive (Datalake, based on hdfs)? Establishing filters in the tables prior to join them, select* from (select code from table_1 where type="a") a inner join (select code from table_2 where type="a") b on a.code=b.code Or this way? In `where` condition. Select *...

                                  What is the most optimal way to query in Hive (Datalake, based on hdfs)?
Establishing filters in the tables prior to join them, 

     
    select* from
    (select code from table_1 where type="a") a 
    inner join 
    (select code from table_2
    where type="a") b 
    on a.code=b.code

Or this way? In where condition.

    Select *
    From table_1 inner join table_2 on table_1.codigo=table_2.codigo
    Where table_1.type="a" and
    Table_2.type="a".

Perhaps the most obvious and quickest answer is the first way. But I think that with HDFS the environment is optimized in such a way that it reads the "where" first and then the "join", I mean, HDFS brings an internal code optimization. 
                                

cfsl (1 rep)

Jan 31, 2024, 09:46 PM • Last activity: Jan 31, 2024, 10:20 PM

1 votes

0 answers

113 views

How to pass multiple values into hive hql for the same hivevar

hive hiveql

Requirement : My Hql has below script in which I want to pass values into the where clause dynamically. How do I dynamically pass using hivevar in below specific scenario where multiple values are expected. Or how do I invoke the hql with hivevar defined for (‘a’,’c’) Create table newtbl As select *...

                                  Requirement :
My Hql has below script in which I want to pass values into the where clause dynamically. How do I dynamically pass using hivevar in below specific scenario where multiple values are expected. Or how do I invoke the hql with hivevar defined for (‘a’,’c’)

Create table newtbl 
As select * from temptbl where
id IN (‘a’, ‘c’)

RaCh (11 rep)

Dec 21, 2023, 12:38 AM

2 votes

1 answers

2670 views

Connecting to cluster with cqlsh returns "Unable to connect to any servers"

cassandra hive

I am trying to deploy The Hive 4 on a VMware Workstation 17 player VM to test Splunk integration with The Hive. I am following the guide at this [link][1], but I encountered an error at one of the stages, namely when launching cassandra using the `cqlsh localhost 9042` command: Connection error: ('U...

                                  I am trying to deploy The Hive 4 on a VMware Workstation 17 player VM to test Splunk integration with The Hive. 
I am following the guide at this link , but I encountered an error at one of the stages, namely when launching cassandra using the cqlsh localhost 9042 command:

    Connection error: ('Unable to connect to any servers', {'127.0.0.1': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})

I tried to solve the problem based on the information from this  site, but it didn't help me.

OS Version - Ubuntu 22.04.2 LTS

I'm new in this field, I can provide any information you need

aimakovm (21 rep)

May 16, 2023, 10:12 AM • Last activity: May 16, 2023, 12:09 PM

1 votes

1 answers

287 views

Metastore(Mysql) bottleneck for Hive

mysql hive mariadb

We have a hive installation that has MariaDB as metastore database. MariaDB has around ~250 GB metadata with ~100GB indexes. It becomes terribly slow during the peak load of 40-60K QPS. Looking from the community to share similar experiences if any and what they did to scale out the meta store or fi...

                                  We have a hive installation that has MariaDB as metastore database. MariaDB has around ~250 GB metadata with ~100GB indexes. It becomes terribly slow during the peak load of 40-60K QPS.

Looking from the community to share similar experiences if any and what they did to scale out the meta store or fix it?

Some of the ideas i am looking at currently are:
- Application Caching at HMS level: Didn't found out of box capability in my current v2.0.1. Is there support for it in higher versions?
- Read replicas and routing SELECTS to it: Facing some failure if there is replication lag and i am trying to read back the value.
- Horizontal sharding of Mysql: founding it way complex. Saw some recommendations of TiDB but not sure of its experience.
                                

Shakti Garg (111 rep)

Feb 16, 2023, 02:48 PM • Last activity: Feb 18, 2023, 10:11 AM

1 votes

1 answers

308 views

For each tuple, get the name of the first column which is non-zero

hive hiveql

I have a table in Hive which looks like: ``` | Name | 1990 | 1991 | 1992 | 1993 | 1994 | | Rex | 0 | 0 | 1 | 1 | 1 | | Max | 0 | 0 | 0 | 0 | 1 | | Phil | 1 | 1 | 1 | 1 | 1 | ``` I would like to get, for each row, the name of the first column which is non-zero, so something like: ``` | Name | Column...

I have a table in Hive which looks like:

| Name | 1990 | 1991 | 1992 | 1993 | 1994 |
| Rex  | 0    | 0    | 1    | 1    | 1    |
| Max  | 0    | 0    | 0    | 0    | 1    |
| Phil | 1    | 1    | 1    | 1    | 1    |

I would like to get, for each row, the name of the first column which is non-zero, so something like:

| Name | Column |
| Rex  | 1992   |
| Max  | 1994   |
| Phil | 1990   |

For each row, it is guaranteed that: * There is at least one column with "1"; and * If column X has is "1", for each column Y > X, column Y will also have a "1".

user2891462 (113 rep)

Nov 28, 2021, 04:47 PM • Last activity: Dec 1, 2021, 05:21 PM

0 votes

1 answers

6286 views

Testing a Hive array for IS NULL says not null

null array hive

I have a table containing an array, and I want to check if it is empty or NULL. It appears that I cannot check the NULL directly! Can anyone shed light on why the NULL check isn't working? create table `test_array_split` ( `campaign` string , `questions` array ) stored as orc ; insert into `test_arr...

                                  I have a table containing an array, and I want to check if it is empty or NULL. It appears that I cannot check the NULL directly! Can anyone shed light on why the NULL check isn't working?

    create table test_array_split
    ( campaign string
    , questions array 
    ) 
    stored as orc ;
    
    insert into test_array_split (campaign) values ('1');
    
    select
      campaign
    , questions
    , size(questions)
    , case when questions is null then 'null' else 'not null' end isnull
    from test_array_split;
    +-----------+------------+------+-----------+
    | campaign  | questions  | _c2  |  isnull   |
    +-----------+------------+------+-----------+
    | 1         | NULL       | -1   | not null  |
    +-----------+------------+------+-----------+

                                

PhilHibbs (539 rep)

Jul 24, 2020, 12:25 PM • Last activity: Sep 27, 2021, 08:03 PM

0 votes

0 answers

100 views

Is there an open source implementation of QGM (Query Graph Model)?

postgresql hive apache-spark

I am building a new system that needs to interact essentially as an SQL backend. We would like to import logical queries into it (e.g. from ApacheSPARQ or Postgres and related things) and want to develop an internal representation (IR) for them. Doing something similar to QGM seems like a good start...

                                  I am building a new system that needs to interact essentially as an SQL backend.  We would like to import logical queries into it (e.g. from ApacheSPARQ or Postgres and related things) and want to develop an internal representation (IR) for them.  Doing something similar to QGM seems like a good starting point.  However, rather than inventing it all from scratch, I'd like to borrow from something that already exists and then extend as needed.

Even if there is no processing code, just data structures, it would be a useful starting point.

So, if there is something open source that I could look at, it would be appreciated.

intel_chris (141 rep)

Sep 25, 2021, 12:13 PM

1 votes

1 answers

805 views

How to check total allotted space inside a HDFS 'group'

hadoop hive

Our DBA has created a schema for our team in HDFS/HIVE. Not sure if 'schema' is the right word, they call it a 'group'. Anyway, we can only write to the data lake inside this schema, whether it is parquet files or hive tables. Is there a way to check what is the max space allocated to our group , kn...

                                  Our DBA has created a schema for our team in HDFS/HIVE. Not sure if 'schema' is the right word, they call it a 'group'.
Anyway, we can only write to the data lake inside this schema, whether it is parquet files or hive tables.
Is there a way to check what is the max space allocated to our group , knowing only the schema name?
I dont want to accidentally load too much data.

Thank you.
                                

Victor (127 rep)

May 1, 2021, 03:18 PM • Last activity: Sep 16, 2021, 12:25 PM

0 votes

2 answers

1585 views

Omit table name and dot in SELECT query

select alias hive

When I perform this query: SELECT `tablename`.* FROM `something`.`tablename` I get a table with column names that contain dots: tablename.c1 | tablename.c2 | tablename.c3 ------------------------------------------ a | 1 | 2 b | 1 | 3 I don't want this, I just want the column names `c1`, `c2` and `c3...

                                  When I perform this query:

    SELECT tablename.* FROM something.tablename

I get a table with column names that contain dots:

    tablename.c1 | tablename.c2 | tablename.c3
    ------------------------------------------
         a       |       1      |     2
         b       |       1      |     3


I don't want this, I just want the column names c1, c2 and c3, I can solve this by writing the following query:

    SELECT tablename.c1 as c1,
           tablename.c2 as c2,
           tablename.c3 as c3 FROM something.tablename

I however have many columns which makes it a very long query. How can I rename the columns from the first query or how can I get this right from the start?

(p.s. the query I'm using contains multiple table references that is why I specify the table name tablename.*)
                                

Mehdi Nellen (101 rep)

Jan 16, 2020, 01:18 PM • Last activity: Jul 9, 2021, 06:01 AM

2 votes

1 answers

1498 views

how to insert data into extra columns of target avro table when source table is having less no of columns compared to target using hive or impala?

impala hive

Suppose I am having a source Avro table having 10 columns and my target Avro table is having 12 columns, while inserting data into the target table I need to add null values to the extra 2 columns. But when I execute the below query it has thrown the exception > AnalysisException: Target table 'targ...

                                  Suppose I am having a source Avro table having 10 columns and my target Avro table is having 12 columns, while inserting data into the target table I need to add null values to the extra 2 columns.

But when I execute the below query it has thrown the exception
>  AnalysisException: Target table 'target_table' has more columns (8) than the SELECT / VALUES clause returns (7) 

    insert overwrite table target_table select * from source_table;

How to make advantage of Avro table automatic schema change detection here?

**Note:** Suppose if I want to insert only 5 columns to the target and the rest should be default null. So how to achieve this?

user109612 (21 rep)

Nov 3, 2016, 09:53 AM • Last activity: May 25, 2021, 08:59 AM

3 votes

1 answers

3862 views

Getting the row before a row with a certain value in SQL

window-functions sql-standard hive amazon-presto

I have a table like below where user actions are stored with a timestamp. My goal is to identify the action that happened before a specific action (named reference_action) and count the number of those actions to see which actions happens before the specific actions and how they are distributed. I a...

                                  I have a table like below where user actions are stored with a timestamp. My goal is to identify the action that happened before a specific action (named reference_action) and count the number of those actions to see which actions happens before the specific actions and how they are distributed.

I am aware of window functions like LAG() where I can get the row before a certain row but can't figure out how to include a constraint like WHERE action_name = "reference_action".

The query engine is Presto and the tables are Hive tables but I'm mostly interested in the general SQL approach, therefore that shouldn't matter much.

| session | action_name | timestamp |
| ------- | ----------- | --------- |
| 1       | "some_action" | 1970-01-01 00:01:00 |
| 1       | "some_action" | 1970-01-01 00:02:00 |
| 1       | "some_action" | 1970-01-01 00:03:00 |
| 1       | "desired_action1" | 1970-01-01 00:04:00 |
| 1       | "reference_action" | 1970-01-01 00:05:00 |
| 1       | "some_action" | 1970-01-01 00:06:00 |
| 1       | "some_action" | 1970-01-01 00:07:00 |
| 2       | "some_action" | 1970-01-01 01:23:00 |
| 2       | "some_action" | 1970-01-01 02:34:00 |
| 2       | "desired_action1" | 1970-01-01 03:45:00 |
| 2       | "reference_action" | 1970-01-01 04:56:00 |
| 2       | "some_action" | 1970-01-01 05:58:00 |
| 3       | "some_action" | 1970-01-01 01:23:00 |
| 3       | "some_action" | 1970-01-01 02:34:00 |
| 3       | "desired_action2" | 1970-01-01 03:45:00 |
| 3       | "reference_action" | 1970-01-01 04:56:00 |
| 3       | "some_action" | 1970-01-01 05:58:00 |

The result should look like:

| action | count |
| ------ | ----- |
| "desired_action1" | 2 |
| "desired_action2" | 1 |

There are two rows where "desired_action1" is directly followed by a row with "reference_action", when ordered by timestamp, hence the count being 2. The same logic applies for why the count is 1 for "desired_action2".

The goal is to know what a user did before he made a purchase (purchase = reference_action). To understand what he did before, I want to look up the action that happened before a purchase. Therefore I need to know the action_name in the row before a reference_action. desired_actions have to be counted, reference_actions are just the rows after the actions I want to count and used to determine which values should be counted.
                                

Daniel Müller (133 rep)

May 18, 2021, 12:35 PM • Last activity: May 19, 2021, 10:08 AM

0 votes

1 answers

722 views

User specific default database in Hive

hadoop hive

I am running a Oracle Big Data Appliance platform that has Cloudera EDH 5.9.x running under the hood. My users are mainly using beeswax, which is the Hive query editor app within Hue. When a user logs into Hue and then opens beeswax, the default Hive database "default" is preselected. I want to chan...

                                  I am running a Oracle Big Data Appliance platform that has Cloudera EDH 5.9.x running under the hood. My users are mainly using beeswax, which is the Hive query editor app within Hue.

When a user logs into Hue and then opens beeswax, the default Hive database "default" is preselected. I want to change this so that the first database they see is their own sandbox database. Currently, the user has to manually select the database from beeswax or run the use DATABASE command in the editor.

Is there a configuration item I can change within any of the CDH software modules that will help me do this? Or is there a concept of a Hive startup script where I can run the use DATABASE command?

rustycodemonkey (1 rep)

Oct 12, 2017, 12:13 AM • Last activity: Apr 30, 2021, 05:05 PM

0 votes

0 answers

25 views

Field with Top Ranking Field Name

rank hadoop hive hiveql

Let's imagine a table structured like this: | Bucket | Red | Blue | Green | | ------ | --- |----- | ----- | | First | 1 |3 |4 | | Second | 6 |5 |2 | What I'm trying to achieve is based on the values within each bucket, I'd like to generate another set of fields with the highest ranking, second highe...

                                  Let's imagine a table structured like this:

| Bucket | Red | Blue | Green |
| ------ | --- |----- | ----- |
| First  | 1   |3     |4      |
| Second | 6   |5     |2      |

What I'm trying to achieve is based on the values within each bucket, I'd like to generate another set of fields with the highest ranking, second highest ranking, and third highest ranking colors (assume there are more than three colors as well). We are limiting to top 3.

Essentially, what my final output should look like is this:

| Bucket | Red | Blue | Green | Rank 1 | Rank 2 | Rank 3 |
| ------ | --- |----- | ----- | ------ | ------ | ------ |
| First  | 1   |3     |4      | Green  | Blue   | Red    |
| Second | 6   |5     |2      | Red    | Blue   | Green  |

Hoping this isn't a redundant question.
                                

Franco Buhay (1 rep)

Feb 10, 2021, 07:37 PM

0 votes

1 answers

355 views

Creating external tables from SQL Server using JDBC Drivers

sql-server jdbc hadoop hive

I am trying to query Hive tables AND Kylin tables from SQL Server 2019. I was doing some research and found that MSSQL provide PolyBase as a form of interface to Hadoop storage. However, the examples mostly include creating tables and storing file formats into HDFS. All I need is to create an extern...

                                  I am trying to query Hive tables AND Kylin tables from SQL Server 2019. I was doing some research and found that MSSQL provide PolyBase as a form of interface to Hadoop storage. However, the examples mostly include creating tables and storing file formats into HDFS. All I need is to create an external table from MSSQL using JDBC drivers. Is this possible? I would like to avoid using ODBC drivers. 
                                

user859385 (101 rep)

Nov 5, 2020, 01:50 PM • Last activity: Nov 7, 2020, 03:29 PM

0 votes

1 answers

450 views

Query fails sometimes on casting error

postgresql postgresql-9.3 hive

I have a query that can run on the same data set, and sometimes it fails and sometimes it succeeds The query is generated by hive metadata service, and I can't modify it. This is a simplified version of the query: select "TBLS"."TBL_ID", "FILTER0"."PART_ID", "TBLS"."TBL_NAME", "FILTER0"."PART_KEY_VA...

                                  I have a query that can run on the same data set, and sometimes it fails and sometimes it succeeds  

The query is generated by hive metadata service, and I can't modify it.

This is a simplified version of the query:

    select
        "TBLS"."TBL_ID",
        "FILTER0"."PART_ID",
        "TBLS"."TBL_NAME",
        "FILTER0"."PART_KEY_VAL"
    from
        "PARTITIONS"
    inner join "TBLS" on
        "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID"
        and "TBLS"."TBL_NAME" = 'test_table_int'
    inner join "PARTITION_KEY_VALS" "FILTER0" on
        "FILTER0"."PART_ID" = "PARTITIONS"."PART_ID"
    where
        cast("FILTER0"."PART_KEY_VAL" as decimal(21, 0)) = 1

When I spin up a new database, and populate the relevant tables, this is how the whole data looks like (querying without any filters):

and running the query above will return a single row (the one with PART_KEY_VAL = 1)

the problem starts after I run some automated tests that write to those tables. I couldn't find any pattern, I just run a few complicate tests that write to those tables

Now if I populate those tables again, the data looks similar: 

but running the query above will result in:

> SQL Error [22P02]: ERROR: invalid input syntax for type numeric: "c"

for some reason, the value "c" is being cast to decimal and it fails, even though the same query on the same data was working earlier  

what could be the reason for this behavior? 

---
for reference, here is where the query is generated, but I simplified it a bit above: https://github.com/apache/hive/blob/rel/release-3.1.2/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java#L1289-L1339

lev (111 rep)

May 28, 2020, 08:06 AM • Last activity: May 29, 2020, 07:47 AM

1 votes

0 answers

26 views

Defining external table on JSON with an @ sign in an element

json hive hiveql

I need to define a Hive external table onto a JSON file that has @ signs in its elements, e.g. { "data": { "@type": "person", "name": "Phil", "job": "Programmer" } } This works: create external table sandbox.test_table ( data STRUCT ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED...

                                  I need to define a Hive external table onto a JSON file that has @ signs in its elements, e.g.

    { "data": { "@type": "person", "name": "Phil", "job": "Programmer" } }

This works:

    create external table sandbox.test_table
    ( data STRUCT
    )
    ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
    STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION 's3a://bucket/DEV/data/raw/test/';

However this misses out the @type element, I've tried these:

    data STRUCT
    data STRUCT

Neither of them work. Any suggestions how I can do this, or do I need to preprocess the JSON to remove the @ from the element?

                                

PhilHibbs (539 rep)

Mar 25, 2020, 12:17 PM

Showing page 1 of 20 total questions