Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

5 votes

1 answers

212 views

How to installation spark standalone mode in ubuntu

hadoop

I am trying to install spark standalone but show error. How i can solve this problem. - Java version :- 1.8.0_131 - Spark:- 2.2.0 - Hadoop: 2.7.4 - bashrc file setting - Hadoop file location in local System: `/usr/lib/hadoop/hadoop-2.7.4` - Spark file location: `/opt/spark/spark` File: -------------...

                                  I am trying to install spark standalone but show error.
How i can solve this problem.

-  Java version :- 1.8.0_131
-  Spark:- 2.2.0
-  Hadoop: 2.7.4
-  bashrc file setting 
-  Hadoop file location in local System: /usr/lib/hadoop/hadoop-2.7.4
-  Spark file location: /opt/spark/spark

File:

    -----------------------------
    #JAVA HOME directory setup
    export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
    export PATH=$PATH:$JAVA_HOME/bin
    
    #HBASE HOME setup
    export HBASE_HOME=/usr/lib/hbase/hbase-1.3.1
    export PATH=$PATH:$HBASE_HOME/bin
    
    #HADOOP Setup
    export HADOOP_HOME=/usr/lib/hadoop/hadoop-2.7.4
    export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
    export HADOOP_INSTALL=$HADOOP_HOME
    export HADOOP_MAPRED_HOME=$HADOOP_HOME
    export HADOOP_HDFS_HOME=$HADOOP_HOME
    export YARN_HOME=$HADOOP_HOME  
    export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
    export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
    export HADOOP_PID_DIR=$HADOOP_HOME/hadoop2_data/hdfs/pid
    
    #spark setup
    export SPARK_HOME=/opt/spark/spark
    export PATH=$SPARK_HOME/bin:$PATH
    #scala 
    export SCALA_HOME=/usr/local/src/scala/scala-2.11.11
    export PATH=$SCALA_HOME/bin:$PATH
    ------------------------------------------------------------

 
                                

Shiva Manhar (161 rep)

Oct 14, 2017, 05:47 PM • Last activity: Jun 12, 2025, 01:06 AM

0 votes

1 answers

287 views

Fastest way to sync (or keep import) 3.5TB data from hadoop to sharded mongodb cluster

mongodb import mongodb-3.0 hadoop

There are 3.5TB of data in our Hadoop cluster(yes on hdfs). And we have newly built a sharded MongoDB cluster(the latest 3.x) with 3 mongos, 3 configdb and 3 shards(each shard has 1 primary and 2 secondary nodes) We are looking for the best/fastest way to import these data from Hadoop/hdfs to our ne...

                                  There are 3.5TB of data in our Hadoop cluster(yes on hdfs). And we have newly built a sharded MongoDB cluster(the latest 3.x) with 3 mongos, 3 configdb and 3 shards(each shard has 1 primary and 2 secondary nodes)

We are looking for the best/fastest way to import these data from Hadoop/hdfs to our newly built sharded MongoDB cluster.

All these data will be into shared collections in the MongoDB cluster.

We don't have much experience with this and have no clue how to do this in the fastest way in our environment.

Appreciate it if anyone can give a clue or the tools we can leverage. open source tools or commercials are both ok to us.

Joe

nntp (13 rep)

Sep 15, 2015, 04:21 PM • Last activity: May 19, 2025, 07:04 PM

0 votes

0 answers

49 views

Is it necessary to optimize join in hdfs?

hadoop hive

What is the most optimal way to query in Hive (Datalake, based on hdfs)? Establishing filters in the tables prior to join them, select* from (select code from table_1 where type="a") a inner join (select code from table_2 where type="a") b on a.code=b.code Or this way? In `where` condition. Select *...

                                  What is the most optimal way to query in Hive (Datalake, based on hdfs)?
Establishing filters in the tables prior to join them, 

     
    select* from
    (select code from table_1 where type="a") a 
    inner join 
    (select code from table_2
    where type="a") b 
    on a.code=b.code

Or this way? In where condition.

    Select *
    From table_1 inner join table_2 on table_1.codigo=table_2.codigo
    Where table_1.type="a" and
    Table_2.type="a".

Perhaps the most obvious and quickest answer is the first way. But I think that with HDFS the environment is optimized in such a way that it reads the "where" first and then the "join", I mean, HDFS brings an internal code optimization. 
                                

cfsl (1 rep)

Jan 31, 2024, 09:46 PM • Last activity: Jan 31, 2024, 10:20 PM

2 votes

1 answers

110 views

Can we use Cassandra in place of Hadoop with Spark?

database-design cassandra application-design hadoop apache-spark

Considering we have a backend written in NodeJS and uses MySQL and Cassandra as it's databases, if we want to add Spark to the system to do some data analyzing stuff like recommendation, can we do it with Cassandra( I mean using Spark + Cassandra) and reach the same result as we could reach with the...

                                  Considering we have a backend written in NodeJS and uses MySQL and Cassandra as it's databases, if we want to add Spark to the system to do some data analyzing stuff like recommendation, can we do it with Cassandra( I mean using Spark + Cassandra) and reach the same result as we could reach with the Hadoop( Spark + Hadoop)? 

I want to know what Hadoop can do that Cassandra can not to? Or what would make it essential to use Hadoop alongside with the Spark?

user20551429 (69 rep)

Nov 29, 2022, 04:41 AM • Last activity: Nov 29, 2022, 05:09 AM

1 votes

1 answers

805 views

How to check total allotted space inside a HDFS 'group'

hadoop hive

Our DBA has created a schema for our team in HDFS/HIVE. Not sure if 'schema' is the right word, they call it a 'group'. Anyway, we can only write to the data lake inside this schema, whether it is parquet files or hive tables. Is there a way to check what is the max space allocated to our group , kn...

                                  Our DBA has created a schema for our team in HDFS/HIVE. Not sure if 'schema' is the right word, they call it a 'group'.
Anyway, we can only write to the data lake inside this schema, whether it is parquet files or hive tables.
Is there a way to check what is the max space allocated to our group , knowing only the schema name?
I dont want to accidentally load too much data.

Thank you.
                                

Victor (127 rep)

May 1, 2021, 03:18 PM • Last activity: Sep 16, 2021, 12:25 PM

0 votes

1 answers

722 views

User specific default database in Hive

hadoop hive

I am running a Oracle Big Data Appliance platform that has Cloudera EDH 5.9.x running under the hood. My users are mainly using beeswax, which is the Hive query editor app within Hue. When a user logs into Hue and then opens beeswax, the default Hive database "default" is preselected. I want to chan...

                                  I am running a Oracle Big Data Appliance platform that has Cloudera EDH 5.9.x running under the hood. My users are mainly using beeswax, which is the Hive query editor app within Hue.

When a user logs into Hue and then opens beeswax, the default Hive database "default" is preselected. I want to change this so that the first database they see is their own sandbox database. Currently, the user has to manually select the database from beeswax or run the use DATABASE command in the editor.

Is there a configuration item I can change within any of the CDH software modules that will help me do this? Or is there a concept of a Hive startup script where I can run the use DATABASE command?

rustycodemonkey (1 rep)

Oct 12, 2017, 12:13 AM • Last activity: Apr 30, 2021, 05:05 PM

0 votes

1 answers

38 views

How to get percentage after grouping?

mysql hadoop

I have a table like this: person_id food data 1 bread xxx 1 bread xxx 1 fruite xxx 2 bread xxx 2 fruite xxx food can have only three values (bread, fruits), I want to get the percentage of bread each person has eaten, and there percentage of fruits each person has eaten. I tried to do: select count(...

                                  I have a table like this:

    person_id food   data
    1         bread   xxx
    1         bread   xxx
    1         fruite  xxx
    2         bread   xxx
    2         fruite xxx


food can have only three values (bread, fruits), I want to get the percentage of bread each person has eaten, and there percentage of fruits each person has eaten.

I tried to do:

    select count(*) from table
    group by person_id, food

and that gives me:

    player_id, food, number

but how can I continue from now? 

I am on hadoop
                                

Marco Dinatsoli (151 rep)

Apr 13, 2021, 01:03 PM • Last activity: Apr 13, 2021, 03:08 PM

0 votes

0 answers

25 views

Field with Top Ranking Field Name

rank hadoop hive hiveql

Let's imagine a table structured like this: | Bucket | Red | Blue | Green | | ------ | --- |----- | ----- | | First | 1 |3 |4 | | Second | 6 |5 |2 | What I'm trying to achieve is based on the values within each bucket, I'd like to generate another set of fields with the highest ranking, second highe...

                                  Let's imagine a table structured like this:

| Bucket | Red | Blue | Green |
| ------ | --- |----- | ----- |
| First  | 1   |3     |4      |
| Second | 6   |5     |2      |

What I'm trying to achieve is based on the values within each bucket, I'd like to generate another set of fields with the highest ranking, second highest ranking, and third highest ranking colors (assume there are more than three colors as well). We are limiting to top 3.

Essentially, what my final output should look like is this:

| Bucket | Red | Blue | Green | Rank 1 | Rank 2 | Rank 3 |
| ------ | --- |----- | ----- | ------ | ------ | ------ |
| First  | 1   |3     |4      | Green  | Blue   | Red    |
| Second | 6   |5     |2      | Red    | Blue   | Green  |

Hoping this isn't a redundant question.
                                

Franco Buhay (1 rep)

Feb 10, 2021, 07:37 PM

0 votes

3 answers

129 views

Data storage for analytics

sql-server postgresql hadoop apache-spark

I have to store some amount of data for analytical purposes. - The data source produces 2TB data per month. - Data is collected on a monthly basis (not real-time). - Data is fully structured. - There are 100+ different columns of data. - Availability of SQL is important. - Engineer/developer resourc...

                                  I have to store some amount of data for analytical purposes. 

 - The data source produces 2TB data per month. 
 - Data is collected on a monthly basis (not real-time). 
 - Data is fully structured.
 - There are 100+ different columns of data.
 - Availability of SQL is important.
 - Engineer/developer resources are limited.

I planned to use Postgres (probably with column-oriented extension), however, it would not be feasible for such data amounts (more than 20TB per year). I also made a research on Hadoop/Spark, however, it looks like a bit massive solution (considering, that the data is fully structured). I don't consider cloud-based solutions, as well as expensive ones (preferably, free-licence) .

Would you be so kind to suggest, which data storage to use for big amounts of structured data for analytical purposes?
                                

Leeloo (111 rep)

Dec 24, 2020, 12:23 PM • Last activity: Dec 24, 2020, 02:18 PM

0 votes

1 answers

355 views

Creating external tables from SQL Server using JDBC Drivers

sql-server jdbc hadoop hive

I am trying to query Hive tables AND Kylin tables from SQL Server 2019. I was doing some research and found that MSSQL provide PolyBase as a form of interface to Hadoop storage. However, the examples mostly include creating tables and storing file formats into HDFS. All I need is to create an extern...

                                  I am trying to query Hive tables AND Kylin tables from SQL Server 2019. I was doing some research and found that MSSQL provide PolyBase as a form of interface to Hadoop storage. However, the examples mostly include creating tables and storing file formats into HDFS. All I need is to create an external table from MSSQL using JDBC drivers. Is this possible? I would like to avoid using ODBC drivers. 
                                

user859385 (101 rep)

Nov 5, 2020, 01:50 PM • Last activity: Nov 7, 2020, 03:29 PM

5 votes

2 answers

5759 views

Why is my UTF-8 document raising UTF-8 encoding errors in Azure Data Lake Analytics?

import azure encoding unicode hadoop

I have a document that was compressed in gunzip from a unknown source system. It was downloaded and decompressed using a 7zip console application. The document is a CSV file that appears to be encoded in UTF-8. It's then uploaded to Azure Data Lake Store right after compression. Then there is a U-SQ...

                                  I have a document that was compressed in gunzip from a unknown source system. It was downloaded and decompressed using a 7zip console application. The document is a CSV file that appears to be encoded in UTF-8.

It's then uploaded to Azure Data Lake Store right after compression. Then there is a U-SQL job setup to simply copy it from one folder to another folder. This process fails and raises a UTF-8 encoding error for a value: Ã©e

**Testing**

I downloaded the document from the store and removed all records but that one with the value flagged by Azure. In Notepad++, it shows the document as UTF-8. I save the document as UTF-8 again and upload it back to the store. I run the process again and the process succeeds with that value as UTF-8

What am I missing here? Is it possible the original document is not truly UTF-8? Is there something else causing a false positive? I'm a bit baffled.

**Possibilities**

 - The document is not truly UTF-8 and needs to be recoded
 - Maybe the method that's uploading the file is recoding it
 - Maybe 7zip is recoding it incorrectly

**Environment/Tools**

 - Windows Server
 - Python 2.7
 - Azure Data Lake Store
 - Azure Data Lake Analytics
 - 7Zip.exe
 - gz
 - Azure API

**USQL**

Just the base USQL job that defines the schema then selects all fields to a new directory. No transformation happening outside of leaving out the headers. The file is CSV, comma delimited with double quotes on strings. Schema is all strings regardless of data type. Extractors tried is TEXT and CSV with both set to be encoded:UTF8 even though both are default to UTF8 according to Azure documentation on the system.

**Other Notes**

1. This same document was uploaded in the past to BLOB storage and imported in the same fashion into Azure Data Warehouse without errors via Polybase.
1. The value that causes the UTF-8 encoding error is a URL mangled among 1 million other records.
1. It looks like there are ASCII characters coming in even though it's a UTF-8 document.
1. When I convert it to ANSI and use the ASCII extractor the file succeeds.
1. Azure Data Lake Analytics does not allow you to ignore the error as it's an encoding issue. I'd be happy invalidating the record all together like you can in Azure Data Warehouse.
                                

Fastidious (496 rep)

Mar 10, 2018, 11:51 PM • Last activity: Jun 19, 2020, 03:01 PM

0 votes

1 answers

4004 views

"ZooKeeper exists failed after 4 attempts" when launching Hbase

hadoop

When launching Hbase I have the following error mike@mike-thinks:~/hbase-1.2.6/bin$ ./hbase shell 2017-11-30 17:26:42,137 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2017-11-30 17:26:59,588 ERROR [main] zook...

                                  When launching Hbase I have the following error

    mike@mike-thinks:~/hbase-1.2.6/bin$ ./hbase shell
    2017-11-30 17:26:42,137 WARN  [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    2017-11-30 17:26:59,588 ERROR [main] zookeeper.RecoverableZooKeeper: ZooKeeper exists failed after 4 attempts
    2017-11-30 17:26:59,589 WARN  [main] zookeeper.ZKUtil: hconnection-0x10823d720x0, quorum=localhost:2181, baseZNode=/hbase Unable to set watcher on znode (/hbase/hbaseid)
    org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
    	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
    	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
    	at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:220)
    	at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:419)
    	at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
    	at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:105)
    	at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.retrieveClusterId(ConnectionManager.java:905)
    	at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.(ConnectionManager.java:648)
    	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    	at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238)
    	at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:218)
    	at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:119)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at org.jruby.javasupport.JavaMethod.invokeDirectWithExceptionHandling(JavaMethod.java:450)
    	at org.jruby.javasupport.JavaMethod.invokeStaticDirect(JavaMethod.java:362)
    	at org.jruby.java.invokers.StaticMethodInvoker.call(StaticMethodInvoker.java:58)
    	at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:312)
    	at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:169)
    	at org.jruby.ast.CallOneArgNode.interpret(CallOneArgNode.java:57)
    	at org.jruby.ast.InstAsgnNode.interpret(InstAsgnNode.java:95)
    	at org.jruby.ast.NewlineNode.interpret(NewlineNode.java:104)
    	at org.jruby.ast.BlockNode.interpret(BlockNode.java:71)
    	at org.jruby.evaluator.ASTInterpreter.INTERPRET_METHOD(ASTInterpreter.java:74)
    	at org.jruby.internal.runtime.methods.InterpretedMethod.call(InterpretedMethod.java:169)
    	at org.jruby.internal.runtime.methods.DefaultMethod.call(DefaultMethod.java:191)
    	at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:302)
    	at org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:144)
    	at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:148)
    	at org.jruby.RubyClass.newInstance(RubyClass.java:822)
    	at org.jruby.RubyClass$i$newInstance.call(RubyClass$i$newInstance.gen:65535)
    	at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodZeroOrNBlock.call(JavaMethod.java:249)
    	at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:292)
    	at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:135)
    	at home.mike.hbase_minus_1_dot_2_dot_6.bin.$_dot_dot_.bin.hirb.__file__(/home/mike/hbase-1.2.6/bin/../bin/hirb.rb:131)
    	at home.mike.hbase_minus_1_dot_2_dot_6.bin.$_dot_dot_.bin.hirb.load(/home/mike/hbase-1.2.6/bin/../bin/hirb.rb)
    	at org.jruby.Ruby.runScript(Ruby.java:697)
    	at org.jruby.Ruby.runScript(Ruby.java:690)
    	at org.jruby.Ruby.runNormally(Ruby.java:597)
    	at org.jruby.Ruby.runFromMain(Ruby.java:446)
    	at org.jruby.Main.doRunFromMain(Main.java:369)
    	at org.jruby.Main.internalRun(Main.java:258)
    	at org.jruby.Main.run(Main.java:224)
    	at org.jruby.Main.run(Main.java:208)
    	at org.jruby.Main.main(Main.java:188)
    2017-11-30 17:26:59,596 ERROR [main] zookeeper.ZooKeeperWatcher: hconnection-0x10823d720x0, quorum=localhost:2181, baseZNode=/hbase Received unexpected KeeperException, re-throwing exception
    org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
    	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
    	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
    	at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:220)
    	at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:419)
    	at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
    	at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:105)
    	at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.retrieveClusterId(ConnectionManager.java:905)
    	at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.(ConnectionManager.java:648)
    	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    	at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238)
    	at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:218)
    	at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:119)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at org.jruby.javasupport.JavaMethod.invokeDirectWithExceptionHandling(JavaMethod.java:450)
    	at org.jruby.javasupport.JavaMethod.invokeStaticDirect(JavaMethod.java:362)
    	at org.jruby.java.invokers.StaticMethodInvoker.call(StaticMethodInvoker.java:58)
    	at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:312)
    	at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:169)
    	at org.jruby.ast.CallOneArgNode.interpret(CallOneArgNode.java:57)
    	at org.jruby.ast.InstAsgnNode.interpret(InstAsgnNode.java:95)
    	at org.jruby.ast.NewlineNode.interpret(NewlineNode.java:104)
    	at org.jruby.ast.BlockNode.interpret(BlockNode.java:71)
    	at org.jruby.evaluator.ASTInterpreter.INTERPRET_METHOD(ASTInterpreter.java:74)
    	at org.jruby.internal.runtime.methods.InterpretedMethod.call(InterpretedMethod.java:169)
    	at org.jruby.internal.runtime.methods.DefaultMethod.call(DefaultMethod.java:191)
    	at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:302)
    	at org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:144)
    	at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:148)
    	at org.jruby.RubyClass.newInstance(RubyClass.java:822)
    	at org.jruby.RubyClass$i$newInstance.call(RubyClass$i$newInstance.gen:65535)
    	at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodZeroOrNBlock.call(JavaMethod.java:249)
    	at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:292)
    	at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:135)
    	at home.mike.hbase_minus_1_dot_2_dot_6.bin.$_dot_dot_.bin.hirb.__file__(/home/mike/hbase-1.2.6/bin/../bin/hirb.rb:131)
    	at home.mike.hbase_minus_1_dot_2_dot_6.bin.$_dot_dot_.bin.hirb.load(/home/mike/hbase-1.2.6/bin/../bin/hirb.rb)
    	at org.jruby.Ruby.runScript(Ruby.java:697)
    	at org.jruby.Ruby.runScript(Ruby.java:690)
    	at org.jruby.Ruby.runNormally(Ruby.java:597)
    	at org.jruby.Ruby.runFromMain(Ruby.java:446)
    	at org.jruby.Main.doRunFromMain(Main.java:369)
    	at org.jruby.Main.internalRun(Main.java:258)
    	at org.jruby.Main.run(Main.java:224)
    	at org.jruby.Main.run(Main.java:208)
    	at org.jruby.Main.main(Main.java:188)
    HBase Shell; enter 'help' for list of supported commands.
    Type "exit" to leave the HBase Shell
    Version 1.2.6, rUnknown, Mon May 29 02:25:32 CDT 2017


Yet I did the advised settings Apache gave in order to make the data persitent 
I tried to run a statusand it gave me back :

    2017-11-30 17:26:15,464 ERROR [main] client.ConnectionManager$HConnectionImplementation: Can't get connection to ZooKeeper: KeeperErrorCode = ConnectionLoss for /hbase

I found a similar question on SO  which states that the error indicates that zookeeper quorum is not running - most probable cause can be some inconsistency with your zookeeper.quorum setting in conf/hbase-site.xml but it didn't helped  my conf/hbase-side.xml file looks like :

    
    
    
      
        hbase.rootdir
        file:///DataHbase/hbase
      
      
        hbase.zookeeper.property.dataDir
        /DataHbase/zookeeper
                                

Revolucion for Monica (677 rep)

Nov 30, 2017, 04:53 PM • Last activity: Apr 23, 2020, 12:18 AM

2 votes

1 answers

330 views

Binary storage in Cassandra, HBase

cassandra hadoop hbase

I am looking at some implementations of Cassandra and HBase for medium-sized data sets (~1M resources) to be exposed to clients as graphs (via e.g. Tinkerpop). I would also like to store binaries in the same data stores. While it seems like both systems support storing large binaries one way o anoth...

                                  I am looking at some implementations of Cassandra and HBase for medium-sized data sets (~1M resources) to be exposed to clients as graphs (via e.g. Tinkerpop). I would also like to store binaries in the same data stores. While it seems like both systems support storing large binaries one way o another (HBase via HDFS) I wonder what the performance implications would be for using these versus flat file storage. Are these systems designed to store binaries at scale, or are they more targeted at metadata storage? I am talking about 100s of Tb of binary data. 

Thanks

.s

gattu marrudu (21 rep)

Apr 1, 2017, 06:06 AM • Last activity: Nov 29, 2019, 11:02 PM

0 votes

1 answers

129 views

Polybase with Cloudera 5.9

sql-server sql-server-2016 hadoop polybase

We are working on a Proof Of Concept using Polybase with Cloudera. In the polybase documentation it says we can connect to Cloudera 5.5: https://msdn.microsoft.com/en-us/library/mt143174.aspx We are looking at Cloudera 5.9 - does polybase work with cloudera versions over 5.5?

                                  We are working on a Proof Of Concept using Polybase with Cloudera. 

In the polybase documentation it says we can connect to Cloudera 5.5:  
https://msdn.microsoft.com/en-us/library/mt143174.aspx 

We are looking at Cloudera 5.9 - does polybase work with cloudera versions over 5.5?

Jeremiah (11 rep)

Nov 22, 2016, 08:47 PM • Last activity: Nov 14, 2019, 02:01 AM

1 votes

0 answers

207 views

how to access hadoop slaves which have the same IP address but different port numbers

docker hadoop ssh

My question is, I'm wondering if it's possible Hadoop can do ssh access with a different port number(not 22) for slaves when it makes a Hadoop cluster. Detail) I tried to make Hadoop cluster with two slaves. The master node is on my local computer, but the other nodes are on my AWS ec2 as Docker con...

                                  My question is, I'm wondering if it's possible Hadoop can do ssh access with a different port number(not 22) for slaves when it makes a Hadoop cluster.

Detail)  
I tried to make Hadoop cluster with two slaves.
The master node is on my local computer, but the other nodes are on my AWS ec2 as Docker containers.  

local : master node  
AWS EC2(33.162.168.105) : first Hadoop container as slave #1, second Hadoop container as slave #2  
(btw, this ip addr is fake.)

The first slave container opened 1111 port number for 22 with **-p 1111:22** option,  
The second slave container opened 1112 port number for 22 with **-p 1112:22** option  
so that remote computer can do ssh access into slave#1 with 1111 port number.

Actually, when I tried to do ssh access into slave#1 with 1111 port number from my local computer terminal, it worked well!  

The problem is Hadoop ssh connection as slaves to make cluster.  
I'm using Hadoop 3.1.1 version so I had to put slaves ip into /etc/hadoop/workers like,

33.162.168.105:1111

But after run start-all.sh, I showed errors.

    server@9130cf720e0f:/usr/local/hadoop/etc/hadoop# start-all.sh
    Starting namenodes on [master]    
    Starting datanodes
    33.162.168.105:1111: ssh: Could not resolve hostname 33.162.168.105:1111: Name or service not known
    Starting secondary namenodes [9130cf720e0f]
    Starting resourcemanager
    Starting nodemanagers
    33.162.168.105:1111: ssh: Could not resolve hostname 33.162.168.105:1111: Name or service not known

I know I can't add ":1111" at the end of the ip addr.  
So I really wanna know if it's possible Hadoop can do ssh access with a different port number(1111 in this case) for slaves when it makes a Hadoop cluster.  
If you know how to, help me, please.

Thank you for your reading and help in advance.
                                

eyeballs (21 rep)

Jun 5, 2019, 11:14 AM • Last activity: Jun 5, 2019, 01:00 PM

2 votes

0 answers

180 views

Options for copying data out of continuously updating tables

replication trigger db2 change-data-capture hadoop

I work with a pretty large (O(10^1) TB) DB2 (LUW, v. 9.7) database. It's got data coming in on a continuous basis via Golden Gate replication from another DB2 database. It's used for business intelligence and analytics. Now I'm working with a group in the parent company which is trying to build an e...

                                  I work with a pretty large (O(10^1) TB) DB2 (LUW, v. 9.7) database.  It's got data coming in on a continuous basis via Golden Gate replication from another DB2 database.  It's used for business intelligence and analytics.

Now I'm working with a group in the parent company which is trying to build an enterprise data warehouse.  They want to collect data from their own databases, as well as from their acquisitions (like my site).  To this end they purchased an Oracle BDA appliance (Cloudera Hadoop), and sometime soon an Oracle Exadata box will be stood up.

Putting aside the fact that the target database is Hadoop, not a traditional RDMS, I'm having a hard time coming up with solutions that will faithfully copy the data out of the source database, given that rows are not only being continuously inserted, but also updated.  (As far as I can tell, rows are never deleted.)

## Question

I'm interested in what the landscape of possible approaches looks like, as well as which approaches will scale and not be too much of a performance burden on the source database?

## Current Solution

Currently we copy data to Vertica in-house, using a home-built solution. Small tables are dumped on the target and then copied in their entirety to Vertica. Large tables have a trigger that updates a table with a single row. That datum records the oldest value of a timestamped and indexed column seen in any row that's inserted or updated. All the SELECTs are done as uncommitted reads. This appears to work, but we do transfer a large amount of data; the new project requires transmission over a much greater distance and with presumably less bandwidth. Moreover, this process is only run once per week. While I don't think a lag measured in minutes is required for this new project, the principals might not be very happy with a weekly refresh.

## Possible Solutions

Here's what I've brainstormed so far:

1.  A vended replication solution like Golden Gate.

2.  Some in-house solution that ships transaction logs (probably beyond our dev capabilities).

3.  A trigger that exports any inserted or updated row.  I assume this would be a horrendous performance hit on the source DB.

4.  A trigger that records the primary key of any inserted/updated row.  Also would appear to be a big performance hit.

5.  Adding an indexed timestamp column for time of last modification, and an accompanying trigger to modify it on update.  The DBAs I work with claim this would be a performance hit (I can't really tell if it would be superior to numbers 3 and 4 above).  Moreover, it adds the complication that the upstream data source doesn't have this column, with possible implications for the current replication process between the two DB2 databases.

user1071847 (183 rep)

Jul 30, 2015, 03:53 PM • Last activity: Jan 21, 2019, 12:04 PM

1 votes

0 answers

122 views

mongo-hadoop connector makes duplicate data by numbers of Mongo Sharding

mongodb hadoop

I'm using [mongo-hadoop connector](https://github.com/mongodb/mongo-hadoop) which makes that Hadoop can get data from MongoDB and save into MongoDB. I found MongoDB data is duplicated after Hadoop Map-Reduce job with MongoDB data. environment: Hadoop version is 3.1.1, MongoDB version is 4.0.4, mongo...

                                  I'm using [mongo-hadoop connector](https://github.com/mongodb/mongo-hadoop)  which makes that Hadoop can get data from MongoDB and save into MongoDB. I found MongoDB data is duplicated after Hadoop Map-Reduce job with MongoDB data.

environment: Hadoop version is 3.1.1, MongoDB version is 4.0.4, mongo-hadoop connector is mongo-hadoop-core-2.0.2.jar, mongo-java-driver-3.8.2.jar, and Docker version is 18.03.1-ce. There are two local server named server1, server2. They have public IP so I can make hadoop cluster or mongo sharding environment on them. To run hadoop and MongoDB, I used Docker and Docker orchestration so that containers can exchange packets each other on an overlay network. server 1 has Hadoop master container, Hadoop slave1 container, MongoDB router container, MongoDB configurator container, and server 2 has Hadoop slave2 container, Hadoop slave3 container, and MongoDB shard1 container, MongoDB shard2 container. MongoDB has [30MB tsv data file](https://drive.google.com/open?id=14U1G4lDV-ExjQ8saLeDD0hXzFJAue9v3)  and chunk size is 8MB. When I setup Hadoop clustering and MongoDB sharding, I let the containers connect each other by their container name(ex, Hadoop master sends ping with 'slave1' or 'mongorouter'.. and it works well). 

problem : The problem is, after setting Hadoop cluster and MongoDB Sharding(the sharded collection is just one) WordCount MR job using MongoDB data(the 30MB data), the WordCount result is duplicated, in detail, the data is multiplied by 2. For example, the normal result is [a 1, b 2] then the duplicated result is [a 2, b 4]. If I make another sharded collection(same 30MB data, same code, same database, just different collections name, so the sharded collections are two. MR job uses only the new sharded collection) then the result is multiplied by 3( [a 3, b 6] ). If I added more sharded collections the same way, the result is multiplied proportionally. If I don't set the Mongo sharding environment, the result is what I expect. I really don't know what is happening. I noticed MongoCollectionSpliter is increasing. The more sharded collections, the more MongoCollectionSpliter in the MR result(I attached the logs below)

This is the WordCount MR code.

    import java.io.IOException;
    import java.util.StringTokenizer;

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.bson.BSONObject;
    import org.bson.BasicBSONObject;

    import com.mongodb.hadoop.MongoInputFormat;
    import com.mongodb.hadoop.MongoOutputFormat;
    import com.mongodb.hadoop.io.BSONWritable;
    import com.mongodb.hadoop.util.MongoConfigUtil;

    public class BigdataBench{
        public static void main(String[] args) throws Exception {
                Configuration conf = new Configuration();

                MongoConfigUtil.setInputURI(conf, "mongodb://" + args);
                MongoConfigUtil.setOutputURI(conf, "mongodb://" + args);

                Job job = Job.getInstance(conf, "WordCount");

                job.setJarByClass(BigdataBench.class);

                job.setMapperClass(Map.class);
                job.setReducerClass(Reduce.class);

                job.setMapOutputKeyClass(Text.class);
                job.setMapOutputValueClass(IntWritable.class);

                job.setOutputKeyClass(Text.class);
                job.setOutputValueClass(BSONWritable.class);

                job.setInputFormatClass(MongoInputFormat.class);
                job.setOutputFormatClass(MongoOutputFormat.class);

                job.waitForCompletion(true);
        }

        public static class Map extends Mapper {
                private final static IntWritable one = new IntWritable(1);
                private final Text dataOutput = new Text();

                public void map(Object key, BSONObject value, Context context) throws IOException, InterruptedException {
                        String data = value.get("data").toString();

                for(String whiteSpaceSplit : data.split(" ")) {
                        String[] tapSplit = whiteSpaceSplit.split("\t");

                        for(String split : tapSplit) {
                                dataOutput.set(split);
             context.write(dataOutput, one);
                        }
                }

                }
        }

        public static class Reduce extends Reducer {
                private BSONWritable reduceResult = new BSONWritable();

                public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
                        BasicBSONObject output = new BasicBSONObject();

                        int sum = 0;
                        for(IntWritable s : values) {
                                sum+=s.get();
                        }

                        String wordCount = String.valueOf(sum);

                        output.put("word", wordCount);
                        reduceResult.setDoc(output);
                        context.write(key, reduceResult);
                }
        }
    }


I arranged my problem as a matrix.
![image](

The MR results are too long so I put them into [this google doc](https://docs.google.com/document/d/1Ap8RU6U_9f_wcpmoKw1M00fZADiRAkjnxfdmfIyPx0o/edit?usp=sharing) .

If you have the same problem or solutions, please answer and help me. Thanks in advance.


                                

eyeballs (21 rep)

Jan 7, 2019, 01:33 AM • Last activity: Jan 7, 2019, 02:15 AM

0 votes

1 answers

816 views

Does HBase support spatial functionality?

spatial hadoop hbase

I see mention for spatial functional in HBase. For example [*"HBaseSpatial: A Scalable Spatial Data Storage Based on HBase"*](https://ieeexplore.ieee.org/abstract/document/7011307). What spatial functionality does HBase support and where is this documented?

                                  I see mention for spatial functional in HBase. For example [*"HBaseSpatial: A Scalable Spatial Data Storage Based on HBase"*](https://ieeexplore.ieee.org/abstract/document/7011307) . 

What spatial functionality does HBase support and where is this documented?

e7lT2P (175 rep)

Nov 12, 2018, 05:31 PM • Last activity: Nov 16, 2018, 01:47 AM

3 votes

1 answers

372 views

Apache Phoenix: Using MINUS throws an error

query hadoop

I'm using Apache Phoenix to query Hbase. I tried to use a simple MINUS operator like we do in good old SQL but it produces an error that I couldn't wrap my head around. Here's the query: select * from NOTIFICATION MINUS select * from NOTIFICATION where SUBJECT = 'datanode'; Here's a screenshot: [![e...

                                  I'm using Apache Phoenix to query Hbase. I tried to use a simple MINUS operator like we do in good old SQL but it produces an error that I couldn't wrap my head 
around.

Here's the query:

    select * from NOTIFICATION MINUS select * from NOTIFICATION where SUBJECT = 'datanode';

Here's a screenshot:

> Note: It's not this specific query I'm looking to solve, I just want
> to get MINUS to work. Thanks.

Youcef Chihi (131 rep)

Aug 9, 2017, 10:08 AM • Last activity: Oct 31, 2018, 03:17 PM

2 votes

1 answers

683 views

Directly go to Hadoop or use SQLite/Postgres as a stepping stone towards Hadoop/Spark?

postgresql sqlite hadoop

In our organization, some people are working on setting up `Hadoop` with a lot of security restrictions etc. By the rate of progress, this seems complex, especially considering security restrictions, large variety of data-sources present etc. I am in another part of the organization, and in our grou...

                                  In our organization, some people are working on setting up Hadoop with a lot of security restrictions etc. By the rate of progress, this seems complex, especially considering security restrictions, large variety of data-sources present etc. I am in another part of the organization, and in our group, the amount of data currently generated is not so high as to need Hadoop or Spark, at least for a long while. I am also building a small application which will needs a proper database. 

Based on a back-of-the-envelope calculation, a single group in my smaller
department generates about 25GB of data (images, log files, xlsx, ppts) etc per 
year, and ~ 10mb of numerical data that is stored in excel workbooks.  Right now all these are stored in flat files (Excel files with numerical data, images, log files), because a lot of the work
we do is non-routine (my part of the org is mostly a research org) and changes
from day to day. So a lot of times we have to inspect images manually, as there
is no way to do any automated image analysis for the kind of features we are
looking for. In total, across all groups in my part of the org, we might be generating ~ 10TB of data per year (assuming 200 groups, and 2x multiplier to account for growth in data-volume per year, 200TB in 20 years), most of which reside in flat file systems.

We use a Excel template where people enter numerical data and then multiple
people can simultaneously access the data, and generate reports. 

Currently, the main problems I have to address is as follows:

1. The Excel workbook that we use can only be accessed by 1 user at a time, so
it causes a lot of conflicts
2. If we store Excel files larger than say > 10mb, because its stored on a network, it
becomes painful to open the workbook, so I need to chose a database which is
not too complex so I can demo a prototype within a reasonable time.
3. The linked data (numerical data along with blob data) that is stored in the database and or file-system needs to be able to transition
over to hadoop/spark or distributed databases.

I was thinking of the following route:

1. Just move to network share on Excel workbooks, so that multiple users can start
access workbooks independently without seeking permission from the person who has the workbook open(using legacy sharing): https://www.presentationpoint.com/blog/multiple-users-excel-2016-datasheet/ . The binary data will be stored on the file system, while numerical data is stored in Excel.
2. Next, instead of using co-authoring (OneDrive) and because we have to start using a proper
database, I would create a macro in excel which users would pretty much click
to push the user generated numerical data (along with links to the binary data) into a database. The binary data will still reside on the file system, but possibly copy it over to a second database (Database2), so that it can be transitioned to distributed databases in future. Choose between Postgres or
SQLite, (leaning towards SQLite, for individual groups for prototyping, as it seems to pretty widely used, has a large community, probably low bugs/maintanance cost). Each group (~ 200 total) would maintain their own PostgreSQL/Sqlite databases, till the distributed database becomes ready.
3. In veeeeeeeeeeeeeeeeeery very long term future when we have to scale to
Hadoop/Spark (assuming we hit the SQLite limit in 5 years), we can extract the data out of this database and push it to
Hadoop/Spark using some convertor
(https://stackoverflow.com/a/28240430/4752883 ,
https://stackoverflow.com/a/40677622/4752883) 

The reason for choosing SQLlite over PostgreSQL is that SQLite itself
supports around 140TB of datastorage. SQLite seems to support multiple concurrent users (https://stackoverflow.com/questions/5102027/can-sqlite-support-multiple-users) . Postgres has more capabilities, but will require a lot more resources and maintenance. I think in the long term, we probably have to go to Hadoop/Spark because the data-volumes are likely to grow for sure, but Hadoop is much more complex to manage and administer
especially considering the security considerations etc. 

## Questions
1. What are the drawbacks of this approach (what am I not thing about)?
2. Some people have told me to directly jump to Hadoop, and some have told me
to just SQL type databases, till we actually start needing a lot more data. If you were trying to chose a database, while knowing for sure that maybe in couple of years you will probably need Hadoop would you chose Hadoop or SQL-type
databases in this scenario, for the step#2?
                                

alpha_989 (137 rep)

Aug 19, 2018, 12:32 AM • Last activity: Aug 20, 2018, 07:01 AM

Showing page 1 of 20 total questions