Directly go to Hadoop or use SQLite/Postgres as a stepping stone towards Hadoop/Spark?

2 votes
1 answer
683 views
                          In our organization, some people are working on setting up Hadoop with a lot of security restrictions etc. By the rate of progress, this seems complex, especially considering security restrictions, large variety of data-sources present etc. I am in another part of the organization, and in our group, the amount of data currently generated is not so high as to need Hadoop or Spark, at least for a long while. I am also building a small application which will needs a proper database. 

Based on a back-of-the-envelope calculation, a single group in my smaller
department generates about 25GB of data (images, log files, xlsx, ppts) etc per 
year, and ~ 10mb of numerical data that is stored in excel workbooks.  Right now all these are stored in flat files (Excel files with numerical data, images, log files), because a lot of the work
we do is non-routine (my part of the org is mostly a research org) and changes
from day to day. So a lot of times we have to inspect images manually, as there
is no way to do any automated image analysis for the kind of features we are
looking for. In total, across all groups in my part of the org, we might be generating ~ 10TB of data per year (assuming 200 groups, and 2x multiplier to account for growth in data-volume per year, 200TB in 20 years), most of which reside in flat file systems.

We use a Excel template where people enter numerical data and then multiple
people can simultaneously access the data, and generate reports. 

Currently, the main problems I have to address is as follows:

1. The Excel workbook that we use can only be accessed by 1 user at a time, so
it causes a lot of conflicts
2. If we store Excel files larger than say > 10mb, because its stored on a network, it
becomes painful to open the workbook, so I need to chose a database which is
not too complex so I can demo a prototype within a reasonable time.
3. The linked data (numerical data along with blob data) that is stored in the database and or file-system needs to be able to transition
over to hadoop/spark or distributed databases.

I was thinking of the following route:

1. Just move to network share on Excel workbooks, so that multiple users can start
access workbooks independently without seeking permission from the person who has the workbook open(using legacy sharing): https://www.presentationpoint.com/blog/multiple-users-excel-2016-datasheet/ . The binary data will be stored on the file system, while numerical data is stored in Excel.
2. Next, instead of using co-authoring (OneDrive) and because we have to start using a proper
database, I would create a macro in excel which users would pretty much click
to push the user generated numerical data (along with links to the binary data) into a database. The binary data will still reside on the file system, but possibly copy it over to a second database (Database2), so that it can be transitioned to distributed databases in future. Choose between Postgres or
SQLite, (leaning towards SQLite, for individual groups for prototyping, as it seems to pretty widely used, has a large community, probably low bugs/maintanance cost). Each group (~ 200 total) would maintain their own PostgreSQL/Sqlite databases, till the distributed database becomes ready.
3. In veeeeeeeeeeeeeeeeeery very long term future when we have to scale to
Hadoop/Spark (assuming we hit the SQLite limit in 5 years), we can extract the data out of this database and push it to
Hadoop/Spark using some convertor
(https://stackoverflow.com/a/28240430/4752883 ,
https://stackoverflow.com/a/40677622/4752883) 

The reason for choosing SQLlite over PostgreSQL is that SQLite itself
supports around 140TB of datastorage. SQLite seems to support multiple concurrent users (https://stackoverflow.com/questions/5102027/can-sqlite-support-multiple-users) . Postgres has more capabilities, but will require a lot more resources and maintenance. I think in the long term, we probably have to go to Hadoop/Spark because the data-volumes are likely to grow for sure, but Hadoop is much more complex to manage and administer
especially considering the security considerations etc. 

## Questions
1. What are the drawbacks of this approach (what am I not thing about)?
2. Some people have told me to directly jump to Hadoop, and some have told me
to just SQL type databases, till we actually start needing a lot more data. If you were trying to chose a database, while knowing for sure that maybe in couple of years you will probably need Hadoop would you chose Hadoop or SQL-type
databases in this scenario, for the step#2?
                        
Asked by alpha_989 (137 rep)
Aug 19, 2018, 12:32 AM
Last activity: Aug 20, 2018, 07:01 AM
Directly go to Hadoop or use SQLite/Postgres as a stepping stone towards Hadoop/Spark?

Related Questions