Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

0 votes

1 answers

142 views

Is storing redundant data part of the (CouchDB) self-contained data concept?

The [CouchDB concept](http://docs.couchdb.org/en/2.1.1/intro/why.html#a-better-fit-for-common-applications) speaks about self-contained documents. In other words: ALL related data is stored in a single concept, instead of splitting them in multiple tables and reference each other using FKs. > An inv...

                                  The [CouchDB concept](http://docs.couchdb.org/en/2.1.1/intro/why.html#a-better-fit-for-common-applications)  speaks about self-contained documents. In other words: ALL related data is stored in a single concept, instead of splitting them in multiple tables and reference each other using FKs. 

> An invoice contains all the pertinent information about a single transaction the seller, the buyer, the date, and a list of the items or services sold.

But this means that we end with redundant data in our database. Let's consider the invoice example: I buy 10 products over a period of time, which results in 10 invoices. Now several information are redundant, e.g. my billing address, delivery address and so on. That's at least a waste of space. 

Considering other use-cases, updating data can be very complex. If we have a blogging-cms like WordPress, every post-document contains metadata like author name/email and so on. If those information changes, we've to update a lot of documents. And we have to know the structure from all of them. E.g. the email could be used in posts, comments, log entrys and so on. 

When you think that the CouchDB allow a different form for any document, updating such information looks like a horrible tasks. I used MySQL as relational database for years and it seems much more meaningful to normalize such information, put e.g. users in a user table/collection, so that we can refer to some kind of PK. 

### What is the purpose behind those concept in CouchDB?

So my question is: Did I understood the concept of CouchDB properly, that they drop normalisation completely? I see the (theoretically?) benefit of having a better performance when no joins are required. Is the concept that we get a database with high throughput, but have to live with a (possible massive) amount of redundant data and we have to deal with a complex update-process when related metadata changes?

Lion (151 rep)

May 8, 2018, 08:17 AM • Last activity: Jul 23, 2025, 01:04 AM

-1 votes

1 answers

66 views

NoSQL DB candidate for the project (looking for experienced advice)

database-design json nosql best-practices document-oriented

## Intro ## I am stuck with the limitations of the DBs I've tested* and need someone experienced to give me some advice. ## Project details ## ### Data sets ### The project has 2 sets of data, the sets of items and their correlated prices. Items are the list of all the supported items with their properties. Prices are the list of item's price data for many countries where each country may have multiple stores (shops) that offer the item. This one-many-many (item-country-store) relation makes it hard to flatten the document into a single set of data where each item has prices embedded as properties. ### Main goal ### The user should be able to get a list of items with their lowest price offer after applying preferred filters and sort. The user should be able to filter by item properties (such as item.title LIKE 'The %', item.rating > 8 or other filters) AND/OR price properties (such as price.value = 50 or other filters). The user should be able to sort by either the selected item property (sort by item.title ASC or item.rating DESC) or the price property (sort by price.value ASC, price.discount DESC) The response needs to be fast so the user can quickly paginate through the list. I suppose the DB caching, pre-sorted collections (such as ArangoDB views ), pre-aggregated (filtered) collections (such as MongoDB on-demand materialized views ) or some other solution is a must, but correct me if I am wrong. Relying on just indexing proved to be slow in the DBs I've tried since there are around 150 000 items with the total records of prices easily going over a million of records per country. ### Limitations ### I cannot afford to use enterprise (proprietary) or cloud licenses so I would have to stick to open source (community) DB projects. ## Question ## Based on your experience, which DB would you recommend for this use case and why? --- * I've tested MariaDB, Redis (with RedisJSON module) and ArangoDB. ### MariaDB ### Since it's a relational DB, it's too slow to JOIN more than 10 different tables to get a single list. I use MariaDB to store the data, but I need another solution for the frontend serving. **Edit** I did not give details about my MariaDB implementation because I thought it was irrelevant since I asked for a NoSQL solution. Looks like this is the part that everyone wants to know about. I said it was too slow because the items and prices data types in MariaDB are spread out to multiple tables. In MariaDB, item has 17 related tables with mostly one-to-many relations. Think of properties such as item.genre, item.category, item.rating and others. Price has up to 6 relations with mostly one-to-one relations. When flattened to JSON documents, everything fits into two data sets, hence I said items and prices in NoSQL. Because of the complexity of relations, just building one whole item _should_ take longer in SQL (even with the simplest ID index relations) than fetching one document from NoSQL. This is especially because of a lot of one-to-many relations, which would take more than one query to build the arrays. I am not an expert and I do not claim to be one so correct me if I am wrong. ### Redis ### Redis with its RedisJSON module looked like a great candidate because it's an in-memory DB, but there were too many limitations to make it a viable option. Because of the prices complexity, I am unable to flatten the data into a single JSON document so I have multiple roundtrips to get the "JOINed" results. The complexity of multiple roundtrips escalated when I had to introduce filtering by both data sets—I had to filter prices to see which items had an offer within the wished ranges, then filter items to see which of those with an offer were within wished item ranges, then, for example, sort by item.title, and finally get the best corresponding price details. I lost performance since I had to use FT.AGGREGATE with SORT by the grouped property—this made Redis work on results without the use of index. Final nail on the coffin was Redis being single-threaded. When I ran multiple queries at the same time, I saw queries go over 10 seconds and time out. ### ArangoDB ### ArangoDB was a great candidate until I saw missing data in the results. It has neat caching, inverted indexes with pre-sorted collections in views and even full-text search on the collections. I was able to separate prices into collections for each country and then create multiple views to include all possible sort options. The pre-sorted view on price.value was huge because I was able to always get the lowest price offer just by doing a FIRST() in the subquery (AQL syntax similar to LIMIT 1 in SQL). The queries were very fast once the results got cached. The main issue with ArangoDB I had was getting the nulls in the results array while I was doing UPDATEs* on the collections. Examples... Query without update:

[{"id":"348","t":"Item 348"}, {"id":"112","t":"Item 112"}, {"id":"225","t":"Item 225"}]

Query when update starts:

[null,null,null]

Mid-update query:

[null,null,{"id":"225","t":"Item 225"}]

I thought it was an issue because I would first run a subquery to get the candidate document _keys and then run the main query with those _keys, and if there was an UPDATE, the document might not exist during that short time span. But because an UPDATE should not first DELETE and then INSERT the document (the way the REPLACE usually works), I doubt this is the case. I couldn't get the answers from the support. This issue made it unreliable because the user would get nulls as the results, which would also break my frontend app. * ArangoDB has the ImportDocuments function to import multiple documents at once with an option to UPDATE on duplicate. --- ### Note ### I have also considered ElasticSearch, but I would not be able to afford it if the project grows and needs any kind of scaling.

Delicious Bacon (99 rep)

Aug 26, 2022, 10:23 AM • Last activity: Aug 26, 2022, 02:19 PM

3 votes

1 answers

113 views

Do any document databases (noSQL) allow you to do cross database joins?

nosql document-oriented

I am currently looking at MongoDB and ArangoDB as the database for our web application, but neither appear to support cross-database joins, does anyone know if there is such a document database that supports this please?

                                  I am currently looking at MongoDB and ArangoDB as the database for our web application, but neither appear to support cross-database joins, does anyone know if there is such a document database that supports this please?
                                

Anupheaus (131 rep)

Mar 6, 2017, 08:02 AM • Last activity: Apr 5, 2021, 09:03 PM

0 votes

1 answers

167 views

Many-to-many database

sqlite database-recommendation document-oriented embedded

I am thinking about making a music database. Something where you can search by artist, album or song. This seems to be a Many-to-many model, as an album can have multiple songs, and the same song can be on different albums. Currently I am just using some JSON files, but this isnt really robust, as i...

                                   I am thinking about making a music database. Something where you can search by
artist, album or song. This seems to be a Many-to-many model, as an album can
have multiple songs, and the same song can be on different albums.

Currently I am just using some JSON files, but this isnt really robust, as it
can only represent hierarchical data.

This would just be a personal project, so I dont need anything too involved. I
have looked at SQLite, but also Document-oriented databases. I like the idea of
being able to persist the data as JSON or CSV files. What is a good option for
my use case?

---

Response to some questions:

> Do you want a free solution or need to use an enterprise level system like
> Microsoft SQL Server which costs money.

Free options only

> How far do you expect to scale your database?...in other words, in the
> foreseeable future roughly how big do you expect it to get (data-wise), how
> many users do you expect will be using it both overall and at a given instance
> of time, how often will data change inside of it?

Most likely it would be a tool that only has one user at a time. So either I
would use it myself, or someone might download my tool and use it for
themselves.

> What does your current technology stack look like?

I would prefer something with of of these if possible: C#, D, Go, Nim or Rust.
Although I have seen many options in C and C++, so I might have to end up with
one of those.

> I don't think any modern solution out there persists the underlying data of
> the database in JSON or CSV files (though I could be mistaken).

I think Document-oriented database  offer this, but I could be
misunderstaning it. Also I had seen Export to CSV  for SQLite.

                                

Zombo (1 rep)

Dec 10, 2020, 10:17 PM • Last activity: Dec 16, 2020, 02:36 PM

31 votes

2 answers

24007 views

Using MongoDB and PostgreSQL together

postgresql mongodb rdbms document-oriented

My current project is essentially a run of the mill document management system. That said, there are some wrinkles (surprise, surprise). While some of the wrinkles are fairly specific to the project, I believe there are some general observations and questions that have come up which don't have a can...

                                  My current project is essentially a run of the mill document management system. 

That said, there are some wrinkles (surprise, surprise). While some of
the wrinkles are fairly specific to the project, I believe there are
some general observations and questions that have come up which don't
have a canonical answer (that I could find, anyway) and that are
applicable to a wider problem domain. There's a lot here and I'm not
sure it's a good fit for the StackExchange Q&A format but I think it a) an answerable question and b) non-specific enough that it can benefit the community. Some of my considerations are specific to me but I think the question could be of use to anyone faced with deciding on SQL vs NoSQL vs both.

The background:
-

The web app we are building contains data that is clearly relational in
nature as well as data that is document-oriented. We would like to
have our cake and eat it too.

TL;DR: I think #5 below passes the smell test. Do you? Does anyone
have experience with such an integration of SQL and NOSQL in a single
application? I tried to list all the possible approaches to this class
of problem in below. Have I missed a promising alternative?

Complexities:
-------------

- There are many different classes of documents. The requirements
  already call for dozens of different documents. This number will
  only ever go up. The best possible case would be one in which we
  could leverage a simple domain specific language, code generation
  and a flexible schema so that domain experts could handle the
  addition of new document classes without the intervention of DBAs or
  programmers. (Note: already aware we are living out [Greenspun's
  Tenth Rule](http://c2.com/cgi/wiki?GreenspunsTenthRuleOfProgramming)) 
- The integrity of previous successful writes is a central requirement
  of the project. The data will be business critical. Full ACID
  semantics on writes can be sacrificed provided that the things that
  do get succesfully written stay written.
- The documents are themselves complex. The prototype document in our
  specific case will require storage of 150+ distinct pieces of data
  per document instance. The pathological case could be an order of
  magnitude worse, but certainly not two.
- A single class of documents is a moving target subject to updates at
  a later point in time.
- We like the free stuff we get from Django when we hook it into a
  relational database. We would like to keep the freebies without
  having to jump back two Django versions to use the django-nonrel
  fork. Dumping the ORM entirely is preferable to downgrading to 1.3.

Essentially, it's a mishmash of relational data (your typical web app
stuff like users, groups, etc., as well as document metadata that
we'll need to be able to slice and dice with complex queries in
realtime) and document data (e.g. the hundreds of fields which we have
no interest in joining on or querying by - our only use case for the
data will be for showing the single document into which it was
entered).


I wanted to do a sanity check (if you check my posting history, I'm
pretty explicit about the fact that I am not a DBA) on my preferred
method as well as enumerate all of the options I've come across for
others solving broadly similar problems involving both relational and
non-relational data.

Proposed Solutions:
-------------------

 **1. One table per document class**

Each document class gets its own table, with columns for all metadata and data.

Advantages:

- The standard SQL data model is in play.
- Relational data is handled in the best possible way. We'll
  denormalize later if we need to.
- Django's built-in admin interface is comfortable with introspecting
  these tables and the ORM can live happily with 100% the data out of
  the box.

Disadvantages:

- Maintenance nightmare. Dozens (hundreds?) of tables with (tens of?)
  thousands of columns.
- Application-level logic responsible for deciding exactly which table
  to write to. Making the table name a parameter for a query stinks.
- Basically all business logic changes will require schema changes.
- Pathological cases might require striping data for single forms
  across multiple tables (see: What is the maximum number of columns
  in a PostgreSQL table? ).
- We would probably need to go find a real, honest-to-God DBA who
  would no doubt end up hating life and us.

**2. EAV modeling**

There is just a fields table. Entity-Attribute-Value modeling is already
well understood. I've included it for completeness. I don't think any
new project being started in 2013 would go with an EAV approach on
purpose.

Advantages:

- Easy to model.

Disadvantages:

- More difficult to query.
- DB layer no longer has a straight-forward representation for what
  constitutes one app-level object.
- We would lose DB-level constraint checking.
- Number of rows on one table will grow 100s-1000s of times
  faster. Likely future pain point, performance-wise.
- Limited indexing possible.
- DB schema is nonsensical as far as ORM is concerned. Batteries
  included web app stuff is preserved but custom data models are going
  to require custom queries.

**3. Use PostgreSQL hstore or json fields**

Either of these field types would do the trick for storing schemaless
data within the context of a relational DB. The only reason I don't
jump to this solution immediately is it is relatively new (introduced
in version 8.4 so not *that* new), I have zero previous exposure to it
and I am suspicious. It strikes me as wrong for precisely the same
reasons I would feel uneasy throwing all my nice, easily normalized
data into Mongo - even though Mongo can handle references between
documents. 

Advantages:

- We get the benefits of the Django ORM and the built-in auth and
  session management.
- Everything stays in one backend that we've previously used on other
  projects successfully.

Disadvantages:

- No experience with this, personally.
- It doesn't look like a very highly used feature. It looks like they
  get recommended quite a bit to people looking at NOSQL solutions but
  I don't see a lot of evidence that they are being chosen. This makes
  me think I must be missing something.
- All values stored are strings. Lose DB-level constraint checking.
- The data in the hstore will never be displayed to the user unless
  they specifically view a document, but the metadata stored in more
  standard columns will be. We will be beating that metadata
  up and I worry the rather large hstores we will be creating might
  come with performance drawbacks.

**4. Go full bore document-oriented**

Make all the things documents (in the MongoDB sense). Create a single
collection of type Document and call it a day. Bring all peripheral
data (including data on user accounts, groups, etc) into mongo as
well. This solution is obviously better than EAV modeling but it feels
wrong to me for the same reason #3 felt wrong - they both feel like
using your hammer as a screwdriver too.

Advantages:

- No need to model data upfront. Have one collection with documents of
  type Document and call it a day.
- Known good scaling characteristics, should the collection need to
  grow to encompass millions or even billions of documents.
- JSON format (BSON) is intuitive for developers.
- As I understand it (which is only vaguely at this point), by being
  paranoid with regard to write-concern level even a single instance
  can provide pretty strong data safety in the event of anything and
  everything up to a hard drive crash.

Disadvantages:

- The ORM is out the window for Django trunk. Freebies that go out the
  window with it: the auth framework, the sessions framework, the
  admin interface, surely many other things.
- Must either use mongo's referencing capabilities (which require
  multiple queries) or denormalize data. Not only do we lose freebies
  that we got from Django, we also lose freebies like JOINs we took
  for granted in PostgreSQL.
- Data safety. When one reads about MongoDB, it seems there is always
  at least one person referring to how it will up and lose your
  data. They never cite a particular occurrence and it might all just
  be hogwash or just related to the old default fire and forget
  write-concern but it still worries me. We will of course be
  utilizing a fairly paranoid backup strategy in any case (if data is
  corrupted silently that could well be immaterial of course..).

**5. PostgreSQL and MongoDB**

Relational data goes in the relational database and document data goes
in the document-oriented database. The documents table on the
relational database contains all of the data we might need to index or
slice and dice on as well as a MongoDB ObjectId which we would use
when we needed to query for the actual values of the fields on the
documents. We wouldn't be able to use the ORM or the built-in admin
for the values of the documents themselves but that's not that big of
a loss since the whole app is basically an admin interface for the
documents and we would have likely had to customize that specific part
of the ORM to an unacceptable degree to make it work just the way we
need. 

Advantages:

- Each backend does only what it is good at.
- References between models are preserved without requiring multiple
  queries.
- We get to keep the batteries Django gave us as far as users,
  sessions, etc are concerned.
- Only need one documents table no matter how many different classes
  of documents are created.
- The less often queried document data is strongly separated from the
  far more often queried metadata.

Disadvantages:

- Retrieving document data will require 2 sequential queries, first
  against the SQL DB and then against the MongoDB (though this is no
  worse than if the same data had been stored in Mongo and not
  denormalized)
- Writing will no longer be atomic. A write against a single Mongo
  document is guaranteed to be atomic and PG obviously can make
  atomicity guarantees but ensuring atomicity of write across both
  will require application logic, no doubt with a performance and
  complexity penalty.
- Two backends = two query languages = two different programs with
  dissimilar admin requirements = two databases vying for memory.
                                

chucksmash (545 rep)

Mar 27, 2013, 08:14 PM • Last activity: Nov 14, 2019, 09:44 PM

0 votes

1 answers

803 views

Local Cosmos DB service unavailable

windows azure document-oriented cosmos-db

I've installed the azure cosmos db in my local machine but is not loading the explorer pages as shown below: [![enter image description here][1]][1] **Explicit error**: Failed to get authorization headers for offers: {"readyState":4,"responseText":"{\"code\":\"ServiceUnavailable\",\"message\":\"Serv...

                                  I've installed the azure cosmos db in my local machine but is not loading the explorer pages as shown below:

**Explicit error**:

Failed to get authorization headers for offers: {"readyState":4,"responseText":"{\"code\":\"ServiceUnavailable\",\"message\":\"Service is currently unavailable.\\r\\nActivityId: d728958b-c0b4-4b32-a131-bba1dc7a8ba0, \\r\\nResponseTime: 2018-06-05T17:22:33.8740685Z, StoreReadResult:

user141153

Jun 5, 2018, 05:33 PM • Last activity: Jun 12, 2018, 02:58 AM

1 votes

0 answers

133 views

How do read queries against Postgres's JSONB type compare to Couchbase?

json postgresql-10 read-only-database document-oriented couchbase

I'm working on a small project which would involve running queries against deeply-nested (up to 5 layers) keys in dictionaries. This would be specifically a read-heavy environment, with only one large set of rites during the app's bootstrapping (nominally once per install). The two options I'm consi...

                                  I'm working on a small project which would involve running queries against deeply-nested (up to 5 layers) keys in dictionaries. This would be specifically a read-heavy environment, with only one large set of rites during the app's bootstrapping (nominally once per install). The two options I'm considering for use as database backends right now are PostgreSQL (10.0) and Couchbase (5.0.1).

I'm reasonably familiar with Postgres and its JSONB facilities in terms of performance (although not implementation details), but I'm not familiar enough with Couchbase to draw a concrete comparison, and it's difficult to find this specific comparison in any online benchmarks.

Ignoring issues about the fact that Postgres isn't a document store, could Postgres consistently outperform Couchbase in read queries involving accessing a deeply-nested key/value pair?

Jules (23 rep)

Feb 25, 2018, 02:50 AM

1 votes

0 answers

139 views

What data store should I use for storing and querying a billion small (~2kb) JSON documents?

database-recommendation json document-oriented

I'm trying to figure out which database to use for storing and querying on ~1 billion JSON documents about the size of 2KB each. About 2 TB of data. Nature of JSON documents: * 3 types of documents (structure is loosely defined) * Nested, multi level, sparse Things that I'm looking for are: * High t...

                                  I'm trying to figure out which database to use for storing and querying on ~1 billion JSON documents about the size of 2KB each. About 2 TB of data.

Nature of JSON documents:

* 3 types of documents (structure is loosely defined)
* Nested, multi level, sparse

Things that I'm looking for are:

* High throughput
* Secondary indexes support
* Queries over JSON

Nice to haves:

* SQL like query interface

Things I'm looking at:

* Cassandra
* Storing JSON in Postgres
* Riak
* HBase

If more details I missed regarding usage or the data I missed that would help make the decision, please let me know
                                

Akshit Khurana (111 rep)

Jun 7, 2017, 12:40 AM

1 votes

1 answers

266 views

Are most of the NoSQL databases key/value stores technically?

mongodb table nosql cassandra document-oriented

[This video][1] distinguishes between three (main) types of NoSQL databases: 1. Key/Value store 2. Tabular 3. Document oriented Do these types differ in how they technically store their data or can they be considered all to be key/value stores since even the tabular ones (like Cassandra) may have va...

                                  This video  distinguishes between three (main) types of NoSQL databases:

 1. Key/Value store
 2. Tabular
 3. Document oriented

Do these types differ in how they technically store their data or can they be considered all to be key/value stores since even the tabular ones (like Cassandra) may have variable column count and most likely serialize the column values to a string? And the documents (MongoDB) are also only strings in the end... I suppose all types generate an unique id for their "values/data". So are all the 3 types key/value on low level?
                                

Kozuch (111 rep)

Oct 30, 2016, 08:40 PM • Last activity: Feb 27, 2017, 11:25 AM

1 votes

1 answers

193 views

The Logical Type of The Document in CouchBase

database-design nosql document-oriented couchbase index-design

How should the logical type of a document get specified in CouchBase? Using a field for the type? Or employing separators in keys like `product::app::123id`? Currently I'm putting a document type, inside the document itself, inside a string field named Type like say Product. But I see this pattern o...

                                  How should the logical type of a document get specified in CouchBase? Using a field for the type? Or employing separators in keys like product::app::123id?

Currently I'm putting a document type, inside the document itself, inside a string field named Type like say Product.

But I see this pattern of using separators in document id, like product::app::123id. Done some playing around it, couldn't get the type part (product) from the key (Of-course parsing it, by splitting is possible, which to me seems to have same overhead in both N1QL and views).

So how should the document type (app logic type) get specified?

Env: CouchBase Community, inside Docker on Ubuntu 14.04, using Go client gocb.

Kaveh Shahbazian (113 rep)

Feb 7, 2017, 01:41 PM • Last activity: Feb 7, 2017, 02:22 PM

4 votes

1 answers

397 views

Numerical data with lots of NULLs: 6NF or document database?

postgresql database-design architecture design-pattern document-oriented

I have a source of numerical data mostly comprised of sensor readings, physical quantities and flags. Each associated with a datetime. | id | timetag | sensor1 | sensor2 | flag1 | ... | -------------------------------------------------------------------------------------- | 943 | 2016-12-08 10:27:00...

                                  I have a source of numerical data mostly comprised of sensor readings, physical quantities and flags. Each associated with a datetime.

    |  id  |        timetag           |   sensor1   |      sensor2     |  flag1  |  ...  |
    --------------------------------------------------------------------------------------
    | 943  | 2016-12-08 10:27:00.000  |  1.2323523  |  -23123.5346234  |    0    |  ...  |
    | 944  | 2016-12-08 10:27:00.038  |     NULL    |     163.6151345  |    1    |  ...  |
    | 945  | 2016-12-08 10:27:02.000  |  1.2477801  |       NULL       |    3    |  ...  |
                                           ...

The data has the following characteristics:

- There are hundreds of fields, and new unexpected fields may pop up at any time. We need to record those too.
- For every row, only a small subset (<10%) of the fields are available.
- Sometimes a field is available but corrupted, i.e. NaN, and we need to record that.
- Data is written once, updated almost never and read relatively often.
- We need to do range queries on the data, both on the date and on the actual numbers.
- We don't want to lose any data.
- It doesn't need to be a distributed system.

---

I'm wondering what the best way to store these data would be.

## Solution 1: One big table
A single huge table would be a bad idea, because it would be full of NULLS and inconvenient. Moreover, it would be impossible to distinguish between "recorded but NaN" NULLs and "not recorded" NULLs.

## Solution 2: 6th normal form
The first solution that comes to mind is to use 6NF , creating one table for each field.

This is clean in theory, but

1. queries would require many many JOINs at the same time;
2. new tables would have to be created on the fly when a new field arrives for the first time;
3. fields that are always paired, like for example longitude and latitude would have to be automatically detected and their tables merged for efficiency;
3. if at some point one of the columns in a "merged" table (e.g. lon in the lon-lat example above) starts to be obtained independently (without lat), to keep the table in 6NF we would have to detect that and automatically split the table into separate tables.

The above makes for a rather complex solution.

## Solution 3: Document database
I've been considering PostgreSQL with JSONB to mimic a document database. 
This would allow us to only store the data we have in each row, but still have a single table.

However, as detailed in my (unanswered) question on Stack Overflow , JSONB doesn't seem appropriate for the numeric operations we need, like range queries.
On the other hand, NoSQL technologies like MongoDB, in which I have no experience, don't guarantee the consistency we are looking for.


## Question

Of the solutions above, which would you favor? Are there any other options I'm missing?

                                

IGRSR (141 rep)

Dec 8, 2016, 02:10 AM • Last activity: Dec 9, 2016, 11:36 AM

1 votes

2 answers

853 views

SQL Server vs SOLR (Or any document db)

sql-server nosql denormalization document-oriented solr

I have my "customer" data in a normalized sql server database. Getting out the customer data in my app is taking too long. This is because I have to go to 10+ tables to get all the data I need. My company has an installation of SOLR that I thought about storing a Json object that contains all the da...

                                  I have my "customer" data in a normalized sql server database.

Getting out the customer data in my app is taking too long.  This is because I have to go to 10+ tables to get all the data I need.

My company has an installation of SOLR that I thought about storing a Json object that contains all the data I need for a single "customer" already put together.

I think that this would give me some significant speed improvements.

However, it got me to wondering what the difference would be to me putting this data in a single table with a varchar(max) column that has the Json in it.  I could index my 10ish searchable columns on the same table as the json.

I know that document databases are very popular.  **So I imagine there has to be benefits over just rolling my own using denormalized data in sql server.  Can someone tell me what they are?**

Vaccano (2550 rep)

Apr 7, 2015, 10:10 PM • Last activity: Feb 8, 2016, 04:08 PM

1 votes

0 answers

66 views

Which Azure Storage / Database Technology to make use of for large scale logging

database-design data-warehouse azure-sql-database document-oriented

I am looking into making use of Azure hosting services in order to develop a logging service for our company. Azure has a number of storage options available (I have listed links below). However I am not sure which one would be best suited for my needs. I am coming from a MS SQL Server background so...

                                  I am looking into making use of Azure hosting services in order to develop a logging service for our company.

Azure has a number of storage options available (I have listed links below). However I am not sure which one would be best suited for my needs. I am coming from a MS SQL Server background so I am probably not as open minded as I should be, hence my question here is to try and explore other more suitable options.  If any one knows of any place where I can get design help specifically for azure please let me know, as I have not been able to find any.

Requirements:

Background: We are looking to log messages from a variety of types of devices which the company owns. This needs to be on a mass scale. Several million entries per day. As much as I have tried to keep the design a fixed structure, due to the variety of device types out their they messages differ completely and have different elements. I would somehow like to still be able to query message specific attributes. (How one would do this, I do not know).

 - All messages come into this logging service as A JSON object. The JSON object obviously varies based on the type of device which sent the message and the type of message the device is sending.
 - There will be some common attributes such as deviceID, LogTime,customerId etc.
 - Due to the mention of above, it seems that a document / object store technology would suit the application better. However as mentioned above I have not have much exposure to these types of technologies. Will it be able to efficiently handle the load, and queries etc.
 - We will need to be able to store months of log values without slowing down daily queries to a halt.
 - Data analytic's and big queries should be able to be done on the data after some time of running. I'm not sure if one would make use of a different mechanism for this? Possibly a sort of storage layer and then a big data layer. However my storage mechanism needs to keep this future features in mind.

Azure offers the following storage and database services:

 1. Sql Database 
 2. DocumentDb 
 3. Storage: Blobs, Tables,Queues,Files and Disks 
 4. StorSimple Hybrid cloud Stroage 
 5. SQL Data Warehouse 

I am sorry, I know this question is quite general and possibly open ended. I am not looking for anyone to just give the the complete solution in my hand. I am looking for some guidance in the right direction and experience from people who have already gone down this road before.

Zapnologica (779 rep)

Oct 29, 2015, 07:43 AM

-1 votes

1 answers

639 views

document and edges in orientdb

database-design graph document-oriented

I was playing around with OrientDB lately and was create some classes and edges. I was able to create edges between documents, is this something that is expected from OrientDB when edges are relationship between vertices? Besides that, how do I manage data integrity when I delete a document from the...

                                  
I was playing around with OrientDB lately and was create some classes and edges. I was able to create edges between documents, is this something that is expected from OrientDB when edges are relationship between vertices?

Besides that, how do I manage data integrity when I delete a document from the DB, apparently the edges are still pointing to an unknown record id that is already deleted.

Thanks

eugene (101 rep)

Mar 26, 2015, 02:20 AM • Last activity: Oct 24, 2015, 07:42 AM

2 votes

0 answers

544 views

OrientDB distributed architecture

distributed-databases load-balancing document-oriented graph auto-growth

Technically this question could relate to any database that performs in a similar distribute manner but given that I am specifically working with OrientDB I thought the question should be target as such. For want of better terminology the **conventional** approach (for me in any case) for a distribu...

                                  Technically this question could relate to any database that performs in a similar distribute manner but given that I am specifically working with OrientDB I thought the question should be target as such.

For want of better terminology the **conventional** approach (for me in any case) for a distributed database is very similar to that of the architecture supporting the logic layer. A load balancer where all the queries are directed with an auto scaling group behind the LB that uses some metrics to spin up and destroy database servers on demand. As servers are dynamically added and removed Orient uses its multicast to synchronise the instances. Something like this:

It then struck me that this model could exist is a different format where each App instance has its own DB instance. The elegance is that you still get the same redundancy, but you don't have the cost of the extra db instances and you don't have the LB failure point. I know that the ration of DB instances and App instances doesn't necessarily need to be 1:1 but there is no harm in it. It also means that the code for the DB can be managed inside the app codebase as each new app instance that spins up configures its own DB and then connects up to the others to sync using some ENV VARS passed in at build time. Probably sensible to have a single fallback instance outside the ASG.

Clearly both models have advantaged and disadvantages. I also appreciate that there would need to be some specifics ironed out but are there any significant problems with this concept? Is there perhaps a better architecture that already exists?

tarka (121 rep)

Jun 14, 2015, 10:33 AM

0 votes

2 answers

1689 views

No suitable driver found Orientdb jdbc driver

database-design java jdbc document-oriented

I have some problems with Orientdb driver for jdbc. i´m using Netbeans, i have include the library JDBC+OrientDB libraries (orientdb-jdbc-2.0.8-all.jar) in the directory and in netbeans, i have also include the connector in the proyect (i have include the connector in the Classpath), but i stil...

                                  I have some problems with Orientdb driver for jdbc.
i´m using Netbeans, i have include the library JDBC+OrientDB libraries (orientdb-jdbc-2.0.8-all.jar) in the directory and in netbeans, i have also include the connector in the proyect (i have include the connector in the Classpath),
but i still have the same problem :

    Exception in thread "main" java.sql.SQLException: No suitable driver found for jdbc:orient:remote:localhost/GratefulDeadConcerts
    at java.sql.DriverManager.getConnection(DriverManager.java:689)
    at java.sql.DriverManager.getConnection(DriverManager.java:208)
    at Aconexion.AConexion.main(AConexion.java:15)
    Java Result: 1
    BUILD SUCCESSFUL (total time: 0 seconds)

my code:

    package Aconexion;
    import java.sql.Connection;
    import java.sql.DriverManager;
    import java.sql.ResultSet;
    import java.sql.SQLException;
    import java.sql.Statement;
    import java.util.Properties;
    import com.orientechnologies.orient.jdbc.OrientJdbcConnection;
    public class AConexion {
    public static void main(String[] args) throws SQLException {
    
    Properties info = new Properties();
    info.put("user", "admin");
    info.put("password", "admin");
    Connection conn = (OrientJdbcConnection) 
    DriverManager.getConnection("jdbc:orient:remote:localhost/GratefulDeadConcerts", info);
    try{
    
    Class.forName("com.orientechnologies.orient.jdbc.OrientJdbcDriver");
    Connection connection = (OrientJdbcConnection)
    DriverManager.getConnection("jdbc:orient:remote:localhost/GratefulDeadConcerts", info);
    }catch (ClassNotFoundException e) {
    e.printStackTrace();
    }
    }
    }


                                

LaraCraft84 (3 rep)

May 11, 2015, 05:45 PM • Last activity: May 31, 2015, 12:52 PM

5 votes

3 answers

1524 views

Replace relational database with "poor" design in software product with non-relational DB?

relational-theory nosql denormalization orm document-oriented

***Edit**: This question is about how to deal with many issues that arose from a overall system design, which made parts of the system deviate from common standards. For example, managing everything in the business model with own program code, even down to relational integrity. This gives the databa...

                                  ***Edit**: This question is about how to deal with many issues that arose from a overall system design, which made parts of the system deviate from common standards. For example, managing everything in the business model with own program code, even down to relational integrity. This gives the database and persistence layer a smell of bad design, by using it as a place to "dump something in and get it out again somehow", rather than a structured storage. I asked the question, because NoSQL document storages seem to me like an option to move an already schema-less (or very loose schema) database to something without schema by default. Also, I must note that the whole system is not a bad one at all, despite some of the flaws described here. Also, some of the problems, such as versioning, have solutions on the way or already implemented.*

Think of a software system you look at, based on classic, relational databases (SQL Server, Oracle), NHibernate as object-relational mapper (ORM), a business logic model layer on top and a huge number of modules (a few 100), mainly .NET based services and a few web services (with clients, max. ~100 per system/customer, company network, non-public). Operation style is mainly OLTP, with write/CUD access being an importand part of the workload. Productive databases are usually some 10GB,  but always well below 100GB in size (thus no "big data"). It does it's work well, but to me, the database and ORM implementation smell of several anti-patterns (for relational databases). Maybe these implementations can be better with another kind of database – document oriented ("NoSQL"), or in-memory database.

1. Many relational database and supporting ORM features omitted: Tables are strongly denormalized, foreign key relations missing or impossible, e.g. due to metadata tables referencing different main tables, with columns like IdInTable INT, OwnerTable INT. NHibernate has mostly no object relations mapped (and has generally problems with table structures it is not made for). Instead, these are implemented in business logic (sometimes leading to orphaned child objects or inefficient database access, see below).

2. Denormalization down below the basics: Increasing use of non-1st NF data: nclob/nvarchar(max) columns with XML, comma separated lists, or composite numeric value columns (e.g. 123, 10123, 40123 for task type 123, but different module configs identified by 0,1,4 * 10000). The first two containing database relevant, logical "Foreign Keys" and data model relevant values, such as AdminUser (to be checked with LIKE '%...%'). This is mainly due to many quick-to-release, short lived and customized values which shouldn't go into the main schema or are easier to implement through XML values.

3.  Non-2nd NF data, including table contents being copied by triggers, followup stored procedures or applications into other tables. E.g., a table column value copied to a "vertical" metadata table, this again copied to a "horizontal" or "pivoted" representation of the metadata (each metadata type a column), because some applications can only use the metadata or horizontal metadata. Frequent requests to use "rubbish bin structures" (dump collected data from various sources into one nclob/nvarchar(max) "rubbish bin" column and let an application search through it, instead of many different sources).

4. The "One-Object-Disease" in business logic model and applications: Iterating and immediate load/save of single objects: The business layer uses mainly Load/Save() methods for individual objects and few bulk/set based operations. A common job is to get object IDs by SQL or it's NHibernate representations, then iterate over all retrieved Ids and fetch the objects one by one in the style of foreach (oneId in Ids) { myObjects.Add( BizModel.GetMyObjectById(oneId) ); }. This with all metadata, dependent objects collections etc., a classic SELECT N+1 situation. Furthermore, most of NHibernate's caching, persistence ignorance and combined operations have been forcilby disabled: Loading one object explicitly calls SELECT FROM MyObject WHERE Id=:id to prevent using the cache or deferred execution, but get a fresh object from the current DB row. MyObject.Save() is implemented to enforce immediate Insert/Update: NHibernate session.Save(...) followed by immediate .Flush(). The whole thing uses NHibernate micro-sessions: loaded objects are immediately taken out of session context and saved within a new session (preventing those "weird", undesired changes of unsaved objects in DB). Persistence ignorance and object relations through NHibernate seem undesirable, to keep control of the state of every single object. NHibernate is really considered a mapper (one row to one object) rather than a complex tool for relational database access. There's also debate about using a "fast" micro-ORM instead of NHibernate, which will materialize SELECT N+1 queries lightning fast into objects, but of course do nothing against N+1 itself.

5. An important requirement is to get everyting working with everything, because it is too much to release all modules for every single change: a new module must work with an old database version, where certain columns and tables don't exit, and an old module must still work with a new DB version, with columns etc. added. This leads to new columns having default values if not nullable, and old tables/columns, abandoned for a long time, to be still in the data model, because removal might crash old modules. Another consequence is reluctance to add new tables/columns, which you can hardly get rid of, once released. Instead, XML (in text columns) and similar, denormalized stuff, or property values in the global metadata table, are preferred.

6. Many modules receive tasks just for single objects, which would be no problem within a possible, set based approach, since a set/bulk data access method could also handle a single object/row, if needed so. On the other hand, there are the web servers, maintenance and background services, which handle many objects at once, need the business logic and run very inefficient in the current single-object way (the webservice using native SQL, or, newly, a Lucene based search engine to search the IDs of desired objects, but retrieving the full model objects one by one).

Imagine, you've tried to change this. In the beginning, you didn't know NHibernate and how it works, but then you came up with ideas how to adapt data access to it's real abilities, and avoid unneccessary database operations: map relations in NHibernate, keep sessions and transactions open for several object operations, do set/bulk operations, normalize the DB the way you've learned years ago, add foreign keys, views, maybe Materialized Views to it. But you keep being rejected, with arguments like: *"nobody is going to pay for it"*, *"the database can handle it, no matter how 'bad' the application is"*, and simply *"it works"*. Disk space, memory, CPU power and network resources are cheap; refactoring data access would be much more expensive. Likely, standing by the code programmer's object oriented approach, rather than the DB programmer's set based approach, is preferred (including it's enforcement against the ORM implementation). What does it matter if the system could be 10 or even a 100 times faster, if it works sufficiently in the current way? Don't care about SELECT N+1 anyway, today's databases can handle it! That would only be gold plating! It might become different when databases grow into terabytes, but that's not for now.

So, maybe, there is a solution in the "NoSQL" or "NewSQL" area. It might be possible to fetch objects from and store them in a database in a fast and efficient way. Even with many queries in a single object, rather than set approach, as long as it is a local DB without long distance latency. It looks like the current system uses the relational database just as an extended, persistent main memory, and all those "stone age relicts" of IT, like creating and maintaining tables and indices manually, or mapping objects to relational tables, just add a huge overhead.

My idea is: 

A "NoSQL" document database is a good thing, because:

 -  the document mostly contains the whole object graph with dependent items, metadata and everything belonging to it, so it doesn't require additional DB queries, thus avoiding or greatly reducing the SELECT N+1 problem.
 - within a document, there's implicit "relational integrity" through dependent objects being contained by their parents (nested in XML or JSON representation).
 - over multiple different documents, there are no relations in database, these are solely maintained by the business logic (as it is done now frequently, but wrong for a classic relational DB design).
 - it has usually no fixed schema, so it is much easier to work with with changing data structures. Objects can ignore properties added later, or fill missing values from older version data with defaults.
 - Subsequent data with external/variable/no schema can be integrated without break (as opposed to storing XML in relational text columns).
 - many document DB have auto indexing or search engines integrated.

A minimum of automated data integrity, especially multi-object transactions, is still required.

An in-memory relational database, or any focused on fast access without the need to access slow hard drives with every (write) operation, would help with speed, but still basically rely on a hard relational schema, which has yet been largely omitted and seems to be undesirable to the stakeholders.

Can anybody with experience tell me if my assumptions are correct?

Erik Hart (510 rep)

Sep 21, 2014, 09:50 AM • Last activity: Oct 3, 2014, 11:47 PM

1 votes

0 answers

191 views

Can XML columns in Oracle/MS SQL Server replace document oriented databases to some extent?

performance relational-theory nosql xml document-oriented

Just a strategic question: I am currently working with frequently changing data schemas and it takes a lot to keep track of versions and specific content. I do, however, have a classic OLTP application, which requires ACID, transactions, joins, single-point-responsiblility; plus, the DB load seems n...

                                  Just a strategic question: I am currently working with frequently changing data schemas and it takes a lot to keep track of versions and specific content.

I do, however, have a classic OLTP application, which requires ACID, transactions, joins, single-point-responsiblility; plus, the DB load seems not to be extreme, so a document oriented ("no SQL") database seems not to be a good solution. 

I currently have numerous objects/tables with XML documents in clob/varchar(max) columns, the database being completely ignorant of their content. Also, many don't contain "real" XML, but simply a collection of XML elements with content. In fact, I use the XML columns much like noSQL documents. 

As I have read, both Oracle and SQL Server have a kind of XML columns which is not bound to a schema, effectively stored in clob/blob style, but the database can have indices on certain content, such as an XPath to a certain XML element which frequently occurs and contains important data. Much like document databases can index documents by content, which not even every document may contain. They even can handle simple XML element collections, which are not true XML documents, if I understood right.

Looks like I can use some features of document oriented DBs through XML, living without a fixed schema, but still making parts of the XML data structure visible to the DBMS and it's clients without parsing the XML on client side. For example, I can *SELECT* all *OrderItems* which have a 2 year warranty stored somewhere in the XML, without a badly performing and possibly wrong "*where Terms LIKE '%2%'*".

Any experiences with this? 

BTW, the XML is mostly read-only, few costly updates, most variable data is still in classic relational columns.

Erik Hart (510 rep)

Oct 30, 2013, 02:05 PM

2 votes

1 answers

110 views

How to handle "or" possibilities in MongoDB

schema mongodb database-design document-oriented

I'm new to MongoDB and by no means an expert in DB design in general. This feels like a really basic question that's probably been answered a million times, but I'm having a surprisingly hard time finding an answer for it: is there a good way to easily handle either/or choices in the DB, or structur...

                                  I'm new to MongoDB and by no means an expert in DB design in general. This feels like a really basic question that's probably been answered a million times, but I'm having a surprisingly hard time finding an answer for it: is there a good way to easily handle either/or choices in the DB, or structure it in a way that makes it easy to deal with in code?

Let's say I'm building a cocktail database in MongoDB. I'm adding the entry for a Manhattan.

 - 2 oz. **Bourbon or Rye Whiskey** (this is the issue)
 - 1 oz. Sweet Vermouth
 - Dash Aromatic Bitters
 - Garnish Maraschino Cherry

So I might do a cocktails collection with an entry like:

    {
        "_id" : "1234",
        "name" : "Manhattan",
        "ingredients" : [
            {
                "measure" : "2 oz.",
                "ingredient" : "Bourbon Whiskey"
            },
            {
            	"measure" : "1 oz.",
            	"ingredient" : "Sweet Vermouth"
            },
            {
                "measure" : "Dash",
                "ingredient" : "Aromatic Bitters"
            },
            {
            	"measure" : "Garnish",
            	"ingredient" : "Maraschino Cherry"
            }
        ]
    }

Which is fine, but how do I get the rye in there? I don't think I'd want to do "ingredient" : "Bourbon or Rye Whiskey", would I? Is it better for the purpose of searching, later, to have them separated out? What's the best practice, here?

Also, an even more tricky drink would be something like a Martini, where I would actually use a different garnish for vodka than I would for gin, so I'd want to give the user the option to choose the base liquor they're working with and then give the proper garnish accordingly.

Any suggestions? Thanks!
                                

Christopher Buecheler (23 rep)

Jul 25, 2013, 03:33 PM • Last activity: Jul 27, 2013, 02:01 AM

0 votes

1 answers

293 views

RavenDB - Disaster Recovery on a Replicated Instance

document-oriented disaster-recovery

**EDIT:** I'm encountering precisely [the same issue][1]. - Assume that I have two RavenDB instances. - Instance 1 contains one document: users/1. - Instance 2 contains three documents: users/2, users/3, and users/4. _ - Assume that I configure and enable replication for both instances. - That is, I...

                                  **EDIT:** I'm encountering precisely the same issue .

 - Assume that I have two RavenDB instances.
 - Instance 1 contains one document: users/1.
 - Instance 2 contains three documents: users/2, users/3, and users/4.  

_
  
 - Assume that I configure and enable replication for both instances.
  - That is, Instance 1 will replicate to Instance 2 and Instance 2 will replicate to Instance 1.
 - After replication, both instances will have all four documents.
  - But both instances will agree that users/1 's Raven-Replication-Source is Instance 1, and users/2, users/3, and users/4 will have their Raven-Replication-Source set to Instance 2.

_
  
 - Assume that Instance 1 suffers a catastrophic failure and I need to reinstall RavenDB.

_

 - My testing shows that in this scenario: users/2, users/3, and users/4 will again be replicated to Instance 1.
 - *However*, users/1 will **not** be replicated to Instance 1 because its Raven-Replication-Source **is** Instance 1.

----------

My Question:  

 - How can I force Instance 2 to replicate users/1 back to Instance 1, even though users/1 was originally created on Instance 1 and its Raven-Replication-Source **is** Instance 1?

Jim G. (123 rep)

May 7, 2013, 08:18 PM • Last activity: May 17, 2013, 05:01 PM

Showing page 1 of 20 total questions