Sample Header Ad - 728x90

Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

0 votes
1 answers
1365 views
How to get current date and time in Elasticsearch
I need to review the current date and time zone from Elasticsearch. I'm checking the elastic documentation and it mentioned that the default value is UTC. In other environments I use: ``` SELECT NOW(); ``` Is there any similar function for Elasticsearch?
I need to review the current date and time zone from Elasticsearch. I'm checking the elastic documentation and it mentioned that the default value is UTC. In other environments I use:
SELECT NOW();
Is there any similar function for Elasticsearch?
Carolina (47 rep)
May 10, 2023, 07:22 PM • Last activity: Aug 4, 2025, 09:03 PM
0 votes
1 answers
263 views
Use data from Mysql to ElasticSearch with Logstash
I'm using logstash for use my mysql database in ElasticSearch My conf is the next. ```` input { jdbc { jdbc_connection_string => "jdbc:mysql://[ip]:3306/nextline_dev" jdbc_user => "[user]" jdbc_password => "[pass]" #schedule => "* * * * *" #jdbc_validate_connection => true jdbc_driver_library => "/p...
I'm using logstash for use my mysql database in ElasticSearch My conf is the next.
`
input {
    jdbc {
        jdbc_connection_string => "jdbc:mysql://[ip]:3306/nextline_dev"
        jdbc_user => "[user]"
        jdbc_password => "[pass]"
        #schedule => "* * * * *"
        #jdbc_validate_connection => true
        jdbc_driver_library => "/path/mysql-connector-java-6.0.5.jar"
        jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
        statement => "SELECT * FROM Account"
    }
}
output {
    elasticsearch {
        index => "account"
        document_id => "%{id}"
        hosts => ["127.0.0.1:9200"]
    }
}
` But I have some questions, I want to schedule more than one query, but the index will be always account. Can I make a dynamic index for the output to elasticsearch? And how can I use more than one statement? (Export more than one table)
BlueSeph (121 rep)
Mar 15, 2019, 08:29 PM • Last activity: May 20, 2025, 03:02 PM
0 votes
0 answers
9 views
reset elasticsearch node
I know you can delete the data directory on a percona cluster server, and then let it join the cluster again. This will result in a full transfer of data; this has been useful a couple of times. Is the same possible in an elasticserach cluster? i.e. shutdown one elastic node. If the cluster health i...
I know you can delete the data directory on a percona cluster server, and then let it join the cluster again. This will result in a full transfer of data; this has been useful a couple of times. Is the same possible in an elasticserach cluster? i.e. shutdown one elastic node. If the cluster health is still yellow, is it safe to delete data in that one node ../elasticsearch/indices/* and then join cluster again as 'new' node?
pulven (1 rep)
Mar 14, 2025, 09:23 AM • Last activity: Mar 14, 2025, 09:58 AM
0 votes
1 answers
45 views
Database / Search-Index recommendation: Match time ranges from different categories
* 300k+ videos * 10+ millions of markers, pointing to timeranges in videos ```json { "markerCategory": "something", "markerDesc": "something-more-specific", "frameIn": 120001, "frameOut": 140002 }, { "markerCategory": "something-else", "markerDesc": "something-else-more-specific", "frameIn": 130001,...
* 300k+ videos * 10+ millions of markers, pointing to timeranges in videos
{
  "markerCategory": "something",
  "markerDesc": "something-more-specific",
  "frameIn": 120001,
  "frameOut": 140002
},
{
  "markerCategory": "something-else",
  "markerDesc": "something-else-more-specific",
  "frameIn": 130001,
  "frameOut": 135002
}
Any Suggestions which database / search-index would perform best, when searching for something along these lines: > Videos having events of category A __AND__ category B in overlapping timeranges, sorted by amount of covered time Videos are currently exported from some proprietary relational database and stored in an Apache SOLR instance for searching. * Is there a specific name for those kind of queries ("inverted range queries" or some thing like that...) ? * Any Suggestions which technology would perform best, for those types of queries? I was thinking maybe elasticsearch?
gherkins (103 rep)
Mar 6, 2025, 06:45 AM • Last activity: Mar 6, 2025, 01:08 PM
0 votes
0 answers
26 views
Is there a Scalable Solution in Elasticsearch for Handling Frequent Event Updates in Large-Scale Systems?
I’m working on a large-scale event storage system where a significant percentage (50-75%) of stored events need to be updated over time. These events include fields such as startTimestamp, endTimestamp, and connectionStatus, which may change as the system processes more data. From my research, I und...
I’m working on a large-scale event storage system where a significant percentage (50-75%) of stored events need to be updated over time. These events include fields such as startTimestamp, endTimestamp, and connectionStatus, which may change as the system processes more data. From my research, I understand that updating documents in Elasticsearch can be inefficient, as each update triggers an insert and a delete operation in the background, leading to reindexing. To work around this, I’ve considered an alternative strategy: instead of updating documents directly, I fetch the current version of an event, merge it with the updated data in the application layer, and then insert a new version of the event with a unique document ID. This way, the event history is preserved without triggering Elasticsearch’s costly reindexing mechanism. To retrieve the latest version of an event, I plan to use the collapse query on the eventId field, sorting by the version to fetch only the most recent version. Here’s a simplified version of the strategy I’m considering: 1. **Fetch the current version** of the event using eventId. 2. **Merge the updated fields** with the existing event data at the application level. 3. **Insert the new version** of the event with the same eventId but a new unique document ID and an updated version. 4. **Use collapse queries** to retrieve only the latest version of the event, based on version. My questions are: - **Will this approach handle high volumes of data efficiently?** For context, the system processes thousands of writes per second, and each event may be updated multiple times over its lifecycle. - **Does the collapse feature perform well when querying millions of documents?** How does it scale when querying across large indices with a mix of new and old events? - **Is there a better alternative** for handling frequent updates in Elasticsearch at this scale that avoids the inefficiencies of the update and delete mechanism? Additionally, I'm concerned about whether updates are really as bad as they seem in Elasticsearch for my use case. Since 50-75% of events will require updates, is my approach of creating new document versions justified, or could using Elasticsearch’s standard update operations work well enough without impacting performance too much? Any advice or insights from those with experience in large-scale Elasticsearch systems would be greatly appreciated!
Chillax (131 rep)
Oct 16, 2024, 11:16 AM
1 votes
0 answers
198 views
How to use things like Elasticsearch, Meilisearch, etc. alongside a main database (e.g. PostgreSQL)?
I could be missing something obvious because this seems like a fairly basic question, but I haven't been able to find any explicit guidance anywhere on how to integrate Meilisearch/Elasticsearch with a "main" database (say Postgres), and what the best practices for this sort of setup are. For contex...
I could be missing something obvious because this seems like a fairly basic question, but I haven't been able to find any explicit guidance anywhere on how to integrate Meilisearch/Elasticsearch with a "main" database (say Postgres), and what the best practices for this sort of setup are. For context: I'm building an online video course platform (a la Udemy), and on the server side of things, I have a GraphQL API with a searchInCourses query field that takes in a user-provided search query and is supposed to return the relevant results. I've been looking into things like Meilisearch, but I'm not quite sure about the right workflow and how it would best fit into our system. Would something like this make sense, for example?: I create a courses index on my e.g. Meilisearch DB, each document in the index would contain the ID of the course (which would be the same as the ID of the same course in the main DB) + a flat set of (denormalized) fields that are most relevant to searching (e.g. title, instructor_name, etc.). Every time a request comes in for the searchCourses query field, I first query the Meilisearch DB with the user-provided query, which sends back the IDs of the matching courses, but I'll then send another query to my main (PostgreSQL) database to retrieve the actual information about the matching courses — the SQL query would end with WHERE c.id = ANY (@ids) where @ids is the IDs of the courses returned by Meilisearch. Is this a standard, sane way of doing things? Or am I just totally off? If so, I'll appreciate it if someone points me in the right direction. Thank you in advance.
Arad (131 rep)
Sep 24, 2023, 12:23 AM • Last activity: Sep 24, 2023, 12:47 AM
1 votes
0 answers
69 views
Disabling SSL in ElasticSearch and enrolling new nodes
I am trying to start an elasticsearch container with SSL disabled as I want it to be hosted at `http://localhost:9200` (not https). With this I want the ability to enroll new nodes. But as I disable the `http.ssl` (by passing it as environment variable), all the default settings in the elasticsearch...
I am trying to start an elasticsearch container with SSL disabled as I want it to be hosted at http://localhost:9200 (not https). With this I want the ability to enroll new nodes. But as I disable the http.ssl (by passing it as environment variable), all the default settings in the elasticsearch.yml also disappear. I want to keep the other settings including enrollment.enable and transport.ssl (as they are used for enrolling new nodes) along with creation of certificates (which do not get created if disable the ssl). So what i did is started the container and disabled http.ssl by myself (with all generated certificates at my disposal), but when i tried to connect other node using the enrollment token it gave me the error that:
configurations for this node is already configured
and exited. Does anyone if it is possible to do in this way?
Tanmay Sharma (11 rep)
Aug 11, 2023, 11:07 AM • Last activity: Aug 11, 2023, 01:49 PM
0 votes
1 answers
471 views
Developing a database to store 100-dimensional image embeddings along with their paths
Suppose we want to design a database to store images represented in a vector format (each image is stored as a 100-dimensional vector). The goal here is to store image embeddings along with their paths so that we can use the embedding to find image neighbors, which helps in image retrieval. Once we...
Suppose we want to design a database to store images represented in a vector format (each image is stored as a 100-dimensional vector). The goal here is to store image embeddings along with their paths so that we can use the embedding to find image neighbors, which helps in image retrieval. Once we find the neighbors to images, I would like to return back and display images, and that is why I need to store image paths as well in the database. The KD-tree data structure is good to start for the nearest neighbors search, but it suffers from dimensionality issues given we have a vector of dimension equal to 100. **First question**: Can you please let me know what you think about this design strategy and what would you recommend as an alternative to KD-tree to deal with high dimensional input? **Second question**: Is it easy to find sophisticated nearest neighbor search in Oracle database for example?
Avv (109 rep)
Feb 28, 2023, 11:10 PM • Last activity: May 23, 2023, 05:48 PM
1 votes
1 answers
442 views
Secondary indexes vs Using elastic search
When does it make sense to put data in elastic search vs creating secondary indexing on Primary datastore? Elastic search with another primary store Pros: 1. Primary datastore can be optimised for read write usecases. 2. Elastic search suports more than just key value matching like fuzzy match, etc....
When does it make sense to put data in elastic search vs creating secondary indexing on Primary datastore? Elastic search with another primary store Pros: 1. Primary datastore can be optimised for read write usecases. 2. Elastic search suports more than just key value matching like fuzzy match, etc. Cons: 1. Out of sync with primary datastore 2. two more component to manage (ES as well a pipeline to insert in ES) 3. Would need some sort of Change data capture capability from Primary datastore. Secondary indexes on Primary datastore Pros: 1. Less moving parts. 2. Less consistency issues ( because secondary indexes can be eventually consistant) Cons 1. Not all datastore support secondary indexing 2. Secondary index queries are more oftan scatter gather, doing it on higher QPS will limit read write qps on primary access patterns like read, write by PK Are there other considerations while deciding this?
best wishes (113 rep)
Mar 25, 2023, 08:37 AM • Last activity: Mar 26, 2023, 12:50 PM
0 votes
1 answers
2086 views
Delete unassigned shards in Elasticsearch
I had an elasticsearch server, which ran in single node mode. When dataset reached 1TB, I added second node and relocated couple of shards with reroute api. Now second node has 2 of 5 shards, but first node still holds all 5 shards and space is not reclaimed. **_cat/shards?v** command shows: new_mes...
I had an elasticsearch server, which ran in single node mode. When dataset reached 1TB, I added second node and relocated couple of shards with reroute api. Now second node has 2 of 5 shards, but first node still holds all 5 shards and space is not reclaimed. **_cat/shards?v** command shows: new_messages 3 p STARTED 974698739 256.6gb 5.188.130.61 el01 new_messages 3 r UNASSIGNED I've found some "solutions" like stop ES and delete files by hand, but I don't like them.
Oleg Gritsak (123 rep)
Sep 29, 2020, 09:03 AM • Last activity: Sep 7, 2022, 06:03 AM
0 votes
1 answers
32 views
Are the Informix and Lucene analyzers similar?
Is the analyzer discussed here same as Lucene analyzers? I am confused because most of them show similar properties to Lucene analyzer but the blog posts don't say a word about Lucene; instead they talk about something made by IBM called Informix. https://www.ibm.com/docs/en/informix-servers/12.10?t...
Is the analyzer discussed here same as Lucene analyzers? I am confused because most of them show similar properties to Lucene analyzer but the blog posts don't say a word about Lucene; instead they talk about something made by IBM called Informix. https://www.ibm.com/docs/en/informix-servers/12.10?topic=analyzers-snowball-analyzer I just want to know about these 5 types of analyzer: stopword, simple, standard, whitespace and snowball. Are their properties the same as Lucene? It looks like they're same although their names aren't exact.
bomtirgom (13 rep)
Jul 29, 2022, 12:05 PM • Last activity: Jul 29, 2022, 08:02 PM
0 votes
1 answers
2473 views
How to insert several csv files into Elasticsearch?
I have several csv files on university courses that all seem linked by an ID, that you can find [here][1], and I wondered how to put them on Elasticsearch. I know, thanks to [this video][2] and Logstash, how to insert one sole file csv file to Elasticsearch. But do you know how to insert several suc...
I have several csv files on university courses that all seem linked by an ID, that you can find here , and I wondered how to put them on Elasticsearch. I know, thanks to this video and Logstash, how to insert one sole file csv file to Elasticsearch. But do you know how to insert several such as those in the provided link ? At the moment I started with a first .config file for a first csv file : ACCREDITATION.csv. But it would be painful to write them all... The .config file is : input{ file{ path =>"Users/mike/Data/ACCREDITATION.csv" start_position => "begining" sincedb_path => "/dev/null" } } filter{ csv{ separator => "," columns => ['PUBUKPRN', 'UKPRN', 'KISCOURSEID', 'KISMODE', 'ACCTYPE', 'ACCDEPEND', 'ACCDEPENDURL', 'ACCDEPENDURLW'] } mutate{convert => ["PUBUKPRN","integer"]} mutate{convert => ["UKPRN","integer"]} mutate{convert => ["KISMODE","integer"]} mutate{convert => ["ACCTYPE","integer"]} mutate{convert => ["ACCDEPEND","integer"]} } output{ elasticsearch{ hosts =>"localhost" index =>"accreditation" document_type =>"accreditaiton keys" } stdout{} } ###Update May, 3rd Without knowing how to use a .config file to implement csv files to Elasticsearch, I fell back to Elastic blog and tried to do a shell script importSVFiles for a first .csv file before trying to generalize the approach : ###importCSVFiles : #!/bin/bash while read f1 do curl -XPOST 'https://XXX.us-east-1.aws.found.io:9243/courses/accreditation ' -H "Content-Type: application/json" -u elastic:XXX -d "{ \"accreditation\": \"$f1\" }" done < AccreditationByHep.csv Yet I received a mapper_parsing_exception on the terminal : mike@mike-thinks:~/Data/on_2018_04_25_16_43_17$ ./importCSVFiles {"error":{"root_cause": [{"type":"mapper_parsing_exception","reason":"failed to parse"}], "type":"mapper_parsing_exception", "reason":"failed to parse", "caused_by":{"type":"i_o_exception","reason":"Illegal unquoted character ((CTRL-CHAR, code 13)): has to be escaped using backslash to be included in string value\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@e18584; line: 1, column: 88]"} },"status":400 }
Revolucion for Monica (677 rep)
May 2, 2018, 02:47 PM • Last activity: Oct 15, 2021, 09:08 AM
-1 votes
1 answers
58 views
Room availability
I was wondering what would the best way to manage availability for rooms or beds like in a hostel (Saas). I was thinking about using MySQL and Elasticsearch. What would be the best schema for the rooms/bed table and availability table? Thanks a lot.
I was wondering what would the best way to manage availability for rooms or beds like in a hostel (Saas). I was thinking about using MySQL and Elasticsearch. What would be the best schema for the rooms/bed table and availability table? Thanks a lot.
Yass O. (1 rep)
Jul 4, 2021, 10:27 PM • Last activity: Jul 4, 2021, 10:55 PM
0 votes
0 answers
276 views
Store Elasticsearch shards in separate partitions on data node
I have an Elasticsearch cluster configured with one head node and three data nodes with the number of replicas set to 2. The data nodes are each split into three partitions called /data1 /data2 and /data3; the disk on each data node is partitioned. Elasticsearch is storing all the shards in the /dat...
I have an Elasticsearch cluster configured with one head node and three data nodes with the number of replicas set to 2. The data nodes are each split into three partitions called /data1 /data2 and /data3; the disk on each data node is partitioned. Elasticsearch is storing all the shards in the /data1 partition on each data node. The system has been working fine but now I want to add a new index and the /data1 partitions do not have space to store the new shards. How can I instruct the cluster to store the new shards on different disk partition? I looked at /etc/elasticsearch/elasticsearch.yml on the head node and path.data is set to /var/lib/elasticsearch. How do I modify path.data to use a different partition on the data nodes? Update: In response to the link posted in the comments, I modified the elasticsearch.yml to look like this: path.data: /data1/elasticsearch, /data2/elasticsearch_2, /data3/elasticsearch_3 And created the corresponding directories on the data nodes. However, I am still getting the node is above the low watermark warnings from two of the nodes and a the shard cannot be allocated to the same node on which a copy of the shard already exists from the third.
Matt (291 rep)
Apr 16, 2021, 07:28 PM • Last activity: Apr 20, 2021, 03:32 PM
5 votes
4 answers
697 views
Database events but not triggers
This is a question regarding general DB inner workings, not particular to an implementation or paradigm, though answers for certain technologies are welcome. I am asking if there is a way to listen to what commands the database has received, or something like read an inner log of the database, at le...
This is a question regarding general DB inner workings, not particular to an implementation or paradigm, though answers for certain technologies are welcome. I am asking if there is a way to listen to what commands the database has received, or something like read an inner log of the database, at least the last changes. I need such a functionality to be able to find if there were changes to a table and if so, read the particular row that has changed. It is assumed changes to the columns are not to happen. I am only a listener of the database, hence I am not able to program triggers.
Stefan Popescu (51 rep)
Feb 9, 2021, 09:31 AM • Last activity: Feb 10, 2021, 03:44 PM
1 votes
1 answers
1952 views
Elasticsearch: Versioning a document on revisions
I currently have a document which gets revised regularly, I want to keep track of the document by keeping each old version of the document. So if document A has a summary and update date I want to, after every update, keep the previous version along with its update data. The problem is that I'm not...
I currently have a document which gets revised regularly, I want to keep track of the document by keeping each old version of the document. So if document A has a summary and update date I want to, after every update, keep the previous version along with its update data. The problem is that I'm not sure how I should do this efficiently: ...{ Title: A Summary: {update_date:content, update_date:content, ...} } The problem is that if I take the key as value then the automatic generated schema will take all dates as possible keys. Which is not something you want. So my question is what's the most efficient way of tracking all revisions by date in ElasticSearch?
Lucas Kauffman (1835 rep)
Feb 17, 2014, 04:14 PM • Last activity: Jan 11, 2021, 05:05 PM
0 votes
1 answers
20 views
Help needed with several engines use case
We are developing an app, aprox 50k RPM read, 1k RPM write, it ask via key and get a JSON. The search is always made by key. I'm inclined to use one MySQL 8 table with a Id field and JSON field with innodb. Seems simple, cheap and fast accessing by index. Each index can have n rows (30 max), total s...
We are developing an app, aprox 50k RPM read, 1k RPM write, it ask via key and get a JSON. The search is always made by key. I'm inclined to use one MySQL 8 table with a Id field and JSON field with innodb. Seems simple, cheap and fast accessing by index. Each index can have n rows (30 max), total size of table less than 100gb. Response time is important, I think 2-10ms are achievable on MySQL. The other, more expensive options that I have are DynamoDB and ElasticSearh (can't use another tool). Can't find a comparison for this use case to help me know if I'm in the correct path. Do you see any cons of using MySql or I'm missing something? Thanks!!
Alejandro (113 rep)
Dec 14, 2020, 02:42 AM • Last activity: Dec 14, 2020, 03:49 AM
-1 votes
2 answers
397 views
Which database to use when you have a 6 billion of rows inside and need to query rows with from list of IDs?
We are currently researching our case of storing the road distances between cities. Right now we have 6 billion of those distances. Our structure right now in SQL Server is that we have a `float` which represents the relationship between cities. For example if we have a city with `Id` 1 inside the `...
We are currently researching our case of storing the road distances between cities. Right now we have 6 billion of those distances. Our structure right now in SQL Server is that we have a float which represents the relationship between cities. For example if we have a city with Id 1 inside the Locations table and a city with Id 2 inside the same table, a row with distance from 1 to 2 will look like 1.2,'1000 miles'. That column is indexed. So to get the distance from the city 1000 to the city 2535, we would find 1000.2535 inside the Distances. Besides getting single distances, we need to select groups of 1000 distances from those 6 billion rows:
`
SELECT
  id
 ,distance
FROM
  Distances 
WHERE 
  id IN (1000.2535, 1.2, ...)
` Right now we've only tested SQL Server on a local machine and it gives us around 300 ms for such query of 1000 rows, but only when we set a 50 ms timeout (this is needed for a lot of parallel requests from multiple users), if 50 ms timeout is not used it just grows exponentially like 300 ms for the first, 500 ms for the second, 800 ms for the third, etc. And right now we taking a look at ElasticSearch specifically for mget. So my questions are: 1. Which database would you recommend for such a use case? 2. What would you recommend besides what we've thought of maybe some other ideas like splitting into two different columns cities IDs, etc? 3. What would be the best ways to optimize such a database?
Artem Biryukov (3 rep)
Sep 24, 2020, 04:47 PM • Last activity: Sep 25, 2020, 08:59 AM
1 votes
1 answers
252 views
Query customers attributes
I need to create a system where the user creates dynamic filters based on our customer's attributes. There are more or less 30 possible filters and 30 millions of customers, but the number of customer increase every day and the attribute value can change every day too, so we have insert and updates...
I need to create a system where the user creates dynamic filters based on our customer's attributes. There are more or less 30 possible filters and 30 millions of customers, but the number of customer increase every day and the attribute value can change every day too, so we have insert and updates in these set of data every day. Another thing is that I can create a new filter or remove. In this case we can use a relational database like Oracle and create a index for every colunm, but with inserts and updates every day, can I have a problem with performance? Should I use a search engine for this case like Elasticsearch? Or there is a recommended database or architecture for this use case? I need to return a count of customers that match these filters at most in 5 seconds. **EDIT** Some attributes: - Downloaded the app (boolean) - Credit card limit (number) - Last transaction (date) - Status (text) - Last access (date) - How many times used the credit card (number) - City (text) - Average transaction value (number) The user can use
>, =, <=
to filter or use
, like
IN ('New York', 'Seattle')
Mac Portter (11 rep)
Jul 25, 2020, 01:01 AM • Last activity: Jul 25, 2020, 09:25 PM
1 votes
0 answers
63 views
Elassandra data modeling
In Cassandra, everyone always stresses how important [data modeling][1] is, and rightfully so. However, when using [Elassandra][2], we have Elasticsearch baked in. How does should change how we think about modeling our Cassandra tables and partitions? For example, for vanilla Cassandra we need to tr...
In Cassandra, everyone always stresses how important data modeling is, and rightfully so. However, when using Elassandra , we have Elasticsearch baked in. How does should change how we think about modeling our Cassandra tables and partitions? For example, for vanilla Cassandra we need to try to minimize the amount of searching we do across our partitions (see here under header "Basic Goals" > "Rule 2: Minimize The Number Of Partitions Read". Is this still true when we have our data indexed by Elasticsearch? Is it still worth the extra complexity of duplicating our data and managing duplicate tables for the sake of not searching across partitions? Another example would be choosing our primary key. In Cassandra, normally we cannot run a CQL query using WHERE statements that don't specify all fields from the primary key (unless we use a Cassandra secondary index or override the warnings using ALLOW FILTERING, see here ). With Elassandra however, this is easily overcome by way of Elasticsearch integration, whether by using the [Elasticsearch REST API or by using CQL with Elasticsearch queries . If we know we will use Elassandra, does that therefore make a difference in how we choose our primary key?
RyanQuey (153 rep)
Jun 24, 2020, 01:21 PM • Last activity: Jun 24, 2020, 01:51 PM
Showing page 1 of 20 total questions