Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

0 votes

0 answers

37 views

How to filter extraneous Unicode values from a column?

I am cleaning up / priming our PostgreSQL 12 database for future data-related activities (e.g. data encryption). I have tried the following methods to delete non-basic Latin / basic accented Latin / punctuational values from one of our tables: - `regexp_replace(field, '[^[:graph:]]', '', 'g')` and `...

                                  I am cleaning up / priming our PostgreSQL 12 database for future data-related activities (e.g. data encryption). I have tried the following methods to delete non-basic Latin / basic accented Latin / punctuational values from one of our tables:

- regexp_replace(field, '[^[:graph:]]', '', 'g') and SIMILAR TO '%[^[:graph:]]%' (in our implementation, [:print:] is not working in regexp_replace)
- btrim(field, '') - not working for Unicode character-containing strings but only Latin strings.

However, the extraneous Unicode characters, even those embedded within acceptable (i.e., Latin and punctuational) values (e.g. \ud83d) do not get filtered / deleted from the values, hence, I couldn't prime the data.

What can I do to filter out only unacceptable Unicode values in regexp_replace and delete them, retaining only acceptable characters?

Bona Rae Villarta (1 rep)

Aug 31, 2023, 02:10 AM

1 votes

1 answers

509 views

Find duplicate values by joining tables SQL server

sql-server join duplication data-cleansing

Finding the duplicate values in the 'Item_Sales_Detail' table as NULL rows in the 'Sales' and 'Item' tables by joining three tables. 'Sales' table (ID is primary key) | ID | Invoice | Date | TotalAmount | | |----|----------------|------------|-------------|---| | 10 | 00000000100001 | 02/02/2023 | 2000 | | | 20 | 00000000100002 | 02/02/2023 | 1500 | | | 30 | 00000000100003 | 02/02/2023 | 18000 | | 'Items' table (Sales_ID foreign key) | ID | Sales_ID | Item_Code | Amount | Quantity | Total_Amount | |----|----------|-----------|--------|----------|--------------| | 1 | 10 | 22 | 2000 | 1 | 2000 | | 2 | 20 | 35 | 1500 | 1 | 1500 | | 3 | 30 | 44 | 5000 | 2 | 10000 | | 4 | 30 | 14 | 8000 | 1 | 8000 | 'Item_Sales_Detail' table (Sales_ID , Item_ID , Invoice are foreign keys) | ID | Sales_ID | Item_ID | invoice | date | Amount | |----|----------|---------|----------------|------------|--------| | 1 | 10 | 1 | 00000000100001 | 02/02/2023 | 2000 | | 2 | 10 | 1 | 00000000100001 | 02/02/2023 | 2000 | | 3 | 20 | 2 | 00000000100002 | 02/02/2023 | 1500 | | 4 | 30 | 3 | 00000000100003 | 02/02/2023 | 5000 | | 5 | 30 | 3 | 00000000100003 | 02/02/2023 | 5000 | | 6 | 30 | 3 | 00000000100003 | 02/02/2023 | 5000 | | 7 | 30 | 4 | 00000000100003 | 02/02/2023 | 8000 | In table "Item_Sales_Detail," invoice number 00000000100001 has 1 extra record as a duplicate, and invoice number 00000000100003 with Item_ID 3 has quantity 2 and an extra record entered; the total is now 3 records instead of 2. My query :

SELECT Sales.Invoice,
       Items.Item_Code,
       Item_Sales_Detail.invoice,
       Item_Sales_Detail.date,
       Item_Sales_Detail.Amount
FROM Sales
INNER JOIN Items ON Sales.ID=Items.Sales_ID
INNER JOIN Item_Sales_Detail Items.ID= Item_Sales_Detai.Item_ID

Result | Sales.Invoice | Items.Item_Code | Item_Sales_Detail.invoice | Item_Sales_Detail.date | Item_Sales_Detail.Amount | |----------------|-----------------|---------------------------|------------------------|--------------------------| | 00000000100001 | 22 | 00000000100001 | 02/02/2023 | 2000 | | 00000000100001 | 22 | 00000000100001 | 02/02/2023 | 2000 | | 00000000100002 | 35 | 00000000100002 | 02/02/2023 | 1500 | | 00000000100003 | 44 | 00000000100003 | 02/02/2023 | 5000 | | 00000000100003 | 44 | 00000000100003 | 02/02/2023 | 5000 | | 00000000100003 | 44 | 00000000100003 | 02/02/2023 | 5000 | | 00000000100003 | 14 | 00000000100003 | 02/02/2023 | 8000 | Expected : | Sales.Invoice | Items.Item_Code | Item_Sales_Detail.invoice | Item_Sales_Detail.date | Item_Sales_Detail.Amount | |----------------|-----------------|---------------------------|------------------------|--------------------------| | 00000000100001 | 22 | 00000000100001 | 02/02/2023 | 2000 | | NULL | NULL | 00000000100001 | 02/02/2023 | 2000 | | 00000000100002 | 35 | 00000000100002 | 02/02/2023 | 1500 | | 00000000100003 | 44 | 00000000100003 | 02/02/2023 | 5000 | | 00000000100003 | 44 | 00000000100003 | 02/02/2023 | 5000 | | NULL | NULL | 00000000100003 | 02/02/2023 | 5000 | | 00000000100003 | 14 | 00000000100003 | 02/02/2023 | 8000 |

Mohammad Bastan (33 rep)

Feb 16, 2023, 07:59 PM • Last activity: Feb 17, 2023, 04:11 AM

0 votes

1 answers

59 views

Why get tables cleaned and copied in an SQL-Server DB?

sql-server data-cleansing

I'm working on an SQL-Server database. Regularly, entries from one table get moved to another one (from `entries` to `Log_Entries`) in order not to flood the database. (The `Log_Entries` get cleaned afterwards too) I would like to know how this works, but I don't find any corresponding entry in the...

                                  I'm working on an SQL-Server database.

Regularly, entries from one table get moved to another one (from entries to Log_Entries) in order not to flood the database. (The Log_Entries get cleaned afterwards too)

I would like to know how this works, but I don't find any corresponding entry in the "Stored Procedures" or "Functions" and there seem not to be any "Database Triggers". Also the "Rules" part of the database seems to be empty.

Which entry in the database can be responsible for such a task?

**Edit after first comment**  
I have "Database Diagrams", "Tables", "Views, "External Resources", "Programmability", "Service Broker", "Storage" and "Security".  
Within "Programmability" there are "Stored Procedures", "Functions", "Database Triggers", "Assemblies", "Types", "Rules", "Defaults" and "Sequences".

Where is that SQL Agent?

Thanks in advance

Dominique (609 rep)

Mar 17, 2022, 09:04 AM • Last activity: Mar 19, 2022, 01:28 PM

5 votes

2 answers

15256 views

Find rows with similar string values

sql-server sql-server-2012 full-text-search string-manipulation data-cleansing

I have a Microsoft SQL Server 2012 database table with around 7 million crowd-sourced records, primarily containing a string name value with some related details. For nearly every record it seems there are a dozen similar typo records and I am trying to do some fuzzy matching to identify record grou...

                                  I have a Microsoft SQL Server 2012 database table with around 7 million crowd-sourced records, primarily containing a string name value with some related details. For nearly every record it seems there are a dozen similar typo records and I am trying to do some fuzzy matching to identify record groups such as "Apple", "Aple", "Apples", "Spple", etc. These names can also contain multiple words with spaces between them.

I've come up with a solution of using an edit-distance scalar function that returns number of keystrokes required for transformation from string1 to string2 and using that function to join the table to itself. As you can imagine, this doesn't perform that well since its having to execute the function millions of times to evaluate a join.

So I put that in a cursor so at least only one string1 is being evaluated at a time, this at least gets results coming out but after letting it run for weeks it has only made it through evaluating 150,000 records. With 7 million to evaluate, I don't think I have the kind of time my method is going to take.

I put full text indexes on the string names, but couldn't really find a way to use the full text predicates when I didn't have a static value I was searching.

Any ideas how I could do something like the following in a way that wouldn't take months to run?

      SELECT t1.name, t2.name
      FROM names AS t1
      INNER JOIN names AS t2
           ON EditDistance(t1.name,t2.name) = 1
           AND t1.id != t2.id

I've tried soundex, but since the names can contain spaces and multiple words per value I get too many false positives to use it reliably.
                                

kscott (151 rep)

Jul 24, 2017, 04:20 AM • Last activity: Nov 7, 2019, 09:07 PM

1 votes

0 answers

30 views

Selecting Duplicates on All Fields

query ms-access duplication data-validation data-cleansing

I have an MS Access (no laughing at the back) database I've used to import a bunch of IIS logs into. Having looked at the Excel files I pulled these in from, I'm worried going by the dates that some of the IIS files might have full duplicates (i.e. where every single field is identical). I know how...

                                  I have an MS Access (no laughing at the back) database I've used to import a bunch of IIS logs into.

Having looked at the Excel files I pulled these in from, I'm worried going by the dates that some of the IIS files might have full duplicates (i.e. where every single field is identical).

I know how to select duplicates on an individual row, but how can I issue an SQL query that will show me rows where **every** field is identical, so I don't get skewed results?

If possible, some guidance on how to then have these rows deleted from the table without having to do a million delete queries with the individual PK IDs would be great too.

Example data:

In the above I'd be looking to identify rows 1 and 2 as they are exact duplicates, but 3 and 4 are okay as they share some but not all values. So the report would ideally come back with 1 and 2, so I could then note the IDs and delete all but 1 of the duplicates.

user788561 (11 rep)

Jul 26, 2019, 02:52 PM • Last activity: Jul 30, 2019, 09:09 AM

2 votes

1 answers

5302 views

How to compare two tables that have no primary key?

data-cleansing

So I got two sets of tables at work that I have to compare the data. The fields are identical, but there is no column that has unique entries. (Employee ID, Assignment ID, Employee's last name, Employee's First Name, Dependent Last name, Dependent First name, Dependent Date of Birth) An employee can...

                                  So I got two sets of tables at work that I have to compare the data.
The fields are identical, but there is no column that has unique entries.
(Employee ID, Assignment ID, Employee's last name, Employee's First Name,
Dependent Last name, Dependent First name, Dependent Date of Birth)
An employee can have multiple assignments, and each employee can have multiple dependents.

The data of the two tables were entered by different people and so they are messy.. and I was asked to do some data cleanup.
Some of the dependents can be missing for an employee, or the dependent's name is spelled wrong /blanks etc...

Is there a way to compare the two tables for unmatched records, so that I can create a table for all data?

I tried to use Find Unmatched Fields query in Access but it can only compare one field?

Thanks in advance!:)

==edit==

I'm adding some sample data here:

vpxoxo (121 rep)

Jan 26, 2016, 07:05 PM • Last activity: Jan 27, 2016, 05:26 PM

2 votes

2 answers

81 views

How should I handle measurement errors in a timeseries database?

errors data-cleansing

I have a table used to record measurements sampled at regular intervals on different sensors. Each row records the time, the identifier of the quantity being measured, and the value itself. Now and again measurement errors occur and garbage is being recorded in the value field. How should I deal wit...

                                  I have a table used to record measurements sampled at regular intervals on different sensors. Each row records the time, the identifier of the quantity being measured, and the value itself.

Now and again measurement errors occur and garbage is being recorded in the value field. How should I deal with these errors:

1. Delete the offending rows entirely, losing the information that there had been an error;
1. Keep the row as is, and asking of client code to deal with the errors;
1. Replace the values with NULL, losing the original erroneous value?

Or is there another option that I have not considered?
                                

lindelof (225 rep)

Jun 1, 2015, 01:38 PM • Last activity: Jun 2, 2015, 04:18 PM

Showing page 1 of 7 total questions