Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

0 votes

1 answers

149 views

Producing JSON in Oracle 12c1

I need to produce a JSON document with several properties from a list of Oracle 12c1 production databases, and I was given read-only access in order to generate this data. The problem is these databases were created over time and may store VARCHAR2 values in different character sets. My naive attemp...

                                  I need to produce a JSON document with several properties from a list of Oracle 12c1 production databases, and I was given read-only access in order to generate this data.

The problem is these databases were created over time and may store VARCHAR2 values in different character sets. My naive attempt at producing JSON is:

    SELECT '{' ||
        '"content":"' || content || '", "' ||
        '"name":"' || name || '"}'
    FROM my_table
    WHERE ...

However, since the columns content and name may have all kind of special characters and use non-trivial character sets, the query above may produce invalid JSON documents.

Is there any function that can convert a VARCHAR2 into Unicode from any character set? I think if the string is in Unicode it could be easy to manage. Unfortunately, I don't really know each column's character set in advance.

Joe DiNottra (101 rep)

Oct 26, 2021, 10:42 PM • Last activity: Jul 24, 2025, 07:02 AM

1 votes

1 answers

286 views

Can't set up Unicode support

mysql phpmyadmin unicode utf-8

I am creating a toy website locally with XAMPP and I want it to support Unicode. index.php contains a simple form that I can use to add a new entry (a UTF-8 string) to a table "myTable" in the DB. In addition to that, I am also echoing with PHP every entry from "myTable" on the body of index.php, so that I can see if the UTF-8 strings that I am entering through the form make all the way into the database and back without any issues. Here are the measures that I took toward Unicode support: 1. Every file on my website (ex.: index.php) is encoded in UTF-8 without BOM; 2.

` everywhere applicable too;
3. `mysqli_set_charset($myConnection, "utf-8");` in my "db.inc.php" file;
4. At the phpMyAdmin "control panel" (?), under "Database Server" I see "Server charset: UTF-8 Unicode (utf8)";
5. I created a database and a table for this specific purpose. Here is the result of `SHOW CREATE TABLE myTable `:

    `CREATE TABLE myTable` (
 row1 varchar(16) COLLATE utf8mb4_bin NOT NULL,
 row2 varchar(32) COLLATE utf8mb4_bin NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin``



The outcome was as follows: when I print the entries of "myTable", everything does get displayed correctly on the browser indeed, however only owing to a stroke of luck, because apparently my strings are being internally represented badly. When I browse my table with phpMyAdmin I see mojibake, and I looked over at

myTable.ibd``` and I see the same mojibake. The string `joão e maría for example, is displayed as joÃ£o e marÃa; for the first ã -> Ã£, it could be UTF-8 bytes being interpreted as latin-1, though I see Ã£` even when I tell npp to interpret the file as UTF-8. Running the query `SELECT row1, HEX(row1) FROM myTable WHERE 1 I see 6A6FC383C2A36F2065206D6172C383C2AD61 as the bytes of joÃ£o e marÃa, so in fact the mojibake is "correctly" encoded as UTF-8 (C3 83 C2 A3 is UTF-8 for Ã£`). Ideas?

UPDATE

After more careful inspection I verified that mysqli_set_charset($myConnection, "utf-8"); was returning false. I replaced it with mysqli_set_charset($myConnection, "utf8"); and voilà, strings with fancy characters that I enter through the form now make it all the way into the database unscathed. However, these new strings aren't correctly displayed by the browser. For example: the string bênção is inserted into the DB through the form. I query the table with `SELECT row, HEX(row1) (...) and find out that it is stored as 62C3AA6EC3A7C3A36F inside the table, which is the correct UTF-8 representation of the string in question. I access index.php and I see b�n��o

 being displayed. I download the HTML of the webpage and inspect it, and I find that the offending bytes are respectively

EA, E7, E3, which, surely enough, correspond to ê, ç and ã` in latin-1.

what a disgrace (111 rep)

May 22, 2018, 07:16 AM • Last activity: May 18, 2025, 08:11 PM

1 votes

1 answers

1078 views

Character Strings related : string_units db cfg changed from SYSTEM to CODEUNITS32

db2 db2-luw unicode db2-10.5

To handle few special characters, we changed `string_units` database configuration parameter from default `SYSTEM` to `CODEUNITS32`. Based on IBM [DB2 Knowledge Center][1], there are three methods to achieve this. 1.Set the `NLS_STRING_UNITS` global variable 2.Update the `string_units` database conf...

                                  To handle few special characters, we changed string_units database configuration parameter from default SYSTEM to CODEUNITS32. 

Based on IBM DB2 Knowledge Center , there are three methods to achieve this.

1.Set the NLS_STRING_UNITS global variable

2.Update the string_units database configuration parameter

3. Alter the column by defining new character set ( CODEUNITS16 or CODEUNITS32)

We have used the second option and restarted the instance. Still it's not working as expected. Have we missed any steps? 

Is the reorg required on the table? What happens to the already existing data ?

db2level - 10.5 fp5  
OS : Windows

shivrajk (73 rep)

Jan 5, 2017, 06:09 AM • Last activity: Apr 14, 2025, 06:04 AM

1 votes

1 answers

1501 views

How to display non-English character correctly in db2cmd?

db2 db2-luw encoding unicode client

On Db2 v11.5.7 on Linux/x86_64 I have a UTF-8 database. Executing `db2 get db cfg for test1` returns: Database code page = 1208 Database code set = UTF-8 On my Windows 10 computer in Putty I have set: *Window | Translation | Remote character set* to UTF-8. I can properly see non-English characters i...

                                  On Db2 v11.5.7 on Linux/x86_64 I have a UTF-8 database.

Executing db2 get db cfg for test1 returns:

     Database code page       = 1208
     Database code set        = UTF-8

On my Windows 10 computer in Putty I have set: *Window | Translation | Remote character set* to UTF-8. I can properly see non-English characters in Putty.

I have also installed DBeaver tool and I can also properly see non-English characters.

The problem I have is "db2cmd" tool.

In Putty remote connection I created tables and inserted one character:

    db2 "connect to test1 user db2inst1 using db2inst1"
    db2 "create table admin.test8 (id int not null generated always as identity, a varchar(20))"
    db2 "insert into admin.test8 (a) values ('č')"
    db2 "select id, a, hex(a) as hex_a from admin.test8 order by id"

Output of select statement is (which is correct):

If I do the same in DBeaver:

    insert into admin.test8 (a) values ('č');
    select id, a, hex(a) as hex_a from admin.test8 order by id;

I get:

Both Putty and DBeaver correctly inserts character in database.

Now saving the commands into UTF-8 encoding:

From db2cmd I check for the code page:

    chcp

and output is:

    Active code page: 1252

Execute commands:
db2 -tf a.sql

Output in column A is completely wrong. In ID=3 also HEX_A is completely wrong.

I tried several things like:
1. Changing: chcp 65001
2. db2set DB2CODEPAGE=1208

But in both cases I see the same result as last command.

If I try to connect with "-a" switch I get Db2 server code page is 1208 (utf-8) and Db2 client code page is 1252.

Now to configure db2cmd to properly handle the UTF-8 code-page?

folow (523 rep)

Oct 18, 2022, 01:08 PM • Last activity: Mar 20, 2025, 10:03 PM

1 votes

1 answers

2041 views

How to fix unicode REPLACEMENT CHARACTER � in postgres database

postgresql character-set unicode utf-8

Is there a way fix the � problem in postgres database: [![enter image description here][1]][1] [1]: https://i.sstatic.net/r9WuT.png As you can see there are names with accents and are not displayed corretly in the database. I would like to know if there is a way to fix this without dropping the data...

Is there a way fix the � problem in postgres database:

As you can see there are names with accents and are not displayed corretly in the database. I would like to know if there is a way to fix this without dropping the database and restoring it... ? When I execute SELECT CAST (userfirstname AS bytea) FROM tab WHERE id = 42;` this is the result i get:

bytea
--------------------------------------
 \x5374efbfbd7068616e696520284d6d6529
(1 ligne)

rainman (205 rep)

May 12, 2022, 03:10 AM • Last activity: Jul 27, 2024, 02:55 AM

1 votes

0 answers

229 views

MySQL 8.0 fails to store 5.7-allowed symbols (Incorrect string value, Error 1366)

mysql upgrade character-set utf-8 unicode

I'm upgrading MySQL 5.7 server to 8.0 with a bunch of user-input data that sometimes contain "malformed" unicode characters like `x'eda0bdedb98f'`. The latter seems to be 2 UTF-16 characters `0xD83D 0xDE4F` (4 bytes) of ['Folded Hands'-emoji](https://www.fileformat.info/info/unicode/char/1f64f/index...

I'm upgrading MySQL 5.7 server to 8.0 with a bunch of user-input data that sometimes contain "malformed" unicode characters like x'eda0bdedb98f'. The latter seems to be 2 UTF-16 characters 0xD83D 0xDE4F (4 bytes) of ['Folded Hands'-emoji](https://www.fileformat.info/info/unicode/char/1f64f/index.htm) encoded into 2 utf8mb3 characters (6 bytes). [This](https://cryptii.com/pipes/hex-decoder) service correctly decodes the emoji, while the others consider it to be a wrong unicode character. ### Problem ### This symbol can be inserted into MySQL 5.7 table, but fails to insert into MySQL 8.0 table. Depending on sql_mode insert either fails or proceeds with a truncated string. Thus mysqldump + restore produces inconsistent data. The inplace upgrade works fine but I still cannot manipulate (insert/update) these characters. Changing the charset/collation doesn't help. ### Question ### Is that possible to force MySQL 8.0 to store "malformed" unicode characters like MySQL 5.7 does? ### Steps to reproduce ### Run latest MySQL 5.7 in Docker:

-bash
docker run --rm --name percona57 -e MYSQL_ALLOW_EMPTY_PASSWORD=1 percona:5.7 --character_set_server=utf8mb4 --collation-server=utf8mb4_general_ci &
docker exec -ti --user root percona57 mysql --default-character-set=utf8mb4

Run latest MySQL 8.0 in Docker:

-bash
docker run --rm --name percona80 -e MYSQL_ALLOW_EMPTY_PASSWORD=1 percona:8.0 --character_set_server=utf8mb4 --collation-server=utf8mb4_general_ci &
docker exec -ti --user root percona80 mysql --default-character-set=utf8mb4

Execute SQL commands (on both 5.7 and 8.0):

-sql
SET names utf8mb4 COLLATE utf8mb4_general_ci;
SHOW VARIABLES LIKE '%char%';
SHOW VARIABLES LIKE '%collat%';
SHOW VARIABLES LIKE 'sql_mode';
CREATE DATABASE IF NOT EXISTS test DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
SHOW CREATE DATABASE test \G
CREATE TABLE IF NOT EXISTS test.t1 (id int unsigned auto_increment primary key, text varchar(255)) CHARSET utf8mb4 COLLATE utf8mb4_general_ci;
SHOW CREATE TABLE test.t1 \G
INSERT INTO test.t1 (text) VALUES (concat('aaa', x'eda0bdedb98f', 'bbb'));
SHOW WARNINGS;
SELECT *, hex(text) FROM test.t1;

Results from MySQL 5.7:

mysql> INSERT INTO test.t1 (text) VALUES (concat('aaa', x'eda0bdedb98f', 'bbb'));
Query OK, 1 row affected (0.01 sec)

mysql> SHOW WARNINGS;
Empty set (0.00 sec)

mysql> SELECT *, hex(text) FROM test.t1;
+----+--------------+--------------------------+
| id | text         | hex(text)                |
+----+--------------+--------------------------+
|  1 | aaa������bbb     | 616161EDA0BDEDB98F626262 |
+----+--------------+--------------------------+
1 row in set (0.00 sec)

Results from MySQL 8.0:

mysql> INSERT INTO test.t1 (text) VALUES (concat('aaa', x'eda0bdedb98f', 'bbb'));
ERROR 1366 (HY000): Incorrect string value: '\xED\xA0\xBD\xED\xB9\x8F...' for column 'text' at row 1
mysql> SHOW WARNINGS;
+-------+------+----------------------------------------------------------------------------------+
| Level | Code | Message                                                                          |
+-------+------+----------------------------------------------------------------------------------+
| Error | 1366 | Incorrect string value: '\xED\xA0\xBD\xED\xB9\x8F...' for column 'text' at row 1 |
+-------+------+----------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT *, hex(text) FROM test.t1;
Empty set (0.00 sec)

Execution log for both MySQL 5.7 and 8.0 + SHOW VARIABLES statement: https://pastebin.com/MK0KUaV1

Alessandro Cattaneo (19 rep)

May 20, 2024, 10:56 PM • Last activity: May 29, 2024, 01:20 PM

1 votes

1 answers

905 views

sp_send_dbmail is populating 00 bits between all characters in a result set

sql-server t-sql database-mail encoding unicode

I'm using sp_send_dbmail in a SQL agent job triggered nightly. The job queries our database and checks for product price updates, if there are some, it will send them in an email as an attachment to our e-commerce vendor. They have some automatic processes they will update our e-commerce platform, h...

                                  I'm using sp_send_dbmail in a SQL agent job triggered nightly. The job queries our database and checks for product price updates, if there are some, it will send them in an email as an attachment to our e-commerce vendor. They have some automatic processes they will update our e-commerce platform, however their automated process cannot deal with the file provided by sp_send_dbmail.
sp_send_dbmail seems to be putting null characters between all characters in the result set.

See result of opening the csv in a hex editor:

Viewed through a text editor I see as expected:

sp_send_dbmail query here:

        SET @FileName = 'Update_' + CONVERT(VARCHAR(12),GETDATE),105) '.csv'        
        SET @Query = 'Some Query'

        EXEC msdb.dbo.sp_send_dbmail
            @recipients = 'me@domain.com'
            @query = @Query
            @attach_query_result_as_file = 1
            @query_attachment_filename = @FileName
            @query_result_separator = ','
            @query_result_no_padding =  1
        END

What's going on here?

****EDIT FOR ADDITIONAL CLARIFICATION****

Using Microsoft SQL Server 2008 R2
Collation is Latin1_General_CI_AS

Leaving no_padding out leaves several trailing spaces in each returned field. each space (20) separated by a null (00)

Andrew (13 rep)

Jan 10, 2017, 04:01 AM • Last activity: May 27, 2024, 06:47 PM

4 votes

2 answers

11931 views

Nvarchar to varchar to nvarchar characters conversion

sql-server alter-table collation encoding unicode

I had a table filled with some greek text rows as `nvarchar(2000)`. Lately, I altered the type of the column to `varchar(4000)` but I realized that some greek characters were shown as question marks. So, I tried to change back to `nvarchar(4000)` in order to fix it as I suppose that the characters'...

                                  I had a table filled with some greek text rows as nvarchar(2000).

Lately, I altered the type of the column to varchar(4000) but I realized that some greek characters were shown as question marks. 

So, I tried to change back to nvarchar(4000) in order to fix it as I suppose that the characters' unicode still remains the same.

I am just wondering, is there any way to fix this instead of restoring the backup I created before altering the table?

George Chondrompilas (143 rep)

Sep 9, 2016, 06:32 PM • Last activity: May 27, 2024, 06:47 PM

3 votes

1 answers

4565 views

How to insert a Unicode character verbose into a varchar DB?

sql-server t-sql collation encoding unicode

I need to insert this character '●' into a `VARCHAR` column of a MSSQL database with collation set as `SQL_Latin1_General_CP1_CI_AS` (or at least mock what my Python + Windows MSSQL Driver might have done). I tried to insert this from my client, but it gets inserted as `?`. The hex value as checked...

                                  I need to insert this character '●' into a VARCHAR column of a MSSQL database with collation set as SQL_Latin1_General_CP1_CI_AS (or at least mock what my Python + Windows MSSQL Driver might have done). I tried to insert this from my client, but it gets inserted as ?.

The hex value as checked in Python is \xe2\x97\x8f which is binary 226, 151, 143.

The collation spec  says that 226 is defined but 143 and 151 is undefined. So my best would be to insert 226?

The reason I am doing this exercise is that our app was inserting this character into DB earlier. At this point, I don't know how it went in because but it went through a Python App with Windows MSSQL Driver and our DB Team wrote a script to correct it and apparently it got replaced with "?".

Now I am just trying to reproduce this scenario to ask them why it is getting replaced with "?" because we are replacing it as empty in the App. Since my App already has a fix for this issue (we remove anything greater than 127), I am trying to mock this directly from back-end to prove that it is getting replaced with '?' which is actually my claim and I think it is done because it says "COLLATE" in the update query.

    UPDATE pr_response
    SET nur_respon =
    REPLACE (nur_respon,
         SUBSTRING(nur_respon, PATINDEX('%[^ !-~]%' COLLATE Latin1_General_BIN, nur_respon), 1),
             '')
    WHERE PATINDEX('%[^ !-~]%' COLLATE Latin1_General_BIN, nur_respon) > 0

Is this a fine script to remove Non-ASCII-Data > 127? Can someone explain this query in plain English?
                                

Nishant (899 rep)

Nov 20, 2018, 12:42 PM • Last activity: May 27, 2024, 06:47 PM

0 votes

1 answers

792 views

Generating MD5 Hash for NVARCHAR and VARCHAR columns across different databases

unicode hashing netezza azure-synapse-analytics

I am building a utility that generates a single MD5 hash across all CONCATENATED column values for any table row. To make this work, I have to eliminate NULL values using COALESCE and must CAST NUMBER and DATE types to VARCHAR using consistent patterns. The purpose of this utility is to compare all...

                                  I am building a utility that generates a single MD5 hash across all CONCATENATED column values for any table row. To make this work, I have to eliminate NULL values using COALESCE and must CAST NUMBER and DATE types to VARCHAR using consistent patterns. 

The purpose of this utility is to compare all the data across two different databases as part of a data migration. We wanted to make sure all of our converted ETL code (stored procedures) produced the same results with 100% accuracy after porting it from Netezza to Azure Synapse Analytics.

For the most part, this works extremely well and we were able to identify numerous bugs using this approach but still, this process isn't perfect ... We cannot use this with BLOB columns and we sometimes get slight differences with FLOAT types. But the real headache is how NVARCHAR is hashed differently between Netezza and Azure Synapse Analytics.
 
Here is an example... First, I will demonstrate what Netezza gives me.

    show nz_encoding
    -- NOTICE: UTF8
    
    show server_encoding
    --  Current server encoding is LATIN9
    
    CREATE TABLE ENCODING_TEST (
        VALUE1  CHARACTER VARYING(255),
    	VALUE2  NATIONAL CHARACTER VARYING(255)
    );
    -- The command completed successfully
    
    INSERT INTO ENCODING_TEST (VALUE1, VALUE2) VALUES('très bien', 'très bien');
    -- 1 rows affected
    
    SELECT RAWTOHEX(HASH('très bien', 0)) FROM ENCODING_TEST
    -- E5D128AFD34139A261C507DA18B3C558
    
    SELECT RAWTOHEX(HASH(VALUE1, 0)) FROM ENCODING_TEST
    -- E5D128AFD34139A261C507DA18B3C558
    
    SELECT RAWTOHEX(HASH(VALUE2, 0)) FROM ENCODING_TEST
    -- A54489E883CE7705CDBE1FDAA3AA8DF4
     
    SELECT RAWTOHEX(HASH(CAST(VALUE2 AS VARCHAR(255)), 0)) FROM ENCODING_TEST
    -- E5D128AFD34139A261C507DA18B3C558

And here is what Azure Synapse Analytics gives me...

    CREATE TABLE WZ_BI.ENCODING_TEST (
        VALUE1  CHARACTER VARYING(255),
    	VALUE2  NATIONAL CHARACTER VARYING(255)
    );
    -- Commands completed successfully.
    
    INSERT INTO WZ_BI.ENCODING_TEST (VALUE1, VALUE2) VALUES('très bien', 'très bien');
    -- (1 row affected)
    
    SELECT HASHBYTES ('MD5', 'très bien') FROM WZ_BI.ENCODING_TEST
    -- 0xE5D128AFD34139A261C507DA18B3C558
    
    SELECT HASHBYTES ('MD5', VALUE1) FROM WZ_BI.ENCODING_TEST
    -- 0xE5D128AFD34139A261C507DA18B3C558
    
    SELECT HASHBYTES ('MD5', VALUE2) FROM WZ_BI.ENCODING_TEST
    -- 0xC43534A6812499038457EDF545834866
    
    SELECT HASHBYTES ('MD5', CAST(VALUE2 AS VARCHAR(255))) FROM WZ_BI.ENCODING_TEST
    -- 0xE5D128AFD34139A261C507DA18B3C558

The question I have is how come the MD5 hash for the NVARCHAR column in Netezza is different than the MD5 hash of the same type and value in Azure Synapse Analytics? I mean it treats the VARCHAR types the same? I really do not want to have to explicitly CAST all NVARCHAR types to VARCHAR to get these to work, but I have not found any other way to make them equivalent.

What am I missing here?
                                

Lauren_G (69 rep)

Jun 29, 2023, 02:38 AM • Last activity: Jan 12, 2024, 02:05 AM

0 votes

1 answers

612 views

sp_executesql changes non standard characters to?

sql-server-2012 dynamic-sql unicode

Am trying to use `sp_executesql` to run commands but I've come across an issue where in non-english characters (aka Chinese & Japanese) are changed to ??? when ran. sql server 2008R2 & 2012 **Example Code** DECLARE @text nvarchar(30) = N'select ''J こんにちは C 你好''' SELECT @text SELECT N'J こんにちは C 你好' e...

                                  Am trying to use sp_executesql to run commands but I've come across an issue where in non-english characters (aka Chinese & Japanese) are changed to ??? when ran.

sql server 2008R2 & 2012

**Example Code**

    DECLARE @text nvarchar(30) = N'select ''J こんにちは C 你好''' 
    SELECT @text 
    SELECT N'J こんにちは C 你好'
    exec sp_executesql @text
    exec sp_executesql  N'select ''J こんにちは C 你好'''

*J= Japanese 'Hi' C = Chinese 'Hi' as processed by google translate for testing purposes*

**Output**

    ------------------------------
    select 'J こんにちは C 你好'
    
    ------------
    J こんにちは C 你好
    
    ------------
    J ????? C ??
    
    ------------
    J ????? C ??

As you can see when ran naturally or the code is selected the output is returned as expected, however once it is passed through sp_executesql it returns as ? characters. Its worth noting if you attempt to send a varchar instead of an nvarchar it errors out so it defiantly uses unicode.

Have tried a few different collations for the input but that doesn't seem to change anything
                                

Ste Bov (2101 rep)

Apr 11, 2018, 09:11 AM • Last activity: Nov 3, 2023, 09:37 AM

9 votes

0 answers

758 views

Inconsistent equality checking of Unicode text in Postgres 9.5?

postgresql unique-constraint unicode

Using PostgreSQL 9.5. It seems like Postgres is being inconsistent with how it compares strings similar to `لى`. A unique constraint is considering some strings equal that `GROUP BY` is considering different. I have a select query that is using `GROUP BY` on a `TEXT` column and an aggregate function...

                                  Using PostgreSQL 9.5. It seems like Postgres is being inconsistent with how it compares strings similar to لى. A unique constraint is considering some strings equal that GROUP BY is considering different.

I have a select query that is using GROUP BY on a TEXT column and an aggregate function on some other column to ensure uniqueness of the TEXT column in the output, and I'm inserting the results into a table with a primary key (and thus unique) constraint on the TEXT column.

Looks similar to this; I've just changed the table names for simplicity:

     INSERT INTO mytable ( % mytable has string TEXT PRIMARY KEY, score INT
                SELECT 
                sq1.string string, sq2.score / sq1.score
                FROM
                (
                    SELECT n.string string, SUM(n.score) score
                    FROM
                    othertable1 n % has string TEXT (non-unique) and score INT
                    GROUP BY string
                ) sq1,
                (
                    SELECT n.string string, SUM(n.score) score
                    FROM
                    othertable2 n % has string TEXT (non-unique) and score INT
                    GROUP BY string
                ) sq2
                WHERE sq1.string = sq2.string
                ORDER BY score DESC LIMIT 100000
            );
This should never fail, right? It fails: violation of unique constraint due on key لى.

And it has worked many times before with other data sets containing millions of rows in PG 9.3; I don't know whether لى was in the data back then. I know Arabic has decorations you can put on the letters, so I wonder if that's tripping it up.

Does anyone have an alternative explanation, or should I report this as a bug once I can reproduce it more easily?

UPDATE: Confirmed that the query runs successfully on a PostgreSQL 9.3 server with the same data. There are some moving parts here, so I'm trying to find exactly what the problematic strings are so I can make a simple list of queries anyone can run to expose a bug.

UPDATE 2: Argh, I can't get my database to give me a set of strings I can copy into a table and expose a bug. I've been trying to do it with COPY TO. Something along the way keeps stripping the Arabic text of the differences that are making it fail, I think. But I tried a simpler query, and it's also failing. It's more obvious that this should work:

    INSERT INTO mytable ( % mytable has string TEXT PRIMARY KEY, score INT
        SELECT n.string string, SUM(n.score) score
        FROM
        othertable2 n % has string TEXT (non-unique) and score INT
        GROUP BY string
    );

I'm still working towards getting something others can try because, of course, a bug report is useless if I say that it only works on my data.

UPDATE 3: I ran it again with different data and ran into the same problem with Cyrillic characters У and В. Making a table containing them didn't reveal anything. Same problem as with the Arabic text, I think. Something is getting stripped along the way.

UPDATE 4: This is definitely a bug. I'm still trying to figure out how to report this. I found a workaround:

    WITH glitch(string, score) AS (
        SELECT n.string string, SUM(n.score) score
        FROM
        othertable2 n % has string TEXT (non-unique) and score INT
        GROUP BY string
    )
    INSERT INTO mytable ( % mytable has string TEXT PRIMARY KEY, score INT
        SELECT DISTINCT ON(string) * FROM glitch
    );
                                

sudo (1061 rep)

Feb 21, 2016, 06:49 AM • Last activity: Aug 31, 2023, 06:57 AM

5 votes

2 answers

26234 views

Why do I get incorrect characters when decoding a Base64 string to NVARCHAR in SQL Server?

sql-server sql-server-2012 collation encoding unicode

I've been looking at how to decode Base64 with SQL Server and after searching around many solutions online (some from here) seem to be based upon this sort of method. SELECT CAST(CAST('Base64StringHere' as XML ).value('.','varbinary(max)') AS VARCHAR(250)) When I have ASCII text this works perfectly...

                                  I've been looking at how to decode Base64 with SQL Server and after searching around many solutions online (some from here) seem to be based upon this sort of method.

    SELECT CAST(CAST('Base64StringHere' as XML ).value('.','varbinary(max)') AS VARCHAR(250))

When I have ASCII text this works perfectly. However when I have the following French text it gets corrupted (presumably due to the restrictions of VARCHAR).

    Où est le café le plus proche?
    T8O5IGVzdCBsZSBjYWbDqSBsZSBwbHVzIHByb2NoZT8=

And provides the following output.

    OÃ¹ est le cafÃ© le plus proche?

I thought the relatively simple fix would be to change the CAST to NVARCHAR but this results in corruption again.

    SELECT CAST(CAST('T8O5IGVzdCBsZSBjYWbDqSBsZSBwbHVzIHByb2NoZT8=' as XML ).value('.','varbinary(max)') AS NVARCHAR(250) )

    썏₹獥⁴敬挠晡꧃氠⁥汰獵瀠潲档㽥

My search engine skills may be failing me, but I cannot seem to find someone else who has my problem.

Any thoughts?

Geesh_SO (153 rep)

Jul 13, 2018, 02:45 PM • Last activity: Aug 13, 2023, 08:41 AM

6 votes

2 answers

2548 views

Byte ordering for multibyte characters in SQL Server versus Oracle

sql-server oracle encoding unicode hashing

I am currently in the process of migrating data from Oracle to SQL Server and I'm encountering an issue trying to validate the data post-migration. Environment Details: * Oracle 12 - AL32UTF8 character set * Client - NLS_LANG - WE8MSWIN1252 * VARCHAR2 field SQL Server 2016 * Latin1_General_CI_AS col...

                                  I am currently in the process of migrating data from Oracle to SQL Server and I'm encountering an issue trying to validate the data post-migration.

Environment Details:

* Oracle 12 - AL32UTF8 character set
* Client - NLS_LANG - WE8MSWIN1252
* VARCHAR2 field

SQL Server 2016 

* Latin1_General_CI_AS collation
* NVARCHAR field

I'm using DBMS_CRYPTO.HASH on Oracle to generate a checksum of the whole row, then copying to SQL and using HASHBYTES to generate a checksum of the whole row, which I'm then comparing to validate the data matches.

The checksums match for all rows, except those with multibyte characters.

For example, rows with this character: ◦ do not match in the checksums, even though the data is transferred correctly. When I use DUMP in Oracle or convert to VARBINARY in SQL Server the data matches exactly except for the bytes for this character.

In SQL Server, the bytes are 0xE625 and in Oracle they are 0x25E6.

Why are they ordered differently, and is there a reliable way to convert one to the other to ensure the checksum at the other end matches for strings with multi-byte characters?

HandyD (10432 rep)

Feb 7, 2019, 03:12 AM • Last activity: Jun 27, 2023, 05:38 PM

3 votes

2 answers

17998 views

SQL Server error JSON text is not properly formatted. Unexpected character '?' is found at position 21

sql-server json varchar unicode

I am getting below error while parsing any JSON string with special character or emoji. How to parse this? >JSON text is not properly formatted. Unexpected character '?' is found at position 21. Sample code: DECLARE @OriginalJSON NVARCHAR(4000), @newvalue varchar(30),@path varchar(20) Set @OriginalJ...

                                  I am getting below error while parsing any JSON string with special character or emoji. How to parse this?

>JSON text is not properly formatted. Unexpected character '?' is found at position 21.

 Sample code:

    DECLARE @OriginalJSON NVARCHAR(4000), @newvalue varchar(30),@path varchar(20)
      Set @OriginalJSON='{
        "test2": "ＲＯＹＤ＼ロイド",
        "test1": 0
      }'
       Select
            @OriginalJSON,
            JSON_VALUE(CONVERT(VARCHAR(MAX),@OriginalJSON), '$.test1')  AS 'Updated JSON';

Is there any way to parse this data without modifying it? This is just sample code. I need to query from a table and cannot modify the data.
                                

Mish (33 rep)

Feb 26, 2023, 04:50 PM • Last activity: Feb 27, 2023, 03:29 PM

4 votes

1 answers

3175 views

Importing Chinese characters via SSIS

sql-server ssis import collation unicode

I'm trying to import file that contains Chinese characters into a table. File encoding is Big5 (Traditional). Here is a sample file like one that I need to import: https://www.pastiebin.com/5d7782d9b63fa Table where file has to be imported into has such a structure: create table dbo.Test ( AccountId...

                                  I'm trying to import file that contains Chinese characters into a table. File encoding is Big5 (Traditional).

Here is a sample file like one that I need to import: https://www.pastiebin.com/5d7782d9b63fa 

Table where file has to be imported into has such a structure:

	create table dbo.Test (
		  AccountId   numeric(18, 0) not null
		, Province    nvarchar(50)       null
		, City        nvarchar(50)       collate Chinese_Hong_Kong_Stroke_90_CI_AI
		, Country     nvarchar(50)       null
		, Gender      nvarchar(50)       null
	)

----------

If I import using OPENROWSET/BULK then all the data is transferred correctly:

	select AccountId, Province, City, Country, Gender
	from openrowset (
		  bulk 'C:\chinese_sample.dat'
		, firstrow = 1
		, formatfile = 'C:\chinese_sample.xml'
	) t

Here is a format file that I use: https://www.pastiebin.com/5d7783396f9e4 

----------

But if I try to import file using SSIS, then Chinese characters are not parsed correctly.

In Flat File Source I use DataType string [DT_STR] and CodePage 950. Then convert it to Unicode string [DT_WSTR].

Here is how text is imported into table:

As we can see some characters are parsed correctly and some are not.
What am I missing?

GriGrim (322 rep)

Sep 10, 2019, 11:50 AM • Last activity: Jan 23, 2023, 03:11 AM

0 votes

1 answers

544 views

Unable to translate Unicode character \\uDCC9 at index 5 to specified code page

mariadb unicode mariadb-10.2

I'm using MariaDB (and Adminer) > MySQL version: 5.5.5-10.10.2-MariaDB-1:10.10.2+maria~ubu2204 through PHP extension MySQLi And this is the `show create table Pages`: ``` CREATE TABLE `Pages` ( `Id` bigint(20) NOT NULL AUTO_INCREMENT, `Guid` uuid NOT NULL DEFAULT uuid(), `Name` varchar(400) CHARACTE...

I'm using MariaDB (and Adminer) > MySQL version: 5.5.5-10.10.2-MariaDB-1:10.10.2+maria~ubu2204 through PHP extension MySQLi And this is the show create table Pages:

CREATE TABLE Pages (
  Id bigint(20) NOT NULL AUTO_INCREMENT,
  Guid uuid NOT NULL DEFAULT uuid(),
  Name varchar(400) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci NOT NULL,
  AcceptsComment bit(1) NOT NULL DEFAULT b'1',
  Key varchar(400) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci NOT NULL DEFAULT uuid(),
  IsVital bit(1) DEFAULT NULL,
  Slug varchar(400) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci NOT NULL DEFAULT uuid(),
  PRIMARY KEY (Id),
  UNIQUE KEY IX_Pages_Unique_Guid (Guid),
  UNIQUE KEY IX_Pages_Unique_Key (Key),
  UNIQUE KEY IX_Pages_Unique_Slug (Slug),
  CONSTRAINT Ck_Pages_NonEmptyGuid CHECK (Guid  '00000000-0000-0000-0000-000000000000')
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci

I'm using [Bogus](https://github.com/bchavez/Bogus) to insert some fake data using MySqlBulkCopy. But when I call bulkCopy.WriteToServer(table), I get this error: > Unable to translate Unicode character \\uDCC9 at index 5 to specified code page. I realized that [\uDCC9](https://charbase.com/dcc9-unicode-invalid-character) is a � character. However, I use UTF for all of my columns. I can't understand why this happens. What have I done wrong?

Saeed Neamati (1515 rep)

Dec 15, 2022, 12:25 PM • Last activity: Dec 15, 2022, 06:19 PM

4 votes

1 answers

652 views

Fix corrupted utf8 nvarchar value that was inserted without N prefix

sql-server collation unicode utf-8

Is it possible to restore the UTF8 data from a corrupted nvarchar, that was used without the `N` prefix? For example, in the following snippet, I'd like `@n1_fixed` to have the correct value based on `@n1`: declare @n1 nvarchar(10) = 'Ḿấxiḿứś', @n2 nvarchar(10) = N'Ḿấxiḿứś'; declare @n1_fixed nvarch...

                                  Is it possible to restore the UTF8 data from a corrupted nvarchar, that was used without the N prefix?

For example, in the following snippet, I'd like @n1_fixed to have the correct value based on @n1:

    declare
      @n1 nvarchar(10) = 'Ḿấxiḿứś',
      @n2 nvarchar(10) = N'Ḿấxiḿứś';
 
    declare
      @n1_fixed nvarchar(10); -- somehow make it have the correct value, based on @n1

    select iif(@n1_fixed = @n2, 1, 0)
                                

HeyJude (467 rep)

Oct 20, 2022, 09:16 AM • Last activity: Oct 20, 2022, 11:28 AM

1 votes

1 answers

15786 views

How to replace multiple unicode characters

sql-server sql-server-2012 t-sql unicode

I have identified a couple of unicode characters in my database by means of the following query: SELECT name FROM table WHERE name NOT LIKE '%'+CAST(name AS VARCHAR(1000))+'%' ...with which I get stuff like this: MORRISON CPA LIMITED E¦¬Ç¦®EŒÆŒƑÈ¨ˆÅ¸...

                                  I have identified a couple of unicode characters in my database by means of the following query:

    SELECT name FROM table WHERE name NOT LIKE '%'+CAST(name AS VARCHAR(1000))+'%'

...with which I get stuff like this:

    MORRISON CPA LIMITED  E¦¬Ç¦®EŒÆŒƑÈ¨ˆÅ¸«Äº‹Å‹™Æ‰€ÆŒ‰E™Å…¬Å¸

Is it possible to get the collation when using a REPLACE function? Something like:

    UPDATE table SET name = REPLACE(name,NOT LIKE '%'+CAST(name AS VARCHAR(1000))+'%','')

I know I can individually replace them REPLACE(name,'Æ','') but is there a way to replace everything that is unicode?

Edit: I don't know why but I get less rows if I use 

    SELECT name FROM table WHERE name != CAST(name AS VARCHAR(1000))

user151179

Jul 3, 2019, 07:03 PM • Last activity: Aug 24, 2022, 12:08 PM

35 votes

1 answers

49989 views

How do I set a SQL Server Unicode / NVARCHAR string to an emoji or Supplementary Character?

sql-server t-sql collation encoding unicode

I want to set a Unicode string variable to particular character based on its Unicode code point. I want to use a code point beyond 65535, but the SQL Server 2008 R2 database has a collation of `SQL_Latin1_General_CP1_CI_AS`. According to [Microsoft's NCHAR documentation](https://msdn.microsoft.com/e...

                                  I want to set a Unicode string variable to particular character based on its Unicode code point.

I want to use a code point beyond 65535, but the SQL Server 2008 R2 database has a collation of SQL_Latin1_General_CP1_CI_AS.

According to [Microsoft's NCHAR documentation](https://msdn.microsoft.com/en-us/library/ms182673.aspx) , the NCHAR function takes an integer as follows:

> *integer_expression*
> 
> When the collation of the database does not contain the supplementary
> character (SC) flag, this is a positive whole number from 0 through
> 65535 (0 through 0xFFFF). If a value outside this range is specified,
> NULL is returned. For more information about supplementary characters,
> see Collation and Unicode Support.
> 
> When the collation of the database supports the supplementary
> character (SC) flag, this is a positive whole number from 0 through
> 1114111 (0 through 0x10FFFF). If a value outside this range is
> specified, NULL is returned.

So this code:

    SELECT NCHAR(128512);

Returns NULL in this database.

I would like it to return the same as this:

    SELECT N'😀';

How can I set a Unicode string variable (e.g. nvarchar) to an emoji using code (without using the actual emoji character) in a database where the collation "does not contain the supplementary character (SC) flag"?

[Full list of emoji Unicode code points](http://www.unicode.org/emoji/charts/full-emoji-list.html) 

(Ultimately I want any character to work. I just chose emoji for ease of reference.)

(Although the server is SQL Server 2008 R2, I'm also curious about any solutions for later versions.)

Assuming that there is no way, could I reference an inline user-defined function in another database which had an appropriate collation?

How do I find a collation which has the "supplementary character" flag?

This returns no records on our server:

    SELECT * FROM sys.fn_helpcollations() 
    WHERE name LIKE 'SQL%[_]SC';

It seems like SQL Server 2012 introduced Latin1_General_100_CI_AS_SC which would work. Can you install collations on older instances?

Collation References:

 - [Answer to What is the difference between char, nchar, varchar, and nvarchar in SQL Server?](https://stackoverflow.com/a/17335317/2266979) 
 - [Microsoft's Supplementary Characters Collation Information](https://msdn.microsoft.com/en-us/library/ms143726.aspx#Anchor_2) 
 - [Microsoft's SQL Server 2008 R2 Collation List](https://msdn.microsoft.com/en-us/library/ms143508(v=sql.105).aspx) 

Is there an explanation for why, regardless of collation, SQL Server can understand and deal with the extended characters except from the perspective of NCHAR?

Riley Major (1965 rep)

May 25, 2016, 08:02 PM • Last activity: Jun 24, 2022, 03:35 PM

Showing page 1 of 20 total questions