SUBSTRING and UTF-8

This is a discussion on SUBSTRING and UTF-8 within the MySQL Database forums, part of the Database Forums category; Hi, I ran into a problem I couldn't figure out. What I want seems easy enough: the first two ...


Go Back   Usenet Forums > Database Forums > MySQL Database

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 05-20-2008
Matthias Watermann
 
Posts: n/a
Default SUBSTRING and UTF-8

Hi,

I ran into a problem I couldn't figure out. What I want seems easy enough:
the first two characters of a string copied to another string where both
the original string and the destination string should be utf-8 encoded.
It has to happen in a "stored procedure" (which is used by several other
SPs).

I've stripped that down to the smallest possible test case:
-----------------------------------------------------------------------
DELIMITER $$

DROP FUNCTION IF EXISTS firstTwo$$
CREATE FUNCTION firstTwo($aNachname VARCHAR(63)) RETURNS VARCHAR(12)
SQL SECURITY INVOKER
BEGIN
DECLARE _kurz VARCHAR(12) DEFAULT '';

SELECT SUBSTRING($aNachname, 1, 2) INTO _kurz;

RETURN _kurz;
END$$

DELIMITER ;
-----------------------------------------------------------------------

Now, calling it by
SELECT firstTwo('Müller');
returns "MÃ" but not "Mü" (note the u-umlaut at position two).
Changing the third SUBSTRING argument to 3 gave the expected result
in this case but, of course, wrong results with a teststring w/o
UTF characters.

The server's query log shows:
-----------------------------------------------------------------------
080520 11:31:34 2974 Connect matthias@somewhere.tld on theDatabase
2974 Query SET SESSION interactive_timeout=1000000
2974 Query SELECT @@sql_mode
2974 Query SET SESSION sql_mode='STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO _ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_Z ERO,TRADITIONAL,NO_AUTO_CREATE_USER'
2974 Query SET NAMES utf8
2974 Query SELECT @@SQL_MODE
2974 Query SELECT firstTwo('Müller')
2974 Quit
-----------------------------------------------------------------------

It's the same with both "MySQLQueryBrowser" and the "mysql" console.
Searching the web only showed pointers to "SET NAMES utf8" which is
used anyway (see above), but even adding that statement directly
into the SP didn't change anything.

Any ideas what/where to look for?
Thanks in advance.


--
Matthias
/"\
\ / ASCII RIBBON CAMPAIGN - AGAINST HTML MAIL
X - AGAINST M$ ATTACHMENTS
/ \
Reply With Quote
  #2 (permalink)  
Old 05-21-2008
Willem Bogaerts
 
Posts: n/a
Default Re: SUBSTRING and UTF-8

Matthias Watermann wrote:
> Hi,
>
> I ran into a problem I couldn't figure out. What I want seems easy enough:
> the first two characters of a string copied to another string where both
> the original string and the destination string should be utf-8 encoded.
> It has to happen in a "stored procedure" (which is used by several other
> SPs).
>
> I've stripped that down to the smallest possible test case:
> -----------------------------------------------------------------------
> DELIMITER $$
>
> DROP FUNCTION IF EXISTS firstTwo$$
> CREATE FUNCTION firstTwo($aNachname VARCHAR(63)) RETURNS VARCHAR(12)
> SQL SECURITY INVOKER
> BEGIN
> DECLARE _kurz VARCHAR(12) DEFAULT '';
>
> SELECT SUBSTRING($aNachname, 1, 2) INTO _kurz;
>
> RETURN _kurz;
> END$$
>
> DELIMITER ;
> -----------------------------------------------------------------------


I think the problem is that the encoding is not passed along with the
parameter, and that the connection encoding is latin1 by default. When I
query:
SELECT SUBSTRING('Müller', 1, 2)
I get the expected result of 'Mü' (and I use an utf-8 encoded
connection: if I put a HEX function around it, I see 3 bytes). So the
function works fine for utf-8 (MySQL version 5.0.51). You could put the
encoding back in by explicitly giving it:

SELECT SUBSTRING(_utf8 $aNachname, 1, 2) INTO _kurz;

This may not be an optimal solution, as it will treat any value in
$aNachname as utf-8 encoded. But if that is the case in your database,
it should work.

>
> Now, calling it by
> SELECT firstTwo('Müller');
> returns "MÃ" but not "Mü" (note the u-umlaut at position two).


Best regards,
--
Willem Bogaerts

Application smith
Kratz B.V.
http://www.kratz.nl/
Reply With Quote
  #3 (permalink)  
Old 05-21-2008
Matthias Watermann
 
Posts: n/a
Default Re: SUBSTRING and UTF-8

On Wed, 21 May 2008 10:29:18 +0200, Willem Bogaerts wrote:

> Matthias Watermann wrote:
>> [...]
>> I ran into a problem I couldn't figure out. What I want seems easy enough:
>> the first two characters of a string copied to another string where both
>> the original string and the destination string should be utf-8 encoded.
>> It has to happen in a "stored procedure" (which is used by several other
>> SPs).
>>
>> I've stripped that down to the smallest possible test case:
>> -----------------------------------------------------------------------
>> DELIMITER $$
>>
>> DROP FUNCTION IF EXISTS firstTwo$$
>> CREATE FUNCTION firstTwo($aNachname VARCHAR(63)) RETURNS VARCHAR(12)
>> SQL SECURITY INVOKER
>> BEGIN
>> DECLARE _kurz VARCHAR(12) DEFAULT '';
>>
>> SELECT SUBSTRING($aNachname, 1, 2) INTO _kurz;
>>
>> RETURN _kurz;
>> END$$
>>
>> DELIMITER ;
>> -----------------------------------------------------------------------

>
> I think the problem is that the encoding is not passed along with the
> parameter, and that the connection encoding is latin1 by default.


Hmm, I posted the logfile entries to show that I've been using the
correct encoding (see the "SET NAMES utf8" line).

> When I query:
> SELECT SUBSTRING('Müller', 1, 2)
> I get the expected result of 'Mü'


Me as well. However, used within a SP the wrong result is returned.

080521 10:59:47 697 Connect matthias@xxx on yyy
697 Query SET SESSION interactive_timeout=1000000
697 Query SELECT @@sql_mode
697 Query SET SESSION sql_mode='STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO _ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_Z ERO,TRADITIONAL,NO_AUTO_CREATE_USER'
697 Query SET NAMES utf8
697 Query SELECT @@SQL_MODE
697 Query SELECT SUBSTRING('Müller', 1, 2)
697 Quit

080521 11:01:04 698 Connect matthias@xxx on yyy
698 Query SET SESSION interactive_timeout=1000000
698 Query SELECT @@sql_mode
698 Query SET SESSION sql_mode='STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO _ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_Z ERO,TRADITIONAL,NO_AUTO_CREATE_USER'
698 Query SET NAMES utf8
698 Query SELECT @@SQL_MODE
698 Query SELECT firstTwo('Müller')
698 Quit

As you can see, the connection setup for both is exactly the same.
Now, exporting the resultsets produces this:

"SUBSTRING('Müller', 1, 2)"
"Mü"

"firstTwo('Müller')"
"MÃ"

As you can see, the u-umlaut is encoded correctly for both _calls_ (in
hex it's "C3 BC") but differently in the _results_ (hex "C3 83" in the
second case i.e. the SP call). Therefor I suspect that the problem is
some "magic" in the SP, but I couldn't figure out what. I've tried
different column types for the local "_kurz" variable - to no avail.

> [...]
> connection: if I put a HEX function around it, I see 3 bytes).


True, but unfortunately the wrong bytes in the SP call's result.

> So the
> function works fine for utf-8 (MySQL version 5.0.51).


Just to be sure: Did you actually test the SP above, or just the
SELECT statement?

> You could put the encoding back in by explicitly giving it:
>
> SELECT SUBSTRING(_utf8 $aNachname, 1, 2) INTO _kurz;


That gives me (MySQL 5.0.22):
ERROR 1064 (42000) at line 6: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '$aNachname, 1, 2) INTO _kurz;

> This may not be an optimal solution, as it will treat any value in
> $aNachname as utf-8 encoded. But if that is the case in your database,
> it should work.


Yeah, that's fine.

> [...]
>> Now, calling it by
>> SELECT firstTwo('Müller');
>> returns "MÃ" but not "Mü" (note the u-umlaut at position two).


Thank you, Willem, for your thoughts!

--
Matthias
/"\
\ / ASCII RIBBON CAMPAIGN - AGAINST HTML MAIL
X - AGAINST M$ ATTACHMENTS
/ \
Reply With Quote
  #4 (permalink)  
Old 05-21-2008
Willem Bogaerts
 
Posts: n/a
Default Re: SUBSTRING and UTF-8

> Just to be sure: Did you actually test the SP above, or just the
> SELECT statement?


No. just the select statement. That is why I thought that the encoding
somehow went missing with the parameter passing.
>
>> You could put the encoding back in by explicitly giving it:
>>
>> SELECT SUBSTRING(_utf8 $aNachname, 1, 2) INTO _kurz;


Stupid of me. This is the syntax for literals, like:
SELECT _utf8 0x4DC3BC6C6C6572;
(which is about the only way to get a string correctly into MySQL from a
statement that may be reencoded before being sent to the database)

For expressions, it should be CONVERT($aNachName USING utf8) . Not
tested though.

Good luck,
--
Willem Bogaerts

Application smith
Kratz B.V.
http://www.kratz.nl/
Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT +1. The time now is 06:01 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0