This is a discussion on SUBSTRING and UTF-8 within the MySQL Database forums, part of the Database Forums category; Hi, I ran into a problem I couldn't figure out. What I want seems easy enough: the first two ...
|
|||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
|
|||
|
Hi,
I ran into a problem I couldn't figure out. What I want seems easy enough: the first two characters of a string copied to another string where both the original string and the destination string should be utf-8 encoded. It has to happen in a "stored procedure" (which is used by several other SPs). I've stripped that down to the smallest possible test case: ----------------------------------------------------------------------- DELIMITER $$ DROP FUNCTION IF EXISTS firstTwo$$ CREATE FUNCTION firstTwo($aNachname VARCHAR(63)) RETURNS VARCHAR(12) SQL SECURITY INVOKER BEGIN DECLARE _kurz VARCHAR(12) DEFAULT ''; SELECT SUBSTRING($aNachname, 1, 2) INTO _kurz; RETURN _kurz; END$$ DELIMITER ; ----------------------------------------------------------------------- Now, calling it by SELECT firstTwo('Müller'); returns "MÃ" but not "Mü" (note the u-umlaut at position two). Changing the third SUBSTRING argument to 3 gave the expected result in this case but, of course, wrong results with a teststring w/o UTF characters. The server's query log shows: ----------------------------------------------------------------------- 080520 11:31:34 2974 Connect matthias@somewhere.tld on theDatabase 2974 Query SET SESSION interactive_timeout=1000000 2974 Query SELECT @@sql_mode 2974 Query SET SESSION sql_mode='STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO _ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_Z ERO,TRADITIONAL,NO_AUTO_CREATE_USER' 2974 Query SET NAMES utf8 2974 Query SELECT @@SQL_MODE 2974 Query SELECT firstTwo('Müller') 2974 Quit ----------------------------------------------------------------------- It's the same with both "MySQLQueryBrowser" and the "mysql" console. Searching the web only showed pointers to "SET NAMES utf8" which is used anyway (see above), but even adding that statement directly into the SP didn't change anything. Any ideas what/where to look for? Thanks in advance. -- Matthias /"\ \ / ASCII RIBBON CAMPAIGN - AGAINST HTML MAIL X - AGAINST M$ ATTACHMENTS / \ |
|
|||
|
Matthias Watermann wrote:
> Hi, > > I ran into a problem I couldn't figure out. What I want seems easy enough: > the first two characters of a string copied to another string where both > the original string and the destination string should be utf-8 encoded. > It has to happen in a "stored procedure" (which is used by several other > SPs). > > I've stripped that down to the smallest possible test case: > ----------------------------------------------------------------------- > DELIMITER $$ > > DROP FUNCTION IF EXISTS firstTwo$$ > CREATE FUNCTION firstTwo($aNachname VARCHAR(63)) RETURNS VARCHAR(12) > SQL SECURITY INVOKER > BEGIN > DECLARE _kurz VARCHAR(12) DEFAULT ''; > > SELECT SUBSTRING($aNachname, 1, 2) INTO _kurz; > > RETURN _kurz; > END$$ > > DELIMITER ; > ----------------------------------------------------------------------- I think the problem is that the encoding is not passed along with the parameter, and that the connection encoding is latin1 by default. When I query: SELECT SUBSTRING('Müller', 1, 2) I get the expected result of 'Mü' (and I use an utf-8 encoded connection: if I put a HEX function around it, I see 3 bytes). So the function works fine for utf-8 (MySQL version 5.0.51). You could put the encoding back in by explicitly giving it: SELECT SUBSTRING(_utf8 $aNachname, 1, 2) INTO _kurz; This may not be an optimal solution, as it will treat any value in $aNachname as utf-8 encoded. But if that is the case in your database, it should work. > > Now, calling it by > SELECT firstTwo('Müller'); > returns "MÃ" but not "Mü" (note the u-umlaut at position two). Best regards, -- Willem Bogaerts Application smith Kratz B.V. http://www.kratz.nl/ |
|
|||
|
On Wed, 21 May 2008 10:29:18 +0200, Willem Bogaerts wrote:
> Matthias Watermann wrote: >> [...] >> I ran into a problem I couldn't figure out. What I want seems easy enough: >> the first two characters of a string copied to another string where both >> the original string and the destination string should be utf-8 encoded. >> It has to happen in a "stored procedure" (which is used by several other >> SPs). >> >> I've stripped that down to the smallest possible test case: >> ----------------------------------------------------------------------- >> DELIMITER $$ >> >> DROP FUNCTION IF EXISTS firstTwo$$ >> CREATE FUNCTION firstTwo($aNachname VARCHAR(63)) RETURNS VARCHAR(12) >> SQL SECURITY INVOKER >> BEGIN >> DECLARE _kurz VARCHAR(12) DEFAULT ''; >> >> SELECT SUBSTRING($aNachname, 1, 2) INTO _kurz; >> >> RETURN _kurz; >> END$$ >> >> DELIMITER ; >> ----------------------------------------------------------------------- > > I think the problem is that the encoding is not passed along with the > parameter, and that the connection encoding is latin1 by default. Hmm, I posted the logfile entries to show that I've been using the correct encoding (see the "SET NAMES utf8" line). > When I query: > SELECT SUBSTRING('Müller', 1, 2) > I get the expected result of 'Mü' Me as well. However, used within a SP the wrong result is returned. 080521 10:59:47 697 Connect matthias@xxx on yyy 697 Query SET SESSION interactive_timeout=1000000 697 Query SELECT @@sql_mode 697 Query SET SESSION sql_mode='STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO _ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_Z ERO,TRADITIONAL,NO_AUTO_CREATE_USER' 697 Query SET NAMES utf8 697 Query SELECT @@SQL_MODE 697 Query SELECT SUBSTRING('Müller', 1, 2) 697 Quit 080521 11:01:04 698 Connect matthias@xxx on yyy 698 Query SET SESSION interactive_timeout=1000000 698 Query SELECT @@sql_mode 698 Query SET SESSION sql_mode='STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO _ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_Z ERO,TRADITIONAL,NO_AUTO_CREATE_USER' 698 Query SET NAMES utf8 698 Query SELECT @@SQL_MODE 698 Query SELECT firstTwo('Müller') 698 Quit As you can see, the connection setup for both is exactly the same. Now, exporting the resultsets produces this: "SUBSTRING('Müller', 1, 2)" "Mü" "firstTwo('Müller')" "MÃ" As you can see, the u-umlaut is encoded correctly for both _calls_ (in hex it's "C3 BC") but differently in the _results_ (hex "C3 83" in the second case i.e. the SP call). Therefor I suspect that the problem is some "magic" in the SP, but I couldn't figure out what. I've tried different column types for the local "_kurz" variable - to no avail. > [...] > connection: if I put a HEX function around it, I see 3 bytes). True, but unfortunately the wrong bytes in the SP call's result. > So the > function works fine for utf-8 (MySQL version 5.0.51). Just to be sure: Did you actually test the SP above, or just the SELECT statement? > You could put the encoding back in by explicitly giving it: > > SELECT SUBSTRING(_utf8 $aNachname, 1, 2) INTO _kurz; That gives me (MySQL 5.0.22): ERROR 1064 (42000) at line 6: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '$aNachname, 1, 2) INTO _kurz; > This may not be an optimal solution, as it will treat any value in > $aNachname as utf-8 encoded. But if that is the case in your database, > it should work. Yeah, that's fine. > [...] >> Now, calling it by >> SELECT firstTwo('Müller'); >> returns "MÃ" but not "Mü" (note the u-umlaut at position two). Thank you, Willem, for your thoughts! -- Matthias /"\ \ / ASCII RIBBON CAMPAIGN - AGAINST HTML MAIL X - AGAINST M$ ATTACHMENTS / \ |
|
|||
|
> Just to be sure: Did you actually test the SP above, or just the
> SELECT statement? No. just the select statement. That is why I thought that the encoding somehow went missing with the parameter passing. > >> You could put the encoding back in by explicitly giving it: >> >> SELECT SUBSTRING(_utf8 $aNachname, 1, 2) INTO _kurz; Stupid of me. This is the syntax for literals, like: SELECT _utf8 0x4DC3BC6C6C6572; (which is about the only way to get a string correctly into MySQL from a statement that may be reencoded before being sent to the database) For expressions, it should be CONVERT($aNachName USING utf8) . Not tested though. Good luck, -- Willem Bogaerts Application smith Kratz B.V. http://www.kratz.nl/ |