[mythtv-users] Charactersets and whats gone wrong with my database ?

Nick Morrott knowledgejunkie at gmail.com
Sun Oct 4 16:17:50 UTC 2009


2009/10/4 Panachoi <panachoi at gmail.com>:
> I'm stumped on this one. After a recent mythtv upgrade (from trunk to a
> newer version of trunk), my recorded information seems to be "corrupted"
> somehow. Basically, the title and description no longer display the encoded
> foreign characters correctly. The data appears to be correct in the database
> itself, using mysql query from the command line:
>          chanid: 17008
>       starttime: 2009-07-01 20:39:00
>         endtime: 2009-07-01 23:36:00
>           title: Les classiques du cinéma
>        subtitle:
>     description: Patton Pendant la seconde guerre mondiale, la vie d'un
> général rebelle dans l'armée américaine, mais reconnu comme un génie de la
> stratégie. En 1943, le général Patton est envoyé à Tunis afin de reprendre
> une situation difficile, l'armée US ayant du mal à prendre le pas sur
> l'Africa Korps nazi. Il enchaîne les succès mais refuse de laisser les
> lauriers de sa victoire au maréchal Montgomery, des forces armées
> britanniques. Il est alors relevé de son commandement et va attendre
> plusieurs mois avant de se retrouver à la tête d'une division.  (Etats-Unis
> d'Amérique-1970)
>
> I first noticed this in Mythweb, so I thought this was restricted to an
> Apache/php problem, but I now see the corrupted characters (i.e. any kind of
> foreign accent character) in the frontend when browsing the recordings.
>
> In the frontend/mythweb, this appears as:
>
> Patton Pendant la seconde guerre mondiale, la vie d'un général rebelle
> dans l'armée américaine, mais reconnu comme un génie de la stratégie. En
> 1943, le général Patton est envoyé à Tunis afin de reprendre une
> situation difficile, l'armée US ayant du mal à prendre le pas sur l'Africa
> Korps nazi. Il enchaîne les succès mais refuse de laisser les lauriers de
> sa victoire au maréchal Montgomery, des forces armées britanniques. Il est
> alors relevé de son commandement et va attendre plusieurs mois avant de se
> retrouver à la tête d'une division. (Etats-Unis d'Amérique-1970)
>
> Strangely enough, the newer entries (i.e. recordings made after the upgrade)
> seem to be correct. I'm not sure what's going on, and were it all went
> wrong. Any ideas ? Thanks.

* I recently did a lot of detective work for the uk_rt XMLTV grabber
to correct different types of encoding errors in the source data.

I (and most probably many others) can tell you that those odd
characters - e.g. 'é' instead of 'é' - are the 2 bytes that the UTF-8
encoding of the accented character 'é' uses. Your frontend/mythweb
data might be showing you the raw octets 0xC3 0xA9 as ISO-8859-1
instead of interpreting them as UTF-8 data (where multibyte characters
as used for such accented characters where the codepoint is greater
than 0x007F.) This might well be the issue as the data in the database
"appears" to be good (although to be sure you need to see the raw
bytes).

Another problem you might also find if you look at the raw data from
the database (at the byte level, without any character encoding) is
that there are double-encoded UTF-8 characters which will display
incorrectly in every case. As an example of this, for a character such
as 'é' which is UTF-8 encoded using the 2 bytes 0xC3 0xA9 (equivalent
to 'é' in ISO-8859-1), you may instead find that each of those 2
bytes has again been re-encoded into UTF-8 (from another charset, such
as ISO-8859-1), using 4 bytes in total for the intended character
(0xC3 0x83 0xC2 0xA9). When the resulting data is interpreted as
UTF-8, each pair of bytes would be decoded into another character per
UTF-8 decoding rules. However, to be shown correctly, another round of
UTF-8 decoding is required to be left with the desired character.

Cheers,
Nick

-- 
Nick Morrott

MythTV Official wiki:
http://mythtv.org/wiki/
MythTV users list archive:
http://www.gossamer-threads.com/lists/mythtv/users

"An investment in knowledge always pays the best interest." - Benjamin Franklin


More information about the mythtv-users mailing list