[Mythtv-translators] Themestrings has been updated for 0.25 - It's time to start translating! :)

Fri Mar 23 06:55:23 UTC 2012

Hi!

On 3/22/2012 2:24 PM, Kenni Lund wrote:
> 2012/3/22 Nicolas Riendeau<knight at teksavvy.com>
>> On 3/22/2012 12:57 PM, Nick Morrott wrote:
>>> I noticed some UTF-8 weirdness today after updating for the en_gb
>>> translations. The XML element generated by lupdate containing the
>>> description text for the Steppes theme (it contains "Français") was
>>> not generated with valid UTF-8, but rather each of the two bytes
>>> representing the "ç" character (should be C3 A7) was further
>>> re-encoded into UTF-8 so that 4 bytes in total (C3 83 C2 A7) were
>>> output for the character in the file.
>
> Heh...that bug just won't die :) First the encoding issue appeared in
> the theme downloader generation script, then in the themestrings tool
> and now in lupdate...

(Kenni I know you most likely know a good deal of this but since I'm 
posting this to the mailing list I might as well document the problem we 
had with this...)

I can only assume All of these scripts/programs assumed that the strings 
where all in US-ASCII or when they were made there wasn't anything to 
test them with to make sure they produced the expected results.

In the case of the themestrings tool what was most likely happening is 
that the output which was forced to be outputted in UTF-8 was later 
re-encoded into our local character set (which is most likely UTF-8 for 
many if not all of us) by the QTextStream.

This time the problem is slightly different... By default lupdate 
assumes that we are using ISO-8859-1** in the source files (when what we 
trying to make it process is in UTF-8) so it takes it, assumes the "ç" 
which is encoded using two bytes in UTF-8 is actually two characters in 
ISO-8859-1 and proceeds to re-encode it into UTF-8 to store it in the 
translation file with catastrophic results.

** lupdate default encoding, it's also known under the name Latin1.

The reason why this never caused problems before is that our strings are 
normally in US-ASCII and both ISO-8859-1 and UTF-8 are supersets of 
US-ASCII. What this means is that as long as the original text only 
contains US-ASCII characters its encoding is *identical* in both 
ISO-8859-1 and UTF-8.

While both are supersets of US-ASCII all non-US-ASCII characters are not 
encoded them in the same way (even if the character values match).

So as long as everything was in US-ASCII none of these encoding problems 
popped up...

>> I think I have an idea how to fix it (assuming lupdate is actually able
>> to extract UTF-8 correctly)
>
> Ok, good, I haven't looked at it yet.

I'll do a few more spot checks tomorrow (it's pretty late here now) but 
unless I find a problem with the fix I found (I'm not expecting to find 
any though) I'll commit the fix in every file except for the one for the 
programs under mythtv/ since we don't want to fix it right now since it 
would actually add a new string.

(The fix will be added at a later time...)

The resulting translation file is encoded correctly after applying the 
fix and it will display correctly in the main translation window but Qt 
Linguist will still be unable to display the source file correctly (a 
bug in Qt Linguist).

>
>> freeze. If we get reports from any of the translators that some strings
>> are untranslatable we *might* temporarily break the string freeze in
>> order to correct these issues and fix this at the same time..
>
> Yep, if it's the only string that needs fixing, let's just fix it
> through the translations. If we want to, we can always fix the source
> string as well as the single character in all of the translations, on
> the day before the release of 0.25.

Yep, the problem is quite harmless and doesn't justify adding a new 
string at this time...

Have a nice day!

Nicolas