[mythtv] Database schema upgrade 1215 and 1216 causes loss of data
Michael T. Dean
mtdean at thirdcontact.com
Wed Apr 23 20:21:42 UTC 2008
On 04/23/2008 03:58 PM, mythtv at miwers.dk wrote:
> I just took a look at ticket #5070 after having problems upgrading the
> database (duplicate key errors).
> http://svn.mythtv.org/trac/ticket/5070
>
> It seems that converting the columns from latin1 to utf8 causes loss of
> data, when the columns are first converted from varchar to varbinary and
> then back to varchar again.
>
> When MySQL hits a special character during conversion from varbinary to
> varchar, it gives a "Warning (1366) Incorrect string value" and
> truncates the rest of the string.
>
> Since we're talking loss of data here (program titles, recording titles
> and such), someone really should revert or fix that change, so nobody
> else looses data over this. (good thing there's a DB backup, but still...)
>
> I made som tests and posted my results in the ticket.
> Skipping the varchar to varbinary conversion, and instead convert
> directly from varchar (latin1) to varchar (utf8) worked fine for me
> without loss of data.
>
I'm not a dev, so don't get too upset by my (not important) opinion, but
considering there should /never/ be any "special" characters in any
valid MythTV database, this is not a problem.
Myth has always required a latin1 character encoding in the DB. Where
characters exist that cannot be expressed in latin1, they are encoded as
multiple latin1 characters before being written to the DB. Myth had
always done the character encoding conversions itself upon reading the
data from the DB. This approach allowed us to use 8-byte characters for
most data (throughout much of the world) and to only get a multi-byte
penalty where special (non-latin) characters actually exist in the data.
Many have warned users not to convert their databases to use other
character sets. I've been saying that doing so is creating a database
time bomb--this being yet another case where the time bomb goes off.
See the nice big warning at the top of:
http://www.mythtv.org/wiki/index.php/Utf8_Text_in_OSD
> Is there a reason why it's converted to varbinary first?
> It seems like an unnecessary step to me...
Because we need to convert the UTF-8 data stored in latin1 encoding to
real UTF-8. If we have MySQL do the conversion directly from latin1 to
UTF-8 on a /valid/ DB (one where any multibyte characters would be
encoded as multiple latin1 characters), MySQL will add extra (garbage)
characters where the extra bytes (which MySQL thinks are extra
characters) of the multibyte character exist. By converting from latin1
to varbinary (=no character set, no data conversions), we're removing
MySQL's assumptions about the data format so that it can convert from
varbinary (=raw data already in UTF-8 binary format) to (MySQL's
internally-recognized) UTF-8.
Therefore, I'd say that anyone who has corrupted his/her database by
doing character set conversions is likely to have issues, and, IMHO, the
devs shouldn't be expected to waste their time cleaning up messes that
users made of their own data by altering the database outside the
official process.
Users who had problems with character encodings should have reported the
issues as specific bugs that could have been fixed properly, but it
seems many people took the "easy" way and instead broke their DB's.
If you really want to help, figure out how to convert your (broken)
pre-upgrade UTF-8 database to latin1 with UTF-8 encoded in latin1 so
that any users who broke their databases can unbreak them before doing
an upgrade. Then, document the process on the wiki (with links from,
i.e., the broken page to which I linked, above).
Again, just my opinion.
Mike
More information about the mythtv-dev
mailing list