[mythtv-commits] Ticket #5946: Fix (unimportant) data corruption in database character set conversion

Mon Dec 1 22:59:34 UTC 2008

#5946: Fix (unimportant) data corruption in database character set conversion
----------------------------------------------+-----------------------------
 Reporter:  sphery <mtdean at thirdcontact.com>  |        Owner:  ijr    
     Type:  patch                             |       Status:  new    
 Priority:  minor                             |    Milestone:  unknown
Component:  mythtv                            |      Version:  head   
 Severity:  medium                            |   Resolution:         
  Mlocked:  0                                 |  
----------------------------------------------+-----------------------------

Comment(by sphery <mtdean at thirdcontact.com>):

 mythtv-5946-fix_database_utf8_conversion_corruption.patch fixes the
 corruption caused by the initial (pre
 mythtv-5946-fix_database_utf8_conversion.patch ) implementation of the
 UTF-8 conversion in DB update 1216.

 Because of references from the credits and recordedcredits table to person
 (ID's) in the people table, the fix involves:
  1. query all corrupt records in people
  2. find all records that are duplicates once null-padding is removed
   a. if there are duplicates:
    1. update references in {,recorded}credits to refer to the original
 (corrupt) person
    2. delete the duplicate (not-null-padded) record from people
    3. update the corrupt (null-padded) name on the original person record
   b. if there are no duplicates
    1. update the corrupt (null-padded) name on the original person record

 Though we are updating the corrupt (null-padded) name on the original
 person record whether there are duplicates or not, we cannot just move the
 code outside the duplicate-checking conditional because if there are
 duplicates, it's only safe to correct the name if we were able to update
 references (we will only delete the duplicate if we were able to update
 references).

 Also, it seems inefficient to update the {,recorded}credits table to use
 the corrupt person, delete the not-corrupt person, then correct the
 corrupt person's name, but we need to use this approach to ensure that
 data is valid even if we have multiple duplicates.  For example, if users
 have edited database data directly, they may have gotten additional
 duplicates with varying numbers of null-pad characters.

 The update is rather long-running.  For various test cases on my Athlon X2
 5000+ dev system, it took:
  * 80229 records + 80229 dups = 58s
  * 80229 records + 39771 dups (40458 corrupt, not dup) = 43s
  * 80229 records + 0 dups (39771 corrupt records) = 11s
  * 80229 records + 0 dups (0 corrupt records) = 0.5s (because of the fixes
 for programgenres, programrating, and recordedrating)
  * 72650 records with mythtv-fix_database_utf8_conversion.patch applied to
 prevent corruption = 0.5s

-- 
Ticket URL: <http://svn.mythtv.org/trac/ticket/5946#comment:1>
MythTV <http://www.mythtv.org/>
MythTV