[mythtv-users] Radio Times XMLTV failing

Neil Bird neil at fnxweb.com
Tue Oct 10 09:33:35 UTC 2006


Around about 03/10/06 02:21, Nick typed ...
> There are a couple of other substitutions performed at this stage to
> replace other characters commonly seen in the data, but I think this
> may be the first time this character sequence has been seen.

   For the record, this is my current version of tv_grab_uk_rt's code 
conversion.  I spent quite a while hunting down rogue accented characters 
and the like, and this seems to do the trick for me:


     # Tidy up HTML entities and bad characters.  The site seems to use
     # a mixture of Latin-1 and UTF-8, I'm not sure exactly.  We want
     # our output to be in Latin-1 but HTML::Entities decides to use
     # Unicode so we have to fiddle a few entities manually first.
     #
     for ($page) {
         # FNX
         s/—/--/go;
         s/…/.../go;
         # checked
         s/\310\241/í/go;
         s/\310\244/’/go;
         s/\310\250/é/go;
         s/\310\253/î/go;
         s/\310\255/ü/go;
         s/\310\257/ï/go;
         s/\310\263/ô/go;
         s/\310\265/ö/go;
         s/\310\272/ç/go;
         s/\310\275/ë/go;
         s/\310\277/è/go;
         s/\310\341/à/go;
         s/\310\355/á/go;
         s/\310\361/ä/go;
         s/\310\376/ó/go;
         s/\310\1334/ñ/go;
         s/\310\20130/æ/go;      ## TBC
         # Not checked
         s/ù\371/­/go;
         s/á/ß/go;
         # Can't be latin 1
         s/ô/“/go;
         s/ö/”/go;
         # /FNX
         decode_entities $_;
         tr/\207\211\200\224/\347\311\055\055/; # bad characters
     }

-- 
[neil at fnx ~]# rm -f .signature
[neil at fnx ~]# ls -l .signature
ls: .signature: No such file or directory
[neil at fnx ~]# exit



More information about the mythtv-users mailing list