[mythtv-users] Radio Times XMLTV failing
Neil Bird
neil at fnxweb.com
Tue Oct 10 09:33:35 UTC 2006
Around about 03/10/06 02:21, Nick typed ...
> There are a couple of other substitutions performed at this stage to
> replace other characters commonly seen in the data, but I think this
> may be the first time this character sequence has been seen.
For the record, this is my current version of tv_grab_uk_rt's code
conversion. I spent quite a while hunting down rogue accented characters
and the like, and this seems to do the trick for me:
# Tidy up HTML entities and bad characters. The site seems to use
# a mixture of Latin-1 and UTF-8, I'm not sure exactly. We want
# our output to be in Latin-1 but HTML::Entities decides to use
# Unicode so we have to fiddle a few entities manually first.
#
for ($page) {
# FNX
s/—/--/go;
s/…/.../go;
# checked
s/\310\241/í/go;
s/\310\244/’/go;
s/\310\250/é/go;
s/\310\253/î/go;
s/\310\255/ü/go;
s/\310\257/ï/go;
s/\310\263/ô/go;
s/\310\265/ö/go;
s/\310\272/ç/go;
s/\310\275/ë/go;
s/\310\277/è/go;
s/\310\341/à/go;
s/\310\355/á/go;
s/\310\361/ä/go;
s/\310\376/ó/go;
s/\310\1334/ñ/go;
s/\310\20130/æ/go; ## TBC
# Not checked
s/ù\371//go;
s/á/ß/go;
# Can't be latin 1
s/ô/“/go;
s/ö/”/go;
# /FNX
decode_entities $_;
tr/\207\211\200\224/\347\311\055\055/; # bad characters
}
--
[neil at fnx ~]# rm -f .signature
[neil at fnx ~]# ls -l .signature
ls: .signature: No such file or directory
[neil at fnx ~]# exit
More information about the mythtv-users
mailing list