[mythtv-users] Duplicate recordings because of bad SD data?

Thu Jul 22 22:22:10 UTC 2010

    > Date: Thu, 22 Jul 2010 11:32:39 -0500
    > From: Robert Eden <rmeden at yahoo.com>

    > I got a response from Tribune:
    > > The original episode was created as a standard without the 
    > > season/episode information.  Once we obtained that, editorial did not 
    > > realize there was already an episode present and created a new one 
    > > with the season/episode info.  The bad episode is being removed from 
    > > our database today.

    > See simple... it really comes down to how complicated Tribune's job is 
    > because content providers don't always provide accurate info.

I'll say---though sometimes it's difficult for end-users to know where
to assign blame.  For example, I just discovered that the Science
Channel aired three episodes of Weird Creatures this morning, followed
by one of Weird Nature.  But SD thought this was -four- episodes of
Weird Creatures, one of which had never been aired before (even though
the actual video content and the Science Channel website disagree with
that assessment and claim it was a repeat of Weird Nature).

The seriesid/programid for the supposed fourth Weird Creatures were
EP01293176 and EP012931760009.  That's a plausible programid for WC
(the one before that had programid EP012931760008) and matches the
seriesid as well.  But it's wrong.

The on-screen title for this episode was "Weird Nature: Bizarre
Breeding", but the titles on their website, in today's SD data, and in
the SD data from 9/28/08 (when this episode was first noticed by my
Myth) all call it "Defense".  The CC data from -that- showing matches
this one very closely.

The actual Breeding/Defense aired was never a part of Weird Creatures,
because that's hosted by Nick Baker, who appears onscreen and in
narration all the time, and he's nowhere here.  To make things even
more complicated, they've recently begun reairing "Nick Baker's Weird
Creatures" (with seriesid EP00904993) as just "Weird Creatures" with
(as far as I can tell) -exactly- the same content (including episode
titles and descriptions and CC data) -but- with with this new seriesid
EP01293176 and a title omitting "Nick Baker's"---hence causing my myth
to believe that every single one of them is brand-new, when in fact
they're all repeats.  (See why my CC-comparison automation is so useful?)

So it looks like TMS picked the wrong series entirely here,
incremented the programid for this wrong series, and ran with it.  I
have no idea whose fault this is, except to say that the SC website
appears to have the right data (though they've -always- claimed this
episode's title differs from the onscreen graphics---in their
defense, this is apparently a schizoid episode that talks in its first
half about breeding and its second about defense, with no credits or
reintro halfway through; I'm wondering if it was basically glued
together from two half-hour segments).

I see this sort of screwup all the time.  I haven't kept careful
stats, but at least once a week on -some- channel, someone has blown
it with the metadata somehow.

And now I also don't know what happens if TMS -fixes- the screwup
and ever un-increments the programid (assuming there are ever more
showings of Weird Creatures).  In these sorts of situations, when
the metadata is obviously wrong, I have tools that manually nuke the
seriesid and programid data (and put them into the description, so I
still have them preserved in a human-readable form so I can debug
later lossage) to allow a possible rerecording if they straighten
themseleves out.  It's one reason why I really want to make sure that
raw DB acceess is preserved even in Myths with built-in SQL
servers---the "fix up the broadcaster's braindead metadata" usage
model doesn't seem to be in the development roadmap... ;)

[Another common issue:  airing something whose title is "Foo: Bar"
with no subtitle and then later airing it with title "Foo" and
subtitle "Bar", usually with different seriesid's or programid's even
though it was the same thing but they couldn't decide if the subtitle
was a part of the title or not---so Myth sees both as new.  I could go
on...]

    > There have been threads about this in the past, but when we looked into 
    > a listing solution that turned into SD, we were shocked at how 
    > needlessly complicated the whole TV listing business is.  Content 
    > providers and stations have the power to fix it, but simply aren't 
    > interested.  TMS and Rovi make good money because of it :)

Yeah.  I'd be really, really happy if they could be motivated to care.
For example (as I've said before), if certain broadcasters (*cough*
Sundance and TCM) could stop rounding-off the lengths on what they air
so they don't occasionally air something that's a few minutes longer
than the declared timeslot, that would be nice.  [As I've said before,
the runTime metadata -is- correct, but the scheduled size of the slot
via starttime & endtime are -not-; I have a tool (and a rejected patch
in Trac) that looks for discepancies and makes a lurid warning, so the
postroll padding that's -already- always present can be increased even
more...  It's worked great for me for those channels.]

Or if TCM could be bothered to list the same shorts they list on their
website in the stuff they give to TMS, these hard-to-find and very-
rarely-repeated historical artifacts would be easier to discover.

Or all the broadcasters who just show a zillon generic descriptions
and let the viewers sort it out; that was the original use case for my
CC dot-product dingus ("record everything and throw away CC dups").
As you say below, this was Comedy Central's MO for the Daily Show.
Other news-show providers seem to have the same MO---as far as I
can tell, every MSBNC news show that gets repeated a few times during
primetime is indistinguishable from every other, so again, recording
just one per day and not all three pretty much requires a timeslot
rule instead of the obvious.  (And since the repeats span midnight,
"record one per day" can miss days and get two of one day instead;
what's really needed is "record one IN THIS INTERVAL" with an interval
that is user-settable, so you could set the interval to start at 6pm
and end at 3pm the next day or something---if a power search can even
do this, it's sufficiently annoying to hand-write for each one that it
might as well be impossible...  So, instead, hard-coded timeslot rules
that make it more-difficult for the scheduler to resolve conflicts
because it can't push around things it would otherwise easily do.)

While I'm complaining about MSNBC, btw, maybe if they could start to
identify episodes the way the Daily Show does, the "record unique eps"
typical rule would be able to pick up cases where the second showing
of the evening if in fact very much not a repeat of the first and is
new content; this is typically the case during elections but happens
somewhat randomly at other times.

    > Feel free to contact SD to report issues like this.  Please try and do 
    > some vetting to prevent false positives.  If you want to automate things 
    > (like scanning CC listings, and alerting for a 90%+ hit) contact me and 
    > I can facilitate automatic reporting.  TMS has shown support for 
    > anything that improves the quality of their data.

I'd be happy to figure out how I can report what I can automatically.
One reason I haven't reported much is (a) it's really, really painful
to try to type up a report in a web browser to send to a forum when
all the data starts out sitting in Emacs buffers and basically has to
be hand-copied (or at least mouse-copied) into the report, and (b) it
wasn't clear to me that the parties far enough upstream in the food
chain to actually fix things really cared about improving their
accuracy, so I didn't want to waste my time pissing into the
wind... :)

    > Speaking of data quality, if you guys haven't noticed, there have been 
    > significant improvements to "The Daily Show" listings over the past 
    > year.  It is much, much better and that's due to hard work, automated 
    > tools,  and cooperation between TMS and SD. (if you don't mind me 
    > tooting my own horn, and if you do.. @$%@!$#  :) )

I -have- noticed, but had no idea it had to do with you or SD!
Bravo!  (If you can talk about what was actually involved there,
it sounds like it'd be fascinating.  And probably head-bangingly
painful at the same time...)

There are still the occasional generics and also the occasional
just-wrong's (probably caused by the last-minute unavailability of a
guest and a consequent swap---that's asking too much of TMS to somehow
catch, I'll bet), but it's improved.  Currently they're often slow to
realize that the show's gone on vacation (it usually takes them a
couple days of generic descriptions before it finally vanishes from
their data---and this is not my data being obsolete; it's true even in
a reload of "today's" data), but for this particular break, I think
they got it right and claimed zero generics.