[mythtv-users] Duplicate detection

Tue Sep 20 17:05:25 UTC 2016

On 09/20/2016 12:36 PM, Jan Ceuleers wrote:
> On 20/09/16 17:14, Michael T. Dean wrote:
>>> IIUC this partitions duplicate matching, such that duplicates would be
>>> found for repeats on channels whose metadata comes from the same source,
>>> but still not for repeats that span listings data sources. In order to
>>> achieve that I believe (but do correct me if I'm wrong) that I need to
>>> continue erasing the programids.
>> No, it means the program ID is used for dup matching when both programs
>> contain program IDs from the same authority and the rule-specified
>> duplicate-matching method is used otherwise.
> Yes, exactly. We're on the same page. I said what I said because the
> rule-specified method doesn't work since it disregards empty subtitles,
> rather than accepting an empty subtitle as something that should be
> matched with another empty subtitle.

Well, since you've already determined that dup matching won't work for 
this specific situation--regardless of whether you have scrubbed the 
program IDs--removing program IDs isn't helping.  If you stop removing 
program IDs, you'll get valid dup matching when you have showings from 
the same program ID provider that the program you previously recorded 
used.  Otherwise, your rule-specified method will be used and (assuming 
you choose "subtitle" method) it will be treated as a generic (meaning 
it will be recorded).  You're no worse off than you are now, and you're 
better off when the repeat is on the same program ID source as the 
original recording.

However, for all other "proper" programs--where there is something that 
can be used for dup matching--it will just work.  The program ID will be 
used when available and when matching authorities are specified, 
otherwise, the method your rule specifies will be used.

Currently, by scrubbing out all program IDs, you can only ever use your 
rule-specified method--i.e. the fallback that would have been used after 
the program IDs were found to come from different authorities.  So, 
really, scrubbing them isn't helping; it's only making it always use the 
fallback.

> I had another thought: a duplicate-matching method based on the inetref
> field. This wouldn't find defects until the metadata has been retrieved,
> of course,

Right--and does require a lot of hits against a metadata source (there 
are a lot of episodes in people's program listings and they're replaced 
a lot--daily for about 2 weeks, usually--causing re-retrievals).  This 
might even be so many hits we may not want to encourage it.

> and it relies on there being a history of inetrefs employing
> the current format (i.e. not just the number but also the tmdb3.py_ or
> ttvdb.py_ prefix). Furthermore, it breaks if a new metadata source is
> introduced in the future.

Well, the program ID authorities would fix all of that.

> The latter weakness could be addressed by updating the inetref in
> oldrecorded after the fact.
>
> Just a thought - this would require a code change. Not sure I'm up to
> that but once I upgrade to 0.28 I could give it a go.
>
> I could test-drive the concept by a one-time:
>
> update oldrecorded set subtitle=inetref where length(subtitle)=0;
>
> and a daily
>
> update program set subtitle=inetref where length(subtitle)=0;
>
> I can then continue to use the subtitle duplicate-matching method; it'll
> just be ugly in the user interface.
>
> Another possibility is to regard the special treatment of empty
> subtitles as a bug,

Well, it's actually a designed-in feature.  If you say the subtitle 
distinguishes episodes and there is no subtitle, there is no way to 
distinguish which episode it is, so we have to assume it could be one 
you haven't seen, so we record it since you can ALWAYS delete after 
something is recorded, but you can't (at least I haven't found a way to) 
go back and record something after it airs because you later find out 
you hadn't seen that episode.

>   and to remove that special treatment. This might
> cause a regression for people who rely on that (probably long-standing)
> behaviour though.
>

The easiest generally-good approach for this specific issue--the movie 
rule--is the title-only dup matching method.  Again, this might be 
considered if someone went to the trouble of coding it, but no one has 
yet felt sufficient need to actually do the work.  It won't distinguish 
between Ben-Hur's 1959, and 2016 releases***, but if you record one and 
decide you want the other, you could always create a specific Ben Hur 
rule to catch it.

Mike

*** I'm pretty sure the 1907 film was titled "Ben Hur" (no hyphen), and 
the 1925 film was actually titled, "Ben-Hur: A tale of the Christ", so 
the differing titles would actually allow them to be recorded without 
special treatment.  Note, too, that this is why 2001's Ocean's Eleven 
wasn't Ocean 11 (the same as the title of the 1960 movie)--the producers 
wanted it to be easily distinguished from the original.