[mythtv-users] Mooting architecture for a DataDirect replacement

Tue Jun 26 19:16:08 UTC 2007

On 6/26/07, Jay R. Ashworth <jra at baylink.com> wrote:
> On Tue, Jun 26, 2007 at 11:21:28AM -0700, Shawn Rutledge wrote:
> > I thought XMLTV already worked that way a few years ago.  You used to
> > be able to configure it to use one of several sites for scraping.
>
> tv_grab_na is, I think, the component you're talking about; it's
> described as being in disrepair since DD went online.

That's what I thought.  Well now there is a reason to fix it, I guess.
 Scraping algorithms tend to require maintenance when the site being
scraped changes, but it's still much better than hand-entering the
data, right?  And I didn't understand if it was abandoned because of
the maintenance or because of legal threats.  The listings sites are
subsidized by advertising so they would prefer if a person visits and
looks at the ads, naturally.

But if we can legally get away with it, then it would be the least
impact for those sites if the scraping is done more centrally, rather
than every MythTV installation simultaneously doing it.  I like this
idea of using a news server, or maybe BitTorrent.  Ideally there
should be several listings servers hosted at different locations.
Each of them should be capable of doing the scraping, and they could
take turns (and then distribute the results to the other servers).  If
the scraping agent masquerades as a regular browser, and the scraping
is done in a distributed fashion, it will be very difficult for the
listing sites to stop it.  Even better, each MythTV installation could
take turns scraping and uploading, in random order, so nearly every
time, a different IP address is seen in the site's logs; and since
they can't tell whether it's a browser or not, there is no pattern
they can use to block access.