[mythtv-users] Mooting architecture for a DataDirect replacement

Tue Jun 26 20:24:02 UTC 2007

On Tue, Jun 26, 2007 at 12:16:08PM -0700, Shawn Rutledge wrote:
> On 6/26/07, Jay R. Ashworth <jra at baylink.com> wrote:
> > On Tue, Jun 26, 2007 at 11:21:28AM -0700, Shawn Rutledge wrote:
> > > I thought XMLTV already worked that way a few years ago.  You used to
> > > be able to configure it to use one of several sites for scraping.
> >
> > tv_grab_na is, I think, the component you're talking about; it's
> > described as being in disrepair since DD went online.
> 
> That's what I thought.  Well now there is a reason to fix it, I guess.

Yup.

> Scraping algorithms tend to require maintenance when the site being
> scraped changes, but it's still much better than hand-entering the
> data, right?  And I didn't understand if it was abandoned because of
> the maintenance or because of legal threats.  The listings sites are
> subsidized by advertising so they would prefer if a person visits and
> looks at the ads, naturally.

My guess would be that it was abandoned because DataDirect made its
existence...um, moot.  :-)

> But if we can legally get away with it, then it would be the least
> impact for those sites if the scraping is done more centrally, rather
> than every MythTV installation simultaneously doing it.

By 4 or 5 orders of magnitude, yes.

>                                                           I like this
> idea of using a news server, or maybe BitTorrent.  Ideally there
> should be several listings servers hosted at different locations.

One advantage of leveraging the commercial NNTP infrastructure.

> Each of them should be capable of doing the scraping, and they could
> take turns (and then distribute the results to the other servers).  If
> the scraping agent masquerades as a regular browser, and the scraping
> is done in a distributed fashion, it will be very difficult for the
> listing sites to stop it.  Even better, each MythTV installation could
> take turns scraping and uploading, in random order, so nearly every
> time, a different IP address is seen in the site's logs; and since
> they can't tell whether it's a browser or not, there is no pattern
> they can use to block access.

Now, modulo concerns of accuracy, that's not a bad idea at all.

In the interim, until we can get a direct data source, we can still
generate the data to ship around by scraping.

We're gonna need to extend the XMLTV format, I think, so we can put
some information about quality of information and source into it.

Time to put my head down, I guess.  I got a couple weeks vacation
coming up...

Cheers,
-- jra

-- 
Jay R. Ashworth                   Baylink                      jra at baylink.com
Designer                     The Things I Think                       RFC 2100
Ashworth & Associates     http://baylink.pitas.com                     '87 e24
St Petersburg FL USA      http://photo.imageinc.us             +1 727 647 1274