[mythtv-users] Mooting architecture for a DataDirect replacement
Jay R. Ashworth
jra at baylink.com
Tue Jun 26 20:24:02 UTC 2007
On Tue, Jun 26, 2007 at 12:16:08PM -0700, Shawn Rutledge wrote:
> On 6/26/07, Jay R. Ashworth <jra at baylink.com> wrote:
> > On Tue, Jun 26, 2007 at 11:21:28AM -0700, Shawn Rutledge wrote:
> > > I thought XMLTV already worked that way a few years ago. You used to
> > > be able to configure it to use one of several sites for scraping.
> >
> > tv_grab_na is, I think, the component you're talking about; it's
> > described as being in disrepair since DD went online.
>
> That's what I thought. Well now there is a reason to fix it, I guess.
Yup.
> Scraping algorithms tend to require maintenance when the site being
> scraped changes, but it's still much better than hand-entering the
> data, right? And I didn't understand if it was abandoned because of
> the maintenance or because of legal threats. The listings sites are
> subsidized by advertising so they would prefer if a person visits and
> looks at the ads, naturally.
My guess would be that it was abandoned because DataDirect made its
existence...um, moot. :-)
> But if we can legally get away with it, then it would be the least
> impact for those sites if the scraping is done more centrally, rather
> than every MythTV installation simultaneously doing it.
By 4 or 5 orders of magnitude, yes.
> I like this
> idea of using a news server, or maybe BitTorrent. Ideally there
> should be several listings servers hosted at different locations.
One advantage of leveraging the commercial NNTP infrastructure.
> Each of them should be capable of doing the scraping, and they could
> take turns (and then distribute the results to the other servers). If
> the scraping agent masquerades as a regular browser, and the scraping
> is done in a distributed fashion, it will be very difficult for the
> listing sites to stop it. Even better, each MythTV installation could
> take turns scraping and uploading, in random order, so nearly every
> time, a different IP address is seen in the site's logs; and since
> they can't tell whether it's a browser or not, there is no pattern
> they can use to block access.
Now, modulo concerns of accuracy, that's not a bad idea at all.
In the interim, until we can get a direct data source, we can still
generate the data to ship around by scraping.
We're gonna need to extend the XMLTV format, I think, so we can put
some information about quality of information and source into it.
Time to put my head down, I guess. I got a couple weeks vacation
coming up...
Cheers,
-- jra
--
Jay R. Ashworth Baylink jra at baylink.com
Designer The Things I Think RFC 2100
Ashworth & Associates http://baylink.pitas.com '87 e24
St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274
More information about the mythtv-users
mailing list