[mythtv-users] Possible fix for tv_grab_au 2.11

David Whyte david.whyte at gmail.com
Sat Jul 15 03:20:48 UTC 2006


On 7/15/06, Max Barry <mythtv at maxbarry.com> wrote:
>
> Seems to be URLs. Simply refreshing the daily guide page immediately
> gives you another 21 shots at the detail pages.

I got the following email overnight.  Not sure why the list wasn't
CC'd but anyways:

I think I have a solution for the new problem with the msn site. I
suspected they may have been up to something when they changed the
pids to a hash with a time component. I think what's happening is that
they are allowing about 20 odd "closeups" with each refresh of the day
page, and when you refresh the day page, it will recalculate a new pid
hash for a new batch of 20. You can test this manually by going to the
site with a web browser and clicking on 20 or so program details.
Eventually it will say "Please try again later". When you refresh the
page, it should allow details again. HOWEVER there are exceptions,
occassionally I've had  to wait and refresh yet again before it
worked.

My solution is to refresh the same day page and start grabbing details
again when one "closeup" fails, after waiting a few seconds. At the
same time, programs for which we had already retrieved details has
been cached, so it will "resume" where it left off. On a fresh new
page where I have to grab details for everything, it can take up to 13
retries to get it all. But it does get there.

The script is working for me, but I've had to make quite extensive
changes to your v2.12 to get this to work, and I'm a complete perl
noobie, so I may have mucked things up. Also, things like the
statistics reporting are all screwed up now and need to be fixed.
However, I thought I'd give you what I've done in case it will help.
(indentation has been unchanged, to minimize the diff output so you
know where the changes are). I'm not confident enough in my perl
skills to post this on the list :)

Boy, the programming consultants at ninemsn must be doing a good job
convincing the brass that this scraper is a "real problem", to pad
their own hours.

Hope this helps. Oh, and as always, use at your own risk!


$ diff tv_grab_au.new tv_grab_au.immir.2.12
163,164d162
< my $maxDayProc = 20; # Maximum times to repeat a single day processing
< my $waitBetweenRetries = 10; # Time to wait between repeats
353d350
<   LOOPDAYS:
363,364d359
<     my $completedDay = 0;
<     my $dayProcCount = 0;
366,370c361
<     DAYPROC:
<     while (!$completedDay and $dayProcCount < $maxDayProc)
<     {
<
<     my $guidedata = get_page($url) or next LOOPDAYS;
---
>     my $guidedata = get_page($url) or next;
372,373d362
<     ++$dayProcCount;
<     print "DAYPROC ITERATION $dayProcCount\n" if $debug;
417,420c406
<           my $url;
<           # If webwarper used, the link already contains full URL
<           $url = $NMSN unless $opt_warper;
<           $url .= $link[0]->[0];
---
>           my $url = $NMSN . $link[0]->[0];
428c414
<           my ($show, $cachedShow, $needDetails, $gotDetails);
---
>           my ($show, $cache_show);
434c420
<             $cachedShow = 1;
---
>             $cache_show = 1;
459,465c445,446
<             $needDetails = want_details($show);
<             $gotDetails = get_closeup_details($date6am,$show,$pid,$row,$url)
<               if $needDetails;
<             # update current cache in case current day needs to be repeated
<             $cached->{$cache_id} = $show
<               unless $needDetails and !($gotDetails);
<           }
---
>             $cache_show = get_closeup_details($date6am,$show,$pid,$row,$url)
>               if want_details($show);
467,470d447
<           if (!($cachedShow) and ($needDetails and !($gotDetails))) {
<             # Give the website a breather for better success
<             sleep $waitBetweenRetries;
<             next DAYPROC;
475,477c452,453
<           # recreate newcache based on current shows to flush out obsolete
<           # entries in old cache
<           $newcache->{$cache_id} = $show;
---
>           push @{ $showlists{$chanid} }, $show;
>           $newcache->{$cache_id} = $show if $cache_show;
479c455
<           abbr_dump($show, $cachedShow) if $debug==1;
---
>           abbr_dump($show, $cached->{$cache_id}) if $debug==1;
486d461
<     $completedDay = 1;
488,495c463
<
<     } # Processing one day
<   } # For days
< } # For services
<
< # add all shows in cache to showlists
< while (my ($cache_id, $show) = each (%$newcache)) {
<   push @{ $showlists{$show->{channel}} }, $show;
---
>   }
497a466
>
519,522c488
<   # Make sure shows are in order so that dupe check will work
<   my @shows =
<     sort {$a->{start} cmp $b->{start}} @{ $showlists{$channel} };
<
---
>   my @shows = @{ $showlists{$channel} };
1021,1022c987
<   # Don't prepend warper if it already has it
<   $url =~ s/^http:\/\//$WW/ if $opt_warper and !($url =~ /^$WW/);
---
>   $url =~ s/^http:\/\//$WW/ if $opt_warper;


More information about the mythtv-users mailing list