NAME
    WWW::Leech::Walker - small web content grabbing framework

SYNOPSIS
      use WWW::Leech::Walker;

      my $walker = new WWW::Leech::Walker({
            ua => new LWP::UserAgent(),
            url => 'http://example.tdl',

            parser => $www_leech_parser_params,

            state => {},

            logger => sub {print shift()},

            filter => sub{
                    my $urls = shift;
                    my $walker_obj = shift;

                    # ... filter urls

                    return $urls
            },

            processor => sub {
                    my $data = shift;
                    my $walker_obj = shift;

                    # ... process grabbed data

            }
      });

      $walker->leech();

DESCRIPTION
    WWW::Leech::Walker walks through a given website parsing content and
    generating structured data. Declarative interface makes Walker some sort
    of a framework.

    This module is designed to extract data from sites with particular
    structure: an index page (or any other provided as a root page) contains
    links to individual pages representing items that should be grabbed.
    Index page may also contain 'paging' links (e.g.
    http://exmple.tdl/?page=2) which lead to the page with similar
    structure. The closest example is a products category page with links to
    individual products and links to 'sub-pages'.

    All required parameters are set as constructor arguments. Other methods
    are used to start/stop the grabbing process and launch logger (see
    below).

DETAILS
    new($params)
        $params must be a hashref providing all data required.

        ua  LWP compatible user-agent object.

        url Starting url.

        parser
            Parameters for WWW::Leech::Parser

        state
            Optional user-filled value. Walker does not use it directly.
            State is passed to user callbacks instead. Defaults to empty
            hashref.

        logger
            Optional logging callback. Whenever something happens walker
            runs this subroutine passing message.

        filter
            Optional urls filtering callback. When walker gets a list of
            items-pages urls it passes that list to the filter subroutine.
            Walker expects it to return filtered list. Empty list is okay.

        processor
            This callback is launched after the individual item is parsed
            and converted to a hashref. This hashref is passed to the
            processor to be saved, or processed in some other way.

        next_page_link_post_process
            This optional callback allows user to alter next page url.
            Usually these urls look like 'http://example.tld/list?page=2'
            and no changes needed there. But sometimes such links are
            javascript calls like 'javascript:gotoPageNumber(2)'. The source
            url is passed as is before walker absolutizes it. Walker passes
            current page url as a third agument - this may be usefull for
            links like 'javascript:gotoNextPage()'

            Walker expects this callback to return a fixed url.

    leech()
        Starts the process.

    stop()
        Stops the process completely. By default walker keeps working untill
        there are links. Some sites may contain zillions of pages, while
        only first million is required. This method allows to stop at some
        point. See "CALLBACKS" section below.

        If walker is restarted with leech() method it will run as if it was
        newly created (still the 'state' is saved).

    log($message)
        Runs the 'logger' callback with $message argument.

CALLBACKS
    Walker passes callback specific data as a first argument, itself as a
    second and some additional data as third if any.

    When grabbing large sites the grabbing process should be stopped at some
    point (if you don't need all the data of course). This example shows how
    to do it using state propery and stop() method:

      #....
      state => {total_links_amount => 0},
      filter => sub{
        my $links = shift;
        my $walker = shift;

        if($walker->{'state'}->{'total_links_amount'} > 1_000_000 ){
            $walker->log("Million of items grabbed. Enough.");
            $walker->stop();

            return [];
        }

        $walker->{'state'}->{'total_links_amount'} += scalar(@$links);

        return $links;
      }
      #....

AUTHOR
        Dmitry Selverstov
        CPAN ID: JAREDSPB
        jaredspb@cpan.org

COPYRIGHT
    This program is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.

    The full text of the license can be found in the LICENSE file included
    with this module.

