
                            WEBCOPY V0.97B 95/05/31
                                       
   
   
   Copyright 1994, 1995 by Victor Parada (vparada@inf.utfsm.cl).
   
   Copy files (recursively) via HTTP protocol.
   
Description:

   
   
   WebCopy is a perl program that retrieves the URL specified in a
   unix-like command line. It can also retrieve recursively any file that
   an HTML file references, i.e. inlined images and/or anchors, if
   specified with an option.
   
   It can be used as a "mirror" program to retrieve a tree of documents
   from a remote site, and put them on-line immediately through the local
   server.
   
   By default, only the document pointed by the URL in the command line
   is retrieved. Many switches can be specified in the command line and
   each option enables one type reference to follow.
   
   To avoid endless recursion, only files at one site can be retrieved
   with one command. WebCopy never follows links to files not in the same
   host, port number and protocol (only HTTP supported) of the first
   document retrieved. A list of discarded URLs are logged for future
   references.
   
   This program does not comply with the Robot Exclusion Standard, since
   it retrieves almost what the user specifies in the command line.
   
   The user must know what kind of server and documents he want to
   access. Webcopy does all what it knows to stop at CGI-generated files
   (virtual documents).
   
What's New:

     * Slightly improved code :-)
     * More restrictive, to avoid endless recursion.
     * Added PROXY support. An HTTP proxy must be specified in the
       command line or through the http-proxy environment variable.
     * It can POST data to CGI scripts (but cannot recurse on output).
     * Added delay time between connections, to avoid overload on the
       server. Duration can be specified by an option in the command
       line. Defaults to 30 seconds.
       
Syntax:


webcopy [options] http://host:port/path/file [http://proxy:port]

Options (can be combined):

   -o
          output through stdout.
          You can redirect the output to another filename or pipe it to a
          program. Use it with -s option. You cannot recurse HTML files
          or use -v or -q options in this mode.
   -v
          operates in verbose mode.
          Displays every URL to fetch. -vv is "very verbose" and outputs
          every header line the server sends with the file.
   -q
          query each URL to transfer.
          Use it to select the files to transfer. Enter 'n' to skip file,
          'y' to transfer, 'a' to transfer all the remaining files, and
          'q' to quit immediately. If you don't say 'y' to the first file
          (the one specified in the command line), no recursion is made.
   -s
          do not log in 'W.log'.
          This file is stored in the root of the working directory. It
          can be parsed to get a list of every file (NOT) transfered.
   -tdelay
          set 'path' seconds between transfers.
          This option is used to change the default 30 seconds of delay
          before every connection. This delay is due to avoid server
          overload.
   -wpath
          set working directory to 'path'.
          WebCopy stores files in the current working directory. Use this
          option to force WebCopy to use another directory.
   -xfile
          set default index to 'file'.
          When a directory index is required, this is the filename that
          is used to store the output. Defaults to index.html.
   -zfile
          post 'file' or query string if ommited.
          You can send some URL-encoded form data using the POST method
          to a CGI script. If the filename is omited, the data is taken
          from the query string specified in the URL after a "?". The
          data in the file must be in URL-encoded format, and spaces are
          suppressed.
   -r
          recurse HTML documents.
          Same as -il.
   -i
          include inlined images.
          Retrieves the files referenced in <IMG> and <FIG> tags.
   -l
          follow hypertext links.
          Recurse through hypertext references in .html documents.
          Warning: Never leave WebCopy unattended if you don't know what
          you are recursively retrieving.
   -m
          allow imagemaps.
          NOT available yet.
   -c
          allow links to CGI scripts.
          By default, WebCopy discards references that seems to be a CGI
          script (e.g. /cgi-bin/ in the path). Use this option if you
          want to retrieve the output of a CGI script. If the base path
          is other than current, you'll also require options -paf.
   -a
          allow absolute references to the same host.
          References like /path/to/file.html, where the path is the
          current one (the one that was specified in the command line)
          are not rejected when this option is specified. If other paths
          are required, also use option -p.
   -f
          allow full URL references to the same host.
          Complete http: URLs are accepted only if this option is
          specified in the command line and the host and port remain the
          same than the current, but still rejected unless option -p is
          also specified.
   -p
          allow paths other than current.
          References like /images/some.gif, where the path is not the
          current, are accepted. Use this option to allow references to
          CGI scripts. To keep the same document structure of the server
          and to avoid document name collision, option -d is recommended.
          
          Warning: This option can cause WebCopy to retrieve the whole
          data from a server if it finds a reference to the server root
          in some document while using recursion. Never leave WebCopy
          unattended if you don't know what you are recursively
          retrieving.
   -d
          keep directory path in URL for local file.
          The defaut behaviour of WebCopy is to set the working directory
          the equivalent of the document directory specified in the
          command line's URL. Using this option, WebCopy sets the working
          directory to be the same of the root directory of the server,
          so directories in the path are also created in the working
          directory. If you want to specify this option after doing some
          documents transfer, you'll have to create the subdirectories
          yourself and move the retrieved files in working directory to
          the subdirectory, or you will get duplicated files.
   -u
          use local copy of file if exists.
          Before doing a request to a server, WebCopy checks for the file
          in the working directory, and sends file information to the
          server. Only if the file was changed since last access, the new
          version is retrieved. This option forces WebCopy to use the
          local copy of the file if it exists, without checking if the
          file was changed in the server.
   -n
          don't use defined PROXY.
          If http_proxy environtment variable is defined, this option
          makes WebCopy to ignore it. It also ignores a PROXY specified
          in the command line.
   -h
          help.
          Webcopy displays a brief help, ignores other options specified
          and exits.
          
   
   
   Note: Some options conflicts with others. For example, you cannot use
   -v and -o at the same time because both require STDOUT.
   
Examples:

    1. To retrieve a single file and store it with some name in current
       directory:
webcopy -so http://www.host/images/icon.gif > logo.gif
    2. To retrieve a page and some of the inlined images without delay:
webcopy -vsiqt0 http://www.host/page.html
   and press RETURN on each file NOT to transfer.
    3. To mirror a group of files in some other directory:
webcopy -rwpub/mirror/name http://www.host/intro.html
    4. To retrieve the output of a form:
         1. Get the form:
webcopy -so http://www.host/form.html > form.html
         2. Using an editor, change:
<FORM METHOD=POST ACTION="http:/www.host/cgi-bin/proc">
        tag into:
<FORM METHOD=POST ACTION="mailto:yourself@yourdomain">
         3. Using a WWW browser, read the modified file, fill the form
            and press "OK" button.
         4. Wait for your own mail to arrive. It should contain the
            posted URL-encoded data in the body.
         5. Save the mail in a file (post.dat) without the mail headings.
         6. Post the data:
webcopy -so -zpost.dat http://www.host/cgi-bin/proc > result.html
   If you are smart enough, you can write your own files of data and just
       do step 6, or use the following:
webcopy -so -z http://www.host/cgi-bin/proc?postdata > result.html
    5. To verbosely retrieve html documents and icons that are not in the
       same directory of the server:
webcopy -vvrpafd http://www.host/path/page.html
    6. To retrieve a file using a PROXY, overriding the default
       http_proxy environtment variable:
webcopy http://www.host/path/page.html http://otherproxy

License Agreement and Lack of Warranty:

     * The author of this program is Victor Parada
       <vparada@inf.utfsm.cl>.
     * This program is "Freeware", not "Public Domain".
     * This program must be distributed for free, and cannot be included
       in commercial packages without prior written permisson from the
       autor.
     * This program cannot be distributed if modified in any way.
     * This program can be used by anyone if the copyright and this
       notice remains intact in every file.
     * If you modify this program, please e-mail patches to the the
       author.
     * This is a Beta version of the program. You have been warned!
     * This program is provided ``AS IS'', without any warranty.
     * This program can cause huge file transfers and all the related
       effects.
     * This program can fill data disks without notice.
     * Neither the author nor UTFSM are responsibles for the use of this
       program.
     * Bug reports, comments, questions and suggestions are welcome! But
       please check first that you have the latest version!
       
   
   
   If you (want to) use this program, please send e-mail to the author.
   He will try to notify you of any updates made to it.
   
System Requirements:

     * perl interpreter (either 4.036 or 5.000 or later) with perl
       library (sys/socket.ph and timelocal.pl).
     * hostname program or script to get current host's name.
     * TCP/IP connection and Sockets.
     * Space on disk.
     * A machine with all the above available.
       
Down-loading and Setting-Up:

    1. Make sure you have the previous System Requirements.
    2. Get the latest version of WebCopy from its home FTP server:
       ftp://ftp.inf.utfsm.cl/pub/utfsm/perl/webcopy.tgz
       This is a gzip'ed tar archive.
    3. Untar the file with the command:
tar -xzvf webcopy.tgz
   (GNU version of tar).
    4. Make sure you got the following files in a subdir called
       webcopy-0.97:
          + webcopy
          + webcopy.html
          + webcopy.txt
    5. Read the License Agreement and Lack of Warranty in webcopy.txt, or
       in webcopy.html using an HTML browser.
    6. Edit the first line of webcopy if your perl interpreter is not
       located at /usr/local/bin/perl.
    7. Move webcopy to a suitable directory.
    8. Use it at your own risk!
    9. Register yourself (it's free) and send feed-back!
       
   
   
   If you cannot do gunzip or tar, please send e-mail to the author. He
   will try to send you a shar'ed copy of it :-)
     _________________________________________________________________
   
    Document last modified on 1995/06/14 by Victor Parada
