Previous: , Up: Usage  


WARCs management

To view WARC files, you have to load them in daemon. Responses will be transparently replaced from those WARCs for corresponding URIs.

There is no strict validation or checking of WARCs correctness at all! But built-in WARC support seems to be good enough for various sources. Uncompressed, gzip (multiple streams and single stream are supported) and zstd compressed ones are supported.

Searching in compressed files is slow – every request will lead to decompression of the file from the very beginning, so keeping uncompressed WARCs on compressed ZFS dataset is much more preferable. tofuproxy does not take advantage of multistream gzip files.

Loading of WARC involves its whole reading and remembering where is each URI response is located. You can echo SAVE > fifos/add-warcs to save in-memory index to the disk as ....warc.idx.gob file. During the next load, if that file exists, it is used as index immediately, without expensive WARC reading.

redo warc-extract.cmd builds warc-extract.cmd utility, that uses exactly the same code for parsing WARCs. It can be used to check if WARCs can be successfully loaded, to list all URIs after, to extract some specified URI and to pre-generate .idx.gob indexes.

$ warc-extract.cmd -idx \
    smth.warc-00000.warc.gz \
    smth.warc-00001.warc.gz \
    smth.warc-00002.warc.gz
$ warc-extract.cmd -uri http://some/uri \
    smth.warc-00000.warc.gz \
    smth.warc-00001.warc.gz \
    smth.warc-00002.warc.gz

GNU Wget can be easily used to create WARCs:

$ wget ... [--page-requisites] [--recursive] \
    --no-warc-keep-log --no-warc-digests [--warc-max-size=XXX] \
    --warc-file smth.warc ...

Previous: Certificate trust management, Up: Usage