WARCs management (tofuproxy)

Next: Geminispace, Previous: HTTP-based authentication, Up: tofuproxy

WARCs management ¶

To view WARC files, you have to load them in daemon. Responses will be transparently replaced from those WARCs for corresponding URIs.

There is no strict validation or checking of WARCs correctness at all! But built-in WARC support seems to be good enough for various sources. Following formats are supported:

.warc

Ordinary uncompressed WARC. Useful to be stored on transparently compressed ZFS dataset.

.warc.gz

GZIP compressed WARC. Multi-stream (multi-segment) formats are also supported and properly indexed.

.warc.zst

Zstandard compressed WARC, as in specification. Multi-frame format is properly indexed. Dictionary at the beginning is also supported.

It is processed with unzstd (cmd/zstd/unzstd) utility. It eats compressed stream from stdin, outputs decompressed data to stdout, and prints each frame size with corresponding decompressed data size to 3rd file descriptor (if it is opened).

Load WARCs:

$ tee fifos/add-warcs <warcs.txt
smth.warc-00000.warc.gz
smth.warc-00001.warc.gz
smth.warc-00002.warc.gz
another.warc

Visit the URI you know, that exists in those WARCs, or go to http://warc/, to view full list of known loaded URIs from those WARCs.
Pay attention that order of WARCs loading is important! WARC can be segmented and single response can be split on multiple WARC files. Each following WARC files will overwrite possibly already existing URIs.

To list and delete loaded known WARCs:

$ cat fifos/list-warcs
smth.warc-00000.warc.gz 154
smth.warc-00001.warc.gz 13
smth.warc-00002.warc.gz 0
another.warc 123
$ echo another.warc >fifos/del-warcs

One possibility that smth.warc-00002.warc.gz has no URIs is that it contains continuation segmented records.

Loading of WARC involves its whole reading and remembering where is each URI response is located. You can echo SAVE >fifos/add-warcs to save in-memory index to the disk as ....idx.gob files. During the next load, if those files exists, they are used as index immediately, without expensive WARC parsing.

cmd/warc-extract/warc-extract utility uses exactly the same code for parsing WARCs. It can be used to check if WARCs can be successfully loaded, to list all URIs after, to extract some specified URI and to pre-generate .idx.gob indices.

$ cmd/warc-extract/warc-extract -idx \
    smth.warc-00000.warc.gz \
    smth.warc-00001.warc.gz \
    smth.warc-00002.warc.gz
$ cmd/warc-extract/warc-extract -uri http://some/uri \
    smth.warc-00000.warc.gz \
    smth.warc-00001.warc.gz \
    smth.warc-00002.warc.gz

Following example can be used to create multi-frame .warc.zst from any kind of already existing WARCs. It has better compression ratio and much higher decompression speed, than .warc.gz.

$ cmd/warc-extract/warc-extract -for-enzstd /path/to.warc.gz |
    cmd/zstd/enzstd >/path/to.warc.zst

GNU Wget can be easily used to create WARCs:

$ wget ... [--page-requisites] [--recursive] \
    --no-warc-keep-log --no-warc-digests [--warc-max-size=XXX] \
    --warc-file smth.warc ...

Or even more simpler crawl utility written on Go too.