To view WARC files, you have to load them in daemon. Responses will be transparently replaced from those WARCs for corresponding URIs.
There is no strict validation or checking of WARCs correctness at all!
But built-in WARC support seems to be good enough for various sources.
gzip (multiple streams and single stream are
zstd compressed ones are supported.
Searching in compressed files is slow – every request will
lead to decompression of the file from the very beginning, so keeping
uncompressed WARCs on compressed ZFS dataset is much more preferable.
tofuproxy does not take advantage of multistream gzip files.
$ tee fifos/add-warcs < warcs.txt smth.warc-00000.warc.gz smth.warc-00001.warc.gz smth.warc-00002.warc.gz another.warc
$ cat fifos/list-warcs smth.warc-00000.warc.gz 154 smth.warc-00001.warc.gz 13 smth.warc-00002.warc.gz 0 another.warc 123 $ echo another.warc > fifos/del-warcs
One possibility that smth.warc-00002.warc.gz has no URIs is that it contains continuation segmented records.
Loading of WARC involves its whole reading and remembering where is each
URI response is located. You can
echo SAVE > fifos/add-warcs to
save in-memory index to the disk as ....warc.idx.gob file. During
the next load, if that file exists, it is used as index immediately,
without expensive WARC reading.
redo warc-extract.cmd builds
that uses exactly the same code for parsing WARCs. It can be used to
check if WARCs can be successfully loaded, to list all URIs after, to
extract some specified URI and to pre-generate .idx.gob indexes.
$ warc-extract.cmd -idx \ smth.warc-00000.warc.gz \ smth.warc-00001.warc.gz \ smth.warc-00002.warc.gz $ warc-extract.cmd -uri http://some/uri \ smth.warc-00000.warc.gz \ smth.warc-00001.warc.gz \ smth.warc-00002.warc.gz
GNU Wget can be easily used to create WARCs:
$ wget ... [--page-requisites] [--recursive] \ --no-warc-keep-log --no-warc-digests [--warc-max-size=XXX] \ --warc-file smth.warc ...