warctika: warctika 1.0: First production release

Name: warctika: warctika 1.0: First production release
Creator: Nicholls, Tom
Keywords: Apache Tika, OpenWayback, WARC files, Text extraction, Heritrix, Python

integration_instructionsResearch softwarekeyboard_double_arrow_right Software 10 Oct 2014Publisher:Zenodo

Authors: Nicholls, Tom;

doi: 10.5281/zenodo.592694 , 10.5281/zenodo.12183

warctika: warctika 1.0: First production release

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

This library is designed to handle web crawl data fetched using the Heritrix web crawler (or other tools producing WARC files), extract the plain text from structured formats and resave the data as WARC "conversion" records. The primary use for this tool is to extract text from webcrawl data sets for use in machine learning and supervised classification work. WARC (Web ARChive) is a file format for storing web crawls: http://bibnum.bnf.fr/WARC/ The hanzo library which this code is dependent upon can be installed with 'pip install warctools'. Beware that there are several old versions floating around under different names in the index. The software at this stage should be considered feature-complete, though it may have minor additions in the future.

Related Organizations

University of Oxford
United Kingdom

Keywords

Apache Tika, OpenWayback, WARC files, Text extraction, Heritrix, Python

1 Research products, page 1 of 1

warctika: Release v0.5
2014HasVersion

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average