Views provided by UsageCounts
This library is designed to handle web crawl data fetched using the Heritrix web crawler (or other tools producing WARC files), extract the plain text from structured formats and resave the data as WARC "conversion" records. The primary use for this tool is to extract text from webcrawl data sets for use in machine learning and supervised classification work. WARC (Web ARChive) is a file format for storing web crawls: http://bibnum.bnf.fr/WARC/ The hanzo library which this code is dependent upon can be installed with 'pip install warctools'. Beware that there are several old versions floating around under different names in the index. The software at this stage should be considered feature-complete, though it may have minor additions in the future.
Apache Tika, OpenWayback, WARC files, Text extraction, Heritrix, Python
Apache Tika, OpenWayback, WARC files, Text extraction, Heritrix, Python
| citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
| views | 6 |

Views provided by UsageCounts