<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

A Dataset For Temporal Analysis Of Files Related To The Jfk Case

Name: A Dataset For Temporal Analysis Of Files Related To The Jfk Case
Creator: Luczak-Roesch, Markus
Keywords: OCR, JFK, content analysis, temporal data mining

Research datakeyboard_double_arrow_right Dataset 05 Nov 2017 English Publisher:Zenodo

Authors: Luczak-Roesch, Markus;

doi: 10.5281/zenodo.1042153 , 10.5281/zenodo.1042154 , 10.5281/zenodo.1098568

A Dataset For Temporal Analysis Of Files Related To The Jfk Case

- Summary
- Subjects
- Metrics

Abstract

This dataset contains the content of the subset of all files with a correct publication date from the 2017 release of files related to the JFK case (retrieved from https://www.archives.gov/research/jfk/2017-release). This content was extracted from the source PDF files using the R OCR libraries tesseract and pdftools. The code to derive the dataset is given as follows: ### BEGIN R DATA PROCESSING SCRIPT library(tesseract) library(pdftools) pdfs <- list.files("[path to your output directory containing all PDF files]") meta <- read.csv2("[path to your input directory]/jfkrelease-2017-dce65d0ec70a54d5744de17d280f3ad2.csv",header = T,sep = ',') #the meta file containing all metadata for the PDF files (e.g. publication date) meta$Doc.Date <- as.character(meta$Doc.Date) meta.clean <- meta[-which(meta$Doc.Date=="" | grepl("/0000",meta$Doc.Date)),] for(i in 1:nrow(meta.clean)){ meta.clean$Doc.Date[i] <- gsub("00","01",meta.clean$Doc.Date[i]) if(nchar(meta.clean$Doc.Date[i])<10){ meta.clean$Doc.Date[i]<-format(strptime(meta.clean$Doc.Date[i],format = "%d/%m/%y"),"%m/%d/%Y") } } meta.clean$Doc.Date <- strptime(meta.clean$Doc.Date,format = "%m/%d/%Y") meta.clean <- meta.clean[order(meta.clean$Doc.Date),] docs <- data.frame(content=character(0),dpub=character(0),stringsAsFactors = F) for(i in 1:nrow(meta.clean)){ #for(i in 1:3){ pdf_prop <- pdftools::pdf_info(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i]))) tmp_files <- c() for(k in 1:pdf_prop$pages){ tmp_files <- c(tmp_files,paste0("/home/STAFF/luczakma/RProjects/JFK/data/tmp/",k)) } img_file <- pdftools::pdf_convert(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i])), format = 'tiff', pages = NULL, dpi = 700,filenames = tmp_files) txt <- "" for(j in 1:length(img_file)){ extract <- ocr(img_file[j], engine = tesseract("eng")) #unlink(img_file) txt <- paste(txt,extract,collapse = " ") } docs <- rbind(docs,data.frame(content=iconv(tolower(gsub("\\s+"," ",gsub("[[:punct:]]|[\n]"," ",txt))),to="UTF-8"),dpub=format(meta.clean$Doc.Date[i],"%Y/%m/%d"),stringsAsFactors = F),stringsAsFactors = F) } write.table(docs,"[path to your output directory]/documents.csv", row.names = F) ### END R DATA PROCESSING SCRIPT

Related Organizations

Victoria University of Wellington
New Zealand

Keywords

OCR, JFK, content analysis, temporal data mining

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Usage byUsageCounts

visibility	views	845
download	downloads	73

845
views
73
downloads
Powered by

Found an issue? Give us feedback

visibility

download

Average

845