A Dataset For Temporal Analysis Of Files Related To The Jfk Case

Name: A Dataset For Temporal Analysis Of Files Related To The Jfk Case
Creator: Luczak-Roesch, Markus
Keywords: OCR, JFK, content analysis, temporal data mining

Luczak-Roesch, Markus

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Dataset . 2017

License: CC BY

Data sources: Datacite

ZENODO

Dataset . 2017

License: CC BY

Data sources: Datacite

ZENODO

Dataset . 2017

License: CC BY

Data sources: Datacite

A Dataset For Temporal Analysis Of Files Related To The Jfk Case

Research datakeyboard_double_arrow_right Dataset 05 Nov 2017 English Publisher:Zenodo

Authors: Luczak-Roesch, Markus;

doi: 10.5281/zenodo.1042153 , 10.5281/zenodo.1042154 , 10.5281/zenodo.1098568

A Dataset For Temporal Analysis Of Files Related To The Jfk Case

- Summary
- Subjects
- Metrics

Abstract

This dataset contains the content of the subset of all files with a correct publication date from the 2017 release of files related to the JFK case (retrieved from https://www.archives.gov/research/jfk/2017-release). This content was extracted from the source PDF files using the R OCR libraries tesseract and pdftools. The code to derive the dataset is given as follows: ### BEGIN R DATA PROCESSING SCRIPT library(tesseract) library(pdftools) pdfs <- list.files("[path to your output directory containing all PDF files]") meta <- read.csv2("[path to your input directory]/jfkrelease-2017-dce65d0ec70a54d5744de17d280f3ad2.csv",header = T,sep = ',') #the meta file containing all metadata for the PDF files (e.g. publication date) meta$Doc.Date <- as.character(meta$Doc.Date) meta.clean <- meta[-which(meta$Doc.Date=="" | grepl("/0000",meta$Doc.Date)),] for(i in 1:nrow(meta.clean)){ meta.clean$Doc.Date[i] <- gsub("00","01",meta.clean$Doc.Date[i]) if(nchar(meta.clean$Doc.Date[i])<10){ meta.clean$Doc.Date[i]<-format(strptime(meta.clean$Doc.Date[i],format = "%d/%m/%y"),"%m/%d/%Y") } } meta.clean$Doc.Date <- strptime(meta.clean$Doc.Date,format = "%m/%d/%Y") meta.clean <- meta.clean[order(meta.clean$Doc.Date),] docs <- data.frame(content=character(0),dpub=character(0),stringsAsFactors = F) for(i in 1:nrow(meta.clean)){ #for(i in 1:3){ pdf_prop <- pdftools::pdf_info(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i]))) tmp_files <- c() for(k in 1:pdf_prop$pages){ tmp_files <- c(tmp_files,paste0("/home/STAFF/luczakma/RProjects/JFK/data/tmp/",k)) } img_file <- pdftools::pdf_convert(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i])), format = 'tiff', pages = NULL, dpi = 700,filenames = tmp_files) txt <- "" for(j in 1:length(img_file)){ extract <- ocr(img_file[j], engine = tesseract("eng")) #unlink(img_file) txt <- paste(txt,extract,collapse = " ") } docs <- rbind(docs,data.frame(content=iconv(tolower(gsub("\\s+"," ",gsub("[[:punct:]]|[\n]"," ",txt))),to="UTF-8"),dpub=format(meta.clean$Doc.Date[i],"%Y/%m/%d"),stringsAsFactors = F),stringsAsFactors = F) } write.table(docs,"[path to your output directory]/documents.csv", row.names = F) ### END R DATA PROCESSING SCRIPT

Related Organizations

Victoria University of Wellington
New Zealand

Keywords

OCR, JFK, content analysis, temporal data mining

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average