Reverse geo-tagging included; duplicates removed

All of the tweets for this project have been processed and consolidated into a single file that can be downloaded with this link: https://s3-us-west-2.amazonaws.com/healthcare-twitter-analysis/HTA_noduplicates.gz 1.85 Gb zipped / 15.80 Gb unzipped Each of the 4 million rows in this file is a tweet in json format containing the following information: All the Twitter data in exactly the json format of the original Unix time stamp All the Topsy data originating file name score author screen name URLs 60% of the records have geographic information ... Latitude & Longitude Country name & ISO2 country code City For country code "US" Zipcode Telephone area code Square miles inside the zipcode 2010 Census population of the zipcode County & FIPS code State name & USPS abbreviation The basic technique for using this file in Python is the following: import json with open("HTA_noduplicates.json", "r") as f: # convert each row in turn into json format and process for row in f: tweet = json.loads(row) text = tweet["text"] # text of original tweet ... # etc. Python provides very powerful analytical and plotting features but R is also very handy; R does not work well with large datasets but Python can be used to create a targeted subset file that R can read (or Excel, or anything else for that matter). For long-running jobs, I used Amazon Web Service's EC2 running Ubuntu 14.04, accessed via PuTTY and WebSCP; for local processing I used a Windows 7 laptop with the data on a terabyte external hard drive. The Status Report in the main repo contains a comprehensive explanation of the dataset examples of analyses done with this dataset a list of references to other healthcare-related Twitter analyses instructions for using Amazon Web Services sample programs using this file with Python, R and MongoDB.

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Usage byUsageCounts

visibility

views

1

1
views
Powered by

Found an issue? Give us feedback

visibility

0

Average

1