
<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>
PUDL Data Release 1.1.0 This is the second data release from the Public Utility Data Liberation (PUDL) project. It can be referenced & cited using https://doi.org/10.5281/zenodo.3672068 For more information about the free and open source software used to generate this data release, see Catalyst Cooperative's PUDL repository on Github, and the associated documentation on Read The Docs. This data release was generated using v0.3.2 of the catalystcoop.pudl python package. Included Data Packages This release consists of three tabular data packages, conforming to the standards published by Frictionless Data and the Open Knowledge Foundation. The data are stored in CSV files (some of which are compressed using gzip), and the associated metadata is stored as JSON. These tabular data can be used to populate a relational database. pudl-eia860-eia923 Data originally collected and published by the US Energy Information Administration (US EIA) in their Form 860 and Form 923, covering the years 2009-2018. A large majority of the data published in the original data sources has been included, but some parts, like fuel stocks on hand, and EIA 923 schedules 6, 7, & 8 have not yet been integrated. pudl-eia860-eia923-epacems This data package contains all of the same data as the pudl-eia860-eia923 package above, as well as the Hourly Emissions data from the US Environmental Protection Agency's (EPA's) Continuous Emissions Monitoring System (CEMS) from 1995-2018. The EPA CEMS data covers thousands of power plants at hourly resolution for decades, and contains close to a billion records. pudl-ferc1 Seven data tables from FERC Form 1 are included, primarily relating to individual power plants, and covering the years 1994-2018 (the entire span of time for which FERC provides this data). These tables are the only ones which have been subjected to any cleaning or organization for programmatic use within PUDL. The complete, raw FERC Form 1 database contains 116 different tables with many thousands of columns of mostly financial data. We will archive a complete copy of the multi-year FERC Form 1 Database as a file-based SQLite database at Zenodo, independent of this data release. It can also be re-generated using the catalystcoop.pudl Python package and the original source data files archived as part of this data release. Contact Us If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. You can also: Subscribe to our announcements list for email updates. Use the Github issue tracker to file bugs, suggest improvements, or ask for help. Email the project team at pudl@catalyst.coop for private communications. Follow @CatalystCoop on Twitter. Using the Data The data packages are just CSVs (data) and JSON (metadata) files. They can be used with a variety of tools on many platforms. However, the data is organized primarily with the idea that it will be loaded into a relational database, and the PUDL Python package that was used to generate this data release can facilitate that process. Once the data is loaded into a database, you can access that DB however you like. Make sure conda is installed None of these commands will work without the conda Python package manager installed, either via Anaconda or miniconda: Install Anaconda Install miniconda Download the data First download the files from the Zenodo archive into a new empty directory. A couple of them are very large (5-10 GB), and depending on what you're trying to do you may not need them. If you don't want to recreate the data release from scratch by re-running the entire ETL process yourself, and you don't want to create a full clone of the original FERC Form 1 database, including all of the data that has not yet been integrated into PUDL, then you don't need to download pudl-input-data.tgz. If you don't need the EPA CEMS Hourly Emissions data, you do not need to download pudl-eia860-eia923-epacems.tgz. Load All of PUDL in a Single Line Use cd to get into your new directory at the terminal (in Linux or Mac OS), or open up an Anaconda terminal in that directory if you're on Windows. If you have downloaded all of the files from the archive, and you want it all to be accessible locally, you can run a single shell script, called load-pudl.sh: bash pudl-load.sh This will do the following: Load the FERC Form 1, EIA Form 860, and EIA Form 923 data packages into an SQLite database which can be found at sqlite/pudl.sqlite. Convert the EPA CEMS data package into an Apache Parquet dataset which can be found at parquet/epacems. Clone all of the FERC Form 1 annual databases into a single SQLite database which can be found at sqlite/ferc1.sqlite. Selectively Load PUDL Data If you don't want to download and load all of the PUDL data, you can load each of the above datasets separately. Create the PUDL conda Environment This installs the PUDL software locally, and a couple of other useful packages: conda create --yes --name pudl --channel conda-forge \ --strict-channel-priority \ python=3.7 catalystcoop.pudl=0.3.2 dask jupyter jupyterlab seaborn pip conda activate pudl Create a PUDL data management workspace Use the PUDL setup script to create a new data management environment inside this directory. After you run this command you'll see some other directories show up, like parquet, sqlite, data etc. pudl_setup ./ Extract and load the FERC Form 1 and EIA 860/923 data If you just want the FERC Form 1 and EIA 860/923 data that has been integrated into PUDL, you only need to download pudl-ferc1.tgz and pudl-eia860-eia923.tgz. Then extract them in the same directory where you ran pudl_setup: tar -xzf pudl-ferc1.tgz tar -xzf pudl-eia860-eia923.tgz To make use of the FERC Form 1 and EIA 860/923 data, you'll probably want to load them into a local database. The datapkg_to_sqlite script that comes with PUDL will do that for you: datapkg_to_sqlite \ datapkg/pudl-data-release/pudl-ferc1/datapackage.json \ datapkg/pudl-data-release/pudl-eia860-eia923/datapackage.json \ -o datapkg/pudl-data-release/pudl-merged/ Now you should be able to connect to the database (~300 MB) which is stored in sqlite/pudl.sqlite. Extract EPA CEMS and convert to Apache Parquet If you want to work with the EPA CEMS data, which is much larger, we recommend converting it to an Apache Parquet dataset with the included epacems_to_parquet script. Then you can read those files into dataframes directly. In Python you can use the pandas.DataFrame.read_parquet() method. If you need to work with more data than can fit in memory at one time, we recommend using Dask dataframes. Converting the entire dataset from datapackages into Apache Parquet may take an hour or more: tar -xzf pudl-eia860-eia923-epacems.tgz epacems_to_parquet datapkg/pudl-data-release/pudl-eia860-eia923-epacems/datapackage.json You should find the Parquet dataset (~5 GB) under parquet/epacems, partitioned by year and state for easier querying. Clone the raw FERC Form 1 Databases If you want to access the entire set of original, raw FERC Form 1 data (of which only a small subset has been cleaned and integrated into PUDL) you can extract the original input data that's part of the Zenodo archive and run the ferc1_to_sqlite script using the same settings file that was used to generate the data release: tar -xzf pudl-input-data.tgz ferc1_to_sqlite data-release-settings.yml You'll find the FERC Form 1 database (~820 MB) in sqlite/ferc1.sqlite. Data Quality Control We have performed basic sanity checks on much but not all of the data compiled in PUDL to ensure that we identify any major issues we might have introduced through our processing prior to release. These checks have also identified some issues in the originally reported data. If you have suggestions for additional types of data quality control and validation tests we would love to hear them, or see them in a pull request! Data Validation Test Cases We've compiled a collection of data validation test cases which were run against the data in this release prior to publication. For the complete details see the pudl.validate module and the PyTest routines organized under test/validate in the PUDL repository on Github. Generally these tests include: Ensuring that there are no entirely NULL columns. This often happens due to a bad merge between dataframes when there's a misnamed column. Make sure that tables have the expected number of records, to +/- a few percent. Ensure that tables do not contain duplicate records within specified subsets of columns that should serve as unique keys. For reported values that have a physically constrained valid range of values do the vast majority of reported records fall within that valid range? This includes quantities like heat content per unit of fuel delivered/consumed, the sulfur, ash, moisture, chlorine, mercury content of coal, plant capacity. Do ownership shares of individual generators reported in EIA 860 sum to 100%? Are derived IDs that are used to group units of infrastructure together internally self consistent? For example, are there ever cases where a reported EIA generation unit appears in more than one inferred PUDL generation unit? For quantities that may not have a physically constrained range of valid values, do annual slices of the data at least statistically consistent with the historical values reported for that quantity? For example, fuel prices per unit delivered and per unit heat content. Do the fractions of different types of fuel consumed by FERC plants add up to 100%? Is there a strong correlation between total fuel cost and total heat content of reported fuel consumed for large steam plants in FERC 1? Are capacity factors generally between 0 and 1? Are plant construction years all after 1850? Is the fuel consumed for electricity generation always less than the total fuel consumed? Do any inferred generation units contain generators with differing primary fuels? Are plants reporting more than 8784 hours connected per year? Are coal and gas generator capacity factors within expected ranges? Are coal and gas generation unit heat rates within expected ranges? Known Issues This is probably not an exhaustive list. If you find something wonky, please bring it up in the Github issue tracker so we can keep track of it, fix it, or add it to the documentation at least. Consistency of Harvested Entity Attributes EIA 860 reports the same information about utilities, plants, and generators over the years. Many of the reported attributes (like a plant's latitude and longitude...) should be constant over time. We associate these attributes with the entity ID and store them in one table. However, in some cases the reported values are not perfectly consistent across all the available years of data. When that happens, PUDL chooses the most consistently reported value, so long as at least 70% of the reported values are identical (or very nearly identical in the case of numeric values). If the reported values are too inconsistent, the field is assigned N/A for that entity. Incomplete Boiler Generator Associations for Gas Plants Prior to 2015, EIA did not collect sufficient information to be able to infer complete boiler generator associations (and thus heat rates) for natural gas fired generators. In effect the fuel consumption of combustion turbines and the combustion turbine portions of combined cycle plants were excluded since they aren't really "boilers." In the case of combined cycle plants, this can result in impossibly low heat rates, since only additional fuel injected after the combustion turbine counts as fuel input associated with the power generated by the steam turbine. Unrealistically High Coal Mercury Content In 2012 a significant portion of the coal deliveries reported in the EIA 923 Fuel Receipts and Costs table had mercury content orders of magnitude higher than was possible, and higher than in any other report year. See Github issue 390 for details. Imperfect FERC Form 1 Plant ID Assignments Because FERC does not assign unique identifiers to the individual plants whose data are reported, and FERC Form 1 respondents are free to identify those facilities however they like from year to year, there is no entirely reliable way to link records pertaining to a given plant in one year to records pertaining to the same plant in another year. We use a record linkage algorithm that considers the reported plant names, capacities, years of construction, primary fuels, and other attributes to attempt to associate plant records with each other across years, but the process is imperfect. The plant_id_ferc1 values found in the plants_steam_ferc1 and fuel_ferc1 tables should be considered experimental and used with caution. See Github issue 144 for more on this endless saga. Non-unique Mappings Between plants_steam_ferc1 & fuel_ferc1 While the plant names and utility IDs found in the plants_steam_ferc1 and fuel_ferc1 in any given year for a particular plant are guaranteed to be the same, they are not guaranteed to be unique which means that in a few cases it is not possible to identify exactly how much or what kind of fuel is associated with a particular plant record. This is an issue with the original FERC Form 1 database design which we can't fix. Imperfect Data Entry Error Correction In some cases obvious errors have been made in data entry or units of measure. We have attempted to fix some of them (e.g. converting heat content reported in BTU per lb of coal into mmbtu per ton) and we are confident that overall these corrections have improved the quality of the dataset, but there are likely a few cases in which they have been applied incorrectly. If you find something off by a factor of 1000, please let us know! Imperfect Coding FERC does not restrict the vocabulary respondents may use to describe plant and fuel types, resulting in thousands of different strings being used. We have done our best to identify and categorize them all in the steam plants table, but this process is imperfect. Many other tables still have not been similarly coded, the plants_small_ferc1 and purchased_power_ferc1 tables remain especially messy. Reproducibility It's our intention that a user should be able to completely reproduce the data processing pipeline that we've used to generate this data release, and get the same outputs byte-for-byte, using only resources that are available in curated, long-term archives. The main requirements are a copy of the same original source data (archived as part of this data release), and a specification of the software environment, which can be reconstructed with packages from conda-forge. Original Source Data The original source data as downloaded from the public sources and used by the PUDL software to generate this data release are archived here alongside the outputs in the interest of reproducibility. The publishing agencies do not use version control or provide access to historically published versions, meaning that the same data may not remain available from them going forward. All of the original input data can be found in the pudl-input-data.tgz compressed archive distributed with this data release. The data it contains were downloaded from FERC, EIA, and EPA between January 31st and February 17th, 2020. A small amount of additional data that we have compiled by hand is distributed as part of the Python package. Software Environment This data release was generated using v0.3.2 of the catalystcoop.pudl Python package, which is available on the official Python Package Index as well as via conda using the community maintained conda-forge channel. It's also archived in the PUDL Github repository. and on Zenodo The archived-environment.yml file distributed in this archive describes the conda software environment in which this data release was generated. OS / Hardware The data release was generated on a 64 bit, Intel based Ubuntu Linux 19.10 system. The only specialized external library that was required outside of the conda framework was libsnappy-dev version 1.1.7-1. Note that this library should not be required if you use conda. The data processing pipeline used to generate this data rel
EIA, Energy Information Administration, EIA 860, Environmental Protection Agency, utility, EPA CEMS, EPA, electricity, EIA 923, FERC, FERC Form 1, energy, Federal Energy Regulatory Commission
Twitter Data
EIA, Energy Information Administration, EIA 860, Environmental Protection Agency, utility, EPA CEMS, EPA, electricity, EIA 923, FERC, FERC Form 1, energy, Federal Energy Regulatory Commission
Twitter Data
citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
views | 110 | |
downloads | 12 |