
This dataset provides inventor-, organization-, and patent-level information on U.S. utility patents (1976–2021). It has been curated to enable research on gender disparities in patenting, inventor team composition, organizational characteristics, and innovation outcomes. The dataset is based on disambiguated inventor, assignee, and patent information, enriched with bibliometric, geographic, and citation indicators. The dataset consists of three CSV files: 1. 01_distinct_inventor_information.csv Unit of analysis: Unique inventors. Description: Contains demographic, geographic, and innovation-related characteristics for distinct inventors in the dataset. Key variables include: inventor_id: Unique inventor identifier. inventor_first_name, inventor_last_name: Disambiguated inventor names. patent_count: Number of patents linked to the inventor. gender_code, gender_evaluation_method: Assigned gender and method of inference. first_filing_year, last_filing_year: Patent activity period. first_author_patent_count: Number of patents with inventor listed first. Technological scope: Counts of CPC subclasses and subgroups per inventor. Backward and forward citations: Sums and means across patents. Bibliometric indicators: Originality, generality, combinatorial novelty (cd_5, cd_2017y). Science linkage: Number of cited scientific papers. Geographic information: City, state, country, county, latitude/longitude, and FIPS codes. 2. 02_distinct_organizational_assignee_information.csv Unit of analysis: Unique organizational assignees. Description: Summarizes the characteristics of distinct organizational assignees, including patenting activity, gender composition of inventor teams, and bibliometric indicators. Key variables include: assignee_id, disambig_assignee_organization: Unique ID and disambiguated organization name. patent_count: Number of assigned patents. assignee_type, assignee_type_name, assignee_type_name_adj: Organization type (e.g., firm, university, government). first_filing_year, last_filing_year: Patent activity period. Inventor gender composition: Male, female, undefined counts; all-male, all-female, and gender-collaboration team measures. Technological scope: Mean counts of CPC section, subclass, and group. Citation measures: Backward and forward citations, scientific publication citations. Bibliometric indicators: Originality, generality, combinatorial novelty. Gender ratios: Fraction of patents with women inventors, team gender ratios. 3. 03_utility_patent_information.csv Unit of analysis: Individual utility patents. Description: Provides patent-level information, including bibliometric measures, team composition, organizational assignment, and government funding reliance. Key variables include: patent_id: Patent identifier. num_claims, filing_year, grant year/date: Patent characteristics. team_size: Inventor team size. Technological scope: CPC section, subclass, and group counts. Citations: Backward citations, forward citations (5/7/10 years), originality, generality. Disruption and novelty indicators: cd_5, cd_10, novelty upon granting. Assignee information: IDs, names, type, and counts. Inventor gender composition: Counts of male, female, undefined inventors; women participation indicators. Government reliance: Categorization of patents by reliance on government funding (two-type and three-type). WIPO categories: Sector and field identifiers and titles. Impact metrics: Percentile rankings, top 10% indicators for citation and disruption. Science linkage: Number of cited scientific papers and per-inventor measures. Data Sources and Construction The dataset integrates information from multiple sources: PatentsView open data platform: Core source of patent, inventor (including original gender code), assignee, and location data. Merged external datasets: Funk, R. J., Park, M., & Leahey, E. (2022). Papers and patents are becoming less disruptive over time (1.0). Zenodo. https://doi.org/10.5281/zenodo.7258379 Fleming, L., Green, H., Li, G.-C., Marx, M., & Yao, D. (2019). Replication Data for: Government-funded research increasingly fuels innovation. Harvard Dataverse. https://doi.org/10.7910/DVN/DKESRC Marx, M., & Fuegi, A. (2020). Reliance on science: Worldwide front-page patent citations to scientific articles. Strategic Management Journal, 41(9), 1572–1594. https://doi.org/10.1002/smj.3145 Marx, M., & Fuegi, A. (2022). Reliance on science by inventors: Hybrid extraction of in-text patent-to-article citations. Journal of Economics & Management Strategy, 31(2), 369–392. https://doi.org/10.1111/jems.12455 Enhancements and derived variables: Final gender code: Created using an LLM-assisted approach, as described in our associated research paper. Patent indicators computed: Originality, generality, combinatorial novelty, etc. Notes Only utility patents are included; design and plant patents are excluded. Gender inference is probabilistic and based on name-based algorithms plus LLM-assisted refinement. Results should be interpreted with care. Some location and gender data may remain incomplete or missing. Bibliometric indicators follow standard measures in patent analytics literature. This dataset description was created with the assistance with ChatGPT (GPT-5).
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
