
# onedz_datasets_csv This directory contains the **split CSV datasets** of the ZirconRegular_LLM project. All files are partitioned into manageable parts (~100,000–130,000 rows each) for batch processing, LLM ingestion, or memory-constrained workflows. ## Directory Structure ``` onedz_datasets_csv/ │ ├── Total_UPb_split_parts/ # Main U-Pb geochronology database │ ├── zircon_upb_part_01.csv │ ├── zircon_upb_part_02.csv │ └── ... (22 parts total) │ ├── Total_LuHf_split_parts/ # Lu-Hf isotope database, note that all files have been checked by experts │ ├── zircon_luhf_part_01.csv │ ├── zircon_luhf_part_02.csv │ └── zircon_luhf_part_03.csv │ └── Experts_checked_UPb_split_parts/ # Expert-reviewed U-Pb subsets ├── expert_upb_part_01.csv ├── expert_upb_part_02.csv └── ... (14 parts total) ``` ## Dataset Summary | Dataset | Parts | Est. Total Rows | Columns | Content | |---------|-------|-----------------|---------|---------| | `Total_UPb_split_parts` | 22 | ~2,550,000 | 64 | Full detrital zircon U-Pb age database | | `Total_LuHf_split_parts` | 3 | ~297,000 | 33 | Lu-Hf isotope data linked to U-Pb records (expert-checked) | | `Experts_checked_UPb_split_parts` | 14 | ~1,497,000 | 64 | Peer-reviewed regional compilations (quality-controlled) | --- ## File Format All CSV files follow the project standard: | Property | Specification | |----------|---------------| | **Encoding** | UTF-8 with BOM (`utf-8-sig`) | | **Delimiter** | Comma (`,`) | | **Line endings** | LF (`\n`) | | **Header** | Single header row with standardized column names | | **Quoting** | Double-quoted fields when containing commas or newlines | ### U-Pb Standard Columns (64 total) - **Bibliographic**: `Lead_Author`, `Year`, `Journal`, `Vol`, `Pages`, `Title`, `Web_Link` - **Sample**: `Published_Sample_ID`, `Country_State`, `Region`, `Continent`, `Major_Geographic_Geologic_Unit`, `Minor_Geologic_Geographic_Unit`, `Group`, `Formation`, `Member`, `Locality`, `Profile`, `Latitude`, `Longitude` - **Depositional Age**: `Depos_Age_Period`, `Depos_Age_Epoch`, `Depos_Age_Stage`, `Max_Depos_Age_Ma`, `Est_Depos_Age_Ma`, `Min_Depos_Age_Ma` - **Analytical**: `Spectrometer`, `Spectrometer_Location`, `Institution`, `Spectrometer_Mode`, `Rock_Type_one`, `Rock_Type_two`, `Rock_Type_three`, `Grain`, `Spot_Location`, `Spot_diam` - **Isotope Ratios**: `Pb206U238_iso`, `Pb207U235_iso`, `Pb207Pb206_iso`, `Pb208Th232_iso` (with one-sigma uncertainties) - **Calculated Ages**: `Pb206U238_age`, `Pb207U235_age`, `Pb207Pb206_age`, `Best_age` (with one- and two-sigma uncertainties), `Discord` - **Elemental**: `U_ppm`, `Th_ppm`, `Pb_ppm`, `Pb206Pb204`, `Pb204Pb206`, `UTh_ratio`, `ThU_ratio` ### Lu-Hf Columns (33 total) Includes all bibliographic and sample metadata columns above, plus: - `Upb_Age`, `Upb_Age_two_sigma` - `176Hf177Hf_iso`, `176Lu177Hf_iso`, `176Yb177Hf_iso` (with 2-sigma uncertainties) - `epsilon_Hf_0`, `epsilon_Hf_t` (with 1-sigma and 2-sigma uncertainties) - `TDM1_Ma`, `TDM2_Ma` (with 2-sigma uncertainties) --- ## Usage Notes 1. **Load order**: When reassembling the full dataset, load parts in numerical order (`01` → `22`). 2. **Row overlap**: Parts are split sequentially; no duplicate rows exist across parts of the same dataset. 3. **Cross-dataset linkage**: Use `Lead_Author` + `Year` + `Published_Sample_ID` + `Grain` to link U-Pb records with Lu-Hf records. 4. **Expert vs. Total**: `Experts_checked_UPb_split_parts` is a **subset** of the total database, curated from peer-reviewed regional compilations. It does not contain all rows from `Total_UPb_split_parts`.
