
The dataset contains data collected for 1 million time steps across different RL environments in Mujoco Playground using trained-agents with actions sampled from the learnt Gausian distribution. Here only the data for the following environments are released as an example - CartpoleBalance, CheetahRun, FingerSpinEach environment specific folder contains .npz file which has the following keys described in the table below (here num_timesteps=1e6) Key Definition states True state values, including initial and terminal states, with shape (num_timesteps x state_dimension) actions Action values with shape (num_timesteps x action_dimension) state_jacobians Jacobian matrices representing the derivative of the next true state with respect to the current state, with shape (num_timesteps x state_dimension x state_dimension) action_jacobians Jacobian matrices representing the derivative of the next true state with respect to the action, with shape (num_timesteps x state_dimension x action_dimension) obs Observation values with shape (num_timesteps x obs_dim) rewards Reward values with shape (num_timesteps x 1) dones Done Indicator Vector: A binary vector of shape (num_timesteps, 1) indicating episode terminations. A value of 1 marks a timestep where an episode ends. The corresponding index after this in the states data represents the terminal state. At this terminal index, all keys except for "observation" are set to -1. total_steps_collected The total number of timesteps collected
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
