Quickstart
==========

Installation
------------

.. code-block:: bash

   pip install dendros

   # With optional pandas support:
   pip install 'dendros[pandas]'

   # With matplotlib for plotting analyses:
   pip install 'dendros[plot]'

   # Development version from GitHub:
   pip install git+https://github.com/galacticusorg/dendros.git

Opening files
-------------

.. code-block:: python

   from dendros import open_outputs

   # Single file
   c = open_outputs("galacticus.hdf5")

   # Auto-detect MPI-split outputs
   c = open_outputs("galacticus_MPI:0000.hdf5")

   # Explicit list or glob
   c = open_outputs(["rank0.hdf5", "rank1.hdf5"])
   c = open_outputs("run001/galacticus*.hdf5")

   # Lightcone mode
   c = open_outputs("lightcone.hdf5", output_root="Lightcone")

Use :class:`~dendros.Collection` as a context manager to ensure file handles
are closed automatically::

   with open_outputs("galacticus.hdf5") as c:
       tbl = c.list_outputs()

Checking completion status
--------------------------

.. code-block:: python

   with open_outputs("galacticus.hdf5") as c:
       c.validate_completion()           # raises RuntimeError if incomplete
       c.validate_completion(mode="warn")    # emit UserWarning instead
       c.validate_completion(mode="ignore")  # silent

Listing outputs
---------------

.. code-block:: python

   with open_outputs("galacticus.hdf5") as c:
       tbl = c.list_outputs()          # astropy Table
       df  = c.list_outputs(format="pandas")

The table contains columns: ``index``, ``name``, ``time``,
``scale_factor``, ``redshift``, and ``output_type``.  The ``output_type``
column reports the kind of output each group holds (``tree``, ``node``,
``snapshot``, or ``lightcone``); it is ``None`` (a missing value) for older
files that predate the ``outputType`` attribute.

Listing properties
------------------

.. code-block:: python

   with open_outputs("galacticus.hdf5") as c:
       tbl = c.list_properties("Output1")  # by name
       tbl = c.list_properties(1)          # by 1-based integer index

Columns: ``name``, ``dtype``, ``shape``, ``description``, ``units``.  The
``units`` column shows a human-readable units description (e.g.
``"Solar masses"``), taken from the dataset's ``units`` attribute; it is
blank for dimensionless datasets.

Reading datasets
----------------

.. code-block:: python

   with open_outputs("galacticus.hdf5") as c:
       data = c.read("Output1", ["nodeData/basicMass"])
       # data["nodeData/basicMass"] is an astropy Quantity (in solar masses)

       # Custom labels via dict
       data = c.read(
           "Output1",
           {"Mhalo": "nodeData/basicMass", "Mstar": "nodeData/diskMassStellar"},
       )

Units and ``Quantity`` objects
------------------------------

By default, datasets that carry a units ``quantity`` string are returned as
:class:`astropy.units.Quantity` objects, so you can convert between units and
carry units through calculations:

.. code-block:: python

   with open_outputs("galacticus.hdf5") as c:
       mass = c.read("Output1", ["nodeData/basicMass"])["nodeData/basicMass"]

   mass.unit                 # Unit("solMass")
   mass.to("kg")             # convert to kilograms
   mass.value                # the underlying numpy array

Dimensionless datasets (those with an empty ``quantity``) are always returned
as plain :class:`numpy.ndarray` objects.  To disable the ``Quantity`` wrapping
entirely and get plain arrays for every dataset, pass ``as_quantity=False``:

.. code-block:: python

   with open_outputs("galacticus.hdf5") as c:
       data = c.read("Output1", ["nodeData/basicMass"], as_quantity=False)
       # data["nodeData/basicMass"] is a plain numpy array

Filtering
---------

Pass a boolean mask or integer index array as ``where``:

.. code-block:: python

   with open_outputs("galacticus.hdf5") as c:
       masses = c.read("Output1", ["nodeData/basicMass"])["nodeData/basicMass"]
       mask = masses.value > 1e12

       data = c.read(
           "Output1",
           {"Mhalo": "nodeData/basicMass", "Mstar": "nodeData/diskMassStellar"},
           where=mask,
       )

Tracing galaxy histories
------------------------

Tracing only makes sense for outputs in which a galaxy persists across cosmic
time — those of ``outputType`` ``tree`` or ``snapshot``.  In a ``lightcone``
output each galaxy is seen only once, and ``node`` outputs carry no persistent
cross-output identity, so :meth:`~dendros.Collection.trace_history` raises a
:class:`ValueError` if any chosen output is of type ``node`` or ``lightcone``.
Restrict to traceable outputs with the ``outputs=`` argument, or read those
outputs directly with :meth:`~dendros.Collection.read`.  Outputs from older
files that lack the ``outputType`` attribute are assumed traceable.

Given one or more ``nodeUniqueIDBranchTip`` values, dendros can assemble each
galaxy's full history across all outputs:

.. code-block:: python

   from dendros import open_outputs

   with open_outputs("galacticus.hdf5") as c:
       ids = [101, 104]   # branch-tip IDs of galaxies of interest
       hist = c.trace_history(
           ids,
           {"Mstar": "nodeData/diskMassStellar"},
       )

   # hist["Mstar"]             shape (2, n_outputs); NaN where absent
   # hist["time"]              shape (2, n_outputs); NaN where absent
   # hist["expansion_factor"]  shape (2, n_outputs)
   # hist["present"]           bool mask, shape (2, n_outputs)
   # hist["output_names"]      object array of output group names
   # hist["ids"]               int64 array, the normalized input

A 2-D per-galaxy property (e.g. a spectrum of shape ``(N_gals, n_bins)``) is
returned as a 3-D array of shape ``(n_galaxies, n_bins, n_outputs)`` — one
extra trailing axis for time. Each galaxy need not be present at every output
(it may have formed later or merged earlier), so history arrays are *ragged*
in time. Absent slots are filled with:

* ``NaN`` for floating-point properties (and for the ``time`` and
  ``expansion_factor`` arrays);
* the value of ``int_sentinel`` (default ``-1``) for integer properties;
* ``False`` for boolean properties.

The ``present`` mask is the canonical indicator of presence and should be
preferred to sentinel checks::

   import numpy as np
   mask = hist["present"][0]         # galaxy 0 presence across outputs
   times  = hist["time"][0][mask]
   masses = hist["Mstar"][0][mask]

Restrict to a subset of outputs with the ``outputs=`` argument (accepts a
``range``, a list of 1-based integers, or output group names)::

   hist = c.trace_history(ids, ["nodeData/basicMass"], outputs=range(1, 6))

Multi-file collections search each file independently. For arbitrary
user-provided file lists or globs, ``nodeUniqueIDBranchTip`` collisions are
possible across files, so by default if the same ID is found in more than
one file at the same output the call raises :class:`ValueError`; pass
``on_duplicate_file_match="warn"`` or ``"first"`` to keep the first
match instead. True Galacticus MPI-split outputs are a separate case:
there ``nodeUniqueIDBranchTip`` is expected to be unique across ranks/files
for a given output.

If ``nodeUniqueIDBranchTip`` was not included in the Galacticus run, the
function raises a :class:`KeyError` that points you at the missing output
property.

MPI outputs
-----------

Galacticus MPI runs produce files suffixed ``_MPI:NNNN``.  Dendros detects and
groups them automatically when you pass any single rank's filename or a glob::

   c = open_outputs("galacticus_MPI:0000.hdf5")  # auto-detects all peers

:meth:`~dendros.Collection.read` concatenates arrays from all ranks along
axis 0.

Star formation histories
------------------------

Star formation histories are stored per galaxy, tabulated over time and
metallicity.  Dendros provides functions to collapse (sum) over metallicity
and to recover the tabulation times:

.. code-block:: python

   from dendros import sfh_collapse_metallicities, sfh_times

   with open_outputs("galacticus.hdf5") as c:
       sfh = c["Outputs/Output1/nodeData/diskStarFormationHistoryMass"]
       times = sfh_times(sfh)
       collapsed = sfh_collapse_metallicities(sfh)

When every galaxy is tabulated at the same times (a shared ``time`` attribute),
:func:`~dendros.sfh_collapse_metallicities` returns a fixed-length 2D array of
shape ``(n_galaxies, n_times)`` and :func:`~dendros.sfh_times` returns the
common 1D time array.

Lightcone runs commonly use the ``fixedAges`` method, where each galaxy is
tabulated at a fixed set of ages relative to its lightcone-crossing time.  Ages
that precede the Big Bang are dropped, so galaxies that cross earlier retain
fewer bins and the per-galaxy arrays have different lengths.  Dendros detects
this method (from the ``Parameters`` group) and *right-aligns* the histories
into non-ragged 2D arrays of shape ``(n_galaxies, n_ages)``: the
crossing-time bin is the last column, dropped bins are padded at the front
(with ``0`` for masses and ``NaN`` for times), and the companion
``...Times`` dataset supplies the per-galaxy times.  Column ``j`` of the mass
and time arrays refer to the same tabulation bin, though — because each galaxy
crosses at a different cosmic time — a given column holds a different absolute
time for each galaxy (but the same lookback age relative to crossing).

.. code-block:: python

   with open_outputs("lightcone.hdf5", output_root="Lightcone") as c:
       sfh = c["Lightcone/Output1/nodeData/diskStarFormationHistoryMass"]
       collapsed = sfh_collapse_metallicities(sfh)   # (n_galaxies, n_ages)
       times = sfh_times(sfh)                         # (n_galaxies, n_ages), NaN-padded

For other variable-length tabulations (no fixed-age structure), the collapsed
histories are returned as a list of 1D arrays and :func:`~dendros.sfh_times`
returns ``None``.

Plotting analyses
-----------------

If a Galacticus run was configured to write reduced analysis results, the
HDF5 file will contain a top-level ``/analyses`` group with one subgroup per
analysis.  Dendros can list and plot every ``function1D`` analysis,
overlaying the model curve with the observational/target data when
present.  Requires the ``plot`` extra (``pip install 'dendros[plot]'``).

For MPI runs, the ``/analyses`` data is reduced over all ranks and is
identical in every rank's file, so dendros reads only the primary file.

.. code-block:: python

   from dendros import open_outputs

   with open_outputs("galacticus.hdf5") as c:
       # Tabulate available analyses (no matplotlib needed for this).
       print(c.list_analyses())

       # Plot every analysis; returns dict[name, matplotlib.figure.Figure].
       figs = c.plot_analyses()

       # Plot one analysis and also save to disk.
       figs = c.plot_analyses(
           name="stellarMassFunction",
           output_directory="figs",
           file_format="pdf",
       )

       # Hide the target overlay (model only).
       figs = c.plot_analyses(show_target=False)

To compare several models on the same figure, open them as a
:class:`~dendros.ModelCollection` with :func:`~dendros.open_models` and
pass the result to the module-level :func:`~dendros.plot_analyses`.  The
target overlay is shared across models so it is drawn only once; each
model contributes its own curve, labelled by the dict key (or by its
primary file stem when no dict is supplied).  MPI-split files are still
treated as a single model — only inter-model differences produce
additional curves.

.. code-block:: python

   from dendros import open_models, plot_analyses

   with open_models({"Fiducial": "fid.hdf5", "Variant": "var.hdf5"}) as m:
       figs = plot_analyses(m)

       # Or with default labels derived from filenames:
       # figs = plot_analyses(open_models(["fid.hdf5", "var.hdf5"]))

       # Or pass an explicit list with custom labels:
       # figs = plot_analyses(list(m.values()), labels=["A", "B"])