A deeper walk through pynbody’s data access facilities#
The following example shows how to load a file, determine various attributes, access some data, make use of unit information, and understand how pynbody transforms data.
A shorter tutorial that deals with the absolute basics and shows how to make some plots is offered by the quick-start guide.
Note
This tutorial assumes basic familiarity with python and is
written as a series of ipython
cells, like a Jupyter
notebook.
Before you start make sure pynbody
is properly
installed. See Pynbody Installation
for more information. You will also need the standard pynbody
test
files if you want to follow the tutorial. See Obtaining test data
for more information.
After you have extracted the testdata folder (e.g. with tar -xzf
testdata.tar.gz
), launch ipython
or Jupyter
. At the prompt or in
a cell, type import pynbody
. If all is installed correctly, this should silently
succeed, and you are ready to use pynbody
.
First steps#
We will start by loading one of the test files.
We’ll also load the numpy
module as it provides some
functions we’ll make use of later.
In [1]: import pynbody
In [2]: import numpy as np
In [3]: f = pynbody.load("testdata/gadget2/test_g2_snap")
Here we’ve loaded a sample gadget file. Not much seems to have
happened when you called pynbody.load()
, but the variable f
is now your calling point for accessing data.
In fact f
is an object known as a SimSnap
.
Behind the scenes, the function inspects the provided path and decides
what file format it is. At the time of writing, supported file formats
include tipsy, nchilada, gadget-1, gadget-2, gadget-HDF and
ramses. For most purposes you should never need to know what type of
file you are dealing with – that’s the whole point of the pynbody
framework.
Note
If you look inside the testdata
folder, you’ll notice that
our test snapshot is actually an example of a spanned gadget
snapshot. There is not actually a file called gadget2/test_g2_snap
, but
rather two files from two CPUs, gadget2/test_g2_snap.0
and
gadget2/test_g2_snap.1
. pynbody
knows to load both of these if you ask
for gadget2/test_g2_snap
; if you ask for gadget2/test_g2_snap.1
(for instance),
it’ll load only that particular file.
The SimSnap
object that’s returned is currently a fairly empty
holder for data. The data will be loaded from disk as and when you
request it.
Finding out something about the file#
Let’s start to inspect the file we’ve opened. The standard python operator len
can be used to query the number
of particles in the file:
In [4]: len(f)
Out[4]: 8192
We can also find out about particles of a particular type or family
as it is known within pynbody. To find out which families are present
in the file, use families()
:
In [5]: f.families()
Out[5]: [<Family gas>, <Family dm>, <Family star>]
You can pick out just the particles belonging to a family by using the
syntax f.family
. So, for example, we can see how many particles of
each type are present:
In [6]: len(f.dm)
Out[6]: 4096
In [7]: len(f.gas)
Out[7]: 4039
In [8]: len(f.star)
Out[8]: 57
Useful information about the file is stored in a python dictionary
called properties
:
In [9]: f.properties
Out[9]:
{'omegaM0': 0.2669,
'omegaL0': 0.7331,
'boxsize': Unit("3.00e+03 kpc a h**-1"),
'a': 0.2777777798158637,
'h': 0.71,
'time': Unit("3.55e+00 s kpc a**1/2 h**-1 km**-1")}
Like any python dictionary, specific properties can be accessed by name:
In [10]: f.properties['a']
Out[10]: 0.2777777798158637
These names are standardized across different file formats. Here for example z
means redshift, a
means the cosmological scalefactor, h
indicates
the Hubble constant in standard units (100 km/s/Mpc).
Note
Actually f.properties
has some behaviour which is
very slightly different from a normal python dictionary. For further
information see SimDict
.
Retrieving data#
Like f.properties
, f
itself also behaves like a python
dictionary. The standard python method
f.
keys()
returns a list of arrays
that are currently in memory.
In [11]: f.keys()
Out[11]: []
Right now it’s empty! That’s actually correct because data is only
retrieved when you first access it. To find out what could be loaded,
use the pynbody
-specific method loadable_keys()
:
In [12]: f.loadable_keys()
Out[12]: ['vel', 'pos', 'mass', 'iord']
This looks a bit more promising.
To access data, simply use the normal dictionary syntax. For example
f['pos']
returns an array containing the 3D-coordinates of all the
particles.
In [13]: f['pos']
Out[13]:
SimArray([[ 53.318974, 177.84364 , 128.22311 ],
[ 306.75046 , 140.44455 , 215.37149 ],
[ 310.99908 , 64.1345 , 210.53595 ],
...,
[2870.9016 , 2940.1711 , 1978.7949 ],
[2872.4114 , 2939.2197 , 1983.916 ],
[2863.6511 , 2938.0544 , 1980.0615 ]], dtype=float32, 'kpc a h**-1')
Note
Array names are standardized across all file
formats. For instance, even if you load a Gadget-HDF file – which
internally refers to the position array as coordinates – you
still access that array from pynbody by the name pos
. The
intention is that code never needs to be adapted simply because you
have switched file format. However the name mapping is fully
configurable should you wish to adopt
different conventions.
Some arrays are stored only for certain families. For example,
densities are stored only for gas particles and are accessed as
f.gas['rho']
. To find out what arrays are available for the gas
family, use
f.gas.
loadable_keys()
:
In [14]: f.gas.loadable_keys()
Out[14]:
['u',
'nheq',
'nhep',
'vel',
'nh',
'sfr',
'pos',
'rho',
'mass',
'nhe',
'nhp',
'smooth',
'iord']
So, we can get the density of the gas particles like this:
In [15]: f.gas['rho']
Out[15]:
SimArray([1.3888609e-09, 3.3617684e-09, 4.5273674e-09, ..., 8.5340952e-09,
7.4101774e-09, 1.4051752e-09], dtype=float32, '1.00e+10 h**2 Msol kpc**-3 a**-3')
Note
The SimArray
objects are actually
numpy
arrays with some added functionality (such as unit tracking,
discussed below). Numerical operations are very nearly as fast as
their numpy equivalents. However, if you want to squeeze the
performance of your code, you can always get a vanilla numpy array by
using the numpy
view mechanism,
e.g. f.gas['rho'].view(type=numpy.ndarray)
Creating your own arrays#
You can create arrays using the obvious assignment syntax:
In [16]: f['twicethemass'] = f['mass']*2
You can also define new arrays for one family of particles:
In [17]: f.gas['myarray'] = f.gas['rho']**2
An array created in this way exists only for the gas particles; trying to access it for other particles raises an exception.
Alternatively, you can define derived arrays which are calculated (and re-calculated) on demand. For example,
In [18]: @pynbody.derived_array
....: def thricethemass(sim) :
....: return sim['mass']*3
....:
At this point, nothing has been calculated. However, when you ask for the array, the values are calculated and stored
In [19]: f['thricethemass']
Out[19]:
SimArray([0.02464408, 0.02464408, 0.02464408, ..., 0.02464408, 0.02464408,
0.02464408], dtype=float32, '1.00e+10 Msol h**-1')
This has the advantage that your new thricethemass
array is
automatically updated when you change the mass
array:
In [20]: f['mass'][0] = 1
In [21]: f['thricethemass']
Out[21]:
SimArray([3. , 0.02464408, 0.02464408, ..., 0.02464408, 0.02464408,
0.02464408], dtype=float32, '1.00e+10 Msol h**-1')
Note, however, that the array is not re-calculated every time you
access it, only if the mass
array has changed. Therefore you don’t
waste any time by using derived arrays. For more information see
the reference documentation for derived arrays.
Keeping on top of units#
You might have noticed in the output from the above experiments that
pynbody
keeps track of unit information whenever it can.
Warning
It’s worth understanding exactly where pynbody gets this
information from, in case anything goes wrong. Many simulation data
formats now store units (e.g. gadget, arepo or swift’s HDF5 output, or
ramses output folders). In such
cases, pynbody will use the specified units.
For nchilada
and tipsy
, a ChaNGa or gasoline
.param
file is sought in the directory from which you are loading
the snapshot and its immediate parent. You can also create a text file
with the same name as your snapshot but the extension .units
to override
the units at load time. For example, such a file can contain
pos: kpc a
vel: km s^-1
mass: Msol
to specify distance units are comoving kiloparsecs, velocity units are kilometers per second, and mass is in solar masses.
You can print out the units of any given array by accessing the
units
property:
In [22]: f['mass'].units
Out[22]: Unit("1.00e+10 Msol h**-1")
However, it’s usually more helpful to simply convert your arrays into
something more managable than the internal units. Pynbody
arrays can
be converted using the in_units()
function; just pass in a string representing the units you want.
In [23]: f['pos'].in_units('Mpc')
Out[23]:
SimArray([[0.02086032, 0.06957889, 0.05016554],
[0.12001192, 0.05494701, 0.08426115],
[0.12167414, 0.02509174, 0.08236931],
...,
[1.1232009 , 1.1503017 , 0.7741764 ],
[1.1237916 , 1.1499294 , 0.77617997],
[1.1203643 , 1.1494735 , 0.774672 ]], dtype=float32, 'Mpc')
Note
The function in_units()
returns a copy of
your array in new units. Next time you access f['pos']
it will be
back in its original units. If you want to permanently convert the array in-place
use convert_units()
or see below.
Another option is to request that pynbody
converts all your arrays
into something sensible, using
physical_units()
,
In [24]: f.physical_units()
Take a look at what’s happened to the density:
In [25]: f.gas['rho']
Out[25]:
SimArray([ 326.6502 , 790.66406, 1064.8047 , ..., 2007.1586 , 1742.821 ,
330.4872 ], dtype=float32, 'Msol kpc**-3')
Note that the conversion will also be made when loading any arrays in future; for example:
In [26]: f['vel']
Out[26]:
SimArray([[ 27.938293 , 4.983705 , -10.008866 ],
[ 15.361564 , 5.7859726, 4.3631563],
[ -8.357319 , -2.8885257, 22.809904 ],
...,
[ 27.749176 , 85.60175 , 15.532437 ],
[ 40.755856 , 59.442867 , 44.244846 ],
[ 38.383965 , 68.63973 , 46.01429 ]], dtype=float32, 'km s**-1')
A new array generated from a unary or binary operation will inherit the correct units. For example
In [27]: 5*f['vel']
Out[27]:
SimArray([[139.69147 , 24.918526, -50.04433 ],
[ 76.807816, 28.929863, 21.81578 ],
[-41.786594, -14.442629, 114.04952 ],
...,
[138.74588 , 428.00876 , 77.662186],
[203.77928 , 297.21432 , 221.22423 ],
[191.91983 , 343.19867 , 230.07144 ]], dtype=float32, 'km s**-1')
In [28]: (f['vel']**2).units
Out[28]: Unit("km**2 s**-2")
In [29]: np.sqrt(((f['vel']**2).sum(axis=1)*f['mass'])).units
Out[29]: Unit("km Msol**1/2 s**-1")
You can even associate arrays with the loaded
SimSnap
unit system even when you create
them outside the SimSnap
. This is useful
for keeping things tidy with your unit conversions if you are
calculating quantities that don’t apply to all of the particles. For
instance:
In [30]: array = pynbody.array.SimArray(np.random.rand(10)) # make the newly-formed numpy array a pynbody array
In [31]: array.sim = f # this links the array to the simulation
In [32]: array.units = 'Mpc a' # we set units that require cosmology information
In [33]: array
Out[33]:
SimArray([0.61987701, 0.4786126 , 0.87755043, 0.16324242, 0.4292023 ,
0.40660903, 0.83750666, 0.84330569, 0.72184476, 0.17679248], 'Mpc a')
In [34]: array.in_units('kpc')
Out[34]:
SimArray([172.18805926, 132.94794458, 243.7640089 , 45.34511672,
119.22286089, 112.94695425, 232.64074101, 234.25158353,
200.51243569, 49.1090218 ], 'kpc')
Note that the units were correctly converted into physical units in the last step.
See also
For more information see the reference documentation for pynbody.units
.
Subsnaps#
An important concept within pynbody
is that of a subsnap. These are
objects that look just like a SimSnap
but actually only point
at a subset of the particles within a parent
. Subsnaps are always
instances of the SubSnap
class.
You’ve already seen some examples of subsnaps, actually. When you
accessed f.gas
or f.dm
, you’re given back a subsnap pointing
at only those particles. However, subsnaps can be used in a much more
general way. For example, you can use python’s normal array slicing
operations. Here we take every tenth particle:
In [35]: every_tenth = f[::10]
In [36]: len(every_tenth)
Out[36]: 820
In common with python’s normal mode of working, this does not copy any data, it merely creates another pointer into the existing data. As an example, let’s modify the position of one of our particles in the new view:
In [37]: every_tenth['pos'][1]
Out[37]: SimArray([140.2888 , 122.21798, 75.80529], dtype=float32, 'kpc')
In [38]: every_tenth['pos'][1] = [1,2,3]
In [39]: every_tenth['pos'][1]
Out[39]: SimArray([1., 2., 3.], dtype=float32, 'kpc')
This change is reflected in the main snapshot.
In [40]: f['pos'][10]
Out[40]: SimArray([1., 2., 3.], dtype=float32, 'kpc')
Note
If you’re used to numpy’s flexible indexing abilities, you
might like to note that, typically, f[array_name][index] ==
f[index][array_name]
. The difference is that applying the index to
the whole snapshot is more flexible and can lead to simpler code. In
particular, numpy_array[index]
may involve copying data whereas
f[index]
never does; it always returns a new object pointing back at
the old one.
You can pass in an array of boolean values representing
whether each successive particle should be included (True
) or not
(False
). This allows the use of numpy
’s comparison
operators. For example:
In [41]: f_slab = f[(f['x']>1000)&(f['x']<2000)]
In [42]: f_slab['x'].min()
Out[42]: SimArray(1000.11975, dtype=float32, 'kpc')
In [43]: f_slab['x'].max()
Out[43]: SimArray(1173.6926, dtype=float32, 'kpc')
In [44]: f['x'].min()
Out[44]: SimArray(0.04504352, dtype=float32, 'kpc')
In [45]: f['x'].max()
Out[45]: SimArray(1173.6926, dtype=float32, 'kpc')
Here f_slab
is pointing at only those particles which have
x-coordinates between 1000 and 2000.
Note that subsnaps really do behave exactly like snapshots. So, for instance, you can pick out sub-subsnaps or sub-sub-subsnaps.
In [46]: len(f_slab.dm)
Out[46]: 1236
In [47]: len(f_slab.dm[::10])
Out[47]: 124
In [48]: f_slab[[100,105,252]].gas['pos']
Out[48]:
SimArray([[1029.6443 , 421.70737, 169.25818],
[1097.2487 , 377.8681 , 149.81082],
[1059.9261 , 374.7446 , 192.37201]], dtype=float32, 'kpc')
Note
Under most circumstances there is very little performance
penalty to using a SubSnap
. However in performance-critical code it
is worth understanding a little more about what’s going on under the
hood. See Performance optimisation in pynbody.
Filters#
Another way you can select a subset of particles is to use a
filter
. This can lead to more readable code than the equivalent
explicitly written condition. For example, to pick out a sphere
centered on the origin, you can use:
In [49]: from pynbody.filt import *
In [50]: f_sphere = f[Sphere('10 kpc')]
See also
For more information see filters_tutorial, and for a list of filters, see pynbody.filt
.
Centering#
Several built-in functions (e.g. those that plot images and make
profiles) in pynbody
like your data to be centered on a point of
interest. The most straight-forward way to center your snapshot on a
halo is as follows:
In [51]: f = pynbody.load("testdata/gasoline_ahf/g15784.lr.01024.gz");
In [52]: h = f.halos();
In [53]: pynbody.analysis.center(h[0])
Out[53]: <Transformation translate, offset_velocity>
We passed h[0]
to the function
center()
to center the entire snapshot
on the largest halo. The default centring uses the shrinking sphere method,
which normally gives a really stable and precise centre for galaxies, halos
or any other astronomical object
(see the documentation for center()
for
more details).
Suppose we now want to center only the contents of halo 5, leaving the rest of the simulation untouched. This is no problem. Let’s check where a particle in halo 5 is, then shift it and try again. You’ll notice halo 1 doesn’t move at all.
In [54]: h[1]['pos'][0]
Out[54]: SimArray([-0.02857659, 0.03555186, -0.10901458], '6.85e+04 kpc a')
In [55]: h[5]['pos'][0]
Out[55]: SimArray([-0.01305085, 0.00186122, -0.04388914], '6.85e+04 kpc a')
In [56]: h5 = h[5]
In [57]: my_h5_transform = pynbody.analysis.center(h5, move_all=False)
In [58]: h[1]['pos'][0] # should be unchanged
Out[58]: SimArray([-0.02857659, 0.03555186, -0.10901458], '6.85e+04 kpc a')
In [59]: h5['pos'][0] # should be changed
Out[59]: SimArray([ 0.00124333, -0.00044998, -0.00101245], '6.85e+04 kpc a')
Note however that the data inside h5
(or any halo) just points
to a subset of the data in the full simulation. So you now have an
inconsistent state where part of the simulation has been translated
and the rest of it is where it started out. For that reason, functions
that transform data return a Tranformation
object that conveniently
allows you to undo the operation:
In [60]: my_h5_transform.revert()
In [61]: print(h5['pos'][0]) # back to where it started
[-0.01305085 0.00186122 -0.04388914]
In [62]: print(h[1]['pos'][0]) # still hasn't changed, of course
[-0.02857659 0.03555186 -0.10901458]
In fact, there’s a more pythonic and compact way to do this. Suppose
you want to process h[5]
in some way, but be sure that the
centering is unaffected after you are done. This is the thing to do:
In [63]: with pynbody.analysis.center(h[5]):
....: print("Position when inside with block: ", h[5]['pos'][0])
....: print("Position when outside with block: ", h[5]['pos'][0])
....:
Position when inside with block: [ 0.00124333 -0.00044998 -0.00101245]
Position when outside with block: [-0.01305085 0.00186122 -0.04388914]
Inside the with
code block, h[5]
is centered. The moment the block
exits, the transformation is undone – even if the block exits with an
exception.
See also
For more information about centering, see the documentation for
center()
.
For more information about transformations in general, see the documentation for the
transformation
module.