A deeper walk through pynbody’s data access facilities#

The following example shows how to load a file, determine various attributes, access some data, make use of unit information, and understand how pynbody transforms data.

A shorter tutorial that deals with the absolute basics and shows how to make some plots is offered by the quick-start guide.

Note

This tutorial assumes basic familiarity with python and is written as a series of ipython cells, like a Jupyter notebook. Before you start make sure pynbody is properly installed. See Pynbody Installation for more information. You will also need the standard pynbody test files if you want to follow the tutorial. See Obtaining test data for more information.

First steps#

We will start by loading one of the test files. We’ll also load the numpy module as it provides some functions we’ll make use of later.

In [1]: import pynbody

In [2]: import numpy as np

In [3]: f = pynbody.load("testdata/gadget2/test_g2_snap")

Here we’ve loaded a sample gadget file. Not much seems to have happened when you called pynbody.load(), but the variable f is now your calling point for accessing data.

In fact f is an object known as a SimSnap.

Behind the scenes, the function inspects the provided path and decides what file format it is. At the time of writing, supported file formats include tipsy, nchilada, gadget-1, gadget-2, gadget-HDF and ramses. For most purposes you should never need to know what type of file you are dealing with – that’s the whole point of the pynbody framework.

Note

If you look inside the testdata folder, you’ll notice that our test snapshot is actually an example of a spanned gadget snapshot. There is not actually a file called gadget2/test_g2_snap, but rather two files from two CPUs, gadget2/test_g2_snap.0 and gadget2/test_g2_snap.1. pynbody knows to load both of these if you ask for gadget2/test_g2_snap; if you ask for gadget2/test_g2_snap.1 (for instance), it’ll load only that particular file.

The SimSnap object that’s returned is currently a fairly empty holder for data. The data will be loaded from disk as and when you request it.

Finding out something about the file#

Let’s start to inspect the file we’ve opened. The standard python operator len can be used to query the number of particles in the file:

In [4]: len(f)
Out[4]: 8192

We can also find out about particles of a particular type or family as it is known within pynbody. To find out which families are present in the file, use families():

In [5]: f.families()
Out[5]: [<Family gas>, <Family dm>, <Family star>]

You can pick out just the particles belonging to a family by using the syntax f.family. So, for example, we can see how many particles of each type are present:

In [6]: len(f.dm)
Out[6]: 4096

In [7]: len(f.gas)
Out[7]: 4039

In [8]: len(f.star)
Out[8]: 57

Useful information about the file is stored in a python dictionary called properties:

In [9]: f.properties
Out[9]: 
{'omegaM0': 0.2669,
 'omegaL0': 0.7331,
 'boxsize': Unit("3.00e+03 kpc a h**-1"),
 'a': 0.2777777798158637,
 'h': 0.71,
 'time': Unit("3.55e+00 s kpc a**1/2 h**-1 km**-1")}

Like any python dictionary, specific properties can be accessed by name:

In [10]: f.properties['a']
Out[10]: 0.2777777798158637

These names are standardized across different file formats. Here for example z means redshift, a means the cosmological scalefactor, h indicates the Hubble constant in standard units (100 km/s/Mpc).

Note

Actually f.properties has some behaviour which is very slightly different from a normal python dictionary. For further information see SimDict.

Retrieving data#

Like f.properties, f itself also behaves like a python dictionary. The standard python method f.keys() returns a list of arrays that are currently in memory.

In [11]: f.keys()
Out[11]: []

Right now it’s empty! That’s actually correct because data is only retrieved when you first access it. To find out what could be loaded, use the pynbody-specific method loadable_keys():

In [12]: f.loadable_keys()
Out[12]: ['vel', 'pos', 'mass', 'iord']

This looks a bit more promising. To access data, simply use the normal dictionary syntax. For example f['pos'] returns an array containing the 3D-coordinates of all the particles.

In [13]: f['pos']
Out[13]: 
SimArray([[  53.318974,  177.84364 ,  128.22311 ],
          [ 306.75046 ,  140.44455 ,  215.37149 ],
          [ 310.99908 ,   64.1345  ,  210.53595 ],
          ...,
          [2870.9016  , 2940.1711  , 1978.7949  ],
          [2872.4114  , 2939.2197  , 1983.916   ],
          [2863.6511  , 2938.0544  , 1980.0615  ]],
         shape=(8192, 3), dtype=float32, 'kpc a h**-1')

Note

Array names are standardized across all file formats. For instance, even if you load a Gadget-HDF file – which internally refers to the position array as coordinates – you still access that array from pynbody by the name pos. The intention is that code never needs to be adapted simply because you have switched file format. However the name mapping is fully configurable should you wish to adopt different conventions.

Some arrays are stored only for certain families. For example, densities are stored only for gas particles and are accessed as f.gas['rho']. To find out what arrays are available for the gas family, use f.gas.loadable_keys():

In [14]: f.gas.loadable_keys()
Out[14]: 
['rho',
 'nh',
 'vel',
 'sfr',
 'nhep',
 'nheq',
 'nhp',
 'pos',
 'smooth',
 'mass',
 'iord',
 'u',
 'nhe']

So, we can get the density of the gas particles like this:

In [15]: f.gas['rho']
Out[15]: 
SimArray([1.3888609e-09, 3.3617684e-09, 4.5273674e-09, ..., 8.5340952e-09,
          7.4101774e-09, 1.4051752e-09], shape=(4039,), dtype=float32, '1.00e+10 h**2 Msol a**-3 kpc**-3')

Note

The SimArray objects are actually numpy arrays with some added functionality (such as unit tracking, discussed below). Numerical operations are very nearly as fast as their numpy equivalents. However, if you want to squeeze the performance of your code, you can always get a vanilla numpy array by using the numpy view mechanism, e.g. f.gas['rho'].view(type=numpy.ndarray)

Creating your own arrays#

You can create arrays using the obvious assignment syntax:

In [16]: f['twicethemass'] = f['mass']*2

You can also define new arrays for one family of particles:

In [17]: f.gas['myarray'] = f.gas['rho']**2

An array created in this way exists only for the gas particles; trying to access it for other particles raises an exception.

Alternatively, you can define derived arrays which are calculated (and re-calculated) on demand. For example,

In [18]: @pynbody.derived_array
   ....: def thricethemass(sim) :
   ....:     return sim['mass']*3
   ....: 

At this point, nothing has been calculated. However, when you ask for the array, the values are calculated and stored

In [19]: f['thricethemass']
Out[19]: 
SimArray([0.02464408, 0.02464408, 0.02464408, ..., 0.02464408, 0.02464408,
          0.02464408], shape=(8192,), dtype=float32, '1.00e+10 Msol h**-1')

This has the advantage that your new thricethemass array is automatically updated when you change the mass array:

In [20]: f['mass'][0] = 1

In [21]: f['thricethemass']
Out[21]: 
SimArray([3.        , 0.02464408, 0.02464408, ..., 0.02464408, 0.02464408,
          0.02464408], shape=(8192,), dtype=float32, '1.00e+10 Msol h**-1')

Note, however, that the array is not re-calculated every time you access it, only if the mass array has changed. Therefore you don’t waste any time by using derived arrays. For more information see the reference documentation for derived arrays.

Keeping on top of units#

You might have noticed in the output from the above experiments that pynbody keeps track of unit information whenever it can.

Warning

It’s worth understanding exactly where pynbody gets this information from, in case anything goes wrong. Many simulation data formats now store units (e.g. gadget, arepo or swift’s HDF5 output, or ramses output folders). In such cases, pynbody will use the specified units. For nchilada and tipsy, a ChaNGa or gasoline .param file is sought in the directory from which you are loading the snapshot and its immediate parent. You can also create a text file with the same name as your snapshot but the extension .units to override the units at load time. For example, such a file can contain

pos: kpc a
vel: km s^-1
mass: Msol

to specify distance units are comoving kiloparsecs, velocity units are kilometers per second, and mass is in solar masses.

You can print out the units of any given array by accessing the units property:

In [22]: f['mass'].units
Out[22]: Unit("1.00e+10 Msol h**-1")

However, it’s usually more helpful to simply convert your arrays into something more managable than the internal units. Pynbody arrays can be converted using the in_units() function; just pass in a string representing the units you want.

In [23]: f['pos'].in_units('Mpc')
Out[23]: 
SimArray([[0.02086032, 0.06957889, 0.05016554],
          [0.12001192, 0.05494701, 0.08426115],
          [0.12167414, 0.02509174, 0.08236931],
          ...,
          [1.1232009 , 1.1503017 , 0.7741764 ],
          [1.1237916 , 1.1499294 , 0.77617997],
          [1.1203643 , 1.1494735 , 0.774672  ]],
         shape=(8192, 3), dtype=float32, 'Mpc')

Note

The function in_units() returns a copy of your array in new units. Next time you access f['pos'] it will be back in its original units. If you want to permanently convert the array in-place use convert_units() or see below.

Another option is to request that pynbody converts all your arrays into something sensible, using physical_units(),

In [24]: f.physical_units()

Take a look at what’s happened to the density:

In [25]: f.gas['rho']
Out[25]: 
SimArray([ 326.6502 ,  790.66406, 1064.8047 , ..., 2007.1586 , 1742.821  ,
           330.4872 ], shape=(4039,), dtype=float32, 'Msol kpc**-3')

Note that the conversion will also be made when loading any arrays in future; for example:

In [26]: f['vel']
Out[26]: 
SimArray([[ 27.938293 ,   4.983705 , -10.008866 ],
          [ 15.361564 ,   5.7859726,   4.3631563],
          [ -8.357319 ,  -2.8885257,  22.809904 ],
          ...,
          [ 27.749176 ,  85.60175  ,  15.532437 ],
          [ 40.755856 ,  59.442867 ,  44.244846 ],
          [ 38.383965 ,  68.63973  ,  46.01429  ]],
         shape=(8192, 3), dtype=float32, 'km s**-1')

A new array generated from a unary or binary operation will inherit the correct units. For example

In [27]: 5*f['vel']
Out[27]: 
SimArray([[139.69147 ,  24.918526, -50.04433 ],
          [ 76.807816,  28.929863,  21.81578 ],
          [-41.786594, -14.442629, 114.04952 ],
          ...,
          [138.74588 , 428.00876 ,  77.662186],
          [203.77928 , 297.21432 , 221.22423 ],
          [191.91983 , 343.19867 , 230.07144 ]],
         shape=(8192, 3), dtype=float32, 'km s**-1')

In [28]: (f['vel']**2).units
Out[28]: Unit("km**2 s**-2")

In [29]: np.sqrt(((f['vel']**2).sum(axis=1)*f['mass'])).units
Out[29]: Unit("km Msol**1/2 s**-1")

You can even associate arrays with the loaded SimSnap unit system even when you create them outside the SimSnap. This is useful for keeping things tidy with your unit conversions if you are calculating quantities that don’t apply to all of the particles. For instance:

In [30]: array = pynbody.array.SimArray(np.random.rand(10)) # make the newly-formed numpy array a pynbody array

In [31]: array.sim = f # this links the array to the simulation

In [32]: array.units = 'Mpc a' # we set units that require cosmology information

In [33]: array
Out[33]: 
SimArray([0.05428876, 0.22109127, 0.75276403, 0.75768469, 0.41987334,
          0.20815045, 0.34277263, 0.53425088, 0.39975123, 0.46980493], 'Mpc a')

In [34]: array.in_units('kpc')
Out[34]: 
SimArray([ 15.08021128,  61.41424325, 209.10112201, 210.46796966,
          116.6314834 ,  57.81957113,  95.21461912, 148.40302361,
          111.04200833, 130.50136926], 'kpc')

Note that the units were correctly converted into physical units in the last step.

Subsnaps#

An important concept within pynbody is that of a subsnap. These are objects that look just like a SimSnap but actually only point at a subset of the particles within a parent. Subsnaps are always instances of the SubSnap class.

You’ve already seen some examples of subsnaps, actually. When you accessed f.gas or f.dm, you’re given back a subsnap pointing at only those particles. However, subsnaps can be used in a much more general way. For example, you can use python’s normal array slicing operations. Here we take every tenth particle:

In [35]: every_tenth = f[::10]

In [36]: len(every_tenth)
Out[36]: 820

In common with python’s normal mode of working, this does not copy any data, it merely creates another pointer into the existing data. As an example, let’s modify the position of one of our particles in the new view:

In [37]: every_tenth['pos'][1]
Out[37]: SimArray([140.2888 , 122.21798,  75.80529], dtype=float32, 'kpc')

In [38]: every_tenth['pos'][1] = [1,2,3]

In [39]: every_tenth['pos'][1]
Out[39]: SimArray([1., 2., 3.], dtype=float32, 'kpc')

This change is reflected in the main snapshot.

In [40]: f['pos'][10]
Out[40]: SimArray([1., 2., 3.], dtype=float32, 'kpc')

Note

If you’re used to numpy’s flexible indexing abilities, you might like to note that, typically, f[array_name][index] == f[index][array_name]. The difference is that applying the index to the whole snapshot is more flexible and can lead to simpler code. In particular, numpy_array[index] may involve copying data whereas f[index] never does; it always returns a new object pointing back at the old one.

You can pass in an array of boolean values representing whether each successive particle should be included (True) or not (False). This allows the use of numpy’s comparison operators. For example:

In [41]: f_slab = f[(f['x']>1000)&(f['x']<2000)]

In [42]: f_slab['x'].min()
Out[42]: SimArray(1000.11975, dtype=float32, 'kpc')

In [43]: f_slab['x'].max()
Out[43]: SimArray(1173.6926, dtype=float32, 'kpc')

In [44]: f['x'].min()
Out[44]: SimArray(0.04504352, dtype=float32, 'kpc')

In [45]: f['x'].max()
Out[45]: SimArray(1173.6926, dtype=float32, 'kpc')

Here f_slab is pointing at only those particles which have x-coordinates between 1000 and 2000.

Note that subsnaps really do behave exactly like snapshots. So, for instance, you can pick out sub-subsnaps or sub-sub-subsnaps.

In [46]: len(f_slab.dm)
Out[46]: 1236

In [47]: len(f_slab.dm[::10])
Out[47]: 124

In [48]: f_slab[[100,105,252]].gas['pos']
Out[48]: 
SimArray([[1029.6443 ,  421.70737,  169.25818],
          [1097.2487 ,  377.8681 ,  149.81082],
          [1059.9261 ,  374.7446 ,  192.37201]], dtype=float32, 'kpc')

Note

Under most circumstances there is very little performance penalty to using a SubSnap. However in performance-critical code it is worth understanding a little more about what’s going on under the hood. See Performance optimisation in pynbody.

Filters#

Another way you can select a subset of particles is to use a filter. This can lead to more readable code than the equivalent explicitly written condition. For example, to pick out a sphere centered on the origin, you can use:

In [49]: from pynbody.filt import *

In [50]: f_sphere = f[Sphere('10 kpc')]

Centering#

Several built-in functions (e.g. those that plot images and make profiles) in pynbody like your data to be centered on a point of interest. The most straight-forward way to center your snapshot on a halo is as follows:

In [51]: f = pynbody.load("testdata/gasoline_ahf/g15784.lr.01024.gz");

In [52]: h = f.halos();

In [53]: pynbody.analysis.center(h[0])
Out[53]: <Transformation null, translate, offset_velocity>

We passed h[0] to the function center() to center the entire snapshot on the largest halo. The default centring uses the shrinking sphere method, which normally gives a really stable and precise centre for galaxies, halos or any other astronomical object (see the documentation for center() for more details).

Suppose we now want to center only the contents of halo 5, leaving the rest of the simulation untouched. This is no problem. Let’s check where a particle in halo 5 is, then shift it and try again. You’ll notice halo 1 doesn’t move at all.

In [54]: h[1]['pos'][0]
Out[54]: SimArray([-0.02857659,  0.03555186, -0.10901458], '6.85e+04 kpc a')

In [55]: h[5]['pos'][0]
Out[55]: SimArray([-0.01305085,  0.00186122, -0.04388914], '6.85e+04 kpc a')

In [56]: h5 = h[5]

In [57]: my_h5_transform = pynbody.analysis.center(h5, move_all=False)

In [58]: h[1]['pos'][0] # should be unchanged
Out[58]: SimArray([-0.02857659,  0.03555186, -0.10901458], '6.85e+04 kpc a')

In [59]: h5['pos'][0] # should be changed
Out[59]: SimArray([ 0.00124333, -0.00044998, -0.00101245], '6.85e+04 kpc a')

Note however that the data inside h5 (or any halo) just points to a subset of the data in the full simulation. So you now have an inconsistent state where part of the simulation has been translated and the rest of it is where it started out. For that reason, functions that transform data return a Tranformation object that conveniently allows you to undo the operation:

In [60]: my_h5_transform.revert()

In [61]: print(h5['pos'][0]) # back to where it started
[-0.01305085  0.00186122 -0.04388914]

In [62]: print(h[1]['pos'][0]) # still hasn't changed, of course
[-0.02857659  0.03555186 -0.10901458]

In fact, there’s a more pythonic and compact way to do this. Suppose you want to process h[5] in some way, but be sure that the centering is unaffected after you are done. This is the thing to do:

In [63]: with pynbody.analysis.center(h[5]):
   ....:     print("Position when inside with block: ", h[5]['pos'][0])
   ....: print("Position when outside with block: ", h[5]['pos'][0])
   ....: 
Position when inside with block:  [ 0.00124333 -0.00044998 -0.00101245]
Position when outside with block:  [-0.01305085  0.00186122 -0.04388914]

Inside the with code block, h[5] is centered. The moment the block exits, the transformation is undone – even if the block exits with an exception.

A deeper walk through pynbody’s data access facilities

Contents

A deeper walk through pynbody’s data access facilities#

First steps#

Finding out something about the file#

Retrieving data#

Creating your own arrays#

Keeping on top of units#

Subsnaps#

Filters#

Centering#