This example demonstrates the basic usage of Trieste, with an emphasis on the added metadata that can be used for forensic analysis, or simply as a reminder to the author of the file.
%pylab notebook
import trieste as tr
We'll begin by loading the Trieste file, demo.npz
.
demo = tr.load('demo.npz')
Note the warning: The file was created with newer software than we are using, so it's possible that we could encounter a problem at some point. If we do encounter a problem, we can simply update our software and try again.
Let's see what is in the file by printing its table of contents (TOC) using print_toc()
.
demo.print_toc()
Let's see if the person who created this file included documentation, in the form of a README:
print(demo.readme)
The author did not include a huge amount of info in the README doc, but it is still somewhat helpful. Let's see what metadata keys are available...
demo.metadata.keys()
It looks like we can find out who created this file:
demo.metadata['author']
So, the user named 'Nemo' is the author. What machine were they using? ...
demo.metadata['hostname']
Of course. And what was the working directory? ...
demo.metadata['author_working_dir']
Nemo was working in his home directory on nautilus. What kind of machine is nautilus? ...
demo.metadata['platform']
Linux 4.14 on an x86-64 CPU. When was the file created?
demo.metadata['creation_date']
If Nemo was using IPython or a Jupyter notebook when he created this file, we can find out exactly how he created the file...
demo.print_history()
This gives us much more info than the README document! Nemo loaded a JSON file named cities.json
and then parsed the file to create the table included in this Trieste file. The cities.json
file was apparently located in his home directory.
In order to create the arrays in the file, Nemo loaded a Python module called sim
and executed sim.flow_field((36, 48.5, 68.5))
. So, this nautilus machine contains a Python module called sim
which we can examine in further detail, if we can gain access to the machine.
Okay, let's look at the contents of the file again...
demo.print_toc()
In the README, Nemo suggested plotting the flow field, stored in the arrays. Let's try that. We can access the content of each array by indexing them with an integer, as in:
xvals = demo[0]
xvals
To get the actual data stored in the array, we use the data
attribute:
xvals.data
So, let's load the data for all of the arrays. Note that we can also access the data in the file by specifying the name of the array:
xvals = xvals.data
yvals = demo['yvals'].data
unit_vec = demo[2].data
rate = demo['rate'].data
Now, let's plot the flow field, as Nemo suggested:
plt.contourf(xvals, yvals, rate[50,:,:], cmap='hot')
plt.streamplot(xvals, yvals, unit_vec[0,50,:,:], unit_vec[1,50,:,:], color=rate[50,:,:], cmap='BuPu')
plt.axis('square')
Nemo seems to be showing us the flow field from an explosion or something!
plt.close()
Now let's look at the city data
table:
city_data = demo[4]
print(city_data.readme)
city_data
What are the columns of the table? ...
city_data.column_names
What are the data types of each column? ...
city_data.column_types
Let's look at the first 10 entries in the table, to get a feel for what's inside...
city_data[:10]
Oh! It looks like this is a list of US cities, sorted from highest population to lowest.
This table contains city coordinates, so we can create a scatter plot and scale the points so that cities with larger populations are larger...
lon = city_data['longitude']
lat = city_data['latitude']
pop = city_data['population']
sqrtpop = np.sqrt(pop)
normed_sqrtpop = sqrtpop/sqrtpop.max()
plt.scatter(lon, lat, s=30*normed_sqrtpop, linewidths=0, alpha=0.5)
plt.xlim((-130,-65))
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.close()
We can also look at the distribution of populations...
plt.hist(pop/1e6, bins=50, histtype='step', log=True)
plt.xlabel("Population (millions)")
plt.ylabel(r"$N_{\rm cities}$")
plt.close()
New York City (the one on the far right) is quite an outlier!
We could have also obtained a NumPy RecArray
, like this:
cities = city_data.as_recarray()
Then, we can do things like search for all cities named Pasadena:
cities[cities.city == 'Pasadena']
and obtain a record for the 718th most populous city:
cities[718]
also:
hburg = cities[718]
hburg.city
hburg.longitude, hburg.latitude
To do more complicated things with tables, it's helpful to use Pandas:
import pandas as pd
cities = pd.DataFrame(city_data.data)
cities[:15]
cities.loc[cities.city == 'Pasadena']