Chapter 2. The Data Store

Table of Contents

DataSources and DataSets
Multiple datasources
Hierarchical Datasources
Netcdf Directory Sources
Metadata
DataSource Configuration
Virtual DataSets
The DataSetSelection Class
Validating DataSetSelection Objects
Related Changes
Ideas for Further Changes
Data-Driven Models
DataSet Models
Data Sources as Models for Asynchronous Opens
Task List
The DerivedDataSet Class

DataSources and DataSets

At one point in the design, variables were being passed around the application by name. However, variables are so fundamental a concept in the design that they deserve to be objects. Variables have attributes like description and units, and they are associated with particular data sources. A variable also has data. Rather than make a distinction in the design between a variable and its data, why not use the DataSet class for both? Instead of returning a list of variable names, a DataSource returns DataSets for all its variables. Initially, the DataSets are empty, like data proxies. Plots get created by the user, and the user selects the DataSets to draw on each Plot based on the DataSet's variable name or description. To actually plot the variables, the application collects the set of DataSets being plotted and passes them to the updateDataSets() method of the toplevel DataSource. The DataSource passes the DataSets to each of its child DataSources, which then update and/or fill their DataSets as necessary in a backend-specific way. As DataSets migrate between being drawn or not, they can still maintain some cached data in case those data get chosen again. Also, the complete set of available variables is always the collection of DataSets from all of the DataSources, even though some of those DataSets will be empty until assigned to a Plot. A DataSet can also be shared among multiple Plots. The DataSource keeps the current list of available variables as a vector of DataSets, and the rest of the application can deal only with DataSets. The GUI can generate the list of available variables by querying the DataStore for the available DataSets, and each DataSet provides the variable name and units and other attributes without requiring that the DataSet be filled with any data.

Behind the scenes, a DataStore can periodically flush LRU caches of data to keep memory under some configurable limit. Also, the DataSource backend has the option of pre-filling DataSets as an optimization, which would be especially useful for caching data being received anyway in realtime broadcasts. The non-selected DataSets would already exist for caching the realtime data until such time as they are selected.

DataSet has a public constructor which takes a DataSet as a parameter and not the variable metadata. The original DataSets come from the DataSource, from which all other DataSets propagate. The only public way for the rest of the application to create a DataSet should be from another DataSet. The DataSet constructor gets a reference id from the DataSetBack (so we need to add such a method, int DataSetBack::createReferenceID()), and initially the DataSet has no dependency on any data in the DataSetBack. Rather than write another constructor which takes a DataSet pointer, just implement the copy constructor:

	DataSet::DataSet (const DataSet& rhs)

The implementation copies the DataSetBack pointer from rhs and gets a new refID with createReferenceID().

Since DataSource needs to be able to create DataSets to return in its public method getDataSets(), DataSet needs a constructor which takes a DataSetBack. A DataSource uses the DataSetBack constructor which takes the variable name and units and such. Then from its vector of DataSetBack objects a DataSource can return a vector of DataSets to the rest of the application. There are (at least) two options for this:

	void
	DataSource::
	collectDataSets (DataSetList& list);
	DataSetList
	DataSource::
	getDataSets();

Maybe it would be enough to return DataSet pointers from getDataSets(), and then only plots and other entities with actual data dependencies need to use the value constructor DataSet::DataSet (DataSet*). However, it seems more consistent to use DataSets as value objects where the reference will persist, and that favors the first method.

Should the dataset public interface include ways to change the data, such as to support editing? For now, no. The DataSet interface represents the set of public operations the rest of the application can perform with data, and for now that does not include editing. Thus the DataSource alone keeps pointers to DataSetBack objects to access their writable interface to cache realtime data and fill data requests. In fact, only DataSource should see the definition of DataSetBack, the rest of the code accesses data through the DataSet interface.

DataSetBack needs a 'int nextId' field to increment for generating unique reference id's for DataSets.

DataSetBack keeps a list of DataCache objects in a vector<DataCache*>.

DataCache no longer keeps a list of the references to itself. Instead, DataSetBack itself maintains some kind of map from the references to the DataDomain each reference depends upon. The DataSetBack needs to know the specific DataDomain on which each reference depends in order to know how best to arrange the DataCache objects to fulfill all the dependencies. For example, one DataCache may hold the union of two domains. If one of the references is released, DataSetBack needs to know which half of the cache needs to be kept to satisfy the remaining reference.

This means the DataDomain does not need to be passed into data requests on a DataSet. Instead, the DataSet client calls DataSet::GetDouble() (for example), and DataSet propagates that call to the DataSetBack in a call to DataSetBack::GetDouble(refid). The DataSetBack looks up the domain for the given refid, finds the DataCache which contains that domain, and returns DataCache::GetDouble(domain). The domain needs to be passed into DataCache since the DataCache may actually hold a superset of the domain, so the DataCache needs to calculate the correct offset into its data arrays to return the data pointer.

The DataDomain on which a DataSet reference depends will be updated only by a call to DataSource::updateDataSets(). We decided not to add DataSet methods which take an explicit DataDomain and cause an immediate data request. Instead, the same effect can be achieved by calling updateDataSets() with only the single DataSet, something like this:

	DataSet ds (some_other_ds);
	DataSource* source = ds.getDataSource();
	DataDomain newdomain;
	newdomain.set ("time", begin, end);
	DataSetList dslist;
	dslist.push_back (&ds);
	source->updateDataSets (newdomain, dslist);
	double* times = ds.GetTimes();
	double* values = ds.GetDouble();
	// and away we go...

Which reminds me, methods like GetTimes() and GetDouble() need to return the length of the returned arrays.