Project Data Management Criteria

PROJECT DATA MANAGEMENT CRITERIA

There is a need for coordinated data management of observational data sets from the Continental-Scale Experiments (CSEs). The broader modeling community will use coordinated ground, atmospheric and satellite measurements of the type taken during these experiments to test such formulations as prognostic cloud schemes and the representativeness of related interactions being implemented in their global models. This process can be made much more efficient if these data sets are gathered into a uniform database easily accessible by the various modeling centers represented across the CSEs. A Data Management Working Group has been established to investigate the research plans of the CSEs to determine the strategies being developed for gathering the different observational data sets, including those generated during their intensive observation periods, into coordinated databases. The intent of this initiative is to leverage software and related resources by encouraging standardization (compatibility) of the CSE data system hardware and software schemes, to highlight opportunities for collaborative efforts in assembling data sets (e.g., global), and to foster co-operation and scientific outreach between CSEs by facilitating the exchange of data.

To achieve the objectives of GHP, careful attention has to be placed on data management. Many aspects need to be considered, as outlined below in regards to "groundwork" considerations for the coordination of data management issues among the various GEWEX Continental scale experiments.

1) Develop Data Management Plan - A data management plan should be drafted for each project that defines various aspects of the data policy, so that project researchers, the scientific community, and involved data centers are aware of procedures and technical aspects of the database. The database may contain value added data sets pertinent to the project and it is important to define these in advance, since many might be generated in real-time or require specialized products. The plan should also contain the specific information as described below. A data manager should be identified early in this process to coordinate all these data management issues.

TIMELINE - 3 months prior to experiment start to publish Data Management Plan.

2) Data Types - All data types whether they be operational, research (or experimental), and interdisciplinary should be inventoried and defined. This can be done in the form of a survey sent out to the data sources, data centers, and Principal Investigators (PIs). Also all data sources should be identified and any arrangements required to obtain the data made (e.g. International or Interagency agreements). Considerations regarding whether episodic or longer term data are required will impact these arrangements (particularly if cost is a factor). Historical or climatological data may be needed to define scientific objectives early on in the planning process. However, these data should also be included in the database.

TIMELINE - Begin working 6 to 8 months prior to Experiment start.

3) Data Formats and Volumes - This information should be obtained at the time of the survey or inventory of data types. Issues to consider are the investigator's needs (i.e. resolution, frequency) and requirements for data storage (volume of each data set) . Some data sets may be required to be converted from native format depending upon the use or analysis tools to be used. This conversion might be easiest done in real-time as data are received. In any event, conversion software must be obtained or written and data formats in the final archive must be compatible and easy to use by the researcher. All this information should be obtained at the time of the survey.

TIMELINE - Begin working 6-8 months prior to Experiment start.

4) Data Collection - Data collection should commence at least 2 weeks prior to experiment start to allow enough time to ensure proper data ingest and archive procedures. If the data are required for real-time decision making, any visualization software should also be checked at this time. Metadata (or information about the data) MUST accompany the data. Metadata should contain information regarding instrumentation, calibration, site location, exposure, etc. Experience shows this information is extremely difficult to obtain especially after the fact, and is critical for determining the validity during any in-field intercomparisons or quality assurance process.

TIMELINE - At least 2 weeks prior to Experimental start.

5) Real-time Data Requirements - This issue is usually addressed in the Experiment Design or Operations plan documents. Real-time data needs are usually determined from operational requirements or the need to perform calibrations and intercomparisons in the field. It is strongly recommended to perform calibrations and intercomparisons to verify instrument performance and identify problems early before the entire data set may be bad. All this information can be archived in an on-line real-time catalog or standard report form documentation.

TIMELINE - During Field Operations (preparation for these data discussed in previous items).

6) Data Quality Control - This is an issue that requires the most attention and will assure a credible database in the final archive. Generally it should be the PI's responsibility to perform the Quality Control (QC) on their own data since they are most knowledgeable regarding the data, instrumentation, and calibrations. In the case of operational data, the preparation of "composite" data sets combining data from various networks and/or instruments will show any bias from a spatial and temporal analysis. The utilization and development of analysis tools (including software exchange) will increase the efficiency of quality control. In any case, all quality control changes to the data set (i.e. flagging or estimation) MUST be thoroughly documented. Changing actual data is not recommended unless fully documented.

TIMELINE - variable, but generally QC should be completed 6 to 12 months following the field deployment.

7) Data Archival - The biggest archival decision is determining a centralized vs. a de-centralized data base. The advantage of a centralized database is all data are located in one location and access is usually better coordinated and quicker. The disadvantage is usually large storage is required and in many cases data sets will be duplicated from another data center. With the increase in Internet access, electronic links are making de-centralized databases more practical. Consideration should be provided to coordinate with other data centers that might archive complementary data sets (particularly when pertinent co-located measurements are made through another program). Another issue is what data to keep on-line vs. off-line. Generally the smaller data sets (i.e. in-situ) conform better to on-line retrieval and larger data sets such as satellite and radar are stored on tapes. Again, the amount of data kept on-line depends on the data volume,amount of available storage, project requirements/priorities and available staff to process more data sets. Finally, the concept of data integration should be considered when scientific objectives require the merger and "overlaying" of various data sets or creation of special data "composites". This becomes a far more complex issue because additional planning must provide for spatial and temporal observations must be compatible. Also, standardized formats must be implemented. Special data products are usually the benefit of an integrated database.

TIMELINE - 6 to 12 months depending when data sets are available.

8) Data Distribution - The large barrel theory (and black hole corollary) states that "It is a lot easier to archive data than to disseminate it"! Many issues must be considered. First, the policy of restricted vs. open data access is dependent on logistics such as funding, staff support, agreements between PIs, quality of the data (preliminary versus final) and varying policies of numerous data centers in the case of a de-centralized database. In some cases this will be determined on a data set by data set basis. In any event, ease of data access must be strongly encouraged. The data should be disseminated with all available metadata and inventories. In some cases the distribution of "browse" products such as radar reflectivity composites, makes data selection and case study identification easier particularly for voluminous data sets. The production and distribution of customized data requests must be seriously considered due to the typical large impact on staff time and computer resources.

TIMELINE - 6 to 12 months depending when data sets are available.

9) Coordination with Other Programs - This issue is usually associated with a de-centralized database or collaboration among projects with compatible scientific objectives such as GEWEX. The consideration of data and analysis tool exchange implies good coordination and interoperability between data centers. In the case of collaborative research the issue of standardized data formats becomes more critical. The benefit of such coordination is the efficient cost shared resources of database development in this age of budget shortfalls.

TIMELINE - continuous as practicality dictates.

Back to GHP DMWG Home Page