Researchers collect and generate a lot of data that is often shared with other people (your collegues, students, advisors, publishers, funders, and the public). Research can be stored on multiple computers creating challenges to find what you are looking for and may be difficult to understand in the future.
A Savvy Solution
The DataOne Best Practices Primer (PDF) recommend these following tips (with links to U of M Data Management resources).
- Plan for data management as your research proposal (for funding agency, dissertation committee, etc.) is being developed. Revisit your data management plan frequently during the project and make changes as necessary.
- Collection strategies. It is important to collect data in such a way as to ensure its usability later. Careful consideration of methods and documentation before collection occurs is important.
- Consider using a template for use during data collection. This will ensure that any relevant contextual data are collected, especially if there are multiple data collectors.
- Describe the contents of your data files (Readme File Template): Define each parameter, including its format, the units used, and codes for missing values (null = X, 0 = Y, 9999= Z, etc.). Provide examples of formats for common parameters. Data descriptions should accompany the data files as a “readme.txt” file, a metadata file using an accepted metadata standard, or both.
- Use consistent data organization: We recommend that you organize the data within a file in one of the two ways described below. Whichever style you use, be sure to place each observation on a separate line (row).
- Each row in a file represents a complete record and the columns represent all the parameters that make up the record (a spreadsheet format).
- One column is used to define the parameter and another column is used for the value of the parameter (a database format). Other columns may be used for site, date, treatment, units of measure, etc.
- Use the same format throughout the file; for instance, do not rearrange columns or rows within the file. At the top of the file, include one or more header rows that identify the parameter and the units for each column. “Atomize” data: make sure there is only one piece of data in each entry.
- Use plain text ascii characters for variable names, file names, and data: this will ensure that your data file is readable by the maximum number of software programs.
- Use stable, non-proprietary software and hardware: File formats should ideally be non-proprietary (e.g. .txt or .csv files rather than .xls), so that they are stable and can be read well into the future. Consider the longevity of hardware when backing up data.
- Assign descriptive file names: File names ideally describe the project, file contents, location, and date, and should be unique enough to stand alone as file descriptions. File names do not replace complete metadata records.
- Keep your raw data raw: Preserve the raw data, with all of its imperfections. Use a scripted program to “clean” the data so that all steps are documented.
- Create a parameter table or data dictionary: Describe the code and abbreviations used for a parameter, the units, maximum and minimum values, the type of data (i.e. text, numerical), and a description.
- Create a site table: Describe the sites where data were collected, including latitude, longitude, dates visited, and any contextual details (e.g. ecosystem type, land cover or use, weather conditions, etc.) that might affect the data collected.
- Assure: Perform basic quality assurance and quality control on your data, during data collection, entry, and analysis. Describe any conditions during collection that might affect the quality of the data.
- Identify values that are estimated, double-check data that are entered by hand (preferably entered by more than one person), and use quality level flags to indicate potential problems.
- Check the format of the data to be sure it is consistent across the data set. Perform statistical and graphical summaries (e.g.max/min, average, range) to check for questionable or impossible values and to identify outliers.
- Communicate data quality using either coding within the data set that indicates quality, or in the metadata or data documentation.
- Identify missing values. Check data using similar data sets to identify potential problems. Additional problems with the data may also be identified during analysis and interpretation of the data prior to manuscript preparation.
- Describe using data documentation (Readme File Template): Comprehensive data documentation (i.e. metadata) is the key to future understanding of data. Without a thorough description of the context of the data file, the context in which the data were collected, the measurements that were made, and the quality of the data, it is unlikely that the data can be easily discovered, understood, or effectively used.
- Preserve: Identify data with long-term value: It is not necessary to archive all of the data products generated from your research. Consider the size of your files, which data will be most useful for future data users (typically raw data), and which data versions would be most difficult to reproduce.
- Decide on a repository: Select a data repository (i.e. national data center or local Data Repository for the U of M) that is most appropriate for the data you will generate and for the community that will make use of the data. Talk with colleagues and research sponsors about the best repository for your discipline and your type of data. Check with the repository about requirements for submission, including required data documentation, metadata standards, and any possible restrictions on use (e.g. intellectual property rights).
- Discover, Integrate, and Analyze: When data sets and data elements are used as a source for new data sets, it is important to identify and document those data within the documentation of the new derived data set (i.e. data set provenance). This will enable 1) tracing the use of data sets and data elements, 2) attribution to the creators of the original data sets, and 3) identifying effects of errors in the original data sets or elements of those sets on deriveddata sets.
More ExamplesThe libraries can help you create a data management plan. We are interested in working with individuals to consult on the best ways to share, disseminate, and make accessible their research data. Here are some next steps you can take toward creating your plan:
- Take one of our data management workshops or watch a recorded session.
- Learn about the various funding agency requirements and recommendations (ie. NSF Data Sharing Policy).
- View a list of subject-specific data repositories to determine the best place to share your data.
- Get access to data management tools and services on campus and include them in your data management plan.
- Consult with a data librarian to archive your data with the library.