On this page:
Data is more than just data
In the Reference Model for an Open Archival Information System (OAIS) (Wikipedia), data is defined as "[a] reinterpretable representation of information in a formalized manner suitable for communication, interpretation, or processing. Examples of data include a sequence of bits, a table of numbers, the characters on a page, the recording of sounds made by a person speaking, or a moon rock specimen."Types of data include:
- observational data
- laboratory experimental data
- computer simulation
- textual analysis
- physical artifacts or relics
For social science, data is generally numeric files originating from social research methodologies or administrative records, from which statistics are produced. It also includes, however, more data formats such as audio, video, geospatial and other digital content that are germane to social science research.
Digital text is becoming increasingly important in the humanities and arts. Research in these areas may think of data in the form of textual information, semantic elements, and text objects. Digital Arts, Sciences, and Humanities (DASH), on campus, is an example of research emerging in this area.
University of Minnesota examples
At the University of Minnesota, research data are produced by various disciplines have various characteristics. Here are some examples of digitally available data:
- National Center for Earth-surface Dynamics data repository maintained at the UMN
- Geospatial data for wildlife refuge management and planning compiled by the Forestry and Wildlife department
- Cedar Creek Ecosystem Science Reserve Data from the College of Biological Sciences
- Election data archive a historical database of Upper Midwestern state election data dating to the mid-1800's.
- Bell Museum Scientific Collections offer a range of artifact collections from amphibian specimens to geologic core samples
- Movie Recommendation Simulations by the Computer Science department
Data is not always sharable
The federal government1 defines the default terms and conditions for recipients of federal funding specifically outline what research data is NOT for the sake of sharing and archival purposes and these include:
- preliminary analyses
- drafts of scientific papers
- plans for future research
- peer reviews, or communications with colleagues
- physical objects (e.g., laboratory samples)
- trade secrets
- commercial information
- materials necessary to be held confidential by a researcher until they are published, or similar information which is protected under law
The Federal government also defers data sharing compliance for data including "personnel and medical information and similar information the disclosure of which would constitute a clearly unwarranted invasion of personal privacy, such as information that could be used to identify a particular person in a research study
1Circular No. A-110 - Uniform Administrative Requirements for Grants and Agreements With Institutions of Higher Education, Hospitals, and Other Non-Profit Organizations. (1999, September 30).
Glossary of data related terms
- A place, physical or virtual infrastructure, that houses firsthand facts, data, and evidence. The data format can be in various materials such as letters, reports, notes, memos, photographs, digital files and other primary sources.(What's an Archive? http://www.archives.gov/)
- Controlled vocabulary
- A type of metadata that uses a non-redundant collection of standardized terms, or subject headings, to describe objects for later reference. Examples include Library of Congress Subject Headings (LCSH) and Medical Subject Headings (MeSH).
- It is the coordinated aggregate of software, hardware and other technologies, as well as human expertise, required to support current and future discoveries in science and engineering. (Fran Berman, Director of the San Diego Supercomputer Center)
- Data access management
- A practice that focuses on ensuring that only approved roles are able to create, read, update, or delete data - and only using appropriate and controlled methods.
- Data Anonymization
- It is the practice to anonymize personally identifiable information (PII), i.e., any piece of information which can potentially be used to uniquely identify, contact, or locate a single person, such as a Social Security number, email address, credit card number or fixed IP address.
- Data audit
- An independent examination of an effort "to identify and assess the current value and condition of data assets to make recommendations for their long term management and preservation." (Jones, Ross, Ruusalepp, 2009. Data Audit Framework Methodology)
- Data curation
- Data curation refers to the value-added activities and features that stewards of digital content engage in to make digital content meaningful or useful. The data portion of this term sometimes refers specifically to research data (the outcomes of conducting research) and sometimes to digital content of any kind.
- Data life cycle
- "Life cycle" is different from "life span," that is, the time from birth to death. A "cycle" implies an environment in which resources (data) are managed and preserved for discovery and repurposing. Resources are created, curated, made accessible, and preserved for subsequent research, learning, and policy activity. The challenge is to infuse the data life cycle with the metadata and services that will enable access, evaluation and re-use over time. (IASSIST's Conceptualizing the Digital Life Cycle )
- Data privacy
- The assurance that a person's or organizations' personal and private information is not inappropriately disclosed. Ensuring Data Privacy requires Access Management, network security, and other data protection efforts such as anonymization, etc.
- Data publishing
- Process through which data are fixed and made citable and retrievable over the long term and may imply there has been a quality-control process.
- See examples or more about this topic in our repository section.
- Data repository
- A digital data center that supports the preservation, discovery, use, reuse, and manipulation of scientific data objects supporting published research. It often provides added value to data through quality assurance and metadata enhancement, and has an operational model based on data harmonization into a common schema.
- See examples or more about this topic in our repository section.
- Data standard
- Also known as a metadata standard, these protocols facilitate compatible communications and interoperability between separate science laboratory instruments and computer systems. They are a subset of the familiar engineering standards compilations from ANSI, etc. Hypertext Transfer Protocol (HTTP), Domain Name Service (DNS), and the Transmission Control Protocol and Internet Protocol (TCP/IP) are all familiar examples of internet data standards.
- Data stewardship
- The American Medical Informatics Association's definition: Data stewardship encompasses the responsibilities and accountabilities associated with managing, collecting, viewing, storing, sharing, disclosing, or otherwise making use of personal health information. For our purpose, its coverage is broadened to include data from other disciplines as well.
- See examples or more about this topic in our archiving section.
- Digital curation
- As initially defined by the Digital Curation Center of the UK when it was founded, this term encompasses the full life-cycle of digital content management: selection, preservation, maintenance, collection and archiving of digital assets. Digital curation is generally referred to the process of establishing and developing long term repositories of digital assets.
- Digital preservation
- Digital preservation is the set of processes and activities that ensure continued access to information and all kinds of records, scientific and cultural heritage existing in digital formats, and is an ongoing process for the entire time span the information is wanted. There is an important distinction between “bit preservation”, i.e., preserving the original files intact and unchanged over time) and “functional preservation”, or preserving the information in the files by means of reformatting, added documentation, or other processes that will enable users to interpret the information in the future.
- Commonly used term in the UK and Australia that is synonymous with the US favored "e-science", it has been defined as encapsulating research activities that use a spectrum of advanced information and communications technology capabilities and embraces new research methodologies emerging from increasing access to:
- Broadband communications networks, research instruments and facilities, sensor networks and data repositories;
- Software and infrastructure services that enable secure connectivity and interoperability;
- Application tools that encompass discipline-specific tools and interaction tools.
- E-research capabilities serve to advance and augment, rather than replace traditional research methodologies, but there is a growing dependence on e-Research capabilities.
- E-Science (or eScience)
- It is used to describe computationally intensive science that is carried out in highly distributed network environments. It is a new research methodology
(Hey and Hey, 2006)rather than an emerging "science".
- This newer term derived from e-science includes a broader focus of data issues in all disciplines, such as the Digital Humanities. The following E-scholarship goal was drafted in 2010 to provide a framework for continued development in this area. 'The Libraries will provide life-cycle management solutions for digital content through engagement in strategic partnerships, leveraging of Libraries’ (and campus) assets, developing and sharing our expertise, and collaborating to develop essential infrastructure.'
- Grid computing (or the use of computational grids)
- It is the application of several computers toward a single problem at the same time usually to a scientific or technical problem that requires a great number of computer processing cycles or access to large amounts of data.
- High performance computing
- High Performance Computing (HPC) involves parallel-processing computers and programs used for scientific research or computational science. In recent years HPC systems have shifted from supercomputing architectures to computing clusters and grids.
- Linked open data
- The goal of Linked Open Data is to extend the Web as a vast data network by publishing various open datasets as RDF on the Web and by setting RDF links between data items from different data sources to link them semantically.
- See also Semantic Web
- Metadata is often simply defined as "data about data" or "information about information". NISO (2004) defines it as "structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource."
- See examples or more about this topic in our metadata section.
- A comprehensive collection of all the paradigms, objects and semantic relationships in a specific field of study. More than a dictionary of subject-specific words or a controlled vocabulary, it is an interconnecting concept map for a specific subject domain. The relationships can usually be defined mathematically/computationally, which facilitates automated reasoning and inferences.
- A petabyte is a unit of information or computer storage equal to one quadrillion bytes (short scale), or 1000 terabytes, or 1,000,000 gigabytes. It is abbreviated PB. 1 PB = 1,000,000,000,000,000 B = 10005 B = 10^15 bytes.
- Preservation is a branch of library and information science concerned with maintaining or restoring access to artifacts, documents and records through the study, diagnosis, treatment and prevention of decay and damage. Digital preservation can therefore be seen as the set of processes and activities that ensure continued access to information and all kinds of records, scientific and cultural heritage existing in digital formats.(Wikipedia)
- See examples or more about this topic in our archiving section.
- Provenance data
- A term most often used to describe the semantic meaning of a dataset, or what the library community usually calls metadata, as well as where the data came from and by what methods (closer to the archives concept of provenance).
- Publishing data
See Data Publishing
- Research life-cycle
- The life cycle approach observes each stages of a process, to understand the overall process better. The research life cycle begins with the conception of a research project (hypothesis), continues through its methodology design and data collection, analysis, and finally publication and archiving of research outputs (e.g. articles, datasets, software, models, etc.). Understanding the research life cycle helps libraries identify who in involved and what information is produced or transformed during each phase of the project. For a more detailed explanation of this key concept, see e-Science and the Life Cycle of Research by Charles Humphrey.
See Data Repository
- Resource Description Framework (RDF)
- A W3C knowledge representation specification and the foundation for ontology languages such as OWL
- Restricted-use data
- Data that contain sensitive information (usually about human subjects) that could permit the identification of individuals.
- Semantic web
- The Semantic Web is a "web of data" intended to enable computers to understand the semantics, or meaning, of information on the World Wide Web. The “Semantic Web" is implemented with a set of formats and technologies including RDF, which are intended to provide a formal description of concepts, terms, and relationships within a given knowledge domain. More recently, the term “Linked Data” or “Linked Open Data” has become the preferred term for this concept of a data Web.
- An extension of the controlled vocabulary providing additional relationships between terms (broader-term, narrower-term, etc.)
- Web Ontology Language (OWL)
- A well defined, XML-based formal ontology language, and the de facto W3C ontology standard as of 2007. OWL Facilitates exchanging ontologies between subject domains and software systems. It has three “flavors?(OWL Lite, OWL DL, OWL Full), and computer-automated reasoning is possible with the first two. OWL DL is particularly well suited for representing bio-ontologies. (Examples here)