Content: Working With Data

Effective data management involves planning how data will be collected, labeled, stored, analyzed, and shared. Without proper data management, researchers risk losing data, struggling to find specific files, and hindering the reproducibility of their work. Understanding the principles of data management can help researchers promote these practices among their colleagues and supervisors, fostering a culture of good data stewardship.

  1. Creating Data Management Plans (DMPs)
  2. Including Metadata
  3. Best Practices for Data Archiving and Preservation
  4. Working With Data Among Teams
  5. Encouraging Best Practices
  6. Data Organization
  7. How to Cite Datasets, Code, and Software
  8. Understanding and Implementing FAIR Principles

Creating Data Management Plans (DMPs)

Data Management Plans serve as foundational documents for effective research data management. Data Management Plans (DMPs) are increasingly becoming a fundamental requirement for scientific research, especially with funding agencies like the NSF mandating them before a project begins. These plans are not merely administrative hurdles but are valuable tools that save researchers time and ensure the long-term usability and impact of their data. It not only helps researchers meet funder requirements but also provides a framework for effective data management practices that contribute to the long-term value and reusability of scientific data.

  1. Purpose and Value:
    • A well-thought-out DMP serves as a roadmap for how data will be managed throughout the research lifecycle, from planning and collection to storage, analysis, preservation, and sharing. By proactively addressing these aspects, researchers can prevent significant problems down the line, such as data loss, disorganization, or difficulties in sharing and reuse.
  2. Key Components and Considerations:
    • While specific requirements may vary by funder or institution, effective DMPs generally address several key elements
      • Data Collection: The DMP should outline what data will be collected, the methods used, and the expected volume.
      • Data Formats and Standards: Consideration should be given to the formats in which data will be stored to ensure long-term accessibility. Choosing non-proprietary and widely accepted formats is recommended.
      • Metadata: The plan should detail how metadata will be created and managed. High-quality metadata is essential for the findability, understanding, and reuse of data. Researchers should consider adhering to community metadata standards.
      • Storage and Backup: The DMP needs to specify where and how data will be stored securely and backed up throughout the project.
      • Data Sharing and Access: The plan should outline how and when data will be made accessible to the broader research community, considering funder and publisher requirements. This includes deciding whether data will be made open or shareable and identifying appropriate repositories for data deposit. The DMP should also address any potential restrictions on data sharing, such as embargo periods.
      • Data Preservation: The DMP should consider long-term preservation strategies to ensure data remains available and usable beyond the project’s lifespan.
      • Roles and Responsibilities: The plan may outline who is responsible for different data management tasks.
  3. Benefits Beyond Compliance:
    • While often required by funders, creating a thoughtful DMP offers numerous benefits for researchers
      • Saves Time: Proactive planning can prevent data-related issues that could consume significant time later in the research process.
      • Ensures Future Usability: Well-managed and documented data is more likely to be usable by the researcher and others in the future.
      • Increases Impact and Attribution: By outlining how data will be shared, DMPs increase the potential for other scientists to use and cite the data, leading to greater impact and recognition for the researcher’s work.
      • Facilitates Collaboration: A clear DMP can facilitate data sharing and collaboration among research team members.
  4. Resources and Tools:
    • Several resources are available to assist researchers in creating DMPs
      • DMPTool and DMPonline: These are common tools that provide templates and guidance for writing data management plans.
      • Data Help Desks: Experts at Data Help Desks can provide guidance and answer questions about creating DMPs. They can offer advice on what to include, point to relevant examples, and help researchers understand funder-specific requirements.
      • Institutional Libraries and Data Management Services: Many universities and research organizations offer local support for data management planning. For example, EOL’s Data Management and Services (DMS) works with PIs to develop DMPs.
      • Example Templates: Researchers can find example DMP templates to get started. Many agencies like NSF, NIH, DOE, and ERC provide example DMPs or templates specific to their requirements. Platforms like DMPTool (U.S.) and DMPonline (UK) have public templates and a list of funder requirements.
  5. Timing:
    • Ideally, planning for data management should begin early in the research process, even at the proposal writing stage. This allows researchers to think through their data needs and management strategies proactively.

Including Metadata

Effective metadata allows for a comprehensive understanding, utilization, and sharing of data both in the present and for future use. It also aids in ensuring that other researchers can discover, access, apply, repurpose, and properly cite the data over time. Additionally, it supports the long-term preservation and archiving of data. The key categories of documentation to capture data include

  1. Samples Metadata:
    • Details about the samples themselves.
  2. Technical Metadata:
    • Information generated automatically by research instruments and the associated software.
  3. Experimental Metadata
    • Data concerning the experimental setup, including time points, protocols, and equipment used during data generation.
  4. Analytical Metadata:
    • Information related to the data analysis process, such as the software name and version, quality control parameters, and specifications of output file types.
  5. Dataset Level Metadata
    • Descriptive data about the research project’s objectives, contributing investigators, related publications, and funding sources.

Barriers scientists face for adding metadata

  • Lack of Standardization
    • There are numerous metadata standards, and choosing the right one can be confusing.
  • Complexity of Requirements
    • Metadata requirements can be extensive and complex, leading to frustration in understanding what is necessary.
  • Insufficient Training
    • Many scientists lack formal training on metadata practices, resulting in uncertainty about how to create or implement it effectively.
  • Time Constraints
    • The demands of research and publication often leave little time for the additional task of creating comprehensive metadata.
  • Limited Resources
    • Smaller labs or institutions may not have dedicated personnel or resources to assist with metadata creation.
  • Poor Documentation
    • Existing documentation may be overly technical, poorly organized, or just a wall of text without clear examples.
  • Perceived Low Value
    • Some scientists may not see the immediate benefits of adding metadata, viewing it as an optional extra rather than a necessity.
  • Software Limitations
    • Tools and platforms for data management may not support easy metadata integration, leading to additional manual work.
  • Data Privacy Concerns
    • Worries about sharing sensitive or proprietary information can deter scientists from fully documenting their datasets.
  • Resistance to Change
    • Established practices and habits can be hard to change, and some scientists may be reluctant to adopt new methodologies for data management.

What can help scientists add metadata?

  • Standardized Templates
    • Providing pre-made templates tailored to specific fields helps streamline the process and ensures consistency.
  • Clear Guidelines and Examples
    • Offering straightforward, example-driven documentation can demystify metadata requirements and show scientists exactly what is needed.
  • Training Workshops
    • Conducting hands-on training sessions or webinars can equip scientists with the skills and confidence to create metadata effectively.
  • Metadata Management Tools
    • Utilizing user-friendly software or platforms that facilitate metadata entry and management simplifies the process and reduces errors.
  • Integration with Data Management Plans
    • Encouraging scientists to incorporate metadata considerations into their data management plans at the outset fosters a more holistic approach.
  • Collaboration with Librarians/Archivists
    • Partnering with information specialists can provide expertise and support for metadata creation and best practices.
  • Emphasizing Benefits
    • Clearly communicating the long-term benefits of metadata—such as improved data discoverability, reuse, and compliance with funding requirements—motivates scientists to engage.
  • Community Support
    • Establishing forums or communities where scientists can share experiences and tips can enhance peer learning and provide moral support.
  • Feedback Mechanisms
    • Creating systems for feedback on metadata quality can help scientists refine their practices and improve future datasets.
  • Automated Metadata Extraction
    • Leveraging tools that can automatically generate metadata from existing data helps reduce the burden on scientists while ensuring accuracy.

Example of metadata to include for a dataset

For an example dataset of plankton density in the ocean, here are key metadata elements to include:

  1. Title
    • A clear and descriptive title of your dataset. and significance of the research.
  2. Keywords
    • Relevant keywords that encapsulate the main topics of your dataset.
  3. Principal Investigator(s)
    • Names and affiliations of the lead researchers.
  4. Contact Information
    • Email addresses and institutional affiliations of everyone who worked on the dataset.
  5. Data Collection Methods
    • Detailed description of how the data were collected.
  6. Spatial Coverage
    • Geographic locations of where the data were collected.
  7. Temporal Coverage
    • Dates and times of data collection
  8. Data Format
    • Information on the file formats of the dataset (e.g., CSV, NetCDF)
  9. Variables/Parameters
    • Clear definitions of parameters and units of measurement
  10. Data Quality Information
    • Notes on the quality and accuracy
  11. Funding Sources
    • Information about any grants or funding that supported the dataset
  12. Related Publications
    • References to any scientific papers or reports
  13. License Information
    • Specify the licensing terms for data reuse.
  14. Provenance
    • Details on how the dataset was created, including any relevant software or tools that were used in the data processing.

Metadata Resources

See this site to learn more about metadata Metadata Best Practices and Data Publishing.

See the Documentation & MetaData section in Resources for more information: Metadata Resource Links

Best Practices for Data Archiving and Preservation

Data Help Desks play a vital role in guiding researchers through the essential steps to ensure their valuable data remains accessible and usable for the long term. This includes careful consideration of data formats, metadata creation, and documentation.

  1. Choosing appropriate data formats
    • This is a foundational aspect of data archiving and preservation. Selecting non-proprietary and widely accepted file formats is critical for long-term accessibility. These formats minimize the risk of data becoming inaccessible due to the obsolescence of specific software or systems. Data Help Desks can advise researchers on suitable formats for their particular data types, ensuring that the data can be opened and interpreted by future users without reliance on potentially outdated proprietary software. The goal is to choose formats that are likely to be supported by various tools and platforms over time, thereby maximizing the longevity and usability of the data.
  2. The creation of comprehensive metadata
    • This is another cornerstone of best practices. Metadata, often defined as “data about data”, provides the crucial context necessary for understanding and reusing data. Without adequate metadata, even well-formatted data can become meaningless over time. Data Help Desks emphasize the importance of documenting key aspects of the data, such as its origin, collection methods, processing steps, variables, units, and any quality control measures taken. Adhering to community-recognized metadata standards is highly recommended as it promotes interoperability and facilitates data discovery across different datasets and disciplines. Resources like SESAR help researchers create metadata templates for geoscience samples. By guiding researchers in creating rich and standardized metadata, Data Help Desks significantly enhance the findability, evaluation, and understanding of data, both for the original researchers and for future generations of scientists.
  3. Thorough documentation of the data
    • Collection methods, processing steps, and relevant contextual information, is indispensable for effective long-term data usability. Documentation complements metadata by providing a more narrative and detailed account of the data’s lifecycle. This includes explaining any methodologies used, describing the experimental setup, and outlining any transformations or analyses applied to the raw data. Data Help Desks can advise researchers on what information to include in their documentation, ensuring that future users have a clear understanding of how the data was generated and how it can be appropriately interpreted and reused. Well-documented data, coupled with appropriate formats and comprehensive metadata, significantly increases the likelihood that the data will remain a valuable resource for future research and discovery.
  4. Guide researchers in adopting FAIR
    • FAIR principles are making data Findable, Accessible, Interoperable, and Reusable. Proper data archiving and preservation, through careful attention to formats, metadata, and documentation, are fundamental to ensuring that research data maintains its value and contributes to the advancement of scientific knowledge over time. Researchers are also often directed to appropriate data repositories for the long-term storage and preservation of their data.

Working With Data Among Teams

Effective collaboration on data is fundamental for successful research projects. When multiple researchers or groups contribute to a project, establishing clear strategies and fostering a culture of best practices are crucial for ensuring consistency, efficiency, and the overall quality of the research outcomes.

  1. Establishing Clear Protocols from the Outset
    • Defining how data will be collected, labeled, and shared at the beginning of a project is paramount. This proactive approach helps to prevent misunderstandings, reduce errors, and streamline workflows as different team members contribute data. For example, agreeing on standardized naming conventions for files and variables and a logical directory structure can significantly improve data organization and make it easier for everyone to locate and use the data.
  2. Encouraging Best Practices for Data Management
    • Promoting and adopting sound data management practices within the team ensures data integrity and facilitates the integration and analysis of data from various sources or team members. This can include
      • Data Quality Control: Implementing shared strategies for checking data accuracy and completeness as it is collected and entered.
      • Metadata Creation: Establishing guidelines for creating consistent and comprehensive metadata to document the data.
      • Data Backup: Ensuring that all team members follow agreed-upon procedures for regularly backing up data to prevent loss.
      • Version Control: For data and especially for code used in data analysis, utilizing version control systems helps track changes and allows for collaboration without overwriting each other’s work.
  3. Sharing Knowledge and Expertise
    • Effective teamwork involves open communication and the sharing of knowledge related to data. When team members understand the data collection methods, potential biases, and analysis techniques used by others, it leads to a more comprehensive and nuanced understanding of the overall dataset. Data Help Desks also emphasize the importance of experts sharing their science expertise with researchers, which is relevant within a research team as well. This can involve explaining the rationale behind specific data management practices and helping team members understand why they are important.
  4. Agreeing on Standards
    • Establishing common standards is essential for creating robust and comparable datasets. This includes
      • Defining Sampling Systems Before Fieldwork: As explicitly mentioned, agreeing on how data will be sampled or collected in advance ensures that the resulting data can be effectively combined and compared.
      • Data Formats: Deciding on preferred data formats for sharing and archiving within the team promotes interoperability and avoids compatibility issues.
      • Common Vocabularies and Ontologies: Where applicable, agreeing on the use of shared vocabularies or ontologies can improve data integration and understanding across different parts of the project.
  5. Need for data cleaning, reshaping, and merging
    • This often arises when working with data from multiple team members or sources. Establishing clear protocols for how these processes will be handled ensures consistency and reproducibility. In a virtual setting, effective communication strategies, potentially utilizing tools like Slack channels, become even more critical for seamless data collaboration within teams. Training team members on agreed-upon protocols and tools can also contribute significantly to the success of collaborative data work. By prioritizing these aspects, research teams can maximize the value of their data and produce more reliable and impactful results.

Encouraging Best Practices

Best pracices foster high-quality, reproducible, and impactful scientific research. These practices span the entire data lifecycle, from the initial planning stages to long-term sharing and preservation [previously discussed]. By actively promoting and supporting the adoption of established guidelines, the scientific community can ensure that research data is reliable, understandable, and readily available for verification, reuse, and further discovery. This, in turn, strongly supports the core tenets of open science and FAIR data principles (Findable, Accessible, Interoperable, Reusable), leading to more transparent and collaborative research environments.

  1. Education and Training:
    • A fundamental aspect of encouraging best practices is providing researchers with adequate training and educational resources in research data management. Many researchers have little to no formal training in this area, highlighting a significant need for accessible learning opportunities. Data Help Desks themselves serve as a valuable avenue for disseminating knowledge and skills. Creating and sharing recorded resources, tutorials, and one-pagers on key data management topics can benefit a wider audience over a longer timeframe, especially given potential limitations in live virtual meeting attendance. The ESIP Data Management Training Clearinghouse (DMTC) is a specific resource designed for finding and sharing online learning materials for research data management. Furthermore, integrating data management practices into all ecological research activities is recommended.
  2. Highlighting the “Why”:
    • It’s not enough to simply tell researchers how to manage data; explaining why it matters is crucial for promoting genuine adoption of best practices. Understanding the benefits of good data management practices early in the research workflow, beyond just meeting publication or funding requirements, can be a powerful motivator. These benefits include increased reproducibility, transparency, potential for reuse, and proper attribution through data citation. Emphasizing how good data management can further their research and make their data and software more open and FAIR is a core goal of Data Help Desks.
  3. Providing Practical Guidance and Tools:
    • Researchers need practical tools, workflows, and guidance to implement best practices effectively. This includes offering support on topics such as
    • Data formatting: Choosing appropriate, non-proprietary formats for long-term preservation [previously discussed].
      • Metadata creation: Developing comprehensive and standardized metadata to ensure data findability and understanding. Resources like SESAR can aid in this process for specific data types.
      • Data organization: Implementing logical file naming conventions and directory structures.
      • Data quality control: Establishing strategies for ensuring data accuracy and completeness.
      • Data archiving and repository selection: Guiding researchers on identifying and using appropriate FAIR-aligned data repositories for their data. This includes understanding funder and publisher requirements.
      • Data citation: Educating researchers on how to properly cite data and software to ensure proper credit and enhance discoverability.
  4. Addressing Challenges and Concerns:
    • Acknowledging and addressing the challenges researchers face in adopting best practices is important. For example, researchers may have questions about managing large datasets, navigating varied and confusing publisher/funder requirements, or concerns about sharing data before publication. Providing clear and helpful guidance on these specific issues can build trust and encourage better data management.
  5. Leveraging Community Expertise:
    • Engaging informatics experts familiar with specific scientific domains is a key element of Data Help Desks. These experts can provide tailored advice and demonstrate relevant tools and platforms. Facilitating interaction between researchers and these experts helps to build a supportive community around data management best practices.

Data Organization

Data organization is a foundational element of effective data management that significantly impacts the findability, understandability, and usability of research data. Whether for the original researchers, collaborators, or the broader scientific community, well-organized data is essential for saving time, enhancing reproducibility, and maximizing the impact of research.

  1. Consistent File Naming Conventions:
    • Developing and adhering to a consistent system for naming files is crucial. This makes it easier to locate specific data files and understand their content without having to open them. A good naming convention might include elements like the project name, date, data type, version, or processing stage. There is the temptation to append excessive information to filenames, but it’s advised instead to come up with a convention and stick to it.
  2. Logical Directory Structures:
    • Creating a clear and logical hierarchy of folders (directory structure) to store data is equally important. A well-thought-out folder structure helps to categorize data and prevents researchers from spending excessive time just looking for their data before analysis can even begin. This structure should reflect the different stages of the research project, data types, or other relevant groupings.
  3. Structuring Data for Storage and Analysis:
    • Organizing the data within files in a way that is suitable for analysis is another key aspect. This might involve using consistent formats (e.g., CSV, tabular data), clear headers, and avoiding unnecessary complexity within the data files themselves. “Tidy data” principles, where each variable forms a column, each observation forms a row, and each type of observational unit forms a table, can greatly enhance usability.
  4. Importance for Findability and Usability:
    • The primary goal of data organization is to make data easily findable and usable in the future. Well-organized data ensures that researchers can quickly locate the specific information they need, whether they are returning to the data months or years later, or if others are trying to use the data for verification or new research.
  5. Support for Reproducibility:
    • Clear data organization directly supports the reproducibility of research. When the structure of the data is transparent, others can more easily understand how the data was collected, processed, and analyzed, making it possible to attempt to replicate the findings.
  6. A Recognized Need:
    • Data organization has been identified as a significant research data management need by scientists at conferences like AGU. It was also highlighted as a topic for future training that researchers expressed interest in. This underscores the ongoing challenges and the importance researchers place on improving their data organization skills.
  7. Connection to Broader Data Management Practices:
    • Proper data organization is an integral part of encouraging best practices in data management. It aligns with the principles of making data Findable and Reusable from the FAIR data principles.

How to Cite Datasets, Code, and Software

A significant need for researchers is to gain proper credit for their work and to enhance the transparency and reproducibility of scientific findings.

  1. Attributing Credit:
    • Properly citing datasets, code, and software ensures that creators receive due recognition for their contributions to the research ecosystem. This is vital for career advancement and for the scholarly record.
  2. Reproducibility and Transparency:
    • Citations provide a clear link to the underlying data and software used in a study, allowing others to access and potentially reproduce the research. This enhances the transparency and rigor of scientific work.
  3. Meeting Requirements:
    • Funders and publishers increasingly have expectations, and sometimes mandates, for archiving and citing the data and software used or derived in reported work. The Data Help Desk can help researchers understand and comply with these policies.
  4. Discoverability and Impact:
    • When data and software are properly cited (often with persistent identifiers like DOIs), they become more discoverable, potentially leading to increased usage and impact.
  5. Importance of Persistent Identifiers (PIDs):
    • The role and benefits of persistent identifiers like Digital Object Identifiers (DOIs) for datasets and software can be explained. Researchers can learn how to obtain DOIs for their data in repositories and for software in GitHub and Zenodo.
  6. Guidance on Repositories:
    • The Help Desk can guide researchers on where they can deposit their data and software in FAIR-aligned repositories, which often facilitate the assignment of DOIs. This includes discussing the different types of repositories available (domain-specific, institutional, generalist) and how to choose an appropriate one.
  7. Data Availability Statements:
    • Researchers can receive guidance on how to write effective Data Availability Statements for their publications, which describe where the underlying data and software can be accessed and reused.
  8. Software Citation Specifics:
    • Given that software citation can be particularly “thorny”, the Data Help Desk can provide specific advice on how to cite code, including mentioning the software name, version, and providing a link (ideally with a persistent identifier). Tools like GitHub and Zenodo, which allow for DOIs and licensing of software, can be highlighted.
  9. Citing Computational Notebooks:
    • Researchers may have questions about whether they can give their computational notebooks a DOI. The Data Help Desk can provide information on platforms and methods for publishing and citing notebooks.
  10. Funder and Publisher Policies:
    • Experts can help researchers navigate the specific data and software sharing and citation requirements of different funding agencies and journal publishers, pointing them to relevant websites and documentation.
  11. Demonstrations and Resources:
    • The Data Help Desk can offer demos of tools and resources related to citation management and repository search. They can also provide handouts and one-pagers with clear instructions and examples of data and software citation. Links to relevant guidelines and best practices from organizations like AGU can be shared.

Understanding and Implementing FAIR Principles

FAIR principles are a core focus of a data help desk, given the increasing importance of making research data Findable, Accessible, Interoperable, and Reusable (FAIR). Data help desks play a crucial role in explaining these principles and providing practical guidance to researchers on how to make their data more FAIR.

  1. Explaining the Purpose and Meaning of FAIR:
    • Many researchers, especially those with little to no formal training in research data management, may be unfamiliar with the FAIR principles. A data help desk can explain what FAIR and open really mean and why managing data according to these principles is important. This includes conveying that FAIR data supports open science and leads to more transparent and collaborative research.
  2. Addressing Common Questions about FAIR:
    • Researchers frequently ask “What is FAIR?” and “What makes data and software #FAIR? And what is FAIR anyway?”. The data help desk provides a space for them to ask these fundamental questions and receive clear explanations from informatics experts.
  3. Providing Practical Steps for Implementing FAIR:
    • Beyond explaining the concepts, a data help desk offers practical guidance on how researchers can make their data FAIRer. This can include
    • Directing them to resources on how to implement FAIR in their research.
    • Advising on how to make their Earth science research more open and FAIR.
    • Providing guidance on selecting FAIR-aligned repositories for archiving their data.
    • Explaining the importance of assigning persistent identifiers such as DOIs to data and software.
    • Offering advice on creating comprehensive metadata records to enhance data discovery, which is a key aspect of making data Findable.
    • Sharing information on data standards that promote interoperability.
  4. Highlighting the Benefits of FAIR Data Practices:
    • The data help desk can emphasize the advantages of adhering to FAIR principles, such as meeting publisher or funder expectations for open and FAIR data and software. They can also explain how FAIR data increases the potential for data reuse, enhances the impact of research through increased citations, and promotes collaboration.
  5. Providing Resources and Tools:
    • Data help desks often curate and share resources like one-pagers, tutorials, and links to relevant organizations that provide further information and tools for implementing FAIR principles. For example, they might point researchers to the ESIP Data Management Training Clearinghouse for learning materials or to specific resources explaining FAIR.
  6. Addressing Challenges in Understanding and Implementation:
    • The data help desk acknowledges that researchers may find it challenging to understand and implement FAIR principles. By providing direct engagement with experts, they can address specific concerns and provide tailored advice to overcome these hurdles. The high number of “uncategorized” questions received at a data help desk suggests that attendees often need a broader, preliminary consultation before diving into the technical aspects of FAIR implementation.