/Data

A workshop on Data

Primary LanguageR

a workshop about Data

"...tidy datasets are all alike, but every messy dataset is messy in its own way" H. Wickham

Data is at the core of the scientific activity. Ecologists are managing larger and more complex data every day. There is an increasing trend to share data with collaborators, combine different datasets to synthesize existing knowledge, and deposit your data after publication to ensure reproducibility. However, there is relatively little guidance for how to manage and share datasets efficiently. Understanding how a dataset is structured and obtaining the right input format required by your statistical software often takes more time than the analysis itself. This workshop will go though the data life cycle. Planing and collecting data. Entering and storing data in electronic formats, clean and manipulate data and explore it before analysis. The workshop consists in three sections. First we will discuss about data in a question/answer forum. How do you clean data? Do you use metadata? Who owns your data? and a bit about how use tidy data. The tidy data concept is based in using variables stored in columns, observations in rows, and a single type of experimental unit per dataset. While this may seems trivial, in my experience is not. Second we will have a practical example on how to manipulate data in R (reshape, dplyr package and regular expresions). Third, we will see how to explore your data before analyzing it (also in R following Zuur et al. 2009 MEE). If, and only if, there is interest I can explain Git as a way to manage your entire workflow.

  • Find a messy dataset (and how to clean it) under example folder

    • NEW: dirty_data.csv is updloaded.
    • NEW: script to clean it is added
  • Find the slides used in the workshop under data.md file

    • Riikka and Vesna presentations added in PDF
  • Find code for following Zuur et al. (and more)

    • Data_exploration.R (and associated data inside the example folder)
  • This workshop is based on previous experiences and the following key references:

    -Tidy data paper

    -About Data management: 10 basic rules and a few more tips for data management

    -About Git

    -Other resources: DataOne; Prometheusresearch; Data is being lost; Practicaldatamanagement blog.