Data Formats

Data Structure

The universal data structure we’re going to use is:

a list with the following elements:
- a data.frame or tibble, named abundance (required)
- a data.frame or tibble, named covariates (optional)
- a list, named metadata (required)

If both abundance and covariates are present in the list, then the two data.frames must have the same number of rows.

abundance

In the abundance data.frame:

each row is an observation (e.g. in time or space)
each column is a variable

Here, the common usage is for each column to be a species or taxon, and each row to be an observed sample. In other words, each column is a time series, with the rows sorted such that time advances down (higher row indices correspond to later times).

covariates

In the covariates data.frame:

each row is an observation (e.g. in time or space)
each column is a variable

The number of rows should match that of abundance, and rows of covariates should line up with abundance (either sampled simultaneously or concurrently). Common covariates are date and time, temperature, treatments, etc.

metadata

In the metadata list:

in general, any entries are allowed, in any data structure and format
it must have a is_community entry, which indicates whether the time series in abundance can be treated as components of a community with interactions and/or shared drivers in some way
it must have a citation entry that is a vector of text values for the reference to the dataset. There can be multiple values (e.g. in the case of a specific dataset pulled from a larger database).
if there is a location entry, it must contain at least a latitude and longitude value (in decimal form). location itself can be a data.frame or vector (that has names)
if there is a timename entry, it refers to a column in the covariates data.frame that gives a time index for the data
- this column must be some form of numeric, integer, date, or date/time corresponding to the timing of the samples
- this column must be of a form that applying tidyr::full_seq, along with a “period” entry (using 1 if missing) will produce the appropriate equi-timed spacing
if there is a period entry, it must be compatible with tidyr::full_seq and the timename variable described above.
if there is a species_table entry, it must have an id column that includes all the column names in abundances. This is intended to provide more information about the different variables in abundances.

Example Data

Here is an example of a correctly formatted dataset with covariates and metadata:

library(MATSS)
data(dragons)

str(dragons)
#> List of 3
#>  $ abundance :Classes 'tbl_df', 'tbl' and 'data.frame':  6 obs. of  3 variables:
#>   ..$ Red Spotted Dragon    : num [1:6] 2 6 0 5 4 4
#>   ..$ Green Striped Dragon  : num [1:6] 6 0 4 1 9 7
#>   ..$ Blue Eyes White Dragon: num [1:6] 0 0 0 1 0 0
#>  $ covariates:'data.frame':  6 obs. of  3 variables:
#>   ..$ date         : Date[1:6], format: "2014-06-28" "2015-06-28" ...
#>   ..$ precipitation: int [1:6] 7 6 14 18 9 5
#>   ..$ effort       : num [1:6] 3 3 2 4 1 9
#>  $ metadata  :List of 7
#>   ..$ timename     : chr "date"
#>   ..$ effort       : chr "effort"
#>   ..$ period       : num 365
#>   ..$ authors      :List of 2
#>   .. ..$ :Class 'person'  hidden list of 1
#>   .. .. ..$ :List of 5
#>   .. .. .. ..$ given  : chr "Ellen"
#>   .. .. .. ..$ family : chr "Bledsoe"
#>   .. .. .. ..$ role   : chr "aut"
#>   .. .. .. ..$ email  : NULL
#>   .. .. .. ..$ comment: Named chr "0000-0002-3629-7235"
#>   .. .. .. .. ..- attr(*, "names")= chr "ORCID"
#>   .. ..$ :Class 'person'  hidden list of 1
#>   .. .. ..$ :List of 5
#>   .. .. .. ..$ given  : chr "Hao"
#>   .. .. .. ..$ family : chr "Ye"
#>   .. .. .. ..$ role   : chr "aut"
#>   .. .. .. ..$ email  : chr "hao.ye@weecology.org"
#>   .. .. .. ..$ comment: Named chr "0000-0002-8630-1458"
#>   .. .. .. .. ..- attr(*, "names")= chr "ORCID"
#>   .. ..- attr(*, "class")= chr "person"
#>   ..$ species_table:'data.frame':    4 obs. of  2 variables:
#>   .. ..$ id  : Factor w/ 4 levels "Blue Eyes White Dragon",..: 4 3 1 2
#>   .. ..$ game: Factor w/ 2 levels "pokemon","yugioh": NA NA 2 1
#>   ..$ citation     : chr "Hao Ye, Ellen K. Bledsoe, Renata Diaz, S. K. Morgan Ernest, Juniper L. Simonis, Ethan P. White, & Glenda M. Yen"| __truncated__
#>   ..$ is_community : logi TRUE
#>  - attr(*, "class")= chr "matssdata"

We can view the abundance and covariates tables side by side:

knitr::kable(dragons[c("abundance", "covariates")])

Red Spotted Dragon	Green Striped Dragon	Blue Eyes White Dragon
2	6	0
6	0	0
0	4	0
5	1	1
4	9	0
4	7	0

date	precipitation	effort
2014-06-28	7	3
2015-06-28	6	3
2016-06-28	14	2
2017-06-28	18	4
2018-06-28	9	1
2019-06-28	5	9

Checking Data

We also provide a function for checking whether the data is formatted correctly:

check_data_format(dragons)
#> [1] TRUE

Ellen K. Bledsoe