Overview

MATSS is a package for conducting Macroecological Analyses of Time Series Structure. We designed it to help researchers quickly get started in analyses of ecological time series, and to reinforce and spread good practices in computational analyses.

We provide functionality to:

  • obtain time series data from ecological communities, processed into a common data format
  • perform basic processing and summaries of those datasets; see data processing
  • build an analysis pipeline for macroecological analyses, using the workflow framework of the drake package
  • package the above data analytical work in a reproducible way in a research compendium

Installation

You can install MATSS from github with:

# install.packages("remotes")
remotes::install_github("weecology/MATSS", build_opts = c("--no-resave-data", "--no-manual")))

And load the package in the typical fashion:

library(MATSS)

Example Research Compendium

One of the best ways to get started is to create a research compendium. An auto-updating example is visible at https://github.com/weecology/MATSSdemo

To get started, identify the location and name for your compendium. For example, ~/MATSSdemo will put the compendium inside your home directory (the ~ location), with the package name "MATSSdemo". (Note that package names can only contain ASCII letters, numbers, and “.” and have to start with a letter.)

Compendium Creation Steps

Running this code will perform the following operations:

  • create a new R package for the analysis
  • add required dependencies for the new R package to its DESCRIPTION file
  • create an analysis folder to hold the analysis files
    • create a script to define and run the analysis
    • create an Rmarkdown report to plot the results of the analysis
    • create a .bib file to hold the MATSS reference linked to in the Rmarkdown report
  • create an R folder to hold function definitions
    • create functions for computing population-level and community-level properties
  • add the MIT License file (checking with the user first, if running in an interactive session)
  • add a project README
  • add a template R script for the analysis
  • add a template Rmd report that is created as a result of running the above R script
  • open the project in a new RStudio window (if available)

Running the Code

After creating the new project, the readme will contain further instructions to run the code. We summarize briefly here:

  1. The compendium exists as an R package and needs to be installed first.
  2. R needs to be restarted.
  3. The analysis script in analysis/pipeline.R can be run to perform the analysis and generate the report.
  4. The compiled report at analysis/report.md can be viewed.

For further details about how the code within the template project works, see the below guide to interacting with the datasets, the drake workflow package, and our tools for building reproducible analyses.

Data

Packaged datasets

Several datasets are included with this package - these can be loaded individually using these specific functions, and require no additional setup.

Configuring download locations:

Other datasets require downloading. To facilitate this, we include functions to help configure a specific location on disk. To check your current setting:

and to configure this setting (and then follow the instructions therein):

Downloading datasets:

To download individual datasets, call install_retriever_data() with the name of the dataset:

install_retriever_data("veg-plots-sdl")

To download all the datasets that are currently supported (i.e. with associated code for importing and formatting):

Preprocessing datasets:

We tap into several collections of datasets in MATSS, so it is useful to do some preprocessing to split the raw database files into separate datasets. These databases are: * BBS (the North American Breeding Bird Survey) * BioTIME (ecological assemblages from the BioTIME Consortium)

Processing these databases are necessary before loading individual datasets in.

prepare_datasets() # wrapper function to prepare all datasets
# prepare_biotime_data()
# prepare_bbs_ts_data()

Working with Drake

We designed MATSS to build off of the workflow package drake for computational analyses. Thus, it can be helpful to have a general understanding of how to use drake.

Basic Workflow

The basic apporach to using drake is:

  • run R code to create a drake plans
  • call drake::make() to perform the work described in a drake plan

Provided Helper Functions

We provide several functions to help construct plans:

  • build_datasets_plan() constructs a plan for the datasets, with options to include downloaded datasets
  • build_analyses_plan() constructs a plan for a set of analyses that applies a method to each dataset. It takes as arguments, a plan for the datasets and a plan for the methods.
  • collect_analyses() combines the output objects from a single analysis applied to multiple datasets. This helps to achieve a consistent structure for the results, regardless of what individual analysis functions actually return.
  • analysis_wrapper() is a function that wraps a method that applies to a single time series (such as calculating the slope of the linear trendline), so that the result can be applied to a dataset (resulting in outputs of the method applied to each individual time series in that dataset).

Usage of these functions is demonstrated in the template R script generated from create_MATSS_compendium().

Example

library(drake)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# define the plan
plan <- drake_plan(data_1 = mtcars,
                   data_2 = iris,
                   my_model = lm(mpg ~ disp, data = data_1),
                   my_summary = data_2 %>%
                       group_by(Species) %>%
                       summarize_all(mean))

# run the plan
make(plan)
#> ▶ target data_1
#> ▶ target data_2
#> ▶ target my_model
#> ▶ target my_summary

# check resulting objects
readd(my_model)
#> 
#> Call:
#> lm(formula = mpg ~ disp, data = data_1)
#> 
#> Coefficients:
#> (Intercept)         disp  
#>    29.59985     -0.04122
readd(my_summary)
#> # A tibble: 3 x 5
#>   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
#>   <fct>             <dbl>       <dbl>        <dbl>       <dbl>
#> 1 setosa             5.01        3.43         1.46       0.246
#> 2 versicolor         5.94        2.77         4.26       1.33 
#> 3 virginica          6.59        2.97         5.55       2.03

Running Drake Plans

Drake plans are run by calling make(). This does several things. First it checks the cache to see if any targets need to be re-built, and then it proceeds to build all the targets, in some order that accounts for the dependencies between targets. (e.g. an analysis target that depends on a dataset target to be processed)

The manual has more information about how Drake stores its cache and how Drake decides to rebuild targets.

Note that if there are file inputs, it is important that they are declared explicitly using e.g. file_in(), knitr_in(), and file_out(). This enables Drake to check if those files are changed and to rebuild targets that depend on the files if needed. Otherwise Drake will treat them as fixed strings.

plan <- drake_plan(data = read.csv("some_data.csv"))
make(plan)

# make some changes to `some_data.csv`
make(plan) # will NOT rebuild the `data` target
plan <- drake_plan(data = read.csv(file_in("some_data.csv")))
make(plan)

# make some changes to `some_data.csv`
make(plan) # will rebuild the `data` target

References