Introduction

The purpose of this guide is to provide a reference for the formatting and style of data which is to be uploaded into the azmpdata package. This is meant as a reference for developers and data managers, data upload should not be attempted by individual package users.

Data format

Biological and chemical data is uploaded using a standard procedure. Raw data may vary significantly in format and style, it is then massaged into azmpdata style and format through R scripts before being finalized in exported data tables. This guide will focus on the final form and style of exported data products within the package.

The format of all package dataframes should be the same, they should be exported through the usethis::use_data() function as dataframe objects. Each column should correspond to a single data or metadata variable (wide format data). Metadata should be included as unique columns even if that requires repetitive entries for all rows of data.

Data style

This section will mainly focus on naming conventions. For all data variable names, please ensure that there are matching entries in the look-up table ‘variable_look_up.csv’ (located in inst/extdata/lookup).

Data variables have been named based on CF principles (https://cfconventions.org/standard-names.html). These principles dictate that variable names are verbose and descriptive, with words separated by underscores and no capitalization. This guidelines docucment (https://cfconventions.org/Data/cf-standard-names/docs/guidelines.html) is quite helpful in determining phrasing, for example using ..._at_sea_floor to append a variable which was collected at the bottom of the water column, instead of bottom....

These conventions help to make variable names very clear for new and experienced users. The same tenants should be followed when naming new variables.

Metadata names have been kept fairly simple compared to data names. The same style principles are used when relevant, such as underscore separation and no capitalization. Metadata names are not appended by _name, for example we would use station rather than station_name, this keeps names brief.

Metadata requirements

All of the azmpdata datasets include quite extensive metadata in order to make the dataframes stand-alone products. This section gives some general requirements of the type of metadata which should be included if new data is added.

Occupations

Occupation data is single point data which is collected on a specific day or point in time. It can be biological or physical and has requirements as follows

  • year
  • month
  • day
  • latitude
  • longitude
  • station OR section OR area
  • sample_id
  • event_id

In some cases additional metadata will be required depending on the variable sampled

  • depth
  • nominal_depth
  • cruise_id
  • odf_filename

Weekly

Weekly data is often from remote sensing and may be an average of parameters over a given week. It has requirements as follows

  • year
  • month
  • week
  • station OR section OR area

Monthly

Monthly data is often derived and presents an average of data over a given month. It can be physical or biological and has requirements as follows

  • year
  • month
  • station OR section OR area

Seasonal

Seasonal data is often derived and presents an average of data over a given season. It can be physical or biological and has requirements as follows

  • year
  • season
  • station OR section OR area

Annual

Annual data is often derived and presents an average of data over a given year. It can be physical or biological and has requirements as follows

  • year
  • station OR section OR area

Metadata format

This section focuses on how metadata is presented.

Date and Time

year is given as a four digit numeric value

month is given as a numeric value (where 1 is January and 12 is December)

week is given as numeric, the number of weeks elapsed in the given month

day is given as numeric, the number of days elapsed in the given month

season is given as a character value, describing the Northern Hemisphere season (fall, winter, spring or summer)

Location

station, section, or area is given as character value, typically representing the short name for a given location. Care should be taken to ensure that any names used in the datasets are also represented in the spatial data included with the multivaR package so that users are able to find definitions for the areas included. AZMP transects are represented by acronyms, for example Halifax line is written as HL, Brown’s Bank line is BBL. Stations along each of these lines are numbered and written without any dashes, underscores or leading zeros (e.g. HL2, BBL7, P5).

Identifiers

sample_id is currently given as a combination of original data sample_id, appended (with an underscore) by the cruise_id. This ensures that there is no duplication of sample_id’s between cruises, which is a problem in other databases. The format is cruiseid_sampleid.

event_id is given as numeric

cruise_id is given as the numeric cruise identifier from BioChem and is typically (although not always) in the format YYYY###### (Note the nineties are an exception where cruise_id’s are formatted YY######). Where the last six digits are a unique numeric sequence to identify the cruise.