The purpose of this guide is to provide a reference for the
formatting and style of data which is to be uploaded into the
azmpdata package. This is meant as a reference for
developers and data managers, data upload should not be attempted by
individual package users.
Biological and chemical data is uploaded using a standard procedure.
Raw data may vary significantly in format and style, it is then massaged
into azmpdata style and format through R scripts before
being finalized in exported data tables. This guide will focus on the
final form and style of exported data products within the package.
The format of all package dataframes should be the same, they should
be exported through the usethis::use_data() function as
dataframe objects. Each column should correspond to a single data or
metadata variable (wide format data). Metadata should be included as
unique columns even if that requires repetitive entries for all rows of
data.
This section will mainly focus on naming conventions. For all data
variable names, please ensure that there are matching entries in the
look-up table ‘variable_look_up.csv’ (located in
inst/extdata/lookup).
Data variables have been named based on CF principles (https://cfconventions.org/standard-names.html). These
principles dictate that variable names are verbose and descriptive, with
words separated by underscores and no capitalization. This guidelines
docucment (https://cfconventions.org/Data/cf-standard-names/docs/guidelines.html)
is quite helpful in determining phrasing, for example using
..._at_sea_floor to append a variable which was collected
at the bottom of the water column, instead of
bottom....
These conventions help to make variable names very clear for new and experienced users. The same tenants should be followed when naming new variables.
Metadata names have been kept fairly simple compared to data names.
The same style principles are used when relevant, such as underscore
separation and no capitalization. Metadata names are not appended by
_name, for example we would use station rather
than station_name, this keeps names brief.
All of the azmpdata datasets include quite extensive
metadata in order to make the dataframes stand-alone products. This
section gives some general requirements of the type of metadata which
should be included if new data is added.
Occupation data is single point data which is collected on a specific day or point in time. It can be biological or physical and has requirements as follows
In some cases additional metadata will be required depending on the variable sampled
Weekly data is often from remote sensing and may be an average of parameters over a given week. It has requirements as follows
Monthly data is often derived and presents an average of data over a given month. It can be physical or biological and has requirements as follows
This section focuses on how metadata is presented.
year is given as a four digit numeric value
month is given as a numeric value (where 1 is January and 12 is December)
week is given as numeric, the number of weeks elapsed in the given month
day is given as numeric, the number of days elapsed in the given month
season is given as a character value, describing the Northern Hemisphere season (fall, winter, spring or summer)
station, section, or area is given as
character value, typically representing the short name for a given
location. Care should be taken to ensure that any names used in the
datasets are also represented in the spatial data included with the
multivaR package so that users are able to find definitions
for the areas included. AZMP transects are represented by acronyms, for
example Halifax line is written as HL, Brown’s Bank line is BBL.
Stations along each of these lines are numbered and written without any
dashes, underscores or leading zeros (e.g. HL2, BBL7, P5).
sample_id is currently given as a combination of original
data sample_id, appended (with an underscore) by the
cruise_id. This ensures that there is no duplication of
sample_id’s between cruises, which is a problem in other
databases. The format is cruiseid_sampleid.
event_id is given as numeric
cruise_id is given as the numeric cruise identifier from
BioChem and is typically (although not always) in the format
YYYY###### (Note the nineties are an exception where
cruise_id’s are formatted YY######). Where the
last six digits are a unique numeric sequence to identify the
cruise.