METHODOLOGICAL PRINCIPLES TO CREATE A METADATA EXTENSION TO THE DARWIN CORE STANDARD FOR AGROBIODIVERSITY DATA

This paper aims to propose principles for creating a metadata extension to the Darwin Core standard that addresses the agrobiodiversity data, with a scope on ecological interactions. These principles have been compiled from the scientific literature, giving special attention to recommendations of the DCMI Abstract Model, which outlines the principles for creating metadata. The DCMI Abstract Model governs the creation of the Dublin Core metadata standard upon which Darwin Core is based. The requirements of ISO/IEC 11179-4/2004 standard for the definition of metadata were also taken into consideration. A prototype of a metadata record for the field of ecological interactions, which is the scope of this research within agrobiodiversity, was created to demonstrate the format that metadata will have when the extension is finished. This research an effort to propose more effective tools for agrobiodiversity data management, but it is necessary to mature and deepen the discussions around the conceptual aspects of the ecological interactions in agrobiodiversity and the relationship of the new metadata extension with the vocabulary of the Darwin Core, as well a robust methodology to create DwC extensions is still pending of being developed.


Introduction
Metadata creation and curation are community-driven tasks. Many metadata standards have been developed by scientific communities for distinct knowledge fields. Metadata for specific subjects are named disciplinary metadata. The Digital Curation Center (DDC) presents dozens of disciplinary metadata standards currently in use for all disciplines of knowledge on the website: https://www.dcc.ac.uk/guidance/standards/metadata. Some metadata standards have been developed for the biodiversity science over the history, such as Access to Biological Collection Data (ABCD), Darwin Core (DwC), and Ecological Metadata Language (EML). Between them, DwC is the most used metadata standard to share data about biodiversity in the Global Biodiversity Information Facility (GBIF) portal 1 , one of the largest biodiversity data repositories in the world (Body et al. 2020). Its worldwide use makes us believe that DwC (Wieczorek et al. 2012) may be used to describe agrobiodiversity data. However, pragmatic analysis of DwC and DwC Metadata Extensions demonstrated that important concepts and relations of Agricultural Biodiversity are not represented in DwC elements .
The Convention on Biological Diversity (CBD 2000) defines Agricultural Biodiversity as the set of elements of biodiversity that are relevant somehow to agriculture and food production.
In other words, "the variability among living organisms from all sources including terrestrial, Soares marine and other aquatic ecosystems and the ecological complexes of which they are part: this includes diversity within species, between species and of ecosystems" (CBD 2000 p. 85).
The field and research work in Agricultural Biodiversity produces data. Given this problem, a research project 2 is going on to develop a metadata extension able to represent data about agricultural biodiversity produced by Embrapa. Nevertheless, before creating a metadata extension, it is necessary to set rules to standardize this process. Thus, this paper aims to present some methodological principles required to create a new metadata extension to the DwC, within the scope of agrobiodiversity data.

Information representation and metadata
A representation is a piece of information that describes a digital object in a way it can be retrieved on the web or on a database (Chu 2005). "Information representation includes the extraction of some elements (e.g., keywords or phrases) from a document or the assignment of terms (e.g., descriptors or subject headings) to a document so that its essence can be characterized and presented" (Chu 2005 p. 14). 2 The project begins at the Federal University of Minas Gerais (UFMG)

in collaboration with
Embrapa as a master's degree research and is carried on at Polytechnical School of São Paulo University (USP) as a Ph.D. research  Metadata has emerged of the need to organize the growing amount of digital information on the web to improve information retrieval (Alves 2005;. Information representation is a field of study in Library Science, but also of Informatics and Computing Linguistic (Lourenço 2016). The term metadata has emerged back in the 1960s and was applied to the bibliographic description in libraries, but became popular just in 1995 with the emergence of Dublin Core metadata standard, created for describing digital objects on the web (Alves 2016;Lourenço 2016).
The main aim of information representation is to enable information retrieval (Lourenço 2016). To reach this propose, many models and definitions have been developed to support metadata creation and use.

Metadata
Metadata may be understood as labels created to describe data content (National Information Standards Organization 2007;Pomerantz 2015;Riley 2017;Zeng 2015;Zeng and Qin 2008). It Is Often Defined as "data about data" in literature, e.g. in the ISO/IEC 11179-4 (2004) standard for metadata registries, but this trivial definition may not be enough (Pomerantz 2015). However, there are definitions in the literature that better express the whole function of metadata. Zeng (2015) indicates the variations of the definition for the metadata concept through different communities of practice but shows a definition that better fits within the research approach of this paper: metadata are "information about specific things" (Zeng 2015). This definition by Zeng (2015) seems to be more adequate than the one of ISO/IEC 11179-4 (2004).
However, the definition that better expresses the meaning of the concept is given by the National the example from before, <phone> is a predicate, 5531986424933 is a value and it is a phone number that belongs to a person -the subject in the Triple. This model is very similar to the Resource Description Framework (RDF), which is "a globally-accepted framework for data and knowledge representation that is intended to be read and interpreted by machines" (Wilkinson et al. 2016 p. 2).
RDF is a standard model for data interchange on the Web. RDF has features that facilitate data merging even if the underlying schemas differ, and it specifically supports the evolution of schemas over time without requiring all the data consumers to be changed. RDF extends the linking structure of the Web to use URIs to name the relationship between things as well as the two ends of the link (this is usually referred to as a "triple"). Using this simple model, it allows structured and semi-structured data to be mixed, exposed, and shared across different applications. (RDF Working Group 2014).
All this effort has a much bigger aim: making scientific data FAIR: Findable, Accessible, Interoperable, Reusable (Wilkinson et al. 2016).  11179-6 (2015) shows that each element in the metadata schema must have a unique identifier.
Beyond that, to create a metadata schema one must define the semantic and syntax rules for it.
The semantic gives the meaning of each element of the metadata schema, setting its function on the registry, while the syntax determines how to format the metadata in an interoperable way.
Observe the DwC element <eventTime> in Table 1.
The attribute identifier gives the pathway to a computer program to understand the element (makes it interoperable); definition in Table 1 sets the meaning of the metadata element, i.e., its semantics. Meanwhile, comments set part of the syntax of the metadata element by indicating an encoding scheme to properly format data. The examples show how the data look like if properly formatted. Each metadata standard sets its own rules, e.g. a set of elements that must be always present in the metadata record or how to organize the metadata classes of the metadata record. DwC set of rules for formatting and using metadata can be found in the documentation published by the Darwin Core Task Group (2009a;2009b;2009c;2009d) and Darwin Core and RDF/OWL Task Groups (2015).

Agrobiodiversity data
A report published by the GBIF task group on data fitness for use in agrobiodiversity (Arnaud et al. 2016) shows the need for developing strategies and tools to manage agrobiodiversity data. Arnaud et al. (2016) point out that the focus on agrobiodiversity data might be on taxon, vernacular names, occurrences, geospatial distribution, genotype, phenotype, environmental factors, agronomic traits, functional traits, species interactions, socioeconomic factors, and local knowledge. There are metadata in DwC standard that describe taxon, vernacular names, occurrences, and geospatial distribution; in DwC metadata extensions, it is possible to find metadata for genetic data; the other concepts are uncovered by DwC metadata.
Thus, this research focuses on species interactions, a subject field unexplored in the scope of DwC.
Species interact all the time in crops and farms, so knowing those interactions is particularly important for food production. E. g., the pollination, a kind of mutualistic interaction between an animal (e.g. a bee, bird, or a bat) and a plant, is crucial to produce fruits and seeds. Soares "three out of four crops across the globe" depend on pollinators to reach yields. Beyond pollination, many other interactions are noticed in agriculture. These interactions are presented as a conceptual model in Figure 2 in section 5.1.

Methods
The methodological principles we believe are necessary to create a DwC metadata extension were assembled in three phases: a) selection and analysis of terminological and data inputs; b) terminological definition and metadata modeling; c) the community of practice evaluation.

Selection and analysis of terminological and data inputs
This stage of the methodology aimed the immersion in the thematic field of agrobiodiversity. It was organized into five sub-steps: a) definition of the scope of agrobiodiversity data representation, using as guide the Final Report of the Task Group on GBIF Data Fitness for Use in Agrobiodiversity (Arnaud et al. 2016), to set the sample to be worked (ecological interactions); Identifier (URI), which make the metadata element unique and avoid misunderstanding for both humans and machines.
This second step consisted of applying principles that allow us to define, clearly and objectively, the function of metadata elements. Those are ISO/IEC 11179-4 (2004) recommendations for metadata construction and the syntactic and semantic scheme of DwC, based on RDF schema.
This task of defining metadata terms has been named 'functional terminological definition', since this activity consists in establishing the function of metadata, pointing out its rules of application.

Results and analysis
In this section, we present the principles we followed to start the task of creating the metadata extension.

Analysis of methodological and terminological inputs
The preview analysis of species interaction concepts in the literature ( Figure   2 shows a concept schema for the interaction grid, a model that represents species interactions. An ecological interaction is an action that involves two organisms of the same species (intraspecific interaction) or two different species (interspecific interaction). Thus, to represent this relationship, one must define the role of each organism or species in the interaction. For example, in predation, one organism is a predator, which means it eats another living being. The other organism which serves as food for the predator is denominated prey. Each organism, if represented by metadata, is a resource. The terms "predator" and "prey" are the whole of each organism in the interaction and may be useful to understand the relationship between the resources. Soares Just analyzing the literature is not enough to define metadata to describe the ecological interactions in agrobiodiversity data. For an information scientist or a computer scientist, this literature analysis enables them to understand the basic concepts of ecological interactions agrobiodiversity, but not more than that. It is mandatory to get involved with agrobiodiversity specialists who will say what data matters to be represented.

Terminological definition and metadata modeling
The definition of the metadata terms must establish the function of the metadata. This definition is different of a mere definition of a glossary or a dictionary term: it must define the semantics and the syntax of the metadata element, which determine the role that the metadata element will play in the data description. For example, observe the following two definitions of the word 'date', the first one as a metadata element of Dublin Core and the second one as a dictionary definition: b) date (Oxford Learner's Dictionaries): "noun; particular day/year; 1) a particular day of the month, sometimes in a particular year, given in numbers and words; 2) a particular day or year when a particular event happened or will happen; 3) a time in the past or future that is not a particular day; 4) an arrangement to meet somebody at a particular time; 5) a meeting that you have arranged with a boyfriend or girlfriend or with somebody who might become a boyfriend or girlfriend; 6) a boyfriend or girlfriend with whom you have arranged a date; 7) a sweet sticky brown fruit that grows on a tree called a date palm, common in North Africa and West Asia". (Oxford University Press 2020).
As we see, the first definition given by Dublin Core  Table 2).  ) to form a precise definition that includes the essential characteristics of the concept. Simply stating one or more synonym(s) is insufficient. Simply restating the words of the name in a different order is insufficient. If more than a descriptive phrase is needed, use complete, grammatically correct sentences.

EXAMPLES
Agent Name: a) good definition: name of party authorized to act on behalf of another party; b) poor definition: representative.

REASON
"Representative" is a near-synonym of the data element name, which is not adequate for a definition.
contain only commonly understood abbreviations EXPLANATION Understanding the meaning of an abbreviation, including acronyms and inicialisms, is usually confined to a certain environment. In other environments, the same abbreviation can cause misinterpretation or confusion. Therefore, to avoid ambiguity, full words, not abbreviations, shall be used in the definition.
Exceptions to this requirement may be made if an abbreviation is commonly understood such as "i.e." and "e.g." or if an abbreviation is more readily understood than the full form of a complex term and has been adopted as a term in its own right such as "radar" standing for "radio detecting and ranging." All acronyms must be expanded on the first occurrence.

EXAMPLES
Tide Height: a) good definition: the vertical distance from mean sea level (MSL) to a specific tide level; b) poor definition: the vertical distance from MSL to a specific tide level.

REASON
The poor definition is unclear because the acronym, MSL, is not commonly understood and some users may need to refer to other sources to determine what it represents. Without the full word, finding the term in a glossary may be difficult or impossible. These ISO/IEC principles are useful if one does not know from where to start to create metadata. However, some of its principles are not very up to date to the practice of metadata creation for specific scientific data fields. For example, one ISO/IEC rules declare that the definition of a metadata term should not embed definitions of other metadata elements. It helps to avoid unnecessary repetitions, but sometimes a cross-reference definition is needed to complement the meaning of the metadata element, especially when these elements are arranged into classes. For example, in DwC (Darwin Core Task Group 2009d) the property <kingdom> has the definition "The full scientific name of the kingdom in which the taxon is classified". The scientific name is another metadata property of DwC, but its name is embedded in <kingdom> property definition to show what kind of data this element can represent.

Shaping the metadata extension
To illustrate the application of phases two and three of the methodology, we present the following situation: we wish to describe the ecological interaction between two species used in biological control. "Biological control can be defined as using living natural enemies to control pests (Kenis et al. 2019 p. 1  The predicate parasite of, which describes the relationship between the wasp Telenomus podisi (subject) and the Euschistus heros stink bug (object) in Figure 3, can be represented as a DwC metadata property as follows in Table 3. The identifier in Table 3 is a reference to the GitHub repository where the metadata element is published. The definition is based on Sorci and Garnier (2008). The label parasite was imported from AGROVOC 3 which presents the term parasites in plural form. However, the term was adopted in the singular form to meet the first requirement of ISO/IEC 11179-4 (2004) in Table 2.
The definition for the property in Table 3   Considering that the object of the example in Figure 3, that is, the stink bug species triple is inverted, another predicate can be created, with the stink bug as the subject, as shown in Table 4.
The identifier in Table 4 is a reference to the GitHub repository where the metadata element is published. The definition is based on Sorci and Garnier (2008). The host label was imported from AGROVOC 4 which presents the term hosts in the plural. Nevertheless, the term was adopted in the singular to meet the first requirement of ISO/IEC 11179-4 (2004) in Table 2.
The element in Table 4 follows the same rules of specification as in the element in Table   3. However, the nature of this element is more complex: an organism can be a host in four kinds of ecological interactions shown in Figure 2: parasitism, commensalism, inquilinism, and mutualism. In each of these interactions, the function of the host varies: in parasitism, the host is harmed by the parasite, so it is called a negative interaction, but it is not harmed in commensalism; mutualism or inquilinism, called positive interactions. It is important to make explicit the role of the organism in the interaction within the metadata record, as it determines if a given organism can be used as a resource of agrobiodiversity in crops and farms, or not.
The same species may be involved in one or more interactions, so it may assume the role of the host more than once in different interactions. This implies repeating the hostOf element in the metadata record, which Simple DwC usage rules recommend not to do (Darwin Core Task Group 2009b). A possible solution to this problem is to organize the metadata into classes, using RDF model since it has no limitation on repeating properties (Darwin Core and RDF/OWL Task Groups 2015). The hostOf element subordinate to the Parasitism class in a metadata record ceases to be ambiguous: it becomes apparent that it is a host that houses a parasite, even if there is more than one hostOf field in the record. Classes fulfill the function of contextualizing the organism's role in ecological interaction. The hostTo property, which in Table 4 allows to name parasitic organisms, could be used, for example, to describe the mutualistic relationship (in which both participant organisms are benefited) between a cow and beneficial bacteria living in the animal gut, since it is subordinate to a Mutualism metadata class. The hostTo predicate would be assigned to the cow metadata record and would take as values the names of these beneficial bacteria.
To exemplify how the given example of parasitism could work applying the metadata proposed in Tables 2 & 4 combined with DwC metadata, we created Schema 1. This example just illustrates the work we are doing. The next step in this research is to involve the international community of practice on data management concerned about the use of biodiversity data in agriculture and food production in the metadata creation process. We believe the scientific community is going to show up the best solution for semantic problems about agrobiodiversity metadata as those shown in this paper. Thereafter, we expect the resulting metadata extension will be used by researchers all around the world.  To study and analyze the data subject that is intended to be represented with metadata, in order to define the scope and approach of the data representation.

1.1
To create a branch of classes to arrange data properties into categories. 2 To analyze the metadata of the main core of DwC to check the existence of elements for the subject data that one wants to represent with metadata. Soares To analyze the core of terms of DwC Registered Extensions 5 to verify the existence of metadata for the scope of the subject data that is intended to be represented.

4
To search for ontologies, metadata schemes, or any other pre-existing conceptual model for the subject field that is intended to be represented by metadata.

4.1
To perform a correlated analysis between the terms of the conceptual model and the DwC, if any conceptual model is found for the subject field. Then, one should classify the terms of the conceptual model into two categories: DwC equivalent terms and nonequivalent terms.
4.1.1 Use non-equivalent terms as potential new metadata.
4.1.2 Use equivalent terms as DwC metadata, from Darwin Core Task Group (2009d).

5
To adopt requirements and recommendations based on standards and data models from the scientific literature to propose the syntactic and semantic structure of the metadata extension elements, so other can use your metadata. It is suggested: 5.1 the standard ISO/IEC 11179-4 (2004), which provides recommendations with clear and exemplified definitions of best practices in creating a definition for metadata vocabulary; 5.2 the Dublin Core Abstract Model (Powell et al. 2007) as the semantic model for the "design of metadata records in terms of structural components, such as Descriptions, Statements, Properties, and (literal or non-literal) Values, in order to enable structural validation of RDF-based metadata".

5.3
The RDF Schema for modeling the metadata vocabulary (Brickley et al. 2014).

5.4
the Extensible Markup Language (XML) as the syntax for the RDF metadata application (Bray et al. 2008), but also Turtle, JSON, and other markup languages can be applied.

6
To embrace the three forms of extending metadata schemas, if applicable: a) to create new metadata elements, which should have names, labels, definitions, and functions different from the pre-existing metadata in the DwC, according to items from 6.1 to 6.2.9; b) qualifiers: a qualifier term must be bonded together with a pre-existing term in the metadata vocabulary to identify a specific value, for example, 'dc:date' and 'dc:dateRegistered', according to items from 6.3 to 6.3.2; c) encoding schemes, which provide guidelines for formatting the metadata terms and data values, as per item 6.4.

6.1
To create a terminology sheet for each term of the extension's metadata, defining its attributes such as term name, namespace, definition, and additional information that might help users to apply the metadata (see Brickley et al. 2014 To search for a term definition in the scientific literature when the controlled vocabulary does not provide an underlying conceptual definition for the term used as a metadata element.

6.2.8
To write the metadata terms in the lower CamelCase format. For example: occurenceRemark, lifeStage, reproductiveCondition.
6.2.9 To create a namespace to identify the extension's metadata for computer systems.

6.3
To create qualifiers for pre-existing metadata in the DwC element set. Qualifiers are words that make the representation of a more specific value, i.g., 'dateAccepted' is a qualified version of the element 'date'. To involve the community of practice in the construction of the metadata extension, so the data representation can meet the needs of those who will use them in practice.

8
The metadata extension archive has to be identified by a single metadata registry, including the following attributes (from 8.1 to 8. Each metadata element of the extension must be defined by attributes (from 9.1 to 9. 9): 9.1 termName: name of the element that can be used as a metadata field in a record; 9.2 definition: the term meaning; 9.3 see also: a source of information about discussion groups, vocabulary, or history of the term; 9.4 qualified name: the term's URI; Soares 9.8 datatype: the type of value (data) that can be entered in the metadata field; 9.9 required: indicates if the use of the element is mandatory in all DwC metadata records or not.
The principle 4 of We expect these principles to be useful for other researchers to create their own metadata extensions to DwC.

Conclusion
It is still necessary to discuss what would be the best way to relate the metadata extension to the core of DwC terms. We know the principles presented here are broader, and as so, require