Open Metadata Handbook/Metadata Elements

METADATA ELEMENTS edit

Generally, the first step in metadata creation is to define the community model - what are the things that will be represent in the metadata? These can be things like: (a) resources (books, articles) (b) agents (authors, publishers) (c) carriers (journals, CDs, online) (d) classifications

The goal is not to identify a finite set of metadata elements, but instead a selection of core elements that can be expanded so that each community can interact with related communities of interest. It is also essential to clearly specify the expansion mechanism. Otherwise, people will be tempted to misuse elements because they are stuck with a set that does not quite fit their needs. In particular we advise the use of a few key relationships between metadata elements (e.g. broader, narrower) and the creation of web-unique identities for metadata terms, and the creation of independent vocabulary lists that can be substituted as needed.

In this section, we will identify the key metadata elements needed for discovery, identification, location, and deduplication of works:

  • Define a small set of metadata options (data elements and serializations) that can be used/adopted by data providers.
  • Allow for differences in granularity of the data, but provide best practices that many data providers should be able to achieve.

Element sets should be adapted to fit requirements for particular materials, business processes and system capabilities. This should be done for every type of works (let's start with literary works):

Legend edit

O - optional MA - mandatory if applicable, but may be legitimately missing M - Mandatory R - repeatable NR - not repeatable

Comment by Jim Pitman: I think we should go very light on the M, and just encourage data providers to provide all the information they have, parsed out as best they can. We need to provide some structure for this, but not so much as to be onerous. Rough unparsed bibliographic references are better than none. They can be cleaned up and matched to enhance the metadata by agents other than those who first publish the data. Especially various forms of entity extraction (people, places, subjects, .. ) fall under this.

LITERARY WORKS edit

Books edit

  1. creator(s) (one minimum) MA/R [taking into account that some books may be anonymous]
  2. title M/NR [Some books may have multiple titles (cover vs. inside, for instance; or titles in multiple languages). But it's reasonable to distinguish a single main title. The usual library convention is to use the title on the main title page.] - This can be accommodated e.g. in BibJSON by making the title an object, with main title its "text" values, and other titles indicated as values of other keys. Main issue is standardizing on conventions for the keys of secondary titles.
  3. date MA/NR [Multiple dates are possible -- copyright date, publication date, reprinting date. Again, it's reasonable to pick one "key" date; this is usually the date of publication. However, note that some books have *no* date on them, hence MA.] - Strictly, each date should come with a relation to the book.
  4. editor(s) MA/R -- CG: In library cataloguing standards (ISBD etc.), authors and editors and collaborators and translators etc. (either persons or organizations) all are taken together as the "statement of responsibility": so I would not separate editors from authors in the first instance
  5. publisher O/R
  6. place of publication O/R
  7. no. of pages O/NR [Not actually required to identify most editions, but helpful. Also, many "books" are multi-volume or lack page numbers, making page counts unclear or imprecise. Finally, if the books are digital, there may be no definite concept of pages at all.]
  8. type [We’ll need a list. e.g. bibliography, encyclopedia, ... ] O/R
  9. Identifiers MA/NR - e.g. ISBN [This assumes we're indexing a specific edition. There may be many ISBNs associated with different editions, versions. Ideally, nature of relation should be indicated. In practice, just having the ISBNs is useful for finding and deduping], DOI, etc
  10. Links - e.g. URL if online MA/NR [Many catalogs include an annotation with the URL explaining what it is (free? full text or excerpts? etc.) Not required, but good to have room for such annotations in the schema.] - Strongly recommended to provide either a text anchor hinting at the relation, or a relation value from a controlled vocab

Comment by Mathias Schindler: A minimal dataset should be Creator *and* Title *and* at least one of the following: year *OR* ISBN *OR* URL. Any other field is resireable but not a "minimum".

Book chapters edit

  1. creator(s) (one minimum) MA/R [taking into account that some books may be anonymous]
  2. title of the chapter M/NR
  3. title of the book M/NR [Some books may have multiple titles (cover vs. inside, for instance; or titles in multiple languages). But it's reasonable to distinguish a single main title. The usual library convention is to use the title on the main title page.] - This can be accommodated e.g. in BibJSON by making the title an object, with main title its "text" values, and other titles indicated as values of other keys. Main issue is standardizing on conventions for the keys of secondary titles.
  4. date of the book M/NR [Multiple dates are possible -- copyright date, publication date, reprinting date. Again, it's reasonable to pick one "key" date; this is usually the date of publication. However, note that some books have *no* date on them, hence MA.] - Strictly, each date should come with a relation to the book.
  5. editor(s) MA/R
  6. publisher O/NR
  7. place of publication O/NR
  8. no. of pages in book O/NR [This does not seem to be relevant when dealing with chapters. Not actually required to identify most editions, but helpful. Also, many "books" are multi-volume or lack page numbers, making page counts unclear or imprecise. Finally, if the books are digital, there may be no definite concept of pages at all.]
  9. start/end pages of the chapter M/NR
  10. type [ needed? very difficult to provide a list] O/R
  11. Identifiers MA/NR - e.g. ISBN [This assumes we're indexing a specific edition. There may be many ISBNs associated with different editions, versions. Ideally, nature of relation should be indicated. In practice, just having the ISBNs is useful for finding and deduping], DOI, etc
  12. Links - e.g. URL if online MA/NR [Many catalogs include an annotation with the URL explaining what it is (free? full text or excerpts? etc.) Not required, but good to have room for such annotations in the schema.] - Strongly recommended to provide either a text anchor hinting at the relation, or a relation value from a controlled vocab

Journal articles edit

  1. creator(s) (one minimum) MA/R
  2. title M/NR
  3. ISSN or full journal name M (one or the other)/NR
  4. year M
  5. enumeration M/NR [e.g. volume, number, start page / end page (as appropriate) - substitute date if no other issue enumeration is available . Minimum requirements should not required them to be parsed, even though it should be strongly recommended. References to journal articles come from many sources e.g. reference lists where they are not parsed. Even such a reference may be useful.]
  6. type [May need a list. e.g. research, expository, survey, review, abstract, note, …. such classification is sometimes provided by publishers and/or bibliographic databases] O/R
  7. Identifiers MA/R - e.g. DOI if available
  8. Links M/NR - e.g. URL if online?


Online texts edit

(e.g. Wikipedia articles, arXiv eprints, technical reports, working papers)

  1. creator(s) (one minimum) MA/R
  2. title M/R?
  3. URL M/NR?
  4. date accessed O
  5. date created O
  6. date last updated O
  7. format: html/pdf/etc. [we’ll need a short list to choose from] O
  8. type [ We’ll need a list. e.g. eprint, techreport, encyclopedia_entry, obituary, news_article, review, abstact, …. ] O
  9. links [Data providers should be encouraged to match their data into catalogs provided by others, e.g. WorldCat, Open LIbrary, ... and if they find a match to provide a link to it. This greatly assists deduplication of docs/works. This can of course be done by others, but nice if the data provider helps. This is commonly done by publishers of reference lists in academic journals. Other links might be to reviews, commentary, ...]

Enhanced access edit

(this will be things like keywords, abstracts, tables of contents, links to related resources, etc.) - To be decided...

ISSUES TO DISCUSS edit

1. Enumeration: Will we need separate elements for volume, issue, etc. - These are part of the standard language of reference to academic publications. What is the use case for these data elements? (Note, any articles or journals older than about 10 years will not have DOIs or SICIs.) Use cases (especially in case of missing, broken or substandard identifiers of other kinds) will be (a) identification, (b) deduplication, (c) hierarchichal indexing and display, (d) ease of indicating the entire scope of a complete collection (e.g. complete list of vols)

2. Resource identifiers: Some identifiers, like DOI, are self-contained (e.g. in URI format). Many are not. We probably do not want to have dozens of identifier fields, so we need a format for the data that can go into a single identifier field, e.g. PMID for pubMed items: PMID:12345. We need a list of recommended identifiers and how to input them - Such a list could and should be easily supported by OKF and or BKN, along with recommendations for canonical forms (to assist deduping) and services which leverage these identifiers. ISBN is the best options for books. It is widely used after 1970, although some local publications do not have it. Where not available, national bibliography IDs are often used in catalogues, e.g. Library of Congress Catalog ID.

3. Entity identifiers: We want to accommodate identifiers for persons, places, and other entities. Placeholders for such ids should be provided. They are not required. Names of persons identified should be referenced with a VIAF or a regional authority file record with a higher priority than spelling suggestions. Priority should be given to authority files which are publicly available under free terms such as CC0.

4. Creator types: Again, a short list (author, editor, reviewer, … ). If unknown, default can be “creator” Book chapters 'should' require that the editors of the book be named.

5. Update of bibliographic data: we need some idea of how updates will happen before we can discuss identification and versioning of the metadata itself.

6. Links: For each link, need at a minimum the url, and preferably text and indication of nature of relation of the link. These links can refer to either the full text or related works - that is the job of the "relation" field to specify.

7. Indexing and display: The metadata will affect what kind of indexing is possible, and desire for specific indexing, sorting, display should inform metadata. For example, Jim's desire to have a view that shows: journal/volume/number (probably a sort) requires that volume and number be sortable (e.g. numeric, no "v." etc.) So we need to have a discussion about what we need and what we can reasonably expect. Another issue is field indexing - e.g. the ability to search on specific fields, not just a general keyword index. What fields do we want to be able to search on?