Library revue

Research use of web archived data

Jaroslav Kvasnica, Barbora Rudišinová, Rudolf Kreibich — 2016-12-21T13:35:00Z

Keywords: web archiving, researchers, big data, metadata, data analysis, WARC

Mgr. Jaroslav Kvasnica, Mgr. Barbora Rudišinová, Bc. Rudolf Kreibich / National Library of the Czech Republic, Klementinum 190, 110 00 Praha, Czech Republic

Introduction

Since the creation of the World Wide Web in 1990s, the internet has been slowly turning into a more and more important source for the studies of both recent history and current political, social, and cultural phenomena. The internet is quite different from other types of media and publication platforms, as it always changes, its content is intertwined by links, and it allows for a plethora of various formats (images, videos, applications etc.); last, but not least, its content is created by a diverse set of both individuals and companies. Today, the internet is important not only with regards to both mass and personal communication, but also as a platform for the development of new services. The days when it was accessible only via desktop computers are past now; people can be “online” at any hour of the day via mobile devices and freely-accessible data services. Whether we talk about political or social campaigns, discussions on online forums, opinions published on personal blogs, or partially public information shared via social networks, our public lives are lived more and more on the internet.

Today, researchers are used to working with materials published online, such as electronic articles, databases, or various types of applications. In 2014, researchers from the Harvard Law School carried out a study that focused on hyperlinks in professional law journals. They discovered that over 70 % of these citations were not linked to the content they described (ZITTRAIN 2014).

Since communication and other social activities have moved, to a substantial degree, onto the internet, researchers and scientists are beginning to regard it with a new-found interest, even though it used to be a relatively neglected source in comparison with scientific information. The internet offers an enormous quantity of individual information and sources, but it can also be seen as a group of large sets of data that can be analyzed and used for various research purposes. However, its ever-changing nature poses a problem for those researchers who want to work with older sources, as individual webpages can be moved to another location or updated in various ways, or they can disappear altogether. These researchers then have to turn to web archives which try to archive the unique and dynamic content of the internet and keep it for posterity.

When it comes to archived internet sources, there are certain unique features that distinguish them from online sources. Internet archiving always takes place in real time. If we want to archive the content that is on a given webpage today, we should do it as soon as possible, because at any moment, it can change or disappear. And at that point, such webpage becomes irretrievable. Archived internet content is not a mere copy of something that was online sometime in the past; it is a unique version, which, however, contains “blanks” in content.

Internet sources acquisition

In the course of internet archiving, one has to make a number of decisions that determine the form of the content that will be stored in the archive.

From the technological standpoint, we have to choose an archiving software. Currently, every available archiving software has certain technological limitations that determine which types or parts of content will be missing; in other words, no software is able to copy every type of content. The choice of hardware, particularly with regards to storage capacity and computing power, also plays a significant role; it can have a major impact on the final version of the data stored in the archive. Obviously, the storage capacity of the archive determines how many copies can be made and stored, while computing power determines both the speed and the effectivity of the archiving process.

These issues are intertwined with the choices regarding the range of archiving of individual internet sources. The archiving process can be limited either horizontally, or vertically. The horizontal limitation deals with the number of archived linked pages that form a context for the original source – for instance, if we archive an article, we also have to archive the sources that it links to, so that it will be represented in the whole context. The vertical limitation is concerned with the number of digital objects that can be archived per each domain – this is also called “harvest depth”. The platform and the device that we use to view the content are other technical factors that can have an impact on the form of the archived internet content. The final version of the content will vary according to which device, browser, or application we use.

Web archives are trying to keep up with technological innovations that concern the internet – dynamic webpage features, JavaScript, links to social networks, integrated videos, and various applications, to name just a few. However, the technological development is fast and relentless, and so web archives usually need some time to adapt and catch up, which once again translates into content imperfections whenever we try to turn the original online version into an archived website (BRÜGGER and FINNEMANN 2013).

From the content standpoint, we have to discuss the definition of territoriality. With regards to the European internet archives, which are usually parts of national libraries, we typically see the so-called “broad archiving” that distinguishes between webpages on the basis of their national domains (“.cz”, for instance) and that defines territoriality through technical parameters. However, this method will miss all national sources on international domains or domains of other nations (such as “.eu” or “.org”), and so it will not yield the complete national internet content (in the Czech Republic, for instance, this would concern the content regarding the Czech Studies).

In general, the type of the harvest determines the types of sources that will be archived. We have already mentioned a “broad harvest”¹ that represents the most general and broadest type of archiving. It does not include manual selection. A “selective harvest” is carried out by curators who select and evaluate sources based on agreed-upon criteria. In most cases, selective harvests are smaller, but the archived internet sources are harvested in more depth.

The archived web content is not an identical copy of the original online web source, but rather a newly-created content that mirrors the original online one.

A web archive is multidimensional with regards to both time and space. Unlike online webpage, which at any given time features just one version of the content, a web archive contains many versions of a given internet source harvested at different points in time. When it comes to harvesting, certain large internet sources can take longer, and so the content may change during the archiving process (this is typical of internet newspapers sites with a lot of content, because they frequently add new articles or update the old ones) – this might result in an archived version that is basically a mosaic of various parts of the content (and varied in terms of range, too). All of this complicates things for researchers, of course; therefore, creating the metadata descriptions of the data itself is a key part of the process. Metadata descriptions allow researchers to find and define a set of data that is most relevant to their research.

The WARC container format

In order to store archived copies of webpages, web archives use specialized container formats that allow them to combine multiple fragments of harvested webpage content into an aggregate archive file. This solution was chosen mainly because most webpages include a veritable legion of small files that make any digital manipulation much more difficult. Typically, a webpage includes hundreds or thousands of these small files – scripts, images, videos etc.

Container formats combine several digital sources into a single aggregate archive file, which also contains relevant information (BAILEY and LACALLE, 2015). These formats are designed in such a way that all archived files can be stored and moved as one aggregate file; however, they also keep all metadata, and so all archived files are still intertwined in the same way in which they were linked before. This method then allows us to use the fragments to reconstruct the webpage and present it to the user in its original form. For researchers who want to use the data from the archive for their studies, the metadata represent a chance to process the archive without having to work with the enormous amount of data stored in it.

The container format storage principle for both the data and the metadata is simple – each data object is preceded by a header with metadata. In this context, data objects might consist of individual files harvested from webpages or metadata records. There are eight types of headers; a list of all the types, as well as their uses, is standardized in the ISO standard that concerns information and documentation, and also the WARC format (ISO 28500:2009 2009).

“warcinfo” – The “warcinfo” header is used to describe the whole content that follows this header and that is delimited by another occurrence of the “warcinfo” header. Usually, it is used to describe the whole container.
“request” – This header introduces the data object that holds a full HTTP request for a file sent over to the server. It serves the payload which contains information about the original request for internet content harvest sent by the crawler² to the target server.
“resource” – This header is followed by the content harvested from the server, for instance, an HTML file or an image
“response” – This header is used to describe a server response to a harvest request. This response is included and stored here. It the request was approved (the server responded, the URL contains data), this header is followed by the web content itself.
“metadata” – This header is used for a metadata description that cannot be associated with any other header
“revisit” – This header is used for a content that was archived earlier; today, a record marked as “revisit” is mainly used to refer the user to duplicates
“conversion” – This header is used for alternate versions of the content that were created during a format change, for instance
“continuation” – This header is used for segmented data objects that would otherwise exceed the size limitations of the container

Header example

The following example includes a header of the “request” type; it was a part of a test harvest carried out for the purposes of this article. Each header type has a certain number of obligatory identical elements. In this example, the identical elements are indicated in bold; they contain information about the type of the content (WARC-Type), date and time of its creation (WARC-Date), and its size (Content-Length), as well as a unique identification (WARC-Record-ID) that can be used to refer to the content.

Apart from these obligatory elements, the header also includes the URL that was the target of the request (WARC-Target-URI) and a control hash for the payload (WARC-Block-Digest), which can then be used to verify the authenticity of the content or ascertain any damage to it.

The payload itself begins on the line “GET /HTTP/1.1”. This line tells us that the crawler sent the GET request via the HTTP 1.1 protocol, which is a standard request for obtaining information in the internet environment.

An example of a “request” header and its payload:

WARC/1.0
WARC-Type: request
WARC-Record-ID:
WARC-Date: 2016-08-03T12:18:35Z
WARC-Target-URI: http://webarchiv.cz/
WARC-Concurrent-To:
WARC-Block-Digest: sha1:a514938cef4b9c3b3a88403c4ccdedd3863a74db
Content-Type: application/http;msgtype=request
Content-Length: 451

GET / HTTP/1.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36
(…)
Host: webarchiv.cz
Accept-language: cs-CZ,cs;q=0.8,sk;q=0.6

All the content stored in the WARC containers is structured in the way shown in the example. Whatever the type of the internet content, it is always preceded by a record of communication with the server, including any potential faulty requests and negative server responses. Originally, these metadata were created specifically for web archives. They allow us to go back in time and discover potential errors, as well as the time of their occurrences and their causes, which can then tell us why the webpage was not stored correctly. Because the stored information includes the date of creation and the size of the metadata, and even the communication between the client and the server, these metadata can be used by researchers who want to work with web archives.

Datasets for researchers

The data in web archives have one significant disadvantage – they are too “big”. Researchers typically do not have access to devices with enough computing power to process such volumes of data; quite often, they cannot even extract the data from the archive. To solve this issue, web archives generate special derived datasets that are designed for specific groups of researchers and research purposes – a linguistic analysis of the text does not include audiovisual materials, a hypertext analysis requires only metadata etc.

Datasets can be generated both from the data itself and the metadata. These sets represent the part that is necessary to fully saturate specific information needs of a given user. Users can download these sets and work with them on ordinary devices (using the tools that are available for free).

The Internet Archive defines three basic datasets that provide a basic insight into web archives, and which also show how to define and approach them. As these datasets’ specifications are freely available and as they can be generated, modified, and used via freely-available tools, we can consider them to be provisional standards for any web archive research.

A basic dataset for a research concerning any web archive data

“WAT” is an acronym that stands for Web Archive Transformation. The WAT dataset includes basic information about stored digital objects, such as the date of their harvest, their size, or the format that the HTTP server communicated to the harvester. Apart from this, the WAT dataset also contains information gathered by harvesting the digital objects themselves – the name of the author, the name of the document provided in the metadata part of the source, or all links (URLs) found in the digital object.

There is a freely-available tool called archive-metadata-extractor, which is used to generate the WAT datasets. This tool saves the metadata and the digital object itself, and it also extracts available information, for instance from the HTML metadata header or the body of the document. With a certain amount of simplification, we could argue that the WAT set is a WARC archive container without the archived digital content. The result is a text file that contains metadata organized into structures, which makes it quite small – a WAT file is between 5 % and 20 % the size of the original WARC file (BAILEY 2016a).

Due to their simplicity, relatively small size, and clearly-defined structure, the WAT datasets are very useful for researchers, as they can work with them with ease and via readily-available tools.

An example of a part of a WAT dataset:

{
"Envelope": {
"Format": "WARC",
"Payload-Metadata": {}
"WARC-Header-Length": "298",
"WARC-Header-Metadata": {}
}

A dataset for linking activity analysis of archived data from its origin through present

The second dataset specified by the Internet Archive is called Longitudinal Graph Analysis (LGA). It includes all the data required to create a longitudinal graph of all links. This set allows for a longitudinal study of linking activity between domains, both in terms of the archive as a whole and with regards to specific domains. The LGA dataset consists of two files.

Each URL is first written into an ID-Map file, where it is assigned a unique identifier and a so-called SURT³ form of the URL (BRAGG and ODELL 2010). Then, a second file is created – it is an ID-GRAPH file that contains a timestamp, a unique ID of the source URL, and all URLs that were linked from a given URL.

An example of an LGA dataset record:

ID-Map:
{"url":"https://www.youtube.com/watch?v=--FDzShdFjw&gl=US&hl=en","surt_url":"com,youtube)/watch?gl=us&hl=en&v=–fdzshdfjw","id":294869}

ID-Graph:
{“timestamp":"20150206180648","id":294870,"outlink_ids":[62596,110007,129599,145417,148627,215031,277534,277535,
277668,277678,277679,737423,737436,737459,737476,737483,
737503,737514,803348,803349,803373,803374,803490,803565,
803586,803590]}

A dataset with named entities

The WANE dataset is an acronym for Web Archive Named Entities. This set includes a list of individuals, places, and organizations mentioned in the digital object. These entities are extracted via the Stanford Named Entity Recognizer tool (Stanford NER for short). The resulting file contains data structured in a similar way to a WAT dataset; the WANE files are usually about 1 % of the size of the analyzed data.

Creating the WAT and LGA datasets is relatively easy – you just have to extract the information from the structured part of the source data or correctly recognize URLs in the document. The WANE dataset, however, is a little more complicated – you need a classifier that can recognize individual entities in non-structured texts, and then put them into correct categories. The Stanford NER tool typically includes several classifiers that can recognize entities in English and German texts. Some classifiers can even search for entities regarding time, currencies etc.

Unfortunately, when we attempt to classify a Czech text in this way, the quality of the results is unacceptably low. It is therefore necessary to use a suitable classifier to categorize documents into language groups, and then use appropriate special classifiers designed for given languages.

For Czech, the optimal tool seems to be the NameTag tool, which uses the Czech Named Entity Corpus 2.0 (CNEC 2.0). It achieved results that equal those of English-based classifiers (STRAKOVÁ et al. 2013). CNEC 2.0 includes a rich classification of Czech expressions; it can also distinguish between various versions of identifications of time, between first and last names, and also between ordinary and mythological names. Furthermore, it can recognize entities such as movies, books, standards, products, measurement units, and currencies. It can distinguish between public, government, and private institutions. It is able to identify various radio and television stations, e-mail addresses, and links. Geographical information is automatically divided into categories – continents, streets, rivers, etc. This classifier can also recognize addresses and phone numbers. Unlike a standard WANE record, however, the resulting CNEC 2.0 dataset grows both in bit size and in complexity.

At the moment, the current WANE format defines just places, people, and institutions, and so the best course of action seems to be a modification of the WANE record structure itself, which would create a new level of structuring and mirror the taxonomy of CNEC 2.0. One question remains, however – should we call the resulting dataset “WANE”, or should the ontology of the entities be mentioned in the name of the new file (“cnec2-wane”, for instance)? This question should be discussed during the course of the next IIPC consortium, and it would be a good idea to establish a unified way of naming the sets with named entities.

An example of a geographical record in WANE:

"locations":["Miami","Virginia","Fort Lauderdale","Wash.","Va.","Blacksburg","St. Louis","Clayton","Fla.","Chapel Hill","Michigan State","Austin","North Carolina","Michigan"],

An expected record in WANE from the CNEC 2.0 classifier

“Geographical_names”:{“streets, squares”:[“Václavské náměstí”,”Kaprova”],”castles/chateus”:[“Hradčany”,”Karlštejn”]}

These three dataset are by no means sufficient – they cannot possibly satisfy all information needs that potential researchers might have. They do provide a basic insight into data archives, however, and they also present potential uses of web archive data and serve as examples of the way all datasets should look like.

Researchers and their needs

As researchers place ever more demanding requirements upon web archives and as they want full access to the archived data itself, instead of a standard user one, web archives had to try to satisfy these particular needs. These attempts and discussions, however, were hindered by various issues – problems with the form of the sent data, with hardware and software requirements, and with computer literacy requirements.

As it turns out, the main problem is the fact that researchers are often unable to define which part of a given web archive they are interested in. They usually want all the data, even in cases where it is clear that they will not need it (BAILEY 2016b). This leads to another problem – most researchers lack the necessary equipment and they do not have the capacity to receive the data that can easily have dozens or hundreds of terabytes. However, not all the blame can be placed with them – web archives usually cannot offer their data in a way that would allow for easy use and manipulation. Simply put, the data is not designed to be worked with in this manner, and on top of that, web archives do not have tools that would allow users to do these kinds of things.

Today, web archives focus on improving the communication with researchers and on mutual collaboration, on the development of new technologies and on building sufficient local computer capacities, and also on workshops focused on the development of skills necessary for any work with datasets (TRUMAN 2016). However, without the knowledge of potential users and their needs, web archives cannot develop suitable tools. Fortunately, some are trying to improve the situation with studies on this very topic. Jefferson Bailey from the Internet Archive, for instance, tried to define a classification of users’ interests. Creators of web archives can use it to have at least some idea about what researchers are interested in, and to begin working on tools and ways to make the sought-after data available.

Classification of the users’ interests (BAILEY 2016b):

Documentary – researchers interested in the classification or registration of webpages. This category also includes studies focused on the legality of the content on webpages etc.;
Social and political scientists – sociological and political studies. Work with government or open data;
Web science – studies on internet technologies and protocols;
Digital humanities – this category includes modern historians and other researchers from humanities disciplines that work with digital data;
Computer science – information retrieval, data processing and indexing, infrastructure and tools;
Data analysts – various activities – data mining, language processing, and trend analysis.

This classification can give us a basic insight into possible venues of research concerning web archives data. Researchers from different fields have different requirements – they need different datasets, but they can also have in mind specific ways in which they want to access them, or they want them in specific formats. When it comes to sending the data to researchers, we have to consider factors such as the kinds of tools usually employed in a given field and the general computer literacy of its experts. For instance, sociologists use different tools than IT experts and we can assume that their computer literacy will also differ. While IT experts can write their own scripts that will help them deal with the data, researchers from humanities disciplines might need specific tools that will not require any such advanced knowledge.

Studies of researchers’ needs

The topics of researchers’ uses of web archives and their information needs (with regards to web archives) are not wholly unexplored; several such studies have already been carried out. For the purposes of this article, we have selected two recent studies that were initiated by members of the IIPC consortium (the Czech web archive is also a member). Cooperating web archives used these articles to formulate further steps; we will summarize the relevant conclusions.

Peter Stirling, Philippe Chevallier and Gildas Illien (2012) wrote an article summarizing a French study carried out by the National Library of France. It was a qualitative study carried out in 2010 and 2011 and it focused on three groups of users, including researchers. The researchers’ group consisted of five experts from different fields (history, philosophy, sociology, and information technologies).
Apart from the usual practice of using the information from articles published online and the issues concerning the citations (articles published on the internet are often moved, URL addresses cease to function etc.), researchers also stated the need to use internet sources to evidence or illustrate certain sociological or historical phenomena, which is complicated by the fact that you have to explain your reasons for selecting and using these webpages and document them in your scientific work. In the scientific community, data obtained from internet sources are still considered unreliable and short-lived. Researchers touched on this particular topic by noting that they had to harvest the sources they were interested in quite often.

These researchers are focused on web archiving, and they have stated several important issues – the importance of archiving transitory and ever-changing data, the importance of the type of the content (they have mentioned the need to archive blogs, for instance), the legitimacy of the use of data obtained from archived electronic sources and related ethical concerns (such as the use of personal information), and the way we see the web archiving process as a whole, particularly with regards to its current dynamic character.
Importantly, the researchers concurred with the current approach of the National Library of France and agreed with its strategy consisting of a combination of automated broad harvests (e.g. national domains) and small selective harvests. However, they also expressed a hope that in the future, the archiving process will react to new internet trends (popular webpages, social networks content etc.) more swiftly. They requested that the criteria for the selection of the sources for the web archive are made public and documented. Last, but not least, the researchers emphasized the necessity of general collaboration and of working with the scientific community on issues concerning methodology creation and archived sources selections.

One of the most recent studies that focused on the general awareness of web archives and their use by academia was the study carried out in 2015 by the National Library of New Zealand (RILEY and CROOKSTON 2015). This project tried to provide a better understanding of potential scientific uses of web archives and to determine how the web archive in question should be developed further. The study showed that only 39 % of the researchers (113 out of 290) were aware of the existence of international web archiving initiatives, and only 23 % knew that the National Library of New Zealand archives national internet sources.
The researchers from the study would prefer full-text searches, while the URL address-based searches of the requested source in a library catalogue was their least favorite way of looking through the archive. Regarding the datasets on offer and their potential research, the researchers were unsure whether the data would be useful for them (44 % of them would use the data in their research, while 38 % could not decide). A part of the study focused on the use of web archives and archived sources in the classroom, which was surprisingly frequent (34 % of those who work with foreign web archives also use them when teaching). The most frequent ways to use archived sources is direct quotation (via a link), illustrating something with a screenshot, or as an information source for the students.
When asked about the value of the content, 66 % of the researchers said that government webpages are the most significant research source. Many researchers also stated that archiving the content of social networks would be beneficial to their work – most of them agreed that the most important social media are video-sharing networks (YouTube, Vimeo, etc.) and discussion forums. The majority (77 %) also agreed that it is important to archive national internet sources. Half of them (51 %) said that the data from the web archive will be useful to them in the next five years, which is in contrast to the 2012 study of the National Library of France that found out that the researchers did not have ready uses for web archives data and they generally assumed that the data would be used by their students (STIRLING et al. 2012).

A summary of the conclusions of both studies:

The scientific community still see data acquired from the internet as unreliable and transitory.
Many researchers are not aware of web archives; they do not know how to use the data for their research.
However, most researchers agree that with regards to research, web archiving is important and useful.
Researchers prefer more traditional forms of accessing the archived data (looking through individual sources).
They strongly request full-text search tools.
In most cases, researchers are interested in datasets, even though they do not know how to use them afterwards and they are not sure about the size and content limitations of the sample (they usually feel that they should gather as much data as possible).
They expressed an interest in the archiving of social networks.
There are certain legislative and ethical issues concerning the use of electronic materials from web archives that need to be solved.

Conclusion

Until recently, researchers were not interested in web archives, even though these platforms currently represent the most significant, the largest, and the fastest growing part of our cultural heritage. However, when the public interest in the archived internet content grew, the archives found out that they do not know how to collaborate and communicate with researchers in a way that would be beneficial to both sides.

Conclusions from the studies of researchers’ needs showed some of their requirements for web archives. If we want to have better cooperation between the scientific community and web archives, the archives should focus on the following:

promoting themselves and establishing cooperation (researchers are not aware of the archives);
explaining the reasons behind the archiving process, as well as its aims, and emphasizing the significance and relevance of the data from web archives (researchers still prefer “classic” types of sources);
documenting archive collections, creating standards for descriptions, and creating high-quality descriptions and metadata that would explain what is in a given collection, and why (it is not a complete copy of the internet or the original live webpage, some parts are missing);
creating short annotations for individual collections – any scientific work is easier if a researcher knows what data he or she actually works with;
attempting to allow for full-text archive searches;
creating and providing datasets – even though researchers feel that they need as much data as they can possibly get, it is important to provide them with suitable tools, prepare smaller thematic collections for them, give them data delimited by time etc. (the documentation is necessary here – why and how were the sources selected, defined metadata and formats etc.);
attempting to archive less common types of sources (various multimedia types, for instance) and social networks;
solving and clearly defining authorizations and requirements for the use of the data from web archives (with an emphasis on the protection of personal data);
working with other web archives to achieve a productive collaboration (the internet is international, so any research based on this data will have international range and ramifications);
asking researchers for feedback on the collections creation process.

If we, as the human society, want to keep our ability to write and study our own history, we will, without a doubt, need web archives. And web archives have to do everything in their power to rise to the challenge. Researchers from humanities disciplines must learn to work with digital materials; web archives must learn to work with other fields of study.

¹ Harvests are processes of automated downloading and acquisition of data from selected internet sources (the creation of copies).
²Crawler is a tool used for automated downloading and acquisition of data. Crawlers are used for harvests.
³ SURT is an acronym for Sort-friendly URI Reordering Transform, with is a transformation applied to URI that allows the representation of this URI to better reflect the natural hierarchy of domain names.

This article was made possible by the support given by the Ministry of Culture of the Czech Republic to the further development of the National Library of the Czech Republic in its capacity as a research institution.

Bibliography:

BAILEY, Jefferson, 2016a WAT Overview and Technical Details [online]. 2016 [2016-08-10]. Retrieved from: https://webarchive.jira.com/wiki/display/ARS/WAT+Overview+and+Technical+Details

BAILEY, Jefferson, 2016b. Program Models for Research Services [online]. 2016 [2016-08-10]. Retrieved from: http://netpreserve.org/sites/default/files/WAC02_HEKLA_Jefferson_Bailey.pdf

BAILEY, Jefferson, and Marie LACALLE. Don't WARC Away: Preservation Metadata & Web Archives [online]. 2015 [2016-08-10]. Retrieved from: http://connect.ala.org/files/2015-06-27_ALCTS_PARS_PMIG_web_archives.pdf

BRAGG, Molly, and Kate ODELL. SURT Rules [online]. 2010 [2016-08-10]. Retrieved from: https://webarchive.jira.com/wiki/display/ARIH/SURT+Rules

BRÜGGER, Niels, and Niels Ole FINNEMANN. The Web and digital humanities: Theoretical and methodological concerns. Journal of Broadcasting & Electronic Media [online]. 2013, pp. 66-80 [2016-04-28]. ISSN 1550-6878. Retrieved from: http://dx.doi.org/10.1080/08838151.2012.761699

ISO 28500:2009: Information and documentation -- WARC file format. London, 2009.

RILEY, Harriet, and Mark CROOKSTON. Awareness and use of the New Zealand web archive: a survey of New Zealand academics [online]. National Library of New Zealand, 2015 [2016-04-28]. Retrieved from: http://apo.org.au/node/58430

STRAKOVÁ, Jana, Milan STRAKA, and Jan HAJIČ. A New State-of-The-Art Czech Named Entity Recognizer [online]. P. 68 [2016-08-10]. DOI: 10.1007/978-3-642-40585-3_10. Retrieved from: http://link.springer.com/10.1007/978-3-642-40585-3_10

STIRLING, Peter, Philippe CHEVALLIER, and Gildas ILLIEN. Web Archives for Researchers: Representations, Expectations and Potential Uses. D-Lib Magazine [online]. 2012, 18(3/4) [2016-04-28]. DOI: 10.1045/march2012-stirling. ISSN 1082-9873. Retrieved from: http://www.dlib.org/dlib/march12/stirling/03stirling.html

TRUMAN, Gail. Web Archiving Environmental Scan: Harvard Library Report. Digital Access to Scholarship at Harvard [online]. USA: Harvard Library, 2016 [2016-08-10]. Retrieved from: http://nrs.harvard.edu/urn-3:HUL.InstRepos:25658314

ZITTRAIN, Jonathan, Kendra ALBERT, and Lawrence LESSIG. Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations. Legal Information Management [online]. 2014,14(02), 88-99 [2016-07-25]. DOI: 10.1017/S1472669614000255. ISSN 1472-6696. Retrieved from: http://www.journals.cambridge.org/abstract_S1472669614000255

Watermark research in the music sources registered in the Union Music Catalogue of the National Library of the Czech Republic

Eliška Šedivá — 2016-12-21T13:35:00Z

Keywords: watermarks, paper, paper mills, papermakers, databases, catalogues, musical sources, Union Catalogue of Music, RISM, Répertoire International des Sources Musicales, František Zuman

Mgr. Eliška Šedivá / Music Department, National Library of the Czech Republic, Klementinum 190, Praha 1, Czech Republic

Introduction

The employees of the Music Department at the National Library of the Czech Republic commenced their watermark research^[1] in the music sources recorded in the Union Music Catalogue of the National Library of the Czech Republic in association with the RISM (Répertoire International des Sources Musicales) international register of music sources in 2014.^[2] Even though the recording of watermarks (mainly in music manuscript sources) had already been undertaken to a certain extent within the framework of the creation of the Union Music Catalogue, the watermarks had not been systematically classified at that time, which would have enabled the identification of individual characteristics, the more reliable localisation of the used paper or the designation of the age of any undated documents. The idea of creating a special watermark catalogue within the framework of the Union Music Catalogue proved to be unrealistic at the beginning of our efforts to systematically compile information on this phenomenon. For this reason, we began to consider connecting this research to the RISM catalogue which provided us with the working environment for the description of the watermarks and enabled a direct link to be established between the research and the contents of the international database of music sources.

The recording of watermarks is undertaken simultaneously with the cataloguing of music sources for the RISM database and the results are published there regularly (ZUMAN 1923).However, this only involves the first of a number of outputs from this project. The second such output is the Dictionary of Czech Paper Makers, an extract from the seven different papers about historical Czech paper mills written by František Zuman, which is gradually being supplemented with discoveries of watermarks from music sources transferred to the RISM international database. The dictionary mainly serves cataloguers as a tool when identifying paper mill owners and paper mills, but in the future we are expecting it to be expanded and to be used with the framework of the prepared methodology for working with watermarks in music collections. The third output is a watermark database, where we are concentrating all the source findings and the currently known information about Czech watermarks.

František Zuman and his contribution to the history of Czech paper mills

The current research into watermarks would be difficult to undertake without the valuable contribution which the historian of Czech paper making, František Zuman (1870–1955) has left us in a wide range of work focussed on the production of paper and the history of paper mills and watermarks. He is the author of papers covering the extensive period from the 17^th to the mid-19^th century, including a smaller probe into the 16^th century (ZUMAN, 1921a and ZUMAN,1927a). He is the author of general essays (for example, České filigrány XVII., XVIII. a první poloviny XIX. století), as well as essays focussing on a specific territory (for example, the paper mills in the catchment area of the Sázava or Otava Rivers or in the Lower Giant Mountains) or on specific paper mills (Trutnov, Prague). In them, he presented a deep insight into the activities of the paper mills and the lives of their operators. His more detailed work, such as the articles in Památky archeologické or in Zlatá Praha, filled in previously vacant areas in the history of Czech paper mills and provided further starting points for the identification of watermarks.

The Dictionary of Czech Paper Mills

We compiled the first dictionary of Czech paper mills containing most of the individuals and the places where they were active as identified by Zuman on the basis of Zuman’s extensive research, albeit that his research was focussed on the study of various forms of documents (documents, registry records, correspondence and other archive materials) and not only sheet music. The entry created for each given paper mill owner contains all the forms of his name (as they appear in Zuman’s work and in other sources, for example in the form of the inscription which often formed part of the watermark), the individual places where the individual was active in chronological order (a number of paper mill owners were active in two or more places) and the description of the watermarks pertaining to the given individual. In some of his treatises, Zuman also provided negative photographs of watermarks which can be compared with findings in the sources. The entries are concluded with quotations from the literature, most of which is Zuman’s work.^[3]Even though Zuman did not work with musical documents, the watermarks are consistent to a significant extent in both the archive material and in the sheet music manuscripts. The work with musical sources has expanded our awareness of the production and distribution of (not only) Czech paper and its use in musical institutions and in places such as monasteries, church choirs, music associations, schools or, for example, aristocratic or burghers’ salons. A certain number of music documents contain a date written on the material by the music’s author, the copyist or another person and in exceptional cases it is possible to compile a detailed summary of the paper used in a given locality or in a specific place where music was performed.

This dictionary is currently accessible in electronic form from the website of the National Library of the Czech Republic^[4]and from the homepage of the watermark database^[5], according to which it is regularly supplemented and with which it is retrospectively connected by means of hypertext links.

Fig. 1.

A description of the watermarks

When describing the watermarks, we have used the established English terminology from the RISM international catalogue of musical sources and we have applied it both in the records in this database and in all other outputs (the Dictionary of Czech Paper Mills etc.). In the first phase of recording the watermarks, we used the environment of the RISM multimedia database in the Kallisto program which enables the insertion of a picture file and the inclusion of a short description with it. This internal database is especially used by cataloguers for entering the photographic documentation of sources. However, we used it as a working environment for the collection, description and classification of watermarks. The watermark records can be connected with the appropriate records of music sources as soon as they are created. Instead of a photocopy (which is most commonly used in these cases), we chose a black and white tracing on transparent tracing paper as the “sample” and the illustration of the given watermark. This may differ to a certain extent from the illustrations in other sources, but it essentially involves the same symbol which underwent some deformation over time as a result of the repeated use of the mould. In the interests of completeness, we have also included a photograph of the watermark from a specific source in the RISM record so that it is possible to compare it with the “sample” in the form of the tracing.

As well as the actual watermark, the tracing also includes the chain lines^[6] and the scale, amongst other things. It may also include a view showing the placement of the symbol on a sheet of paper. The watermark tracings depict the maximum amount of detail which can be seen with the human eye, but this always depends on individual observation and the experience of the individual preparing the tracing. Emphasis is placed on the identification of the watermarks and the classification method rather than on the visualisation in the phase of the collection and initial classification of the material. A watermark’s name is created by means of a description of the individual symbols and parts thereof from left to right and from the top down as they are located on the page. The described figures are entered in square brackets, while any initials or inscriptions are listed without brackets. The left and right half-sheets are separated in the name by means of a slash, while the positions of the symbols on the sheet are also noted. Symbols which are located one under the other are divided with a vertical pipe.

Fig. 2. [coat of arms (Austria-Hungary, big, crowned) in cartouche with palm branches around]/ [3 peacock feathers (crowned) (= coat of arms, Harrach)] | IPM | G. An example of description and tracing of watermark by Jan Pavel Margott from Trutnov paper mill. This watermark on music manuscript 59 R 4361 from National Library of the Czech Republic (provenance of Koleč church) is dated 1807. As Zuman states, J. P. Margott worked in Trutnov paper mill between 1786 and 1819. Author of the tracing: Lucie Havránková.

Music collections

The collection, identification and recording of watermarks in music sources take place within the framework of the processing of music collections for the RISM database.^[7] The collections which have been included in this research (and in which the watermarks have also been strictly described during cataloguing) come from various locations and different types of musical institutions. The team of cataloguers works does not only work with collections stored at the National Library of the Czech Republic (CZ-Pu), but also, for example, with collections from the National Museum – the Czech Museum of Music (CZ-Pnm).^[8] The collections most commonly contain music manuscripts and printed sheet music from the second half of the 18^th century and the first half of the 19^th century. We can summarise this by stating that as far as the watermarks in music collections were concerned we were interested in the period approximately between 1750 and1850, exceptionally with some overlap into the early 18^th century.

The music collections catalogued (including the records of the watermarks) in the Union Music Catalogue and the RISM database:

a) collections with comprehensive watermark records: 1) the music collection from the church in Koleč (in: CZ-Pu): 177 shelfmarks, 91 new watermarks; 2) the music collection from the church in Řetová (in: CZ-Pu): 218 shelfmarks, 78 new watermarks; 3) the music collection from the Želiv monastery (in: CZ-Pnm): 528 shelfmarks, 188 new watermarks; 4) the music collection of Count Christian Philip Clam-Gallas from the château in Frýdlant (in: CZ-Pnm): 1572 shelfmarks, 426 new watermarks;

b) collections with incomplete records of watermarks: 1) the music collection of the Hübner family from Dlouhý Most (in: CZ-Pu); 2) the music collection of the Strachota family from Panenský Týnec (in: CZ-Pu); 3) the music collection of Jan Nepomuk Kaňka from the château in Jetřichovice (in: CZ-Pu); 4) the music collection of the Hospital of the Brothers Hospitaller in Kuks (in: CZ-Pnm).

The records of the watermarks are continuously uploaded to the RISM database via the multimedia database, where they receive an identification number and are subsequently connected to the source records. They are also included in the watermark database under the same name, where information about the paper mills and their owners, source dates or publishers who used the given paper to print music are also gathered. Finally, the information about the watermarks is added to the dictionary of paper mill owners (provided this involves a watermark of Czech origin).

The database of watermarks in the music sources recorded in the Union Music Catalogue of the National Library of the Czech Republic^[9]

The database of the watermarks found in the music sources recorded in the Union Music Catalogue of the National Library of the Czech Republic (known as the Bohemian Watermark Database = BWD) was created in 2016 and it is the main output of this watermark research. The foundation of the database consists of a list of paper mill owners and paper mills in the territory of Bohemia and Moravia which has been compiled on the basis of the legacy of František Zuman (and partially supplemented with information from G. Eineder). As of today, we have records of 726 Czech paper mill owners and 235 paper mills dating from the 16^th century to the end of the 19^th century.

It is possible to search in the database using the name of the watermark or its individual parts or according to the name of the paper mill owner, the paper mill, the country or the publisher (in the case of printed sheet music). There are indexes and a summary of the countermarks available which assist users to become acquainted with the contents of the database. It is possible to search for the individual entries in this selection separately or in combination with others. It is also possible to search for all the watermarks in a specific collection (provenance) and to subsequently arrange the results as needed (for example, alphabetically by paper mill owner, paper mill or country of origin).

The record of a specific watermark contains the name in the header, the identification number, information about the paper mill and the mill owner (if they are known), the source of the tracing (the library siglum and the shelfmark of the document from which the watermark has been traced) and the provenance. Any further information is used to record the archived original tracings.^[10] The dates in the sources and the bibliography conclude this section of the description. There is also a space reserved for notes. It is possible to scroll through the records and to use the active entries to reach other results.

A number of watermarks have not been able to be traced due to their poor legibility or other losses of integrity. As such, if a record does not include a tracing of the watermark, a photograph is included and this will remain the case until such time as we find material from which it is possible to acquire a tracing and to supplement the record.

As the name of the database clearly shows, this involves watermarks in the music sources held in the collections of Czech institutions, but for all that we have not limited ourselves to exclusively Czech watermarks. The composition of the collections has necessarily also led us to work with paper from abroad. We have endeavoured to identify it on the basis of the available literature and digital sources and we have then included the information about this paper in the database.

Table 1. Czech and foreign paper in music collections (1. 7. 2016). The music collections marked* with complete watermark registration. Abbreviations in footnotes. ^[11]

Music collections	A	CH	CZ	D	GB	I	PL	N	F	[n]	total
Koleč, church (CZ-Pu)*	1		46							44	91
Řetová, church (CZ-Pu)*			45							33	78
Želiv, monastery (CZ-Pnm)*	1		96	3		4	1			83	188
Clam-Gallas, Frýdlant (CZ-Pnm)*	8	6	60	27		190		7	11	117	426
Hübner family, Dlouhý Most (CZ-Pu)	1		28							16	45
Strachota family, Panenský Týnec (CZ-Pu)			21							25	46
J. N. Kaňka, Chateau Jetřichovice (CZ-Pu)			12							10	22
The Kinský Library, Prague (CZ-Pn)	1		8	1		2				8	20
National Library of the CR (CZ-Pu)			53	3	1		2	3	1	27	90
Bohemian Watermark Database (1. 7. 2016) = 1006

Table 1 shows the apparent dominance of Czech paper in the aforementioned Czech collections, while the as yet unidentified watermarks may also be of Czech origin (the same symbols are repeated here, but without any other nominal or place information). The countermarks which most frequently appear in Czech watermarks (see table 2): a shield (83 occurrences);^[12] the fleur-de-lis (57) either standing alone or in combination with a shield (various sizes and positions on the sheet have been found); the crowned Hungarian coat-of-arms (41) regularly supplemented with initials or an inscription on the opposite half-sheet; a cartouche (31) as a decorative oval frame for other countermarks; a cross (27), often the patriarchal cross; a star (26) and a crescent (19) most frequently together in a paired watermark; a posthorn (23) designating writing paper which has different properties than the paper which was regularly used to copy out works of music (it is usually finer and of different dimensions); a two-headed eagle (21) usually with a large crown and a shield or coat-of-arms, it may be holding a sword, a sceptre or an orb in different combinations; an anchor (19) either standing alone or placed in a shield - this was one of the most popular symbols among Czech paper mill owners; a mitre (17) which may also stand alone or crown a shield with ecclesiastical insignia (see below); the Bohemian lion (17); a heart (13) or keys (12) most frequently crossed, supplemented with initials or an inscription, etc.. We have set out the names of the paper mill owners and the paper mills where we have recorded the most watermarks to date in table 3.

Table 2. Watermarks in Bohemian Watermark Database (a selection of main countermarks). A part of unidentified watermarks [un]* can be of Czech origin.

countermarks (BWD)	A	CH	CZ	D	GB	I	PL	N	F	[un]	total
shield	6	5	83	9		26		7	2	83	221
lily (french)		4	57	6		11		6		46	130
coat of arms (Hungarian)			41							19*	60
cartouche	1		31	1		2				18	53
cross			27	4					4	15	50
double-eagle			21							28*	49
heart	1		13			1			3	28	46
posthorn			23					1		21	45
star	3		26			6				9	44
crescent	3		19			11				10	43
anchor	1		19	1		1				11	33
mitre			17							15*	32
keys			12							18	30
Bohemian lion			17							9	26
arrow and bow						18					18
stag			1							8	9
crown			2	1		3				2	8
St. Johannes Nepomuk	2									6	8
unicorn			1							7	8
wall coping			3							3	6
snake			3							2	5

Table 3. Czech paper makers in Bohemian Watermark Database with most watermarks.

Paper makers (CR)	worked	Paper mill	Watermarks
KIESLING family	1800c–1850c	Vrchlabí	35
HELLER, Jan Antonín	1808–1841	Ledeč nad Sázavou	20
KLUSÁČEK, Jakub	1818–1848c	Červená Řečice	12
APPELTAUER, Josef Antonín	1807–1851	Velhartice	10
FÜRTSCH, Jan Jiří snr.	1773–1812	Postřekov	10
BAYER, Ondřej I.	1762–1796c	Stříbro	9
HELLER, Josef Benedikt	1779–1829c	Staré Hory / Ronov	9
HALÍK, Tomáš	1828–1851	Havlíčkův Brod	8
PŘÍHODA, Kristián	1740–1788	Červená Řečice	7
FÜRTSCH, Jan Jiří jnr.	1812–1844	Postřekov	6
SCHÜTZ, Josef I.	1796–1826c	Rádlo	6
WINTERNITZ, Abraham & Emanuel	1825–1865c	Lochovice	5
ZÖH, Petr	1816–1846c	Trutnov	5
MARGOTT, Jan Pavel	1786–1819	Trutnov	4
PLENINGER, Antonín	1819–1851	Kolinec	4
REK, Ludvík	1749–1773c	Ronov	4
RICHTER, Josef III.	1796–1822	Dolní Litvínov	4
RITSCHEL, František I.	1792–1827a	Benešov nad Ploučnicí	4
RITSCHEL, František II.	1816–1840	Svídnice	4
WERNER, Mathias	1765–1795a	Velké Losiny	4

The identification of unknown watermarks

As has already been stated, the recording and detailed identification of the watermarks in the music sources recently commenced at the Music Department of the National Library of the Czech Republic. Four music collections have now been fully processed in the aforementioned manner: the collections from the churches in Koleč and Řetová (in: CZ-Pu), from the Želiv Monastery and from the château in Frýdlant (in: CZ-Pnm). The significance of this for the source research of watermarks is that a) a large amount of comparative material, i.e. watermarks which are already known from the literature (Zuman), is now available in relation to musical sources, i.e. in a new context and with new information (the date or provenance stated by the copyist, the year of publication in the case of printed sheet music, notes made by previous owners, dedications, specific historical events, during which this material was used and even a piece of music written on a certain type of paper are all important pieces of information in this case) and b) there have been extensive additions of as yet unknown countermarks or new combinations of known symbols and supplementary inscriptions identifying the manufacturers of the paper.

In the case of the watermarks which have already been recorded by Zuman (or any other researchers), we have appended the date in the music source and compared it with the period of the paper mill owner’s activities in the given location. Sometimes, we have expanded the timeframe for these activities stated by Zuman and, of course, we have reckoned with a certain overlap in the use of the paper after the completion of the producer’s activities. We have not only come across the natural deformation of watermarks (the emblems, inscriptions and chain lines change their shape insignificantly after each use and the life cycle of the moulds was relatively short), but also the deliberate modification of the countermarks ranging from diverse variations of the same symbols through to the removal and replacement of certain sections. We can give the example of the watermark of Andreas Bayer from Stříbro, which is frequently represented in the collections and which Zuman found in a document dating from 1796 and described thus: “… there is a dimerous watermark from his workshop: the first half-sheet includes a lily, while the countermark on the second is similar to a four with WAB (= ? Andreas Bayer) beneath it.”^[13] This watermark can be found in several variants in the collection of the Strachota family from Panenský Týnec (1801), as well as in the archive of the château chapel in Frýdlant, in the Kinský Library (in: CZ-Pn) or in the church collection from Koleč. This last case includes an apparently later variant of this watermark which was changed by the paper maker’s son and successor, Georg Bayer: he replaced the middle letter in the WAB trigram with “G” (= Georg). The dimerous watermark with the initials WGB is apparently evidence of the way in which these signs were adapted in paper making families. This watermark has yet to appear in the literature and the identification of the initials would probably not have been possible without the systematic classification and research into whole collections. We can use this example to show the time overlap in the use of the paper: according to Zuman, Georg Bayer took over the paper mill at the end of the 18^th century and he was supposedly followed by Margaret Bayer in 1810. As the sources from the church in Koleč show, however, Andreas Bayer’s paper with the initials WAB was still in use (and apparently also in production) in 1812.

Fig.3. Watermark of Ondřej Bayer with inicials WAB (a) and later version of this watermark with a change WGB by Jiří Bayer (b). Watermark a) see CZ-Pnm/ XLII B 241 (Clam-Gallas, Frýdlant), half-sheet: fleur-de-lis (7,5 × 4,5 cm), right half-sheet: inicials WAB with emblem (4,5 × 5 cm). Watermark b) see CZ-Pnm/ 59 R 4495 (Koleč, church), left half-sheet: fleur-de-lis (8,5 × 5 cm), right half-sheet: inicials WGB with emblem (4 × 6 cm).

We have always endeavoured to identify any discovered watermarks using the literature. In approximately half the cases, however, this has involved new findings and a large number of them do not contain any nominal information. We have differentiated between monomial, dimerous and multiple element watermarks. In the case of watermarks which were created in Bohemia in the period between 1750 and 1850, we have most frequently come across dimerous watermarks, i.e. with a countermark of medium dimensions on the left half-sheet and the paper mill owner’s initials or full name on the right half-sheet. The initials or extensive inscriptions can sometimes also state the place of production (of course, in the form used in the given period). Some paper mills also marked their paper with a number signifying the type (quality) of paper. This number (for example, N 1, N 2, N 3, N 4...) can usually be found in the right or left-hand lower corner of the sheet or in exceptional cases in the middle.^[14] However, when collecting watermarks in music sources, we have not always found sheets of paper where all of the elements of the watermark have been preserved. Only rarely have we come across a watermark’s uncut dimensions, whereby the original size of the mould, in which the sheet of paper was produced, constitutes an important part of its identification. We therefore often only have the countermark (or part thereof) or the information on the adjacent half-sheet. A number of Bohemian watermarks are monomial. They may be located in the centre of the sheet or only on one of the half-sheets.

However, even in these cases, it is possible to follow the indications which lead us to a specific paper mill owner (including in the case of any as yet unregistered and often incomplete watermarks), thanks to the monitoring of the contexts within the specific collections as contained units. The four aforementioned collections, whose research has been completed, have not only provided a large volume of new data in the area of watermarks, but they have also enabled us to complete watermarks, which we would never have placed within the given context, if we had assessed them independently of the collection from which they come. We have proceeded from the fact that places with long-term musical operations (such as church choirs or monasteries) took delivery of paper from their environs and all the collections completed to date have confirmed this fact.

It is known that each paper mill owner had several moulds for paper production and that a different watermark was usually made for each of them. As such, we can find several different symbols (pertaining to a single paper mill owner), which were created by the same person (the author of the drawing and the mould) at the same time. It is not only the countermarks which are repeated in this case, but also the specific style and aesthetic formation of the watermark.

For example, the completion of the new watermarks inside the collection from the church in Řetová (the District of Ústí nad Orlicí) helped to clarify the origin of the watermarks pertaining to the paper mill in nearby Žamberk. A dimerous watermark depicting the crowned Hungarian coat-of-arms with the adjacent inscription SEMFTEMPERG, which unequivocally refers to this place, can be found in the collection. During processing, the same Hungarian coat-of-arms was discovered, but the paper mill owner had replaced the designation of the place of production with his initials MS. According to the dating in the Řetová sources (1823, 1830, 1831) and the information contained in G. Eineder’s work, this probably involved the paper mill owner Martin Ševčík, who was active in Žamberk around 1818.^[15]

Another example of the completion of the watermarks within the entire music collection (this time only with indirect support from the literature) involves the collection of watermarks by Johann Georg Zays, who was active in Janovice nad Úhlavou (Veselí, the District of Klatovy) after 1763, contained in the collection of the Strachota family from Panenský Týnec (in: CZ-Pu). Even though this collection has so far been only partially researched from the point of view of the watermarks, the information and the method used to produce the countermarks have enabled the identification of some paper producers and the designation of a number of their symbols. The new information includes, for example, a multiple element watermark which displays the Bohemian lion in the middle of the sheet with the initials I. G. Z. (= Iohann Georg Zays) on the left and the majuscule I (= Ianowitz) in the middle of a six-pointed star on the right (see Fig. 4a). Zuman mentioned this paper mill owner, but it would seem that he did not manage to find a watermark from this location. Thanks to the peculiar design of the Bohemian lion in this watermark and the repetition thereof in the material which was used by the same person (in this case, the copyist and owner of the collection, Josef Cyril Strachota (1746‒1809)), we have also been able to identify other “related” watermarks which bear the year of the paper’s production “1777” (see Fig. no. 4b, c). This corresponds to the period of the copy, while the quality (the colour, delicacy and dimensions) also corresponds to the aforementioned watermark with the initials I. G. Z.

Fig. 4. Watermark of Jan Jiří Zays from paper mill in Janovice nad Úhlavou (1777), a) CZ-Pu/ 59 R 3505, b) CZ-Pu/ 59 R 3519, c) CZ-Pu/ 59 R 3518. Author of the tracing: Lucie Slivoňová.

We can find a relatively rich repertoire of Bohemian watermarks in the sources from the music collection from the church in Koleč. The dominant group of watermarks, which has been preserved within one unit and which at the same time represents one of the most frequented groups in the researched music sources, consists of the paper mill marks of Johann Georg Fürtsch (active between 1773 and 1812) (ZUMAN, 1932, p. 21; ZUMAN 1934a, p. 20–21) and his son Johann Georg Fürtsch junior, who operated the paper mill in Postřekov after his father until 1844(ZUMAN 1934a, p. 21). The older watermarks dating from the end of the 18^th century and belonging to J. G. Fürtsch senior consist of the initials IGF, the inscription KOTENSCHLOS (Chodenschloss = Trhanov) (1794, 1799, 1809) and two crossed keys with the same inscription (1794, 1809). The son of this paper mill owner used a popular symbol, the fleur-de-lis, also with the inscription KOTENSCHLOS (on the same or the adjacent half-sheet or in the lower corners) (1816, 1827, 1830). We have so far found the products of this paper mill in all the researched collections and it would appear that it was amongst the most frequently used in the area of the music sources of the first half of the 19^th century. The aforementioned symbols are usually repeated (the lily, the keys or the two-headed eagle), but the best-known dimerous watermark from the Postřekov paper mill is clearly the fleur-de-lis with a shield with a diagonal stripe beneath it on the left-hand half-sheet and with the inscription KOTENSCHLOS on the right-hand half-sheet. This watermark clearly already dates from the time of J. G. Fürtsch senior, because it has been found, for example, in printed music from Prague dating from 1797 and 1800.^[16] A number denoting the type of paper was later added in the left or right-hand lower corner, for example N 4 (Kinsky: 1825, 1826).^[17]

The catalogues of the watermarks in the music collections – the Želiv Monastery

The first attempt to create a printed catalogue of watermarks was undertaken in May 2016, when the thematic catalogue of the music collection from the Premonstratensian monastery in Želiv was published as the 9^th volume in the Catalogus artis musicae in Bohemia et Moravia cultae series of the National Library of the Czech Republic. Alongside the introductory study dedicated to the history and musical life of the monastery and the actual body of the catalogue of preserved sheet music (now stored in the collections at the National Museum – the Czech Museum of Music), it contains the most detailed catalogue of watermarks from a music collection ever published in this publishing series (SEMERÁDOVÁ, ŠEDIVÁ 2016).This collection provides ideal conditions for research into watermarks, because a large number of the pieces of sheet music there are dated and often include extensive notes made by the copyists. This mainly involves music parts which were copied by alumni (the pupils at the Latin grammar school). Their work, which apparently took place in closed groups (a given pupil frequently always completed the parts for the same instrument or voice), was supervised by the regenschori. He later confirmed his supervision with a note on the cover for the parts and he usually dated his signature. It is precisely this dating in the sources which is essential for the systematic work with watermarks. Thanks to the specific formation of the Želiv collection, we have many supporting points which we have drawn on in the watermark catalogue and during comparisons with other collections.^[18]

The watermark catalogue has been created in two parts: a) a tabular list of the watermarks with names corresponding to the form used in the RISM database and the watermark database (BWD), along with a list of the individuals from the environment of the Želiv monastery who used the paper with the given watermark to copy musical works and the dates in the sources, while the last entry is a reference to the thematic catalogue for the specific piece of sheet music; b) the second part consists of the pictorial annex where almost all the watermarks from the given collection have been reproduced.^[19]

Thanks to the fact that the Želiv monastery purchased paper from its vicinity (especially from the paper mills in the area along the Sázava River), it has been possible to reconstruct certain stages in the operations of the most frequently appearing paper mills and to substantiate the watermarks used in this period. This has been managed to the greatest extent in the case of the paper mill in Červená Řečice which is a mere five kilometres from the Želiv monastery. We have collected a total of 18 watermarks from the paper mill’s production in the period between 1740 and 1848. There were only three owners of the mill during this period: Christian Příhoda (active ^[20] from ca 1740 to 1788), his son Václav Příhoda (1788‒1818) and Jakub Klusáček (1818‒1848c) (ZUMAN, 1934a, p. 24; ZUMAN 1936, p. 54–67). The watermarks designating the paper produced by Christian Příhoda especially display shields and ecclesiastical insignia. There are several shield variants which are crossed by a crosier and an archbishops cross, while there is a mitre instead of a crown and the initials CP appear below the shield. One watermark from the period when Christian’s son Václav Příhoda took over the paper mill has been preserved in the Želiv collection: a crowned Hungarian coat-of-arms on the left-hand half-sheet and the inscription W PRZIHODA on the right-hand half-sheet. This paper was used around the year 1805 for copying music parts, but it was also suitable as an envelope for older materials. We are aware of a total of 11 different watermarks from the workshop of Jakub Klusáček dating from the period between 1824 and 1841. Like his predecessor, Klusáček also used the crowned Hungarian symbol along with an anchor placed either independently or within a shield or a post horn in a crowned shield.^[21] We do not comprehend these shields as coats-of-arms in the true sense of the word, because they did not belong to noble individuals. This involved a merchant “trademark”, a symbol which was only used for the purpose of protecting the manufacturer. The paper mill owner Klusáček appended his initials or his entire name written either in majuscule or in italics to these symbols. He also used the initials J. K. or R. R. (Roth Retschitz).

A relatively large number of watermarks from this collection come from another paper mill in the Sázava River area, from Ledeč nad Sázavou. This mainly involves the production of Johann Anton Heller (active between 1808 and 1832)(ZUMAN, 1934a, p. 17; ZUMAN 1936, p. 45–47)which is most commonlylabelled simply with the inscription A HELLER along the lower edge of the sheet. In other cases, the name was supplemented, for example, with the crowned Hungarian coat-of-arms, an anchor or the aforementioned merchant’s trademark which once again consisted of an anchor in a crowned shield. We have also come across the location information LEDETSCH which either stands independently or in combination with the name of the paper mill owner.

The watermarks of Thomas Hallik from what is now Havlíčkův Brod (active in 1828‒1851c) (ZUMAN 1934, p. 9; ZUMAN, 1936, p. 27) have been found in the Želiv collection in sources dating from the mid-19^th century, i.e. towards the end of the manual production of paper. His paper differs from the others mainly due to its larger dimensions and finer structure. Judging by the central symbol, a post horn, which was used in combination with information about the location D. BROD (= Deutschbrod, also D: Brod in italics) and his name T. Hallik, this was probably writing paper. The right or left-hand section of the watermark is always depicted as a mirror image and so the paper was made so that the sheet could be folded over before use (as was the custom when writing correspondence). The post horn was occasionally replaced with the symbol of the Bohemian lion, while the remainder of the watermark remained unchanged. One specific point of interest is connected with T. Hallik’s products in the Želiv collection. This paper was used to write out the bombardone instrumental part, which was added to the original orchestration of older works in the mid-19^th century.

The paper mill owner Adalbert Hallik, probably is not relative to the aforementioned mill owner, (active in 1820‒1834) (ZUMAN,1934, p. 29–30; ZUMAN 1936, p. 68) from Zahrádka on the Želivka River (the District of Havlíčkův Brod) especially supplied paper to the Želiv monastery in the second quarter of the 19^th century. His paper bears the name A HALLIK and the place of production ZAHRADKA in several variants. Once again, the motif of the anchor was repeated and the paper mill owner placed it in the centre of a crowned shield and added his name written in italics on the adjacent half-sheet. Secondly, he placed the anchor in the middle of the sheet and placed the inscription A. / HALLIK around it.

The other paper mills admittedly do not belong to the Sázava river area and in some cases they were somewhat distant from the Želiv monastery, but for all that their paper designed for copying musical works made its way to the monastery. For example, there are samples of the production from the paper mill in Svídnice (the District of Chrudim) from three different stages of the activities of the Ritschel family: Josef Ritschel (active from 1784c to 1809) (ZUMAN 1932, p. 28), whose workshop produced a watermark with a large crowned W, the inscription SWIDNIZ and the initials I. R., followed by Anton Ritschel (active from 1809 to 1816) (ZUMAN 1934a, p. 7), who appended a figure of a lion and his initials to the information about the place of production, and finally Franz Ritschel (active from 1816 to 1840) (ZUMAN,1934a, p. 27), who used the crowned Hungarian coat-of-arms with the inscription SWIDNITZ and his whole name F RITSCHEL. We should also mention the paper mill in Postřekov (the District of Domažlice), which has already been mentioned elsewhere in association with the collection from the church in Koleč. The Želiv collection also contains preserved copies of the aforementioned paper with the inscription KOTENSCHLOS produced by Johann Georg Fürtsch the younger. A number of watermarks from the Velhartice paper mill (the District of Klatovy) come from the paper mill owner, Josef Anton Appeltauer (active from 1807 to 1851) (ZUMAN 1934a, p. 28–29; ZUMAN, 1934b, p. 31–33). As in the case of the aforementioned paper mill owners, here too we find initials or information about the place of production. The fleur-de-lis is a frequent figure, either self-standing or in combination with a shield as in the case of the Postřekov paper mill.

The Vrchlabí paper mill of the Kiesling family is also significantly represented among the manuscripts dating from the 1830s. This involves products from the workshop of Anton Johann Kiesling (born: 1766 ‒ died: 1838) and his brother Heinrich (died: 1862)(ZUMAN 1940, p. 31) marked with the initials K & S and usually supplemented with a fleur-de-lis in a crowned shield completed underneath with a symbol in the shape of a number four. A. J. Kiesling’s brother Karl August Kiesling (died: 1818)(ZUMAN 1940, p. 97–103) also used the fleur-de-lis in combination with a shield with a diagonal stripe and the initials GK or the inscription “Gebrüder Kiesling”. Later, we can find the inscription CAKE (= Carl August Kieslings Erben) in the lower right-hand corner of the sheet in combination with the fleur-de-lis in a crowned shield.

The conformity of the used countermarks, both in the paper from those mills located in a single region and in those more distant, over a relatively long period of time is apparent in this summary of the most important watermark discoveries. The named paper mills are amongst those which are most frequently represented in the collection and six fundamental (apparently the most popular at that time) countermarks are repeated there: the shield, the fleur-de-lis, the Hungarian coat-of-arms, the post horn, the anchor and the Bohemian lion. These are also the countermarks which have been most frequently preserved in the musical material which has been researched to date, as has been shown above (see table 2).

The newly completed watermark research in the Clam-Gallas music collection from Frýdlant

The re-cataloguing of the music collection of Count Christian Philip Clam-Gallas (1748-1805) for the RISM database was completed in 2016. This extensive collection containing over 1500 shelfmarks was originally the active archive for the château orchestra in Frýdlant, but now the collection is stored at the National Museum – the Czech Museum of Music. This especially involves an orchestral (and chamber) repertoire, most frequently in the form of instrumental parts, but there are also some operatic scores with the performed parts and various works of music designated for domestic production. Some of these pieces of music (nowadays, there are ca 80 shelfmarks which are exclusively operatic arias) were originally part of the private collection of Countess Marie Caroline Josepha Clam-Gallas (nee Špork) (1752‒1799) and they were apparently added to the collection before it was stored at the National Museum.

Here too, records of all the watermarks were compiled and a catalogue similar to that for the collection from the Želiv Monastery was put together during the cataloguing.^[22] Given the fact that this involves an aristocratic collection, it is completely different from the monastery collection in its repertoire composition, type of material, scope and the characteristics of the specific internal units. The catalogue has therefore been divided into groups according to the country of origin of the watermarks and subsequently according to the individual paper mills, provided they are known.

As stated in Table 1, a total of 190 watermarks of Italian provenance have been found there and this is especially a result of the fact that the orchestra repertoire^[23] was mainly secured by means of the purchase of instrumental parts from the Viennese professional copyist workshops. They used light, delicate Italian paper which was mainly imported from the areas of Lombardy, the area around Venice or Friuli-Venezia Giulia (Toscolano, Bergamo, Vaprio, Venice, Cordenons (= Pordenone) etc.). There are a total of 60 newly identified Czech watermarks, whereby a further 117 have not yet been designated, but a certain part of this group will most probably be identified in time. In addition, German (27), French (11), Austrian (8), Dutch (7) and Swiss (6) watermarks have also be found.

The watermarks from the printed music, which constitutes approximately one third of all the material, have also been recorded here for the first time. As well as sheet music published by German (Berlin, Dresden, Leipzig and Offenbach am Main), Dutch (Amsterdam), French (Paris), Italian (Florence, Venice) or Czech publishing houses (Prague), an extensive collection of sheet music published in Vienna (Artaria & Comp., Giovanni Cappi, Christoph Torricella, Johann Traeg, Joseph Eder, K. K. Hoftheater-Musikverlag, Ludwig Maisch, Thadé Weigl, Tranquillo Mollo, Chemische Druckerei etc.) has also been preserved.

Watermarks in printed sheet music

Given the fact that the goal of the Prague RISM workgroup (and also the team involved in the watermark research) is to fully process the music collections, including any printed sheet music, the watermarks from this type of document have also been included in the watermark database. The presence of watermarks in sheet music has been somewhat neglected to date, but it has been shown that interesting findings can also be made here and that in certain cases a watermark can be used, for example, to identify fragments or to elaborate on the origins of those pieces of sheet music, which are lacking any publishing information.^[24] The watermark database can now be used to search according to the publishers who used a specific type of paper to print music. It has been supplemented with dates which are most frequently derived from the number of the printing plate. In the case of the Viennese printed sheet music, which constitutes a significant set of the sources not only in the aristocratic music collections, we have used the publisher catalogues compiled on the basis of source research or according to the period sales catalogues of the individual companies. These lists, which have been compiled by Alexander Weinmann and published in the Beiträge zur Geschichte des Alt-Wiener Musikverlages series, constitute an important tool for dating these printed documents. In the case of Czech publishers, we have been able to make use of Karel Chyba’s inventorial work Slovník knihtiskařů v Československu od nejstarších dob do roku 1860.

Let us once again return to the music collection of Count Clam-Gallas or to the printed part thereof. This involves the first collection in which the printed sheet music has been researched from the point of view of the watermarks. The temporal focal point, for example, of the aforementioned Viennese printed sheet music is around the year 1800. The following table sets out the paper mills which supplied material to ten Viennese publishing houses between 1800 and 1812. As is apparent here, not only Italian paper was used, but also German and Italian paper (Kiesling, Vrchlabí).

Tab. 4a. List of the kinds of paper used in 10 Viennese publishing houses between 1800–1812 based on the watermark registration of music prints in the collection of Count Clam-Gallas from Frýdlant (CZ-Pnm). *Marked music print is not from this collection, but from music collection of the National Library of the CR (CZ-Pu).

Tab. 4b. List of watermarks and paper mills.

	Watermark (RISM, BWV)	Paper maker, paper mill
A	GA \| F [in shield] / [3 crescents]	Fratelli Andreoli, Toscolano (Lombardia, Italy)
G 1	[crescent] \| [star (6 points) in shield (crowned)] / VG	Valentino Galvani, Cordenons (Friuli-Venezia Giulia, Italy)
G 2	[3 stars (6 points, 4 points) in shield (crowned), crescent above] / VG
G 3	VG
H	Luzern / Hartmann \| 1802 [in shield]	Hartmann, Luzern (Switzerland)
K 1	GEBR: KIESLING	= Gebrüder Kiesling, Vrchlabí (Czech Republic)
K 2	G. KIESLING
K 3	13 / KIES[LING]
K 4	GEBR KIESLING
M 1a	[crescent] \| [star (6 points) in shield (crowned)] / P. A. M. (no. 1)	P. A. Mathes, Unter-Waltersdorf (Austria)
M 1b	[crescent] \| [star (6 points) in shield (crowned)] / P. A. M. (no. 2)
M 2	P. A. M. / [crescent] \| [star (6 points) in shield (crowned)]
M 3	P. A. M.
U	[coat of arms with wolf (standing in profile)] \| IAV \| WOLFEG	Joseph Anton Unold, Wolfegg (Baden-Württemberg, Germany)

The conclusion

The systematic study of watermarks requires continual vigilance during research into the sources, accuracy, an aesthetic sense when drawing the watermarks or a critical view when evaluating the countermarks. The watermarks in music sources are bearers of information and they can also be valuable aids in such serious matters as authorship or the year of the creation of the piece of music. The sense of this research lies in finding links between the watermarks (the commercial brand of the paper from the specific time and place) and the actual music, individuals, places or events. The known and generally accepted terminology and the strict rules for the description of countermarks enable the cataloguing and subsequent searching for both individual components and complete watermarks using the database. The English terminology and the description principles have been adopted from the cataloguing practice of the RISM database and therefore the watermark database is especially user friendly to researchers who work with the international catalogue of music sources. It has therefore grown out of the already introduced system as an additional tool for the area of the description of watermarks and it enables the understanding of the context within the framework of the individual music collections and the relationships which they may have in common, including on an international scale.

In conclusion, I would like to thank all my colleagues who have contributed to the watermark research at the National Library of the Czech Republic: Mgr. Zuzana Petrášková, Bc. Lucie Havránková, Mgr. Lucie Slivoňová, Bc. Štefánia Demská, Mgr. Jakub Michl and Dr Michaela Freemanová, Ph.D. Special thanks goes to Mgr. Radovan Zahořík for the technical realisation and operation of the watermark database and for his technical support.

The list of used literature

BASTLOVÁ, E., 2013. Collectio operum musicalium quae in Bibliotheca Kinsky adservantur. 1. vyd. Pragae: Národní knihovna ČR. 381 s. Catalogus artis musicae in Bohemia et Moravia cultae. Artis musicae antiquioris catalogorum series; vol. VIII. ISBN 978-80-7050-626-4.

EINEDER, G., 1960. The Ancient Paper-mills of the Former Austro-Hungarian Empire and their watermarks. Monumenta chartae papyraceae historiam illustrantia 8. Hilversum, Holland: Paper Publications Society.

CHYBA, K., 1966. Slovník knihtiskařů v Československu od nejstarších dob do roku 1860. [Prague]: the Museum of Czech Literature.

SEMERÁDOVÁ, P. a E. ŠEDIVÁ, 2016. Catalogus collectionis operum artis musicae de Monasterii Siloensis. 1. vydání. Pragae: Národní knihovna ČR. 2 volumes (660 pages). Catalogus artis musicae in Bohemia et Moravia cultae. Artis musicae antiquioris catalogorum series; vol. IX/1-2. ISBN 978-80-7050-664-6.

WEINMANN, A., 1979. Vollständiges Verlagsverzeichnis Senefelder, Steiner, Haslinger: (Wien 1803-1826). Bd. 1, A. Senefelder, Chemische Druckerey, S. A. Steiner & Comp. München: E. Katzbichler. Musikwissenschaftliche Schriften; Bd. 14. Beiträge zur Geschichte des Alt-Wiener Musikverlages; Reihe 2. F. 19. ISBN 3-87397-113-5.

ZUMAN, F., 1921a. Přehled papíren v Čechách v 17. století. Český časopis historický XXVII. [Prague], p. 162–170.

ZUMAN, F., 1921b. Privilegium papírny jáchymovské. Památky archeologické XXXII. [Prague], p. 259.

ZUMAN, F. 1921c. Papírna v Prášilech. Zlatá Praha XXXVIII. [Prague: J. Otto], p. 183–185.

ZUMAN, F., 1922a. Filigrán mimoňský a hamerský. Památky archeologické XXXIII. [Prague], p. 158–160.

ZUMAN, F., 1922b. Inventář bělské papírny z r. 1723. Památky archeologické XXXIII. [Prague], p. 344.

ZUMAN, F., 1923. České filigrány XVI. století. Památky archeologické XXXIII. [Prague], p. 277–286.

ZUMAN, F., [1927a]. České filigrány XVII. století. Památky archeologické XXXV. [Prague: nákl. vl.], p. 442–463.

ZUMAN, F., 1927b. Papírny Starého města pražského. Technicko-průmyslový archiv 1. Prague: The Czechoslovak Technical Museum in Prague IV

ZUMAN, F., 1931a. Přehled papíren v Čechách v 18. století. Český časopis historický 38. [In Prague: F. Zuman]. p. 80–90, 294–316.

ZUMAN, F., 1931b. Papírna trutnovská. In Prague: Printed by the Royal Bohemian Society of Sciences.

ZUMAN, F., 1932. České filigrány XVIII. století. Rozpravy České akademie věd a umění, tř. I, č. 78. In Prague: Printed by the Czech Academy of Sciences and Art.

ZUMAN, F., 1934a. České filigrány z první polovice XIX. století. Rozpravy České akademie věd a umění, tř. 1, čís. 81. In Prague: Printed by the Czech Academy of Sciences and Art.

ZUMAN, F., 1934b. Pootavské papírny. In Prague: the Royal Bohemian Society of Sciences

ZUMAN, F., 1936. Posázavské papírny. In Prague: the Archive for the History of Industry, Trade and technical Work.

ZUMAN, F., 1940. Podkrkonošské papírny. In Prague: Printed by the Czech Academy of Sciences and Art.

^[1] A watermark is an imprint of a visual symbol or inscription made using copper or brass wire attached to the screen of the wooden frame so that its elevation above the screen causes a weakening of the layer of the pulp and leads to the creation of the paper mill’s logo. The watermark not only designates the actual product (it protects both the paper mill and the purchaser), but also its quality, type and format. The use of the visual symbol by the paper mill was not in any way random. It spoke of the place of origin by means of heraldry and respected the family traditions or the owners of the paper mills or the wishes of the suzerain who had issued the producer with authorisation (the privilege) to operate the paper mill. Watermarks substantiate the existence of the used paper in a certain period and clarify its origins. They therefore constitute a substantial component of the sources and are bearers of information.

^[2] Available from: http://www.rism.info/.

^[3] When entering the locations from the area of Moravia and Silesia, we have so far used Georg Eineder’s extensive compendium (EINEDER, G., 1960) and other smaller works focusing on specific paper mills.

^[4] Available from: www.nkp.cz.

^[5] Available from: http://aleph.nkp.cz/web/watermarks/eng/bohemian_papermakers_vocabulary.pdf.

^[6] The mould in which the paper was produced consisted of a frame, a screen and a top frame. The screen consisted of finer wires places close to one another, which reinforced the frame internally. The mark of the screen can be seen on the sheet as a faint dense horizontal line. The warp has a similar function and it was wired vertically within the form, this time using thicker wires. The watermark was soldered to the warp. These vertical lines can be marked in the tracings with sequence numbers, but the counting from left to right is only noted in those cases when the entire uncut sheet of paper is available. We have, however, come across cases where the warp was wired both vertically and horizontally.

^[7] It is often possible to identify the copyist or the copyist workshop or to clarify the date thanks to the watermarks, especially in the case of extensive and less clear collections which require a systematic approach both before processing and during the work.

^[8] Library sigla (RISM).

^[9] Available from: http://aleph.nkp.cz/web/watermarks/cze/intro.htm.

^[10] This involves the catalogue number, under which the watermark tracings are physically stored at the Music Department of the National Library of the Czech Republic.

^[11] A (Austria), CH (Switzerland), CZ (the Czech Republic), D (Germany), GB (Great Britain), I (Italy), PL (Poland), N (the Netherlands), F (France), [n] (undesignated, but this may involve Czech paper in a number of cases based on the quality and the types of countermarks).

^[12] We consider shields (like cartouches) to be one of the regular parts of a watermark, while other symbols (this may involve individual components) are often located inside the shield or the cartouche.

^[13] „ …the Archive of the Ministry of Internal Affairs, Commerciale 1796‒1805, 6/2: The report of the Burgomaster in Stříbro dating from 9^th September 1796 …“ (ZUMAN, F., 1932, p. 28).

^[14] According to our findings to date, this label was used, for example, by J. A. Appeltauer’s paper mill in Velhartice, J. A. Heller’s paper mill in Ledeč nad Sázavou, T. Halík in what is now Havlíčkův Brod or the Kiesling family in Vrchlabí.

^[15] EINEDER, G., 1960, p. 129. Zuman states that M. Ševčík was active as early as in 1799 (ZUMAN, F., 1931a, p. 308).

^[16] This involves the publication of two Nationalgesänge by Bedřich Dionys Weber commissioned by Johann Ferdinand Schönfeld and František Jeřábek in Prague which have been preserved in the collection of Count Christian Clam-Gallas of Frýdlant, which is now stored at the National Museum – the Czech Museum of Music (CZ-Pnm/ sig. XLII C 276). RISM A/II: 551006815. Dated copies from 1807 have also been preserved (CZ-Pu, without provenance), 1815 (Želiv, in: CZ-Pnm), 1817 (Kinsky, in: CZ-Pn).

^[17] The music collection in the Kinský Library (the Library of the National Museum, CZ-Pn) was entered into the RISM database in 2010. The watermarks from this collection were later included in the watermark database in the form of photographs. The thematic catalogue for the collection with a brief description of the watermarks was published in 2013 (BASTLOVÁ, E., 2013).

^[18] The following text is a modified extract from the introduction to the watermark catalogue (SEMERÁDOVÁ, P. – ŠEDIVÁ, E., 2016, p. 95–97.

^[19] The watermarks have been arranged in the catalogue a) according to the paper mill (watermarks which have been able to be identified have been included in this group according to the paper mill they belonged to and arranged chronologically according to the dates in the Želiv sources (W 1-75), b) according to the individual symbols (if the origins of the watermark have not been confirmed, W 76-163), c) the initials and inscriptions arranged alphabetically (W 164-198) and d) poorly legible watermarks (W 199-226) are named in the table without a pictorial section. The comprehensive photographic documentation of the watermarks was undertaken when processing the collection for the RISM database. As such, there are also photographs of those watermarks which we have not been able to include in the catalogue due to their lack of physical integrity, incompleteness or illegibility.

^[20] Hereafter, the dates stated next to a paper mill owner indicate the period when he was active.

^[21] The symbol of the post horn is usually found on writing paper, which is noticeably finer than other paper, but, as the sources from the Želiv collection (and others) have substantiated, this paper was also used for writing music.

^[22] The thematic catalogue of the Clam-Gallas music collection is under preparation and is planned as a further publication in the Catalogus artis musicae in Bohemia et Moravia cultae edition of the National Library of the Czech Republic.

^[23] Symphonies, divertimentos and chamber music most frequently by the following authors: J. K. Vaňhal, J. Haydn, J. Schmittbauer, F. X. Dušek, J. Chr. Bach, V. Pichl, C. Stamitz, A. Fils, F. L. Gassmann, K. F. Abel and T. S. Müller etc.

^[24] Let us state the example of the identification of fragments of printed sheet music using the watermark database in the case of sheet music from the collection of the Monastery Church of the Nativity of Our Lady in Želiv (CZ-ZEL/ sig. Hu 20, in: SEMERÁDOVÁ, P. and E. ŠEDIVÁ., 2016, no. 583, RISM A/II: 551001585). 14 printed instrumental parts were found for an A major clarinet concerto, but, along with the title page, page 1 in the solo clarinet part which contained the first incipit (which was crucial for the identification of the work) was missing. The incipits of the other movements were not identified in either the RISM database or the UMC. The author of the piece and the publisher were therefore unknown. The number of the printer’s plate in all the parts was “1515” and the watermark with the inscription GEBR KIESLING was the only guide to identifying the sheet music. The watermark, newly listed in the watermark database, was used, amongst others, by the Viennese Chemische Druckerei. The publisher’s catalogue from this printing works listed Franz Alexander Poessinger’s (1767‒1827) clarinet concerto under no. 1515 (WEINMANN, A., 1979, p. 88). In cases such as this, it is possible to use the watermark as a guide towards the publishers and printers and at the very least to reduce the list of possible candidates.

Table 2. Watermarks in Bohemian Watermark Database (a selection of main countermarks). A part of unidentified watermarks [un]* can be of Czech origin.

Rights to the data administered by public libraries in the light of amendments of the Freedom of Information Act

Radim Polčák — 2016-11-02T12:15:00Z

Summary: Public libraries – in the focus of law – are cultural institutions taking care of not negligible values within public interest. Among others, their activities are regulated by the part of the Czech law imposing information duties on publicly funded entities (duties related to the freedom of information) and governing also re-use of the public sector information. This article deals with recent substantive amendments of the Czech Freedom of Information Act and discusses some specific legal problems, as concerns the processing of the obligatorily deposited electronic copies, along with the digitisation of the library funds.

Exchangeable formats of bibliographical data: their present transformation

Klára Rösslerová — 2016-11-02T12:15:00Z

Keywords: exchangeable formats, bibliographical data, linked data BIBFRAME, Schema.org, cataloguing

PhDr. Klára Rösslerová / Filozofická fakulta. Univerzita Karlova v Praze (Faculty of Arts, Charles University in Prague), náměstí Jana Palacha, 2, 116 38 Praha 1, Česká republika

1 Introduction

The most broadly used exchangeable format of bibliographical data of the present day is MARC 21¹, and to a smaller extent also UNIMARC (Universal MARC, Universal format MARC). For quite some time already opinions have been voiced that formats of this type are obsolete, but at the same time you cannot simply take over a different format, transfer huge amounts of data and adapt all librarian systems working with the existing structure. Bibliographic data in these formats have been created, stored and distributed over decades. However, the easy period in the global librarianship obviously seems to belong to the past.

In fact, the criticism of the MARC type format, or rather an analysis of what there is specifically outdated or superfluous in it, is of no special importance. It is much more necessary to focus upon the development of librarian science (or rather generally of the web) and to suggest what structure of data should be produced by the library, in order to permanently keep in mind the primary goal of the library in offering services to its user as quickly and easily as possible – namely exactly so as the user is habituated on the web. According to opinion polls, well over 80% of users begin their search on the free web, and only thereafter they start looking for the desirable item in the catalogues of libraries. It is obvious that as far as the libraries wish to follow the trends and get adapted to them (which they readily do in the field of publicity and general communication on the social networks), they should arrange the access to their valuable data so as to enable the user to find them with the help of current browsers.

The path leads to maximum integration of data from diversified sources, and namely not only the non-commercial, but also the commercial ones. Accordingly, the development is thus seen to lead out of the purely librarian, i.e. non-profit region, getting to the margin of business: leaving aside its exclusive status and becoming also an object of interest for companies that can add references of their commercial products to the library data.

If a link between the librarian sphere and the commercial one will be established, it can be beneficial for all participating parties, the library data being very valuable and their creation requiring sophisticated and expensive work, not to speak about the specialized education and practice of librarians.

The present article aims at forecasting possible future development in the field of exchangeable formats of bibliographical data. Is MARC 21 going to be replaced with another format? Is it going to be replaced with one format only? Or can we envisage future utilization of more formats according to the actual purpose: one for exchanging and presentation of data, another for the storage in various systems or for the exchange between the libraries?

1.1 Starting points

Exactly the same as in life, also in our sphere it is more important to look forward instead of weeping over anything in the past. However, the proposal of a reasonable future of exchangeable formats requires an analysis of the present condition, finding its errors or insufficient solutions, and making use of the analysis for defining the requirements that ought to bring us to the desirable future state of things. Roy Tennant (TENNANT, 2002 and 2002b) has handled this subject in his articles in a comprehensive, and yet concise way. The foundation platform according to him can be described as follows:

Keep what is good.
Achieve a high level of granularity.
Interlink by way of references.
Make use of hierarchic relations.
Get free from physical materialization.
Achieve expandability, flexibility.
Achieve interlinking.
Obviously the starting point will be the semantic web.

1.2 Semantic web

Prior to offering an explanation of the concept of semantic web, it seems appropriate to explain the adjective semantic, or the word semantics. Semantics, as a linguistic concept, deals with the meaning of words. In this sense a semantic web is one that is understandable for the machines. However, this does not mean to say that machines should become subjects of artificial intelligence being able to read data and to grasp their meaning the way humans do. This means that information in the semantic web is structured so as to make machines understand – grasp – recognize information, and namely thanks to marks that are used for enabling this process of recognition.

The idea of semantic web has been first encountered in Timothy Barners-Lee, the founder of web, and namely in 2001 when Barners-Lee postulated the requirement that the tangle of files interconnected with hypertext references should be replaced with a structured database, i.e. to use a web of data instead of a web of documents, which should be achieved through the intermediary of hidden marks supplying information of the meaning of data contained in the documents. Thus he wanted to solve two problems at a time: the existence of data that, being not indicated in HTML (such as databases), are not available (retrievable) for classical search engines, i.e. the problems of deep web, on the one hand side, and the problem as concerns relying upon the key words only during the search, irrespective of the actual meaning of the contents, on the other hand side (KONSTANTINOU, 2015).

The reaching of this goal – creating a linked data web, see further – is in the focus of an international consorcium handling the development of web standards (World Wide Web Consorcium, W3C) whose founder and director is Barners-Lee. The semantic web is based upon the RDF (Resource Description Framework²) technology. (World Wide Web Consorcium, 2013)

1.3 Linked data

Also the concept or the idea of linked data has been first encountered in the work of Barners-Lee in 2006. The linked data concept is a publishing model for issuing structured data on the web that is based upon web standards, such as http (Hyper Text Transfer Protocol³) and URI (Uniform Resource Identifier) as well as upon the semantic web technologies, such as RDF (Resource Description Framework)

(MYNARZ, 2010), OCLC (Online Computer Library Center). The automated library center with interactive approach has offered a very illustrative free video on its web, informing about the character of linked data and about the advantages and options brought to the libraries by this concept.

Applying linked data means creating bonds between data from various resources. Thus a linkage of most diversified data created by organizations in different parts of the world with no other interconnection among themselves can take place, or, contrarily, data can be part of heterogeneous systems of one organization (BIZER, 2009). Bizer, Heath and Barnet-Lee expand the idea of data web to make “the web of things described worldwide according to data on the web” (2009).

A subset of linked data is linked open data, which means publishing structured data, but in an open format, i.e. data available for everyone (KONSTANTINOU, 2015). Such data can be re-distributed, re-used and can become the basis for further utilization, including commercial purposes.

1.4 Advantages of using the linked data method

The work of Health and Bizer(Health, 2011) is a summary of the linked data potential:

The RDF principle can be used by anybody worldwide to link anything with anything.

Any URI allows the users to retrieve complementary information.

Information from diversified sources can be simply interlinked by combining two

triplets (subject-predicate-object) in one single diagram.

RDF enables the imaging of information expressed by diversified schemes in one

single graph.

Let us offer an illustrative example in the retrieval of the expression “Jan Hřebejk”,

Czech filmmaker, in the Google browser (see Fig.1). Let us notice the basic

information concerning the person of Jan Hřebejk with a group of his portraits and

further pictures on the right hand side, combined with the text on the left hand side.

Fig.1 Print screen depicting the results of search for the expression “Jan Hřebejk”

1.5 The way to semantic web in the field of library science

During the first decade of the new millennium also the experts for libraries began to explore the region of semantic web with increasing frequency. In 2008 a librarian of the University of California & Television archive, Martha M. Yee, introduced her own cataloguing rules and a RDF model, a document wherein she maintains that the bibliographical universe is (too) complex, and that is why the role of a catalogue should consist in alleviating this complexity with the aim of simplifying the search for the user. The librarians ought to mark single bibliographical data so as to enable systems to handle them and to submit them to the user in an easily understandable form, so as to image, upon one single click, all further works of the given author, all further editions, all other units etc. Moreover, all that could be executed by commercial browsers of the Google type instead of the library catalogues (YEE,2008)

In 2006 the Library of Congress in Washington appointed Karen Calhoun to prepare a report of changes in the cataloguing process and the catalogues, library rules and formats and in general of the trends in the field of information services and librarianship. Karen Calhoun summed up the essential requirements and changes in a number of points, of which the following ones appear to be relevant for the present work: more stress upon shared catalogues, sharing of approach, catalogues as means supporting digitization projects, and further individual preparation for linkages.

However, in particular the idea in the back of the proposal sounds interesting, namely that the exchangeable format should be developed in the direction towards the format MARC-XML⁴yet leaving the actual structure of MARC format unchanged (CALHOUN 2006). There are voices opposing the recommendation of Karen Calhoun, such as that of Birghid Gonzales, summing up in her article (GONZALES 2014) that MARC-XML is utilizable simply since there is nothing better at the moment.

In June 2006 the ALA (American Library Association) conference took place where the vice president of CLIR (Council of Library and Information Resources), and later deputy director for library services of the Congress Library Washington, Deanna Marcum, formulated the requirement that cataloguing should be given an earnest thought in the light of the progress in information technologies. She suggested the creation of a working group that would deal with the problems from a sufficiently detached viewpoint. Such group was founded in the same year, and namely Library of Congress Working Group on the Future of Bibliographic Control. It set itself a goal in proposing a different method of bibliographical control, one that does not start at the present state of things:” We should not try to correct the existing systems, but rather pretend that we have only just returned to this planet from the Mars… (Library of Congress, 2006).

The working group, headed by dr. José-Marie Griffiths, Dean of SILS (School of Information and Library Science) of the University of North Carolina at Chapel Hill, has fifteen members from the sphere of educated public (cataloguing experts) and that of librarians, and also a spokesman from Microsoft is present. The group has declared its willingness to collaborate with the National Federation of Advanced Information Services, http://nfais, the former National Federation of Science Abstracting and Indexing Services, whose members are outstanding libraries, but also fully commercial entities, such as EBSCO or Elsevier. This makes it obvious that the working group pays due attention not only to the field of library data, but also to commercial catalogues (databases), as it has later clearly declared (intention to collaborate with the commercial sphere and with the actual users).

1.6. Report Of a record

In 2008 the working group presented its report Of a record (Library of Congress, 2008b). The report invites the professional public (“calls it to action”) to participate in solving five highlighted topics. The premises are as follows:

Bibliographical control is not restricted to the cataloguing process, but pertains to all types of materials that are made available by the library to variegated groups of users from all sorts of locations.

The bibliographical universe does not cover only libraries, producers, databases and publishers, but also sellers, distributors and all other possible groups of users, irrespective of frontiers.

The Library of Congress, considering the development of information technologies, should not play the role of a single possible producer of bibliographical records in the United States of America, and should not be perceived as such.

The five fields, summarized by the Report as topics for discussion, are the following:

Improving the effectiveness of bibliographical production by cooperation and sharing.

Opening up the access to types of documents that are unavailable at present.

Accepting the fact that users of bibliographical information are not only persons, but also machines (applications).

Adapting to the present day trends and enabling insertions in the records, such as evaluations of the users.

Getting consistently educated.

The Report brings some essential recommendations. Their prevailing majority concern the distribution of work as well as responsibility, and their main objective resides in doing away with duplicities of work as costly and fully superfluous in the light of the present progress of technologies. One of the prerequisites is to stop insisting in further meticulous observation of the American library standards. The existing cataloguing standards should be analyzed and possibly revised so as to be applicable also outside of the domain of libraries. In addition to that it is necessary to create conversion programmes enabling to share data across different data producers and distributors, so as to comply with the needs of all interested parties, namely not only the libraries, but also various information service agencies, such as Amazon and IMDb (portal enabling the access to film databases, to TV series and to further contents in connection with the same).

As regards the future of the formats of bibliographic data, however, there is one prerequisite, namely one basic change that is indispensable for all the above: as long as the library world uses the forty years old, and necessarily quite unsuitable format MARC 21, it cannot effectively cooperate with the remaining groups of producers and distributors of data, it cannot effectively hand over its data outside of the librarian systems, thus being unable to meet the vision of maximum cooperation and distribution. That is why it is necessary to create a future record carrier enabling communication between the library systems without any hindrances, such carrier being suitable not only for the libraries, but also for various other user communities.

In addition to the report Of a record the working group created also its web site (http:/www.loc.gov/bibliographic-future) and three large working encounters with the professional public took place: under the leadership of the Google company a meeting with users of bibliographic data was organized, then the American Library Association convened an encounter focusing on the topic of data structure and the Congress Library arranged a meeting where the economic side of the bibliographical systems was the main theme. (Library of Congress, 2008b). Then the employees of the Congress Library, headed by Deanna Marcum, formulated an official Answer (Library of Congress, 2008b), giving their unambiguous support to the above Report. Anyway, they took the side of supportive policy concerning open access, and namely bearing in mind the small and insufficiently financed institutions. Further they also call for the completion of the cataloguing of the not yet processed funds and making them available in the online catalogue. They expect the working group also to offer a proposition of a retrospective tackling of the funds. The text of the actual Answer gradually focuses upon the fields singled out in the document Of the record, analyzing them and also complementing them by information in what stage the Library of Congress is at the given moment (what it has or has not begun to solve), and possibly suggesting solutions that exist. The single proposals unveil a clear-cut and obvious process tending towards the interlinking of the commercial and non-commercial sphere, and namely by sharing bibliographical data among the libraries and the commercial entities, such as Amazon or, e.g., the rationalisation of work directly in the Library of Congress by way of analyzing separate operations and verifying whether there are any duplicities (be it during the creation of CIP, allotting ISSN etc.). Then concise technical solutions are proposed for data sharing, their collection etc.

The Report Of the Record, however, is seen to have provoked also negative reactions in the librarian community, and namely not by its actual contents, but rather due to the fact of having left aside the trend of open access, in this case omitting the principle of open linked data. Jonathan Gray from the group Open Knowledge Foundation formulated a manifest, in answering the document Of the Record, wherein he requires the libraries to open themselves to the world. His postulate is based upon the proviso that bibliographic records, being part of the cultural heritage, should be accessible to the broad public for further use without any restrictions, be it commercially or non-commercially. By way of examples he mentions the use of library data for the creation of web sites intended for enthusiastic readers, for preparing all sorts of statistics for the scientists, for journalists etc. (Open Knowledge Foundation, 2011). This document bears the signatures of 157 librarians, information experts and private persons from the whole world; the employees of Italian universities prevail among the librarians, whereas the rest recruite from various regions. The common denominator appears to be open access: the undersigned are employees of institutions functioning on the principle of open access or declaring it publicly. However, no institution as a whole presents itself as an advocate of the initiative.

1.7. Single projects

Since 2011 the librarian science shows practical progress in the field of linked data. An alternative to the MARC type formats has been under development in the Library of Congress in Washington, but also OCLC has joined in by experimenting with the Schema org. model.

1.7.1. Research of OCLC

In summer 2014 OCLC inspired a survey relating to how information in form of linked data is provided by leading libraries, archives, metadata services and digital libraries (OCLC Research). The enquiry consisted of six simple questions:

Who provides data in form of linked data

Examples

Who makes use of such data

Why are data offered in this form

Technical details

Advice from providers

The organizers of this survey obtained the total of 96 relevant responses from 15 countries. It is interesting that some institutions only produce data in form of linked data, or on the contrary they only consume them. About one half represent both the producing and the consuming side. The institutions come from the United States of America, Australia, Canada, France, Germany, Ireland, Italy, the Netherlands, Norway, Singapore, South Korea, Spain, Switzerland and Great Britain. Although some of the projects are intended for non-public records, there are some giants among them: the OCLC WorldCat catalogue containing over 2 billion records (and all in form of linked data) is at the same time also the most frequently used one (with 16 million inquiries per day, the catalogue of the Library of Congress in Washington, the British National Bibliography. Outside of the book libraries also the Dewey decimal classification (with conversion executed by OCLC) or the VIAF (Virtual International Authority File) are provided in form of linked data.

The model of semantic framing BIBFRAME of the Library of Congress and the company Zepheira, and further Schema.org used by OCLC (GODBY), range among the most outstanding activities in the field of linked data within the sphere of libraries.

1.7.2 BIBFRAME of the Library of Congress in Washington

In May 2011 the Library of Congress in Washington made the official announcement about the foundation of the Bibliographic Framework Initiative

(http://www.loc.gov.bibframe a www.bibframe.org) whose purpose consists in analyzing

The bibliographical description as such

The actual creation of data

The exchange of data including the protocols of exchange

In the follow-up the aim of this analysis resides in replacing the MARC type formats with BIBFRAME, Bibliographic Frame (Library of Congress, 2012d).

In its report Bibliographic Frame for the Digital Age (Library of Congress, 2011) the Initiative follows in the footsteps of the mentioned Report Of the record, but at the same time it already refers to the link to the RDA cataloguing rules that are progressive in many respects, but their potential cannot be utilized to the full in combination with the MARC type format.

Report Bibliographic Frame for the Digital Age

The report defines the following requirements concerning the environment of the bibliographic frame: it should be independent of the cataloguing rules, data regarding the entity, authority data, data concerning the rights, the material description etc. should be codified; text data linked with identifiers URI should be applied (instead of text only); the cataloguing experts will not work directly with the format (bibliographic frame), the way they were habituated to do with the MARC format; the bibliographic frame will be intended for libraries (institutions) of all categories and specializations; and whereas the MARC 21 format will continue to be maintained for the time being, it will be supported only from the viewpoint of the implementation of the RDA Rules; the frame will be compatible with records saved in the MARC format; the transfer of records from the MARC 21 format to the new bibliographic environment and vice versa will be enabled.

The report highlights the fact that getting adapted to the web environment and the adoption of principles and mechanisms of linked data as well as RDF as default model will offer the users a simplified access to information, while unlocking the doors of libraries to more effective storage and utilization of data not only now, but especially in the future, as the libraries will utilize the knowledge and skills of experts who are knowledgeable about the recent handling of data and software development. Thus the libraries will get themselves adapted to the present market, while saving their costs.

According to this Report the Library of Congress has specifically allotted its funds to the creation of grants for establishing national and international working groups with the aim of proposing scenarios of collaboration, revising the ontology in use and creating a new one for the description of resources.

BIBFRAME model

BIBFRAME is a conceptual model defining four entities: work, instance, authority and annotation (Library of Congress, 2012d).

Work - Work is defined as the source reflecting the conceptual base of the Resource being catalogued.

The total of eleven types (subclasses) of a work have been described, and namely the following:

Audio document, cartographical document, data set, mixed data (more types of data, yet not requiring software), video, multimedia, registered movement (graphically described, e.g. dance), registered music (graphically, not tonally), picture, text and 3D object.

Instance – an individual, material execution of work. Ten types (subclasses) of instance have been described. These are: archival object, collection, electronic document, integrating resource, manuscript, monograph, monograph having a plurality of volumes, print, series and dactylographic document.

Authority – authority is a resource reflecting key authority concepts having a defined relationship to the work and instance Four types(subclasses) of authority have been described: agent (in the sense of person, institution etc.), place, time and topics.

Annotation- annotation complements our information of other resources. Five types (subclasses) of authority have been described: envelope (reference to envelope). Information of entities (holdings), reviews, reduced text (abstract etc.), contents (in the sense of Table of contents).

Properties of entities

Each of the above entities has the following features:

a) authorised access point, which is a controlled chain of marks serving for identification, such as unique appellation or name

b) identifier, which is a controlled chain of marks serving to unambiguously identify the entity, such as UR, ISBN

c) label, which is a text chain expressing the value of the property

d) related to, which is any relation between the resources (URL Uniform Resource Locator)

Then, in addition to these common properties, the features of concrete entities get defined.

BIBFRAME format in practice

The testing of the BIBFRAME format, predominantly at the American libraries, has already begun. The Bibliographic Frame Initiative has published the following list of participating test libraries: British Library, German Library, George Washington

University Library, National Medical Library (USA), OCLC, Princeton Library and Library of Congress. The result consisted in the creation of a BIBFRAME Vocabulary, whose continuous improvement is going on, and also in the conversion of a few million data to the BIBFRAME format.

All necessary materials and applications for the conversion of data are accessible for free at the web site of the Library of Congress, Washington; the libraries may use them for their purposes. In 2015 the Library of Congress announced also the BIBFRAME Editor (available at the same place) – an editor intended for direct cataloguing into the BIBFRAME structure. The editor contains prepared templates for processing in accordance with RDA rules for monographs, musical materials, series, cartographic documents, BluRayDVD and Audio CDS. These categories always offer the choice between instance and work.

At the present moment some American libraries have their complete catalogues converted to the shape of linked data, and namely on a commercial basis by the company Zepheira that participated in the development of the BIBFRAME format. Over three million bibliographic records of libraries, such as the Boston University or the University of Manitoba, are in the process of conversion. After the conversion from the MARC XML format the program is accessible on the web, and free.

One of the pioneers in this field in Europe is the German National Library offering bibliographic records corresponding with the RDF standard since 2010. Although this library uses its own publishing model based upon the expansion of the Schema.org model, it recognizes also BIBFRAME.

1.7.3 Schema.org OCLC

OCLC began considering the possibility of presenting its data in the form of linked data at the same time as the Library of Congress, Washington, i.e. 2011. Contrary to it, however, OCLC did not begin developing its own form, but took over the Schema.org vocabulary. Schema.org is a common activity of companies backed by the Bing, Google, Yahoo! and Yandex.

Anyway, this vocabulary was gradually expanded by a version that is suitable for the libraries. The linked data model developed by OCLC defines similar entities as BIBFRAME. These are work, instance, organization and person. In comparison with BIBFRAME it is obvious that the latter has been prepared specifically for the purpose of the libraries, being based upon the existence of formats and focusing upon achieving compatibility with the search engines, so as to make data/records saved in the library databases easily searchable and accessible to users in a way that is habitual for them when they are looking for information at the present day. Schema.org has no narrow focus. Its objective is simple searching of data irrespective of origin and, accordingly, it is not bound by any rules for the description of library resources and, against BIBFRAME, it is “more flat”. An expansion of the publishing model for the environment of libraries is a subject of intensive work of an established working group W3C Schema Bib Extend Community Group.

Representatives of OCLC and of the Library of Congress in Washington have been intensively dealing with the differences and the compatibility of both schemes. Although these schemes overlap in their nuclei (the overlap concerns the expansion of Schema.org for libraries – BibExtensions), certain parts of the schemes vary

due to have been developed for different groups. (GODBY).

Schema.org in practice

As mentioned, OCLC provides its records based upon the vocabulary Schema.org, and namely using the general catalogue WorldCat as intermediary. Also the publishing model of the German National Library came to being as a modification of Schema.org.

1.8 Activities of libraries

In Sweden the problems of linked data are dealt with by the Swedish National Library that is in charge of the global Swedish catalogue LIBRIS. This catalogue opens access to records from 165 Swedish libraries. Due to the independent activity of the Swedish libraries over 6 million bibliographic records are available thanks to linked data of current browsers, which began as early as 2008. (SÖDERBACK) When mapping the options of the MARC format on RDF, the librarians of the Swedish National Library were guided by a very simple idea:”…it is better to bring something immediately, rather than sticking to the detail and waiting for perfection”. (MALMSTEN)

Finland is another Scandinavian country that wishes to open up its catalogues to the world. The Finnish National Library has begun thorough mapping of its records. Its activity is supported by the decision of the Finnish government of 2011, ordering its institutions to enable access to the public information resources. For this purpose the Open Data Programme was declared in 2013. At the present day the Finnish National Library is still at the beginning of this project (2015-2017) that should result both in data in the form of open data, and in complete documentation for the libraries.

Although the local librarians see the future in BIBFRAME, the decision was passed in support to their own structure. (HYVÖNEN)

Thanks to the pressure from the side of the government also the British librarians have commenced their activities. They started tackling the open data publishing model in 2009. The British National Bibliography counting well over 3 million records was chosen for this work and made available in June 2011. (DELIOT)

There is one separate project under the name of Linked data for Libraries, LD4L, among the American university libraries. This joint project of the Cornell University Library, Harvard Library Innovation Lab and the Stanford University Libraries has received the support of a two years´ grant of the Mellon Foundation, amounting to one million USD. The project focuses upon the creation of a model for the publishing of structured data that will fully reflect the special needs of the university libraries, in spite of being based upon the general BIBFRAME. (LD4L)

2 The future

The object of this article consists in offering a set of information about the topical trends in the field of exchangeable formats of bibliographic data and a prediction of possible development. The above makes it obvious that the life cycle of bibliographic data changes, or has already undergone a conversion. In spite of the fact that the librarians still use predominantly MARC 21 for cataloguing and the libraries distribute data in this format among themselves, the conversion to the linked data structure has been added to the end of this chain. Thanks to this structure the valuable information created by the librarians can finally get out of the library catalogues and databases, often called silos in the current language, and namely to the free web – where the final users (readers) can simply find the desired item as early as in their first attempt when searching for information in a Google type browser.

The author of this contribution has chosen one of the possible quantitative methods for estimating the future development, and namely an enquiry by way of an electronic questionnaire. A benefit of this method resides in addressing foreign experts irrespective of their stay.

This opinion poll in form of asking open questions is the basis of the Delf-method enabling to ascertain the opinions of a group of experts independently of each other.⁵This survey was carried out on two research samples. Each of them consisted of a group of professionals: the first was the IFLA Cataloguing Section Standing Committee, the second were members of the e-mail conference of BIBFRAME. The questionnaire survey was carried out in January and February 2016.

2.1 Questionnaire

A Google form with free access was used for my questionnaire and the reference to the same was distributed by electronic post. My philosophy for preparing the questionnaire was based upon the following hypothesis: the librarian community tends to consider the MARC 21 format as outdated and frequent calls are heard that it ought to be replaced. At the same time it is a topical trend to publish data on the web, including librarian bibliographic information, while making use of linked data. However, the above mentioned text of the study shows that libraries undertaking this journey have taken up different paths. That is why my questionnaire contained three open questions:

How soon will be MARC 21 replaced with a different type of bibliographic data format (in your country)?

Will be linked data format used for the exchange of bibliographic data?

Will there be one (leading) structure of linked data or many versions developed by libraries?

2.2. Survey sample

As indicated, the questionnaire was sent to two groups of the professional public. The first group (sample A) consisted of fifteen members of the Standing Committee of the IFLA cataloguing section, further by corresponding members, the chairman, the secretary and the information coordinator. In sum eighteen persons were addressed.

These members are experts from all over the world (at the present moment specialists represent Denmark, Vatican, Egypt, Argentina, France, the Czech Republic and others). The cataloguing section of IFLA handles the cataloguing issue in the broadest sense, suggesting and developing cataloguing rules, directives and standards. In pursuing its aims, it closely collaborates with the International Organization for Standardization ISO. Thus the Cataloguing section of IFLA can exert direct influence upon the codification of standards and, accordingly, the attitudes and opinions of its members are important from the viewpoint of forecasting the future.

The second group (sample B) were the registered addressees of the public e-mail BIBFRAME conference Listserv whose administrator is the Library of Congress in Washington. The Conference has 1744 listed members.

2.3 Results

Answers were received from 12 persons of the survey sample A and from 30 persons from sample B.

Question No 1 - How soon will be MARC 21 replaced with a different type of bibliographic data format (in your country)?

The resulting answers show that most respondents obviously expect a change (35 persons, which makes 90%).

Tab. 1 Reply to the question whether format MARC 21 will be replaced with a different type of bibliographic data format

	sample A	sample B	total	%
yes	8	27	35	90
no	4	0	4	10

Sample A

38% respondents from sample A believe that the replacement of MARC 21 with another type of format is an ongoing process already. Further 37% respondents expect its replacement within 10 years.

Tab. 2 Reply of respondents from sample A to the question how long it may take to replace the MARC format with another type of bibliographic data format

	sample A	%
in process already	3	38
5 years	2	25
10 years	1	12
later	0	0
different	2	25
total	8	100

Sample B

15% respondents from the surveyed sample B think that the replacement of MARC 21 with another type of format is an ongoing process already. Further 67% respondents expect its replacement within 10 years.

Tab. 3 Reply of respondents from sample B to the question how long it may take to replace the MARC format with another type of bibliographic data format.

	sample B	%
in process already	4	15
5 years	10	37
10 years	8	30
later	4	15
different	1	3
total	27	100

One of the respondents complemented the answer with the information that in her home library (Library of Congress in Washington) the linked data structure is used in testing regime also for the primary creation of records (cataloguing). The total of four respondents condition the transition by an impulse of one of the leading libraries (Library of Congress in Washington or British Library).

Question No 2 – Will be the linked data structure used for the exchange of bibliographic data?

Only 5 respondents gave a strictly negative answer to this question. Further two respondents (from sample B) replied no, linked data being intended for publishing information, and not for being exchanged in this form. The remaining (35) gave a positive response and two of them added that there would always be an option in choosing from different formats. One respondent said that he actually would not know. The responses from sample A enable us to suppose the activity of IFLA in support of the application of the linked data structure also for the exchange of bibliographic data.

Tab. 4 Response to the question whether the linked data structure will be used for the exchange of bibliographic data

	sample A	sample B	%
yes	11	24	83
no	0	5	12
don’t know	0	1	2,5
different	1	0	2,5
total	12	30	100

Question No 3 – Will there be one (leading) structure of linked data, or different variants developed by libraries?

There was mostly agreement of the respondents to this question. Only eight of them (19%) replied unambiguously that there would be a single model of linked data (in the table as YES), and we may summarize from the received answers that BIBFRAME prevails. One respondent did not know and one did not answer. The remaining participants (34 respondents, i.e.78%) shared the opinion that there would be a number of different versions. Most respondents complemented the response with their opinions that are very interesting for the purpose of this paper; the plurality of respondents believe that BIBFRAME may win, but it can exist in diversified local versions or modifications according to its actual target. From the responses of sample A, i.e. persons responsible for the codification of standards, only two participants replied that one single standard can be expected. Nevertheless, it seems obvious at the present day that the IFLA association may not be likely to apply pressure for achieving unanimity.

Tab.5 Response to the question whether there will be one (leading) linked data structure

	sample A	sample B	%
yes	2	6	19
no	9	23	76
don’t know	0	1	2,5
different	1	0	2,5
total	12	30	100

2.4 Conclusion of the enquiry

The received responses to the questionnaire enable the deduction of an answer to the question whether MARC 21 is really being replaced or will be replaced by another format of bibliographic data. Most respondents (90%) agree that it either is or is going to be replaced not later than within 10 years. 83% respondents think that MARC 21 will be replaced with the linked data structure. Only two of the participants draw the attention to the fact of the linked data structure being intended for publishing bibliographic information, rather than exchanging the same. Most respondents (76%) believe that diversified structural variants can be expected.

It is interesting to observe the responses of the surveyed sample A, i.e. those of the members of the Standing committee of the IFLA Cataloguing Section, who can exert direct influence upon the creation of standards in the domain of cataloguing. Considering the fact that the majority of them see the linked data structure as the candidate replacing the MARC 21 format for exchanging bibliographic data, the activities in this direction can be anticipated. On the other hand, the respondents do not agree as to possible unification of structure and, accordingly, pressure may not be exerted upon structural unification, at least not for the time being.

3 Summary

Once again, we ought to remind ourselves for what purpose the exchangeable formats of bibliographic data serve in the librarianship: for recording and transferring (exchanging) between bibliographic agencies and various institutions. (KTD) Anyway, it seems to be obvious at the present moment already that these two functions may get separated, or at least some function seems to split away for the publishing of bibliographic data on the web. Although the librarian community has been discussing the obsolete state of the MARC format quite at length, and in spite of the results of the above enquiry, this structure may be retained, at least for the current storage (cataloguing) of data and their exchange between the library systems. This applies notwithstanding the fact that the linked data structure is used also for cataloguing purposes in the testing regime by the Library of Congress in Washington. The frequently mentioned rationale is the unpreparedness of the existing library systems⁶ as well as the lack of funds for executing the change.

Most libraries observing the development on this field, irrespective of whether they are interested as consumers or deal with the development themselves, expect using the linked data function solely for the presentation of data on the web (i.e. not for exchanging bibliographic records between the libraries) and for opening up their funds to the users directly, using the search interface of global browsers. Some conversion programmes with free access are already in use for this purpose, or also commercial conversion systems can be chosen that are offered by companies having participated in the respective development. The French National Library organized a survey in 2014 in the result of which it was able to determine that as soon as it had made its catalogues accessible to browsers, full 80% of all enquiries were primarily implemented this way, not via OPAC. In addition to that it was found during the mentioned enquiry that these users had mostly no idea of the web site of the library catalogue in question. (ADAMICH,2015b)

The leading models of linked data are the structure of the bibliographic frame BIBFRAME that has been developed under the auspices of the Library of Congress in Washington, and Schema.org that has been implemented by OCLC for its catalogue WorldCat. An expansion of Schema.org for the domain of libraries is underway. Criticisms relating to slow process and further reasons, however, lead further American and European libraries to their own experimentation in this field, resulting in gradual establishing of various local versions. Since the interested institutions can be expected to collaborate, we may anticipate desirable compatibility.

Reinhold Heuvelmann from the German National Library predicts the extinction of the last library systems supporting the structure of the MARC format in 2060. But before that an article will be published in 2047. And its title? BIBFRAME must die⁷. (HEUVELMANN).

The editors did not intervene in the method of citation in this article.

Poznámky pod čarou

¹MARC – Machine Readable Cataloguing, format created in its first version at the Library of

Congress in Washington with the purpose of providing bibliographic data in a machine

readable shape, to be distributed on magnetic tape to American libraries, for enabling them

to make their own print-outs to cards. The name MARC 21 means MARC for the 21^st

century.

²Technological basis for exchanging data in the web environment as an application of XML

³Internet protocol designed for interchanging hypertext documents in the web environment

⁴XML – Extensible Markup Language, a type of language that is suitable for exchanging

data among applications and for publishing documents. MARC/XML is an application of a

XML scheme that was created by the Library of Congress in Washington for the purpose of

exchanging bibliographic data between systems or their publishing. The conversion

between the format MARC 21 and MARC/XML is free from any loss and can be

implemented with the free accessible MARC Tool Kit that is available on the Library of

Congress web (http://www.loc.gov/standard/marcxml/).

⁵This article makes use of the responses received in the first round of asking questions. For

the purpose of the dissertation thesis the ascertained results will be submitted to the

respondents of the survey sample A (sample B having been anonymous) with the request

to clarify or correct them, as case be. The received answers ought to be gradually

correlated. An agreement will be considered as a prediction of future development.

⁶The analysis of the preparedness of the library systems is part of the envisaged

dissertation thesis. In the opinion of the author the implemented survey shows the library

systems as unprepared for the change at the moment.

⁷The title is a reference to the famous article written by R. Tennanta “MARC must die”.

Map of study profiles: comparing the curricula of Library and Information Science in Opava

Michal Lorenz — 2016-11-02T12:15:00Z

Summary: The paper presents a comparative analysis of two independent programmes of Library and Information Science implemented at the Silesian University in Opava. The aim of the comparative study is to reveal the extent to which both closely related field compete with one another. The study programmes have been analyzed by the method of curriculum mapping using shaded Venn diagrams. The visualized profiles of study are compared from the viewpoint of professional identity, diversity of the fields of study, interdisciplinarity features, and scientific character of the curricula. The result of the comparison resides is identifiying of the fields domains whose development can ensure higher competitiveness to programmes.

Printing works Kryl & Scotti / Karel Kryl in the mirror of bibliophile media and competitions

Michal Mocek — 2016-11-02T12:15:00Z

The study attempts to evaluate the contribution of the printing works Kryl & Scotti (later Karel Kryl), located at Nový Jičín and later at Kroměříž, to the creation of beautiful books in the Czech lands, as viewed in a broader context. It is founded upon statistical assessment of deep probes, plunging into book reviews published in bibliophile periodicals between 1925‒1946, further on the results of competitions for the most beautiful Czech books in the period 1929‒1942 and, finally, also upon the extent of interest shown by various publishing companies in the work of this printing house. The analysis of the obtained data indicates that there was hardly any other Czech or Moravian polygraphic company ‒ with the sole exception of Průmyslová tiskárna in Prague (Industrial Printing Works) ‒ whose books would arise more vivid interest among the demanding book reviewers, than have those books of this company. However, the prestigious competitions aimed at finding the nicest Czech books, still, are seen to have left Kryl & Scotti / Karel Kryl lagging behind the leading Prague printing works, namely the State Printing, Industrial Printing Works and the printing works of Orbis.

Relationships of information resources: an attempt to interdisciplinary synthesis

Helena Kučerová — 2016-05-23T10:50:00Z

Keywords: relationships, information resources, structures, interaction, equivalence relationships, hierarchical relationships, associative relationships

PhDr. Helena Kučerová / Ústav informačních studií a knihovnictví FF UK v Praze (Institute of Information Studies and Librarianship, Faculty of Arts, Charles University in Prague), U Kříže 8, 158 00 Praha 5 - Jinonice

"Classical science in its diverse disciplines [...] tried to isolate the elements of the observed universe [...] expecting that, by putting them together again,[...] the whole would result and be intelligible. Now we have learned that for understanding not only the elements but their interrelations as well are required."

Ludwig von Bertalanffy: General system theory (1968)

Introduction

We are likely to get across two major obstacles at the beginning of each try to solve the problems of relationships among the information resources: both the information resource and the relationship are difficult to define. The hurdles in defining an information resource are due, in addition to the broad scope of the concept, also to the variations in understanding its contents in different disciplines, reaching from the narrow documentary viewpoint of the memory institutions, down to the almost unlimited approach of the Semantic Web. For instance D. Allemang and J. Hendler declare that "in the Semantic Web we refer to the things in the world as resources; a resource can be anything that someone might want to talk about."¹ The substance of this allegation is in tune with how an information resource is defined in TDKIV, understanding the same as an "information object, containing available information complying with the information needs of the user."² Contrary to the precision seeking enumeration in the continuing part of the definition, stipulating that "an information resource can be printed, audiovisual or electronic (including resources available online)", thus in fact additionally restricting the available spectrum of the relevant information to documents, but for the purpose of this study we accept the approach represented by the present day concept of the Semantic Web. The information resources are understood in the broadest sense as documents, data, but also persons, things, concepts, terms, processes, events or services supplying information.

The cause of difficulties in defining a relationship resides in the fact that this is one of the most general categories having no superior category level; that is why there is no way of applying the classic Aristotle definition by classifying the issue in the next superior category and then determining its different specific features. An example of understanding the relationship as the most general category can be given by the conception of A. Wierzbicka, representative of the present day cognitive linguistics. In her list of the so-called elementary semantic units, i. e. simple indefinable notions that are universally present in all languages of the world, she indicates three essential representatives of relationships: type (taxonomy), part (partonomy) and likeliness (similarity).³ A relationship, accordingly, is a philosophic category, but in its applied form it is also part of the display of instruments of practically any scientific discipline. Psychology explores the relationships of persons, their interactions and roles, whereas sociology is directed to relationships in social networks and relationships of collaboration. The focus of linguistics and terminology are relations of words and verbal expressions, the focus of semantics lies in studying the relationships of concepts. The rules of correct reasoning and deriving upon the basis of relationships in the frame of formalized statements are the domain of logics. The graph theory offers a theoretical foundation for the "materialization", visualization and exploration of relationships. As to informatics, it deals with technological solutions of the implementation of relationships in digital environments. Of course relationships are also an issue of interest for diverse applications and engineering fields – let us mention, just by way of example, the field of business informatics specialized in the administration of data about customers, in particular CRM – customer relationship management. The accompanying phenomenon of a concept used to such width is its lack of terminological unity. The issue called relationship in this study is given other names by some authors, such as predicate or property, whereas relationships on the web are currently called reference or link. Especially in the context of information technologies heed should be paid to the plurality of meanings of the term relation. In addition to the current understanding it can mean the concept of mathematical relation as a result of a cartesian product, representing the theoretical foundation of relational databases. Also in the "non-professional" natural language the broad extension of the concept of relation is perceived, being the actual cause enabling definitions in very general terms only, such as on the level of a nominal definition (formulating the same in other words). For instance the Dictionary of standard Czech offers the following explanation: a relation is "an interconnection, a continuity or coherence between phenomena, a ratio", or the Dictionary of the Czech Language defines a relation as „the circumstance that somebody or something has some connection or link with somebody or something else".⁴

The intention of our study resides in summarizing the theoretical principles of relationships formulated by various disciplines, followed by a synthesis and application to the field of information resources. The text of the study is divided in three parts. The first part provides a general delimitation of relationships, the characteristics of ways of their expression and a review of their properties. The second part offers a selection from outstanding contributions to the research of relationships, supplied by diverse scientific disciplines. The third part contains a review of relational taxonomies within the information science and a draft of own framework taxonomy, along with suggestions concerning further explorations.

1 General characteristic of relationships

The best suited approach for the purposes of this study appears to be the systemic one that can be singled out of the plurality of field specific relational concepts. Indeed, the general system theory or the system science has been the engine giving rise to the importance of interrelationships of elements of the entities under scrutiny; opposed to the Newtonian mechanistic approach trying to cognize a complex whole by splitting it to smaller parts, it has highlighted the learning of the relationships between these parts . The image of the world based upon the systems approach consists of entities (elements), functions (processes), their features and their mutual relationships. Contrary to the concrete Newtonian analysis, physically separating single elements to be investigated (e.g. by filtration of liquids), the systems approach is an abstract one; instead of physical elements it focuses upon logical ones, the analysis and the synthesis are brought about by mental processes. The same as systems do not exist independently and are products of the human mind, also the relationships within a system are mental artefacts and constructions that are deliberately created in the process of cognizance, representation or drafting of a novel reality.⁵ We should also bear in mind that the role of relationships within a system is not only a static one in the sense enabling a certain structuring, but that the relationships are also endowed with dynamic and interactive functions.

1.1 Ways of expressing relationships

Relationships encountered in the practice can be expressed with various degrees of formalization. 1. A tacit, unexpressed relationship exists only on a material level. Just a few examples of entities with unexpressed relationships: "Peter", "George", "heart", "man", "book", "lending". 2. An implicit, verbally expressed relationship uses means of the natural language. In such ways of putting things the relationships are integrated so closely with the other elements of communication as to form compact wholes requiring to be analyzed. Examples of implicitly expressed relationships: "George´s son Peter has been lent a book." "Man George has heart." "A lost book cannot be lent." 3. An explicit, logical and unambiguously expressed relationship is markedly separated from the other elements of communication. It can be expressed verbally by statements of formal language or graphically. Examples of explicitly expressed relationships can be seen in Fig. 1.

Fig. 1

As shown by examples in Fig. 1, both static and dynamic relationships can be formalized. The static relationships have been graphically expressed with the class diagram of the modelling language UML, more closely described in part 2.5, dynamic relationships are expressed by a statechart diagram and a diagram of activities in the same language. A simple formalized language is used for the text expression of both types of relationships where the relationships are designated by terms enclosed in angle brackets. The formalization of explicit text recording of relationships can continue down to the level of entries by way of symbols agreed upon, e.g. symbols of formal logics or of programming languages. The advantage of such recording resides in that it enables automatic inference, i.e. deriving new statements out of the existing ones thanks to the unambiguous and explicit expression of relationships, which, in the consequence, results in the creation of new knowledge. An example of inference can be offered by the short sentences above, such as: "Since every human has a heart and Peter is a human, also Peter has a heart". Whereas a man with natural intelligence can derive such finding also from an implicit expression in a free text (and a man knowledgeable of the unexpressed context and endowed with intuition could master it upon the material in the first example lacking the expression of relationships), the computer applications based upon artificial intelligence require relationships expressed explicitly, fully and unambiguously. Yet another method of formalization can be a mathematical approach abstracting relationships between the elements to a level enabling their quantitative expression. Also a quantitative expression of a relationship is useful, as it enables processing by tools for which the intentional semantic characteristics of the relationship are unintelligible. In addition to that the quantitative values allow, under certain conditions, extrapolating to quality out of quantitative data.

An analysis of explicitly expressed relationships will allow us to ascertain that relationships expressed formally to the full are composed of three key components: 1. relationship – connection (e.g. "is", "has", "paternity", "lending"), 2. thing (participant) in the relationship (e.g. "man", "Peter", "George", "Heart", "Book") and 3. role of a thing in the relationship (e.g. "son", "father", "follows after", "precedes").⁶

1.2 Definition of a relationship through the mediation of its features

In much the same way as in any entity upon which systems approach is applied, we will define also the relationships by the intermediary of their properties. The importance of relationships for investigating reality, as indicated in the motto of the present paper, however, does not mean to say that they can be handled in a fully abstract way. In order that the relationship may give any sense, we should know what "things", i.e. elements or processes, are joined by way of such relationship. All entities on the most general level can be divided in two groups for the purpose of defining their relationships: abstract concepts (categories, classes, sets of objects) with abstract relationships, on the one hand, and concrete individuals or instances (single objects) with concrete relationships. The entities of both groups are intelligible, definable and describable thanks to their properties (i.e. intensions). Some properties can be shared by a plurality of entities at a time, and exactly these create the semantics of what we call relationship.

For the purpose of our review of selected features of relationships we will consider two groups of properties – formal (extensional, syntactic) and content oriented (intensional, semantic). The group of formal properties comprises: symmetry, direction, degree, multiplicity, obligatory membership in the relation, transitiveness and dynamics. The content properties – permanence of relationships (paradigmatic/syntagmatic relationships) and semantics of relationship are given – considering their importance – separate parts of this study (2.2 and 2.3). It can be stated – in addition to acknowledging their general importance for cognizing a relationship – that both the formal and the content properties of relationships have direct impact upon the way of representation (i.e. instantiation) in computer systems.

Symmetry of relationship: When determining this property, we examine the roles of the entities participating in the relationship. If their roles are equal, we declare the relationship as symmetric, whereas if they play different roles, the relationship is asymmetric. For instance the relationship of paternity is asymmetric, whereas the relationship of siblings is symmetric.

Direction of relationship: We differentiate between one-way and both-ways relationships. For instance the relationship lending – losing is unidirectional, whereas the relationship lending – returning is bidirectional.

Degree of relationship: For designating the degree of relationship or the number of entities entering into a relationship the terms arity, or possibly dimension or degree of relationship are used, sometimes also valence. In general the relationships are designated as n-ary (n-dimensional), n being the number of participating entities. A relationship with one entity is designated as a unary relationship. Such relationship can be understood in two ways: 1. as a relationship of instances or individuals of the same class, for instance the relationship of the first and second edition of the same title; it can be also called iteration or recursion; in the graph theory the edge representing such relationship is called loop, and 2. as a relation of a class and its property (also unary predicate); a unary relationship in this sense is represented, for instance, by the statement "the book has a dimension (e.g. 20 cm)". A relationship of two entities is called binary (with two members, double, two-dimensional), the relationship of three entities, in analogy, is called ternary (with three members, triple, three-dimensional) etc.

Multiplicity of relationship: The number of elements in the sense of instances or individuals of a concrete class participating in an abstract relationship has a special term multiplicity or cardinality. A number larger than 1, as a rule, is not expressed numerically, but by way of a generalizing symbol, such as N, M, , . The relationships are differentiated also thus 1 : 1 (one–to–one), 1 : N (one–to–many), N : 1 ( many–to–one), N : M (many–to–many). If considering an asymmetric, bidirectional and binary relationship "a reader reads a book", then cardinality 1 : 1 would mean that one reader is just reading one book, cardinality 1 : N would allow one reader to read more books, cardinality N : 1 would express a situation when there is a plurality of readers reading one book, and cardinality N : M would apply in the case when one reader reads more books, and at the same time one book is read by a number of readers.

Obligatory membership in a relation: Another quantitative parameter of a relationship is the so-called obligatory participation. The possibility of non-existence of a partner entity is verified (does the occurrence of one entity require the occurrence of the other entity – for instance must every book have its reader?) According to the result the relationship is then designated as obligatory (total, full) or optional (partial).

Transitivity of relationship: The transitivity or transferability of a relationship can be expressed by the formula: if A→B→C, it is valid that A→C. Sometimes also a differentiation is possible between the transitivity of a relationship and the transitivity of properties of the participating entities. For instance in the FRBR model wherein the work D has been realized by expression V, and this expression V has been embodied by manifestation P, it can be derived that P is not only a manifestation of expression V, but also a manifestation of work D. The transfer of semantics of an abstract relationship of classes to the concrete relationship of their instances can serve as another example of transitivity.

Dynamics of a relationship: As mentioned above, the relationships can be divided to static and dynamic ones, in tune with the systems approach. The static relationships depict the relations of elements in a system. They are also called structural relationships, as they allow us to describe the structure of a given system (i.e. a relatively stable arrangement of elements), and thus to understand the sense of the comprised things. This is typically achieved by way of construing a conceptual system representing concrete things within their context. The dynamic relationships depict the relations between the processes in a system. They can show the development and changes in the course of time, such as by the intermediary of process models based upon a network diagram. This diagram depicts the dynamic relationship as an edge or a path; in combination with nodes standing for the participants of the relationship diverse sequences, splitting and joining of processes can be represented. Dynamic relationships are also called interactive, since they enable communication and interaction with the resource (for instance a hypertext link allows the browser to "read" the resource, i.e. ensures access to the resource).

All above formal features can be combined⁷ with one another and, accordingly, any relationship can be specified by adding the relevant value of each of the properties. Thus, by way of example, the relation "Man has a heart" in Fig. 1 can be described as an asymmetric both-ways binary static relationship 1 : 1 with optional participation of the entity man and obligatory partnership of entity heart (based upon the fact that every human has a heart, but a heart need not always be only a component part of man). The relationship „has a heart" is transitive in the direction to the relationship "Peter is a man", but it is not transitive in the direction to the relationship "Man borrows a book" (the circumstance that the book is borrowed by a man having a heart does not mean to say that also the book has a heart).

2 Review of approaches to the examination of relationships in diverse disciplines and fields

2.1 Semantics of relationships in the semiotic triangle

A well proven instrument for modelling the function of a sign in understanding the world and in communication, i.e. reflecting reality in the thought and its lingual expression, is the semiotic triangle. The opinions of protagonists of different theories in linguistics and semiotics as to the delimitation of the key components of the semiotic triangle are not unified. For the purpose of this text we have selected such choice of relationships and their interpretations that can be applicable in the follow-up for considering the relationships among information resources. For the benefit of compliance with the approach used in the terminological systems as well as the knowledge organization systems summarized in part 2.3 the apexes of the triangle are occupied by entities "Thing", "Concept" and "Designation". The relationships are marked with arrows and numbered. As obvious in Fig. 2 there are both relationships between diverse types of entities, and between the entities themselves. The first group are binary asymmetric bi-directional relationships between things and concepts, concepts and designations, and designations and things. The second group comprises recursive relationships of concepts, designations and things that can be both symmetrical and asymmetrical.

Figure 2

Comments to the relationships shown in the semiotic triangle in Fig. 2:⁸

1. Relationship thing – concept is called conceptualization, i.e. a conceptual expression of reality. The thing in this relationship is introduced as an empirical model of the concept. The concept, accordingly, functions as a model of the thing in the sense of its representation. The sum of the characteristics of the thing comprised in the concept is called intension. The relationship thing – concept can have any cardinality, namely from 1 : 1 (1 thing –1 individual concept), over N : 1 (a number of things – 1 general concept) down to the relationship of semantic heterogeneity 1 : N (1 thing – a plurality of concepts).

2. The relationship concept – thing is designed as instantiation or as illustration. The thing in such relationship figures either as a physical instantiation (embodiment) of the meaning of the concept or as its exemplification (introduction of an example). The concept has the function of a model in the sense of its template (plan). The set of things represented by the concept is called extension. The relationship can have cardinality 1 : 1 (1 individual concept – 1 thing) or 1 : N (1 general concept – a plurality of things).

3. The relationship concept – designation is called expression, or possibly also name. E. Svenonius uses the term relational semantics⁹ for these cases. The relationship with cardinality 1 : 1 is called mononymy (1 concept – 1 designation), the relationship 1 : N (1 concept – a number of designations) is called synonymy. Synonymy has an adverse impact upon the recall rate in the retrieval systems.

4. The relationship designation – concept is usually called meaning or sense. According to E. Svenonius the relationships designation – concept create the so-called reference semantics. The relationship with cardinality 1 : 1 is called monosemy (1 designation – 1 concept), whereas the relationship 1 : N (1 designation – a plurality of concepts) is called polysemy or homonymy. Polysemy/homonymy has adverse effect upon the precision in the retrieval systems.

5. The relationship designation – thing is called denotation, if such designation is related to a class of objects, or reference, if it is directed to an individual object.

6. The relationship thing – designation is called representation.

7. The relationship concept – concept has abstract character and is called conceptual relationship. Relationships describing hierarchical relationships between concepts with semantically broader and narrower scope, as well as associative concepts, are interesting from the viewpoint of handling semantic problems.

8. The relationship designation – designation, considering the one with highest frequency among the designation systems, is usually called lexical. The mutual hierarchical designation relationships are called hyperonymy – hyponymy and holonymy – meronymy, and troponymy is the usual designation for the hierarchical relationships of verbs.

9. The relationship thing – thing is concrete; we mention it for the sake of completeness. Much the same as concepts reflect the properties of things, the mutual relationships between concepts should be based upon the ascertained relationships of things.

2.2 Paradigmatic and syntagmatic relationships in the language

The concepts of paradigm and syntagm, together with the concepts of synchrony and diachrony, are seen as the pillars of modern linguistics. The classical linguistic theory of Ferdinand de Saussure recognizes the paradigmatic relationships (designated by Saussure as associative) as the basic structure of what he calls "langue" (language, system), whereas the syntagmatic relationships structure the so-called "parole" (speech, text).¹⁰

Paradigmatic is the designation given to relationships whose meaning is relatively independent of the context. They serve to construe generalized semantic systems for multiple utilization (such as thesauri, intended to express paradigmatic relationships according to the standard ISO 25964-1) by using lingual means to express units of contents, i.e. concepts.

Syntagmatic (syntactic, contextual) is the designation given to relationships among a plurality of lingual elements in a concrete expression (such as in a sentence, in an query, in a heading or entry). In contrast to paradigmatic relationships they serve to construe unique systems or ad hoc units whose meaning varies in accordance with the context in which they have been applied. In other words they enable repeated utilization of semantic elements of the language in diversified connections. Naturally, the possibility of joining a certain lingual element with other elements so as to create a reasonable whole is not unlimited. For instance the verb "to open" can be combined with words "book, (computer) file, publication, journal, newspaper, letter, folio", but it is less suitable for words "picture, film, Internet, interview, billboard, author, Police, TV channel ČT1 , paper, radio, document", although all these words concern information resources.

Sometimes we can come across the designation of paradigmatic relationships as semantic ones, but also syntagmatic relationships have a certain semantic dimension. Due to frequent usage some combinations that were formed ad hoc originally can "grow together" to the extent of not being perceived as the result of some combination, but as a separate semantic paradigm (such as "offside" etc.). The interconnection of both types of relationships was established by Saussure¹¹ already and Roman Jakobson followed in his footsteps. In his study "Two aspects of language and two types of aphasic disturbances" Jakobson maintains that there are two references to interpret the sign – the one is selection, i.e. choice of mutually replaceable semantic equivalents within the same language (Jakobson uses the term "code"), and the other is a combination to form a topical grouping within a certain context.¹² Namely, we can conclude that each element of a lingual expression lies at the intersection of the paradigmatic axis (the so-called axis of equivalence, an option from the given possibilities of expression) and the syntagmatic axis (the so-called axis of combination, of interlinking with the other components of the communication), the actual meaning being derived from both these dimensions.

2.3 Lexically semantic relationships in terminological systems and in the knowledge organization systems

This part offers a summary of the general conception of lexically semantic relationships, as standardized by the international standards for terminological work ISO 704¹³ and ISO 1087-1¹⁴ and in ISO 25964¹⁵ for thesauri and further controlled vocabularies used for organizing knowledge and searching for information. ISO 704 in the recent version of 2009 is already the third revised edition of this standard devoted to the principles and methods of terminological word, whose first edition was dated 1987 and the second in 2000. The terminological standard ISO 1087-1 comprises a vocabulary explicating concepts from the field of terminology. Its verbal part is accompanied by an appendix with diagrams of concepts, illustrating the conceptual relationships of the contained terms in graphic form. It was published in 2000 and its Czech translation together with national comments was issued in 2002¹⁶. ISO 25964 is a standard having two parts whose history goes back to the beginning of the 70ies of the 20th century. The 2011 and 2013 editions represent the actual culmination of efforts of many outstanding institutions (such as UNESCO, IFLA, NISO¹⁷, BSI¹⁸) and of experts associated in the subcommittee of the technical commission ISO/TC 46 Information and documentation, of setting the rules and methods of design, utilization and administration of the thesauri. It reflects amendments due to the introduction of information technologies in this field: software applications for the creation and use of thesauri, technology of full-text searching etc. The scope of the standard has been widened from the original strict focus upon the design of thesauri in the direction towards a more general conception, as applicable for a broader spectrum of types of controlled vocabularies and other knowledge organization systems. The second part of the standard, focussing upon interoperability, has separate chapters dealing with classification schemes, taxonomies, subject heading schemes, ontologies, terminologies, name authority lists and a synonym rings.

All the above standards are examples of practical application of the semantic principles, as indicated in parts 2.1 and 2.2 of the present study, namely the work with concepts and terms. Their character makes them suited for practical activities; concretely ISO 704 and ISO 1087-1, for the creation of terminological conceptual systems, and ISO 24156, for the development and administration of knowledge organization systems. There is a conspicuous similarity of these groups of standards, and namely both as to the types of comprised relationships and the description of their semantics, obviously due to their empirical character based upon the direct practice of the work with concepts. They agree also in the pragmatic expression of the purpose of the explicit description of relationships in the terminological systems and the systems of knowledge organization, which is a certain disambiguation, i.e. the conversion of the relationships between the designation¹⁹ and the concept to the 1 : 1 ratio. The prerequisite is not only the solution of problems of 1 : N relationships, accordingly of synonymy and homonymy or polysemy, but also the problem of indefiniteness of meaning of quite a number of concepts; a help for the solution can be found also in integrating a vague concept into some concrete context (for instance the meaning of the word "good" will obviously differ in the context of ethics, in the system of university classification or in the field of culinary descriptions). Both the two terminological standards and ISO 24156 devoted to thesauri focus upon the three most important lexically-semantic relationships: equivalence, hierarchy and association.

2.3.1 Relationship of equivalence

The relationships of lexically-semantic equivalence in terminological systems fall into the category of designation. ISO 1087-1 defines equivalence as a relationship between designations in different languages representing the same concept. ISO 25964 differentiates, in the knowledge organization systems, between the equivalence of terms in a single language or multi-language context in the sense of ISO 1087-1, and between conceptual equivalence. According to ISO 25964 a relationship of symmetric term equivalence can comprise synonyms and quasi-synonyms, specific terms subordinated to terms expressing a broader concept and specific terms expressing compound concepts represented by combinations of two or more terms (the so-called compound equivalence).An asymmetric term equivalence is a relationship between a preferred and a non-preferred term, or possibly between a variant or alternative name and a name preferred in a set of authorities. A relationship of conceptual equivalence can be established in the course of the so-called equivalent mapping of concepts between different knowledge organization systems (e.g. between two different thesauri). It is obvious that semantic equivalence in this context does not mean identity; the meanings of entities participating in such relationship, can differ. The ISO 25964 standard directly specifies the degrees of inter-lingual and conceptual equivalence for differentiating between the extent of similarity of the participating elements: exact, inexact, partial, broader/narrower, non-equivalence.

2.3.2 Relationship of hierarchy

The standards ISO 1087-1 and ISO 704 as well as ISO 25964 correspond in considering hierarchy as a relationship of concepts. In general they delimit the same as the relationship of inclusion²⁰ in the sense in comprising the scope of the subordinate concept into the scope of the superior one. ISO 25964 determines that a relation of hierarchy should be based upon degrees or levels of superordination and subordination, wherein the superior concept represents a class or a whole and the subordinate concepts represent members of the class or parts of the whole. It is recommendable to define this relationship only among concepts of the same category.²¹ All standards differentiate between generic and partitive hierarchy, and in addition to that they also define the so-called instance relationship between a class and its instance; in agreement they all consider it as a specific case of a generic relationship.

The designation generic relationship is derived from Latin (genus – species); further there is a differentiation between general – special (subsumption), generalisation – specialisation /specification, supra-type – infra-type, superclass – subclass, hyperonymy – hyponymy, inclusion (set – subset). Various alternatives can be encountered, such as "is a", "is-a", or "ISA" ("is", possibly "is a type"). ISO 25964 suggests using the aid "all – some" for the test of validity of generic hierarchy. Its application on the pair of concepts vehicles and trains would look like the following: all trains are vehicles, some vehicles are trains. A seemingly analogous pair of concepts, however, trucks and trains, does not pass this test: only some trains are transport vehicles for loads, some load transporting vehicles are trains. Under ISO 704 the validity test of a generic relationship is the existence of inheritance – all subordinate elements should possess attributes of the superordinate elements; in addition to the inherited characteristics, "the offspring" should be endowed by at least one specific delimiting feature.

Partitive relation is the relationship whole – part of it, and is sometimes also designated as holonymy – meronymy. In contrast to the generic relationships the partitive ones do not enable the application of inheritance and transitiveness or inference derived from the same, either; a part can possess its specific features differing from the properties of the whole. This disables the association of parts belonging to one whole by way of certain common features – the only link is the appurtenance to the given whole. ISO 25964 recommends to specify partitive relationships as mono-hierarchical, i.e. a part should always belong exclusively to one whole. A relatively small circle of entities can fulfil such requirement, and the above standard enumerates them in part 10.2.3.1: systems and organs of the body, geographical locations, disciplines or fields of discourse, hierarchical social structures.

Relationship class – instance (instance relationship) determines whether an individual object appertains to the given class. This is a difference against the generic and partitive relationships concerning abstract relations between classes in the sense of concepts representing sets of objects. Yet another difference between a class and an instance can be seen in that whereas classes have designations, instances and individuals have proper names.

2.3.3 Relationship of association

All standards define this relationship only very generally and, in principle, by the process of elimination – according to the standards any semantic relationship that is neither equivalent nor hierarchic, can be called associative. ISO 10871-1 suggests the division of the associative relationships into sequential ones (follow-up relationships) with the subtype of temporal relationship, on the one hand, and causal relationships, on the other hand. ISO 25964 does not offer any normative typology of associative relationships, though, but in part 10.3 it indicates some typical examples: associations of terms and concepts with overlapping meaning, scientific discipline and the object of studies, or a phenomenon, operation or process and their agents or instruments, an action and its product, an action and its addressee and target, objects and materials and their decisive properties, an artefact and its parts, concepts joined by causal connection, an object or process and means against the same, a concept and a unit of measure, a composed term and a noun forming its nucleus, an organism cultivated out of some other organism or a substance derived from another substance.

2.4 Relationships in data structures

Whereas our above deliberations have been carried on mostly upon a theoretical level, the data structures will lead us to the physical instantiation of relations in digital environment. These physical forms of data relationships have direct impact upon the effectiveness of both basic functions ensuring access to the organized data: collocation and navigation. Collocation consists in static aggregation of semantically linked data, whereas the substance of navigation is a dynamic movement upon an existing path leading to semantically relevant data. In the following review we will direct our attention to these basic general types of data structures: linear, tree-like, networked and relational.

Linear or sequential structure is the historically oldest method of data organization. It is closely connected with restriction from the "pre-computer" era, namely the oral and written communication as well as the restrictions of the early stage of development of information technologies, when data would be registered one after the other upon a magnetic tape and for searching the whole tape had to be sequentially re-wound. There is no mutual relationship between such elements except for the order of their saving (relation 1 : 1 – each element can possess 1 following and 1 preceding element to the maximum, possibly 1 supraordinate and 1 subordinate element). The only advantage of a linear structure seems to be the simplicity of design. Anything else are drawbacks only: the impossibility of expressing relationships 1 : N and N : M between the elements (only for the price of data redundancy, which again causes trouble during updating; e.g. if we wished, in a linearly organized file of bibliographic entries, to register a plurality of books written by the same author, his name must be repeated for each further title.) There is no way of direct retrieval of a given element; the only access is sequential search through the whole file. Traditional application fields of linear structures in the information practice are the bibliographic formats MARC that are founded upon the ISO 2709 standard. Another domain of utilization is seen to be data backup by sequential saving on magnetic tape. Linear structures complemented with sophisticated indexing files, however, are the basis of full text technologies of the present day.

The tree structure enables the link of any data element by way of a unidirectional relation directly or indirectly with a plurality of elements on a lower lever, but only with a single element on a higher hierarchic level. The relationships can be both generic and partitive, and also relationships of the type class – instance²² are possible. On the one hand side the redundancy of data can be thus reduced (each superior element is introduced only once, in spite of being linked with more subordinate elements), and on the other hand data retrieval can be accelerated due to the fact that there is no searching in the whole file; the relevant branches of the tree will do. However, direct access to data on more distant levels is not possible; it is necessary to undergo the complete path leading over the levels in-between. Substantial parts of the tree structure are linear chains enabling navigation. ISO 704 calls the linear chains in the hierarchic terminological structures the sequence of concepts. Whereas the vertical series of concepts reflects their hierarchic relationships, the horizontal line of concepts having an identical directly superior element, comprises a set of coordinate concepts on the same level that is designated as field in ISO 25964. The tree structure is very well adapted for data that are naturally organized in a hierarchical way, i.e. they demonstrate relationships of subordination and superordination, yet they do not allow the relationships N : M between the elements to be expressed easily and without duplicities. The speediness of the tree structure for data access is utilized in auxiliary indexing files. The documents of the HTML and XML type (whose structure is based upon the principles of the SGML²³ language) are doubtlessly the most important application field of the tree structure at the present day.

The network structure is the only one enabling the expression of bi-directional relationships 1 : N and N : M between the data without redundancy – each element can be linked arbitrarily with all other elements. The search is very quick; not the whole file being under scrutiny, the direct path to the given element can be followed, as determined by the defined link. The speedy access by direct jumps, anyway, is enabled only upon paths prepared in advance; ad hoc queries require more steps to be done. Network data structures are the basis of graph (also NoSQL) databases, and are especially connected with the World Wide Web; from the very beginning of its existence it enables the linking of documents by way of hypertext references. These simple relationships on document level without any semantics (the hypertext reference only links the documents without hinting to the importance of their relations), are in the process of change at the present day thanks to the technologies of the Semantic Web, being now complemented by the option of expressing the relationships of single recorded pieces of knowledge, including the respective semantics. The standard is the RDF language (Resource Description Framework) with a simple triple syntax of three members: subject – predicate (i.e. relationship) – object. The nodal points of the RDF graph are subjects and objects identified by means of URI, and also the edges of the diagram represent predicates identified by URI.

The relational structure is also capable of expressing relationships N : M between data, yet it achieves the same by other ways than the network structure. Data of entities participating in the relationship are organized in two-dimensional tables wherein collocation instead of navigation takes place: data of the same type are located in columns, whereas the lines create the so-called arranged n-tuples (this general expression denotes doubles, triples, quadruples and further analogies) comprising values of the properties of some object. An advantage from the user´s viewpoint is the relative ease of the relational structure for retrieving, but on the other side it is the cause of high claims, as regards memory and performance of the computer and low effectiveness of searching. The latter is not achieved by navigation, but the program gradually selects (and draws into its memory) whole sets of data, and only then it chooses the relevant data out of those files by set operations. A typical field of application are relational databases making use of the standardized SQL language.

Linear, tree-type and network structures have one common foundation in the theory of graphs: they consist of a pair of construction elements node – edge. The node represents the entity participating in the relation and the edge the actual relationship. The edges can be oriented or without orientation, which enables the differentiation between symmetry and asymmetry of the relationship. Moreover, they can be evaluated, which adds also semantics to the relationship they represent. Such pairs node – edge can create sequential chains of arbitrary length or paths, but they can be also used for construing more complex tree or network structures A physical implementation of a node is a data object, whereas the implementation of an edge, called reference, is a special data item in the resource node, containing the identifier of the target node. The graph structures are governed by the principle of pre-coordination: the relationships are defined explicitly and permanently prior to the query on the level of entries or documents (the reference contains the identifier of the connected data object) in all anticipated directions and, accordingly, the mutual relationships between the elements can get very complicated, especially in the case of a network structure, and are difficult to analyze. Subsequent changes of the structure and expressing other relationships between the elements require physical "re-arrangement" of the whole file.

Contrary to the above structures the relational one is based upon a thoroughly different principle – drawing from the theory of sets and the mathematical theory of relations. Such relation is the result of a cartesian product over sets of data, the so-called domains. The differing theoretical foundation of relational data structures is reflected also in the way of the physical implementation of relationships – the interlinking is defined also by way of relations between the items (concretely, between the pair primary – foreign key; the reference contains the value of some of the attributes of the connected record). The contribution of this solution resides in the flexibility of expressing the relationships; these are namely defined only at the moment they are required for answering a query, but not in advance. In this respect the relational data structures are conform to the post-coordination principle. An undoubted asset for the user is seen both the simplicity of designing relational tables and easy changes of the structure, consisting just in adding or deleting a column in one table, without any impact upon the other tables.

2.5 UML notation for modelling the relationships

UML (Unified Modelling Language)²⁴ is a standardized language conceived for enabling and simplifying communication for the development of diagrams in object oriented models describing real problems and expressing the results of their analysis, as well as drafts for solving the same by way of information and communication technologies. The systems approach applied in modelling allows the user to model objects (entities), classes, attributes, operations (functions) and relationships among them. The basic lexical units of the UML language are icons (forms, graphic symbols), conjunctions and chains of signs. They are usually represented by diagrams based upon the principles of graph theory. The icons form the nodes of the diagram and their connections are edges representing the relationships of the entities depicted by the icons. UML does not specify one universal diagram for all types of models, but offers a set of fourteen specialized diagrams for various tasks and phases of system modelling. The diagrams are divided in two groups – diagrams of structure and diagrams of behavior and interaction. This classification corresponds with the dichotomy of static and dynamic relationships, as introduced under part 2.2.

UML is traditionally used for designig information systems and in software engineering, however, its field of application get constantly wider, as evidenced also by the recently adopted international standard ISO 24156-1, specifying the utilization of UML in terminology work²⁵. The contents of this standard is a user defined profile, adapting the original semantics to the diagram of the UML classes for application in the development of conceptual diagrams in accordance with terminological standards ISO 704 and ISO 1087-1, as have been introduced in part 2.3. Yet another example of the utilization of UML in the field of work with information is the data model of thesaurus²⁶, elaborated in ISO 25964-1.

The class diagram in UML is of crucial importance for the illustration of the static relationships, and the most essential diagrams enabling the depiction of dynamic relationships are the diagram of activities and the statechart diagram. The graphic representations of various types of relationships in these diagrams with short comments concerning their importance are shown in Fig. 3.

Figure 3

A class diagram is a depiction of the static structure of a system by the help of classes composed of attributes (data) and operations (processes) and through the intermediary of relationships between those classes. It disposes of tools for the expression of all three key types of relationships – equivalence, hierarchy and association. The relationship of equivalence of objects having the same propertie is expressed by their being classified in one class. As can be seen in Fig. 3 where there is, along with the UML-terminology, following a slash, also the terminology used in the user profile ISO 24156-1, the class in UML is an analogy for the concept, as understood by terminological standards ISO 704 and ISO 1087-1, and by ISO 25964. The relationship of generic hierarchy in UML is expressed by the symbol of generalization, whereas the relationship of partitive hierarchy has at its disposal two symbols – aggregation and composition, differing as concerns the extent of dependence of the parts upon the whole. UML enables modelling of both mono-hierarchical, and poly-hierarchical relationships. As a rule, the relationship of instance hierarchy is not depicted in the class diagram, but of course separate instances can be modelled in the diagram of objects.²⁷ The relationship of association in UML is expressed by a symbol of the same name. As has been mentioned already, the instance of each class in UML are data objects. A generic relationship is instantiated by transfer (i.e. the transitivity) of features of a generic class onto the objects of the specific class, thus enabling inheritance. The instances of both partitive and associative relationships are links between the objects instantiated from the classes taking part in the relationship.

A statechart diagram depicts the dynamics of change of the object, whereas a diagram of activities serves for modelling the mutual relationships of the processes within the system, i.e. the relationships of follow-up and of parallelism in time. In addition to the orientated edges both diagrams dispose also of symbols enabling the depiction of alternative possibilities of the course of the modelled processes and their synchronization.

3 General taxonomy of relationships applicable to the relationships of information resources

There is no doubt that learning the relationships of information resources has rich usage in the field of information science and in practice. The relational analysis investigates the extent of connection of a given resource with other resources (such as documents, people, disciplines, organisations…), visualizing the same by means of communication or social networks or maps. The evaluating analysis, executed by methods of scientometry, bibliometry or citation analysis, serves as an indicator of productivity, quality, importance, influence (such as influencing documents, persons…).The comparative analysis enables the evaluation of relevance by examining the relationship between information need and the contents of the resource.

When deliberating upon the relationships of entities falling into the broadly delimited category of information resources, a number of things (participants) in mutual relationship can be considered: a vast group of documents or bibliographic entities – works and their expressions (identified via DOI²⁸, VIAF²⁹, ISTC³⁰, ISAN and V-ISAN³¹, ISWC³²), manifestations and items of works (identified via ISBN, ISSN, GTIN³³), parts, volumes or components of works (with identifiers DOI, SICI³⁴), collections and services (identifier ISCI³⁵), persons, such as authors (with identifiers IČO³⁶, VIAF, ISNI³⁷) or users, organisations (identified, e.g., via IČ³⁸, DIČ³⁹, GLN⁴⁰, VIAF, ISNI, ISIL⁴¹, library signs), themes, objects, contents or genres (identified by their titles or names), but also formats and data structures, signs and elements of language (words, collocations, statements, sentences, enquiries, references, citations, paragraphs, texts).

It is obvious that the scale of such entities is immensely broad, reaching from physical objects (analogue or digital ones) down to abstract concepts. The range of their relationships is not less diversified. Both abstract relationships of entities and concrete relationships of their instances can be considered, they can be mutual (such as unary relationships author – author, theme – theme), or combinational (such as author – work, work – theme). In addition to that the entities entering a relationship tend not to be black boxes, but have their internal structure. The internal relationships of the participating entities affect also the nature of their reciprocal relationship. Some relationships of information resources with entities outside of their domain need not be uninteresting, either (such as author – place of birth).

Standard ISO 25964, mentioned in part 2.3, is not the single manifestation of the information science experts´ concerning the problem of relationships. The relational typology of information resources has been studied by numerous authors, while their approach is often seen to differ, depending upon their philosophical platforms, and also the purpose for which they define the relationships. Let us offer, by way of example, the contrast between the extensive reviews of relationships defined within the framework of the thesaurus Agrovoc⁴² or the taxonomy of subject relationships prepared by D. Michel for a section of ALA, Association of Library funds and technical services, in 1996⁴³, on the one hand, and the very economic list of only three types of relationships in the Gene ontology⁴⁴, on the other hand. Important achievements in the theoretical domain are the taxonomies of bibliographic relationships by Barbara Tillett and the review of relationships worked out within the FRBR model.

B. Tillett, in her dissertation of 1987⁴⁵, executed an analysis of relationships comprised in cataloguing rules; she first published the results in 1991⁴⁶ and a transformed version, complemented with concepts of the FRBR model, followed in 2001⁴⁷. The taxonomy consists of seven types of bibliographic relationships: 1. equivalence relationships – copy of the same manifestation (reproduction, facsimile, reprint, micrographic copy etc.), retaining the same contents and author´s responsibility, 2. derivative relationships, in UNIMARC format called horizontal – modification of the bibliographic unit based upon the given work (versions, translations, summaries, adaptations, changes of genre, such as dramatization, paraphrase etc.), 3. descriptive (reference) relationships – description, critique, evaluation or review of the contents of a bibliographic unit, 4. partitive relationships, in UNIMARC format called vertical, relationships of the type whole – parts between bibliographic units or parts, 5. accompanying relationships (relationships of extension, such as a work and an annex or appendix), 6. sequential relationships (follow-up relationships, in UNIMARC designated as chronological), 7. relationships of shared (common) characteristics (such as the same author, name, theme of bibliographic units). Jonathan Furner has suggested an extension of the above taxonomy in the group of shared characteristics, wherein he recommends also the inclusion of important relationships of citation, of relevance and relationships of content related characteristics.⁴⁸

Another outstanding contribution is a review of relationships in the domain of knowledge organization worked out by Rebecca Green.⁴⁹ She divides the relationships in two basic groups: mutual relationships of documents and relationships of concepts, reflecting, naturally, also the important relationship document – concept (the contents of the document). She integrates also the bibliographic relationships from the taxonomy by B.Tillett among the relationships of documents, mapping them in the light of the entities within the FRBR model, i.e. relationships between the entities of the first group, relationships of responsibility between the entities of the first and second group and subject relationships (a work has an subject). She complements the group of partitive relationships by adding relationships within the frame of the inherent structure of texts. In tune with J. Furner she highlights the importance of citation relationships of documents. The relationships of concepts according to Green create semantic relationships of equivalence, hierarchy and association and, again in line with J. Furner, relationships of relevance, i.e. relationships of information need or request and document.

The relationships upon which most authors agree irrespective of some partial differences, represent a rather narrow, but generally acceptable common platform. The relationships of equivalence, hierarchy (generic, partitive, instantional) and association are not only accepted by theoreticians, but – as shown in part 2.3 of the present study – they have been included also into the international professional standards and, most essentially, they are implemented in most computer programmes enabling access to information resources to the present day users. It is usually no problem to differentiate between the meaning of the concepts of equivalence, hierarchy and association on the theoretical level. Regarding equivalence we suppose that the semantics of linked elements is the same, regarding association we suppose different meanings of he linked elements with some common feature, and we will designate as hierarchical such relationship where there is the relationship of similarity between the contents of its elements, most often expressed as inclusion (the meaning of the supraordinated element is included also in the meaning of the subordinated element). However, in the practice of everyday work with information resources we come across problems. Most frequently there is a problem concerning the determination of the demarcation line on the continuous scale from equivalence to association, but also problems of equivalence – hierarchy can occur (e.g. synonyma are generally considered as cases of meaning equivalence, but it is not exceptional to find hierarchical pairs designated as synonyma⁵⁰). As shown by the UML example, the relationships of partitive hierarchy are implemented as association relationships in the present computer programmes. The differences of granularity ("size " of elements within the relationship) are the cause of varying extent of detail of the defined relationships. Differences as to the precision of determination of a relationship make their mapping difficult (how should we map, e.g., index relationships "see", "see also "or the predicate of language OWL owl:sameAs with the triple of relationships expressed in the scheme SKOS as skos:broadMatch, skos:closeMatch and skos:exactMatch?).

In spite of the above qualifications we have determined this triple of semantic relationships as the primary facet in our review of the most important relationships of information resources. It is complemented by two secondary facets containing also essential categories enabling the broadening and deepening of the analysis of the relationships under scrutiny by adding a technological and a language dimensions.

primary facet

secondary facets

semantics

data structure

permanence, dependence upon the context

equivalence

hierarchy (generic, partitive, instantional)

association

linear

tree-type

network

relational

paradigmatic relationship

syntagmatic relationship

Table 1 Review of key relationships of information resources

The submitted draft of taxonomy in Table 1 is the result of an analysis of theoretical principles of various scientific disciplines. We are aware of the fact that its applicability to the fields of information resources may require very meticulous investigations and testing on an empirical basis. It will be necessary to document the suggested categories with concrete examples, in particular from the following key domains: 1. conceptual (linguistic-semantic) level of relationships within the whole scope of the communication chain – expression and interpretation of conceptual relationships by the author, producer, processor, reader, addressee, while the actor in all these cases can be both a person and a machine; 2. bibliographical relationships (especially in the FRBR model); 3. relationships in the knowledge organization systems; 4. relationships in the Semantic Web.

Conclusion

The problem of relationships of information resources is a typical interdisciplinary issue. Experts from various disciplines look for such relationships or structures that can best reflect the way they themselves perceive the world. Changes in defining the relationships of information resources react to changes undergone by the information resources. The sequential paradigm of written documents was replaced by tree structures of the first databases in mid 20th century, and at the turn of the new millennium by the hierarchical principle of document structuring in the SGML, HTML and XML format. In parallel with that, the relational database structures were able to put themselves through in a not negligible segment of document processing, whose handicap of total dissimilarity with reality was compensated by the firm foundation of the mathematical principles of the theory of sets and of relational algebra. The present day witnesses the advent of network relationships and structures in the form of Semantic Web and linked data, implemented as interlinked, directed and typed graphs. In parallel with that also the paradigm of relations is seen to change: whereas the two-value Boolean logics used to be valid in the closed database environment, fuzzy logics is seen to govern in the open environment of linked data. Also this is certainly just an imperfect image of the infinitely variable and dynamic reality. However, the best options for expressing the complexity of reality appear to be offered by the network relationships – out of all models that have been suggested down to this day. Their implementation in the triple RDF format, in connection with the web technologies, represents an optimum result by combining the advantages of relational and network structures. They enable achieving very fine granularity while intercepting relationships down to the level of single facts. It is a question whether such structure can provide also the necessary extent of abstraction that is essential not only for the representation of reality, but in particular for understanding the same.

This study is a partial solution output of the project NAKI DF13P01OVV013 Knowledge base for the field of information and knowledge organization, as implemented at UISK FF UK in Prague.

Bibliography

BEAN, Carol A. and Rebecca GREEN, ed. Relationships in the organization of knowledge. Dordrecht: Kluwer Academic Publishers, 2011 (reprint of 2001). ix, 232 s. Information science and knowledge management, vol. 2. doi:10.1007/978-94-015-9696-1. ISBN 978-90-481-5652-8 (brož.). ISBN 978-94-015-9696-1 (Online).

ČSN ISO 1087-1 (01 0501). Terminologická práce – Slovník – Část 1: Teorie a aplikace. Praha: Český normalizační institut, 2002. 38 p.

FURNER, Jonathan. Bibliographic relationships, citation relationships, relevance relationships, and bibliographic classification: an integrative view. In: Clare Beghtol, Jonathan Furner, Barbara Kwasnik, ed. Proceedings of the 13th Workshop of the American Society for Information Science and Technology Special Interest Group in Classification Research, November 17, 2002, Philadelphia, PA. Medford (N.J.): Published by Information Today for the American Society for Information Science and Technology, © 2004, p. 29–37. ISBN 978-1-57387-199-0. Advances in classification research, vol. 13. ISSN 2324-9773. doi:10.7152/acro.v13i1.13833.

GREEN, Rebecca. Relationships in knowledge organization. In: Knowledge organization. 2008, 35(2–3), 150–159. ISSN 0943-7444.

HJØRLAND, Birger. Are relations in thesauri "context-free, definitional, and true in all possible worlds"? In: Journal of the American Society for Information Science and Technology. July 2015, 66(7), 1367–1373. doi:10.1002/asi.23253. ISSN 1532-2882 (Print). ISSN 1532-2890 (Online).

ISO 704:2009. Terminology work – Principles and methods. 3rd ed. Geneva: International Organization for Standardization, 2009. 65 p.

ISO 24156-1:2014. Graphic notations for concept modelling in terminology work and its relationship with UML – Part 1: Guidelines for using UML notation in terminology work. 1st ed. Geneva: International Organization for Standardization, 2014. 24 p.

ISO 25964-1:2011. Information and documentation – Thesauri and interoperability with other vocabularies – Part 1: Thesauri for information retrieval. 1st ed. Geneva: International Organization for Standardization, 2011-08-08. 152 p.

ISO 25964-2:2013. Information and documentation – Thesauri and interoperability with other vocabularies – Part 2: Interoperability with other vocabularies. 1st ed. Geneva: International Organization for Standardization, 2013-03-04. 99 p.

KHOO, Christopher S. G. and Jin-Cheon NA. Semantic relations in information science. In: Annual review of information science and technology. Vol. 40. Blaise Cronin, ed. Medford (N.J.): Information Today on behalf of American Society for Information Science and Technology, 2006, chapter 5, p. 157–228. Annual review of information science and technology, vol. 40. ISSN 0066-4200 (Print), ISSN 1550-8382 (Online). doi:10.1002/aris.1440400112. ISBN 978-1-57387-242-3.

SVENONIUS, Elaine. Subject languages: referential and relational semantics. In: The intellectual foundation of information organization. Cambridge (Mass): MIT Press, 2000, chapter 9, p. 147–171. ISBN 978-0-262-19433-4.

TILLETT, Barbara B. Bibliographic relationships. In: Carol A. Bean and Rebecca Green, ed. Relationships in the organization of knowledge. Dordrecht: Kluwer Academic Publishers, 2011 (reprint of 2001),p. 19–35.

Comments

¹Allemang, Dean and James Hendler. Semantic web for the working ontologist: modelling in RDF, RDFS and OWL. Morgan Kaufmann, 2008, p. 31. ISBN 978-0-12-373556-0.

² CELBOVÁ, Ludmila. Informační zdroj. In: KTD: Česká terminologická databáze knihovnictví a informační vědy (TDKIV) [online]. Praha: Národní knihovna ČR, 2003– [cit. 2015-10-03]. Available from: http://aleph.nkp.cz/F/?func=direct&doc_number=000000887&local_base=KTD.

³ WIERZBICKA, Anna. Sémantika: elementární a univerzální sémantické jednotky. Praha: Karolinum, 2014, p. 160–161. ISBN 978-80-246-2289-7.

⁵ Birger Hjørland pointed to the arbitrariness of relationships in the recently published paper devoted to the relationships in thesauri: HJØRLAND, Birger. Are relations in thesauri "context-free, definitional, and true in all possible worlds"? In: Journal of the American Society for Information Science and Technology. July 2015, 66(7), 1367–1373. doi:10.1002/asi.23253. ISSN 1532-2882 (Print). ISSN 1532-2890 (Online).

⁶ All examples in brackets concern concrete statements in Fig. 1.

⁷ Note: There are certain limitations for the combinations of values, such as: the relation 1 : N can be only asymmetrical, not symmetrical.

⁸ For more detail see, e.g. ČERMÁK, František. Jazyk a jazykověda: přehled a slovníky. 2. reprint 3. enlarged edition. Praha: Karolinum, 2001, 2004, 2007. Chapter 1.52, Structure of the sign and its relationships, p. 24–28. ISBN 987-80-246-0154-0.

⁹ SVENONIUS, Elaine. Subject languages: referential and relational semantics. In: The intellectual foundation of information organization. Cambridge (Mass): MIT Press, 2000, chapter 9, p. 147–171. ISBN 978-0-262-19433-4.

¹⁰ For more detail see, e.g. ČERMÁK, František. Jazyk a jazykověda: přehled a slovníky. 2. reprint 3. enlarged edition Praha: Karolinum, 2001, 2004, 2007. Chapter 4.0, System and text (langue et parole), p. 80–91. ISBN 987-80-246-0154-0.

¹¹ "However, it should be born in mind that there is no sharp demarcation line in the field of syntagma between the fact of language, as a sign of collective usage, and the fact of speech, depending upon individual liberty. It is difficult to classify certain combinations of units in quite a number of cases, since their creation has been due to factors of both type, in proportions that can be hardly differentiated." SAUSSURE, Ferdinand de. Kurs obecné lingvistiky. Comments by Tullio de Mauro; from the French original translated by František Čermák. 1. ed. Praha: Odeon, 1989, p. 154. ISBN 978-80-207-0070-4.

¹² "There are two referential relationships serving for the interpretation of a sign, the relation to the code and the relation to the context […]." JAKOBSON, Roman. Poetic function. [From Czech and foreign originals selected and arranged by Miroslav Červenka] Edition of this collection 1. Jinočany: H & H, 1995, p. 57. ISBN 80-85787-83-0.

¹³ ISO 704:2009. Terminology work – Principles and methods. 3rd ed. Geneva: International Organization for Standardization, 2009. 65 p. – Only the 2nd edition is available in Czech translation as yet: ČSN ISO 704 (01 0505). Terminologická práce – Principy a metody. Praha: Český normalizační institut, 2004. 43 p.

¹⁴ ISO 1087-1:2000 Terminology work – Vocabulary – Part 1: theory and application. 1st ed. Geneva: International Organization for Standardization, 2000. 41 p.

¹⁵ ISO 25964-1:2011. Information and documentation – Thesauri and interoperability with other vocabularies – Part 1: Thesauri for information retrieval. 1st ed. Geneva: International Organization for Standardization, 2011-08-08. 152 p. – ISO 25964-2:2013. Information and documentation – Thesauri and interoperability with other vocabularies – Part 2: Interoperability with other vocabularies. 1st ed. Geneva: International Organization for Standardization, 2013-03-04. 99 p.

¹⁶ ČSN ISO 1087-1 (01 0501). Terminologická práce – Slovník – Část 1: Teorie a aplikace. Praha: Český normalizační institut, 2002. 38 p.

¹⁷ National Information Standards Organization

¹⁸ British Standards Institution

¹⁹ Considering the fact that the standards focus upon the so-called special language used in a certain delimited field, I wish to draw attention to two designation types: terms and proper names.

²⁰ In this case the term inclusion is used in its general meaning, not in the strict sense of the theory of sets or formal logics.

²¹ In parts 5.1.2 and 5.1.3 the standard ISO 25964 offers, for orientation, a selection from some categories: objects, things and their physical parts, materials, activities or processes, events or occurrences, properties of persons, things, materials or activities, fields or scientific disciplines, units of measurement, types of persons and organizations, individual entities designated with proper names – places, specific objects, topographic phenomena, individuals, organizations, companies.

²² Note: This structure is sometimes also called hierarchical, but for the purpose of differentiation from the relationship of the same name we will prefer the term tree structure derived from the graph theory.

²³ Standard Generalized Markup Language

^xxiv ISO/IEC 19505-1:2012. Information technology – Object Management Group Unified Modelling Language (OMG UML) – Part 1: Infrastructure. 1st ed. Geneva: International Organization for Standardization, 2012. 220 p. – ISO/IEC 19505-2:2012. Information technology – Object Management Group Unified Modelling Language (OMG UML) – Part 2: Superstructure. 1st ed. Geneva: International Organization for Standardization, 2012. 740p.

²⁵ ISO 24156-1:2014. Graphic notations for concept modelling in terminology work and its relationship with UML – Part 1: Guidelines for using UML notation in terminology work. 1st ed. Geneva: International Organization for Standardization, 2014. 24 p.

²⁶ Data model available on: http://www.niso.org/schemas/iso25964/Model_2011-06-02.jpg [cit. 2015-10-03].

²⁷ Whereas the class diagram focuses upon modelling concepts and their abstract relationships (i.e. concept models), the diagram of objects enables the depiction of concrete individuals or instances (called objects in UML) and their concrete relationships.

²⁸ Digital Object Identifier

²⁹ Virtual International Authority File

³⁰ International Standard Text Code

³¹ International Standard Audiovisual Number and ISAN Version

³² International Standard Musical Work Code

³³ Global Trade Item Number

³⁴ Serial Item and Contribution Identifier

³⁵ International Standard Collection Identifier

³⁶ Identification Number of person in the Register of Persons of the ČR

³⁷ International Standard Name Identifier

³⁸ Identification Number of Organization in the Register of Companies of the ČR

³⁹ Tax Identification Number

⁴⁰ Global Location Number

⁴¹ International Standard Identifier for Libraries and Related Organisations

⁴² Available from: http://aims.fao.org/aos/agrontology [cit. 2015-10-03].

⁴³ MICHEL, Dee. Appendix B: Taxonomy of subject relationships. In: Subject data in the metadata record: recommendations and rationale: a report from the ALCTS/CCS/SAC/Subcommittee on Metadata and Subject Analysis. July 1999. Available from: http://www.ala.org/alcts/mgrps/camms/cmtes/ats-ccssac/srrsreport-B2 [cit. 2015-10-03].

⁴⁴ Available from: http://geneontology.org/page/ontology-relations [cit. 2015-10-03].

⁴⁵ TILLETT, Barbara Ann Barnett. Bibliographic relationships: toward a conceptual structure of bibliographic information used in cataloguing. Los Angeles, 1987. xxi, 306 p. Thesis (Ph.D.). University of California, Graduate School of Library and Information Science.

⁴⁶ TILLETT, Barbara B. A taxonomy of bibliographic relationships. In: Library resources &technical services. April 1991, 35(2), 150–158. ISSN 0024-2527 (Print). ISSN 2159-9610 (Online).

⁴⁷ TILLETT, Barbara B. Bibliographic relationships. In: Carol A. Bean, Rebecca Green, ed. Relationships in the organization of knowledge. Dordrecht: Kluwer Academic Publishers, 2011 (dotisk vydání z r. 2001), p. 19–35.

⁴⁸ FURNER, Jonathan. Bibliographic relationships, citation relationships, relevance relationships, and bibliographic classification: an integrative view. In: Clare Beghtol, Jonathan Furner, Barbara Kwasnik, ed. Proceedings of the 13th Workshop of the American Society for Information Science and Technology Special Interest Group in Classification Research, November 17, 2002, Philadelphia, PA. Medford (N.J.): Published by Information Today for the American Society for Information Science and Technology, © 2004, p. 29–37.

⁴⁹ GREEN, Rebecca. Relationships in knowledge organization. In: Knowledge organization. 2008, 35(2–3), 150–159. ISSN 0943-7444.

⁵⁰ In fact, this is permitted also by standard ISO 25964, enumerating in part 8.1 four types of equivalence, of which two are hierarchical (broader – narrower meaning, complex equivalence).

Archivematica – Open Source System for Digital Archiving

Miroslav Bartošek — 2016-02-29T11:45:00Z

Keywords: digital preservation, digital archiving, Archivematica, OAIS, low cost solution

RNDr. Miroslav Bartošek, CSc., Masaryk University, Institute of Computer Science, Botanická 68a, 602 00 Brno, Czech Republic

The article was written within the CESNET Development Fund research project “Pilot project for low-barrier approach to digital preservation (LTP-Pilot)”, project No. 516R1/2014^[1]^[2]

1. Introduction

Until recently long-term digital preservation was the exclusive domain of large institutions such as national libraries and national archives, which usually had the necessary mandates, finances and expert resources. These institutions were often focused on building large, monolithic systems based on rather costly commercial solutions (e.g. Rosetta from Ex Libris). However, advances in theory and practice, along with the growing need to address digital archiving in smaller institutions led to the realization that even with limited resources it is possible to begin creating their own solutions, and there is no need to wait what the large institutions can offer. One of the new long-term preservation systems that have emerged in recent years, and supports this trend, is Archivematica.

1.1 About Archivematica

Archivematica^[3] is a freely available open source system supporting long-term digital preservation. The system is being developed by the Canadian company Artefactual Systems Inc. in collaboration with academic and memory institutions since 2008. The impulses for developing Archivematica were (a) demand for low-cost long-term preservation solution^[4], (b) the availability of a large variety of open source tools supporting specific digital preservation tasks which lacked interconnection in a comprehensive system that is easily usable by the wider community of digital curators. The declared objective of Archivematica is to provide archivists and librarians with limited technical and financial capacities with the tools, methodologies and confidence to start digital archiving on their own.

A prototype of Archivematica was designed to verify the idea^[5] that it is achievable to create an open source long-term preservation system by mapping available tools to OAIS processes^[6]. The system was initially developed for the City of Vancouver Archives and for the International Monetary Fund. Later other institutions and the wider community of users got engaged. The beta version was released in early 2009. The first production version appeared two years later. The latest version, published on the date of submission of this paper, is version 1.4 released in May 2015.

Archivematica integrates a set of freely available tools and uses them in the complex processing of digital objects from submission and ingest into the archive to providing access to end users according to the OAIS model and other standards and recommendations. To implement digital preservation functionalities Archivematica uses the micro-services approach: each micro-service represents a partial step in the preservation process and is usually implemented by some of the available tools. Micro-services are chained to workflows representing functions of the OAIS model. The entire system can be user-controlled and monitored via a web interface. The workflows and tools used for individual steps can be changed and replaced. This makes the system flexible, responding to different needs and technological changes in digital file creation, management and preservation.

1.2 Threats to digital information, digital preservation

Digital information is subject to number of risks that critically threaten its long-term availability and usability. Unlike paper documents that can survive and pass on the authentic information for very long time without the need for any special treatment, the digital objects cannot survive without continuous interventions. The main risks include the limited lifetime of storage media, the complexity of digital objects that cannot be rendered without intricate and innovative intervention, and especially technological progress, a byproduct of which is technology and information obsolescence and thus information inaccessiblity. It is not possible to maintain long-term availability, usability, integrity and authenticity of digital documents without active systematic actions.

Digital preservation includes basically two different and complementary levels: physical preservation of the original digital files (or set of bits, hence also the bit-level preservation); and logical preservation which means preservation of the ability to read and understand the information contained in a digital object.

Physical preservation refers to protection against loss of the digital object alone or part thereof (spontaneous media degradation, intentional or unintentional deletion or content alteration, loss or destruction of media due to a crash or natural disaster). Physical preservation is ensured in particular by creating multiple copies of files stored in multiple geographically distant locations and by regular monitoring of their integrity.

Logical preservation concerns protection against the inability to access preserved object ( due to the technological changes a device needed to read the media is no longer available; there is no software for decoding the file format; there is no operating system or hardware platform which would run the software) or understand its informational content (loss of context etc.). Logical preservation might be provided by storing the digital object along with well maintained andaccurate supporting information – metadata. Moreover, it is also necessary to carry out preservation actions to ensure the legibility of the original information (migrations to a new file format, emulation of the original computing environment, etc.). Logical preservation may result in changing original bits, but that is always favorable in the quest to preserve the readability and clarity of content.

Archivematica is focusing to support processes of logical preservation, ie. preservation of the information content, its readability and understandability.

1.3 OAIS functional and informational model

OAIS – Open Archival Information System (ISO 14721) is the reference model and the crucial standard used today for long-term archiving systems implementation. According to the OAIS model^[7] the digital content which has to be preserved is submitted by creator to the archive as a Submission Information Package (SIP). The submission package is then transformed into an Archival Information Package (AIP), which is stored in a secure physical storage and then is managed within and by the digital archive on the basis of established preservation practices and strategies. Digital content is made available to the end user through a Dissemination Information Package (DIP) – see Figure 1.

Figure 1: The OAIS Reference model scheme

The OAIS Functional Model encompasses six functional entities: Ingest, receipt of information from the creator and its preparation for insertion into the archive; Archival Storage, ensuring long-term storage and protection of information; Data Management, managing descriptive metadata of archived objects together with administrative metadata for both the operation of archive and search; Preservation planning, preserving the archived objects with respect to ongoing changes in the external environment and technology; Administration, management, coordination and operation of the archive; Access, providing access to archived objects to end users.

2. OAIS model implementation in Archivematica

The OAIS functional model has been translated by the Archivematica creators into a set of user scenarios. Using those scenarios the specific workflows were devised and implemented in the system. Digital information passes through a series of transformations during processing and the original digital content might be modified and enhanced (the unchanged digital original is always kept as well).

The main function and goal of Archivematica is to process submitted digital data (called Transfer in the Archivematica terminology) into SIPs which are then ready for ingest into the archive and transformation into archival packages (AIPs) intended for long-term storage. In parallel with the creation of archival AIP packages one can also set up creation of access DIP packages. Archivematica focuses primarily on creating the best possible AIP package^[8]. What happens to them further is not what Archivematica really addresses; it relies on the use of other external systems for that. For example, to access the DIP package users can use either the AtoM module, which is an independent component supplied with the system, or integrate Archivematica with external data management and access systems used by the organization (eg institutional repository).

Figure 2: OAIS information packages as seen in Archivematica workflow^[9].

2.1 Transfer

The term "transfer"^[10] is used in two different meanings: on the one hand, it refers to a set of submitted data and metadata (files and directories) to be archived in Archivematica. On the other hand it also refers to the process prior Ingest itself (ie. pre-ingest) where a SIP is generated from the submitted data.

The preparation and conduct of the Transfer process depends on the type of digital content and the procedures established in the particular institution. Typically it may include putting files into an appropriate folder structure, creating descriptive metadata for those files and adding other metadata such as copyright agreements, access restrictions, etc. There are several predefined modes for submitting the data for Transfer in Archivematica, but it is also possible to create and implement customized structures for Transfer.

A SIP package is created by using sequential steps (micro-services), such as extraction of compressed files, normalisation of file names, the virus scan, checksums generation and validation, assigning unique identifiers, format identification, metadata extraction and others. From one data set submitted to digital archive one or many SIPs can be created. Also the other way round, one SIP package can contain one or many sets of submitted data. The system also supports the "backlog" functionality, delayed processing of incomplete Transfers (a frequent procedure in the archival community).

2.2 Ingest

SIP packages go through further processing during the ingest operation. For example new metadata can be added and validated (like descriptive metadata in Dublin Core, preservation metadata in PREMIS, technical metadata etc.), optical character recognition can be performed etc. More importantly normalisation can be conducted (if configured). This means conversion of the digital content to a more suitable archival format, based on the input format. At the same time Archivematica can also generate representations in other file formats for access purposes. The original versions of digital objects are always stored along with the normalised versions. Normalisation is then followed by another processes involving the creation of detailed input documentation, integration of newly generated metadata into a METS document (see 3.2), content and metadata indexing etc. Archivematica offers pre-defined ways for ingest depending on the type or form of data and completeness of the description of ingested digital content. The Administrator may, however, modify these or define new ones.

The ingest process is completed by creating AIP archival package and storing it in an archival storage. When required, the DIP access package can also be generated during ingest and stored in an access system. Data, metadata and any accompanying information forming an AIP are encapsulated in a single package created according to the BagIt standard (see 3.2).

2.3 Archival Storage

Archivematica stores all data and information packages (transfer, SIP, AIP, DIP) as files in a file system^[11]. To ensure independence from specific physical data storage it uses a separate component called Storage Service, which provides an interface to any archival storage. The administrator can configure Storage Service so that the data are stored in the storage according to the need of the organization. Storage may be local or remote file system (eg. NFS), networked storage such as LOCKSS, cloud etc. Multiple repositories can be configured for different data types simultaneously within a single system. Archivematica does not address bit preservation (backups, multiple copies, integrity checks, recovery after catastrophic events, etc.), it leaves this on the repository itself.

All AIP packages in the archival storage are indexed (using the ElasticSearch server) so they can be searched and retrived in a limited way, both at the package level or individual objects level. It is also possible to search at the AIC level (Archival Information Collection is an information unit which brings together a set of logically interrelated AIPs). In justified cases it is possible to remove AIPs from the archival storage by using a controlled remove procedure (but it is not possible to delete individual files from the AIP).

2.4 Preservation Planning

Archivematica uses two-pronged preservation strategies – normalisation conducted during the ingest; and keeping and preserving the original files to support future preservation actions such as the format migrations or emulations. Normalisation is based on the identification of file formats and their significant properties and also on format policies which specify target file format, type of actions, tools used and also procedures followed for creating AIP and DIP packages. Target formats for normalisation are selected using criteria such as current community recommendations, open format specification, availability of open-source tools for the format creation and presentation, format licensing, patent restrictions and others. Administrators of Archivematica can configure their preferred file formats and normalisation processes anytime as needed.

A crucial part of Archivematica is FPR – Format Policy Registry, centrally managed by the Archivematica producer. FPR specifies and continuously updates format-oriented procedures recommended on the basis of the contemporary state of knowledge and best practice in digital preservation (the system administrator always has the option to modify and enhance these centrally managed proceduress in the local registry copy). FPR is available through API and is shared not only by all organizations using Archivematica, but also by other institutions and projects. It is connected with the PRONOM^[12] register. The use of other format registers such as UDFR (Unified Digital Format Registry) or the Planets Core Registry is also planned.

Institutions may use the FPR registry as a tool for supporting and updating local processes as part of their broader concepts and strategies for digital preservation. The user has the freedom to determine their own procedures based on institutional LTP policies or tools available for preservation planning, such as PLATO^[13]. However, Archivematica in its current version does not address the creation of the generic preservation plans and their implementation.

2.5 Access

Archivematica was designed to support integration with external systems already used by institutions for data storage, data management and access wherever possible. Therefore it is intended more as a back-end supplement to manage preservation tasks than a data management and access system of its own. Archivematica customers have the option to continue to use their existing systems and integrate Archivematica with them to "only" support long-term archiving processes.

Access versions of digital objects packed together with other information in DIP packages can be generated during ingest of data into Archivematica. DIPs are then imported into an external system for access and are available to users through it. Archivematica provides tools for basic metadata synchronization between archival storage and external access systems. Currently, there are two approaches on how to ensure access to archived information. Firstly, the AtoM system was developed by Archivematica creators to address the needs of the community of archivists. Secondly, users can connect their own access systems to Archivematica. There are various pilot projects where Archivematica was connected to systems like Archivist's Toolkit, Content DM, DSpace or Fedora (Islandora)^[14].

2.6 Administration – Dashboard

The user interface to Archivematica, used for managing the processes and the system configuration, is called a Dashboard. It is a web application which provides the following functionalities to a user/administrator:

- configure the system,

- prepare and ingest new content to digital archive,

- monitor and manage ingest processes, usually by configuring and choosing from available options (dropdown menus),

- edit and enhance metadata,

- deal with the user requests for providing AIP packages,

- report on preservation planning,

- report different statistics and operations running within the system (in a very rudimentary form).

Archivematica functions are designed in accordance with the OAIS model into the modules described in paragraphs above; individual tabs of the Dashboard correspond with the modules – i.e. Transfer, Ingest, Archival Storage, Planning Preservation^[15], Access, Administration, see Figure 3.

Figure 3: Dashboard – user interface of Archivematica

When Archivematica performs various operations the Dashboard displays a list of micro-services currently in use and generates alerts if manual intervention of the administrator is necessary. For example, to choose the variant of next ingest step or the options to solve some error. However, it is possible to configure the individual processes to proceed automatically so manual intervention is mostly not needed^[16].

2.7 Management of archival packages (AIPs)

The current version of Archivematica creates high-quality and robust AIPs, but it provides only a minimum of tools needed for their long-term management. For example, over a longer period of time it may be necessary to modify the content of AIPs in connection with migration of obsolete file formats to new ones, or update the metadata stored in the AIP. New versions of Archivematica should deliver at least partial improvements in this direction. Awaited functionalities include versioning of information packages and the possibility of AIP re-ingest^[17]. This should enable to perform both minor AIP updates (such as adding a file that was missing in the original SIP package) and extensive large-scale changes (such as periodic migration of normalised file formats).

Another as of yet unsupported feature is replicating the AIP packages to multiple geographically distributed repositories as well as periodical integrity checks of these packages^[18]. Users of the current version of Archivematica must use external systems and tools for these tasks.

3. Other features of Archivematica

3.1 Archivematica as a software and micro-services

Archivematica is developed using Python programming language. Archivematica’s code, development environment and documentation are freely available under the AGPL 3.0 (GNU Affero General Public License) and Creative Commons licenses. The system can be installed on an Ubuntu environment. Alternatively, you can prepare distribution as a virtual appliance with a bundled Xubuntu linux distribution and a set of open source software tools. By using an appropriate virtualization application (eg. Oracle VirtualBox, VMWare Player) a virtual machine running Archivematica can run on any hardware platform and operating system, including conventional desktop computers. The disk image used for the virtual machine can also be used to create a bootable USB drive or DVD or for direct installation of Archivematica on physical hardware like servers and workstations.

As mentioned earlier, Archivematica uses the concept of micro-services. This means that the information packages submitted into the system are processed step-by-step by individual micro-services pipelined in such a way that output of one is an input for following one. Each micro-service consists usually of several steps (jobs) and is implemented as a combination of Archivematica scripts and one or more freely available software tools. Each of the pre-installed tools can be replaced (at least theoretically) for another one, without compromising the functioning of the system as a whole^[19].

In the initial analysis and the development of user scenarios based on OAIS functional model the Archivematica developers identified 24 original micro-services which they grouped into 9 process categories^[20]:

Process category	Micro-service
1. receiveSIP	verifyChecksum
2. reviewSIP	extractPackage assignIdentifier parseManifest cleanFilename
3. quarantineSIP	lockAccess virusCheck
4. appraiseSIP	identifyFormat validateFormat extractMetadata decidePreservationAction
5. prepareAIP	gatherMetadata normalizeFiles createPackage
6. reviewAIP	decideStorageAction
7. storeAIP	writePackage replicatePackage auditFixity readPackage updatePackage
8. provideDIP	uploadPackage updateMetadata
9. monitorPreservation	updatePolicy migrateFormat

The scope and specifications of micro-services are constantly enhanced and refined during Archivematica development, so the current list of micro-services is much wider and more sophisticated.

Archivematica architecture utilizing micro-services implemented by freely available tools is shown in Figure 4.

Figure 4: Archivematica system architecture^[21].

3.2 Standards

Archivematica uses a number of open de facto standards for metadata, identifiers and integration of the information. The most important are:

BagIt^[22] – specifies method of packaging of the folders and files into single packages for purposes of the long-term preservation or data exchange. Checksum information is generated and stored for each file in the package, which simplifies the integrity assurance. The BagIt standard is used for AIP packaging, and BagIt packages can be submitted into Archivematica as a Transfer created by other systems.

METS (Metadata Encoding and Transmission Standard)^[23] – standard encapsulating metadata (descriptive, administrative, structural) and source files of a structured digital object. Archivematica uses the METS standard to group metadata of the archival objects in a single XML file. The METS file with all metadata records and the content files constitute the AIP package.

PREMIS (PREservation Metadata: Implementation Strategies)^[24] – archival metadata standard, which provides data dictionary for storing information about changes of the archived objects in the course of preservation. It keeps also information about events related to the archived object (for example ingest to the archival system, performed virus checks, format conversions, fixity checks etc.), agents, which are associated with these events (people, software, institutions) and technical characteristics of the archived objects (including information about the file format, size, resolution). Archivematica generates metadata in PREMIS standard for preserved objects and adds it to the METS files, which describes these archived objects.

UUID (Universaly Unique Identifier)^[25] – standard for unique identification of the information objects in distributed systems without any central coordination. The UUID identifier is 128 bits value (represented as 36 alphanumeric character string) generated so that the identifiers are globally unique. Archivematica uses the UUID to identify all objects, including files, processes, and storage locations.

3.3 System scalability

Archivematica has a client/server architecture, which can have different configurations to support requirements of scalable data processing. To achieve better performance with large scale data processing the services can be distributed over more nodes – processors. Similarly in different scenarios users can parallelize the installation of the Archivematica system itself. Institutions can run more systems in parallel, each system can perform different tasks (for example when particular type of task is very resource consuming like converting many large graphic images) or the systems can work in parallel on the same task (parallel processing of the big amounts of data).

3.4 Sustainability and further development

Archivematica is open source software developed and freely distributed with support of the Artefactural systems Inc. In the beginning the system development was also co-financed by UNESCO ^[2]. Currently, it is supported by the producer company and from other resources, like customers who sponsor the development of specific functionalities. The sponsored funcionality is then available to all users. The community also uses code additions provided by independent developers. Institutions which need technical support in the installation and configuration of the system can order these as optional and paid services from Artefactual systems Inc.

4. Conclusion – what Archivematica is and what is not

The main Archivematica features can be summarized as follows:

Archivematica is a free, open source system developed by Artefactual Systems Inc. with the support of growing community of users and customers.
The system is actively developed – several times a year a new version is released with new functionalities and bugs corrections.
So far the system cannot be considered to be finished product. Some important functions are still missing and configuring the system for smooth ingest of large data can be difficult.
Users can influence the development of the system by sponsoring new functionalities (which are then freely available to all users, and are incorporated into further versions of Archivematica) or suggestions into a wish-list. As the source code is open, anybody can create and share his/her own components or adjust the existing code.
The system is flexible. It is based on the concept of micro-services using proven open source tools and open standards for implementation of most of the services needed in the archiving workflows and data management.
Configurability is a strong point of Archivematica, especially when it comes to configuration of the tools connected in the micro-services. To a large degree the user can configure the system according to his/her specific needs. On the other hand, wide configuration options can pose a barrier for newcomers to digital preservation.
A large part of the ingest processing can be automated and manual processing may be minimized.
Currently the basic preservation strategy is normalisation (based on format policies) and generating high quality archival packages which can be stored in any repository.
Via the FPR – the Format Policy Registry – the system provides updated recommendations based on the current experience and shared knowledge of the preservation community, while permiting local configuration reflecting specific needs of each institution.
Different projects use Archivematica in different maturity phases, in various contexts and environments. Experiences, tools and deployment architectures^{^[26]} can be shared. There is less pratical experience use cases published about larger installations and long-term production deployment.
At the time of writing this article, the system still did not provide all the functionalities that could be derived from the OAIS. It focuses on the ingest processing and preparation of the AIPs. Integration with external systems is needed to ensure other OAIS function entities (physical storage, preservation planning, active preservation, access for the end users etc.).
The system provides a low cost solution for long-term preservation of digital information. Institutions can already start now with this activity, even with limited resources and finances. However, nothing is for free.The implementation of the system in a real life digital preservation project requires significant effort and experience with management and configuration of the system, customization to specific local needs and integration with wider institutional infrastructure.
For those who would like to use Archivematica for long-term preservation of their digital data, but lack necessary technical personnel for its implementation, management and maintenance, there are paid hosted services like Arkivum^[27], ArchivesDirect^[28] and other.

5. Conclusion

Archivematica is an open source system supporting long-term digital preservation, which is currently considered by many to be the most advanced freely available solution. Contrary to other solutions which try to cover all functions related with management, preservation and access in one integrated system (like for example RODA^[29]) Archivematica is intended as a complement to existing infrastructures. Archivematica focuses on the processes and services of the long-term preservation, and expects to be integrated with available external systems for data management (collection management, physical storage, access).

Archivematica is relatively young system developed by Artefactual Systems Inc. since 2008. The development is not finished yet and some functionalities inherent in commercial solutions are still missing. But dynamic development, system flexibility, and a growing user community gives hope for those who look for open and promising solutions for projects with limited budgets. Besides quite a number of evauluation projects in the international community and first production installations, the system is currently being tested and used by the pilot projects in the Czech context. The National Digital Archive solution currently being developed step by step by the Czech National Archives is using Archivematica as one of its components. The system was also intensively tested in the LTP-Pilot project supported by CESNET Development Fund. Use of Archivematica is expected also in the prepared ArcLib project which should develop a complex solution for long-term preservation of the library digital collections (the project was submitted by a group of Czech libraries to the funding call from NAKI II programme of the Ministry of Culture of the Czech Republic for the years 2016-2020^[30]).

Literature

^[1] Archivematica ^[online]. Artefactual Systems Inc., 2015 ^{[cit. 2015-09-28]}.

Online available at: http://www.archivematica.org/

^[2] VAN GARDEREN, Peter a Courtney C. MUMMA. Realizing the Archivematica vision: delivering a comprehensive and free OAIS implementation. In: iPRES2013: proceedings of the 10th International Conference on Preservation of Digital Objects, 3-5 September 2013, Lisbon, Portugal ^[online]. Lisbon: Biblioteca Nacional de Portugal, 2013 ^{[cit. 2015-09-28]}. Online available at: http://purl.pt/24107/1/iPres2013_PDF/Realizing%20the%20Archivematica%20vision%20deli vering%20a%20comprehensive%20and%20free%20OAIS%20implementation.pdf

^[3] VAN GARDEREN, Peter. Archivematica: Using micro-services and open-source software to deliver a comprehensive digital curation service. In: iPRES2010: 7th International Conference on Preservation of Digital Objects, September 19 – 24, 2010, Vienna, Austria ^[online]. Vienna, iPress2010, 2010 ^{[cit. 2015-09-28]}.

Online available at: http://www.ifs.tuwien.ac.at/dp/ipres2010/papers/vanGarderen28.pdf

^[4] JORDAN, Mark. Introduction to Archivematica : Material for a workshop on Archivematica. In: GitHub ^[online]. Apr 30 2014 ^{[cit. 2015-09-28]}.Online available at: https://github.com/mjordan/archivematicaworkshop

^[5] SCHUMACHER, Jaime et al. From Theory to Action: “Good Enough” Digital Preservation Solutions for Under-Resourced Cultural Heritage Institutions: A Digital POWRR White Paper for the Institute of Museum and Library Service ^[online]. August 2014 ^{[cit. 2015-09-28]}. Online available at: http://commons.lib.niu.edu/handle/10843/13610

^[6] LAVOIE, Brian. The Open Archival Information System (OAIS) Reference Model: Introductory Guide (2nd Edition): DPC Technology Watch Report 14-02 October 2014 ^[online]. Digital Preservation Coalition, 2014 ^{[cit. 2015-09-28]}.

Online available at: http://dx.doi.org/10.7207/TWR14-02

^[7] ČSN ISO 14721. Systémy pro přenos dat a informací z kosmického prostoru – Otevřený archivační informační systém – Referenční model. Praha: Úřad pro technickou normalizaci, metrologii a státní zkušebnictví, 2014. 98 s. Třídící znak 31 9620.

^[8] MITCHAM, Jenny et al. Filling the Digital Preservation Gap: A Jisc Research Data Spring project: Phase One report – July 2015 ^[online]. University of York, University of Hull, 2015. Online available at: http://dx.doi.org/10.6084/m9.figshare.1481170

^[1]The project aims were to test functionality, requirements and constraints of the Archivematica system; verify its usefulness for the logical long-term preservation of selected documents and collections; create a basic documentation for system administrators and digital data curators.

^[2] Jan Hutař and Andrea Byrne from Archives New Zealand kindly read the English translation of the article and suggested number of improvements. Author wishes to express the thanks for their help.

^[3] http://www.archivematica.org

^[4]As an example of the approach see the project POWRR – Preserving Digital Objects With Restricted Resources, http://commons.lib.niu.edu/handle/10843/13610.

^[5] VAN GARDEREN, Peter a Courtney C. MUMMA. Realizing the Archivematica vision: delivering a comprehensive and free OAIS implementation. In: iPRES2013: proceedings of the 10th International Conference on Preservation of Digital Objects, 3‐5 September 2013, Lisbon, Portugal

^[6] Open Archival Information System (OAIS) is the reference model for long-term digital archive created as an endorsement of an international forum Consultative Committee for Space Data System in 1999 and standardized in 2002 as the International Standard ISO 14721:2003. In 2012 an updated version was published as ISO-14721:2012 (Czech translation of this standard was published in 2014). Highly qualified readable overview and assessment of the OAIS by Brian Lavoie can be found in ^[6].

^[7] The OAIS standard encompasses three related models: OAIS Environment (external entities and archive interaction with them), OAIS Functional Model (core functions of the archive) and OAIS Information Model (high-level description of the information objects managed by the archive). All entities, relationships and processes are described in details in the standard.

^[8] AIP packages are the key information objects for long-term preservation. Preservation and usability of the original content depends heavily on the quality and completeness of the information contained in AIPs. Each AIP package includes not only the information that is subject to archiving (ie. Content Data Object), but also a number of supporting information: information necessary for future understanding and presentation of the object (at both the structural and semantical levels), metadata to support and document protection processes – identification, preservation context and history of the object changes, proving integrity and authenticity, access data and many more. AIP's structure and content on the general level are specified in the OAIS information model. Specific Archivematica's AIP implementation is described in the system and user documentation.

^[9] The figure is taken from https://github.com/mjordan/archivematicaworkshop

^[10] The concept of "Transfer" is not defined by the OAIS reference model. Archivematica introduced it as a supplemental entity based on practical experience and needs of users.

^[11] The Archivematica developers justify utilization of file system by its robustness and proven long-term durability in comparison with other types of information management technology. At the same time it is part of their broader preservation strategy: each layer and a component of the LTP-system is not resistant to the risks of technological obsolescence, as well as digital data itself. The fewer complex technology layers the better.

^[12] PRONOM is an on-line information system about data file formats and their supporting software products. It was developed and is operated by the National Archives of Great Britain.

^[13] PLATO – The Preservation Planning Tool, http://www.ifs.tuwien.ac.at/dp/plato/intro/

^[14] One example is the use of Archivematica as a "dark archive" for the DSpace system. DSpace repository serves users as a storage and access system with Archivematica connected as a preservation back-end. For more information see https://www.archivematica.org/wiki/DSpace_integration, https://www.archivematica.org/wiki/DSpace_exports

^[15] Planning Preservation tab provides access to FRP registry.

^[16] Examples of automating Archivematica can be found in https://github.com/mjordan/archivematicaworkshop

^[18] Periodical integrity checks functionality is planned for some of new Archivematica versions.

^[19]For example the current version of „Scan fo viruses“ micro-service uses ClamAV tool to check files in transfer for viruses. To change ClamAV for another antivirus software what is needed is only to modify scripts preparing data for a new antivirus tool.

^[20] VAN GARDEREN, Peter. Archivematica: Using micro‐services and open‐source software to deliver a comprehensive digital curation service. In: iPRES2010: 7th International Conference on Preservation of Digital Objects, September 19 – 24, 2010, Vienna, Austria

^[21] Převzato z http://www.ifs.tuwien.ac.at/dp/ipres2010/papers/vanGarderen28.pdf

^[22] https://tools.ietf.org/html/draft‐kunze‐bagit‐10

^[23] http://www.loc.gov/standards/mets/

^[24] http://www.loc.gov/standards/premis/ or PREMIS introduction by B. Lavoie and R. Gartner. Preservation Metadata (2nd edition). DPC Technology Watch Report 13‐03, May 2013 available from http://dx.doi.org/10.7207/TWR13‐03.

^[25] https://tools.ietf.org/html/rfc4122

^[26] Czech National Archive develops their own long-term preservation solution based around Archivematica.

^[27] http://arkivum.com

^[28] http://www.archivesdirect.org

^[29] http://www.roda‐community.org/

^[30] The results of the NAKI II call should be known at the end of 2015.

Towards issues of descriptive metadata for knowledge organization systems

Eva Bratková, Helena Kučerová — 2015-09-18T12:20:00Z

Keywords: Knowledge organization systems, metadata description, metadata elements, NKOS Application Profile, FRBR conceptual model, knowledge base

PhDr. Eva Bratková, Ph.D., PhDr. Helena Kučerová, Ústav informačních studií a knihovnictví FF UK v Praze (Institute of Information Studies and Librarianship, Faculty of Arts, Charles University in Prague)

Introduction

The present study covers the problems of metadata description of knowledge organization systems. It follows in the footsteps of the preceding study published in 2014^[1] that dealt with issues of their typology. The determination of the type of a knowledge organization system (acronym KOS) is one of the important parts of the metadata description. The new study is based upon a typical set of KOS records processed within the prototype of a knowledge base that is being developed within the „Knowledge base for the subject area of knowledge organization“ project, as part of the NAKI Programme“ (DF13P01OVV013). One of the objectives of the research has been to develop the concept and the contents of KOS description so as to comply with the aim and function of the knowledge base. The collection of metadata representing the given KOS has shown the necessity to describe them as entities of strongly variable character, from the very beginning over the continuous changes down to their present day presentation, in freely accessible structures of linked open data. First of all the study presents the results of the analysis of the existing metadata KOS description in some selected databases and then it offers a draft of their description in the knowledge base. It is founded upon the general principles of FRBR and FRSAD conceptual models, and also upon the proposal of metadata elements for the KOS description within the NKOS Application Profile (NKOS AP), as prepared by the DCMI Networked Knowledge Organization Systems Task Group. The proposed metadata schema takes account of further existing metadata schemas that are in use for similar purposes (BARTOC, DCAT, DC-terms, FRBR/FRSAD, OMV, Schema.org, and TaxoBank).

The text of the study is divided into three parts. The first covers the results of the analysis and the evaluation of the KOS descriptions in the existing practice of the bibliographic databases as well as the known KOS online registries. The second part presents a draft of a standard for the KOS description within the DCMI NKOS Task Group. The summaries of these two parts form the starting point of the third chapter offering a proposal of the description of these systems in the project knowledge base for the field of information and knowledge organization. The draft of description is complemented with a few selected examples of the chosen KOS types (classification scheme, thesaurus, subject headings) of the knowledge base, with appended diagrams.

1 Existing practice of knowledge organization systems description

This part of the study submits the results of the analysis of a number of KOS metadata descriptions, as applied in diversified available online databases. The evaluation intends to focus upon the description of these systems in certain selected bibliographic and catalogue databases, assessing them on the level of all sorts of other publications, yet of course particularly in the specialized registries or databanks describing exclusively KOS. The quantity and the character of applied elements (data) are subject to evaluation.

1.1 Description of knowledge organization systems in bibliographic or catalogue databases

The knowledge organization systems (classification schemes, taxonomies, thesauri, ontologies etc.), especially in their traditional forms, are currently described in bibliographic databases of national registration agencies, and also in databases of local and union catalogues. Due to all sorts of reasons the mentioned types of databases need not always observe the completeness of the registered KOS (such as KOS produced by commercial entities as grey literature), possibly only a fraction is saved, or they are not registered at all. Also KOS in online form are registered, some of them accessible free of charge, some under a licence. There is little or no representation of KOS in various structures of linked open data in the catalogues of the present day. Mapping KOS records in the above mentioned types of databases has confirmed the anticipation that the catalogue records do not offer suitable representation of these systems for the project of knowledge base. Due to similar considerations they are apparently not applied in the systems incorporated into the environment of the semantic web. They are of no use also with regard to the smaller number of descriptive data, including the fact that their structures do not support the present day communication and utilization.

However, the bibliographic and catalogue databases have served a good purpose in the research project, namely as resources for establishing the existence of KOS and for taking over their identification data. The acquired data, anyway, had to be checked, unified and complemented by further KOS-specific data, taking into consideration the given research objectives. Down to the present day more than 230 records were stored into the knowledge base, representing almost 100 important KOS^[2]. The subject of ascertainment were both KOS that were published in various intervals on paper carriers, and KOS that are published by electronic media, beginning with the form of CD-ROM, over the access in the online regime, down to the novel free access in form of rich structured linked open data.

Especially the ascertaining of the existence of KOS on the traditional paper carriers or solid electronic ones has been found as a demanding task. In this case the research team implemented the excerption work with the maximum possible aid of the union catalogue WorldCat, combined with the national bibliographic databases, and possibly also with the national union catalogues^[3]. In the cases of absence of KOS records in the chosen databases, or if they were described minimally or controversially, also further resources had to be consulted in the information search. Especially literature about concrete KOS was checked, whether any; KOS records in traditional forms were consulted, but rather in the digitized ones (permanently kept, e.g., in HathiTrust Digital Library, Internet Archive.org and similar systems). Also KOS records available in the Google Books system since 2012 were monitored for verification, although they have been taken over from the WorldCat database thanks to their free location in the new structure „Schema.org“.

As mentioned above, catalogue records that used to be stored, as a rule, in a MARC type format, doubtlessly lack the sufficient volume of data (see Fig.1), as necessary for the research project objectives. Some specific data are missing (such as information about the KOS type, its internal structure, the number of lexical units etc.) They bear no explicit indication concerning the type of KOS and this disables its automatic searching. But for a few exceptions, the records are provided with no abstracts or annotations. KOS that can look back at a long development usually lack indications about the necessary relationships between the versions – it is difficult to trace down the editions, especially if they were combined with changes of titles or of features. In most records unifying data are missing, namely indications of the Uniform title or Author-Title. Such data greatly facilitate online search. But for minor exceptions, however, unambiguous and permanent identifiers of works or expressions thereof^[4] are missing. An advantage, on the other hand, tends to be the relatively good quality of KOS alternative titles, data concerning their edition or version, the respective language and country of publication. The catalogue records of KOS and their particular elements in the traditional MARC format (including MARCXML or MODS format) have static character, which makes them unsuitable for the dynamic linking in the environment of semantic web.

>040 GEBAY $e rakwb $b ger $c GEBAY $d OCLCQ

>015 05,N36,0101 $2 dnb

>016 7 976026287 $2 DE-101

>020 3598116519

>020 9783598116513

>080

>084 AN 93550 $2 rvk

>049 CQKA

>245 00 Dewey-Dezimalklassifikation und Register : $b DDC 22 / $c begr. von Melvil

Dewey. Hrsg. von Joan S. Mitchell. Hrsg. von Der Deutschen Bibliothek.

>246 3 DDC 22

>250 Dt. Ausg.

>260 München : $b Saur.

>515 Erschienen: 1 - 4.

>630 07 Dewey-Dezimalklassifikation. $2 swd

>650 07 Inhaltserschließung. $2 swd

>650 07 Bibliothek. $2 swd

>650 07 Klassifikation. $2 swd

>650 07 Übersetzung. $2 swd

>651 7 Deutsch. $2 swd

>655 7 Richtlinie. $2 swd

>700 1 Dewey, Melvil.

>700 1 Mitchell, Joan S.

>710 2 Deutsche Bibliothek (Frankfurt, Main; Leipzig)

>730 0 Dewey decimal classification and relative index. $l Deutsch.

>856 41 $3 Inhaltsverzeichnis $u http://bvbr.bib-bvb.de:8991/F?...

>029 0 GEBAY $b 7716793

Fig. 1 Record (OCLC 643133344) of the German translation of the 22^nd edition of Dewey Decimal Classification in MARC Text Area format (without encoded fields 00X) from theWorldCat;the record contains the uniform title of expression level (OCLC, 2015) in field 730

The catalogue records stored and communicated in format of the MARC type, and namely even if converted to some of the formats of linked open data, contain, at the present day, only a certain basic set of descriptive data covering KOS that have been used within the research project. These are: title data of descriptive or formalized character, creator data in authority forms, edition or version data (very useful in case of KOS), publisher data, permanent identifiers, such as URI, ISBN a ISSN, number of pages in case of paper carriers, subject description in form of notations, usually some of the universal classifications (DDC, UDC, LCC etc.) or terms from controlled vocabularies, series data, sometimes even rather rich annotations, possibly also location of items in libraries or other information institutions.

1.2 Descriptive metadata of KOS within Schema.org

The context of information under part 1.1 can also lead to the analysis and evaluation of a KOS description integrated in one of the novel metadata schemas, and namely Schema.org (with name space http://schema.org/, see xmlns: schema; rdf fomat declaration in Fig. 2), as based on experience of the OCLC network with the records of their union catalogue WorldCat. At the present day, this schema is used globally for the need of open accessible Linked Open Data (LOD) that can describe anything (the terminology of the schema talks about describing „things“) and, accordingly, also information resources including KOS. A short introduction will be given upon the basis of freely accessible linked data about KOS, as provided by the WorldCat database and kept as microdata (Microdata Section) in RDFa structure, as well as on the web site of displayed catalogue records under HTML code. The OCLC network, having been among the first to make its catalogue records available in this way, has chosen Schema.org^[5] for this purpose. Thus an enormous effect has been achieved – the catalogue information can reach the end users very quickly, in particular those who receive them via their mobile equipment, and namely using the Google search engine^[6], after the data have been transferred also to the Google Books system.

A comment is needed prior to the presentation of the applied elements of the metadata schema, and namely by highlighting the fact that the concept does not reside in a „classical“ (static) description of the resources wherein appropriate values of properties are assigned to the resources conforming to the given description elements. In the LOD concept the assigned values are linked with values in free-accessible controlled vocabularies, or direct linking with values of these vocabularies can be achieved with the help of the URI identifiers. Schema.org enables dynamic generation of a description of a certain resource with the aid of the combination of different elements joined within the defined basic object categories (work, event, action, organization, person, place, product, intangible thing etc.).

The knowledge organization systems, as presented under LOD in the WorldCat database, comprise the unambiguous identification of the record (URI) within the introductory designation of type of the described resource in the element <schema:about> – see Fig. 2 (the first occurrence of the element „rdf:Description“). The unique number of the record in the WorldCat database has been made use of. The database as such is identified by the aid of element <void:inDataset> from another name space „VoID“ (http://rdfs.org/ns/void#).

The basic data of the described KOS can be introduced as a description of some of the types of CreativeWorks. For instance the type Book (http://schema.org/Book) is at our disposal – this type is valid also for the KOS in print form. With the help of element <schema:bookFormat>, however, a partial form can be specified, such as the form Electronic book (http://schema.org/EBook), under which also the online available web is organized at the present day – see Fig. 2. No finer categorization for the KOS description requirements is at our disposal yet. The main title of KOS can be indicated within the element <schema:name> and other titles in the element <schema:alternateName>. It is suitable to specify both types of titles, in addition to that, by the internal attribute of the language code of the title. If a uniform title of the work is at hand, it will be indicated directly as part of the description of the Work (http://schema.org/CreativeWork) within the element <schema:name>. In this respect Schema.org reflects the concept of the FRBR model.

Edition data relating to KOS can be indicated in the element <schema:bookEdition>, the publisher´s name in the element <schema:publisher> and the date of publication in the element <schema:datePublished>. If the respective KOS is a continuing resource (web etc.), also the element <schema:startDate> is applicable, and namely within the description of the object Event of publication (http://schema.org/PublicationEvent), including the place of the respective event, in element <schema:location> and the organizer, i.e. the publisher, in the element <schema:organizer>. The publisher (as well as the corporation having created this KOS) can be also described by name, in addition to that, in the element <schema:name> as part of the description of the Organization object (http://schema.org/Organization). Much the same also the place of publication can be described by name within the description of the Place (http://schema.org/Place). The place of publication by way of code can be implemented upon the basis of the already existing vocabulary of places (http://id.loc.gov/vocabulary/countries) within the framework of element bearing the same name <dcterms:identifiers> from name space „dcterms“ (http://purl.org/dc/terms/). The language of the resource can be indicated by code in element <schema:inLanguage> with link to the chosen code list (such as http://id.loc.gov/vocabulary/iso639-1). Permanent identifiers ISBN can be indicated in element <schema:ISBN> within the description of the object category Product Model (http://schema.org/ProductModel).

If the creators of KOS are persons, also their names can be used for the description within the description of the object Person (http://schema.org/Person), and namely not only the mentioned element <schema:name>, but also more detailed elements <schema:familyName>, <schema:givenName>, <schema:birthDate> or <schema:deathDate>. Both names of persons and those of corporations exist already in large numbers of the VIAF system database, comprising unique identifiers VIAF ID, and possibly also the international identifiers ISNI.

The subject description of KOS (and also other materials) can be implemented in particular upon the basis of linking to the values of a great number of knowledge organization systems (such as controlled vocabularies), and namely within the object category Intangible (thing) (http://schema.org/Intangible). Examples can be seen in Fig. 2. Records from the WorldCat database can be linked, e.g., directly with classifiers of the Dewey decimal classification on the server http://dewey.info (e.g. http://dewey.info/class/025.431/), or the controlled subject terms of the now accessible OCLC in the new FAST system (for instance the subject term „Classification, Dewey decimal“ is now freely available on the OCLC server of linked data (http://experimental.worldcat.org/fast/863693/)^[7]. The schema has a prepared element <schema:abstract> for the text of abstract or annotation– see example in Fig. 2.

Fig. 2 Online record of database WebDewey (permalink http://www.worldcat.org/oclc/49510336) from the WorldCat database in form of linked open data (rdf.xml) (OCLC, 2015, the elements are marked by yellow colour, whereas their values are blue)

At the present day the metadata schema Schema.org does not yet enable the description of specific features of KOS (the type of KOS, the internal relationships within KOS, KOS structure etc.). The same holds true also for other specific types of resources. However, the schema is being further developed, attention is paid to various details initiated by a number of communities interested in the description of „things “. An extensive activity focussing upon the field of the description of information resources is developed by the working group SchemaBibEx (https://www.w3.org/community/schemabibex/), under the leadership of OCLC expert Richard Wallis. Suggestions are welcome by way of email or chat communication.

1.3 Descriptive metadata of KOS in the BARTOC registry

This part presents the results of our analysis of the KOS description in the BARTOC^[8] registration system. Its typology of KOS was discussed in the above cited study^[9] already. The registry has been operated since 2013 and in August 2014 its database contained 667 records. In April 2015 the number of registered systems almost doubled (1 137 records).

Each KOS in the BARTOC registry base is described with only one record, as regards description level. The description unit is the system as a whole, irrespective of the carrier or format, highlighting its present day form, especially the electronic one, as well as online approach. The historical development relating to the form of continuous registering of the printed products and those on CD-ROM has not been implemented; at the time being only some records include short abstracts covering briefly the development. The KOS description is not based upon the FRBR model. This concept is due to the main objective of the registry – providing the users quickly with clearly arranged information about the existence of the respective KOS. That is why the overall number of metadata elements is not large, most values being permanently stored in structured vocabulary – term taxonomy^[10]. The advantage of the registry consists in that the records, in addition to the online system, are freely available also in more formats as open linked data. Up to 20 languages can be used for the online communication with the BARTOC registry (language equivalents concern even a number of values of descriptive data).

The free documentation of the BARTOC registry has not yet opened access to the metadata specification used for the description of included systems. That is why the following commentary can be based only upon the presentation form of the metadata records displayed within the frame of the online search system, taking into account metadata localized in the accessible records in the rdf.xml structure. An example of the description of the popular AGROVOC thesaurus can be found in Fig. 3.

Each metadata record in the BARTOC registry comprises always one unique and permanent record identifier (URI) playing an important role for any linking. When viewing the record via an Internet browser, URI appears within the address (e.g. the AGROVOC thesaurus has URI http://www.bartoc.org/node/305). If the record is represented in linked data structures, then an appropriate extension is added to the basic URI (http://www.bartoc.org/node/305.rdf, http://www.bartoc.org/node/305.xml etc.). Component parts of the record are also data about the time of the record creation (Date created) and of updating (Date modified). Whereas these technical metadata are not displayed in the course of online searching, they are comprised in the linked data records, and namely through the mediation of the metadata defined by Dublin Core specifications with the attribute of data type, such as:

2013-09-10T15:34:42+02:00

2014-08-12T14:27:30+02:00

Each record indicates a unique Title of the KOS as one whole, having the obvious character of a „uniform“ title. Its shape in the language of the described system (such as „Dewey Decimal Classification“, „Gemeinsame Normdatei Online“, „Kielitieteen ontologia“) is determined by the producer of the registry according to his own decision. Some titles include also a geographic complement in round brackets, exceptionally the title may comprise also an acronym without separation by any punctuation mark (e.g. „JITA Classification System of Library and Information Science“). Also alternative titles can be indicated under which the user can seek them. These can be abbreviations or acronyms (such as „DDC“, „EuroVoc“ etc.), in another case the producer can have chosen the full official title as an alternative (such as „The Life Science Thesaurus“ for the known commercial thesaurus „Emtree“, here in the role of a uniform title).

Fig. 3 Record of the AGROVOC thesaurus in presentation format in the registry BARTOC.org (in the selected English version of language communication, 2015)

The element Author contains mostly the names of companies and organizations bearing responsibility for KOS. For example in a DDC record (http://www.bartoc.org/node/241) the „Online Computer Library Center (OCLC)“ stands in the place of author, whereas the actual author M. Dewey is only briefly mentioned in a short abstract. If there is any, then the name of author is usually joined by way of hyperlink within the element VIAF with records from the VIAF database through the URI whose component part is the unique VIAF identifier (VIAF ID), e.f. hyperlink http://viaf.org/viaf/156508705 identifies the above mentioned OCLC centre.

The element Link is intended for linking the record with the web site describing the respective KOS, or from which KOS is accessible online. The link to the web site of the AIMS portal, from which the AGROVOC thesaurus is accessible, can be seen in Fig. 3. The record of the EuroVoc thesaurus contains a link leading to its main home page (http://eurovoc.europa.eu/).

Subject data range among the very valuable information in the short record of the registry. Also an Abstract in English is included. It bears informative character, containing diversified information about the field of application of KOS, its development, but sometimes also about the number of its units at the present day. The values of the known multilingual thesaurus EuroVoc are well suited for use in the part Topic. The URI identifiers of the descriptors of the EuroVoc thesaurus are shown together with the basic record. A major advantage of the registry resides in organizing the records with the help of the DDC scheme in the DDC element (used up to the third hierarchical level). The classifiers, again, are part of the actual taxonomy of the terms (such as the value „630“ for agriculture is indicated in URI (http://www.bartoc.org/taxonomy/term/10622). However, linking with the records to the freely available online database DDC (http://dewey.info) has not been implemented yet.

Part of the formal data of the record is the element Access to KOS that is understood in the legal sense of the word. Simple values are used that are, again, component parts of the actual taxonomy of the terms, such as „Free“ (http://www.bartoc.org/taxonomy/term/1377 ) or „Registered “ (http://bartoc.org/taxonomy/term/17). The element Format of KOS can contain up to three types of values (they are also part of the taxonomy): on the one hand the form of access (printed, online, CD-ROM etc.), on the other hand format designation (RDF, XML etc.) or also the structure (such as OWL, SKOS etc.). The last and substantial element is the element KOS Type, making use of the values of its own defined BARTOC typology. The number of types has been seen to increase from five (August 2014) to 6 at present (the type „Glossary“ has been added recently). The types are also parts of taxonomy (the type „Classification“ has URI http://www.bartoc.org/taxonomy/term/3 ). The own taxonomy of terms comprises also values for the element KOS Language) (e.g. URI http://www.bartoc.org/taxonomy/term/10747 belongs to the French language). As far as it exists, the record of the BARTOC registry is also additionally linked via hyperlink with the article of the English version of Wikipedia (in principle, however, this relates to a bibliographic reference concerning the system).

1.4 KOS descriptive metadata in the TAXOBANK registration databank

Yet another concept of KOS descriptive metadata focusing upon the scope and the contents is represented by the registration databank TaxoBank^[11] that has been operated since 2009. It is administered by an expert team of „taxonomists“ of a private American company „Access Innovations“, specialized in providing services in the field of software applications with particular focus in the linguistic aspects of data stored in databases. The records of this registration databank contain very detailed and extremely well structured data about „controlled vocabularies“, i.e. knowledge organization systems^[12]. The databank gets complemented irregularly and suggestions for KOS registration can be sent also by current Internet users, but each record is then processed by a specialist. About 250 of knowledge organization systems are registered as of today.

The TaxoBank has not opened the metadata specification in its public documentation, either. That is why also the following commentary is based upon records displayed in the presentation format in the frame of the online search system. An example of a rather short description of the known AGROVOC thesaurus can be found in Fig. 4. A longer description is represented, e.g., by a record of an important American agricultural thesaurus NAL^[13].

http://www.taxobank.org/content/agrovoc-thesaurus

AGROVOC Thesaurus

Thu, 2009-11-19 08:42 — aii_admin

AGROVOC is a multilingual, structured and controlled vocabulary designed to cover the terminology of all subject fields in agriculture, forestry, fisheries, food and related domains (e.g. environment). The AGROVOC Thesaurus was developed by FAO and the Commission of the European Communities in the early 1980s. Since then it has been updated continuously by FAO and local institutions in member countries.

▼ General information

Vocabulary type: thes

Vocabulary Sample URL: http://aims.fao.org/agrovoc/page?c=12332

Was vocabulary created as a course project: 0

▼ Vocabulary characteristics

Type of display: alph, html, diag, perm, other

Relationship types: eq_pri_eq, hier_bn, rel_t, ont

Characteristics Comments:

"AGROVOC Linked Open Data (LOD) is a project to turn the AGROVOC thesaurus into a multilingual, terminological backbone for agricultural digital goods. Hosted by research partner MIMOS it provides web-accessible, structured data records on agricultural concepts and even more importantly, links those concepts to other online thesauri."

▼ Terms and Conditions

Availability Comments: available also via webservice

Import/download instructions :

See http://aims.fao.org/website/Download/sub

▼ Provider

Vocabulary provider name: FAO

Provider URL: http://www.fao.org/

Provider contact details:

"National organizations and institutes are welcome to help enrich and maintain AGROVOC's many languages by joining our growing community. Please contact us at http://aims.fao.org/contact"

Provider section comments:

"The Food and Agriculture Organization of the United Nations leads international efforts to defeat hunger. Serving both developed and developing countries, FAO acts as a neutral forum where all nations meet as equals to negotiate agreements and debate policy. FAO is also a source of knowledge and information. We help developing countries and countries in transition modernize and improve agriculture, forestry and fisheries practices and ensure good nutrition for all. Since our founding in 1945, we have focused special attention on developing rural areas, home to 70 percent of the world's poor and hungry people."

Fig. 4 Record of AGROVOC thesaurus in presentation format in databank TAXOBANK (2015)

The metadata record from TaxoBank has been identified by record identifier (URL). When viewing via Internet browser the value gets displayed in the address. An example of URL record, as described by the AGROVOC thesaurus, is shown in Fig.4 (first line).

The main bloc of descriptive information comprises the unique Title for KOS; also in this case it is determined by the experts of the system, following their internal rules. In some KOS the TaxoBank and BARTOC registry are identical (e.g. „Dewey Decimal Classification“), some other KOS accept considerable differences when determining uniform titles (cf. for instance „NAL Agricultural Thesaurus“ and „United States Department of Agriculture Thesaurus“). The shown records contain also a few system data, and namely Date and time of processing the record (e.g.: Wed, 2009-11-11 11:08) and Editor of the record. This can be a full name of a person or his/her code (e.g.: Barbara Gilles; aii_admin etc.). Such data are complemented with a brief Abstract (see Fig. 4), sometimes also in form of a quote from the documentation of the given KOS. The abstract can comprise various information about the origin of the KOS and its founder, about the present day owner or operator, the character of KOS and the determination of the user community, the access forms, etc.

The bloc of General information can include a larger amount of information. Also in the case of this registry all known Vocabulary alternative name or acronym can form part of it. In the record of the above mentioned Agricultural thesaurus the editors have indicated various alternative values: „NAL Thesaurus“, „Agricultural Thesaurus“ and „NALT“. The element Vocabulary type falls under the substantial mandatory data. The values are given by the internal code list and indicated in form of an abbreviated term, such as: clsssys (classification scheme), thes (thesaurus), subjh (subject headings), taxon (taxonomy), concmp (concept map), contrvoc (controlled vocabulary), gaz (gazetteer), glos (glossary) and ont (ontology). Further, as far as known, it is possible to indicate the name of Author or Editor of KOS, including his/her exact role, such as: „Lori Finch, Thesaurus Coordinator“. The databank records also further very valuable data that are typical for the given KOS, if available. These are Current version/edition, e.g.: 2009 Edition, Current version date, e.g.: Wed, 2008-12-31, Update frequency, e.g.: Annually, Available formats, e.g.: XML, SKOS, PDF, MARC, TXT etc. (also in the case of this registration both the concrete metadata formats and the specifications of their structures can be encountered side by side). Two indicative values „0“ or „1“ are registered in a specific element using the form of a question (Was vocabulary created as a course project). As far as available, the element Vocabulary URL of the KOS with free access will contain the address on which the system is available, such as: http://agclass.nal.usda.gov/. Especially in the case of commercially available KOS a very interesting element is Vocabulary Sample URL, for instance a sample of data from the CAB Thesaurus (http://www.cabi.org/cabthesaurus/mtwdk.exe?yi=sample).

Some of the records are provided with specific and very useful information, and namely in the bloc Scope and Usage. These are Languages that are used for the KOS terms (the values are indicated in English, such as „English“, „Spanish“ etc.), further subject elements Major subjects covered and Minor subjects covered. The actual values are also in English and it can be supposed that they have been taken over from the own controlled vocabulary of the producer, such as: Agricultural subjects, biological sciences // Rural and agricultural sociology; physical sciences. The purpose of the described KOS can be registered in form of a shorter or longer text, and namely within the element Purpose, and possibly also the intended user communities of the KOS can be indicated in the element Used By. Another valuable information is the link to Related vocabularies; for instance in the record of the already mentioned NAL Agricultural thesaurus this element contains the information of the existing related NAL Glossary, including details of its contents and the number of collected terms. Some very specific information require experts for qualified coverage; these are, e.g., data registered in the elements Overlap with related vocabularies and Mappings to other vocabularies. The respective data have verbal form.

Quite a lot of valuable information, especially numerical and encoded, is registered in the elements of the bloc Vocabulary characteristics. If the information is known, then the element Description of overall structure can serve for presenting, in URL form, the document describing the overall structure of KOS. The element Type of terms can contain detailed information about the type of terms handled in KOS, such as: „Agricultural and scientific concepts“, „Concepts in the domain of medicine and related domains, expressed in professional medical and scientific terminology“ etc. The element Type of display, appreciated by professionals, can contain detailed facts about displaying the involved terms. The TaxoBank databank uses fixed determined values from its own list in form of abbreviated expressions, such as: hier = hierarchical, alph = alphabetical, perm = permuted, html, other. Similarly, in the element Relationship types, detailed information about the relations among the terms can be registered (again in form of abbreviated values), such as: eq_pri_eq = equivalence, eq_lang = language equivalence, hier_bn = hierarchy of genus, hier_inst = class of instance, rel_t = association, othr = others). If available, the elements Number of classes indicate data about the number of classes of the given KOS (e.g.: 16 top terms in hierarchical display, 109 hierarchies of descriptors etc.), the total Number of terms (e.g.: English: 73,194, Spanish: 69,118), Number of preferred terms (e.g.: English: 44,857; Spanish: 44,857), Number of Non-Preferred terms (e.g.: English: 28,337; Spanish: 24,261), or alto Depth of Hierarchy with numerical marking of the classification levels within the hierarchy (e.g.: 11 levels). The record can contain also the element Characteristics Comments with notes of verbal type, such as about the project to convert the terms into the structure of linked open data.

Variegated data concerning the options of utilizing knowledge organization systems, especially of commercial character, can be found within the framework of elements of the bloc Terms and Conditions. Simple English expressions from the own list can be used for registering, under the element Availability, the degree to which KOS is accessible, for instance „free“ etc.. More detailed notes can be given by text in the element Availability Comments. Under the element Licensing Options details can be given, again in form of English text, about conditions for acquiring the respective KOS user licence. As an extra, the element Import/download instructions can serve for introducing URL addresses enabling the download of data.

The following last bloc Provider registers, under the element Vocabulary provider name, the name of the company administering and providing the described KOS at the present day. At the time being the names under which companies appear are not unified; sometimes a full name including acronym can be found, sometimes the acronym only, such as: „Online Computer Library Center (OCLC)“, „FAO“ etc. Under the element Provider URL the URL address of the given company can be added, and the element Provider contact details is another place for further useful information in text form. The last element Provider section comments can also comment upon the activities of the provider in verbal form.

2 Descriptive metadata of the NKOS Application Profile

The analysis of the present practice of KOS metadata descriptions in the preceding chapter has shown the description in the TaxoBank databank to be the richest, having the greatest number of elements, being very comprehensive and including, in some instances, even larger verbal descriptions. The introduction of certain specific elements is inspiring (such as the type of relationships within KOS, the art of displaying the terms or of mapping KOS into other knowledge organizing systems). This part of the study is going to present another metadata description involving also a number of specific descriptive elements, but in addition to that it has been developed upon the FRBR conceptual model, i.e. it offers a hierarchically stratified KOS description on a plurality of levels – pertaining to one work – including its optional multiple expressions, and also manifestations (publications). Its description has been included into this chapter as an important metadata schema under preparation, shared by an international team of experts; actually, it exists in the stage of project at the time being.

The KOS metadata description is being prepared within the framework of the comprehensive Dublin Core Application Profile, DC-AP NKOS^[14]^[15] for the knowledge organization systems as well as for NKOS Vocabularies ^[16] that are being developed since 2010 by DCMI/NKOS (Dublin Core Metadata Initiative / Networked Knowledge Organization Systems Task Group)^[17]. The application profile is founded, to a very large extent, upon DCMI Metadata Terms (http://purl.org/dc/terms/), including the basic DCMES set (http://purl.org/dc/elements/1.1/). Also the utilization of the adms name space is envisaged (http://www.w3.org/ns/adms#), as well as dcat (http://www.w3.org/ns/dcat#), frbrer (http://iflastandards.info/ns/fr/frbr/frbrer/) and wdrs (http://www.w3.org/2007/05/powder-s#).

The newly designed standard has specified elements that are indicated in the following, and namely on three descriptive levels. For describing the knowledge organization system on the „Work“ entity level (in Czech „Dílo“), its name Title is suggested as the substantial element. A closer specification has not yet been determined, but the title should be unique (for instance a uniform title) and it ought to be in the original language. The VIAF database already contains a number of uniform titles (within the frame of the future clusters) with the unique identifier VIAF ID, such as „UMLS“ (http://viaf.org/viaf/176304810), „Gemeinsame Normdatei“ (http://viaf.org/viaf/215300162), Library of Congress Classification (http://viaf.org/viaf/203733980), Library of Congress Subject Headings (http://viaf.org/viaf/211744653) etc.

The proposed element Identifier should identify the given work in an unambiguous way. The utilization of identifiers of the VIAF system for the work (VIAF ID) appears to be the optimum solution, but on the level of scientific works only few identifiers of this type are available at the present day (VIAF ID see record in Fig. 5). Yet another option would be identification with the help of the international identifier of textual works (ISTC)^[18], however, as shown by the ISTC database, identifiers of textual works of KOS type have not yet been assigned. The element Description has been prepared for various types of text communications (abstract, contents, graphic representation or practically any text describing KOS as a work). An important formal element suggested by the DCMI NKOS Working Group is the element Type of KOS . The values for this element have already been determined within the application profile in a specific structured vocabulary^[19]. The element Creator is intended for the name of the person or corporation who is primarily responsible for the creation of the resource (work). The formal element Rights is prepared for text information specifying various authorship rights relevant to the resource. A subject description of the work can be formulated within the element Subject . Both the utilization of freely created terms, and of the range of already existing controlled vocabularies are envisaged (such as classification schemes or thesauri), as prepared for linking within the frame of the semantic web. Concerning the description of the work also the element Date is taken into account and could be used, primarily, also in the sense of the actual date of creation. In case such historical information may not be available, it can be replaced with the date of expression or manifestation (publishing) of the work. It is recommended to use the standard etry of date according to the standard ISO 860, or profile W3CDTF. There is yet another formal element, and namely the specific element Audience (determination of the community utilizing a specific KOS) and also a fully novel element Used by , to be defined in the name space (http://purl.org/nkos/). It is intended for the name of the agent (programme, application) utilizing the KOS being described. The last group of the designed elements has been assigned to relationships between single KOS (on the level of the work). The following elements have been designed: Relation , intended for linking with the related KOS through the mediation of unambiguous data (identifier, in the optimum case), and further Is part of that is intended for linking with the KOS whose component part is the described resource (again with the aid of unambiguous data); Is based on is intended for linking with a resource that is in some way related with the described one. A specific relationship is represented by the element Supporting documentation whose value can comprise the citation of resources that also describe the resource in question, also in the respective sense.

DÍLO

Universal Decimal Classification

http://viaf.org/viaf/184301709

„Univerzální desetinné třídění“ je jedním z největších klasifikačních schémat univerzálního charakteru. Jde o schéma hierarchického a také fasetového typu založeného na desetinném principu. Hlavní schéma zahrnuje deset základních tříd označených arabskými číslicemi (0 – 9, třída 4 je neobsazena), které pokrývají všechna odvětví a obory lidského poznání. Pomocné tabulky zahrnují fasety pojmů (místa, jazyky, data a formy dokumentů). Třídění je využíváno zejména v oblasti knihovnictví. Je publikováno ve třech základních úrovních: úplné, střední či standardní a zkrácené. Publikována bývají také specializovaná tematicky zaměřená vydání.

http://purl.org/nkos/nkostype/classificationScheme

UDC Consortium

mezinárodní desetinné třídění

1902

Dewey Decimal Classification

http://viaf.org/viaf/198122328

McILWAINE, Ia C. Universal decimal classification (UDC). In: Ed. M. J. BATES a M. N. MAACK. Encyclopedia of library and information sciences. 3rd ed. Boca Raton (Florida): CRC Press, © 2010, s. 5432-5439. ISBN 978-0-8493-9712-7 (soubor, Print). ISBN 978-0-8493-9711-0 (Online) Dostupný také komerčně online z DK Taylor & Francis (DOI): http://www.tandfonline.com/doi/full/10.1081/E-ELIS3-120043532.

VYJÁDŘENÍ

Universal decimal classification. Online. Česky

neuvedeno

http://cz.udc-hub.com/cs/contacts.php

Překlad z anglické, standardní verze do české připraven Národní knihovnou ČR. Více než 70 000 znaků je opatřeno českým překladem, zbytek textu je uveden prozatím v angličtině. Překlad bude do systému postupně doplněn.

UDC Consortium

Národní knihovna České republiky

http://id.loc.gov/vocabulary/iso639-2/cze</language>

http://id.loc.gov/vocabulary/iso639-2/eng</language>

70 626 platných a 11 000 zrušených notací

2015-01-05

2015-02-05

freq:annual

Určeno knihovníkům, badatelům a studentům v oborech knihovnictví a informatika; je volně přístupná v souladu se Všeobecnými smluvními podmínkami.

http://www.udc-hub.com/en/login.php

http://nl.udc-hub.com/nl/login.php

http://viaf.org/viaf/184301709

PROVEDENÍ

České MDT Online

http://cz.udc-hub.com/cs/login.php

http://cz.udc-hub.com/cs/contacts.php

Nové online zpřístupnění české verze „UDC Online“. Jde o standardní vydání MDT. Obsahuje úplnou verzi MDT, která zahrnuje 70 626 platných a 11 000 zrušených notací MDT. Online nástroj na serveru Konsorcia UDC má několik funkcí umožňujících vyhledávání, prohlížení, analýzu, validaci a tvorbu notací MDT.

UDC Consortium

Národní knihovna České republiky

UDC Consortium

online

text/html

2015-02-05

http://www.udc-hub.com/en/login.php

http://nl.udc-hub.com/nl/login.php

Online vyhledávání a tvorba a validace notací

Universal decimal classification. Online. Česky

http://www.udc-hub.com/

http://cz.udc-hub.com/cs/demo.php

Fig. 5 Three-level (hypothetical) record of the most recent Czech online edition of UDC (Universal Decimal Classification) in the suggested metadata schema AP-NKOS

The element Title has been also suggested on the level of the „Expression “ entity (in Czech „Vyjádření“). Its closer specification has not yet been determined, either, yet the option of adding complementary data within the frame of the title has been mentioned, in particular data concerning the edition or the KOS language. It can be expected that the title of the expression should also be unique. Also in this case a number of such formalized titles have already appeared in the VIAF database, such as thanks to the initiative of the Library of Congress and the German National Library, e.g. „Universal decimal classification. Selections. Czech” (http://viaf.org/viaf/186352368) or „EUROVOC“ (http://viaf.org/viaf/215952972). The example of record in Fig. 5 bears the uniform title for the online Czech version of UDC, as hypothetically prepared according to the examples of titles from the VIAF database. Temporarily, at the time being, the same example bears the value „unspecified” in the element Identifier . It is anticipated that this value will appear in the VIAF system in the course of time. The projected formal element Contact is ensured by the DCAT domain. The value of the element can consist in the reference to the place where further information can be provided about the expression of the work (about the translation, the form of publishing etc.). The element Description is also available on the level of expression. It is prepared for text communication (abstract, contents etc.), describing the process of expression of the work (translation, edition etc.). The element Creator is present on the level of expression as well and it ought to comprise names of persons or corporation bearing responsibility for the expression of the work (translator, editor of the online version etc.). A substantial formal element on the level of expression in the case of text works and, accordingly, also KOS, is the element Language . This concerns the language of the text and it is recommended to make use of the language code lists. The present day practice of a number of systems consists in utilizing freely available language vocabularies on the server of the Library of Congress. The record in Fig.5 bears the URI of the language (three digit code) in this element from the cited vocabulary. Genuinely specific and valuable data of the KOS records concerns the size (number of terms). That is why the suggested AP NKOS specification involves the element Size note . Naturally, the description of the expression takes into account the element Date – it will bear, as far as known, the date of termination of the expression (completion of translation etc.). Newly, however, the element Date (modification) has been incorporated. The standard entry of date according to ISO 860, or W3CDTF profile, is recommended. Also on this level of description a legal element has been integrated as Rights for text information stipulating various types of property rights connected with the expression of the work (such as authorship translation right etc.). The level of expression, too, can bear the formal element Audience (specifying the community authorized for the utilization, e.g. of the KOS translation – see Fig. 5) and also the element Used by (see information indicated above). A novelty on this level is the definition (again the own name space had to be used, namely nkos:) of a specific and valuable element KOS Frequency of update . It is recommended to take over the values from the DCMI Frequency vocabulary (http://dublincore.org/groups/collections/frequency/). The group of suggested elements for the relationships of one KOS to another (on the level of expression) contains mostly the same elements as the level of work: The element Relation is intended for linking with the corresponding expression of KOS through the unique entry (Fig. 5 shows links to other language online versions of the given classification), a new element of the frbr name space: Is realization of , intended for unambiguous reference to the realized work (optimally the URI of the work), further the element Is part of , intended for linking with the expression of the KOS, the described resource being its component part. The specific relation is again represented by element Supporting documentation , whose value can comprise the citations of resources that describe the expression being described, also in the subject sense of the word. A new formal element (from the adms: domain) is the element Sample (adms:sample), that ought to contain a reference to web with demonstration versions of KOS, with examples of records of these KOS, especially in a case when the resource is commercially available, or the resource gets available after registration (see Fig. 5).

The last level of KOS description „Manifestation“ (in Czech „Provedení“), i.e. concrete products, is ensured by the last group of elements. The element Title should contain the descriptive title of the published KOS. The example of record in Fig. 5 bears the title of the online product in Czech. In case a plurality of titles exist (alternative titles) the suggested specification has not made use of the Dublin Core element „dct:alternative“ and at the present moment it has not been clearly specified where the alternative titles ought to be entered. The element Identifier can bear the known identifiers of both printed and electronic products (ISBN, ISSN, DOI, URI, URN and others). The formal element Contact is defined in the same way as on the level of expression. Also the element Description is available, and namely for the purpose of text entry with further details about the character of the given product. The element Creator is present also on this level of description and with the same meaning. A novel element is an important one, and namely Publisher , intended for the name of a corporation or person publishing KOS or its concrete version/edition. KOS specific information is related to format and physical carrier. For this purpose the element Format has been designed; for online products it is recommendable to use the respective values from the MIME vocabulary (Internet Media Types, IMT, http://www.iana.org/assignments/media-types/media-types.xhtml). For the purpose of publications a new element has been defined Date (issuance) . Also on this level of description an element has been included relating to rights to the published KOS, and namely Rights . A novel element is one from the own name space Services offered , indicating types of services that are open to users (downloading, annotation, querying etc.). A group of elements serving for relations between single KOS (on the level of manifestation) comprise also mostly the same elements as the level of work and expression: element: Relation is intended for linking with related KOS publications, element Is part of , is intended for linking with a KOS publication whose component part is formed by the resource being described. There is a novel element, again from the frbr name space: Is embodiment of that is intended as an unambiguous reference to the expression of the work (optimally URI of expression). A specific relation is represented by element Supporting documentation and element Sample (adms:sample).

3 Proposal of KOS metadata description in the knowledge base of knowledge organization

3.1 Description schema in the knowledge base

The goal of the NAKI project is to assemble and to systemize the current body of knowledge in the field of knowledge organization in form of a knowledge base that enables saving, browsing and searching the existing entries and deriving new knowledge. The knowledge base will be made accessible in the format of linked open data and will serve for further research and as an education tool in the field of information science and librarianship. The ontological structure of the knowledge base consists of two types of knowledge units: 1) statements („pure knowledge“ – sentences formalized as logical predicates and instantiated in form of text data) and 2) descriptive metadata covering relevant document and non-document resources (persons, institutions, events, tools, activities and processes). The introductory phase of knowledge acquisition has supplied material for a prototype of knowledge base with about 2 300 units of declarative knowledge: 150 statements in the RDF format, a glossary with the scope of 200 terms, 1 000 records of scientific literature, 230 records of knowledge organization systems. The KOS description creates an important set of knowledge entities in the second module of the knowledge base represented by descriptive metadata.

The framework methodology for the design of the KOS descriptive metadata was chosen according to the recommendation of the Guidelines for Dublin Core Application Profiles^[20]. In accordance with that the first to be delimited were functional requirements relating to metadata in the knowledge base whose point of departure were general requirements to the knowledge base as a whole: 1) enabling the assembling and systemizing of domain knowledge concerning KOS, taking account of the local cultural and language specific aspects of the Czech environment, 2) enabling access to the existing knowledge and deriving new knowledge in the format of linked open data, 3) providing resource material for the preparation of an original Czech monograph, 4) supplying resources for updating the Czech terminology, 5) enabling utilization for the education in the field of information science and librarianship. These requirements have lead to the conclusion that the metadata in the knowledge base should contain the most comprehensive description as well as include the depiction of history and the mutual relationships between different KOS.

The second step was the creation of a domain model with the purpose of delimiting the entities (objects, things) to be described by metadata. Its diagrammatical illustration can be found in Fig. 6. Of course the key entity is the Knowledge Organization System, another important one is Agent; in our approach a person or an institution is represented that is connected with KOS, be it as creator, operator or user. The design of the knowledge base is founded upon the principle of maximum reuse of proven concepts and models. Accordingly, already on the level of domain model it contains the customized entities of the FRBR and FRSAD models Work, Expression, Manifestation, Person, Corporation and Thema^[21] and their relationships. The three entities Work, Expression and Manifestation represent a mutually interconnected and logically structured set of data about the KOS. The last essential component part of the model is the entity Theme that has been directly taken over from entity Thema of the FRSAD model and covers data about the subject contents of KOS.

Fig. 6 Domain model, based upon the FRBR and FRSAD models

The structure of metadata description in Figs. 6 and 7 is illustrated with the help of class diagram in the UML (Unified Modelling Language), used in accordance with ISO 24156, governing the use of UML notation in the terminological work^[22]. The data elements are arranged in groups that are linked with semantic relations. Association stands for any semantic relationship by way of a line ending with an arrow in case of asymmetrical association. Aggregation stands for a partitive hierarchical relationships („is a part“), the respective arrow ending with a diamond. Generalization represents a generic hierarchy with inheritance („is a type“), its arrow ending with a triangle pointing from the subordinate class („subtype“) to the superior class („supertype“).

After the construction of the domain model the following stage defined properties of the respective objects and specified their relationships; the resulting structural model of metadata description is diagrammatically represented in Fig. 7. Again we applied all efforts to make use of all elements of description that had been defined as well as standardized vocabularies. The main resource was the NKOS Application Profile, as described in the preceding part. 11 out of its 21 elements have been directly applied in the knowledge base, 6 elements have been mapped with various degrees of equivalence and 4 elements have not yet been applied (update frequency, rights, audience, services offered). In addition to the descriptive elements of NKOS AP also two of its value vocabularies (NKOS Vocabularies^[23]) have been utilized: the KOS Relation-Type Vocabulary has been taken over to the full, and the KOS Types Vocabulary^[24] has been reflected in the typology of knowledge organization systems^[25]. Our own typology has been complemented with the classification of KOS types that has been taken over from the Classification System for Knowledge Organization Literature^[26], specifically from part 0 – Form Divisions, classes 01, 03 and 04. Further value vocabularies have been appropriated for the element language (at the present day a three-character code and names according to ISO 639-2 are in use), format (format codes under the MIME standard are used), role (here the vocabulary of the Library of Congress is used^[27]). Identifiers have been appropriated from the following name spaces: VIAF ID for persons and corporations and, rather exceptionally, for works and expressions, ISNI for persons and corporations, ISBN, ISSN, DOI, URL, URN and URI for manifestation.

There are 3 types of classes in the structural model of metadata description in Fig. 7: «entity», «agent» and «code list». Classes of the «entity» type correspond with entities from the 1^st group of the FRBR model – Work, Expression and Manifestation. Also the semantics of their relationships has been taken over from the FRBR model – the work is realized with the help of expression, whereas expression is embodied in the manifestation. The classes of the «agent» type correspond with entities from the 2^nd group of the FRBR model, responsible for the contents, the production, the distribution or administration of entities of the 1^st group. The classes of the «code list» type are authority lists of permissible values that are related to the properties of classes of the types «entity» and «agent». An important role among the code lists is played by the class Theme; in the sense of the FRSAD model this is a generalization of entities of the 3^rd group from the FRBR model, pertaining to the subject contents of the work. The relationships between single themes can be also defined (typically either hierarchy or association); in the model this is depicted by the recursive association relationships of themes. The association classes Relationship and Role of the agent reflect the situation when the properties relationship type and role are not properties of single classes, but properties of the relationships between them.

Fig. 7 Structure of KOS metadata description in the knowledge base

In addition to the classes representing data elements there are also two so-called abstract classes in the model – Entity and Agent. Both are so-to-say parent classes in the generic hierarchy possessing no instances of their own, but all their properties and relationships are inherited by their hierarchically subordinate children classes Work, Expression, Manifestation, Person and Corporation. The way generalization is construed in UML enables to define in one place (in the parent class) the properties and relationships occurring to the same or similar extent in a plurality of objects, which enables reducing redundancy within the model. For instance identifier, or the declaration „it has an identifier “ in the class Entity on the implementation level concerns both the work and the expression and manifestation, while the semantics in each of these classes is domain specific. Some inherited features can get modified in the children classes. The property description in the children classes is specified both by its name and is contents – specific requirements for the KOS description are determined on the level of the work (description of work), on the level of expression (description of expression) and on the level of manifestation (description of manifestation). The same is valid for the abstract class Agent. All its properties and relationships are inherited by classes Person and Corporation where a specification of property takes place, namely the property of name gets more closely specified as name of corporation, name and surname of person.

Both the work, and the expression and manifestation can be related with mutual relationships of certain type (e.g. Work A is based on Work B, Expression A has been mapped into Expression B, Work A describes Work B). Such reality can be captured by the recoursive associative relationship of entities. It has been already mentioned that inheritance concerns not only the attributes of parent classes, but also their relationships. Specifically, the relationship agent has a relationship to the entity on the implementation level is applicable as a relationship between a person and the work, between a corporation and the manifestation etc., where the concrete role of a given agent in a given extent is determined by the value of the element role from the code of roles (such as author, publisher, editor, provider).

3.2 Definition and mapping of data elements for the description of KOS in the knowledge base

The following review offers elements of description, as derived from class attributes defined in the structural model (see Fig. 7). The structure of metadata description in Fig. 7 has been slightly simplified for better clarity. Some data elements from the structural model have been additionally structured internally for the purpose of entering data into the knowledge base: language (code, name), availability (description, type), type (own code list – Classification System for Knowledge Organization Literature), type of relationship (relationships of entities, relationships of concepts/terms). Each element of description is complemented with the definition of its meaning and examples of possible values.

ENTITY

Abstract class, supertype for classes Work, Expression, Manifestation. Generalizes properties that are common for all children classes.

identifier

Supertype for the identifier of work/expression/manifestation. Each entity has an identifier. Examples: URI; URN; ARK; VIAF ID; ISBN; ISSN; DOI; URL

title

Supertype for the title of the work/expression/manifestation. Each entity has 1 title.

Examples: Dewey Decimal Classification; EUROVOC

alternative title

Supertype for an alternative title of the work/expression/manifestation. Each entity can have more alternative titles (especially abbreviations/acronyms, possibly titles in other /equally valid languages, or Czech translation).

Examples: DDC; Dewey Decimal Classification

description

Supertype for the description of the work/expression/manifestation. Each entity can have a description.

Examples: abstract; annotation; structured contents

date

Supertype for the date of creation of the work/expression or publishing of the manifestation.

Examples: 2013-01-12; 1876-00-00; 2015

RELATIONSHIP

The class serves for specifying the mutual relationships of the entities (relation Entity – Entity, Work - Work, Expression – Expression, Manifestation – Manifestation), semantic relationships of concepts (relation Theme – Theme) and types of relationships expressed in KOS (relation Expression – Relationship).

Relationship type

Value from the own code list of relationships. The code list comprises the types of entities as well as semantic and syntactic relationships between the units that can be represented by the given KOS (mutual relationships between KOSs as well as between their structural parts).

Examples: concepts/terms relationships: generic hierarchy; association; relationships of entities: summarisation; adaptation; complement; based on; whole-part

THEME

Keyword

Keywords from the own controlled vocabulary delimiting the domain / field / theme covered by KOS, and possibly the domain (contents, field, subject matters) of the organized units.

Examples: knowledge organisation (scientific discipline); agriculture; medicine

Definition

Verbal expression of the contents and the scope of a concept.

Example: The discipline of knowledge organization investigates the process of organizing knowledge and its context, i.e. resources that are transformed in the course of organization, methods used as well as tools and products created by this process, including the participating actors – persons, institutions, technologies

Facet

Facet used for the organization of the glossary.

Example: process

WORK

Description of work

Subtype of description of the entity. Verbal characteristics of the contents of the work in a natural language. It comprises the description of the conceptual foundation of the given KOS, the history, information about updating, information about the creators, purpose, determination of users.

Type

Value from the own code list of types of knowledge organization systems.

Example: ontology; thesaurus

Type (KOL)

Notation of the Classification System for Knowledge Organization Literature (KOL). KOS is described by classmarks from the part 0 – Form Divisions, classes 01, 03 and 04.

Example: 042 Universal Decimal Classification

EXPRESSION

Description of expression

Subtype of entity description. Verbal characteristics of the contents of expression in a natural language. It comprises information about the edition/version, about the structure of the whole system, of the structure of headings/records (descriptors, codes), of the linguistic form of terms, explanation of the notation system, display, browsing and search options.

Number of units

Number of units (lexems) within KOS.

Examples: 15 000 classmarks; 20 953 descriptors

Language

Indication of language under ISO 639-2.

Name of language

Name of language under ISO 639-2 (Czech version).

Examples: German; Arabic; Dutch

Code of language

Code of language under ISO 639-2.

Examples: ger; eng

MANIFESTATION

Bibliographic citation

Bibliographic citation of manifestation under ISO 690.

Example: DEWEY, Melvil, devised. Abridged Dewey decimal classification and relative index. Ed. 15. Ed. by Joan S. MITCHELL, Editor in Chief, Julianne BEALL, Rebecca GREEN, Giles MARTIN, Michael PANZER, Assistant Editors. Dublin (Ohio): OCLC, 2012. lxvii, 1228 s. ISBN 978-0-910608-81-7. ISBN 0-910608-81-4.

Description of manifestation

Subtype of entity description. Verbal characteristics of the contents of manifestation in the natural language. Comprises information about the format of the given manifestation, the way of display and possibly further physical characteristics of the resource (e.g. access to the resource).

Resource reference

Reference to the place wherefrom the described KOS is available (e.g. URI or URL).

Example: http://id.loc.gov/authorities/classification

Availability

Characteristics of the access to the resource.

Availability-description

Verbal characteristics of how the resource can be accessed (free text).

Examples: In: NK ČR (ABA001) – for reference only; NK ČR Knih. Inst. (ABA003) -- sign. Od 20.817, from 20.818, Od 549/B1, Od 549/B2; Library Jinonice (ABD(107).

Availability-type

Value from the own code list of availability types.

Example: open access

Format

Value from the own code list of formats (subset of MIME (IMT) vocabulary).

Examples: application/rdf+xml; text/html

Medium

Value from the own code list of media.

Examples: CD ROM; online

Structure

Value from the own code list of structures.

Example: SKOS

AGENT

Abstract class, supertype for classes Person and Corporation. A class generalizing entities from the 2^nd group of the FRBR model (Person, Corporation), bearing responsibility for the intellectual or artistic contents, physical production and distribution or administration of entities in the first group.

Name

Supertype for the name and surname of person, name of corporation.

Identifier VIAF

Permanent identifier (ID VIAF) of an agent in the VIAF database (Virtual International Authority File). Source: http://viaf.org.

Identifier ISNI

Permanent identifier (ID ISNI) of an agent in the ISNI database (International Standard Name Identifier). Source: http://isni.org.

PERSON

Name of person

Subtype of name of agent.

Surname of person

Subtype of name of agent.

CORPORATION

Name of corporation

Subtype of name of agent.

ROLE OF THE AGENT

Role

The role of agent with respect to his relationship to the entity. Value from the own code list of roles. Mapped into the vocabulary MARC Code List for Relators (http://id.loc.gov/vocabulary/relators)

Example: author (of work);translator (expression);publisher (manifestation)

The most recent implemented stage of the design of elements for the KOS metadata description is their semantic mapping into the relevant name spaces. The already mentioned functional requirements for the accessability of the contents of the knowledge base in the form of linked open data resulted in the necessity to suggest the elements of KOS description so as to fulfil, as far as possible, the parameters of the so-called five-star schema^[28] of data openness. The latter determines that the respective data shall be linked with other related data with the help of references. Eight name spaces were chosen to serve this purpose – in the preceding part of the study already described schemas of the registries. BARTOC and TaxoBank, NKOS AP and Schema.org. In addition to these systems also the generally applicable DCMI Metadata terms (DC-terms) and FRBR and FRSAD models and the specialized schemas DCAT and OMV have been integrated.

Data Catalog Vocabulary (DCAT)^[29] is a recommendation of the W3C Consortium, approved in January 2014. It contains the schema and the vocabulary for metadata description in the dataset catalogues. An example is the EU Open Data Portal (http://open-data.europa.eu/en/data/). The DCAT schema is relevant for the linking of data from the knowledge base due to the fact that the present KOSs as linked data have the character of a dataset. The nature of the vocabulary is markedly eclectic; it brings only 7 own descriptive elements (the web page from which the dataset is accessible, contact, URL for access, URL for downloading, size in bytes, type of medium, key word). Although its authors do not designate DCAT as a Dublin Core application profile, they take over the DC elements to a considerable extent and, as a rule, they usually complement the general definition with specific comments for the purpose of describing datasets. The DCAT schema can be considered as a certain analogy of NKOS AP. Whereas the primary purpose of NKOS AP is the description of KOS in their online registries, DCAT is intended for describing datasets in their online catalogues.

OMV (Ontology Metadata Vocabulary)^[30] is the result of an international project having been implemented in 2004–2009 by the OMV Consortiumm consisting of: University Bremen and its Center for Computer Technologies, Madrid Polytechnical University, University of Karlsruhe and the Stanford Center for Biomedical Informatics Research. Its goal consists in creating a formal metaontological framework for the description of applied ontologies from the viewpoint of knowledge representation. The output of the project is an open-acces metadata vocabulary for a uniform description of ontologies (OMV Core Metadata Entities and OMV Extensions). It contains a list of description attributes (such as Work, description, date of creation), arranged in 13 classes. The key class is ontology, further classes are persons and organizations developing it, and also classes for closer characterization of the given ontology: license model, knowledge representation paradigm, type of ontology, formality level, ontology task being solved, domain, applied methodology, tools applied, syntax, ontology language. For some attributes (such as type of ontology) lists of values are available. As suggested by name, OMV is a metadata schema specialized for ontologies. At the present day they are considered to be the most progressive type of KOS. Its contribution consists in defining elements of description that are ontology specific and cannot be found in other, more generally conceived schemas (such as formality level, paradigm of knowledge representation used, methodology of ontological engineering applied, the ontological language in use). An interesting aspect of OMV resides in that it focuses attention primarily upon the conceptual base, without concentrating upon describing ontology in form of a document filled with instances.

The results of mapping descriptive elements of KOS are arranged in the table in Fig. 8. It can be stated that the authors succeeded to map each of the descriptive elements at least into one relevant namespace. The 7 elements creating the core of the general identifying descriptive data have been mapped in all chosen systems. Along with the justification of the semantic linkability of the suggested elements of description this mapping brought also impulses to add further potentially applicable elements from the relevant name spaces which do not form part of our schema at the time being and whose integration can be considered in the future, namely: purpose, utilization, user determination, rights, offered services, updating frequency,way of display, formality level, ontology language, applied knowledge representation paradigm.

Fig. 8 Mapping elements of description of KOS in the relevant name spaces

Metadata schema application examples for KOS description in the knowledge base

The basic structural elements of metadata description have been again depicted by way of a class diagram in the UML laguage. The key structural component parts of each description are the three entities used from the FRBR model – Work, Expression and Manifestation. The relationship Work – Expression is designated by the association realizes, whereas the relationship Expression – Manifestation by the association embodies.

Fig. 9 Scheme of links between thesaurus and subject heading scheme in the knowledge base

The example in Fig. 9 is a graphical representation of the description structure of two KOS: Library of Congress Subject Headings (LCSH, VIAF ID of the work: http://viaf.org/viaf/211744653/) and thesaurus AGROVOC (a VIAF ID has not yet been assigned to the work). Both systems are described on the level of work (D1 and D2), expression (V1, V2 and V3) and manifestation (P1, P2, P3, P4 and P5). In the case of LCSH the situation of one work, one expression in form of linked open data and one manifestation have been depicted. With the utilization of the relationship P4 is part of P1 has enabled the description of manifestation also on a finer level of granularity. The manifestation of P1 is a whole, i.e. the complete headinglist, whereas the manifestation P4 LCSH-Heading, describing the heading „Feudalism“, is its part. The FRBR model would obviously enable also an alternative solution of this relationship: single headings could be considered as separate works that are the component parts of another work, i.e. the subject heading scheme. However, such solution in the case of very comprehensive KOS with tens or hundreds of thousands of units would be rather wasteful and a structured description of headings on the level of work, expression and manifestation would result in considerable redundance. An analogous line work – expression – manifestation, again down to the level of single entries (D2 – V3 – P3 – P5), has been illustrated in the case of the AGROVOC thesaurus in the form of linked open data. The manifestation P3 AGROVOC-LOD is available in three data formats (HTML, turtle and RDF), as shown by way of the economic technique of generalization enabling to relate the information about available formats not only to the whole thesaurus, but also to its single descriptors. As concerns the AGROVOC thesaurus in the LOD form, its administrator executed selective mapping into the LCSH (specifically covering 1 075 decriptors from the group Generalities), as depicted by association semantic mapping between the manifestations (records) P5 and P4. The description of the AGROVOC thesaurus comprises in this case, in addition to the expression in the LOD form, also the description of the historically first version of the thesaurus (V2) in printed form that was published in seven volumes in 1982 (P2). The mutual relation of these two expressions is illustrated by the association V3 is based on V2. The third work (D3) in this example is a conference paper whose topics is thesaurus AGROVOC (the diagram in Fig. shows this work in simplified form as one object, aggregated with the expression and the manifestation). The association D3 is about D2 is an example of implemenation of a concept from the FRBR model stating that the theme of the work can be anything, i.e. also some other work. This example allows us to notice two ways of expressing the mutual relationships of KOS.The first option is a direct and explicit expression of a work - work relationship (D3 is about D2), epression - expression (V3 is based on V2) or manifestation - manifestation (P5 is mapped into P4). The second option of expressing is indirect, ensuing from inference: works D1 LCSH and D2 AGROVOC are related due to the fact that their single manifestations or even parts thereof are mutually mapped.

Fig. 10 Scheme of links between some subject heading schemes in a knowledge base

The example in Fig. 10 demonstrates the possibility of capturing, with the help of construction elements from the FRBR model, often very intricate relationships between the knowledge organization systems. The diagram shows four independent systems of subject headings – Library of Congress Subject Headings (D1 LCSH), Canadian Subject Headings – D2 CSH), Subject headings of the Library of Laval University in Quebec (D3 RVM(L)) and the French subject heading scheme RAMEAU (D4) (http://viaf.org/viaf/202974143). The relationships between the contents of these systems are as follows: RVM(L) was established in 1946 by translation of a part of LCSH from English to French, whereupon this translation was complemented in the follow-up by adding further headings reflecting Canadian life and institutions. The RVM(L) is the foundation of the CSH that is available in English version form the 60ies of the 20^th century and was further complemented to its bilingual (English and French) form. The CSH subject heading scheme is in use in the Canadian libraries in parallel with LCSH reflecting Canadian life and institutions. By way of adapting the Canadian RVM(L) subject heading scheme the French RAMEAU was founded. These facts are represented by four association relationships: D3 is an adaptation of D1, D2 is based on D3, D2 complements D1 a D4 is an adaptation of D3. In this example an alternative of representing the relationships between four subject heading schemes upon the highest level of generality has been chosen – the relationships are defined on the level of work. Another possible variant of description could be an entry of the same relations (adaptation, based on, complement) on a finer level of granularity, i.e. between concrete expressions or even between manifestations, possibly between a manifestation and a work or between a manifestation and an expression.

Fig. 11 Scheme of classification links and of a subject heading scheme in a knowledge base

The example in Fig. 11 shows the structure of description of two products of the Library of Congress – Library of Congress Subject Headings (D1 LCSH) and Library of Congress Classification (D2 LCC, http://viaf.org/viaf/203733980). The LCSH subject heading list is described in its commercial version this time, made accessible within the frame of the Classification Web portal, and namely down to the level of single entries (the relationship P6 is part of P1 expresses that the subject heading „Classification, Universal decimal“ with the identifier „sh85026810“ is part of LCSH Headings). A component part of the portal is also the LCC classification. The fact of both KOS being integrated in the Classification Web portal is depicted by the relationships P1 is part of P5 and P2 is part of P5. In the case of LCC classification further two forms of expression have been captured – V3 is the static PDF and MS Word format with free access, V4 describes LCC in the format of linked open data. The format enabling free acces to the linked data is registered in the facetes data format and data structure that can be combined with each other.

Conclusion

In the course of collecting data about the knowledge organization systems the hypothesis was confirmed that classical cataloguing practice using the MARC type formats and proceeding in accordance with the existing rules (such as AACR etc.) is not satisfactory for the purpose of KOS description in the context of the projected knowledge base. The existing proprietary solutions of KOS description (for instance the TaxoBank system) are certainly interesting and inspiring, but the public has not been offered the relevant metadata specifications and documentation. That is why the hypothesis of the applicability of the theoretically conceived FRBR and FRSAD models and of the NKOS AP application profile has been tested. However, both cases showed very high abstractness of these models; prior to actual implementation in a concrete software environment a thorough analysis is indispensable and they also must be complemented by own specifications. Filling the thus created metadata structure with concrete data has already brought practical experience and enabled defining some problematic aspects. The FRBR model is intellectually demanding, as concerns the determination of how various entities on the highest level should be filled. Problems appear as early as at the moment of defining identifiers and forms of titles for the work and the expression, generally accepted names and identifiers, for instance in the VIAF system, being very scarcely available. Yet another problem consists in deciding how to reasonably split the data in the elements description and date in the entities work and expression, for avoiding useless redundancy. This issue is connected with the general problem of how to separate the contents (the work) and the form (the expression), and it requires being solved separately in each case of description. Additional problems are brought by the decision-making whether the described unit should be considered as a new work or expression. The application of general principles formulated in the FRBR study often suggests a plurality of solutions and tends to be affected by many factors in the practice. By way of example: is it more appropriate to consider all editions of the AGROVOC thesaurus as one work, or are the differences between the 1982 edition and the present version, after having been re-worked to ontological structure^[31] and linked open data format, so conspicuous as to require considering them as two different, althought interconnected works? How should we proceed in the case of various types of editions of the Universal Decimal Classification? Should „UDC Summary” in 52 languages be handled as 52 expressions? What should be held for the decisive criterion determining that the case in question is a specific expression of UDC – the language, the edition or the type (full, medium, abridged etc.)?

A specific problem is related to the use of the entity „Thema“ of the FRSAD model for expressing the contents of KOS. The knowledge organization systems often fall into the category of multidisciplinary works wherein the theme is usuall expressed according to formal viewpoints in the practice (e.g. UDC is classification, LCSH is a subject heading scheme, and even cases of the type DDC is Dewey decimal classification are not rare). Naturally, such approach would lead to duplicities in the knowledge base; the theme described in this way would overlap with the determination of the KOS type. After all, this is confirmed also by the FRSAD study that is using for such data the term „isness“, in that it is nor considered as the determination of theme, but as one of the attributes of the resource. Even in the domain- specific KOS, enabling to determine what subject field the terms creating their vocabulary pertain to (such as AGROVOC covers issues of agriculture), this is not the same as determining the contents („aboutness“) in the standard professional texts (AGROVOC certainly is not about agriculture, DDC doubtlessly is not about Dewey decimal classification).

The problem of the NKOS AP application profile in its existing version consists in that it is difficult to determine some specific values – it seems to be anticipated that the data will be entered by the producers of KOS, but not those who process them, e.g. the librarians. Filling the elements of KOS description according to NKOS AP by a third party often requires basic investigation on the level of research description of museum or archive facilities. This aspect is nothing adverse in the case of the knowledge base being designed, some research activities being envisaged, but it would be problematic in the case of routine description. Even the seemingly unambiguous issue like date can cause problems– the moment of origin is often difficult to determine on the level of work and manifestation in case of electronic sources. In addition to that, due to the dynamic nature of KOS, not only the date of origin, but also information about changes is interesting, often even on the micro-level of single entries, which makes the number of data grow. In the case of date the following definition of the vocabulary DC-terms (dc:date) seems to be appropriate – „a certain point or period of time associated with an event in the lifecycle of the resource ^[…] at any level of granularity“. However, the utilization of the potential of this definition will require the analysis and the description of the complicated life cycle of the knowledge organization systems. The KOS-specific elements appear to need created vocabularies (code lists), or at least very precise specifications of the contents of each given element, or else the consistence of entered data is threatened.

In spite of these partial problems we feel that the FRBR model, including the application profile NKOS AP based upon the same, has been successful in the task of the KOS description. Its potential to express relationships is seen as the strong side of the FRBR model. Down to the present day we have managed to enter all relationships established by the analysis of various KOS into the knowledge base, and namely by way of standardized types of relationships comprised in the specification of the model.

Concerning the future development of the knowledge base in its descriptive module, we consider the extent of explicity of description to be the key issue. Theoretically, it would be possible to sum up all properties that are specific for the knowledge organization systems in the free text of one descriptive element. At the present day our basis contains data implicitely integrated in the element bibliographic citation, and namely about the version (edition), about the copyright and about availability. The element description contains, in form of unstructured text, data about the history, frequency of updating, user determination, services offered, internal structure and external relations of the system. The advantage of unstructured free text resides in the possibility of expressing all specific aspects, some of them being unique for a certain KOS. Its disadvantage is difficult back-tracing of implicite data and the practical impossibility of quantitative analyses or filtering according to a given criterion etc. The pros and cons of structured data are opposite. Easy searching of structured data is accompanied with the necessity of very precise and uniform specification, which is usually possible only for the price of a certain generalisation and, accordingly, the loss of information about some unique and non-typical features of certain KOS. Even the seemingly unambiguous data about the scope/number of units of KOS requires exact definition of what is considered to be a unit. All terms or only the preferred ones? The main attributes or also the auxiliary ones? How do we proceed in case of facete structure?

The study has supplied a review of the past phases of solution of the knowledge base project in the field of knowledge organization, focusing upon the design of metadata description for KOS. It is positive that most “five-star” parameters of the descriptive elements have been achieved – as necessary for exposing them in the format of linked open data. The final, more or less administrative step will consist in assigning URIs to the descriptive elements, in order to be fully identifiable and technically linkable with semantically mapped data from other relevant name spaces.

The study is a partial solution output of the NAKI DF13P01OVV013 project Knowledge base for the field of information and knowledge organization, implemented by ÚISK FF UK in Prague

References

BRATKOVÁ, Eva, KUČEROVÁ, Helena. Systémy organizace znalostí a jejich typologie. In: Knihovna. 2014, 25(2), 5-29. ISSN 1801-3252 (Print). ISSN 1802-8772 (Online). Also available from: http://knihovna.nkp.cz/pdf/1402/142005.pdf. English version available from: http://knihovna.nkp.cz/pdf/1402sup/142001.pdf.

COYLE, Karen, BAKER, Thomas. Guidelines for Dublin Core Application Profiles [online]. Dublin (Ohio): DCMI, 2009-05-18 [cit. 2015-02-26]. Available from: http://dublincore.org/documents/profile-guidelines/.

Dublin Core Metadata Initiative. NKOS Task Group. NKOS AP Elements. In: DCMI NKOS Task Group [online]. 2013, updates 2014-04-03 final, polished 2014-08-03 [cit. 2015-02-26]. Available from: http://wiki.dublincore.org/index.php/NKOS_AP_Elements.

Dublin Core Metadata Initiative. NKOS Task Group. NKOS Vocabularies. In: DCMI NKOS Task Group [online]. 2013, updated 2013-12-16 [cit. 2015-02-26]. Available from: http://wiki.dublincore.org/index.php/NKOS_Vocabularies.

Dublin Core Metadata Initiative. NKOS Task Group. NKOS Vocabularies. 2., KOS Types Vocabulary. In: DCMI NKOS Task Group [online]. 2013, updated 2013-12-16 [cit. 2015-02-26]. Available from: http://wiki.dublincore.org/index.php/NKOS_Vocabularies#KOS_Types_Vocabulary.

FONS, Ted, PENKA, Jeff, WALLIS, Richard. OCLC’s Linked Data Initiative: Using Schema.org to Make Library Data Relevant on the Web. In: Information Standards Quarterly. Spring/Summer 2012, 24(2/3), 29-33. ISSN 1041-0031. Also available from: http://www.niso.org/apps/group_public/download.php/9408/IP_Fons-etal_OCLC_isqv24no2-3.pdf.

ISO 24156-1:2014. Graphic notations for concept modelling in terminology work and its relationship with UML -- Part 1: Guidelines for using UML notation in terminology work. 1st ed. Geneva: International Organization for Standardization, 2014. 24 p.

PALMA, Raúl, HARTMANN, Jens, HAASE, Peter. OMV: Ontology Metadata Vocabulary for the Semantic web [online]. Version 2.4.1. S.l.: OMV Consortium, March 2009 [cit. 2015-02-28]. 76 s. OMV Report. Available from: http://sourceforge.net/projects/omv2/files/OMV%20Documentation/OMV-Reportv2.4.1.pdf.

World Web Consortium. Data catalog vocabulary (DCAT) [online]. W3C Recommendation 16 January 2014. Ed. Fadi MAALI, John ERICKSON. 2014-01-16 [cit. 2015-02-28]. Available from: http://www.w3.org/TR/vocab-dcat/.

^[1] BRATKOVÁ, Eva, KUČEROVÁ, Helena. Systémy organizace znalostí a jejich typologie. In: Knihovna. 2014, 25(2), 5-29. ISSN 1801-3252 (Print). ISSN 1802-8772 (Online). Also available from: http://knihovna.nkp.cz/pdf/1402/142005.pdf. English version available from: http://knihovna.nkp.cz/pdf/1402sup/142001.pdf.

^[2] Further KOS records entering continues also in 2015.

^[3] For example, the identification of individual printed editions and versions of Danish Decimal Classification (Decimalklassedeling) was relatively successfuly solved through Danish National union catalogue of public libraries „bibliotek.dk“ in combination of the Royal Library catalogue.

^[4] In VIAF database started appear the first permanent identifiers (VIAF ID) for works of some KOS, available is, for example, identifier (URI) for UDC (http://viaf.org/viaf/184301709/) or for LCC (http://viaf.org/viaf/203733980/), or also identifiers for expressions, for example identifier for Russian translation of UDC (http://viaf.org/viaf/185511525/) or identifier for Czech translation of selected UDC notations (http://viaf.org/viaf/186352368/).

^[5] FONS, Ted, PENKA, Jeff, WALLIS, Richard. OCLC’s Linked Data Initiative: Using Schema.org to Make Library Data Relevant on the Web. In: Information Standards Quarterly. Spring/Summer 2012, 24(2/3), s. 29. ISSN 1041-0031. Also available from: http://www.niso.org/apps/group_public/download.php/9408/IP_Fons-etal_OCLC_isqv24no2-3.pdf.

^[6] Metadata schema Schema.org was prepared in cooperation with Google, Bing, Yahoo and Yandex companies.

^[7] The record of Dewey Decimal Classification in form of WebDewey database, presented in Fig. 2, is solved on level of subject description quite pragmatically: it is assigned by (and is linked) to „025.431“ notation (Dewey Decimal Classification) and also to FAST heading „Classification, Dewey decimal“. It is clear, that described classification scheme don’t deals “by itself”, and in this case it is not expression of subject content, but only formal name of KOS.

^[8] Universitätsbibliothek Basel. BARTOC.org: BAsel Register of Thesauri, Ontologies & Classifications [online]. Projektleiter Andreas LEDL. Basel: Universitätsbibliothek Basel, 2013- [cit. 2015-02-25]. Freely available from Basel University server: http://www.bartoc.org/.

^[9] BRATKOVÁ, Ref. no. 1, p. 19-20.

^[10] The vocabulary is located in directory: http://www.bartoc.org/cs/taxonomy/term/.

^[11] Access Innovations. TaxoBank Terminology Registry: TaxoBank ^{[online]. Albuquerque (New Mexico, USA): ACCESS INNOVATIONS, 2009- [cit. 2015-02-25]. Available from: http://www.taxobank.org/.}

^[12] The typology of KOS and used terminology, see: BRATKOVÁ, Ref. no. 1, p. 18-19.

^[13] Detailed record of NAL Thesaurus is on: http://www.taxobank.org/content/nal-agricultural-thesaurus.

^[14] Name space for NKOS AP will be prepared on: http://purl.org/nkos/.

^[15] Dublin Core Metadata Initiative. NKOS Task Group. NKOS AP Elements. In: DCMI NKOS Task Group [online]. 2013, updates 2014-04-03 final, polished 2014-08-03 [cit. 2015-02-26]. Available from: http://wiki.dublincore.org/index.php/NKOS_AP_Elements.

^[16] Dublin Core Metadata Initiative. NKOS Task Group. NKOS Vocabularies. In: DCMI NKOS Task Group [online]. 2013, updated 2013-12-16 [cit. 2015-02-26]. Available from: http://wiki.dublincore.org/index.php/NKOS_Vocabularies.

^[17] Presentaion of group and activity outcomes are available on its web: http://nkos.slis.kent.edu/.

^[18] The database of ISTC identifiers manages the ISTC International Agency: http://www.istc-international.org/.

^[19] Dublin Core Metadata Initiative. NKOS Task Group. NKOS Vocabularies. 2., KOS Types Vocabulary. In: DCMI NKOS Task Group [online]. 2013, updated 2013-12-16 [cit. 2015-02-26]. Available from: http://wiki.dublincore.org/index.php/NKOS_Vocabularies#KOS_Types_Vocabulary.

^[20] COYLE, Karen, BAKER, Thomas. Guidelines for Dublin Core Application Profiles [online]. Dublin (Ohio): DCMI, 2009-05-18 [cit. 2015-02-26]. Available from: http://dublincore.org/documents/profile-guidelines/.

^[21] Verbatim take of entity Thema from FRSAD model, which has not been translated into Czech to the time being.

^[22] ISO 24156-1:2014. Graphic notations for concept modelling in terminology work and its relationship with UML -- Part 1: Guidelines for using UML notation in terminology work. 1st ed. Geneva: International Organization for Standardization, 2014. 24 p.

^[23] Dublin Core Metadata Initiative. NKOS Task Group. NKOS Vocabularies, Ref. no. 15, http://wiki.dublincore.org/index.php/NKOS_Vocabularies.

^[24] Dublin Core Metadata Initiative. NKOS Task Group. NKOS Vocabularies. 2., KOS Types Vocabulary, Ref. no. 18, http://wiki.dublincore.org/index.php/NKOS_Vocabularies#KOS_Types_Vocabulary.

^[25] BRATKOVÁ, Ref. no. 1, p. 22. – In contradistinction to NKOS AP typology, which is linear list, our typology applies one level hierarchy.

^[26] Classification System for Knowledge Organization Literature [online]. ISKO 2011-2012, last update 2012-04-13. Available from: http://www.isko.org/scheme.php.

^[27] Library of Congress. MARC Code List for Relators [online]. Washington (D.C.): The Library of Congress, modified 2011-04-26 [cit. 2015-02-26]. Library of Congress Linked Data Service. Authorities and Vocabularies. Available from (URI): http://id.loc.gov/vocabulary/relators.

^[28] BERNERS-LEE, Tim. Linked data – design issues [online]. 2006-07-27, last change 2009-06-18 [cit. 2015-04-10]. Available from: http://www.w3.org/DesignIssues/LinkedData.html.

^[29] World Wide Web Consortium. Data catalog vocabulary (DCAT) [online]. W3C Recommendation 16 January 2014. Ed. Fadi MAALI, John ERICKSON. 2014-01-16 [cit. 2015-02-28]. Available from: http://www.w3.org/TR/vocab-dcat/.

^[30] PALMA, Raúl, HARTMANN, Jens, HAASE, Peter. OMV: Ontology Metadata Vocabulary for the Semantic web [online]. Version 2.4.1. S.l.: OMV Consortium, March 2009 [cit. 2015-02-28]. 76 s. OMV Report. Available from: http://sourceforge.net/projects/omv2/files/OMV%20Documentation/OMV-Reportv2.4.1.pdf.

^[31] See rich structure of defined relationships in Anthology vocabulary of relationships on: http://aims.fao.org/aos/agrontology.

Care of 19th century prints in the National Library of the Czech Republic

2015-09-18T12:17:28Z

Keywords: preservation, restoration, 19^th century prints, binding, lignin containing paper

Ing. Petra Vávrová, Ph.D., Tereza Kašťáková, Mgr. Jitka Neoralová, Ing. Kristýna Boumová, Tereza Sazamová¹ / The National Library of the Czech Republic, Centrální depozitář v Hostivaři, Sodomkova 2, Praha 1

Introduction

The Collection Preservation Division of the National Library of the Czech Republic (hereinafter NL CR) has devoted special attention to the care also of the so-called modern library collections, which contain books and documents created within the range of last more than two hundred years, beginning 1801, down to the present day, among its recent activities of the last five years. These collections record development of Czech culture and national entity – they have invaluable historical, art, and social significance, and principal informative value. Unfortunately, the material from which the prints were produced after 1845 - lignin containing paper - as well as present changed technologies and materials, result in their poor durability. At the same time, this durability affects a number of degradation factors, above all external factors, which are ambient temperature, relative air humidity, impurities included in environment (e.g. dust particles, air-borne pollutants such as oxides of sulphur and nitrogen, ozone, etc.), light energy, biological pests (mildew, bacteria, insects), etc.
Poor quality of materials of modern book collections is accompanied even with large quantity of documents in this segment of collections: in the National Library CR, there is up to 96 % of book collections, whereas their contents still increases.

In term of care of modern collections we focused on the 19th century collections because of their rarity and uniqueness. The 19th century prints also contain variety of materials.

After brief description of prints condition this paper focus on methods of cleaning, preservation, and restoration modern prints also the specific aspects of textile covers of the 19th century books.

19th century documents

The electronic "Bibliography of the 19th century" (available on NL CR web ²) includes 97,161 scanned cards. The Union catalogue of the national retrospective bibliography of the 19th century contains prints in the Czech language without territorial restriction from 1801–1900 from Prague and off-Prague libraries with denotation of sites of their occurrence. It also registers small prints (almanacs, calendars, annual reports, concert and theatre programs), however, it does not contain newspapers and magazines from this period. Sequencing the entries is in alphabetical order, in case of nameless pieces of work according to the directing noun. The records are not accessible through OPAC (ALEPH system).³

In categorization of book collections, the disproportion is manifested at the level of protection among newer documents from 1901 to the present and documents of the 19th century. In the first case the National Library CR obtained two obligatory copies (OC) one of which was of an archive character, meaning that its circulation was very limited (these segments of collections of NL CR are marked OC I). In case of the 19th century documents, one obligatory copy was in the National Library CR only, or another copy obtained by purchase or as a gift, however, none of them had not an archive status, despite of the fact that these documents should be even better protected. In about 1998, the intention to set one copy apart was authorized by consultation of the management, and subject it to the same level of protection as OC I. Shortly after it, setting these documents apart was initiated and marking of the set apart copies. Setting apart was limited only to monographs. In case of periodicals, an additional analysis should be made, which would evaluate specificity of this type of documents and the impact of setting apart for services to users, or other areas of the library.

The intention declared that it does not concern change-over to the collection of the National Preservative Fund (NPF), but only under the management of the then Department of the National Preservation Collections, and integrity of the Universal Book Fund (UBF) thus remained preserved. The objective was to ensure the same level of protection to these documents as to those in the case of OC I. It was demonstrated already in implementation of the project financed from "The Norwegian Funds" focused on microfilming and digitization of monographs of the 19th century.⁴

Investigation of physical condition of 19th century prints

The above described extent of modern book collections only in the National Library CR alone exemplifies that also investigation of physical conditions of prints of the19th century must be drawn in a different way than that of older historical manuscripts and books. On the basis of experimental activities, a well-arranged and comprehensible application was created, "The Central knowledge base RD" ^⁵,6, into which data found on physical conditions of individual copies has been recorded. The survey procedure of physical conditions of modern book collections was processed in methodology and it was certified.⁷

Up to now nobody has systematically dealt with targeted preservation or restoration of 19th century prints in NL CR nor in other libraries. From quantitative point of view, care for these collections requires above all another approach than that at restoration of older historical book collections – it is necessary to repair or preserve and fixate big quantities of books and documents of various material composition in a short time interval. Till this time, ethical and aesthetic standards for conservators´ interventions in modern collections have not been sufficiently worked out yet. Nowadays, care for modern library collections in the Czech Republic consists first of all in so-called preventive conservation, or setting climatic conditions (parameters given by standards for particular material of library objects), and conditions of depositing (e.g. wrapping and storage of books in acceptable covers, boxes, or envelopes, cleaning), or interventions of conservative character, bookbinding work, disinfection, in some case deacidification in order to decelerate degradation reactions of materials. Within the scope of project NAKI DF13P01OVV004 "Survey, conservation, and care for modern library collections – materials and technologies", procedures are being developed for preservation and restoration of bookbinding of particularly modern book collections, new more stable materials are also being developed for bookbinding and its repairs.

Textile covers of 19th century collections in the National Library CR

Textile covers of bookbinding appeared for the first time in connection with the industrial revolution at the beginning of 19th century. With extensive possibilities of printing and industrialization, branches of printers and publishers separated. Bookbinding already was the work of publishers, who laid stress on availability and quantity. Series publishing production came up. It resulted in rising pressure for depreciation both of services and materials. Quality of paper, printing, and also of artistic level of illustrations and graphic workmanship went down. From 1825, bookbinder's cloth, which is cheaper alternative to leather, began to expand from England.

Within the frame of project NAKI, we carried out a detailed survey of textile covers in more than 37 000 bindings from collections of the National Library CR in 2014. Over 4000 textile covers were documented using a camera and USB microscope. Frequency of both textile material bindings, and individual designs and surface finishes of bookbinder´s cloths were further assessed. Basis for the Technical University in Liberec, which develops new textile materials, was drafted out from this data. In cooperation with NL CR it develops procedures for restoration of canvas covers including possibility of completion of missing coat ⁸ and surface structures.

It is possible to find a number of various textile binding covers in collections of NL CR. The technological terms of half-cloth binding or whole-bound in cloth thus may mask considerable quantity of various materials and surface finishes of cover. For purposes of this survey we divided them into several groups and subgroups according to binding fabrics and, at the same time, according to surface finishes. Division according to surface finishes is commonly used in the nomenclature and division of book-binding cloths. It concerns fabrics with canvas weave (most often used), fabrics with twill weave, and finally satin fabrics, which appear very rarely (mostly in special bookbinding as albums, annual reports, visitors books, and likewise).⁹ Further, there are special sorts of fabrics, such as, for example, velvet. Some fabrics are manufactured by combination of more materials, when different threads are used for production of Warp, and others for woof (e.g. cotton or flax in combination with silk (Fig. 1)). Special effects of lights and lustre are achieved by that (Fig. 2).

Fig. 1 Photograph from microscope; combination of threads of various materials (Photo NL CR)

Fig. 2 Combination of materials (Photo NL CR)

It is necessary to point out that for labelling the textile cover in bookbinders´ nomenclature, denomination canvas is used regardless of whether the fabrics weave is plain weave or twill one.¹⁰ One talks different only about velvet, satin, and from it derived fabrics, as are brocades, damask or sateen.

Fabrics may be divided into three basic categories according to the method of thread weave. From each of these categories then comes out a number of variants, which already have their own commercial names.

Canvas

By plain weave we understand basic structure of fabrics, when thread texture regularly overlaps with woof always one over another (Fig. 3).

Fig. 3 Photograph from microscope; plain weave (Photo NL CR)

Fig. 4 Photograph from microscope; twill weave (Photo NL CR)

Twill

In twill and derived fabrics one woof thread alternates over two warp threads and further on over other two warp threads, but displaced by one position. It generates a characteristic design of oblique line spacing, or in case of interchange a fishbone design (Fig. 4).

Fig. 5 Photograph from microscope; two-colour twill (Photo NL CR)

Fig. 6 Photograph from microscope; Sateen weave (Photo NL CR)

A considerable number of varieties comes up from this basis then.
Twill used for book-binding purposes is in most cases double-sided coated and very often multi-coloured (Fig. 5).

Sateen weave (sateen)

Sateen weave is the third out of basic fabric weaves.
Thanks to the specific weave (so-called five-weave sateen), high gloss is characteristic for it, and at the same time low resistance to abrasion (Fig. 6, 7). Therefore, it is used rarely.

Fig. 7 Sateen cover ornamented with gilding (Photo NL CR)

Fig. 8 Simple recurrent design (Photo NL CR)

Fig. 9 Ornaments (Photo NL CR)

Fig. 10 Imitation of leather (Photo NL CR)

Coated (dressed) canvases

In case of book-binding canvases, and generally textile covers of bookbinding, coating is often used for finishing, which may be made of starch, acrylate, or latex. It is applied for surface protection against abrasion and water, for better handling, and for more intense colourfulness. Coat must fill in gaps between fabrics threads, and thereby ensure its imperviousness for glue, which will be used later within the frame of processing canvas for bookbinding.¹¹ Book-binding canvases are then classified into one-side and double-side coated according to the application of coat. Using coat, it is at the same time made possible to decorate surface of canvas by various plastic patterns. Most frequent ornaments (Fig. 9) are flower motives, simple recurrent patterns (Fig. 8), or also structures imitating leather surface (morocco (Fig. 10), leather of snake, ray, crocodile, etc.). Surface of coat is sometimes after application still further finished by polishing. This process is called calendering. Plastic embossing on canvas is carried out by special punching calenders.

Double-side coated canvases

Double-side coated canvases are most often canvases with smooth surface. Underside used to be implied only with less smoothing, but it is possible to use the canvas for binding on both sides. Canvases, which have punched surface are called shagreens.¹² Thin type of shagreen is shirting. Here also belong thick twills, originally marketed under trade name cotton duck, which were used mainly for big and heavy trade books and bindings of large atlases and pattern books.

One-side coated canvases

Canvases coated only from the right side are called the English ones. These canvases have a layer of coloured coat on the surface, while on the reverse, there remains natural white colour of the fabrics. In this treatment of canvas, also petite damages incurred e.g. by abrasion are easily recognizable.

Canvases coated on the reverse, so-called ecru canvases, provide excellent natural surface of fabrics. Natural flaxen or semi-flaxen canvases and balloon canvases (very gently woven canvases with narrow warp texture ¹³) are often treated in that way. It is worth mentioning also special type of book-binding canvases, so-called doublets (Krypton). They belong among one-side coated canvases, but their coat is partly replaced with backing with thin backing paper (backing is carried out with an acrylate dispersion or thermally activated glue). Advantage of these canvases is considerable range of colourfulness, which may be offered by textile material itself. Colours are fancy and surface is natural. On wetting with glue, canvases maintain dimensional stability. On the other hand, disadvantage is that embossing is difficult on untreated canvas surface.

Most frequent damages of 19th century prints

Damage of books depends on a number of factors. As the first factor, it is necessary to mention material composition and method of material production. Material composition of 19th century books is highly diversified, mainly with regard on used paper. During the 19th century, industrial revolution continued to proceed, when conversion of manual production processes to mechanical ones happened. Indeed, these changes affected also production of book materials, and book manufacturing itself. At the same time, pressure continued to reducing factory prices, which caused that cheaper and lower quality materials were chosen for books, durability of which was lower.
Because of lack of original raw material (rag) for paper production, wood began to be used as substitute. At first, wood was processed mechanically, resulting in wood pulp, which concerned as a problematic material. Papers, containing wood pulp are fairly recognizable because of yellowish and very breakable paper. Wood pulp had the greatest importance in production of paper for newspaper, which contained 75–80 % of wood pulp, abroad sometimes even 88 %. Disputableness of this paper was discovered very early, and that was why new and better processes were developed. Later, chemicals began to be used in wood processing. According to utilization of specific chemicals, resulting product, which was used for paper production, was called soda pulp, sulphite pulp, or sulphate pulp.¹⁵

Another factor, which affects type of damage, is unsuitable storage of books. If a book is stored in high relative humidity and high temperature environment, growth of moulds may occur, which are capable to convert paper into powder. Low humidity is also not suitable, because drying of organic material occurs, which then becomes breakable.¹⁶

A very important factor, which affects damage of books, is a human factor. Great damages of books are caused by faulty manipulation and unsuitable repairs. Therefore, it is very important to manipulate with books carefully and avoid unsuitable materials for repairs.

If we take into account all these factors when depositing documents, we can ensure better durability of greatly various book collections of the 19th century. Further, most frequent damages of books, are described and documented with illustrative photographs which may endanger their compactness or cause their further damaging. Over the years of usage, impurities deposit on book cover, e.g. dust, greasy stains, or stains from various liquids, often also leftovers.

A frequented damage in spinal part of a book is a loss of headcaps, which comes up from thoughtless pulling books out from racks – pulling for headcaps.

Fig. 11 Damage of headcap (Fund NL CR)

When hinge cracks, spine lining from the board may pull away. The largest damage of spinal part of the book is complete loss of spine cover and back lining. When spine of a book is exposed, damage of sewing and folders may occur.

Fig. 12 Cracked back hinge and released board (Fund NL CR)

Cover abrasion and its waste occur by friction of books one about another in a rack. Wrong handling in repeated placing books in and their removal from racks may cause damage of edges and corners of boards. Absence of board or its part is very dangerous for book block. Paper of the first or the last folder is prone to damage.

Fig. 13 Abraded cover and its partial loss (Fund NL CR)

Fig. 14 Damage of edges and corners – waste of cover and layering (Fund NL CR)

The most frequent damages of book block are cracks and loss of sheets of paper. By turning the sheets in book we sometimes run across a sheet or folder ripped out from the book block. Lower corner of paper is also contaminated and weakened by frequent turning the sheets.

Fig. 15 Loss of sheet of paper (Fund NL CR)

Fig. 16 Ripped out folder from book block (Fund NL CR)

End papers made from wood-pulp and acidic paper often do not stand straining, and then their partial to total disruption in back joint occurs. If the end papers are made from higher quality paper than papers in the book block, disruption may occur in the first or the last folder. If backing sheets are completely cracked in the back hinge on the first or the last folder then all weight of the book block is transferred to cords. They cannot carry away the stress and their rupture occurs. If the cords crack the block disjoins from the front or the back board, or completely from the cover (Fig. 18).

Fig. 17 End paper cracked in the back hinge (Fund NL CR)

Fig. 18 Rupture of folder and cords in the book block (Fund NL CR)

Unprofessional repairs by the help of adhesive tapes are very problematic, because further damage of bookbinding or book block may occur. Their removal is described in the next part of the text.

Procedures for preservation and cleaning 19th century prints

Technology of cleaning 19th century prints

Contamination represents a high risk of damage, speeding processes of degradation and undesirable reactions of individual components of paper, textile, leather and other materials occurring in binding structure of books. Basic assumption for stabilization of permanently preserved books is removal and elimination of contamination on the surfaces of used materials, as well as products of degradation coming out from material essence itself. Cleaning is generally one of optimal means for extension of durability of permanently preserved objects from economical and time consuming point of view. Selection of cleaning strategy of books from modern collections is subject not only to characteristics of contamination or physical conditions of the object, but also a measure of risk and time efficiency. Special emphasis is placed on protection of recording media, such as printing dyes, inks, Chinese ink, stamping dyes, and others. Dust is most widespread contamination in book collections. Dust is a mixture of many substances, which negatively affect conditions of materials forming book block. In addition to salts, soot, grease, and compounds of metals (iron, lead, cadmium), also germs of microorganisms are present in dust. Capability of dust to absorb aerial moisture and fixate it on paper surface contributes to activation of transition metals and acids.¹⁷ Contamination caused by book usage is another problem. First of all it concerns contamination of sheet corners from finger grease, leaking in beverages, and stains from foodstuff. Contamination of books caused by thoughtless handling occurs also very frequently, as well as by natural disasters, biological factors, water, wax, resins, and materials, which got into the book blocks during its usage (plants, insects, etc.). Products of degradation or corrosion of bookbinding materials also belong among contaminations.

Methods of cleaning can be divided to mechanical methods (dry cleaning), cleaning with water systems, and cleaning on the basis of chemical reactions, such as cleaning by the help of organic solvents.¹⁸

Mechanical cleaning

Mechanical cleaning removes particles of impurities deposited on paper surface, such as dust. Efficiency of cleaning agents is affected by cohesion of impurities on paper surface. Each of cleaning agents has its specificity, and selection is then subject to the type of impurities, structure of contaminated material, and conditions of material to be cleaned. Basic cleaning of dust is carried out with brushes and vacuum cleaners with brush adapters. This procedure is preferred above all in collective interventions (moving collections, refurbishment of depositories, or research laboratories, etc.). This method is reliable for all types of books, as long as it is carried out with proper care and with respect to the book conditions. Cleaning with brushes and vacuum cleaners removes only part of dust deposits and other impurities on the surface.¹⁹ Impurities difficult to remove are, therefore, removed by more efficient means. They can be divided into compact cleaning agents (rubbers, sponges, blocks of cleaning materials), plastic cements, powdery cleaning agents, and electric cleaners.

Fig. 19 Means for mechanical cleaning (Fund NL CR)

Slabs or rollers of modified PVC – material PURUS - are of universal usage. Their cleaning capability consists in sticky surface, on which particles of impurities are trapped. Sponge Wallmaster is another aid in use made of clean latex rubber, featuring porous structure, which increases sticky surface of sponge at the most. Rubbers Wishab based on vulcanized latex are sticky and are lightly scrolled out. Therefore, they do not strain material surface to be cleaned so much, but surface must be subsequently cleaned using a vacuum cleaner from pulverized particles.²⁰

Fig. 20 Cleaning of paper using sponge Wallmaster (Fund NL CR)

Thanks to their strong cohesion, cleaning cements have ability to attach impurities on their surface. Plastic structure makes it possible to adsorb impurities into mass of cement, whereby period of its using extends. However, strong cohesion causes also undesirable uplift and peeling of minute particles of materials with less cohesive surface, for example threads in insufficiently glued papers.

Agents in powdered form are represented primarily by crushed rubbers of various grain size. They are regardful to surface of treated materials, and have good capability to adsorb surface impurities. Their disadvantage is that they are difficult to remove, therefore, rubber powders are applied at those sites of bookbinding, which are easily accessible. Within the frame of project NAKI, application of material Perlóza is tested for cleaning paper and textile elements in the workplace of the Division of Preventive Conservation. Pure regenerated cellulose in the form of ball-shaped, highly porous micro-elements can be saturated only with water or with addition of suitable tenside. Combination of porous surface of polymeric elements and sorption capabilities makes possible to remove solid impurities (such as e.g. dust, various deposits on the surface), but also organic substances. Indeed, cleaning is mechanical, but with regard to the principle of cleaning, where there is necessary presence of water, this method can be found in the division line between mechanical cleaning and cleaning with water systems.

Fig. 21 Application of pearl cellulose on book cover (Fund NL CR)

Electric rubber machine is relatively penetrative cleaning tool, and its application is subject to good compactness of paper surface, it is not used for other materials. Efficiency of this manual machine is given by the type of rubber filling (white soft rubber, abrasive rubber, so-called plastic rubber, etc.).

Cleaning with water systems

By the help of water and water solutions, not only impurities on the surface are removed, but also soluble matters and degradation products of the material. Cleaning with water systems is preceded by mechanical cleaning for removal of soluble deposits on the surface. Using water system in bookbinding is questionable. Only loose sheets are treated in this way, which do not contain water soluble dyes. Solubility of dyes is tested before initiation of cleaning. After cleaning paper in water bath, the paper is glued with glue based on cellulose ether, for example Tylose MH300 (0.5% aqueous solution). Cleaning of unbound books with water is possible on restorers´ workplace equipped with special vacuum chock, on which paper is washed with water, which may be in form of vapour or aerosol. Moistened materials are used for local application of water solutions, as cotton-wool swabs and extremely absorbing sponges (e.g. Blitzfix, Conservation Sponge, etc.), by which it is possible to clean not only paper, but also textile, exceptionally also leather. Demineralized or distilled water is used for cleaning. Drinking water contains ions of metals, minerals, and other substances, which may reduce cleaning efficiency, and bring undesirable elements into materials to be purified. Water itself is not sometimes capable to remove some sorts of impurities, therefore, various types of surfactants – tensides - are added to baths. Tensides used should always be washed out by pure water from any material.

An alternative to cleaning unattached papers in bath is cleaning on capillary textile. Advantage of this method is that paper is not mechanically stressed, and loosening of threads or cracks does not happen. The principle of cleaning is based on capillary rise of water in textile with synthetic threads laid in one direction, on which contaminated unattached sheet of paper is placed. By capillary action water migrates through capillaries and takes away soluble impurities from paper to the textile, and subsequently to the collecting vessel. ²¹

Application of hydrogels also belongs among cleaning with water systems. Water is partly fixed in gel, and quantity of water penetrating into paper is more or less controlled (depending on gel concentration). Impurities and degradation products are absorbed by gel. By addition of deacidifying solution into gel, it is simultaneously possible to deacidify paper within the frame of cleaning. Compared to water bath, this type of cleaning is thoughtful, deformations, slackening of fragments and tearing of paper do not happen. This procedure is used on loose sheets of paper at present, but also local removal of impurities not only from paper is under testing.

In case of coating materials, cleaning with water solutions is possible only locally so that activation of glue and materials in lower layers under the coat does not occur. If cleaning with water systems is possible according to found temperature of shrinkage ²² of leather, and non-aqueous agent cannot be used for some reason, surface of leather may be treated with foam of water solution of non-iogennic tenside (Alvol) in soft natural sponge. Cleaning foam must be swiped out immediately with clean sponge or cotton swab saturated with water, and leather then dried using a suitable absorbent material.

In cleaning, leather must not be soaked with excess of water. After water content in leather achieves the value of 17–20 %, leather must be greased with selected grease mixture. If metal elements are present in bookbinding, mechanical cleaning is preferred, or cleaning by the help of organic solvents.

Chemical cleaning

Chemical cleaning and use of organic solvents require a workplace equipped with air conditioning, and it is necessary to work with means of protection. Organic solvents are used first of all for local removal of stains, varnishes, resins, and greasy spots. Just as with cleaning with water system, also in this case solubility of present dyes is tested. In 19th century collections, the processes are not carried out such as whitening of paper, etc. For textile materials, medical grade benzine may be used. Isopropyl alcohol is non-hazardous agent for cleaning of leather. Commercial anhydrous cleaning agents are also manufactured for leather, specialized for leather with a definite type of tanning, or for white leather and parchments.

Tab. 1 Use of selected solvents for impurities on paper

Type of contamination	Organic solvent
Greasy stains	Benzine
Varnishes	Acetone
Wax and oil stains	Hexane and toluene
Synthetic self-sealing tape	Benzine, acetone

Restoration of 19th century prints

In restoration 19th century book collections, it is important to take into account various materials, which can be found in these collections. Immense amount of books in modern collections are another specifics, therefore, in contrast to collection of historical books, we must approach restorers´ intervention in a different manner than it is customary in procedures used for restoration of historical books.

Photographic documentation is the first step, as well as survey of the book damage, according to the results of which it will be decided, how the restorers´ intervention will further proceed. It is important to consult every restorers´ procedure with the collection administrator, who knows how great the book has signification, and how it is valuable. Creation of new historicist binding can be approached in books, which have non-functional binding. Partial restorers´ progress is preferred, compared to the complex one, because in case of the latter it concerns a time consuming process. At the same time great emphasis is placed on functionality of the book, therefore, restorers´ intervention must enable its follow-up usage.

The next step is cleaning, both mechanical and by the help of water systems, which is carried out largely locally, if the book is heavily contaminated. Cleaning has already been described in more detail above. Old repairs are not removed, if they are functional, and do not further damage the book.

It is impossible not to miss out pH-value measurement of paper in books from 19th century, because saponified resin with alum was began used for sizing of paper at the beginning of that century, which cause today's acidity of paper. Suitably selected and carried out method of deacidification may extend lifetime of paper by hundreds of years. In great amount of books it is advisable to use mass methods of deacidification.²³
The subsequent step is reparation of book block. For corrections of cracks in paper, coloured Japanese paper is used, basic weight of which is selected according to thickness of the original sheet.

Fig. 22 Japanese papers (Author´s archive)

Missing places are filled by the help of overlaying Japanese paper or analogous hand-made paper. Starch or cellulose ethers (Tylose MH 6000) proved good as suitable glue. Patching is accelerated by the help of restorer´s spatula with thermoregulator.

Fig. 23 Filling losses with hand-made paper (Author´s archive)

If the book block is repaired and sewing of the block is functional, repair of cover may be approached. Small losses of covers in corners may be repaired by the help of heavy paper suspension. Loose layers of corners and edges of covers are glued one to another using starch, and they are afterwards fixed by the help of pasteboard and clamps, until desiccation occurs.

Fig. 24 Repair of missing corner of the cover by heavy suspension of paper pulp (Author´s archive)

Missing parts of paper layer of the cover are complemented by Japanese paper of higher basic weight, which is coloured so that the reparation was apparent, but does not disturb integrity of the book.

We always endeavour to put in the filling under the original cover. Starch or cellulose ethers may be used again as suitable glue.

Fig. 25 Filled part of missing paper cover (Author´s archives)

When repairing non-existent parts of spine cover of whole-bound in cloth or half-cloth bindings, canvas is used, which is most similar to the original, which means, it should have the same colourfulness, surface finish, and texture of warp and woof. Nowadays, it is very problematic to obtain identical canvas, therefore, suitably coloured Japanese paper can be used backed with thin cotton canvas (batiste, thin shirting, crepeline, ²⁴ etc.). In cooperation with the Technical University in Liberec, a suitable material is being developed at present, which would complete missing places and, at the same time, it has similar appearance as the original canvas. As a basic material a textile material without weave (so-called unwoven textile) was chosen, which is sufficiently firm, and at the same time thin, and in some properties it exceeds Japanese paper. The resulting appearance of the applied textile will be possible to adapt by suitably coloured coat, into which a pattern of surface finish of the original cover may be impressed. This method is ideal in case of coated canvases with a pattern, but it very well substitutes also canvases without pattern.
If back lining is missing, it is replaced with new one from non-acid stiff paper.

Fig. 26 Japanese paper backed with glued canvas (Author´s archive)

In leather and half-leather books, which suffer from loss of cover in spine of a book, it is possible to replace missing parts with paring leather of similar colour tone. As an alternative, it is possible to use Japanese paper backed with canvas.

Fig. 27 Half-leather book with cracked back hinge – before restoration (Fund NL CR)

Fig. 28 The same book after restoration (Fund NL CR)

The last step is creation of a suitable cover for depositing, which will protect book in handling, and will partly prevent settling of dust directly on books. Alkaline millboard of archival grade is always selected as a material of cover.

The most frequent restorers´ operations were described above, however, this procedure cannot be considered as an unchanging rule, because every book is unique, and with this in mind it must be approached.

The most frequent unprofessional intervention – use of adhesive tapes for reparations of books and problems of their removal

Unprofessional attempts to repair damaged books by the help of readily available office and corrective adhesive tapes are frequent phenomena in library collections and not only of modern ones (Figs. 29, 30). These interventions are usually carried out with the intention to extend lifetime of books, and to renew their original functionality. However, these interventions in result affect document not only visually, but also negative changes of mechanical and chemical properties occur. In the process of aging, hazardous substances release from tapes, which irrevocable damage paper and other materials connected with bookbinding.
From the above mentioned reasons, it is always desirable to remove already applied adhesive tapes. Removal should be carried out as considerately as possible, with intention to remove the largest possible quantity of adhesive layer.

Fig. 29 Example of unprofessional repair (Fund NL CR)

Fig. 30 Example of unprofessional repair (Fund NL CR)

Adhesive tapes are made of elastic support material (plastics, paper, textile, etc.), and glue. Adhesive layer may be made of various materials, both natural and synthetic, with different activation.

Tapes can be divided into tapes with water activated glue, which are largely on paper support, textile may be also used. Glue used is generally of animal origin (glue, gelatine), but there are also tapes with dextrine glue. Other types of tapes are heat activated, where the support is Japanese paper as a rule. The example is correction tape Filmoplast R. The most widespread and very problematic are self-adhesive tapes activate by pressure. Generally, in the Czech Republic, this type is incorrectly called "izolepa".

Techniques of removal

a) mechanically
b) with water and water solutions of cellulose ether
c) with organic solvents and their solutions
d) with steam or aerosol of various temperatures
e) by hot air

Proven procedures for tape removal

Different approach is necessary for tapes of different types. For paper tapes with water activated glue, Tylose MH6000 was most often applied, further, aqueous aerosol at various temperatures, and distilled water. In self-adhesive tapes activated by pressure, removal is more complicated with respect to variety of glues and supports. In most cases it is necessary to combine mechanical removing with the use of solvents.
Aqueous aerosol of different temperatures was tested on paper tapes (Fig. 31). In this procedure, there is an advantage of possibility to direct flow of aerosol according to the needs, and increase or decrease its intensity. It is important that the moisture gets as far as the glue layer. It is then ideal to moisten surface of the tape with aerosol and let it sufficiently soak. After it, it is possible to remove the tape by mere pulling. Time of removal is in this case comparable to that of other procedures of removing with water and water solutions. Aerosol temperatures were examined of values of 40 and 50 °C. In both cases the results were the same.

If the tape is removed with distilled water, which is applied using a brush or a cotton swab, the principle is the same as that of the previous procedure. The tape should be moisten and let sufficiently soak.

The most comfortable technology of removing paper tapes based on glue or gelatine is using Tylose MH6000. It concerns cellulose ether soluble in water, used for gluing or sizing of paper. Aqueous solution of 5 % Tylose proved good when applied in a thick layer on tape and left for a period of ca. 10 min covered with plastic foil (Fig. 32). The tape was released and removed by pulling.

In case of aqueous aerosol and distilled water, also threads of paper tape cling to the sheet of paper together with glue. In case of Tylose MH6000, tape was withdrawn with larger quantity of glue, the spot was after cleaning visibly lighter (the cleanest). In all cases after withdrawal of tape, it is necessary to finish cleaning of the spot by the help of cotton swabs or wood-pulp and distilled water.

Fig. 31 Removing glue tape with aqueous aerosol (Fund NL CR)

Fig. 32 Removing glue tape with 5% Tylose MH6000

In removing tapes with glue layer activated by pressure, use of hot-air gun proved best. The best results were achieved at 60–80 °C (Fig. 33). The support is usually removed together with large quantity of glue, and cleaning of the spot is then finished with cotton swabs moistened in some solvent. Selection of solvent depends first of all on print solubility. Prior to any application, it is necessary to test solubility on a small, less visible spot. Best results in removing adhesive layers were achieved by the help of benzine of medical grade, in which solubility of the print is not so frequent, as, e.g. in acetone or ethyl alcohol. It was found out from the results of analysis of adhesive layers of tapes taken from books discarded from the collections of NLCR that in tapes, glue on acrylate basis has the largest representation. These tapes were most frequently removed just by hot air, and final treatment of the spot was achieved by benzine of medical grade. Further, other readily available solvents were examined, as ethyl alcohol, acetone, xylene, isopropanol, or toluene (Fig. 34). In case that the tape is impossible to take away by hot air, and it is necessary to intervene with solvents, it is advisable to start in small sections by moistened cotton swabs. If the solvent is applied with a brush, there is a risk of unwanted trickles of larger drops. It may cause maps or dissolution of glue, and subsequent soaking into paper.

Fig. 33 Removing plastic adhesive tape by hot air (Fund NL CR)

Fig. 34 Removing cloth tape by toluene (Fund NL CR)

Instruments used

Preservation pencil
Aqueous aerosol generated by ultrasonic moistener flows through heating head, in which it is heated to required temperature. It can be set by means of connected thermoregulator. Variable adapters to the head nozzle are also part of the delivery.

Hot-air gun Hot-Jet S
It is specially designed for use in restorers´ processes for removing glues, adhesive tapes, desiccation, etc. Air temperature is adjustable from 20 to 600 °C. Air quantity can be set from 20 to 80 l/min. It is possible to buy various types of adapters to the gun.

Corrective tapes

Tapes with guaranteed service life used presently for fast repairs of paper were introduced to market only several tens of years ago. Filmoplast P belongs among the most widely used, which is together with Filmoplast R product of German company Neschen AG.

Filmoplast P is highly transparent self-adhesive foil equipped on one side with non-acid adhesive layer. Thin transparent paper (20 g/m2) serves as a support, which does not contain wood-pulp. Self-adhesive layer is formed of permanently elastic acrylate glue (acrylate copolymer), pH of which is modified using calcium carbonate to approximate value of 8.5.²⁵

Removing Filmoplast P tape is very difficult, and success depends on duration of time elapsed from application of the tape, type of paper, and sensitivity of coloured layers. The tape is impossible to remove completely all at once. First, paper support is removed, and subsequently glue by the help of tools and an organic solvent. While using organic solvents acetone, ethyl alcohol or toluene, formation of maps on paper and dissolving printing colours occur (Fig. 35). Use of isopropyl alcohol and xylene is less hazardous. By the help of benzine of medical grade, it is possible to remove glue only from papers with surface finish, such as e.g. lamination. In this case glue swells rather than dissolves, and forms lumps, which it is possible to remove only from smooth surfaces.

With respect to the found facts, it is impossible to consider Filmoplast P tape as a reversible one, and its usage is impossible to recommend for repairs of books and other documents on paper from collections intended to long-term or permanent preservation. Its utilization should be considered also in copies, which may be registered to the named group of collection in future.

Fig. 35 Removing adhesive layer by toluene (Fund NL CR)

Filmoplast R is a transparent laminating foil, equipped on one side with heat activated layer. It is colourless, resistant against aging, it does not get yellow, and does not damage colours. Foil is intended for heat lamination of newspaper and other paper materials made after 1840. It is not recommended for hardening documents of older dates, or for historically valuable archive materials.
The support is very thin transparent technical Japanese paper (8.5 g/m2), without content of acid substances, lignin, or hemicelluloses, on the other hand, it disposes of high content of alpha-cellulose. Adhesive thermoplastic layer consists of acrylate copolymer. The layer does not contain softeners, pH value is modified using magnesium carbonate.²⁶

For application of Filmoplast R, it is necessary to own restorer´s spatula (e.g. Hot Iron or RTC-2 with thermoregulator), or laminating machine for full-area lamination. Adhesive layer of the tape is activated by heat, and it adheres to paper. For perfect adhesion, it is necessary to set temperature on the thermoregulator to ca. 120 °C, otherwise the tape will not adhere entirely, and in addition to increased opacity, there is also risk of releasing from the support. It generally applies that the higher is temperature in application, the better the tape adheres, and it is less visible. At lower temperatures, bigger pressure is necessary to apply.

Removing Filmoplast R is considerably easier than in case of Filmoplast P. Tape can be removed by hot-air gun, using the same temperature as when it was applied, then ca. 120°C (Fig. 36). Residue of glue can be removed from paper mechanically. Another option is utilization of organic solvents. In testing capability to remove tape, benzine proved the best, which does not solve printing and does not create maps. Further, acetone was tested, by which the tape can be easily removed, however, in less glued or wood-pulp paper, there is risk of dissolving text and creation of maps.

Fig. 36 Removing Filmoplast R by hot air (Fund NL CR)

Conclusion

Partial results of grant project NAKI "Survey, conservation, and care for modern library collections – materials and technologies" acquired during 2013–2014, are presented in the paper. Testing of methods of conservation and restoration of modern book collections, or 19th century prints, will continue also in the following three years. The results of research and testing of interventions should make care for these rare books more efficient, extend their lifetime, and preserve them in good physical conditions for readers – the main objectives of all activities in this project is functionality of bookbinding and its preservation.

References:

BENEŠOVÁ, Marie. Testování účinnosti běžných způsobů mechanického čištění papíru. In: Fórum pro konzervátory a restaurátory. Brno: Metodické centrum konzervace, Technické muzeum v Brně (Effectivity testing of common methods of mechanical cleaning of paper. In: Forum for conservators and restorers. Brno: Methodical centre of conservation, Technical museum in Brno), 2014. s. 50–55. ISSN 1805-00050, ISBN 978-80-87896-08-2.

ČSN 80 4565. (Czech State Standard). Bavlněné tkaniny (Cottons).Vydavatelství úřadu pro normalisaci, Praha (Publisher of the Institution for standardization. Prague),1955.

DONÁT, Adolf. Materiály na výrobky z papírů a lepenek. (Materials for products made of paper and millboard). First edition. Brno: State publishers of technical literature), 1963.

ĎUROVIČ, Michal. Restaurování a konzervování archiválií a knih. (Restoration and conservation of archival documents and books). 1st edition. Prague, 2002, 517 p. ISBN 80-718-5383-6.
Filmoplast® P. [online] [ref. 2014-05-25]. Accessible from::
http://www.neschen.de/assets/ProductDB-Import/Download-207/dats_sheet_-_filmoplast_p.pdf.

[ Filmoplast® R. [online] [cit. 2014-05-25]. Accessible from::
http://www.neschen.de/assets/ProductDB-Import/Download-261/data_sheet_-_filmoplast_r.pdf.

KRÁL, Jindřich. Moderní knihařství: Souborné zpracování poznatků oboru. (Modern bookbinding: Cumulative processing of branch pieces of knowledge). First edition..Brno: SURSUM, 1999. ISBN 80-85799-49-9.

LEDERLEITNER, Milan, Josef HAMAR a Vladimír THOMKA. Polygrafické materiály pro I.–III. ročník SOU. (Polygraphic materials for 1 – 3 grades of Secondary Technical Training Centre). 1st edition. Prague: State Pedagogical Publishers Prague, 1990. ISBN 80-04-24132-8.

SCHALKX, Hilde, Piet +-+--- LEDEMA, Birgit REISSLAND a Bas van VELZEN. Aqueous treatment of water-sensitive paper objects. In Journal of Paper Conservation. Vol. 12, No. 1. Stuttgart, Germany: IADA, 2011. pp. 11–20. ISSN 1868-0860.

SOUČEK, Milan. Exkurze do papírny. (Excursion into papermill).Prague: State Publishers of Technical literature, 1963, 179 p.

TEPNA, Cotton plant, National enterprise. Vzorkovnice kniharských pláten (Sample book of bookbinders´ canvases). Česká Skalice, 1969.

VÁVROVÁ, Petra, Jiří POLIŠENSKÝ, Pavel KOCOUREK a Hana SEDLISKÁ. Metodika průzkumu fyzického stavu novodobých knihovních fondů. Certifikovaná metodika. (Methodology of survey of physical conditions of modern book collections). Certified methodology. Available from:
http://text.nkp.cz/o-knihovne/projekty-a-programy/vyzkum-a-vyvoj-naki/virtualni-depozitni-knihovna/jednotlive-cinnosti-v-projektu-vdk/certifikovane-metodiky/metodika-pruzkumu-fyzickeho-stavu-novodobych-knihovnich-fondu/metodika-pruzkumu-fyzickeho-stavu-novodobych-knihovnich-fondu/view.

VÁVROVÁ, Petra, Jiří POLIŠENSKÝ, Pavel KOCOUREK, Hana SEDLISKÁ, Magda SOUČKOVÁ, Lucie PALÁNKOVÁ a Věra POSPÍŠILÍKOVÁ. Nový nástroj pro monitorování fyzického stavu knihovních fondů (A new tool for monitoring physical conditions of book collections). Knihovna (Library) [online]. 2012, vol. 23, No. 2, pp. 66–76 [ref. 2015-03-16]. Available from: http://knihovna.nkp.cz/knihovna122/neuvirt.htm. ISSN 1801-3252.

Wishab Dry Cleaning Sponges. [online]. [cit. 2014-10-30]. Available from:
http://www.conservationresources.com/Main/uk_section_012/012_018.htm.

¹Workplace of the authors: The Collection Preservation Division, the National Library CR

²http://katif.nkp.cz/Katalogy.aspx?katkey=050BIBL19

³ http://katif.nkp.cz/Katalogy.aspx?katkey=050BIBL19

⁴PhDr. Jiří Polišenský, Director of Division of Book Collection Management of NL CR in 1994 - 2012.

⁵ Register of digitization – current infrastructure of the Register of digitization is used, specifically a storage site – RDBMS ORACLE, and search system MS FastSearch is used as presentation interface.

⁶VÁVROVÁ, Petra, Jiří POLIŠENSKÝ, Pavel KOCOUREK, Hana SEDLISKÁ, Magda SOUČKOVÁ, Lucie PALÁNKOVÁ and Věra POSPÍŠILÍKOVÁ. Nový nástroj pro monitorování fyzického stavu knihovních fondů (A new tool for monitoring physical conditions of book collections). Knihovna (Library) [online]. 2012, vol. 23, No. 2, pp. 66–76 [ref. 2015-03-16]. Available from:http://knihovna.nkp.cz/knihovna122/neuvirt.htm. ISSN 1801-3252.

⁷ VÁVROVÁ, Petra, Jiří POLIŠENSKÝ, Pavel KOCOUREK and Hana SEDLISKÁ. Metodika průzkumu fyzického stavu novodobých knihovních fondů (Methodology of survey of physical conditions of modern book collections). Certified methodology. Available from: http://text.nkp.cz/o-knihovne/projekty-a-programy/vyzkum-a-vyvoj-naki/virtualni-depozitni-knihovna/jednotlive-cinnosti-v-projektu-vdk/certifikovane-metodiky/metodika-pruzkumu-fyzickeho-stavu-novodobych-knihovnich-fondu/metodika-pruzkumu-fyzickeho-stavu-novodobych-knihovnich-fondu/view.

⁸Coat is an elastic, firm film of one or more layers, which is laid on by paintwork on fabric. Considerably increases resistance of textile against dampness. Canvas are then called coated (dressed) canvas, or canvas with coat.

⁹DONÁT, Adolf. Materiály na výrobky z papírů a lepenek (Materials for products made of paper and millboard). First edition. Brno: State publishers of technical literature, 1963.

¹⁰ CSN 80 4565 (Czech State Standard). Bavlněné tkaniny (Cottons).Prague: Publisher of the Institution for standardization, 1955.

¹¹ KRÁL, Jindřich. Moderní knihařství: Souborné zpracování poznatku oboru (Modern bookbinding: Cumulative processing of branch pieces of knowledge). First edition. Brno: SURSUM, 1999. ISBN 80-85799-49-9.

¹² TEPNA, Cotton plant, National enterprise. Vzorkovnice kniharských pláten (Sample book of bookbinders´ canvases). Česká Skalice, 1969.

¹³ Warp texture represents data on a number of threads falling on one centimetre or inch of given fabrics. Therefore, it expresses weaving "density".

¹⁴ LEDERLEITNER, Milan, Josef HAMAR and Vladimír THOMKA.Polygrafické materiály pro I.–III. ročník SOU (Polygraphic materials for 1 – 3 grades of Secondary Technical Training Centre). 1st edition. Prague: State Pedagogical Publishers Prague, 1990. ISBN 80-04-24132-8.

¹⁵ ĎUROVIČ, Michal. Restaurování a konzervování archiválií a knih (Restoration and conservation of archival documents and books). 1st edition. Prague: Paseka, 2002, 517 p. ISBN 80-718-5383-6. SOUČEK, Milan. Exkurze do papírny (Excursion into papermill).Prague: State Publishers of Technical Literature, 1963, 179 p.

¹⁶ ĎUROVIČ, Michal. Restaurování a konzervování archiválií a knih (Restoration and conservation of archival documents and books). 1st edition. Prague: Paseka, 2002, 517 p. ISBN 80-718-5383-6.

¹⁷ Ibid.

¹⁸ Ibid.

¹⁹ BENEŠOVÁ, Marie. Testování účinnosti běžných způsobů mechanického čištění papíru. In: Fórum pro konzervátory a restaurátory. Brno: Metodické centrum konzervace, Technické muzeum v Brně (Effectivity testing of common methods of mechanical cleaning of paper. In: Forum for conservators and restorers. Brno: Methodical centre of conservation, ČSN 80 4565 Technical museum in Brno, 2014. pp. 50–55. ISSN 1805-00050, ISBN 978-80-87896-08-2.

²⁰ Wishab Dry Cleaning Sponges. [online]. [ref. 2014-10-30]. Obtainable from:http://www.conservationresources.com/Main/uk_section_012/012_018.htm.

²¹ SCHALKX, Hilde, Piet LEDEMA, Birgit REISSLAND and Bas van VELZEN. Aqueous treatment of water- sensitive paper objects. In Journal of Paper Conservation. Vol. 12, No. 1. Stuttgart, Germany: IADA, 2011. pp. 11–20. ISSN 1868-0860.

²² Temperature of shrinkage is temperature at which bonds inside molecules of collagen slacken at heating in water, and shrinkage of collagen threads occurs. Value of temperature of shrinkage provides basic idea about degree of leather damage. If temperature of shrinkage is low (below 40°C), damage can occur in contact of treated leather with water. The National Library CR is the only cultural institution in the Czech Republic, where measurement of this quantity is carried out.

²³ ĎUROVIČ, Michal. Restaurování a konzervování archiválií a knih (Restoration and conservation archival document and books). 1st edition. Prague: Paseka, 2002, 517 p. ISBN 80-718-5383-6.

²⁴ Crepeline is a type of textile for restoration.

²⁵ Filmoplast® P. [online] [ref. 2014-05-25]. Accessible from:http://www.neschen.de/assets/ProductDB-Import/Download-207/dats_sheet_-_filmoplast_p.pdf.

²⁶ Filmoplast® R. [online] ref. 2014-05-25]. Accessible from:http://www.neschen.de/assets/ProductDB-Import/Download-261/data_sheet_-_filmoplast_r.pdf.

Acknowledgement

The paper was created thanks to financial support of the Ministry of Culture CR in the grant programme of applied research and development of the National and Cultural Identity NAKI. The project´s title "Survey, conservation and care for modern library collections – materials and technologies" (DF13P01OVV004, 2013-2017).

The i-School Phenomenon: History and Present Situation

2015-04-10T13:15:00Z

Summary: This paper introduces the movement of i-Schools and the grounds of its origin, focusing on the fundamental elements of i-Schools and the i-Model. The role of i-Schools will be illustrated by the diversity of the fields of studies they encompass, and by the new opportunities that have thus opened up. Upon introducing several projects that have innovated the Library and Information Studies curricula, the paper explains the role of the steering and coordinating i-Caucus committee, and the importance of the i-Conferences. Furthermore, the paper presents an analysis of issues relevant to the interdisciplinary character and the identity of i-Schools. The study concludes with an overview of the research activities of i-Schools, and the prospects of this movement in the Czech Republic. Keywords: i-School movement, i-Field, interdisciplinarity, i-Identity, i-Schools research

Information Literacy of Non-medical Students of First Faculty of Medicine of Charles University in Prague

2015-04-10T13:10:00Z

Summary: This survey of information literacy of undergraduate non-medical students at the First Faculty of Medicine of Charles University in Prague was conducted in October and November, 2013. The study examines the ways the students look for information, the types of resources they prefer, how they evaluate their actual abilities and knowledge when looking for scientific information, how they perceive information literacy classes at the First Faculty of Medicine, and the kinds of superstructure services, offered by the Faculty’s Institute of Scientific Information as well as by the General University Hospital in Prague, they choose to use. Our research compares the individual responses of newly admitted students, who have not taken their first year course, Introduction to Scientific Research, with the second and third year students who have completed the course. Keywords: non-medical fields, information literacy, scientific information, library services, library users, information retrieval, information specialist, social sites, quantitative research

Knowledge Organization Systems and Their Typology

2015-01-29T19:55:00Z

Summary: The term, “Knowledge Organization System” (abbr. KOS) has yet to be included in the Czech specialized terminology (which uses “information retrieval language” instead). However, the term has been used for more than 15 years in international literature and in practice, encapsulating vocabularies, authority lists, subject headings, classifications, thesauri, ontologies and other knowledge organization tools of digital network communication. Today, the tools are represented by Linked Open Data technology. The study presents the provisional results of our research concerning the present state of knowledge organization systems, conducted within the DF13P01OVV013 “Knowledge Base for the Subject Area of Knowledge Organization” project, as a part of the NAKI Program. The methodology of the research is based on empirical analysis of knowledge organization systems, which are registered in the prototype of the designed knowledge base. It also draws on the analysis of proposed or implemented typologies presented in literature or in the operating KOS registries. The study further presents a typology of knowledge organization systems that would help identify the systems in the knowledge base. In addition, we have produced a working definition of the term “knowledge organization system”, presenting it to the Czech professional community for further consideration. Keywords: knowledge organization, knowledge organization system, KOS typology, registries

Library revue

Research use of web archived data

Introduction

Internet sources acquisition

The WARC container format

Header example

An example of a “request” header and its payload:

Datasets for researchers

A basic dataset for a research concerning any web archive data

An example of a part of a WAT dataset:

A dataset for linking activity analysis of archived data from its origin through present

An example of an LGA dataset record:

A dataset with named entities

An example of a geographical record in WANE:

An expected record in WANE from the CNEC 2.0 classifier

Researchers and their needs

Studies of researchers’ needs

A summary of the conclusions of both studies:

Conclusion

Bibliography:

Watermark research in the music sources registered in the Union Music Catalogue of the National Library of the Czech Republic

Introduction

František Zuman and his contribution to the history of Czech paper mills

The Dictionary of Czech Paper Mills

Music collections

The database of watermarks in the music sources recorded in the Union Music Catalogue of the National Library of the Czech Republic[9]

The identification of unknown watermarks

The catalogues of the watermarks in the music collections – the Želiv Monastery

The newly completed watermark research in the Clam-Gallas music collection from Frýdlant

Watermarks in printed sheet music

The conclusion

The list of used literature

Rights to the data administered by public libraries in the light of amendments of the Freedom of Information Act

Exchangeable formats of bibliographical data: their present transformation

1 Introduction

1.1 Starting points

1.2 Semantic web

1.3 Linked data

1.4 Advantages of using the linked data method

1.5 The way to semantic web in the field of library science

1.6. Report Of a record

1.7. Single projects

1.8 Activities of libraries

2 The future

2.1 Questionnaire

2.2. Survey sample

2.3 Results

2.4 Conclusion of the enquiry

3 Summary

Poznámky pod čarou

Map of study profiles: comparing the curricula of Library and Information Science in Opava

Printing works Kryl & Scotti / Karel Kryl in the mirror of bibliophile media and competitions

Relationships of information resources: an attempt to interdisciplinary synthesis

Archivematica – Open Source System for Digital Archiving

1. Introduction

1.1 About Archivematica

1.2 Threats to digital information, digital preservation

1.3 OAIS functional and informational model

2. OAIS model implementation in Archivematica

2.1 Transfer

2.2 Ingest

2.3 Archival Storage

2.4 Preservation Planning

2.5 Access

2.6 Administration – Dashboard

2.7 Management of archival packages (AIPs)

3. Other features of Archivematica

3.1 Archivematica as a software and micro-services

3.2 Standards

3.3 System scalability

3.4 Sustainability and further development

4. Conclusion – what Archivematica is and what is not

5. Conclusion

Literature

Towards issues of descriptive metadata for knowledge organization systems

Introduction

1 Existing practice of knowledge organization systems description

1.1 Description of knowledge organization systems in bibliographic or catalogue databases

1.2 Descriptive metadata of KOS within Schema.org

1.3 Descriptive metadata of KOS in the BARTOC registry

The database of watermarks in the music sources recorded in the Union Music Catalogue of the National Library of the Czech Republic^[9]