Research use of web archived data
Keywords: web archiving, researchers, big data, metadata, data analysis, WARC
Mgr. Jaroslav Kvasnica, Mgr. Barbora Rudišinová, Bc. Rudolf Kreibich / National Library of the Czech Republic, Klementinum 190, 110 00 Praha, Czech Republic
Introduction
Since the creation of the World Wide Web in 1990s, the internet has been slowly turning into a more and more important source for the studies of both recent history and current political, social, and cultural phenomena. The internet is quite different from other types of media and publication platforms, as it always changes, its content is intertwined by links, and it allows for a plethora of various formats (images, videos, applications etc.); last, but not least, its content is created by a diverse set of both individuals and companies. Today, the internet is important not only with regards to both mass and personal communication, but also as a platform for the development of new services. The days when it was accessible only via desktop computers are past now; people can be “online” at any hour of the day via mobile devices and freely-accessible data services. Whether we talk about political or social campaigns, discussions on online forums, opinions published on personal blogs, or partially public information shared via social networks, our public lives are lived more and more on the internet.
Today, researchers are used to working with materials published online, such as electronic articles, databases, or various types of applications. In 2014, researchers from the Harvard Law School carried out a study that focused on hyperlinks in professional law journals. They discovered that over 70 % of these citations were not linked to the content they described (ZITTRAIN 2014).
Since communication and other social activities have moved, to a substantial degree, onto the internet, researchers and scientists are beginning to regard it with a new-found interest, even though it used to be a relatively neglected source in comparison with scientific information. The internet offers an enormous quantity of individual information and sources, but it can also be seen as a group of large sets of data that can be analyzed and used for various research purposes. However, its ever-changing nature poses a problem for those researchers who want to work with older sources, as individual webpages can be moved to another location or updated in various ways, or they can disappear altogether. These researchers then have to turn to web archives which try to archive the unique and dynamic content of the internet and keep it for posterity.
When it comes to archived internet sources, there are certain unique features that distinguish them from online sources. Internet archiving always takes place in real time. If we want to archive the content that is on a given webpage today, we should do it as soon as possible, because at any moment, it can change or disappear. And at that point, such webpage becomes irretrievable. Archived internet content is not a mere copy of something that was online sometime in the past; it is a unique version, which, however, contains “blanks” in content.
Internet sources acquisition
In the course of internet archiving, one has to make a number of decisions that determine the form of the content that will be stored in the archive.
From the technological standpoint, we have to choose an archiving software. Currently, every available archiving software has certain technological limitations that determine which types or parts of content will be missing; in other words, no software is able to copy every type of content. The choice of hardware, particularly with regards to storage capacity and computing power, also plays a significant role; it can have a major impact on the final version of the data stored in the archive. Obviously, the storage capacity of the archive determines how many copies can be made and stored, while computing power determines both the speed and the effectivity of the archiving process.
These issues are intertwined with the choices regarding the range of archiving of individual internet sources. The archiving process can be limited either horizontally, or vertically. The horizontal limitation deals with the number of archived linked pages that form a context for the original source – for instance, if we archive an article, we also have to archive the sources that it links to, so that it will be represented in the whole context. The vertical limitation is concerned with the number of digital objects that can be archived per each domain – this is also called “harvest depth”. The platform and the device that we use to view the content are other technical factors that can have an impact on the form of the archived internet content. The final version of the content will vary according to which device, browser, or application we use.
Web archives are trying to keep up with technological innovations that concern the internet – dynamic webpage features, JavaScript, links to social networks, integrated videos, and various applications, to name just a few. However, the technological development is fast and relentless, and so web archives usually need some time to adapt and catch up, which once again translates into content imperfections whenever we try to turn the original online version into an archived website (BRÜGGER and FINNEMANN 2013).
From the content standpoint, we have to discuss the definition of territoriality. With regards to the European internet archives, which are usually parts of national libraries, we typically see the so-called “broad archiving” that distinguishes between webpages on the basis of their national domains (“.cz”, for instance) and that defines territoriality through technical parameters. However, this method will miss all national sources on international domains or domains of other nations (such as “.eu” or “.org”), and so it will not yield the complete national internet content (in the Czech Republic, for instance, this would concern the content regarding the Czech Studies).
In general, the type of the harvest determines the types of sources that will be archived. We have already mentioned a “broad harvest”1 that represents the most general and broadest type of archiving. It does not include manual selection. A “selective harvest” is carried out by curators who select and evaluate sources based on agreed-upon criteria. In most cases, selective harvests are smaller, but the archived internet sources are harvested in more depth.
The archived web content is not an identical copy of the original online web source, but rather a newly-created content that mirrors the original online one.
A web archive is multidimensional with regards to both time and space. Unlike online webpage, which at any given time features just one version of the content, a web archive contains many versions of a given internet source harvested at different points in time. When it comes to harvesting, certain large internet sources can take longer, and so the content may change during the archiving process (this is typical of internet newspapers sites with a lot of content, because they frequently add new articles or update the old ones) – this might result in an archived version that is basically a mosaic of various parts of the content (and varied in terms of range, too). All of this complicates things for researchers, of course; therefore, creating the metadata descriptions of the data itself is a key part of the process. Metadata descriptions allow researchers to find and define a set of data that is most relevant to their research.
The WARC container format
In order to store archived copies of webpages, web archives use specialized container formats that allow them to combine multiple fragments of harvested webpage content into an aggregate archive file. This solution was chosen mainly because most webpages include a veritable legion of small files that make any digital manipulation much more difficult. Typically, a webpage includes hundreds or thousands of these small files – scripts, images, videos etc.
Container formats combine several digital sources into a single aggregate archive file, which also contains relevant information (BAILEY and LACALLE, 2015). These formats are designed in such a way that all archived files can be stored and moved as one aggregate file; however, they also keep all metadata, and so all archived files are still intertwined in the same way in which they were linked before. This method then allows us to use the fragments to reconstruct the webpage and present it to the user in its original form. For researchers who want to use the data from the archive for their studies, the metadata represent a chance to process the archive without having to work with the enormous amount of data stored in it.
The container format storage principle for both the data and the metadata is simple – each data object is preceded by a header with metadata. In this context, data objects might consist of individual files harvested from webpages or metadata records. There are eight types of headers; a list of all the types, as well as their uses, is standardized in the ISO standard that concerns information and documentation, and also the WARC format (ISO 28500:2009 2009).
- “warcinfo” – The “warcinfo” header is used to describe the whole content that follows this header and that is delimited by another occurrence of the “warcinfo” header. Usually, it is used to describe the whole container.
- “request” – This header introduces the data object that holds a full HTTP request for a file sent over to the server. It serves the payload which contains information about the original request for internet content harvest sent by the crawler2 to the target server.
- “resource” – This header is followed by the content harvested from the server, for instance, an HTML file or an image
- “response” – This header is used to describe a server response to a harvest request. This response is included and stored here. It the request was approved (the server responded, the URL contains data), this header is followed by the web content itself.
- “metadata” – This header is used for a metadata description that cannot be associated with any other header
- “revisit” – This header is used for a content that was archived earlier; today, a record marked as “revisit” is mainly used to refer the user to duplicates
- “conversion” – This header is used for alternate versions of the content that were created during a format change, for instance
- “continuation” – This header is used for segmented data objects that would otherwise exceed the size limitations of the container
Header example
The following example includes a header of the “request” type; it was a part of a test harvest carried out for the purposes of this article. Each header type has a certain number of obligatory identical elements. In this example, the identical elements are indicated in bold; they contain information about the type of the content (WARC-Type), date and time of its creation (WARC-Date), and its size (Content-Length), as well as a unique identification (WARC-Record-ID) that can be used to refer to the content.
Apart from these obligatory elements, the header also includes the URL that was the target of the request (WARC-Target-URI) and a control hash for the payload (WARC-Block-Digest), which can then be used to verify the authenticity of the content or ascertain any damage to it.
The payload itself begins on the line “GET /HTTP/1.1”. This line tells us that the crawler sent the GET request via the HTTP 1.1 protocol, which is a standard request for obtaining information in the internet environment.
An example of a “request” header and its payload:
WARC/1.0
WARC-Type: request
WARC-Record-ID: <urn:uuid:27087c01-98ae-4aca-b52c-55f33dbcb8b0>
WARC-Date: 2016-08-03T12:18:35Z
WARC-Target-URI: http://webarchiv.cz/
WARC-Concurrent-To: <urn:uuid:26300322-d4d0-4a81-8851-086f811b0a24>
WARC-Block-Digest: sha1:a514938cef4b9c3b3a88403c4ccdedd3863a74db
Content-Type: application/http;msgtype=request
Content-Length: 451
GET / HTTP/1.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36
(…)
Host: webarchiv.cz
Accept-language: cs-CZ,cs;q=0.8,sk;q=0.6
All the content stored in the WARC containers is structured in the way shown in the example. Whatever the type of the internet content, it is always preceded by a record of communication with the server, including any potential faulty requests and negative server responses. Originally, these metadata were created specifically for web archives. They allow us to go back in time and discover potential errors, as well as the time of their occurrences and their causes, which can then tell us why the webpage was not stored correctly. Because the stored information includes the date of creation and the size of the metadata, and even the communication between the client and the server, these metadata can be used by researchers who want to work with web archives.
Datasets for researchers
The data in web archives have one significant disadvantage – they are too “big”. Researchers typically do not have access to devices with enough computing power to process such volumes of data; quite often, they cannot even extract the data from the archive. To solve this issue, web archives generate special derived datasets that are designed for specific groups of researchers and research purposes – a linguistic analysis of the text does not include audiovisual materials, a hypertext analysis requires only metadata etc.
Datasets can be generated both from the data itself and the metadata. These sets represent the part that is necessary to fully saturate specific information needs of a given user. Users can download these sets and work with them on ordinary devices (using the tools that are available for free).
The Internet Archive defines three basic datasets that provide a basic insight into web archives, and which also show how to define and approach them. As these datasets’ specifications are freely available and as they can be generated, modified, and used via freely-available tools, we can consider them to be provisional standards for any web archive research.
A basic dataset for a research concerning any web archive data
“WAT” is an acronym that stands for Web Archive Transformation. The WAT dataset includes basic information about stored digital objects, such as the date of their harvest, their size, or the format that the HTTP server communicated to the harvester. Apart from this, the WAT dataset also contains information gathered by harvesting the digital objects themselves – the name of the author, the name of the document provided in the metadata part of the source, or all links (URLs) found in the digital object.
There is a freely-available tool called archive-metadata-extractor, which is used to generate the WAT datasets. This tool saves the metadata and the digital object itself, and it also extracts available information, for instance from the HTML metadata header or the body of the document. With a certain amount of simplification, we could argue that the WAT set is a WARC archive container without the archived digital content. The result is a text file that contains metadata organized into structures, which makes it quite small – a WAT file is between 5 % and 20 % the size of the original WARC file (BAILEY 2016a).
Due to their simplicity, relatively small size, and clearly-defined structure, the WAT datasets are very useful for researchers, as they can work with them with ease and via readily-available tools.
An example of a part of a WAT dataset:
{
"Envelope": {
"Format": "WARC",
"Payload-Metadata": {}
"WARC-Header-Length": "298",
"WARC-Header-Metadata": {}
}
A dataset for linking activity analysis of archived data from its origin through present
The second dataset specified by the Internet Archive is called Longitudinal Graph Analysis (LGA). It includes all the data required to create a longitudinal graph of all links. This set allows for a longitudinal study of linking activity between domains, both in terms of the archive as a whole and with regards to specific domains. The LGA dataset consists of two files.
Each URL is first written into an ID-Map file, where it is assigned a unique identifier and a so-called SURT3 form of the URL (BRAGG and ODELL 2010). Then, a second file is created – it is an ID-GRAPH file that contains a timestamp, a unique ID of the source URL, and all URLs that were linked from a given URL.
An example of an LGA dataset record:
ID-Map:
{"url":"https://www.youtube.com/watch?v=--FDzShdFjw&gl=US&hl=en","surt_url":"com,youtube)/watch?gl=us&hl=en&v=–fdzshdfjw","id":294869}
ID-Graph:
{“timestamp":"20150206180648","id":294870,"outlink_ids":[62596,110007,129599,145417,148627,215031,277534,277535,
277668,277678,277679,737423,737436,737459,737476,737483,
737503,737514,803348,803349,803373,803374,803490,803565,
803586,803590]}
A dataset with named entities
The WANE dataset is an acronym for Web Archive Named Entities. This set includes a list of individuals, places, and organizations mentioned in the digital object. These entities are extracted via the Stanford Named Entity Recognizer tool (Stanford NER for short). The resulting file contains data structured in a similar way to a WAT dataset; the WANE files are usually about 1 % of the size of the analyzed data.
Creating the WAT and LGA datasets is relatively easy – you just have to extract the information from the structured part of the source data or correctly recognize URLs in the document. The WANE dataset, however, is a little more complicated – you need a classifier that can recognize individual entities in non-structured texts, and then put them into correct categories. The Stanford NER tool typically includes several classifiers that can recognize entities in English and German texts. Some classifiers can even search for entities regarding time, currencies etc.
Unfortunately, when we attempt to classify a Czech text in this way, the quality of the results is unacceptably low. It is therefore necessary to use a suitable classifier to categorize documents into language groups, and then use appropriate special classifiers designed for given languages.
For Czech, the optimal tool seems to be the NameTag tool, which uses the Czech Named Entity Corpus 2.0 (CNEC 2.0). It achieved results that equal those of English-based classifiers (STRAKOVÁ et al. 2013). CNEC 2.0 includes a rich classification of Czech expressions; it can also distinguish between various versions of identifications of time, between first and last names, and also between ordinary and mythological names. Furthermore, it can recognize entities such as movies, books, standards, products, measurement units, and currencies. It can distinguish between public, government, and private institutions. It is able to identify various radio and television stations, e-mail addresses, and links. Geographical information is automatically divided into categories – continents, streets, rivers, etc. This classifier can also recognize addresses and phone numbers. Unlike a standard WANE record, however, the resulting CNEC 2.0 dataset grows both in bit size and in complexity.
At the moment, the current WANE format defines just places, people, and institutions, and so the best course of action seems to be a modification of the WANE record structure itself, which would create a new level of structuring and mirror the taxonomy of CNEC 2.0. One question remains, however – should we call the resulting dataset “WANE”, or should the ontology of the entities be mentioned in the name of the new file (“cnec2-wane”, for instance)? This question should be discussed during the course of the next IIPC consortium, and it would be a good idea to establish a unified way of naming the sets with named entities.
An example of a geographical record in WANE:
"locations":["Miami","Virginia","Fort Lauderdale","Wash.","Va.","Blacksburg","St. Louis","Clayton","Fla.","Chapel Hill","Michigan State","Austin","North Carolina","Michigan"],
An expected record in WANE from the CNEC 2.0 classifier
“Geographical_names”:{“streets, squares”:[“Václavské náměstí”,”Kaprova”],”castles/chateus”:[“Hradčany”,”Karlštejn”]}
These three dataset are by no means sufficient – they cannot possibly satisfy all information needs that potential researchers might have. They do provide a basic insight into data archives, however, and they also present potential uses of web archive data and serve as examples of the way all datasets should look like.
Researchers and their needs
As researchers place ever more demanding requirements upon web archives and as they want full access to the archived data itself, instead of a standard user one, web archives had to try to satisfy these particular needs. These attempts and discussions, however, were hindered by various issues – problems with the form of the sent data, with hardware and software requirements, and with computer literacy requirements.
As it turns out, the main problem is the fact that researchers are often unable to define which part of a given web archive they are interested in. They usually want all the data, even in cases where it is clear that they will not need it (BAILEY 2016b). This leads to another problem – most researchers lack the necessary equipment and they do not have the capacity to receive the data that can easily have dozens or hundreds of terabytes. However, not all the blame can be placed with them – web archives usually cannot offer their data in a way that would allow for easy use and manipulation. Simply put, the data is not designed to be worked with in this manner, and on top of that, web archives do not have tools that would allow users to do these kinds of things.
Today, web archives focus on improving the communication with researchers and on mutual collaboration, on the development of new technologies and on building sufficient local computer capacities, and also on workshops focused on the development of skills necessary for any work with datasets (TRUMAN 2016). However, without the knowledge of potential users and their needs, web archives cannot develop suitable tools. Fortunately, some are trying to improve the situation with studies on this very topic. Jefferson Bailey from the Internet Archive, for instance, tried to define a classification of users’ interests. Creators of web archives can use it to have at least some idea about what researchers are interested in, and to begin working on tools and ways to make the sought-after data available.
Classification of the users’ interests (BAILEY 2016b):
- Documentary – researchers interested in the classification or registration of webpages. This category also includes studies focused on the legality of the content on webpages etc.;
- Social and political scientists – sociological and political studies. Work with government or open data;
- Web science – studies on internet technologies and protocols;
- Digital humanities – this category includes modern historians and other researchers from humanities disciplines that work with digital data;
- Computer science – information retrieval, data processing and indexing, infrastructure and tools;
- Data analysts – various activities – data mining, language processing, and trend analysis.
This classification can give us a basic insight into possible venues of research concerning web archives data. Researchers from different fields have different requirements – they need different datasets, but they can also have in mind specific ways in which they want to access them, or they want them in specific formats. When it comes to sending the data to researchers, we have to consider factors such as the kinds of tools usually employed in a given field and the general computer literacy of its experts. For instance, sociologists use different tools than IT experts and we can assume that their computer literacy will also differ. While IT experts can write their own scripts that will help them deal with the data, researchers from humanities disciplines might need specific tools that will not require any such advanced knowledge.
Studies of researchers’ needs
The topics of researchers’ uses of web archives and their information needs (with regards to web archives) are not wholly unexplored; several such studies have already been carried out. For the purposes of this article, we have selected two recent studies that were initiated by members of the IIPC consortium (the Czech web archive is also a member). Cooperating web archives used these articles to formulate further steps; we will summarize the relevant conclusions.
Peter Stirling, Philippe Chevallier and Gildas Illien (2012) wrote an article summarizing a French study carried out by the National Library of France. It was a qualitative study carried out in 2010 and 2011 and it focused on three groups of users, including researchers. The researchers’ group consisted of five experts from different fields (history, philosophy, sociology, and information technologies).
Apart from the usual practice of using the information from articles published online and the issues concerning the citations (articles published on the internet are often moved, URL addresses cease to function etc.), researchers also stated the need to use internet sources to evidence or illustrate certain sociological or historical phenomena, which is complicated by the fact that you have to explain your reasons for selecting and using these webpages and document them in your scientific work. In the scientific community, data obtained from internet sources are still considered unreliable and short-lived. Researchers touched on this particular topic by noting that they had to harvest the sources they were interested in quite often.
These researchers are focused on web archiving, and they have stated several important issues – the importance of archiving transitory and ever-changing data, the importance of the type of the content (they have mentioned the need to archive blogs, for instance), the legitimacy of the use of data obtained from archived electronic sources and related ethical concerns (such as the use of personal information), and the way we see the web archiving process as a whole, particularly with regards to its current dynamic character.
Importantly, the researchers concurred with the current approach of the National Library of France and agreed with its strategy consisting of a combination of automated broad harvests (e.g. national domains) and small selective harvests. However, they also expressed a hope that in the future, the archiving process will react to new internet trends (popular webpages, social networks content etc.) more swiftly. They requested that the criteria for the selection of the sources for the web archive are made public and documented. Last, but not least, the researchers emphasized the necessity of general collaboration and of working with the scientific community on issues concerning methodology creation and archived sources selections.
One of the most recent studies that focused on the general awareness of web archives and their use by academia was the study carried out in 2015 by the National Library of New Zealand (RILEY and CROOKSTON 2015). This project tried to provide a better understanding of potential scientific uses of web archives and to determine how the web archive in question should be developed further. The study showed that only 39 % of the researchers (113 out of 290) were aware of the existence of international web archiving initiatives, and only 23 % knew that the National Library of New Zealand archives national internet sources.
The researchers from the study would prefer full-text searches, while the URL address-based searches of the requested source in a library catalogue was their least favorite way of looking through the archive. Regarding the datasets on offer and their potential research, the researchers were unsure whether the data would be useful for them (44 % of them would use the data in their research, while 38 % could not decide). A part of the study focused on the use of web archives and archived sources in the classroom, which was surprisingly frequent (34 % of those who work with foreign web archives also use them when teaching). The most frequent ways to use archived sources is direct quotation (via a link), illustrating something with a screenshot, or as an information source for the students.
When asked about the value of the content, 66 % of the researchers said that government webpages are the most significant research source. Many researchers also stated that archiving the content of social networks would be beneficial to their work – most of them agreed that the most important social media are video-sharing networks (YouTube, Vimeo, etc.) and discussion forums. The majority (77 %) also agreed that it is important to archive national internet sources. Half of them (51 %) said that the data from the web archive will be useful to them in the next five years, which is in contrast to the 2012 study of the National Library of France that found out that the researchers did not have ready uses for web archives data and they generally assumed that the data would be used by their students (STIRLING et al. 2012).
A summary of the conclusions of both studies:
- The scientific community still see data acquired from the internet as unreliable and transitory.
- Many researchers are not aware of web archives; they do not know how to use the data for their research.
- However, most researchers agree that with regards to research, web archiving is important and useful.
- Researchers prefer more traditional forms of accessing the archived data (looking through individual sources).
- They strongly request full-text search tools.
- In most cases, researchers are interested in datasets, even though they do not know how to use them afterwards and they are not sure about the size and content limitations of the sample (they usually feel that they should gather as much data as possible).
- They expressed an interest in the archiving of social networks.
- There are certain legislative and ethical issues concerning the use of electronic materials from web archives that need to be solved.
Conclusion
Until recently, researchers were not interested in web archives, even though these platforms currently represent the most significant, the largest, and the fastest growing part of our cultural heritage. However, when the public interest in the archived internet content grew, the archives found out that they do not know how to collaborate and communicate with researchers in a way that would be beneficial to both sides.
Conclusions from the studies of researchers’ needs showed some of their requirements for web archives. If we want to have better cooperation between the scientific community and web archives, the archives should focus on the following:
- promoting themselves and establishing cooperation (researchers are not aware of the archives);
- explaining the reasons behind the archiving process, as well as its aims, and emphasizing the significance and relevance of the data from web archives (researchers still prefer “classic” types of sources);
- documenting archive collections, creating standards for descriptions, and creating high-quality descriptions and metadata that would explain what is in a given collection, and why (it is not a complete copy of the internet or the original live webpage, some parts are missing);
- creating short annotations for individual collections – any scientific work is easier if a researcher knows what data he or she actually works with;
- attempting to allow for full-text archive searches;
- creating and providing datasets – even though researchers feel that they need as much data as they can possibly get, it is important to provide them with suitable tools, prepare smaller thematic collections for them, give them data delimited by time etc. (the documentation is necessary here – why and how were the sources selected, defined metadata and formats etc.);
- attempting to archive less common types of sources (various multimedia types, for instance) and social networks;
- solving and clearly defining authorizations and requirements for the use of the data from web archives (with an emphasis on the protection of personal data);
- working with other web archives to achieve a productive collaboration (the internet is international, so any research based on this data will have international range and ramifications);
- asking researchers for feedback on the collections creation process.
If we, as the human society, want to keep our ability to write and study our own history, we will, without a doubt, need web archives. And web archives have to do everything in their power to rise to the challenge. Researchers from humanities disciplines must learn to work with digital materials; web archives must learn to work with other fields of study.
1 Harvests are processes of automated downloading and acquisition of data from selected internet sources (the creation of copies).
2 Crawler is a tool used for automated downloading and acquisition of data. Crawlers are used for harvests.
3 SURT is an acronym for Sort-friendly URI Reordering Transform, with is a transformation applied to URI that allows the representation of this URI to better reflect the natural hierarchy of domain names.
This article was made possible by the support given by the Ministry of Culture of the Czech Republic to the further development of the National Library of the Czech Republic in its capacity as a research institution.
Bibliography:
BAILEY, Jefferson, 2016a WAT Overview and Technical Details [online]. 2016 [2016-08-10]. Retrieved from: https://webarchive.jira.com/wiki/display/ARS/WAT+Overview+and+Technical+Details
BAILEY, Jefferson, 2016b. Program Models for Research Services [online]. 2016 [2016-08-10]. Retrieved from: http://netpreserve.org/sites/default/files/WAC02_HEKLA_Jefferson_Bailey.pdf
BAILEY, Jefferson, and Marie LACALLE. Don't WARC Away: Preservation Metadata & Web Archives [online]. 2015 [2016-08-10]. Retrieved from: http://connect.ala.org/files/2015-06-27_ALCTS_PARS_PMIG_web_archives.pdf
BRAGG, Molly, and Kate ODELL. SURT Rules [online]. 2010 [2016-08-10]. Retrieved from: https://webarchive.jira.com/wiki/display/ARIH/SURT+Rules
BRÜGGER, Niels, and Niels Ole FINNEMANN. The Web and digital humanities: Theoretical and methodological concerns. Journal of Broadcasting & Electronic Media [online]. 2013, pp. 66-80 [2016-04-28]. ISSN 1550-6878. Retrieved from: http://dx.doi.org/10.1080/08838151.2012.761699
ISO 28500:2009: Information and documentation -- WARC file format. London, 2009.
RILEY, Harriet, and Mark CROOKSTON. Awareness and use of the New Zealand web archive: a survey of New Zealand academics [online]. National Library of New Zealand, 2015 [2016-04-28]. Retrieved from: http://apo.org.au/node/58430
STRAKOVÁ, Jana, Milan STRAKA, and Jan HAJIČ. A New State-of-The-Art Czech Named Entity Recognizer [online]. P. 68 [2016-08-10]. DOI: 10.1007/978-3-642-40585-3_10. Retrieved from: http://link.springer.com/10.1007/978-3-642-40585-3_10
STIRLING, Peter, Philippe CHEVALLIER, and Gildas ILLIEN. Web Archives for Researchers: Representations, Expectations and Potential Uses. D-Lib Magazine [online]. 2012, 18(3/4) [2016-04-28]. DOI: 10.1045/march2012-stirling. ISSN 1082-9873. Retrieved from: http://www.dlib.org/dlib/march12/stirling/03stirling.html
TRUMAN, Gail. Web Archiving Environmental Scan: Harvard Library Report. Digital Access to Scholarship at Harvard [online]. USA: Harvard Library, 2016 [2016-08-10]. Retrieved from: http://nrs.harvard.edu/urn-3:HUL.InstRepos:25658314
ZITTRAIN, Jonathan, Kendra ALBERT, and Lawrence LESSIG. Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations. Legal Information Management [online]. 2014,14(02), 88-99 [2016-07-25]. DOI: 10.1017/S1472669614000255. ISSN 1472-6696. Retrieved from: http://www.journals.cambridge.org/abstract_S1472669614000255