Legal framework of textual data processing for Machine Translation and Language Technology research and development activities/Copyright Analysis of MT & MP scenarios based on crawling


Objective of this section is to address a series of issues with regard to the crawling, extraction and re-use of data that may be found under different licensing schemes or without any licensing information on the Internet.

Exploring this issue invariably touches upon multiple types of rights (copyright, sui generis right, public sector information) as well as aspects of rights (copy, making derivative works, communicating to the public).

The section presents the different types and aspects of rights involved in a rather structured fashion, so that it is possible to draw generic conclusions with regard to the level of risks involved when re-using material used for the construction and use of Language Resources (LRs) or Language Technologies (LTs).

More specifically, this section presents the key methodological aspects of our approach for analysing different cases of data re-use in the context of the QTLP project.

Methodological ApproachEdit

In order to address each of the above points, we adopt the following approach:

  • First, we present a series of steps that describe in a generic fashion the acts that are to be legally assessed.
  • Second, we examine the legal status of these acts.
  • Third, we assess the risks and opportunities that these acts entail.
  • Fourth, we make suggestions as to alternative or additional actions that could reduce risk or increase the production of value with regard to these acts.
  • Fifth, we make overall suggestions as to how legislation should change in order to address the issue of content re-use in order to make different types of LR processing and the development of LTs possible.

This set of steps draws from cases presented in Section 5, where different use-case scenarios with regard to the re-use of data are explored. These cases are then grouped to provide some core suggestions as to how to process and re-use data in the context of language resources (LR) processing.

Copyright Law is not fully harmonised at the international level and, hence, it is extremely difficult to provide a generic answer for the entirety of the situations involving more than one jurisdictions, where possible acts of infringement take place. However, there are some common rules described in international treaties, mainly the Berne Convention, the TRIPS agreement and the WIPO treaties that provide us with an understanding of copyright rules at the international level, whereas a series of directives at the EU level provide an even more harmonised legal regime for the Member States of the European Union.

The section makes reference primarily to the legal system at the level of the international treaties placing additional emphasis to the regime established by the relevant EU Directives and making references to Copyright Legislation in some of the key jurisdictions outside the EU in terms of where the greater volume of data processing takes place mainly US, Canada and Australia. The main focus of the section is Copyright Law, but there are also references to Public Sector Information and Data Protection/Privacy regulations, where that is applicable.

For reasons of simplicity we will refer to the entirety of these regulations as “copyright law” with additional references to specific legal instruments where this is deemed necessary.

LR processing and LT developmentEdit

LR processing and development is to a great extent dependent upon the use of third party material that may be found on the Internet, in specific collections or final products (mostly books or other types of publications). The processing of content of different types is essential for the production of some key LRs, such as annotations, lexicons and mostly corpora. As a result, it is necessary to have a good idea of how material could be used and re-used.

The key questions in this context may be summarised as follows:

  • What are the key types of use relevant for the development of LRs and how should be treated in order for the (re) use to be lawful?
  • What are the key types of permissions necessary to perform actions on LRs?
  • What are the best licences for making LRs available?
  • How can we use exemptions, fair use/dealing, limitations and exceptions in order to process LRs?
  • What are the legislative amendments necessary to advance LR related research?

The following section deals with the core issue of data mining and web crawling that is the most problematic in terms of dealing with the issue of content re-use for LR/LT use. We return to these questions again at the end of this section.

The issue of Data CrawlingEdit

Data Crawling may be defined as the act of collecting different forms of information from the public Internet in an automatic fashion (i.e. through bots) which is then stored and processed in different ways.

It is necessary that this description be broken down into distinct steps that will be subsequently assessed in terms of the degree to which they constitute violations of copyright law in different jurisdictions.

The material may be found either on the Internet or in specific repositories. In the former case, the likelihood of having a specific licence attached to the material is much lower than when finding the material in a repository. However, it is often the case that even material that is found in a repository is the result of crawling and data-mining, hence the issue of data-mining is a core issue for our understanding of how LRs are to be legally used.

More specifically:

1. All the material found on the web is material that potentially constitutes protected subject matter. This is mostly due to the fact that copyright protection does not require any formalities for the protection to be granted and, hence, there is no record of whether the material is protected or who the owner is or when the protection is to expire. Protected subject matter may fall under the following broad categories:

  • Textual information (literary works)
  • Pictorial (artistic works)
  • Audiovisual works
  • Sound Recordings
  • Musical Works
  • Data Bases and compilations

This section mainly focuses on literary works and databases, though it is also applicable in cases where the other types of works are being crawled.

2. A portion of these works may be outside copyright protection either because the term of protection has expired or because it falls under categories of works that are by definition not protected in certain jurisdictions.

  • In the first category (expired copyright), we find works that have been produced by creators that have expired over 70 years ago (e.g. in the case of literary works) or works that have been produced over 50 or 70 years ago (e.g. in the case of sound recordings). The term of protection is calculated on the basis of a variety of factors, mostly:
    • the type of work (e.g. literary work vs. sound recording)
    • the type of rights subsisting over the work (e.g. copyright vs. related rights) and
    • the jurisdiction of where the rights holder seeks protection (e.g. Australia vs. EU vs. US).
  • In the second category (exempted material) we find subject matter that by virtue of their nature are classified as not protected works. These will mainly involve works made by the public administration or the legislature and which for reasons of public interest remain outside the realm of protection of copyright law. Some of these works are universally outside the protection of Copyright law (e.g. statutes in the EU) and some others are outside the protection only within a specific jurisdiction (e.g. statutes in the US or Public Sector Information (PSI) in certain EU Member States). In addition, these types of works in some jurisdictions are presented as works outside the realm of copyright protection and in some other jurisdictions as falling under the limitations and exceptions to copyright law.

3. Depending on the type of work there may be different types of rights conferred to its creator, producer or performer. Hence, in the case of a literary work, copyright subsists as the main legal right; in the case of what is perceived as a single final work (e.g. an audiovisual work), multiple layers of works and rights may subsist (e.g. musical work, literary work, sound recording performance) with different durations and exceptions; in the case of a compilation of information, there may be different types of rights according to the type of creative input that led to the final work (e.g. copyright for the original compilation, sui generis database right for a database). The definition of the kinds of rights subsisting in a specific informational product depend on the jurisdiction where protection is sought, e.g. original databases are always treated as copyrighted works in the US, whereas in the EU there are two types of rights, i.e. copyright for the original databases and the sui generis rights where only investment in time and labor has taken place. Finally, the level of originality required to grant protection may be different. For instance, in Australia the level of originality required to grant protection to a database is close to the definition of the non-original database in the EU, whereas in the US a greater level of originality is required. This means that the same work may have different levels of protection in different jurisdictions and, hence, what constitutes infringement in one jurisdiction may not have the same treatment in another. The most risk averse approach, hence, would be to take as a base line the highest level of protection (i.e. the existence of a sui-generis database right in all compilations of facts irrespectively of their originality) and act on the basis of very limited exceptions or a very narrowly construed fair dealing.

4. It is necessary to specify the acts that are going to be performed upon the data and hence assess two factors: (a) the degree to which such acts fall within the acts restricted by copyright laws and (b) the extent to which such acts are visible enough to expose an organisation to the risk of legal action.

(a) In the case of web crawling, the acts would certainly include copying and processing of the relevant information and potentially the creation of derivative works and the communication to the public either of parts of the original work or a derivative work. Each of these acts needs special treatment:

  • Copying: the act of crawling certainly involves the reproduction of content and hence activates the reproduction right. According to the Copyright Directive any form of reproduction direct, indirect, temporary or permanent falls under the relevant economic rights of copyright and related rights holders and hence is regulated by copyright. In the case of crawling, the reproduction of the material could involve various quantities of material and could be temporary or permanent. In most of the cases of crawling for Language Technology purposes, the amount of material copied would be substantial both in qualitative and quantitative terms. It will be quantitatively substantial because otherwise there is not enough data for the LTs to perform operations that provide a meaningful result. It will also be qualitatively substantial, because it has significance for the entity performing the processing and the parts of the material collected are by definition significant for the entity making the collection. The temporality of the copying is also a significant factor, but it seems that in the case of crawling for language resource processing there is very little of the temporary copying falling under article 5 of the Copyright Directive. This is because such temporary copying is allowed only in the case where it is used in order to facilitate either the transmission in a network between third parties by and intermediary or a lawful use that has no independent economic significance. It is almost impossible that even such a temporary reproduction in the LT context would fall under this exception since it does by definition have an economic significance. In any case, there are recent developments in national copyright legislation, particularly in Germany and France, where draft legislation has been proposed introducing a “snippeting right” with duration of a year. Under this new right news publishers would be able to license out snippeting rights for a royalty and start proceedings against those found to infringe their newfound neighbouring right. They would also be able to grant permission to reproduce to the relevant intermediaries for free. This is a trend that follows the two Infopaq cases decided by the European Court of Justice in 2009 and 2012 respectively and having been the result of heavy criticism by copyright academics and practitioners. In the Infopaq I case, the Court decided that snippets of 11 words may, depending on national law, be entitled to copyright protection under the European directives if they can be found to constitute an expression of the intellectual creation of their author. Accordingly, originality and not substantiality is the test that determines the copyright status of extracted parts of a work. In Infopaq II, the Court further noted that the transient copying exception to copyright enshrined in Article 5(1) of the Copyright Directive only applies if the act of temporary reproduction does not enable the generation of an additional profit beyond that derived from the lawful use of the protected work and does not lead to a modification of the work – under this interpretation the reproduction of news snippets by an automated process would not qualify as a protected use. Similarly, in 2011, the English Court of Appeal in Meltwater found that Meltwater News, an electronic media monitoring service, could be implicating its subscribers in copyright infringement by distributing sections that included the headline, opening text and extracts from claimant Newspaper Licensing Agency (NLA)’s articles. Businesses that access press-monitoring services without a special web end-user licence may thus be in breach of publishers’ content, notwithstanding any licence held by the press-monitoring agency. It becomes clear, hence, that in most cases of web crawling for LT purposes, the exceptions of art. 5 of the Copyright Directive would not be applicable, neither would the most of national laws in the EU accept it as falling within the realm of copyright limitations and exceptions.
  • Derivative works: in most cases, the act of crawling will either mean the collection of only parts of the websites or will also include additional processing once the material is collected. As a result, derivative works will be created and hence additional permissions by the rights holders may be required. As demonstrated in the previous section, the act of creating derivative works cannot by construed as falling under the limitations and exceptions provisions and hence it will also require separate permission by the rights holders.
  • Communicating to the public: the final part of a series of acts starting with web crawling and continuing with the processing of the collected data could be the communication of the results to the public. If what is communicated to the public is the actual data either in their original or their derivative form, then this constitutes yet another act restricted by copyright law. If, however, the end user is only the recipient of a web service that is the result of web crawling and processing of the relevant data without any direct communication of the actual web data to the audience in an identifiable form, then copyright law is not activated at all. It needs to be made clear that this is the case when no copying is involved. If this is not the case, then copyright applies.

(b) A separate question is the degree to which the act of crawling and any subsequent acts of data processing and dissemination are visible enough to expose the relevant entities to the risk of lawsuits. Unless the owner of each web site explicitly wishes the contents of her site not to be indexed or copied, the act of web crawling is part of the daily operation of a web site and hence it could be covered by an implied licence. Indeed, web sites need to be copied at least temporarily in order to be viewed and hence the simple act of web crawling may not be something that is noticed or objected by the web site owner. In addition, the processing of information or the selected copying from the web site may occur in the site of the entity producing the LTs and hence not really perceptible to the rest of the world. If the LTs are offered as a service, the probability that a third person establishes a link between the infringement of a single web site and the final service offered to the end user becomes extremely low. Accordingly, unless the information crawled from a specific web site is substantial for the operation of the end user service, the legal risk drops dramatically.

5. The previous analysis indicates that while the acts of web crawling and subsequent processing and communication of the relevant material constitutes copyright infringement and is unlikely to fall under the limitations and exceptions to Copyright law, the actual risk of legal action is fairly low and may be further mitigated through the following actions:

  • It is necessary to identify big content providers whose content is crawled and is significant for the entirety of the collection of the entity that performs the crawling. This would be the case, for instance, of a big publisher or a newspaper licensing agency.
  • If a commercial service is offered by the entity that performs the crawling, then it is good practice to contact the collecting societies of the jurisdiction in which it has its main place of operation and inquire whether there is a licence that actually covers the act of web crawling in its jurisdiction. However, if the LT provider is not using material from a specific jurisdiction or is mainly involved in non-commercial activities it may be more prudent to rely on an implied licence rather than seek for a commercial licence from the collecting societies.
  • In some jurisdictions (especially in the US, where fair use is applicable), there is the concept of the implied licence with regard to web crawling. This legal construct relies on the fact that web browsing is not possible without the reproduction of the contents of a web site, that most of the owners derive value when their website is crawled or indexed and that there are technical measures to stop crawling, which are easy to apply and hence if they do not exist, imply that the web site owner wishes it to be copied. Objections to this line of argumentation include that, at least in the civil law jurisdictions, licences are very narrowly construed only to cover the explicit acts the rights owner would like to authorise. In that sense, web browsing or indexing or caching for a search engine is different from crawling for Language Technology purposes and the latter may not have been the intention of the web site owner. In addition, if the LT provider profits out of this activity, this may prejudice the economic interests of the web site owner.
  • In order to further reduce risk it is suggested that the LT provider:
    • only crawls sites where bots are allowed
    • has a notice publicly stating that its content only derives from web sites that do not prohibit crawling
    • provides a brief explanation as to how someone could stop her site from being crawled
    • produces a notice and take down procedure indicating under which circumstances the material will be taken down and for how long, what the decision making procedure is and an email address where relevant complaints could be addressed.
    • does not use the material for commercial purposes
    • the material provided through the LTs are in such a state or form that the original content cannot be re-constructed or its use substituted by the content provided by the LT.
  • Finally, it is strongly suggested that the LT provider:

(a) does not engage in acts of advertising the collection of web material unless necessary for the purposes of her work and only under the conditions stated in the previous bullet point

(b) performs the processing of any collected content internally (c) does not offer any content or derivative content as such but only services that do not replicate the material collected but only produce a service out of its processing.

Grouping and understanding Cases of LR re-useEdit

In order to better understand the ways in which we may make available LRs at the easiest and less risky possible way when they include third party material, is particularly helpful to make a typology of such material and how it could be redistributed. For this purpose, it is necessary to explore the questions raised in section C as parts of a three step process:

  1. Step A: understanding the type of material used
  2. Step B: understanding the limitations to re-use on the basis of Step A
  3. Step C: choosing the appropriate licence and way of releasing the LRs on the basis of steps A and B.

Step A: Material UsedEdit

The material used in LRs will almost invariably originate from different sources and the LR provider will not have the rights to release it without having a legal basis for its re-use. The material used may be classified in accordance to the types of rights or restrictions subsisting on it as follows:

1. Generic Copyrighted material: this would be any type of material that is potentially under Copyright law irrespective of its source. Whether it is still copyrighted or not and whether its use within an LR constitutes a permissible use is something we have seen in detail under section 3.1 The rule of the thumb here, is that when the material is potentially copyrightable we ask three questions:

  • Is it Public Domain material? (if yes, we use it, if not we proceed to the following question)
  • Does its re-use fall under fair use/dealing or an exception? This is a rather rare case, as we have seen in section 3.1 (if yes, we use it; if not, we proceed to the following question)
  • Is it licensed under a Creative Commons or other standard or custom open licence? If yes we use it, otherwise we only link to the material and we do not include it in a repository as such).

2. Public Sector Information (PSI): this is material that has been produced by a Public Sector Body (PSB) and falls under the relevant PSI legislation. PSI is particular important as it comprises of large volumes of material that can be potentially re-used for the development of LRs. PSI legislation in the US exempts such material from being copyrighted in the US and licenses them under permissive and open licences (e.g. Creative Commons Zero or Attribution) outside the US. In other jurisdictions, such as Australia, Canada or New Zealand, a variety of Creative Commons licences is used in order to make such material available. In the EU, the PSI regulations and legislation have over 10 years of history and have recently been updated through the 2013 PSI Directive. According to the new PSI Directive all PSI made available has to be legally allowed to be re-used commercially or not commercially. The PSI 2013 Directive gives the option to the Member States to choose whether to release PSI under an Open Government Licence or to exempt PSI from copyright law by law and only use a disclaimer to further disseminate the material. This practically, means that we are going to see an increase in licensing and notices in the PSI material over the course of the next two years (the implementation deadline for the new PSI Directive is the end of 2015) and hence more re-usable material for LRs. The classic limitations have to do with:

  1. Attribution (of the web site, the information provider or the individual creator)
  2. Non-endorsement
  3. Differentiation between the original and the derivative work
  4. Warranties and other disclaimers
  5. Retaining copyright notices and disclaimers

Most of these conditions are easy to follow, though attribution and documentation of legal terms and conditions requires special treatment. It is important to highlight that only because some material classifies as PSI, it does not mean that it is necessarily re-usable without conditions or only with attribution and notices conditions.

Step B: understanding re-use limitationsEdit

Re-use limitations stem from the types of rights subsisting on the material included in the LRs. Broadly speaking we may identify the following classes of cases:

1. Re-use based on material without any copyright notices from the web: such material may be re-used after taking a series of measures to reduce risk. These include both actions before the re-use of the material and the structure of the web site/service through which the LR is to be offered:

  • Check if there are any meta-data prohibiting crawling of the web-site
  • Check if there are any legal conditions prohibiting re-use of the material
  • Ensure the original material cannot be reconstructed from the LR
  • Ensure the LR does not substitute in terms of use the original material
  • Ensure there is a notice and take down and an opt-out clause prominently featured in the web-site/service through which the LR is made available to third parties
  • Try not to use the material for commercial purposes or allow third parties to use it for commercial purposes

2. Re-use based on material that is found on the web but which has a licence:

a. If the licence or terms of use do not allow derivatives or further distribution, we can only process the material and make it available through a service that does not give access to the original material. We have no limitations regarding annotations if the original work is not reproduced or identifiable.

b. If the licence allows derivative works, make sure to respect the conditions.

c. If the licence is an open licence:

  • Follow carefully the attribution clause
  • Retain copyright notices and disclaimers
  • Do not mix content with different SA licences (e.g. CCSA and OKF SA licences or CCBYSA and CCBYNCSA licences). You could, nevertheless, distribute material under different licences, if it remains separate and the licences are identifiable.
  • Differentiate between the original and derivative
  • Follow the non-endorsement conditions
  • Try to indicate what NC means, if the original licensor uses such conditions

3. Re-use of material that is in print format or in digital format but constitutes a complete work (especially lexicons). Here the key question is the type of re-use made, especially if there is paraphrasing or copying of the structure of the lexicon. The following broad suggestions are made:

a. Avoid verbatim copying

b. Avoid replicating the structure of the lexicon unless there is no originality in it, e.g it is alphabetic.

c. When paraphrasing try to avoid replicating original elements in a specific definition, e.g. the order of explanations, use of characteristic words etc.

4. Creating Derivative Works:

a. Translations, even of PD works, if they themselves are not in the PD, require special permission. See elements (a) and (b).

b. N-grams normally constitute derivative works. The more they are and the closer they get to the original work the higher the risk of copyright violation. The test followed under (a) will help in minimizing the risk from using n-grams.

c. Annotations: unless they reproduce part of the original work they do not constitute a problem. If they reproduce part of the original work, see how n-grams are treated.

5. Anonymisation: in case the original resource contains personal data and in accordance with the specific personal data protection rules of the jurisdictions where the content is to be made available or where the data come from, there is need to either obtain permission from the data subjects or to anonymise the relevant data. The former is a rather expensive procedure, hence the latter is strongly suggested. Anonymisation, however, needs to be specified with regard to the entity that lawfully defines, the conditions for anonymisation and the permited uses after anonymisations has been completed. This is treated in greater detail in a separate section as this section only deals with Copyright related issues.

Step C: understanding licensing of LRs containing third party worksEdit

Overall, the licensing of LRs containing third party material is only possible when Steps A and B have been successfully completed, i.e. there is an understanding of the legal basis under which the material is to be used and the limitations such legal basis entails. In broad terms, we have the following scenarios with regard to the licensing options with regard to such LRs:

1. 3rd party material is without a licence: provided that I follow the conditions under Step B(a), I can release it under any licence. The use of an NC licence is suggested in order to substantiate that no commercial harm is being done to the rightsowner of the original content.

2. There is a licence attached to the material: a. If the licence is a custom made licence that does not allow redistribution or transformative uses, the only possible way to use the LR is through some form of service that does not give access to the LR itself or if the original content cannot be identified within the LR. Otherwise, it can only be linked from the LR to the original content or not disseminated at all. b. If the licence is an open licence, standard or custom:

  • You probably cannot relicense the original content, and license derivatives only under the conditions of the licence
  • Even when additional conditions regarding the derivative work are not provided, try to include the attribution, reference to the original licence and the relevant disclaimers
  • In the case of the derivative ensure you differentiate between the original and the derivative. Attribution to the original would in most cases suffice
  • Ensure you comply with the SA/copyleft conditions of the original content, i.e. do not remix the derivative work with any content under a licence other than the original.

3. In the case the LR contains personal data, it is strongly suggested that there is a relevant indication (data protection notice) as well as a notice whether the data set has been anonymised (anonyimisation notice)


Overall, there seems to be great need for a horizontal legal intervention at the legislative level to clarify some of the key issues examined in this section particularly with regard to the re-use of material mined from the Web. IFLA’s suggestion or the UK Copyright data mining exception is toward the right direction, which is introducing a clear exception for web crawling/data mining with the view of LR processing, with a specific clause of how to treat commercial uses (possibly through extensive collective licensing). With regard to PSI, there is also need for legislative solutions at the Member State level, making open of the PSI by default and only resorting to licensing in cases where there is no legislative tradition of exempting PSI from copyright (e.g. the UK).