Legal framework of textual data processing for Machine Translation and Language Technology research and development activities/Open Data and Web crawling Case Studies

Case #4: Uploading-copying "Open"/"Public domain"/web crawled data to a repository edit

Case description
Actor Repository manager
Intended use Upload the SETIMES (http://www.setimes.com/) dataset to my repository
Conditions I want to copy it from (http://opus.lingfil.uu.se/SETIMES2.php), where it states "A parallel corpus of news articles in the Balkan languages, originally extracted from http://www.setimes.com. The corpus is compiled by Nikola Ljubes(ic' and is taken from http://www.nljubesic.net/resources/corpora/setimes provided under the CC-BY-SA license". On the original setimes.com site, in the disclaimer section it states "Copyright Information. Unless a copyright is indicated, information on the site is in the public domain and may be copied and distributed without permission. Citation of the original source of the information is appreciated. If a copyright is indicated on a photo, graphic or other material, permission to copy these materials must be obtained from the original source."
Question Do I upload the dataset and link to the original site? Which is the original site in this case? The http://www.setimes.com/, OR http://opus.lingfil.uu.se/SETIMES2.php OR http://www.nljubesic.net/resources/corpora/setimes (the url stated in the note of opus) Or

Just describe it with metadata, add attribution info, and link to the original site for downloading?

Suggested legal solution
Legal position This is a case, where the material itself has a "general rule with exceptions" clause. The general rule here is that the material is in the public domain and the exception is defined by the individually licensed material. This is a rather common construct that allows simplicity and flexibility at the same time. For this approach to be operational, it is necessary that individual material is licensed appropriately. The statement amounts to a waiver in the US and a full license of the economic rights in the EU. Note that the attribution requirement is not a legal condition but rather a soft norm. Reference to a CCBYSA or any other copyright licence is limited to specific items in the database. Attribution is to be made through the use of the URL. If specific attribution requirements are attached to specific material, these have to adhere to the specific licensing conditions (e.g. those of the CCBYSA licence). In the absence of more specific conditions, attribution to individual items has to be done by reference to the URL of the source.
Suggested course of action Include the material in the collection under a Public Domain Mark (PDM) or CC0 mark. If possible include the original licence as well.
Type of Terms and Conditions PD, Soft Attribution Requirements, General Rule
Legal basis Copyright Law

Case #5 Distributing web crawled data I edit

Case description
Actor Researcher-Resource Compiler & Provider
Intended use Do I distribute a dataset I have compiled using automatic crawling techniques? Being aware of the legal constraints, I have also crawled legal metadata where available.
Conditions Many of the pages crawled are available under different CC licences
Question Can I make the whole set available under one CC licence? If yes, which one?

Or Do I partition the dataset according to licences and distribute the dataset as a bunch of subsets?

Suggested legal solution
Legal position Material crawled from the Web is treated differently in different jurisdictions. Overall, it is suggested to clear before you publish. You cannot use a single CC licence if the material is licensed under multiple CC licences or no CC or no licences at all.
Suggested course of action Include the material in the collection but only after clearing the content. If clearance is not possible, offer services on the basis of the content and not the corpus itself. When clearing make sure you do not change the CC licences of the source or provide a CC licence unless you have the appropriate licences or rights.
Type of Terms and Conditions Multiple
Legal basis Copyright Law

Case #6 Distributing web crawled data II edit

Case description
Actor Researcher-Resource Compiler & Provider
Intended use Distribute a dataset I have compiled using automatic crawling techniques. Being aware of the legal constraints, I have also crawled legal metadata where available.
Conditions Although many of the pages crawled are available under (some) CC licence(s), some others do not mention anything about terms of use
Question Can I make the whole set available under one licence? If yes, which one?

Or Do I partition the dataset according to licences and distribute the dataset as a bunch of subsets, leaving out those pages that do not contain info as to terms of use?

Suggested legal solution
Legal position Material crawled from the Web is treated differently in different jurisdictions. Overall, it is suggested to clear before you publish. You cannot use a single CC licence if the material is licensed under multiple CC licences or no CC or no licences at all.
Suggested course of action Clear rights and provide only the content for which you have the appropriate licences or rights in accordance to such licences. Note that in most cases you are only re-distributing and not sublicensing or re-licensing material. Hence, do not change the licence unless there is an understanding as to the types of sub-licences and range of re-licensing allowed by the original licence.
Type of Terms and Conditions Multiple
Legal basis Copyright Law

Case #7 Distributing web crawled data III edit

Case description
Actor Researcher-Resource Compiler & Provider
Intended use Distribute a derivative of a dataset I have compiled using automatic crawling techniques. Being aware of the legal constraints, I have also crawled legal metadata where available.
Conditions Since many pages did not contain legal info, plus for a number of other reasons, I decided to build a language model out of it (2-5 grams)
Question Can I distribute it under the licence of my choice?

Or Do I obey the terms of use for those data for which there is legal info (=restrictions)?

Suggested legal solution
Legal position Creating a language model out of material would most probably fall under fair use doctrine unless the original work may be constructed out of this or it may deemed as an unauthorised derivative work. There are issues with regards the definition of the derivative work but in all probability creating n-grams would constitute derivative work precisely because of the dependence on the original work.
Suggested course of action Follow the terms of the licences, where these exist. Try to group material in accordance to the licences they belong to. More specifically:

(a) if they are CCBY material you may use the derivatives with every other licence (b) if they are SA material make sure they are grouped with the same type of -SA material (e.g. CCBYNCSA only, CCBYSA only) (c) you have no restrictions on the use of derivative works under a CCBYNC licence, so you may follow the CCBY rule (d) you have no restrictions with regard to CCZero and PDM works (e) there are multiple compatibility issues with regard to OKF Open Data Commons Attribution licences as they refer to the database and not its content, something which is difficult to apply in practice (f) Open Knowledge Foundation Open Database Licence (ODbL) cannot be mixed with any other SA licence (g) all (Attribution) Open Government Licences have the same treatment as (a). If you cannot clear, provide a service based on the n-grams rather than the n-grams themselves.

Type of Terms and Conditions Multiple
Legal basis Copyright Law

Case #8 Distributing web crawled data IV edit

Case description
Actor Researcher-Resource Compiler & Provider
Intended use Distribute a subset of a dataset I have compiled using automatic crawling techniques. Being aware of the legal constraints, I have also crawled legal metadata where available.
Conditions Since I only want to distribute a few hundred sentences, I tend to be agnostic to the legal conditions. Therefore I have picked these hundred non-consecutive sentences and packed them in a dataset that I want to distribute for technology evaluation.
Question Can I distribute it under the licence of my choice?

Or Do I make sure that I have the right to distribute even sentences extracted from a dataset?

Suggested legal solution
Legal position The question relates to the degree to which (a) the work is a derivative (b) the use of the work may fall under limitations and exceptions. The very act of crawling may constitute copyright violation in a number of jurisdictions. Especially in Europe it is difficult to fit it under any specific limitation or exception, though there is active work on introducing a data-mining exception (see e.g. UK copyright amendments).
Suggested course of action Distribute the material under any CC licence if the original material cannot be reconstructed. NC licences could denote that the material should not be used for commercial purposes adding thus to a due process strategy.
Type of Terms and Conditions Multiple. NC element suggested.
Legal basis Copyright Law. Emphasis on limitations and exceptions.

Case #9 Distributing web crawled data V edit

Case description
Actor Researcher-Resource Compiler & Provider
Intended use Distribute a subset of a dataset I have compiled using automatic crawling techniques. Being aware of the legal constraints, I have also crawled legal metadata where available.
Conditions Since I only want to distribute a few hundred paragraphs, I tend to be agnostic to the legal conditions. Therefore I have picked these hundred paragraphs and packed them in a dataset that I want to distribute for technology evaluation or development (e.g. parameter tuning).
Question Can I distribute it under the licence of my choice?

Or Do I make sure that I have the right to distribute the paragraphs extracted from the dataset? If yes, then does reshuffling the paragraphs help circumvent the problem?

Suggested legal solution
Legal position See Case #8. Reshuffling paragraphs reduces legal risk since it reduces the possibility of having the work reconstructed or substituted.
Suggested course of action See Case #8.
Type of Terms and Conditions Multiple. NC element suggested.
Legal basis Copyright Law. Emphasis on limitations and exceptions.

Case #10 Distributing web crawled data VI edit

Case description
Actor Researcher-Resource Compiler & Provider
Intended use Distribute a subset of a dataset I have compiled using automatic crawling techniques on AUTOMOTIVE websites.
Conditions Most probably the websites and relevant pages are copyrighted
Question Can I distribute it under the licence of my choice (possibly adding a phrase a la Wacky "If you want your webpage to be removed from our corpora, please contact us.")?

Or Do I make sure that I reshuffle the sentences in the paragraphs, before distributing? Or Just refrain from distributing it?

Suggested legal solution
Legal position See Case #9. Pay particular attention to the Terms of Use on the relevant websites. While Fair Use/Doctrine conditions may be on your side in many jurisdictions, these may be circumvented through contractual terms expressed in the Terms of Use of the relevant web-site. If no specific conditions exist, see analysis and suggestion under Case #9. The existence of a notice and take-down procedure is always helpful and is suggested in all web-crawling corpora.
Suggested course of action See Case #9. Include a notice and take down procedure.
Type of Terms and Conditions Multiple. NC element suggested.
Legal basis Copyright Law. Emphasis on limitations and exceptions.

Case #11 Distributing web crawled data VII edit

Case description
Actor Researcher-Resource Compiler & Provider
Intended use I want to build a corpus a la Wacky (http://wacky.sslmit.unibo.it/doku.php)


Conditions For sure I cannot guarantee the copyright-free"ness" of the crawled data. Does a sentence as the one at the end of http://wacky.sslmit.unibo.it/doku.php?id=corpora save me?


Question Can I use such a phrase in tandem with a Notice and Take Down policy such as the META-SHARE MoU?

Or Do I refrain from doing anything with it?

Suggested legal solution
Legal position See sections regarding webcrawling. The last sentence "If you want your webpage to be removed from our corpora, please contact us." is a good practice though not perfect under the current copyright system. It amounts to a notice and take down procedure though not very accurately described.
Suggested course of action (a) Check meta-data that stop crawlers in relevant web sites (b) try to crawl CC-licensed pages or PD pages (c) have an idea of the extent of your non-licensed part of your corpus (d) list the websites you crawled from (e) use CC NC licences (f) have an opt-out or notice and take down procedure clearly stated on your web site (g) make sure that the original content cannot be fully re-constructed or even when it is, that it cannot substitute the original content.
Type of Terms and Conditions Multiple. NC element suggested.
Legal basis Copyright Law. Emphasis on limitations and exceptions.