Legal framework of textual data processing for Machine Translation and Language Technology research and development activities/Risk Mitigation Strategies and Measures

Risk Mitigation Strategies and Measures edit

The aforementioned analysis is illustrative of the uncertainties, costs and, hence, risks related to the collection and re-use of LRs for MP/MT purposes. At the same time, a closer look at the patterns of LRs use indicates that there is a possibility to substantially reduce legal risks by adopting certain strategies that allow the provision of the intended services without exposing the LR processor to substantial legal risk.

Such strategies and measures are based on three premises:

First, that the essence of LR processing does not relate to the processing of a resource, whether textual or audiovisual, as something that is to be consumed by a human-being but rather as data to be processed by a machine. As a result, it is very likely that the original resource will not necessarily be made available to the audience or end-user as such; instead the resource will be modified, often into something that is either unrecognisable or not even copyrightable, and will feed into a service that will not leave traces of the original resource when offered to the end-user.
Second, while copyright infringement covers both introvert (e.g. copying or modifying) and extrovert (e.g. publishing, making available, distributing) acts, the latter (rather than the former) are a greater source of legal risk.
Third, in cases of personal data processing, the link between a specific person and the relevant data is not what is valuable for LR purposes. It is not the information per se, but rather language elements that make it a useful resource for MP/MT. As a result, it is possible to provide services without requiring to provide the personal data as such.

The three main risk mitigation strategies may be summarized as follows:

1. Provide the service rather than the data: as mentioned above, the key objective of any LR-based MP/MT is the provision of a service rather than the data as such. Even in terms of the content processing, what is of importance are structural elements, series of words in n-words, relationships and syntactic elements that provide semantic information etc. Such a strategy is mostly useful with regard to copyrighted content, though it may also useful with regard to content containing personal data. Overall, a service provision strategy may lead to two results:

Either reduce the likelihood of infringement by focusing on mere acts of copying and/or modification rather than distribution, since what is produced is so far away from the original work that it does not classify as a derivative work at all; or
Reduce the risk of legal action by providing access to the service only rather than the content as well, the use of which remains not visible for the wide public.

2. Anonymise or pseudonymise personal data that are then to be released as MP/MT data: While the area of anonymisation and pseudonymisation is vast and often contested, it is also an area that has great potential for obtaining access to personal data for research purposes or for allowing the release of a service without providing access to such personal data. It is not the objective of this section to explore the problems of anonymization, especially with regard to the question of whether they remove any personal data from a data-set or piece of information and to what extent such mechanisms truly protect the data subject. It is rather to highlight the elements that reduce the legal risks. These are:

Imposing anonymisation obligations as an access condition
Imposing anonymisation obligations as a release condition

In both cases the objective is to extract the maximum value from the processing of the data, while maintaining its value.

3. Shuffling or scrambling data: Objective of this strategy is to reduce access to the original information as such, i.e. as information that is addressed to a human reader. This is particularly relevant in cases where the value of MT & MP is not dependent upon the complete sequence of words and sentences or the value of the LR is not dependent upon its human-readability. Such a strategy is helpful in order to prevent cannibalising the market of e.g. a work of literature when it is distributed as part of a corpus and is either part of the language processing copyright exceptions (e.g. in the UK) or the fair use doctrine (e.g. in the US). What is important to note here is that such an approach would only have an effect in jurisdictions where the lack of competition with the original work is deemed as a reason for allowing the use of the work without requesting permissions from the rights holder. In other cases, it is less likely to pass the test of the normal exploitation of the work, since even the use of e.g. the syntactic elements of a text could be considered a source of potential income for the rights-owner.

An interesting application of the aforementioned premises and strategies is that of the MetaShare NoRedistribution (NoRed) licences. While this is a mechanism that is very different from the original purpose of the risk mitigation strategies, because it is the expressed will of the content provider rather than a way in which the re-user of the content could use it without asking for a permission of the rights-holder, it is expressive of a broader approach regarding the use of content for MP/MT purposes. More specifically, it is indicative of the type of re-use that is accepted by the content providers even when they have the option to restrict (or open the content) as much as they desire. Under such licences the recipient of the licence may use the material and –in some cases- even to make derivative works. However, it is not possible to further disseminate the work in its current form. These are the cases, where the content provider is not really interested in restricting the MP/MT market, but rather the emergence of products competing with the content itself. This realization of the existence of two different markets or two different classes of use value sets allows us to revisit the strategies we suggested and provide a final generic rule as to how the resources could be used. That is, that the content should be used in such a way that it does not affect the original market of the work. The means for achieving that would be to ensure that it is not available as such to the end user, either through the provision of a service than the content or it anonymisation or its distortion (scrambling).