Legal framework of textual data processing for Machine Translation and Language Technology research and development activities/Methodological approach

Methodological principles edit

LR-based MT & MP requires the collection of large amounts of information, the creation of compilations and databases, simple or annotated, in the form of corpora, their processing and the provision of services on the basis of such information. In addition, the nature of MT services requires that the texts used for MP are representative of a particular domain or activity. For instance, legislation, court decisions and other regulatory texts are necessary primary material for the provision of MT services in the area of law, whereas engineering manuals or other technical descriptions may be useful for the provision of MT services in an engineering sector and literature or news may be useful for literature or generic MT service provision. As a result, for any MT service - or indeed any LR-based service - to be provided, it is necessary to have access to the largest possible quantity of material that could be machine processed. This practically means that such raw material will almost invariably constitute some form of work that will be covered, either by a property right, as is the case with copyrighted documents, or by other types of rights, as is the case with the transcription of phone conversations, which are very likely to contain personal data.

It is also expected that a single set of data/information/works may constitute the subject matter of different types of rights. For instance, in the previous example, the same set of transcriptions of phone conversations constitutes a literary work protected by copyright, personal data that have to be processed under the data protection rules and, finally, may contain confidential information that should not be divulged beyond the persons having the conversations. Similarly, almost all Public Sector Information will be copyrighted subject matter, whereas it is very likely they contain personal data or other forms of confidential information.

It is, hence, necessary to trace the flows of rights in a paradigmatic case of MT services and set out the key questions that have to be asked irrespective of the specific legal regime that is involved. This is a task that is carried out in this section. The next step, then, is to further explore these issues with regard to specific types of legislation, as the ones identified above, namely: IPR, Data Protection, Personal Data Protection, Confidentiality and Public Sector Information Regulation.

Another important point to make at this stage is that different types of content are, in legal terms, differently treated and, consequently, the terminology used to describe such content varies. For instance, a document may constitute a work in terms of copyright law, but may also contain personal data or be confidential information and be part of a broader range of public sector information. We will use the term “content” here, as it is one that is more generic and appears legal regime neutral.

LR Life-cycle edit

Figure II: Three Stages

We may identify the following three stages (see Figure II) in the life cycle of the content that can be machine processed in a case of Machine Translation service provision.

Stage I: Content Collection edit

At this stage, the content is collected from various sources. This is a rather important moment, as different sources come with different sets of rights. In all cases, the aggregator/collector of the content has to specify the following (Figure III): Whether the content is provided by the content provider under a licence, set of terms and conditions or any other prescription as to how the content is to be used.

Whether there are any special rules that allow the content aggregator/collector to get access to that content without obtaining permission. In these cases, it is necessary to be very clear as regards the conditions set by the law that actually allow the obtaining of the content without the consent of the content provider. The content aggregator/collector may assess the degree to which such acts may be performed on the basis of what will take place during stages II and III, i.e. on the basis of how the content is to be processed and then further disseminated.
If there is an open licence in place, then the LR-based MP/MT service provider only needs to check whether the processing of the LR is in accordance to the terms of the licence.
In the case there are no conditions or where the conditions are not clear, the content aggregator/collector will have to request for permission, in a way that is well documented both as regards the communication stage and its result. The licences, consent or other permission obtained by the content aggregator/collector have to be assessed with regard to the acts of processing and dissemination that will take place at subsequent stages. They may also be amended at a subsequent stage, if the acts of processing or dissemination exceed those originally perceived and communicated during the licence obtaining stage.

Figure III: Rights Clearance

A horizontal note at the stage of content collection is that, though it precedes the stages of content processing and dissemination, it cannot be properly assessed if the range of processing and dissemination acts has not be defined or at least projected. This is because both the case of performing acts without asking permission and the case of obtaining licences/consent have to be assessed on the basis of the intended processing and dissemination.

Stage II: Content Processing edit

At the stage of processing, all necessary permissions have been obtained or the fact that that there is no need for permissions has been established. However, as mentioned above, it is essential to establish the range of processing acts before the permissions are sought or a judgement is made with regard to whether the processing may take place without any permission. The most frequently attested acts of processing have as follows:

Copying: it is impossible to make use of any LRs without copying them and hence, almost invariably, MT & MP will activate the provisions of Copyright Law.
Making derivative works: in order to provide MT services, it is necessary to process the collected content in order to create different forms of derivative works, e.g. n-grams or different forms of corpora. These transformative uses of content activate mostly provisions of copyright law but could also have data protection or other regulatory regimes implications.
Linking with other works: linking of content with multiple types of works. Such linking does not necessary activate copyright law, but could have data protection implications.
Other types of processing: almost any type of processing will activate at least one type of regulatory regime, with data protection regulation to provide the broadest range of acts that are within its scope, as almost every act will amount to processing. However, whether a specific regulatory regime will be activated will depend on other factors as well, such as whether the data are personal or confidential, the type of subject matter and the permissions that have been in one way or another obtained.

Stage III: Content Dissemination and Re-use edit

This is the last stage in the life-cycle of the content collected for LR processing and at the same time a stage that may re-initiate the life-cycle through the re-use and enrichment of the relevant content. More specifically:

Similarly to stage II, the types of dissemination to a great extent define the type of permissions that should be sought in stage I. For instance, if the content is not to be shared as such, the legal risks are substantially reduced, as when compared with cases where the content is further made available as such. If the content is to be made available for re-use, the range of permissions that are to be sought will be far greater than the rights sought when the content is to be made simply available for end use, i.e. not for derivative uses or if it is not to be made available at all.
Overall, we may identify the following types of content dissemination:
- The content is used for Machine Training and is never disseminated as such. However, it could be that the end result of the Machine Translation service is based on the original content or even comprise original parts of such content.
- The content is offered as part of a service, but in a transformed form, and hence it is not necessary that it will constitute a derivative work and in any fashion activate copyright law.
- The content is offered as a data-set, possibly with some form of amendments and annotations, but it may be clearly identified as the original material.

Depending on the type of content delivery the legal regimes activated and the type of permissions required differs. The course of action has to be decided on a case-by-case basis.

Finally, the MT service provider will have to decide on two additional issues:
- the licence which the content (that is the automated translation output) will be offered under
- the terms under which the service is going to be offered.