domingo, 14 julio 2024

(English) Ensuring Text and Data Mining Has a Future in Europe

Disculpa, pero esta entrada está disponible sólo en English.

Text and Data Mining: How the Future TDM workshop highlighted the draft exception must be improved for TDM to have a future in Europe

For the legal geeks among us, it is now old news that the European Commission, after promising to modernise copyright, issued a rather unhinged and disappointing copyright review proposal aimed at creating what it claims to be a ‘well-functioning marketplace’. Neighbouring rights aka ancillary copyright for media snippets, robocopyright type content filtering on user uploaded content, mandatory exceptions that can be overridden by Member States or in case of licensing deals (huh?), … you name it: the review has it.

There is however one small light at the end of that very skewed and scary-looking tunnel: the copyright review does comprise a mandatory exception for text and data mining (aka TDM) in its Article 3 (with additional explanations in Recitals 8 to 13), a crucial element to enable the use of modern techniques on copyrighted material. To show how important TDM is and what’s at stake, we actually put together a short video which we encourage you to share. (Want to skip directly to our ‘magic recipe’ for a workable TDM exception click here)


Why is everyone in the research and innovation fields not throwing a party then? Well, because the proposal as drafted by the European Commission comprises considerable flaws, many of which were highlighted at the FutureTDM workshop.

Where the proposed TDM exception gets it right

 Text and data mining is defined under Article 2 sub (2) as ‘ any automated analytical technique aiming to analyse text and data in digital form in order to generate information such as patterns, trends and correlations’ and the proposed TDM exception basically reads:

Article 3

Text and data mining

  1. Member States shall provide for an exception to the rights provided for in Article 2 of Directive 2001/29/EC, Articles 5(a) and 7(1) of Directive 96/9/EC and Article 11(1) of this Directive for reproductions and extractions made by research organisations in order to carry out text and data mining of works or other subject matter to which they have lawful access for the purposes of scientific research.
  2. Any contractual provision contrary to the exception provided for in paragraph 1 shall be unenforceable.
  3. Rightholders shall be allowed to apply measures to ensure the security and integrity of the networks and databases where the works or other subject-matter are hosted. Such measures shall not go beyond what is necessary to achieve that objective.
  4. Member States shall encourage rightholders and research organisations to define commonly-agreed best practices concerning the application of the measures referred to in paragraph 3.

The proposal comprises four positive elements:

  1. There is an exception: this may seem ridiculous but seeing the lack of ambition of the proposed copyright review, one tends to count one’s blessings these days.
  2. The exception is mandatory, as opposed to the approach based on voluntary exceptions of the current copyright framework (as set out in the InfoSoc Directive), which results in a patchwork of implementations and total legal uncertainty in an online or cross-border environment.
  3. The exception explicitly states that contractual bypasses will not be allowed (art 3 par 2). Frankly, such a principle should be applied to all the existing exceptions as one can hardly understand why policy makers spend months crafting exceptions, arguing there every comma, negotiating there scope, scale and detail, to have all of that legislative work brushed aside by one obscure contractual clause that often the parties at the table not holding copyright cannot negotiate. But let us rejoice at least that one exception will get the common sense treatment of ‘the law is worth more than a contract’.
  4. The exception is not limited to non-commercial activities. This is important as research activities even within institutions such a s universities are often conducted through public-private partnerships or with some form of private funding, which hence makes any restriction to non-commercial unworkable in practice.

Where the proposed TDM exception fails to deliver a positive outcome for Europe

The main legal shortcomings were highlighted in the presentation given by Lucie Guibault, Associate Professor at the Institute for Information Law of the University of Amsterdam, whilst the ‘security & integrity’ addition creates a major practical loophole in the entire legal provision:

Presentation by Prof. Lucie Guibault at the FutureTDM Workshop
  1. The beneficiaries of the TDM exception are too limited in scope (Article 3 par 1 & Recital 11): the beneficiaries should not be limited to ‘research organisations’ as this is detrimental at two levels: on the one hand, it excludes businesses from benefiting from this exception, at a time where a vibrant start-up community is looking into the potential of these new techniques, and on the other, it excludes individual researchers that are not affiliated to a given research organisations from working in an independent manner if they need to use TDM with legal certainty. The latter also includes investigative journalism, and goes counter to the European Commission’s claim it wants to promote ‘Citizen science‘.
  2. The purpose of use is too narrowly defined and could give rise to discussions (Article 3 par 3 & Recital 12): the proposed draft only covers ‘scientific research’, an extremely limited scope that could even within the scientific community lead to discussions between the proponents of soft sciences (social sciences) and those that only see the merit of hard sciences (natural sciences). It certainly excludes many innovative uses of TDM that bring benefits to our society (or could have the potential to do so) for no obvious reason.
  3. The types of material that are covered by the exception could be interpreted in and unduly restrictive manner (Article 3 par 1): can TDM be applied in an unrestricted manner to any type of minable content or does the exception only cover materials ‘associated with scientific publication’?
  4. The possibility for rightholders to neutralise the exception in practice through so-called security & integrity measures creates a gaping loophole for abuses (Article 3 par 3 & Recital 12): by allowing publishers to introduce random measures to protect the ‘security and integrity’ of their network, the effective use of TDM could simply be rendered impossible, or the use of the publishers own platforms could become the only viable alternative for researchers. There are already known cases of Captcha measures being implemented if researchers want to download articles in bulk (which means algorithms cannot work as human intervention s constantly needed), or measures whereby only one article can be downloaded every 20 seconds (which, as pointed out by Professor Ananiadou from the University of Manchester at the FutureTDM workshop, sounds like a lot but actually means you need 12 years to download 20 million documents). This loophole  could allow rightholders to arbitrarily block access for researchers trying to conduct text and data mining. Safeguards in line with those put in place in the context of ‘traffic management’ by telecom operators could be considered (see Article 3 par 3 of the Telecoms Single Market Regulation [EU 2015/2120]), with requirements of proportionality, efficiency, non-discrimination (for example with the security measures applied to researchers’ algorithms vs tose applied to the publishers’ own platform), etc. could be a good starting point to frame this measure.

So what is needed?

The good news is that the Members of the European Parliament (MEPs) present at the FutureTDM Workshop certainly seemed aware that there was room for improvement and willing to tackle the issue. But let’s also be realistic: those were three very well-informed MEPs, out of 751 MEPs in total, so there is a lot of work to be done to inform their colleagues of the need for a proper TDM exception.

Whilst the UK opened the door in Europe for a TDM exception, the one they drafted is also far from perfect, if only because they felt that the existing InfoSoc Directive made it impossible for them to adopt a TDM exception that would cover commercial uses, hence making it skewed from the start.

Singapore, after introducing fair use a couple of years ago, is now also looking into introducing a TDM exception and, in doing so, is making some valid points in its consultation proposal (see pp. 34-35):

  • 3.64 We propose to create a new exception in the CA, which allows the copying of copyrighted works for the purposes of data analysis. The user of the work must have had legitimate access to the work in the first place (e.g. a subscription to an academic journal, or collating online articles which are not locked behind a pay-wall), and the exception would not differentiate between commercial or non-commercial activities, which means the final analysis can be commercialised. However, the exception is not intended to cover situations where commercial benefit came from the actual copies of the works instead of the analysis. An example is where someone copies the works to collate into a large database for sale as a service without doing any analysis on it.


Muthu works at a media monitoring company, which has taken on a project by a fast food chain to help determine customer sentiment towards their latest menu item. Muthu starts by collating any social media and food blog posts which mentioned the menu item’s name, as well as comments left on review websites and replies on the fast food chain’s websites and social media outlets. As part of the collation, he ends up making a copy of all of the posts, comments and reviews. He then uses his company’s proprietary tool to analyse the data and determine whether general customer sentiment was good or bad towards the new menu item. This sentiment analysis was then passed on to the fast food chain. Under the current CA, any of the people who had made the posts, replies or comments could potentially claim that Muthu did not ask their permission to make copies of their creative works. With the proposed exception, the copying of such creative works can be done without permission as long as the purpose is for data analysis. However, if Muthu’s company simply forwarded the copies of all of the posts, comments and reviews without analysing them, to the fast food chain, the exception would not apply.

In other words, here are the ingredients for the magic recipe:

  • Keep what’s good in the proposed TDM exception: it should be mandatory, not distinguish between commercial and non-commercial and not be bypassed by contractual provisions.
  • Expand the scope and scale of the beneficiaries: the beneficiaries should be both natural persons (=human beings) and legal person (=organisations), and should not be limited to research organisations.
  • Do not limit the purpose to scientific research, nor the scope of the minable materials.
  • Ensure that any security or integrity measures implemented by rightholders are open to a rigorous scrutiny and must abide by a set of parameters that prevent abuse.

Caroline is coordinator of the Copyright 4 Creativity (C4C) coalition. She is also the founder and Managing Director of N-square Consulting (N²), a Brussels-based public affairs firm. She is the author of ‘ Survival Guide to EU Lobbying, Including the Use of Social Media’. [All content from this author is made available under a CC BY 4.0 license]