Data Governance in the light of AI

As enterprise adoption of AI becomes increasingly pervasive, few companies claim to have generalized structured and repeatable data governance practices across the enterprise.

As a result, data, the basis of AI machine learning, is too often insufficiently characterized. Information such as definitions, classifications, reference values, usage rules, quality level, traceability and associated responsibilities are still too often incomplete or even non-existent.

However, it seems unavoidable to address the following questions for a reliable and sustainable approach.

What? Why? How much?

What are the characteristics of the data used for machine learning (definitions, classifications, associated rules, reference values, ...)? Why use this data? Is the data consistent with the desired use?

What are the biases that can be induced by this data and what could be the impacts on predictive models?

Is the data complete enough for learning? How many cases are represented / what is the diversity of the cases ? Is the data consistent over time? What is its level of quality ?

Who ?

Who are the people with responsibilities (Owner, Steward, Quality manager,...) on the data used for learning? Who are the stakeholders in charge of controlling the produced results ?

Where and how?

Where does this data come from? What is its lineage/provenance? How was it produced?

When ?

When were the last updates made? How fresh is the data used? When was the data last used?

The prospects opened by AI presage major benefits in all areas of industry and society. For companies, the gains identified are at various levels: operational efficiency, innovation and risk management. In the public sector, the first applications in the fields of health, education and environment presage major impacts. The White Paper published in February 2020 by the EU states "Like the steam engine or electricity in the past, AI is transforming our world".

And at the same time, there are examples of negative effects of AI use : discrimination against individuals and groups of people or violation of rights (such as privacy) related to the use of AI. Some emblematic examples include Amazon's AI-based recruitment solution that favored men or the facial recognition solutions used by tech giants whose results vary according to gender and skin color. The study of Joy Buolamwini of MIT Media Lab is revealing in this regard.

In practice, the learning quality of AI algorithms depends on the quality of the data. Several publications, including Thomas C. Redman's article entitled "If Your Data Is Bad, Your Machine Learning Tools Are Useless", are clear on this subject. Indeed, as in the examples cited above, data may be incomplete (in the case of facial recognition) or represent a historical reality that contains a bias and that should not necessarily be reproduced for the future (for example, the majority of people who have been recruited for this position in the past are men). Beyond the quality, the explainability, efficiency and stability of the algorithms also depend on the data used.

A recent study commented by Tom Davenport shows the rapid progression of AI in all industry sectors. Other studies confirm this trend. The benefits of AI as well as its potential societal impacts justify that companies make the necessary efforts to address data governance at the enterprise level. It is well worth the effort !

Shelemat DANIEL, Septembre 2020