Provenance - How important is data quality to Generative AI?

22 Nov

In my last post I introduced the concept of Advanced Information modelling - a field of study that looks to build on decades of pioneering research to improve the quality of the data and information we manage. In the posts that follow, I will increasingly explore how the tools and techniques from this field can be applied to both improve how we manage our information but also how we do analysis and how we use advanced tools like AI to support it.

For now, I wanted to highlight some of the current discussions around current AI Technologies, specifically Large Language Models (LLMs) used so successfully in ChatGPT. The purpose of doing this is to unpick how the current technology works to get a better understanding of how and where AIM could be applied to help improve certain aspects - particularly ‘provenance’ - the ability to reach back to your source information so you can reference and attribute your thinking (a critically important aspect of analysis and decision making).

So let’s start with current LLMs and how they work.

A (very basic) overview of how an LLM works.
(please note - I’m offering this as a basic summary of the highly technical concepts that are from these sources - https://openai.com/index/extracting-concepts-from-gpt-4/ - it is very much a work in progress and will likely need a few iterations to make it more accurate…)

Data gathering and pre-population - In the beginning, some kind of coded activity occurs that takes a source query (the question or the text you provide initially to chatGPT for example) and then executes an activity to gather source data relevant to the input question/task from a broad range of data sources. Chat GPT these sources are the internet, relevant data stores used to train ChatGPT and other internal sources. (At this point it is worth noting that these conventional data sources are all (most likely) based on conventional ‘3D-based’ information and data sources).
Neuron generation - The source data deemed of relevance from part 1 is converted into specific neurons in an LLM neural network. At this point - the connection back to the source data gathered from part 1 is effectively lost...as a specific new neuron is generated and/or linked to an existing neuron that correlates to the source. I guess this is similar to a human brain and the neurons it contains. For example, our brains form neurons based on the information seen and learnt and not the source information itself, it is a representation of something else stored in a neuron. (This is an important point to reflect upon later when we increasingly understand the value of AIM, which at its heart proposes a different way of tracking ‘things’ we think about by defining their existence at a point in time using a ‘4D’ ontology).
Output generation - Based on the neural network populated by parts 1 and 2 the LLM (trained using the neurones from 2) generates an output that is effectively a creative engineering of the best reasoning the LLM can do based on the neurons it contains and the algorithms that combine them together. What this means is the neurons generated at point 2 are combined with other neurons in the chatGPT LLM and the algorithms applied in this process essentially mimic what a brain does - they put together the most likely building blocks of data and information to best answer the question/query provided in point 1.

https://www.freepik.com/icons/neural-network">Icon by Becris<

With these basic principles of how an LLM works, we can start to see why they have been just so effective for particular processes.

Combining neurons in this way emulates creativity and generates new outputs by essentially mimicking how biological brains mash together pieces of data. This can be used to make stories and poems and also help proof-reading and editing, all of which following particular frameworks and processes that the neural networks have been trained in and can apply back to source pieces of text. Additionally it shows just how good these tools are for doing editing tasks that are essentially quite basic, but laborious for humans to conduct - particularly the process of summarisation. Isn’t it great that a machine can now do this without you manually having to generate a 100 word summary of a 10000 page document? (for more on the benefits of LLM’s please see here)

The issue of ‘provenance’ - how do I attribute sources?

The issue with LLM’s (for analysis use cases at least), is how do you provide accountability? In the fields of foresight and intelligence analysis, the ‘reach back’ to evidence to justify your assessments and recommendations is key. We describe this as provenance - the ability to associate a piece of data to its source.

So, if the summary I’ve given above for how LLM’s work is correct, then how does the process address the issue of provenance? How do we track back to the source information used to produce the neurons in the LLM? Like in the example of the human brain, what is the log of the source data used to make the output of the LLM? In human terms - we are good a doing this manually; we keep records and we use footnotes - i.e. we have processes that the manual editing processes apply to associate additional data and information in and around the written output we are making. In intelligence analysis there HAS to be reach back to evidence, otherwise it is just fiction...well perhaps to put it more kindly maybe the equivalent of a hunch or instinct.

But the crux of the point is - how can we justify our decision if we do not know what recommendations or evidence it is based upon?

And I think this is the interesting question for where next? For example, what and where does it look useful to apply an ontology to associate with this process - would it be an underlying training ontology for the neuron generation - i.e. a new phase is put in before the neurons are generated?

A ‘4D architecture’ for that could be extremely interesting...and this is where we are looking to leverage more of the ontological approach (outlined here by my associate Ross Marwood) to how we manage and process source data.

And in the next post I will try to explore the ‘Temporal Aspect’ that is unique to 4D ontologies and show how they potentially offer a new approach to how we model and understand our sources in analysis.

Chris Evett

Provenance - How important is data quality to Generative AI?

Interoperability and ‘The temporal aspect’ - do ‘4d’ ontologies represent the next generation technology for AI development?

What do AI technologies mean for the field of foresight and futures analysis?

Simplexity Analysis