New Skills for AI-Enabled Insights: Data Curation
- Seth Hardy
- Nov 25, 2024
- 4 min read
Updated: Jan 21
You are missing an important skill that will be critical to success in the era of AI-enabled Insights.
That’s the bad news.
The good news is that nobody else in the industry currently has the skill so you are not behind- there is time.
The skill I’m talking about is Data Curation. If you’ve spent any time in the industry you will be familiar with terms like Data Collection, Data Processing and Data Analysis, but have you heard of Data Duration?
It’s not a new term. In fact, it is a core part of the academic discipline of Library Science and was defined nearly 20 years ago in a presentation delivered at the Graduate School of Library and Information Science at the University of Illinois at Urbana-Champagne as the “active and on-going management of data through its lifecycle of interest and usefulness…; curation activities enable data discovery and retrieval, maintain quality, add value, and provide for re-use over time."
This is a good general description, but what does it mean in the context of Insights work in the AI era?
I would suggest it means three things:
First, “Garbage In, Garbage Out” is the fundamental principle of using AI to analyze data. Think of the AI tool(s) you are using like a car that has been shifted into drive- it will move forward on its own and it is up to the user to steer it in the right direction and apply the brakes when necessary.
As researchers, we can provide the “direction" by deciding what information to put into the AI tool and ensuring that the structure, format and labelling meshes with the way the tool works to provide high quality, accurate outputs.
For example, let’s assume you are using an AI tool with a “Knowledge Hub” feature that allows you to upload various sets of data and analyze them in aggregate.
In any large data set there are bound to be outliers, off topic or otherwise poor responses. Having instances of “bad” data in your upload can only serve to hamper the analysis and muddy potential outputs.
This data should be removed prior to putting it into the tool so that it is not part of the analysis, much in the same way that a human analyst might ignore or remove certain responses to ensure they don’t negatively impact or otherwise distort the findings. I’m not thinking here of “bad” open ends that indicate fraud or which are the result of data quality processes, but of ones that simply don’t provide analytical value for whatever reason.
The second key element to Data Curation is labelling, by which I mean ensuring that you are clearly defining what the information you are uploading relates to in order to aid eventual analysis. If, for example, you are uploading open ended responses from a survey this means ensuring that the particular question or topic to which the text responses refer is included in the information that is uploaded.
While there is value in running analyses at true “aggregate” level, the way data of this sort gets analyzed is generally at the level of question or topic and you will want the data asset you are creating to be able to respond to specific questions about specific issues in a focused and accurate way.
A good rule for this is what I call the “analyst test.” Think of the types of questions that would likely get asked of this data and the level of specificity at which they are likely be asked. Then ask yourself the following question: “Would a human analyst be able to answer those questions based on the format and structure of the uploaded data?”
If the answer is “yes” then you have done your job- if the answer is “no” then this points out a shortcoming in the tool, the way you have organized the data or both.
The third and final element of Data Curation will be providing feedback. One of the key promises of LLM-based AI tools is that, like a human, they can “learn” based on feedback.
For example, think of the way that ChatGPT allows users to provide a “thumbs up” or “thumbs down” on responses. This informs the tool of what you the user, believe to be “good” or “correct” and impacts future responses and outputs.
When using AI tools it should be understood that, like most things in life, you don’t achieve perfection on the first try. This is why we assess, revise, proofread and double check our work.
As the “human in the loop” who is driving the Data Curation process, we have the opportunity, ability and responsibility to “train” the tool(s) we are using in order to get the best possible results in the same way that we might train a new colleague.
To use a slightly different example, let’s assume we are using an AI-enabled tool to track media mentions of a particular educational institution and their core competitive set in order to build a dynamic data asset that can be used to understand the institution’s brand and positioning relative to competitors.
In this instance, it may make sense to “train” the tool not to pull in local media articles and only pull in data from national media and trade publications. Why? Because in many instances, the vast majority of the students of the school come from outside the specific area in which the school is located, the school’s main competitors are not located in the same area and local media coverage tends to focus on “town vs. gown” issues.
While the information being reported on in the local media might be perfectly valid, it doesn’t serve the objective of understanding what a typical prospective student and their family might be exposed to when considering the school.
These points all serve to reinforce the central point that I make frequently about using AI in Insights- the technology is just a tool- it is up to us to use it well.
P.S. I realize that this brings up the specter of bias and error, but this is a topic I will address in future post.
Comentarios