Observations
Historically, data analytics teams and enterprise knowledge management teams have been considered quite separate entities. Likewise, the disciplines for preparing and managing structured data, compared to those for ensuring unstructured data is stored in a way that meets compliance requirements while also being searchable, were considered far apart. To complicate matters, each of these disciplines is being sold hype-filled panaceas that promise to solve all their problems:
- data mesh for the analytics teams.
- management in place for documents and records teams.
To make matters worse, there is growing pressure to bring structured and unstructured data under one governance and management umbrella. This is driven by the impact of deep collaboration and Cloud data analytics. All of the factors outlined in IRBS’s post-paper metaphor model are coming true and proving to be extremely disruptive to both the teams managing data analytics and those managing documents and records.
Two Camps
Traditionally, data versus information and knowledge management were split into two camps:
Data
Historically, data management focused on structured data, such as that held in core enterprise solutions. Most of this data could be considered transactional in nature, meaning it is related to a specific activity or event.
The disciplines created to manage the data broadly sat within the field of business intelligence and, more recently (the rise of self-service analytical solutions) analytics.
The disciplines for managing data traditionally focused on:
- Quality: making sure the data available was accurate and complete.
- Ingestion: obtaining data from different sources and bridging into a common platform.
- Transformation: ensuring that data was formatted so it could be reported upon and analysed in a consistent manner.
- Storage: stored in a structured manner to make reporting and analysis easier.
- Reporting and Visualisation: to present the data.
As a result, for the last three decades, data management has been focused on reporting on and analysing structured data, which required investment in specialised technologies such as ETL (Extract, Transform, Load) and data warehouses. Performance was the key challenge as data sets grew in size and complexity. Significant investments in these technologies were made to ensure performance, and specialised skills evolved to support the data management infrastructure.
The advent of a core solution built on the hyper-scale Cloud has lessened the need for on-premises data warehouses and specialised reporting solutions. In addition, Cloud-native data analytics platforms have democratised data by making it readily available to almost any staff as needed.
The result of hyper-scale data platforms is that the challenge for data management is no longer performance but governance – ensuring that only the right data is accessible to the right people for the right reasons.
Modern data governance now needs to balance privacy and security against the benefits of sharing and combining data so new insights may be gained (especially in relation to public sector, cross-agency data sets and marketing/personalisation).
Data governance now requires fine-grained knowledge of not just broad issues such as data quality and formats but also the sensitivity of specific items of data within a larger data set. For example, customer names may be considered private, but emails are not only private but critically sensitive from a security perspective because they can be used to match different data sets more accurately. For this reason, data governance is quickly moving to embrace concepts such as data sensitivity labels and field-level permission management, ensuring that staff using modern analytics platforms (such as PowerBI) cannot accidentally access or share sensitive data.
- Structure: structure data and unstructured data are a continuum alone, one axis (not absolutes). For example, contract documents contain information such as values, dates, and terms that can be extracted and treated in a more structured manner. Metadata on documents is an attempt to link the benefits of structured data to unstructured documents.
- Temporality: the speed at which data changes, from fixed in time to highly collaborative and co-authored content. Traditional data management solutions mostly focus on transaction information that, while new information is constantly being created, is actually fixed to a point in time and thus static in nature. As collaboration tools have emerged, documents and spreadsheets have become shared, living, and continually changing sources of truth.
- Risk: from low risk to highly sensitive. Traditional approaches to secure high-risk static data have generally been focused on access controls and cyber defence. In contrast, attempts to deal with the risk of unstructured information have generally been through information sensitivity labelling schemes, metadata management and governance programs. As the boundaries between structured and unstructured data blur, we also see the sensitivity labelling and governance approaches applied across all information. This is why Microsoft combined various security and governance tools for structures and unstructured information assets within the Purview solution.
A record is a static point in time on the above 3-dimensional matrix.
Management in place is treating dynamic data as static. It does not work.
A single source of truth is just impractical.
People are worried about this data too much. → the need for mapping the ever-expanding data/information ecosystem so that IM managers can focus on what is a priority and will make the biggest difference, versus trying to get everything right. 60–80 per cent of information/data is ROT too risky to keep beyond a specific time.
Manage in place vs single source of truth are absolutisms – and you’ll have both.
Discovery is to manage risk and manage compliance.
Information, Documents, Knowledge, and Records
Over time, these terms have become confused to the point of being interchangeable, mostly describing unstructured and semi-structured assets: documents, PDFs, contracts, and so on. For the purposes of this paper, I’ll refer to these as information assets. Despite a few nuances (described below) all have been managed and governed in similar ways, and generally, the enterprise information (or a similar name) team were responsible for all four:
- The term information is sometimes used as a bridge between data (structured information assets) and document/knowledge (unstructured information assets).
- The term document describes an unstructured digital information asset in any media type, most commonly textual.
- The term knowledge is used to describe all unstructured information assets and sometimes repositories of semi-structured data, such as spreadsheets. It is generally used to describe the governance practices supporting documents and records. Importantly, it includes soft assets, such as staff experience, skills, and insights. Knowledge is best viewed as the interaction of information assets and people. As AI evolves, these soft assets become more important (which will be examined in a future paper).
- The term records is specialised: it refers to storing both unstructured (documents) and semi-structured transaction information (say, a purchase order form generated from data in a core system) in a knowledge management solution. There was a period during the ’90s to the early 2000s when vendors of electronic documents and records management solutions (EDRMS) pushed the idea that all information assets – including the output of core solutions – should be stored in their solutions as a single source of truth. This has been reborn as manage in place. However, this has proven to be only partly successful, at best.
The disciplines to support these information assets have traditionally focused on compliance and discovery.
For compliance, information needed to be defined as having retention and disposal periods (a lifecycle) and access controls. Policy and legislation heavily influence compliance. A key technology for meeting compliance is EDRMS, a central repository for information assets.
For discovery, the focus has been to place information into a logical hierarchy, with a taxonomy defining where (which digital folder or location) an information asset should be located. Metadata is often applied to information assets to describe their taxonomy.
What’s Changed, and Why Structured and Unstructured Data Are Coming Together?
The legacy approaches to data management were based on computing and storage constraints, which no longer apply. New approaches focus on making data and analytics capabilities broadly available. The concept of data mesh is being touted as a future model for managing data. However, IBRS has noted that data mesh is not appropriate in its entirety for many organisations.
The rise of deep collaboration, real-time co-authoring, and the post-paper metaphor era means that traditional approaches to managing documents are being broken. Concepts such as manage in place, where instead of placing all information assets into a single repository, they are managed directly, whatever digital platform they were created within, are proving difficult to implement at this time.
Looking forward at least a decade, IBRS sees that these two disciplines (and the teams) that constitute data management and information/records/document/knowledge management will slowly merge around several key technologies and practices:
Guiding Principle for Future Knowledge Management
Zero trust information assets: rather than only protecting them by placing them in protected environments (protected network, secured application), information assets will be encrypted. Even when information assets leak, they will be unreadable. Organisations will use authenticated access to determine if a specific person has the right to decrypt and access the information asset. An example of this type of approach is Microsoft’s Azure Information Protection, Conditional Access and Data Leakage Protection services.
To live in a modern knowledge management zero trust world you need…
- Understanding the Scope in Real-Time: Automated Discovery. This capability will sweep all repositories of structured and unstructured and locate information assets. This will not only create a master catalogue of information assets but will also perform automated metadata labelling. It must be continual, not once-off. Part of normal practice. Not to be confused with e-discovery.
- Understand Risk in Real-Time: Automated Sensitivity Labelling. Similar to metadata labelling, but focused on identifying any information assets that should be treated as sensitive or private. This also requires the organisation to create clear definitions of data risk, which are intimately connected to defining access rights (see zero trust information assets above). IBRS believes that over time, data sensitivity labels will be uniformly applied to both structured and unstructured information assets. An example of this is the intended roadmap for Microsoft Purview.
- Understanding What You Have and How to Find It: Automated Metadata Labelling. Information assets will be integrated (likely using AI algorithms) to determine the metadata (taxonomy), so the information may be easily categorised and searched. In addition, this process will also apply data sensitivity labels (see below). Add this traditionally has been related to search (because of poor search performance) it will not be focused on finding value – eg. contractual, rebates, supply contract optimisation.
- Understand Where You Need to Focus On: Information Asset Mapping. The explosive growth in information sources (especially collaborative solutions), the increased volumes of information being created, and the complexity of the information being created (with different media forms) means that data analytics and document management teams need to focus only on the key risk areas for the organisation rather than attempting to do it all. An emerging practice will be to create maps of information risk. Such maps will show information asset teams which solutions, divisions, and processes are creating data that needs to be closely managed and/or protected and where the information assets are being created. This allows the information asset teams to focus on the critical information assets rather than the (estimated) 60+ per cent of information assets that have little risk and long-term value to the organisation.
Next Steps
- Refine and be clear on definitions for data, information, documents, and records:
- Data: is the atomic structure of enterprise information assets. It is structured and often transactional in nature.
- Information: is a generic term that often refers to structured and unstructured information assets. For clarity, IBRS now uses information assets to refer to any digital information, structured or unstructured.
- Documents: are unstructured enterprise information assets.
- Records: are evidence of transactions that need to be managed as assets. They are not just paper assets in electronic form. They may need to be stored for legal, regulatory, research, or analytics purposes, and they have a shelf life, a period of time after which they may need to be disposed of.
- Knowledge: is the concept of managing information assets for business outcomes.
- Begin preparing for the future:
- The days of converting data into paper-equivalent documents (e. g. invoices, purchase orders, etc.) and storing the results in document management platforms are slowly coming to a close. Digital workflows, e-forms, and e-signatures will slowly negate the need for such documents.
- Data analytics is being rapidly democratised due to Cloud-based analytics services. The focus for data analytics teams needs to move to more nuanced data governance.
- Deep collaboration and a move away from paper-based processes (and thinking) will break current information asset management practices. In its place will be a focus on automated discovery, labelling and data sensitivity.