Life Science Data Landscapes and Data Strategy: Overview and Opinions

Kevin Garwood
17 min readMar 3, 2024

By Kevin Garwood

Disclaimer: I have written this article to share my own views and foster discussion that interests me professionally and personally. They are not meant to reflect the views of my employers, past or present.

‘Data strategy’ is becoming a concept of increasing interest in the context of drug development use cases. Such use cases are expensive, data intensive, long-term, and high-risk endeavours that may benefit from data strategies. There is no single definition of ‘data strategy’ but I define it as a plan for using data to provide future economic benefit, in the context of an organisation’s goals. By economic benefits, I mean that data use will either make money or save money.

This may sound detached in the context of the patient experience and indeed health and biomedical data can provide great social, historical, scientific, and cultural value. However, the high cost of provisioning and using data to support drug development use cases warrants a focus on economic value. Economic value is also a useful driver in data management, where there is emphasis on treating data as an asset, preventing data from becoming a liability, and rationalizing resources spent on it.

Strategies are high-level corporate assets themselves, and ideally they are maintained through living documents that reflect an organisation’s responses to changing business priorities. In theory, a high-level corporate strategy informs a business strategy, which in turn drives a supporting data strategy. That data strategy would co-evolve alongside sibling strategies such as an intellectual property strategy, a marketing strategy and a technology strategy.

In practice, some of these strategies may not be explicitly documented or maintained as active corporate artefacts that are adequately socialized throughout an organisation. They can take significant effort to produce. They can also reflect an even greater effort to build an appropriate network of stakeholders that commits to and engages with strategy development. Small, early-stage organisations can experience so much pressure to deliver a lot early that the drive to produce can eclipse the interest in documenting the trajectory of that drive. Large, established organisations may have difficulty in marshalling a widening range of stakeholder interests through a corporate hierarchy.

Organisations that do manage to maintain accurate strategies and sub-strategies through living documents can achieve great value by doing so. However, the absence of these explicitly defined artefacts should not deter data management staff from evolving a data strategy anyway. They may need to make their best guess at synthesising what parts of other strategies would look like and get on with planning. Ultimately the success of any strategy will depend on buy-in from C suite staff, but there is a lot of ground work that can be covered.

Data Governance: Stick, Carrot and Stealth Improvements for Supporting Data Stewardship

The road to providing and using data for drug development projects is often a long one. Once the ‘why’, ‘who’, ‘what’, ‘when’, ‘where’ and ‘how’ questions for provisioning data have been answered, the most important business area to grow will be data governance. In my opinion, the main currency in life science projects is not in the trade of promising drug targets but the trade in something much less considered: trust. In most of the scientific projects I have been involved with, whether I have been involved in software development or data management contexts — they have been about making systems that model trust.

Most life science data sources have inherent sensitivities in that they can describe highly personal data. Furthermore, the efforts needed to create, interpret and manage that personal data take intellectual efforts by many skilled people from different domains. If an organisation cannot appreciate these concerns of prospective data providers, it will not get access to good data.

In the context of data strategy concerns, not getting the data at all can incur the cost of project failure. Getting data late can incur the cost of idle staff and technical infrastructure. Getting the data in a timely fashion can help ensure project success. Actively attracting collaborators to share their data in projects can become a competitive advantage.

Data management staff are faced with the data governance dilemma of how to convey the importance of cultivating trust for holding sensitive data to the wider organisation: do they emphasise the penalties of poor data governance as a liability or do they emphasise the benefits of good data governance as an asset?

Most interest in data strategy is driven initially out of a need to satisfy compliance and much later out of a need to harness trust as a competitive advantage. Data governance activities should be a regarded as a means of supporting data strategy goals, rather than as an end in themselves. However, for those activities to be a viable means, they need a minimum investment in business processes and infrastructure. In conversation spaces dominated by the excitement of science and the imperative of product development, sometimes emphasising the penalties for non-compliance becomes the most practical way to initially garner attention and priority for data management.

However, leveraging the fear of punitive consequences for non-compliance is counter-productive in the long run. It encourages a reactive hierarchy of future blame rather than a proactive network of data stewards and data owners. Staff will focus on reading the lines of text rather than reading between them. A fear of embarrassment or punishment may tend to make them fall silent when scenarios not described in the rules arise. In drug discovery work, a healthy level of autonomy is required by scientists to exercise their creativity in finding prospective drug targets. That autonomy makes it impractical to micro-monitor all of their activities. It is most effective to empower them through principles, and develop a rapport that makes them feel comfortable coming to data management staff with questions about usage.

As organisations evolve, they will realise that trust becomes a competitive advantage that is nurtured by the reputations of their scientists, the diligence of their legal, security and data management staff, and the messaging of their communications staff. Part of good data management is taking reasonable actions to steward data and another part of it is documenting evidence of those actions.

Demonstrating the ethics of a well-nurtured data culture to data providers is not just a minimal requirement for access; it can also end up being a differentiator for competition. As the field of life science moves from the era of Big Data to the era of Good Data, it becomes clear that the promise of benefits from ever-evolving algorithms is ultimately limited by quality of the patient data sources on which they depend. It is much easier to change an algorithm than it is to change underlying historical data on which it may rely.

For example, the need to track diseased patient progression requires longitudinal data sets that can take great effort to collect and are difficult if not impossible to retrospectively enhance. My prediction is that the great expense of generating high quality data sets for life science may produce a market place of data consuming organisations that will need to vie for data custodian trust to grow their own downstream intellectual property assets.

Managing Data as an Asset in Early Drug Discovery versus Clinical Trial Activities

At the heart of data management is an interest in treating data as an asset. Within life sciences, I have observed two very different environments which have their own factors which can support and shape this goal. In my view, there are multiple reasons why it is easier to promote data-as-an-asset in the realm of clinical trials than in early drug discovery.

Greater regulatory compliance in trials justifies greater effort in data management practices. Both drug discovery and clinical trial activities will need to comply with data protection regimes such as GDPR and be mindful of potential duty of care considerations. Drug discovery activities may have to heed data licensing conditions that merit data governance effort. However, trials will likely have to do more such work to satisfy regulations required by drug approval bodies.

Differences in perceived data access risks can influence how available data are for staff. Drug discovery organisations can opt to rely entirely on de-identified data sets from data providers and assume that lower re-identification risks make it easier to share more widely with staff. Trials will necessarily gather more personally identifiable information about participants because they are monitoring an intervention and will need to contact them. They also need to restrict and monitor data access to support the validity of conducting blinded trials.

Discovery environments will tend to favour broader scopes of purpose and wider access. For trials, there is an emphasis on generating data specifically to support prescribed questions and to support analysis activities that are documented in an explicit protocol. Analyses will tend to be conservative in the way they’re designed. For drug discovery, there can be an emphasis on acquiring data generally to support a broad range of less defined questions, and to support different analyses, possibly novel, that are selected and used in response to discovered qualities. Many data sets used in drug discovery need a broad allowance for use because scientists will need to discover their nature and potential relevance to emerging questions.

The value of early drug discovery drug targets is more speculative than the value of trial drug targets. The effort needed to manage an asset should be checked against its perceived value. In early drug discoveries a given drug target candidate is not likely to survive all the target validation phases and progress into trials. There does not seem to be a market for tiering and selling off drug targets that pass only some validation phases. When they do pass all the steps, much of the value will reside in preclinical laboratory test data that suggests promise in early stage trials. Considering all the compounds that fail validation, it seems sensible that effort spent on documenting data processing outcomes would be rationalized to favour the few targets that show the most promise.

In trials, more data processing value is placed on rigour, integrity and reproducibility than on novelty. Trial results need to be traceable to the versions of data entry forms that captured the original data and when values change, the changes need to be heavily audited. In early drug discovery activities, life scientists can rapidly modify algorithms in their analysis until they find something promising. Novel algorithms can help provide new insights, and the algorithms themselves may become a patentable asset. However, it seems to me that the ‘non-obvious’ factor that would make such algorithms qualify for patenting might mystify regulators who may prefer to have clear evidence of clinical benefits from trial sponsors. It seems clear that the value of data sets stewarded in drug discovery are supported through novel scientific insights whereas in trials much value is placed on predictable data management processes.

These differences are important for shaping how strategy will grow in a life science organisation, especially as drug discovery organisations may become more vested in early stage clinical trials. In those cases, organisations will need to accommodate two very different data cultures.

Appreciating the Landscape for Early Drug Discovery by Answering Basic Data Provisioning Questions

There are many different ways of surfacing thoughts about data strategy, but I have found one of the most useful ways is to answer ‘who’, ‘why’, ‘when’, ‘how’ and ‘where’ questions about data provisioning, and generally in that order.

Who is the data for? The product preferences of customers shape the selection and nature of data sources needed to support relevant drug discovery activities. The most immediate customer of drug discovery outputs will be pharmaceutical companies who want to progress promising drug targets through clinical trials in order to produce medical products that can provide a return on investment. Their preferences of drug targets will in turn be influenced by various business factors that would influence their preference for disease patient populations. Within a disease area, the known state of its suspected underlying biological mechanisms can inform an opportunity map of treatment possibilities.

The contours of this map can be further shaped by considerations such as which part of the mechanism is most relevant for treatment, which areas of investigation could lead to novel treatments, what route of delivery would appeal most to a target patient population and what disease areas might incur lower costs for future clinical trials. Factors such as these can influence what kinds of drug targets drug development might focus on and what kinds of safety data may need to be gathered.

Why provide data? Much of drug discovery activities are about efficiently matching a promising drug target, in the context of a particular treatment intervention, with specific populations of disease patients. It can also be viewed as a sifting process whereby millions of potential compounds are filtered to obtain those which show enough promise to positively influence a biological mechanism responsible for the effects of a disease.

The process yields drug target candidates which will likely fail to pass stages of clinical trials and become a new drug on the market. The process is high risk, data intensive and takes a long time to finish. One way to help mitigate risk of failure in any one drug target search process is to have several drug discovery activities, each perhaps starting with an overlapping collection of drug targets that are useful in one treatment context but not the next.

The data provisioned for drug discovery can be used to help produce patentable algorithms or patentable drug compounds. The availability of certain combinations of data can help support the development of novel analysis approaches that can be useful for identifying promising drug targets or articulating patient subpopulations. The algorithms may evolve to support activities in multiple drug discovery activities.

Data can also be used to help validate drug targets, and provide enough promise to test them out in pre-clinical laboratory work. When these compounds show evidence of succeeding in laboratory tests, organisations may be confident enough to file for patents. Algorithm and drug patents can provide common intangible assets which they can use to attract investors, collaborators and clients.

There are other reasons why data are provisioned in early drug discovery organisations. For example, data from competitive intelligence databases could help inform decisions about which drug areas organisations should pursue. Data may also be provisioned to satisfy regulatory compliance needs. Medical literature could be provided to help identify insights about prospective treatments.

What kinds of data should be provided? The use cases formed from considering ‘why’ and ‘who’ will ultimately lead to what defines good data. The drive to economically provision multiple disease programmes will motivate cultivating a basket of data sources that balances these qualities:

· General vs disease-specific or biological mechanism-specific. A longitudinal birth cohort data set could support many disease areas whereas transcriptomic data from a few tumour samples would best suit a cancer investigation.

· Utility vs scarcity. A biobank data set could provide a lot of useful information about many people, but the data may also be available to many competing researchers. Access to newly generated data can be limited to produce value in scarcity, but the cost of creating new data might limit how useful data could be in multiple contexts.

· Short-term vs long-term. Data sets can be easily available from publicly accessible data portals, but it could take years to generate data with a collaborator.

· Acquired vs generated. Acquired old data could be obtained from a laboratory but might reflect dated technology and batch effects. Generated new data presents opportunities to reduce these problems and shape outputs based on future intended use cases.

· heterogenous vs homogenous patient representation. Patient data that represents a wide variety of demographic groups can provide better insights about drug targets but be more difficult to source.

When asking ‘what is a good data source?’ it is tempting to rely on domain experts naming good data sources rather than having them name characteristics of good data sources. Judgement that is the product of long experience can be difficult for domain experts to articulate. My experience in asking that question is an answer that invariably starts with ‘It depends’.

However, the risk of relying entirely on domain experts naming desirable data sources is that the context for their tacit preferences may not become established in institutional memory. Provision of some of the data sources can take so long that staff who initially requested data may leave before it becomes available. Capturing criteria of good data for use cases makes it easier to assess how well data supports changing business needs and it makes the task of looking for other similar data sources easier to delegate across an organisation.

Each organisation should invest in developing a framework of data desirability characteristics that support priority use cases. The desirability criteria will vary across organisations because they are ultimately informed by expert opinions garnered from specific domain experts who are working on specific projects.

Data sets should be screened against this data desirability framework and their outcomes should be periodically examined to evaluate and improve data provisioning efforts. Recording the evaluation of each data source against desirability criteria can take effort but there are two major benefits. First, it can reduce the risk of having multiple staff members unintentionally evaluating the same prospective data asset at different stages in the same project. Second, as the search effort accumulates results, the outcomes can help inform decisions about whether data should be acquired or be generated. Generative data projects often involve significant investment, and records captured from evaluating potential existing assets can help provide evidence to embark on expensive decisions.

The data desirability framework will be different for each organisation, and likely be different over time in the same organisation. The choice and ranking of factors depend entirely on the collection of domain experts consulted to build the framework and the use cases that framework is meant to support. Like many things related to strategy development, the framework should be maintained as a living document if it is to have any long-term relevance to an organisation’s activities.

Careful consideration is needed to determine who should be consulted and how they should be consulted. Life science workers cover a wide variety of backgrounds, including: biologists, bioinformaticians, data scientists, software developers, data license experts, and people involved with clinical trial work.

How they are consulted also matters. My experience interviewing various people over the years is that if there is bandwidth, individual interviews are more preferable than roundtable discussions. Group discussions will usually become dominated by the most vocal people and often the quiet ones have a lot of really important insights to capture.

Structured interviews are more efficient means of gathering preference information than informal chats. Prefer open questions in early rounds of interviewing and graduate to more closed questions if there is bandwidth and need for future interview sessions. Write out questions and send them to interviewees beforehand and stick to the question list. This removes surprises and it provides comfort to people who may feel nervous about having to answer a question on the spot. Distill their wisdom and credit each of them — recognition of expertise is an important part of building stakeholdership in future plans. For many works I have produced to understand data needs, I will often use their parenthesized initials just like a literature reference in a paper. Where possible I dispel any illusions the preferences are actually my own, partly because I am rarely the main user of the data — the scientists are.

When does the data need to be provided? Part of data provisioning is about knowing what kind of data you want, another part is finding a prospective source and much of the rest of it is about the administrative pathway for it. The administrative overheads are often underestimated.

If scientists need the data within a few weeks, it almost compels them to seek data from public data portals. Apart from investing some due diligence to determine whether the data sets obtained via the portal show evidence of good governance, they can be obtained reasonably quickly. Next, there are data sets that can be made available through a straight forward application process. In the US, the most obvious example of this is dbGap: you need to set up an account, then submit applications to researchers to obtain access to data. This can take weeks to months to do. Finally there are data assets which require face-to-face meetings with data providers. These engagements can last many months or more than a year for organisations to build a relationship and to formally apply for access.

‘When’ can also be influenced by how many data sources are needed to support an early drug discovery activity. For example, for a rare disease programme, there may be so few patients that patient data sets may need to be acquired from multiple data providers. Unless the providers pool their data via a consortia, it can mean that data access will require engaging multiple, potentially lengthy administrative pathways.

How should the data be provisioned? ‘How’ is most often about two factors: acquired vs generated data and in-house vs provider secure data environment. Decisions about whether to generate or acquire data are typically influenced by project timelines and decisions about how strategically important a disease-specific asset may be.

With data acquisition you often shop for data sets from data providers, whereas with data generation you often shop for good relationships with prospective collaborators. For data acquisition, a data desirability framework will be important, whereas for prospective collaborators a collaborator desirability framework may be preferable. In many cases, the desirability of a collaborator may be based on the desirability of data they have produced in the past. Generated data is typically associated with longer time scales because of the need to develop and retain a rapport with a partner organisation.

The other important part of ‘how’ is whether the analysis code is moved to the data or the data is moved to the analysis code. From the perspective of a drug discovery organisation, it is better to acquire data that can then be hosted on an its corporate infrastructure. Organisations can leverage computing resources in proprietary ways that optimize performance and which are not exposed to external parties. However, data custodians will often be very hesitant about allowing large amounts of patient-data to be taken off their own premises and managed in an external environment. They may expect organisations to show evidence of data governance procedures and compliance or conformance with ISMS standards.

The administrative pathway for using patient-data can change significantly if the organisation moves its code to a secure data environment. Data providers may insist that analyses should be conducted within their own environment and that results may need to be inspected for traces of personally identifiable information before they can be copied off their systems. Bringing code to the data can sometimes help shortcut the administrative pathway for access because providers feel the impact of data processing is more controlled.

However, this approach often comes with at least two drawbacks: limited controls for supporting IP protection from other internal platform users and limited tools and computational resources for supporting large scale development efforts. In my view, trusted research environments (TREs) tend to support data protection more than IP concerns. They may prevent unauthorized external access to the system, but within the system of legitimate users, shared computing resources may make it easier for them to see each other’s works. A TRE may have anticipated a post-graduate running a one-off script to obtain data, but may not have expected a medium-sized team who are developing complex analysis code with the help of various libraries and code versioning tools.

Where should the data come from? Where data sets are sourced will likely be determined by the locations of the data provider or collaborator. However, where multiple providers may be available, it can also be important to consider ‘where’ in terms of the jurisdictional area that is relevant to the production, storage and transfer of data. The choice of country and sometimes state where data are processed can determine which aspects of data protection, data sovereignty and data residency issues are relevant. It may be preferable to engage data providers who are in countries that have similar data protection regimes to the organisation needing the data. The greater the deviation of regimes, the more likely it could be needed to engage other domain experts who have country-specific legal knowledge.

Concluding Thoughts

There are many aspects of data strategy that are not unique to life sciences, yet I feel there is more concrete discussion needed within the domain. Even within the realm of life sciences there are major areas such as early drug discovery and clinical trials which may cultivate very different data cultures.

My current sense is that data strategy discussions too often graft a few use cases borrowed from large social media platforms to showcase a data provisioning engine that creates more value as more users provide data for it. In my own experience, the data landscapes that support life science use cases seem more complicated.

In life sciences, a wide variety of data sets exist to help scientists become informed on poorly known areas of drug discovery. Access to these data sets is set in an international regulatory context where data sources honeycombed by different data governance protocols can present a challenge for efficiently marshalling data sources to support use cases. Efficiencies will ultimately come when ‘data trust’ is given higher visibility as a corporate asset.

Notions of data quality are complicated, multi-faceted, draw on multiple expertise domains and depend on specialised use cases. Even when notions of data quality are clearly understood, they may not be adequately recorded to withstand staff churn during the long time frames that may be needed to provision data.

While efforts such as FAIR are helping to promote discovery of existing life science data sets in the community, I feel that the work needed to repurpose data to support new cases reflects an inefficient data market place between data producers and data consumers. I hope that sharing my views on data landscapes may ultimately help producers and consumers discover common ground earlier for prospective data projects and lead to more valuable data assets in general.

--

--

Kevin Garwood

I work in scientific computing and I’m interested in art history, folklore, oral history, legends, biotech, argentine tango and medicinal plants.