Q&A with Dr. Emma Griffiths
Contextual data – such as where, when and how a SARS-CoV-2 sample was collected – are vital to genomic surveillance to track COVID-19. Also known as “metadata,” contextual data are needed to understand how the virus is spreading and to inform public health interventions. However, not all contextual data sets are created equal, making data standardization a key pillar of national and international COVID-19 surveillance efforts.
Over the last year, CanCOGeN’s VirusSeq initiative has made significant progress on data standardization, positioning Canada as a global leader in this area.
Contextual data comes from a variety of sources such as case report forms filled out by healthcare professionals when COVID-19 tests are conducted and during follow-up, as well as software tracking the methods used to collect samples and perform sequencing and bioinformatics analyses. Across Canada, there are at least eight different versions of the case report form used by the federal and provincial public health authorities which contain different fields and ask different kinds of questions. This makes comparing and integrating data more difficult.
We asked Dr. Emma Griffiths why standardized data is so important in the fight against COVID-19. Dr. Griffiths is Chair of the Data Structures Working Group at the Public Health Alliance for Genomic Epidemiology (PHA4GE), and Team Lead of the CanCOGeN-VirusSeq Metadata Working Group tasked with the job of standardizing contextual data.
“If you imagine how complicated it is integrating data across multiple jurisdictions in Canada to do analyses, you can imagine it gets more difficult putting data together from around the world. Our standard is being implemented in the U.S., Australia, in Latin American and in African countries. We are tackling many of the same issues in the international community that we face in CanCOGeN.” — Dr. Emma Griffiths
What exactly is contextual data (a.k.a. “metadata”), and why is it important?
Our team – which includes myself, as well as Principal Investigator Dr. William Hsiao, Rhiannon Cameron, Sarah Savic Kallesoe, Nithu Sara John, Emilie Diver, and Damion Dooley) - oversees metadata harmonization for VirusSeq. We prefer to call it contextual data because it’s the data you need to provide context for interpreting the sequence data. It can include data covering health status and outcome information, COVID-19 signs and symptoms, pre-existing conditions and risk factors, complications and clinical evaluations, vaccination status, as well as exposure information, which covers things like a person’s location and travel history. Without good contextual data, it’s very hard to do much with the data generated by genomic sequencing of SARS-CoV-2 samples. Together, genomic sequencing information and contextual data are critical for surveillance and public health responses – specifically understanding how COVID-19 got into Canada, how it spread and how it’s affecting people in their communities.
What are the challenges if sequencing data aren’t standardized?
In Canada, we have a decentralized health system under provincial jurisdiction. This means provinces design their health programs, decide which tests get performed and how best to act in public health emergencies. This makes sense since they are the boots on the ground and know their communities best. But it also means everyone uses different systems and databases to collect and encode information. People are gathering information at different levels of granularity and in different formats. They may also be asking different questions, thus collecting different data elements. When it’s time to combine that information in one place, things can get messy pretty quickly. And without a system for harmonizing these thousands and thousands of data records, it becomes incredibly challenging to fit all this information together to tell an accurate story.
The CanCOGeN VirusSeq Metadata Working Group developed a DataHarmonizer software tool. What does it do?
It’s a spreadsheet-style application that allows different groups to enter their contextual data in a standard format. The DataHarmonizer contains all the fields we developed as part of a Contextual Data Standard, which is in use across Canada and has even been adopted internationally. Once a user has entered all their information, they can validate it by hitting a “Validate” button. Any errors get highlighted in red, and a user can use the next error button to fix issues in a systematic way. The tool allows users to save the file or export it in different formats so that it’s ready to be uploaded to different databases and repositories: a one-stop-shop to organize your data for different uses.
(DataHarmonizer is available in English only)
CanCOGeN VirusSeq’s work on data harmonization has been recognized internationally, including the Contextual Data Standard. Can you explain how this data standard works and the importance of international collaboration in this area?
The Contextual Data Standard is a collection of standardized fields and terms for a range of data types. These data types include things like sample collection and processing information, exposures, symptoms, pre-existing conditions, vaccination, reinfection, sequencing, bioinformatics and diagnostic testing information, as well as variant information.
Importantly, this data standard captures information about who is doing the work, so everyone’s contributions can be properly acknowledged and data providers can be contacted for collaborations.
If you imagine how complicated it is integrating data across multiple jurisdictions in Canada to do analyses, you can imagine it gets more difficult putting data together from around the world. Our standard is being implemented in the U.S., Australia, and in Latin American and African countries. The international community is tackling many of the same issues CanCOGeN is tackling in Canada. I’m also part of a new organization called the Public Health Alliance for Genomic Epidemiology (PHA4GE). It has members from many different research institutes and public health agencies around the world. The main goals of PHA4GE are to improve the interoperability, reproducibility, and portability of public health bioinformatics tools and infrastructure, as well as building capacity everywhere
What’s next for CanCOGeN’s work on contextual data standardization?
As the pandemic evolves, we’re constantly updating the Contextual Data Standard and adding features in the DataHarmonizer to improve usability and uptake. We’re increasingly being contacted by different groups looking to implement the standard and asking for advice. And of course, we’re trying to get all this written up in manuscripts as soon as possible. Those are the activities I’m involved with and there is a lot of other work going on to improve data sharing in Canada. Our team is also working on a mutation visualization tool. There’s lots going on.
The Canadian COVID-19 Genomics Network (CanCOGeN) is on a mission to respond to COVID-19 by generating accessible and usable data from viral and host genomes to inform public health and policy decisions, and guide treatment and vaccine development. This pan-Canadian consortium is led by Genome Canada, in partnership with six regional Genome Centres, the National Microbiology Lab and provincial public health labs, genome sequencing centres (through CGEn), hospitals, academia and industry across the country.