Contextual data—such as where, when and how a SARS-CoV-2 sample was collected—is vital for interpreting trends in SARS-CoV-2 genomic sequence data that help us better understand COVID-19 and inform the public health response. Data from a variety of sources must be standardized quickly and accurately. The challenge is that individuals and teams providing data each use different information management systems, which use unique fields, terms and formatting for encoding data.
- Standardizes data formatting and data entry.
- Provides validation and automates data transformations.
- Represents a vital tool for data curation.
A crucial next step in data preparation is checking the contextual data from different provinces and labs before sharing with the National Microbiology Laboratory’s access-controlled database and the public access Canadian VirusSeq Data Portal launched in April 2021.
What does data curation look like for CanCOGeN?
- The curation process involves:
- Checks for consistency and completeness of the data, as well as verification that the data makes sense.
- Troubleshooting, developing and updating standards to align with public health needs.
- Converting data to ensure it meets the database requirements of different organizations.
- Post-submission corrections and updates.
Privacy and legal concerns
- Data sharing permissions vary across public health jurisdictions, so data curators must be aware of the many ethical, legal and privacy issues associated with different datasets and resources, such as those from access-controlled databases versus those from public access databases.
- A CanCOGeN data curator coordinates with the National Microbiology Laboratory and provincial partners to ensure these considerations are addressed.
Next steps for curated data and remaining challenges
There are still challenges that need to be solved—for instance, ensuring that all genomes submitted to the GISAID initiative are also available in the VirusSeq Data Portal. Discrepancies between the two may arise, as the requirements for genomes submitted to the Portal are different than those in GISAID (GISAID only accepts genomes with 90% coverage, VirusSeq additionally accepts genomes with lower coverage for research purposes).
“The curated data goes beyond simply reporting by providing a framework for communicating data about how viral infections are being transmitted through a diverse population.”
- Nithu John, Research Assistant at Simon Fraser University and Curator for the Canadian VirusSeq Data Portal (CanCOGeN).
The Canadian COVID-19 Genomics Network (CanCOGeN) is on a mission to respond to COVID-19 by generating accessible and usable data from viral and host genomes to inform public health and policy decisions, and guide treatment and vaccine development. This pan-Canadian consortium is led by Genome Canada, in partnership with six regional Genome Centres, the National Microbiology Lab and provincial public health labs, genome sequencing centres (through CGEn), hospitals, academia and industry across the country.