You are here

Meet a CanCOGeN data curator

Friday, October 29, 2021

Explore the world of data curation with Nithu John, Research Assistant at Simon Fraser University and Curator for the Canadian VirusSeq Data Portal (CanCOGeN).

Contextual data—such as where, when and how a SARS-CoV-2 sample was collected—is vital for interpreting trends in SARS-CoV-2 genomic sequence data that help us better understand COVID-19 and inform the public health response.  Data from a variety of sources must be standardized quickly and accurately. The challenge is that individuals and teams providing data each use different information management systems, which use unique fields, terms and formatting for encoding data. 

Image of a superhero with text: This is where data harmonization and curation come to the rescue, enabling data from across the country, and even the world, to be used together to inform our understanding and response to COVID-19 and other major public health challenges.

The DataHarmonizer (developed by the CanCOGeN Metadata Working Group) is a tool that resolves many of the standardization issues noted above (learn more). The DataHarmonizer:

  • Standardizes data formatting and data entry.
  • Provides validation and automates data transformations.
  • Represents a vital tool for data curation.

A crucial next step in data preparation is checking the contextual data from different provinces and labs before sharing with the National Microbiology Laboratory’s access-controlled database and the public access Canadian VirusSeq Data Portal launched in April 2021

What does data curation look like for CanCOGeN?

  • The curation process involves:
    • Checks for consistency and completeness of the data, as well as verification that the data makes sense.
    • Troubleshooting, developing and updating standards to align with public health needs.
    • Converting data to ensure it meets the database requirements of different organizations.
    • Post-submission corrections and updates.

See a more detailed breakdown of the curation process.

Privacy and legal concerns

  • Data sharing permissions vary across public health jurisdictions, so data curators must be aware of the many ethical, legal and privacy issues associated with different datasets and resources, such as those from access-controlled databases versus those from public access databases.
  • A CanCOGeN data curator coordinates with the National Microbiology Laboratory and provincial partners to ensure these considerations are addressed.

Next steps for curated data and remaining challenges

Data points from more than 78,000 samples have been curated, harmonized and are now available to researchers in the VirusSeq Data Portal. Each sample has 43 fields of data, which totals over 3,354,000 data points available for download!

There are still challenges that need to be solved—for instance, ensuring that all genomes submitted to the GISAID initiative are also available in the VirusSeq Data Portal. Discrepancies between the two may arise, as the requirements for genomes submitted to the Portal are different than those in GISAID (GISAID only accepts genomes with 90% coverage, VirusSeq additionally accepts genomes with lower coverage for research purposes).

“The curated data goes beyond simply reporting by providing a framework for communicating data about how viral infections are being transmitted through a diverse population.”

Nithu John, Research Assistant at Simon Fraser University and Curator for the Canadian VirusSeq Data Portal (CanCOGeN).


The Canadian COVID-19 Genomics Network (CanCOGeN) is on a mission to respond to COVID-19 by generating accessible and usable data from viral and host genomes to inform public health and policy decisions, and guide treatment and vaccine development. This pan-Canadian consortium is led by Genome Canada, in partnership with six regional Genome Centres, the National Microbiology Lab and provincial public health labs, genome sequencing centres (through CGEn), hospitals, academia and industry across the country.