You are here

Leveraging cancer research tools to build the Canadian VirusSeq Data Portal

Tuesday, November 30, 2021

Dr. Lincoln Stein

We asked Dr. Lincoln Stein, Head of Adaptive Oncology at the Ontario Institute for Cancer Research (OICR), about OICR’s role in the Canadian VirusSeq Data Portal. Genome Canada launched the Canadian VirusSeq Data Portal on April 27, 2021, to track the evolving COVID-19 pandemic across Canada. The portal is a pillar in the national data infrastructure that is bolstering Canada’s ability to manage the current pandemic—and any future ones—by sharing and resourcing viral genome sequences. This made-in-Canada data solution is one of the key deliverables of the $53 million Variants of Concern Strategy the Government of Canada announced on February 12, 2021, to detect and address COVID-19 variants of concern in Canada.


“We got the whole portal up and running in about 28 days from the start of funding to the launch.” - Dr. Lincoln Stein


What the Ontario OICR’s role in the Canadian VirusSeq Data Portal?

We have three roles. First, we’re a data producer, although that has largely wound down.

We’re also a CanCOGeN participant and did a lot of the first tranche of viral genome sequencing before the public health laboratories had their genome sequencing up and running. We did more than 4,000 of the initial viral sequences for Ontario.

Third, we’re the primary software developers for the Canadian VirusSeq data portal, which is the central repository for all CanCOGeN-funded viral sequencing data sets. It’s a public resource open to anybody with internet access, and they can download full viral sequences and the associated open access data for patient donors who submit their test. The OICR software development team also works with Canada’s Digital Supercluster on the COVID Cloud Project, which provides facilities for doing downstream integrative analysis among the viral sequence and controlled access data sets to allow scientists to identify changes in the viral genome that alter its pathogenicity (the capacity of an organism to cause disease) or infection rate.

Visit OICR to learn more.

What is your role specifically?

I am the Interim Director of the Genome Informatics group at OICR. My full-time role is Head of Adaptive Oncology at OICR, which includes our genomics group, genome informatics group, diagnostic development pathology group, imaging group and computational biology group. Recognizing the impact that COVID is having on both cancer patients and cancer research, I green lighted our investigators to start doing non-cancer COVID-related research during the pandemic.

The data portal was developed with incredible speed. How did OICR help ensure it became operational so quickly? 

We got the whole portal up and running in about 28 days from the start of funding to the launch. A lot of the software was already written for our cancer-related projects. The portal is powered by Overture, our open-source software suite for managing and sharing data at scale.

One challenge we did face was that the portal has different operational characteristics that we weren't really tuned into. We’re used to dealing with a small number of patients in the 10,000 to 20,000 range at the most, with very large genomes. Instead, with CanCOGeN’s VirusSeq initiative, we have a large number of small genomes. The first issue we ran into was that when we got to about 50,000 viral genomes, the whole system stopped working, because that was more genomics than the system could accommodate. So, we rapidly increased capacity to 100,000 and then when we got close to that level, we increased it to 200,000. We continue to have to tweak the software to get over these limits based on assumptions made years ago for cancer.

With the portal up and running, what does the data portal group have planned for the future? 

The immediate goal for CanCOGeN was to create a comprehensive repository for all the genome sequences we were generating to reside. What our Genomics Informatics Team at the OICR built is a very bare bones repository that has the viral sequence and about 15 clinical fields for contextual data, but basically nothing else. While we recently have added a visualization of the family tree of viral lineages that show the emergence of variants of concern, we don't (yet) provide interactive maps of where the cases were collected in Canada or timelines showing growth of cases or shrinkage of cases.

That functionality is provided by another partner, DNAstack with their COVID Cloud platform. While the data portal is open-access, COVID Cloud is a controlled-access environment. You have to apply for access to it and have an experimental plan to use it. Because it’s not generally open to the public, we would like to move some of that functionality into the portal to give people a first‑order view of how the virus has been evolving, spreading and shrinking throughout Canada—and give people a taste of what is available through the full COVID Cloud interface. This would enable us to support a casual researcher—a high school principal for example—who wants the data to help make decisions at a local level.

Learn more about the team that developed this made-in-Canada data solution.

Canadian provinces have shared their viral sequencing data with the Global Initiative on Sharing Avian Influenza Data (GISAID) since the beginning of the COVID-19 pandemic. Why was it important to develop a Canadian data-sharing platform? 

Firstly, the Canadian public health labs’ sharing with international databases was not consistent, with only a fraction of the data being shared. Second, there was no standard for what data the different provinces were sharing. From province to province, they were sharing some clinical fields and not others, which made it look like there were large differences between provinces when really, it was just their different datas policies. CanCOGeN wanted to have a centralized curator with a set of standards for checking the completeness and consistency of the data.

Learn more about quality control in SARS-CoV-2 sequencing and meet a CanCOGeN data curator.


The Canadian COVID-19 Genomics Network (CanCOGeN) is on a mission to respond to COVID-19 by generating accessible and usable data from viral and host genomes to inform public health and policy decisions, and guide treatment and vaccine development. This pan-Canadian consortium is led by Genome Canada, in partnership with six regional Genome Centres, the National Microbiology Lab and provincial public health labs, genome sequencing centres (through CGEn), hospitals, academia and industry across the country.