How to share data

You can submit data in two ways. You can submit results summary statistics (calculated and formatted according to the analysis plan) or you can submit individual-level data.

We prefer you submit individual-level data because they can be used beyond the few analyses that are described in the analysis plan.

Results summary statistics

Information on how to upload results summary statistics are given in the analysis plan in the section “Results upload instructions”

Individual-level data

If you are not from US:

  • You can submit individual-level data (i.e. genetic and clinical phenotype data) via the European Genome-phenome Archive (EGA). EGA offers services for archiving, processing and distribution for all types of potentially identifiable genetic and phenotypic human data at the European Bioinformatics Institute (EBI). To start your submission please fill this form or contact the EGA helpdesk via helpdesk@ega-archive.org and mark the email F.A.O Giselle Kerry stating that your submission is part of the COVID-19 Host Genetics Initiative.

If you are from the US:

  • You can submit individual-level data via NHGRI AnVIL. The AnVIL can ingest datasets, process them via standardized pipelines and perform quality control on them, and make them accessible to other researchers in a cloud-based environment. To start your submission, please contact help@lists.anvilproject.org and mark the email Attn: COVID-19 Host Genetics Initiative.

Researchers can have access to individual-level data in two ways. Researchers within the initiative (i.e. researchers that are registered to the initiative and that have also deposited data) and researchers outside the initiative.

Researchers outside the initiative

Access to individual-level data/datasets by external researchers is controlled by a Data Access Committee (DAC), which must be registered as part of the submission process. A DAC may consist of a single or several committee member/s that are responsible for making data access decisions in response to applications made by individuals wishing to access data. A DAC may be responsible for approving access to single or multiple datasets. Only those who have successfully applied for access via the DAC will receive access to the dataset(s) archived at the EGA and AnVIL.

Researchers within the initiative

Researchers within the initiative that have deposited data or results summary statistics or are part of established analysis groups will have fast-track access to the initiative's data deposited on EGA and AnVIL. The DAC, which is composed by the PIs of the studies that have deposited the data, will facilitate access to the full data pool. We are currently discussing which procedures to implement to facilitate fast access to these groups of researchers. All researchers are required to follow the code of conduct outlined in https://www.covid19hg.org/about/.

Results summary statistics will be meta-analyzed across studies and immediately made available to the scientific community via the website result browser, via GWAS catalog, Open Target Platform and other portals.

The EGA is working with the ELIXIR network to establish the EGA Federation network to enable data to be deposited within national jurisdictions. We expect to launch the first nodes in mid-late 2020. In the meantime, we suggest you contact your country's ELIXIR head of node to find out about the current status for your country.

The EGA is managed by EMBL-EBI and Center for Genome Regulation, Barcelona (CRG). At EMBL, that protection is enacted by the Internal Policy 68 on general data protection (IP 68). IP 68 resembles the GDPR, but adapts to the intergovernmental nature of EMBL and to the needs of enabling free scientific research across national borders. CRG is subject to the GDPR and implements it fully. The EGA GDPR notices can be found here.

Both EGA and AnVIL recommend using open standards and formats that are maintained by the Global Alliance for Genomics and Health (GA4GH), published in the GA4GH Genomic Data Toolkit. For genome sequencing data this includes FASTQ, BAM, CRAM, and VCF. All array-based technologies are accepted, which may include the raw data, intensity and analysis files, and there are no restrictions on data formats accepted.

Clinical data should be included as part of the study submission. We suggest formatting the data following the initiative’s data dictionary (tab FREEZE_1). Not all the variables listed in the data dictionary are required. If you want to submit variables that are not listed in the data dictionary please contact stefano.ceri@polimi.it

Yes, this is entirely possible. We suggest creating a dataset to submit every 500 samples