Winners are announced.
The Azure Open Datasets team is excited to announce that the COVID-19 Open Research Dataset is now available to users via the Azure Open Datasets Platform. The dataset contains all COVID-19 and coronavirus-related research (e.g. SARS, MERS, etc.) from the following sources:
About the Dataset
In response to the COVID-19 pandemic, the Allen Institute for AI has partnered with leading research groups to prepare and distribute the COVID-19 Open Research Dataset (CORD-19), a free resource of over 47,000 scholarly articles, including over 36,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community. This dataset is intended to mobilize researchers to apply for recent advances in natural language processing to generate new insights in support of the fight against this infectious disease.
The dataset contains all COVID-19 and coronavirus-related research (e.g. SARS, MERS, etc.) from the following sources:
Dataset structure
The dataset is about 8GB spread across 60k JSON files representing each research paper extracted and formatted to JSON. A metadata file in the root folder has a listing of all the papers and metadata such as authors, title, tags, original paper URL, etc. It's available on a public blob store in both compressed (for download) and uncompressed (for mount and read) formats. The data access tab has sample code to get started using blob store access. We will be publishing another notebook that uses AureML resources in the next few days.
Dataset applications
The dataset is originally hosted on AWS and Kaggle. The motivation for us to host this dataset is to enable potential Azure users to directly access the dataset from their compute without wasting time to download, upload and unzip the dataset from external sources. We believe that the Azure enterprise users from the pharma/life sciences industry, grantees of AI for Health who can receive Azure credits for COVID-19 related research such academic and research institutes, and data scientists looking assist into COVID-19 related exploration will find this dataset useful. List of “tasks” published on Kaggle by Allen Institute are good examples of how Machine Learning tasks such as NLP, Entity extraction/correlation, etc. can be applied to the dataset. You can find a sample notebook here: https://azure.microsoft.com/en-us/services/open-datasets/catalog/covid-19-open-research/ , under the “Data Access” tab, with more walkthroughs of the dataset structure.