
About the Project
Allen Institute for AI works and provides the COVID-19 Open Research Dataset (CORD-19) with leading research groups and publishers. The research designs and developed an automatic answering service based on the knowledge graph pre-trained with CORD-19.
- The use of knowledge graph requires no time for training but only needs time of the graph construction process. The service is therefore capable of being online quicker.
- The service will be able to identify the keys from a question entered by a user and then summarize the content associated to the question and provide the summary to the user as the answer of the question.
- The research goal is to have the automatic answering service correctly identifying the keys from a question and summarizing the associated content that is relevant to the question and makes the user satisfied.
About Us

Our Mission
The goal is to hope researchers using the latest advanced technologies with the dataset to fight against COVID-19. The research designs and developed an automatic answering service based on the knowledge graph pre-trained with CORD-19.

Our Supervisor
Dr. Maiga Chang is a Full Professor in the School of Computing and Information Systems at Athabasca University, Canada.

Research Goal
The research goal is to have the automatic answering service correctly identifying the keys from a question and summarizing the associated content that is relevant to the question and makes the user satisfied.
Our Team

Sayantan PAL
2021 (current)
Sayantan Pal is an undergraduate student. He is pursuing Computer Science and Engineering from Heritage Institute of Technology, India. His research interest lies in the domain of Machine Learning and Natural Language Processing.

Team Member
Team member Details.

Team Member
Team member Details.
Videos
Presentation Video
Live demonstrations on a 12-weeks work outcome (May 2021~July 2021) of the preliminary function of Coronavirus Question Answering research in Python, PHP, JavaScript (AJAX and JSON), and Natural Language Processing basics. It includes three stages:
- Stage 1: File Extraction and Verification
- Stage 2: Data Processing
- Stage 3: Summary Generation
Stage 1: File Extraction and Verification
Stage 1's major features include (but not limited to)
- File extraction and verification on uploaded CORD-19 dataset in compressed tar and/or gz file format.
- Cron jobs for the backend services.
- Dashboard that shows backend services' working progress.
Stage 2: Data Processing
Stage 2's major features include (but not limited to)
- Processing CORD-19 dataset's fulltext in JSON format.
- Analyzing and summarizing useful Part-of-Speech (PoS) tags in CORD-19 dataset.
- Storing useful PoS tags and relevant n-grams (n is from 1 to 4).
Stage 3: Summary Generation
Stage 3's major features include (but not limited to)
- Web-based user interface for users to ask their questions related to coronavirus.
- Consine similarity calculation based on the extracted useful PoS tags and their correspondent n-grams from the asked question.
- Summary generation and storing for the asked questions.
Frequenty Asked Questions
-
What kind of questions can I ask?
You can ask anything related to COVID-19. This service aims to generate a summary based on the question asked. You can take a look here to have some idea about the questions. Some sample questions are, Should I use soap and water or hand santizer to protect against COVID-19? Can mosquitoes or ticks spread the virus that causes COVID-19?
-
What data do you collect from users?
We are strictly against the idea of collecting user-sensitive data. We store an anonymous system-generated user UUID as a cookie that identifies the user. We keep the question and the summary generated in our database and store them in cookies for a better experience.
-
How accurate results can I expect?
We are using the CORD-19 dataset. CORD-19 is a free resource of tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses for use by the global research community. We are periodically running backend services to process a large amount of this pure text data (i.e., 236,336 academic full-text literature as of July 19, 2021) with basic Natural Language Processing techniques that include tokenization, n-grams extraction, and part-of-speech tagging. Then the summary is generated using data mining techniques. It does not guarantee a specific answer to the question, but you can get a general idea about the question asked. The service improves with time. It will provide better results in the future.
-
How does the summary generate?
We are using the CORD-19 dataset. CORD-19 is a free resource of tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses for use by the global research community. We are processing the documents by extracting the sentences, followed by the extraction of Ngrams in a sentence. When the user asks a question, we identify the keywords (N-grams) from the question and create a vector representation. A vector consists of the frequency of the keywords. We compare it to the document vectors by cosine similarity. This way, we identify the best document. Then similarly, we match the target vector to the sentences by cosine similarity to find the best matching sentences. We concatenate the top two sentences to generate the summary.
-
Can I use this website to gain knowledge related to COVID-19?
Yes, the service aims to generate a summary based on your question. We have extracted data from tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses. It does not guarantee a specific answer to the question, but you can get a general idea about the question asked.
-
Will my question be publicly available and shared with others?
No, we are against storing user's personal information. No questions will be shared or publicly available. We keep the question and the summary in our database for a better user experience. We uniquely identify you by a system-generated UUID and store it in a cookie.
-
What is the goal of the research?
We are periodically running backend services to process a large amount of this pure text data (i.e., 236,336 academic full-text literature as of July 19, 2021) with basic Natural Language Processing techniques that include tokenization, n-grams extraction, and part-of-speech tagging. We are processing the documents by extracting the sentences, followed by the extraction of Ngrams in a sentence. The research goal is to have an automatic answering service that correctly identifies the keys from a question. It summarizes the associated content that is relevant to the question and makes the user satisfied.