Automatic Answering Service for Coronavirus Question


This research project is supported by Mitacs Globalink: "Automatic Answering Service for Coronavirus Question"
under the supervision of Dr. Maiga Chang, Professor at the School of Computing and Information Systems, Athabasca University.

About the Project

Allen Institute for AI works and provides the COVID-19 Open Research Dataset (CORD-19) with leading research groups and publishers. The research designs and developed an automatic answering service based on the knowledge graph pre-trained with CORD-19.

  • The use of knowledge graph requires no time for training but only needs time of the graph construction process. The service is therefore capable of being online quicker.
  • The service will be able to identify the keys from a question entered by a user and then summarize the content associated to the question and provide the summary to the user as the answer of the question.
  • The research goal is to have the automatic answering service correctly identifying the keys from a question and summarizing the associated content that is relevant to the question and makes the user satisfied.

About Us

...
Our Mission

The goal is to hope researchers using the latest advanced technologies with the dataset to fight against COVID-19. The research designs and developed an automatic answering service based on the knowledge graph pre-trained with CORD-19.

...
Our Supervisor

Dr. Maiga Chang is a Full Professor in the School of Computing and Information Systems at Athabasca University, Canada.

...
Research Goal

The research goal is to have the automatic answering service correctly identifying the keys from a question and summarizing the associated content that is relevant to the question and makes the user satisfied.

Our Team

...
Sayantan PAL
2021 (current)

Sayantan Pal is an undergraduate student. He is pursuing Computer Science and Engineering from Heritage Institute of Technology, India. His research interest lies in the domain of Machine Learning and Natural Language Processing.

...
Team Member

Team member Details.

...
Team Member

Team member Details.

Videos


Presentation Video

Live demonstrations on a 12-weeks work outcome (May 2021~July 2021) of the preliminary function of Coronavirus Question Answering research in Python, PHP, JavaScript (AJAX and JSON), and Natural Language Processing basics. It includes three stages:

  1. Stage 1: File Extraction and Verification
  2. Stage 2: Data Processing
  3. Stage 3: Summary Generation


Stage 1: File Extraction and Verification

Stage 1's major features include (but not limited to)

  1. File extraction and verification on uploaded CORD-19 dataset in compressed tar and/or gz file format.
  2. Cron jobs for the backend services.
  3. Dashboard that shows backend services' working progress.


Stage 2: Data Processing

Stage 2's major features include (but not limited to)

  1. Processing CORD-19 dataset's fulltext in JSON format.
  2. Analyzing and summarizing useful Part-of-Speech (PoS) tags in CORD-19 dataset.
  3. Storing useful PoS tags and relevant n-grams (n is from 1 to 4).


Stage 3: Summary Generation

Stage 3's major features include (but not limited to)

  1. Web-based user interface for users to ask their questions related to coronavirus.
  2. Consine similarity calculation based on the extracted useful PoS tags and their correspondent n-grams from the asked question.
  3. Summary generation and storing for the asked questions.

Frequenty Asked Questions

  • You can ask anything related to COVID-19. This service aims to generate a summary based on the question asked. You can take a look here to have some idea about the questions. Some sample questions are, Should I use soap and water or hand santizer to protect against COVID-19? Can mosquitoes or ticks spread the virus that causes COVID-19?

  • We are strictly against the idea of collecting user-sensitive data. We store an anonymous system-generated user UUID as a cookie that identifies the user. We keep the question and the summary generated in our database and store them in cookies for a better experience.

  • We are using the CORD-19 dataset. CORD-19 is a free resource of tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses for use by the global research community. We are periodically running backend services to process a large amount of this pure text data (i.e., 236,336 academic full-text literature as of July 19, 2021) with basic Natural Language Processing techniques that include tokenization, n-grams extraction, and part-of-speech tagging. Then the summary is generated using data mining techniques. It does not guarantee a specific answer to the question, but you can get a general idea about the question asked. The service improves with time. It will provide better results in the future.

  • We are using the CORD-19 dataset. CORD-19 is a free resource of tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses for use by the global research community. We are processing the documents by extracting the sentences, followed by the extraction of Ngrams in a sentence. When the user asks a question, we identify the keywords (N-grams) from the question and create a vector representation. A vector consists of the frequency of the keywords. We compare it to the document vectors by cosine similarity. This way, we identify the best document. Then similarly, we match the target vector to the sentences by cosine similarity to find the best matching sentences. We concatenate the top two sentences to generate the summary.

  • Yes, the service aims to generate a summary based on your question. We have extracted data from tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses. It does not guarantee a specific answer to the question, but you can get a general idea about the question asked.

  • No, we are against storing user's personal information. No questions will be shared or publicly available. We keep the question and the summary in our database for a better user experience. We uniquely identify you by a system-generated UUID and store it in a cookie.

  • We are periodically running backend services to process a large amount of this pure text data (i.e., 236,336 academic full-text literature as of July 19, 2021) with basic Natural Language Processing techniques that include tokenization, n-grams extraction, and part-of-speech tagging. We are processing the documents by extracting the sentences, followed by the extraction of Ngrams in a sentence. The research goal is to have an automatic answering service that correctly identifies the keys from a question. It summarizes the associated content that is relevant to the question and makes the user satisfied.