Automatic Answering Service for Coronavirus Question


This research project is supported by Mitacs Globalink: "Automatic Answering Service for Coronavirus Question"
under the supervision of Dr. Maiga Chang, Professor at the School of Computing and Information Systems, Athabasca University.

About the Project

Allen Institute for AI works and provides the COVID-19 Open Research Dataset (CORD-19) with leading research groups and publishers. The research designs and developed an automatic answering service based on the knowledge graph pre-trained with CORD-19.

  • The service will identify the keyphrases from a question entered by a user with Natural Language Processing techniques.
  • The research goal is to have the automatic answering service correctly identifying the keys from a question and summarizing the associated content that is relevant to the question and makes the user satisfied.

About Us

...
Our Mission

The goal is to use Natural Language Processing concepts and techniques to make computer capable of reading text-based content and extract important keyphrases from every sentences. The same technique is adopted by the computer so it can identify the similar content for user questions and generate corresponding summary for the users.

...
Our Supervisor

Dr. Maiga Chang is a Full Professor in the School of Computing and Information Systems at Athabasca University, Canada.

...
Research Goal

The research goal is to have the automatic answering service correctly identifying the keys from a question and summarizing the associated content that is relevant to the question and makes the user satisfied.

Our Team

...
Maria IRIARTE
2022~2023

Maria is a Charter Civil Engineer specialized in data science and spatial analysis for large public and environmental infrastructures, holding a Master's Degree in Big Data & Visual Analytics and in Occupational Risk Prevention. Maria is currently writing her PhD doctoral thesis in Computer Science at the International University of La Rioja and pursuing a Master of Science in Information Systems at Athabasca University.

...
Supun DE SILVA
2022

Supun De Silva is an undergraduate student at Athabasca University pursuing a Bachelor of Science majoring in Computing Information Systems with a minor in Game Programming. My current research interest is in artificial intelligence in education, specifically identifying student learning weaknesses and providing adaptive feedback. Other areas of interest include cloud computing, system administration, web-based system development, network programming, database administration, game programming, and system analysis and design. Outside of schoolwork, I like reading books and weightlifting.

...
Mohammed SALEH
2022

Mohammed Saleh is an undergraduate student at the University of Alberta, pursuing a Bachelor's degree in Computer Software Engineering. His main research focus was the development of the Ask4Summary plugin; but, Mohammed is also interested in software development and video game production. Other than computing, Mohammed enjoys the outdoors and sports.

...
Mikhail VINOGRADOV
2021

Mikhail Vinogradov is a Software Developer and Cloud Practitioner with a passion for innovation and solving problems through technology. His extracurricular interests include Machine Learning, Artificial Intelligence, Systems Architecture, and Biotechnology. He recently completed his Master's of Information Systems at Athabasca University and is looking for his next project to pursue his PhD.

...
Sayantan PAL
2021

Sayantan Pal was an undergraduate student pursuing Computer Science and Engineering from Heritage Institute of Technology, India. His research interest lies in the domain of Machine Learning and Natural Language Processing. In 2022 he had been accepted by the Doctorate of Philosophy in Computer Science & Engineering program at the University at Buffalo, The State University of New York with full-time Teaching Assistant appointment and Chair's Fellowship.

Videos


Presentation Video

Live demonstrations on a 12-weeks work outcome (May 2021~July 2021) of the preliminary function of Coronavirus Question Answering research in Python, PHP, JavaScript (AJAX and JSON), and Natural Language Processing basics. It includes three stages:

  1. Stage 1: File Extraction and Verification
  2. Stage 2: Data Processing
  3. Stage 3: Summary Generation


Stage 1: File Extraction and Verification

Stage 1's major features include (but not limited to)

  1. File extraction and verification on uploaded CORD-19 dataset in compressed tar and/or gz file format.
  2. Cron jobs for the backend services.
  3. Dashboard that shows backend services' working progress.


Stage 2: Data Processing

Stage 2's major features include (but not limited to)

  1. Processing CORD-19 dataset's fulltext in JSON format.
  2. Analyzing and summarizing useful Part-of-Speech (PoS) tags in CORD-19 dataset.
  3. Storing useful PoS tags and relevant n-grams (n is from 1 to 4).


Stage 3: Summary Generation

Stage 3's major features include (but not limited to)

  1. Web-based user interface for users to ask their questions related to coronavirus.
  2. Consine similarity calculation based on the extracted useful PoS tags and their correspondent n-grams from the asked question.
  3. Summary generation and storing for the asked questions.

Frequenty Asked Questions

  • You can ask anything related to COVID-19. This service aims to generate a summary based on the question asked. You can take a look here to have some idea about the questions. Some sample questions are, Should I use soap and water or hand santizer to protect against COVID-19? Can mosquitoes or ticks spread the virus that causes COVID-19?

  • We are strictly against the idea of collecting user-sensitive data. We store an anonymous system-generated user UUID as a cookie that identifies the user. We keep the question and the summary generated in our database and store them in cookies for a better experience.

  • We are using the CORD-19 dataset (https://allenai.org/data/cord-19) as the knowledge based for the machine to read. CORD-19 is a free resource of tens of thousands of scholarly articles (i.e., 717,012 academic full-text literature as of June 2, 2022) about COVID-19, SARS-CoV-2, and related coronaviruses for use by the global research community. A backend service reads the pure text data with basic Natural Language Processing techniques that include tokenization, n-grams extraction, and part-of-speech tagging. Then the summary is generated using Cosine Similarity method which is also used by search engines. It is not really a specific answer to the question but a summary that may be related to the question asked. The service improves with time. It will provide better results in the future.

  • We are processing the documents by extracting the sentences, followed by the extraction of n-grams and their part-of-speech tags (e.g., verb, noun, etc.) in a sentence. When the user asks a question, we identify the keywords (n-grams) from the question and create a vector representation. A vector consists of the frequency of the keywords. The vector is taken to compare with all the vectors that represent documents with Cosine similarity method. This way, we identify the best document. Similarly, we match the question vector again to the vectors that represent all sentences of the selected documents (i.e., top N documents, by default, N value is 2 here) by cosine similarity to find the best matching sentences (i.e., top M sentences, by default, M value is 5 here). We then concatenate the top M sentences to generate the summary.

  • Yes, the service aims to generate a summary based on your question. We have extracted data from tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses. It is not really a specific answer to the question but a summary that may be related to the question asked. The service improves with time. It will provide better results in the future.

  • No, we are against storing user's personal information. No questions will be shared or publicly available. We keep the question and the summary in our database for a better user experience. We uniquely identify you by a system-generated UUID and store it in a cookie.

  • The research goal is to have an automatic answering service that correctly identifies the keys from a question. It summarizes the associated content that is relevant to the question and makes the user satisfied.

  • Ask4Summary uses two of the VIP Research Group's services - the N-Gram POS service, and the AskCOVIDQ summary algorithm. Ask4Summary directly uses the N-Gram POS service to get the N-Grams and Part of Speech from the user's question and the course material. Then, it uses a derivation of the AskCOVIDQ summary algorithm to produce a response based on frequency of N-Grams in the question and in the database.

  • Check the Videos and the Publications above for a comprehensive guide on Ask4Summary.

  • Yes, Ask4Summary may be updated in the future to improve the course material scanning, the N-Gram POS service, and the summary algorithm in the future; but, there is no immediate plan to do so.