About the Project
Allen Institute for AI works and provides the COVID-19 Open Research Dataset (CORD-19) with leading research groups and publishers. The research designs and developed an automatic answering service based on the knowledge graph pre-trained with CORD-19.
- The service will identify the keyphrases from a question entered by a user with Natural Language Processing techniques.
- The research goal is to have the automatic answering service correctly identifying the keys from a question and summarizing the associated content that is relevant to the question and makes the user satisfied.
The goal is to use Natural Language Processing concepts and techniques to make computer capable of reading text-based content and extract important keyphrases from every sentences. The same technique is adopted by the computer so it can identify the similar content for user questions and generate corresponding summary for the users.
Dr. Maiga Chang is a Full Professor in the School of Computing and Information Systems at Athabasca University, Canada.
The research goal is to have the automatic answering service correctly identifying the keys from a question and summarizing the associated content that is relevant to the question and makes the user satisfied.
Maria is a Charter Civil Engineer specialized in data science and spatial analysis for large public and environmental infrastructures, holding a Master's Degree in Big Data & Visual Analytics and in Occupational Risk Prevention. Maria is currently writing her PhD doctoral thesis in Computer Science at the International University of La Rioja and pursuing a Master of Science in Information Systems at Athabasca University.
Supun De Silva is an undergraduate student at Athabasca University pursuing a Bachelor of Science majoring in Computing Information Systems with a minor in Game Programming. My current research interest is in artificial intelligence in education, specifically identifying student learning weaknesses and providing adaptive feedback. Other areas of interest include cloud computing, system administration, web-based system development, network programming, database administration, game programming, and system analysis and design. Outside of schoolwork, I like reading books and weightlifting.
Mohammed Saleh is an undergraduate student at the University of Alberta, pursuing a Bachelor's degree in Computer Software Engineering. His main research focus was the development of the Ask4Summary plugin; but, Mohammed is also interested in software development and video game production. Other than computing, Mohammed enjoys the outdoors and sports.
Mikhail Vinogradov is a Software Developer and Cloud Practitioner with a passion for innovation and solving problems through technology. His extracurricular interests include Machine Learning, Artificial Intelligence, Systems Architecture, and Biotechnology. He recently completed his Master's of Information Systems at Athabasca University and is looking for his next project to pursue his PhD.
Sayantan Pal was an undergraduate student pursuing Computer Science and Engineering from Heritage Institute of Technology, India. His research interest lies in the domain of Machine Learning and Natural Language Processing. In 2022 he had been accepted by the Doctorate of Philosophy in Computer Science & Engineering program at the University at Buffalo, The State University of New York with full-time Teaching Assistant appointment and Chair's Fellowship.
- Stage 1: File Extraction and Verification
- Stage 2: Data Processing
- Stage 3: Summary Generation
Stage 1's major features include (but not limited to)
- File extraction and verification on uploaded CORD-19 dataset in compressed tar and/or gz file format.
- Cron jobs for the backend services.
- Dashboard that shows backend services' working progress.
Stage 2's major features include (but not limited to)
- Processing CORD-19 dataset's fulltext in JSON format.
- Analyzing and summarizing useful Part-of-Speech (PoS) tags in CORD-19 dataset.
- Storing useful PoS tags and relevant n-grams (n is from 1 to 4).
Stage 3's major features include (but not limited to)
- Web-based user interface for users to ask their questions related to coronavirus.
- Consine similarity calculation based on the extracted useful PoS tags and their correspondent n-grams from the asked question.
- Summary generation and storing for the asked questions.
- Rita Kuo, Maria F. Iriarte, Di Zou and Maiga Chang. (2023). Preliminary Performance Assessment on Ask4Summary’s Reading Methods for Summary Generation. In: 19th International Conference on Intelligent Tutoring Systems, (ITS 2023), Hybrid, Corfu, Greece, June 2-June 5, 2023. (Springer)(accepted)
- Mohammed Saleh, Maiga Chang, and Maria F. Iriarte. (2022). Ask4Summary Automatically Responds Student's Question with a Summary Assembled from Course Content. In: Proceedings II of the 30th International Conference on Computers in Education, (ICCE 2022), Kuala Lumpur, Malaysia (Hybrid), November 28-December 2, 2022, 408-413. https://icce2022.apsce.net/uploads/P2_W06_055.pdf
- Mohammed Saleh, Maria F. Iriarte, and Maiga Chang. (2022). Ask4Summary: A Summary Generation Moodle Plugin Using Natural Language Processing Techniques. In: Proceedings of the 30th International Conference on Computers in Education, (ICCE 2022), Kuala Lumpur, Malaysia (Hybrid), November 28-December 2, 2022, 549-554. https://icce2022.apsce.net/uploads/P1_C6_85.pdf
- Sayantan Pal, Maiga Chang, Maria Fernandez Iriarte. (2021). Summary Generation using Natural Language Processing Techniques and Cosine Similarity. In: the Proceedings of the 21st International Conference on Intelligent Systems Design and Applications, (ISDA 2021), Virtual, December 13-15, 2021, 508-517. https://doi.org/10.1007/978-3-030-96308-8_47 (Springer).
Frequenty Asked Questions
What kind of questions can I ask?
You can ask anything related to COVID-19. This service aims to generate a summary based on the question asked. You can take a look here to have some idea about the questions. Some sample questions are, Should I use soap and water or hand santizer to protect against COVID-19? Can mosquitoes or ticks spread the virus that causes COVID-19?
What data do you collect from users?
We are strictly against the idea of collecting user-sensitive data. We store an anonymous system-generated user UUID as a cookie that identifies the user. We keep the question and the summary generated in our database and store them in cookies for a better experience.
How accurate results can I expect?
We are using the CORD-19 dataset (https://allenai.org/data/cord-19) as the knowledge based for the machine to read. CORD-19 is a free resource of tens of thousands of scholarly articles (i.e., 717,012 academic full-text literature as of June 2, 2022) about COVID-19, SARS-CoV-2, and related coronaviruses for use by the global research community. A backend service reads the pure text data with basic Natural Language Processing techniques that include tokenization, n-grams extraction, and part-of-speech tagging. Then the summary is generated using Cosine Similarity method which is also used by search engines. It is not really a specific answer to the question but a summary that may be related to the question asked. The service improves with time. It will provide better results in the future.
How does the summary generate?
We are processing the documents by extracting the sentences, followed by the extraction of n-grams and their part-of-speech tags (e.g., verb, noun, etc.) in a sentence. When the user asks a question, we identify the keywords (n-grams) from the question and create a vector representation. A vector consists of the frequency of the keywords. The vector is taken to compare with all the vectors that represent documents with Cosine similarity method. This way, we identify the best document. Similarly, we match the question vector again to the vectors that represent all sentences of the selected documents (i.e., top N documents, by default, N value is 2 here) by cosine similarity to find the best matching sentences (i.e., top M sentences, by default, M value is 5 here). We then concatenate the top M sentences to generate the summary.
Can I use this website to gain knowledge related to COVID-19?
Yes, the service aims to generate a summary based on your question. We have extracted data from tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses. It is not really a specific answer to the question but a summary that may be related to the question asked. The service improves with time. It will provide better results in the future.
Will my question be publicly available and shared with others?
No, we are against storing user's personal information. No questions will be shared or publicly available. We keep the question and the summary in our database for a better user experience. We uniquely identify you by a system-generated UUID and store it in a cookie.
What is the goal of the research?
The research goal is to have an automatic answering service that correctly identifies the keys from a question. It summarizes the associated content that is relevant to the question and makes the user satisfied.
What services does Ask4Summary use?
Ask4Summary uses two of the VIP Research Group's services - the N-Gram POS service, and the AskCOVIDQ summary algorithm. Ask4Summary directly uses the N-Gram POS service to get the N-Grams and Part of Speech from the user's question and the course material. Then, it uses a derivation of the AskCOVIDQ summary algorithm to produce a response based on frequency of N-Grams in the question and in the database.
How can I use Ask4Summary?
Check the Videos and the Publications above for a comprehensive guide on Ask4Summary.
Will Ask4Summary be updated in the future?
Yes, Ask4Summary may be updated in the future to improve the course material scanning, the N-Gram POS service, and the summary algorithm in the future; but, there is no immediate plan to do so.