Quick Search:

A Practical Application of Retrieval-Augmented Generation for Website-Based Chatbots: Combining Web Scraping, Vectorization, and Semantic Search

Pokhrel, Sangita ORCID logoORCID: https://orcid.org/0009-0008-2092-7029, K C, Bina and Shah, Prashant Bikram (2025) A Practical Application of Retrieval-Augmented Generation for Website-Based Chatbots: Combining Web Scraping, Vectorization, and Semantic Search. Journal of Trends in Computer Science and Smart Technology, 6 (4). pp. 424-442.

[thumbnail of 07.pdf]
Preview
Text
07.pdf - Published Version
Available under License Creative Commons Attribution Non-commercial.

| Preview

Abstract

The Retrieval-Augmented Generation (RAG) model significantly enhances the capabilities of large language models (LLMs) by integrating information retrieval with text generation, which is particularly relevant for applications requiring context-aware responses based on dynamic data sources. This research study presents a practical implementation of a RAG model personalized for a Chabot that answers user inquiries from various specific websites. The methodology encompasses several key steps: web scraping using BeautifulSoup to extract relevant content, text processing to segment this content into manageable chunks, and vectorization to create embeddings for efficient semantic search. By employing a semantic search approach, the system retrieves the most relevant document segments based on user queries. The OpenAI API is then utilized to generate contextually appropriate responses from the retrieved information. Key results highlight the system's effectiveness in providing accurate and relevant answers, with evaluation metrics centered on response quality, retrieval efficiency, and user satisfaction. This research contributes a comprehensive integration of scraping, vectorization, and semantic search technologies into a cohesive chatbot application, offering valuable insights into the practical implementation of RAG models.

Item Type: Article
Status: Published
DOI: 10.36548/jtcsst.2024.4.007
School/Department: London Campus
URI: https://ray.yorksj.ac.uk/id/eprint/11412

University Staff: Request a correction | RaY Editors: Update this record