Quick Search:

Web Data Scraping Technology using TF-IDF to Enhance the Big Data Quality on Sentiment Analysis

Pokhrel, Sangita ORCID: https://orcid.org/0009-0008-2092-7029, Somasiri, Nalinda ORCID: https://orcid.org/0000-0001-6311-2251, Jeyavadhana, C. Rebecca and Ganesan, Swathi ORCID: https://orcid.org/0000-0002-6278-2090 (2022) Web Data Scraping Technology using TF-IDF to Enhance the Big Data Quality on Sentiment Analysis. In: CDSBDA 2022: XVI. International Conference on Data Science and Big Data Analytics., 7-11 November 2022, Yogyakarta, Indonesia.

Mypaper_YorkUniv_UK.pdf - Published Version

| Preview
Related URLs:


Tourism is a booming industry, with huge future potential for global wealth and employment. There are countless data generated over social media sites every day creating numerous opportunities to bring more insights to decision-makers. The integration of Big Data Technology into the tourism industry will allow companies to conclude where their customers have been and what they like. This information can then be used by businesses, such as those in charge of managing visitor centers or hotels, etc and the tourist can get a clear idea of places before visiting. The technical perspective of natural language is processed by analysing the sentiment features of online reviews from tourists, and we then supply an enhanced long short-term memory (LSTM) framework for sentiment feature extraction of travel reviews. We have constructed a web review database using a crawler and web scraping technique for experimental validation to evaluate the effectiveness of our methodology. The text form of sentences first classified through Vader and Roberta model to get the polarity of the reviews. In this paper, we have conducted study methods for feature extraction, such as Count Vectorization, TFIDF Vectorization, and implemented Convolutional Neural Network (CNN) classifier algorithm for the sentiment analysis to decide the tourist’s attitude towards the destinations is positive, negative, or simply neutral based on the review text that they posted online. The results demonstrated that from the CNN algorithm after pre-processing and cleaning the dataset, we have received an accuracy of 96.12% for the positive and negative sentiment analysis.

Item Type: Conference or Workshop Item (Paper)
Status: Published
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Q Science > QA Mathematics > QA76 Computer software
School/Department: London Campus
URI: https://ray.yorksj.ac.uk/id/eprint/7202

University Staff: Request a correction | RaY Editors: Update this record