This project focuses on analyzing Ryanair customer reviews to extract key topics and classify customer satisfaction using advanced text mining and machine learning techniques. The dataset includes 2,250 reviews from 2012 to the present, providing a rich source of customer feedback for detailed analysis.
The project is divided into three Jupyter notebooks, each addressing different aspects of the analysis:
Objective: Extract and analyze the main topics from Ryanair reviews.
- Text Mining Preprocessing: Clean and preprocess the dataset, focusing on extracting and preparing nouns and adjectives for further analysis.
- Latent Dirichlet Allocation (LDA): Identify and extract key topics from the reviews.
- Bayesian Analysis: Estimate sentiment scores for adjectives, associating them with the extracted topics.
- Sentiment Score Calculation: Normalize sentiment scores (positive or negative) for each topic per review.
- Logistic Regression: Evaluate the importance of each topic as a feature and assess the effectiveness of the classifier with these enhanced features.
Objective: Apply the extracted topics and sentiment scores to identify patterns and relationships.
- Pattern Frequency Analysis: Explore correlations between flight characteristics (e.g., punctuality, comfort, customer origin, destination, flight reason) and the sentiment scores of the extracted topics.
- Visualization and Interpretation: Provide visual representations and interpretations of how different topics and sentiments relate to various aspects of the flight experience.
Objective: Provide a user-friendly interface to test and interact with the machine learning model.
- Interactive Interface: A simple web-based interface to input new reviews, apply the trained model, and receive predictions on customer satisfaction and topic relevance.
- Model Testing: Test and validate the model's performance with new data.
The analysis utilizes the dataset available on Kaggle:
- Dataset URL: Ryanair Reviews & Ratings
This dataset contains reviews and ratings from Ryanair customers, providing a foundation for extracting insights and developing the model.
To run the notebooks and interact with the model, you will need the following Python libraries:
pandas
numpy
sklearn
nltk
gensim
matplotlib
seaborn
tkinter
(for the interface)
- Start with the first notebook to preprocess the data, extract topics using LDA, perform sentiment analysis with Bayesian methods, and evaluate the classifier with logistic regression.
- Use the second notebook to analyze patterns and relationships between topics and flight characteristics.
- Test the final model using the third notebook's web interface.
- Launch the app interface from the third notebook to input new reviews and get predictions on customer satisfaction and topic relevance.
Feel free to fork the repository and contribute by improving the analysis, enhancing the model, or suggesting new features.
This project is licensed under the Apache 2.0 License. See the LICENSE file for more details.
- Read my presentation and documentation.
- For any questions or issues, please open an issue on the repository or contact the project maintainer.
Happy analyzing! 😊