The Data Side of Nahel
Human-Centred Data Science & Information systems Desgin Candidate
Hi! This portfolio highlights my work as a Master of Information student at the University of Toronto, with a focus on HCDS and ISD. It’s a space where I share what I’ve been working on, from my background and resume to coursework and projects that reflect my experience in data, systems, and real-world problem-solving
PROJECTS
This section highlights some of the projects I’ve worked on throughout my coursework at the University of Toronto. It’s a mix of data analytics, data science, and information systems work where I’ve used real-world data to explore problems, build models, and find insights. Each project shows how I approach data, from cleaning and analysis to turning results into something practical and meaningful
Predicting Hotel Booking Cancellations Using Data Analytics
2026
INF2190H: Introduction to Data Analytics
This project applies data analytics and data mining techniques to analyze hotel booking data and predict reservation cancellations, a critical challenge in the hospitality industry that directly impacts revenue, inventory management, and customer experience. Using a real-world dataset of over 119,000 bookings, we developed a structured analytics pipeline that combines classification, clustering, and association rule mining to uncover patterns in customer behavior and cancellation risk. Our analysis demonstrated that cancellations are not random but driven by measurable behavioral and temporal factors, enabling more proactive and data-driven decision-making strategies such as dynamic overbooking, targeted deposit policies, and improved demand forecasting.
My primary contribution focused on building the analytical foundation and extracting interpretable insights from the data. I led the data cleaning process, including handling missing values, removing data leakage variables, and preparing features for modeling to ensure a reliable and unbiased dataset. I also implemented association rule mining to quantify how specific booking characteristics, such as prior cancellations and long lead times, influence cancellation probability through support, confidence, and lift metrics. In addition, I developed the K-Means clustering pipeline, applying scaling, the Elbow Method, and silhouette analysis to segment customers into meaningful behavioral groups. These segments, such as high-risk “habitual cancellers” and highly reliable “committed guests,” translated directly into actionable business strategies, demonstrating how data can be leveraged to optimize operational decisions and reduce revenue loss.
Experimental Design & RCT Analysis Using Python
2026
INF2178H: Experimental Design for Data Science
For this project, we designed and analyzed a series of randomized controlled trials (RCTs) to evaluate the effectiveness of leadership development interventions within a simulated organizational setting. Using Python, we worked with a synthetic dataset of 5,000 managers and built five experimental designs (parallel, cross-over, factorial, withdrawal, and matched pairs). We applied statistical models including two-way ANOVA, ANCOVA, and mixed ANOVA to assess how different training and coaching strategies influenced leadership outcomes over time. We also conducted exploratory data analysis and power analysis to understand data structure, trends, and sample size requirements across experimental designs.
My primary contribution focused on the technical implementation of the project. I developed the full data generation pipeline in Python, constructing a realistic dataset with balanced experimental groups, meaningful covariates, and outcome variables that reflected plausible treatment and contextual effects. I also implemented the RCT setup and statistical modeling framework, including random assignment, long-format restructuring for cross-over and repeated-measures designs, and ANOVA-based hypothesis testing. In the final stage, I designed and integrated blocking factors such as prior training and project workload to improve internal validity and isolate treatment effects. This project strengthened my ability to translate experimental design theory into end-to-end data science workflows, from simulation and data engineering to statistical inference and interpretation
Basketball & Finance: Analyzing the Relationship Between NBA Spending & Performance
2025
INF1344H: Introduction to Statistics for Data Science
In this project, we analyzed the relationship between financial investment and performance in the NBA using statistical methods in R. We built a complete data pipeline in RStudio, where we collected, cleaned, and merged multi-season team payroll and performance data, and engineered a key variable measuring payroll relative to the league’s salary cap. Using linear regression and data visualization, we found a statistically significant positive relationship between team spending and win percentage, showing that higher investment increases the likelihood of success, but does not fully determine outcomes. The model explained a meaningful portion of performance while highlighting the role of external factors such as injuries, roster construction, and coaching.
We also extended the analysis to the player level using multiple linear regression, examining how individual performance metrics relate to salary. After addressing issues such as multicollinearity and heteroscedasticity through feature selection and log transformation, we found that offensive efficiency metrics such as 3-point percentage, free throw percentage, and rebounds were the strongest predictors of salary, while defensive metrics had a weaker impact. In addition to writing all statistical code in R, we led the results and discussion sections of the report, translating model outputs into real-world insights and critically evaluating limitations such as salary cap constraints, contract structures, and unobserved variables. This project demonstrates our ability to apply statistical modeling, data cleaning, and analytical reasoning to a real-world, data-driven problem.
WageScope: Predicting Income Patterns in Canada
2025
INF1340H: Programming for Data Science
This project analyzes wage variation among full-time workers in Canada using Statistics Canada’s Labour Force Survey from September 2025. Our goal was to identify which personal and job-related factors are most strongly associated with higher hourly earnings and to build models that classify workers above or below the national median wage. We developed a full end to end data science pipeline in Python, starting with over 112,000 observations and refining it to a clean analytical sample of about 40,000 workers through filtering, transformation, and validation. This included handling missing data, correcting scaling issues in wage and hours variables, recoding categorical features, and retaining realistic outliers to preserve meaningful labour market patterns.
We conducted descriptive and diagnostic analysis to uncover trends across demographics and employment conditions, supported by a comprehensive set of data visualizations that I led, including distribution plots, segmentation charts, and correlation heatmaps. These insights were further quantified using a linear regression model, which explained about 26 percent of wage variation and highlighted education, age, gender, sector, and union status as key predictors. We also built and evaluated three machine learning models including Logistic Regression, K Nearest Neighbors, and Random Forest, with Logistic Regression achieving the highest accuracy at around 71 percent. This project demonstrates our ability to combine data engineering, statistical analysis, and machine learning to generate meaningful insights about income inequality and labour market dynamics.
