Exploration of Data Mining Techniques Supporting
Recommendation Systems
Joseph Davis
Department of Data Science
Florida Polytechnic University
Lakeland, United States of America
JosephDavisJoey@gmail.com
Abstract—Recommendation systems are the most significant
auxiliary feature of many online services. This project aims to
explore techniques for data cleaning and feature extraction in data
mining. Using these techniques we aim to identify, extract, and
create features which are useful in recommendation systems.
Several approaches are discussed such as Principal Component
Analysis, Single-Value Decomposition, and Graph Clustering.
Keywords—recommendation systems, data mining, content-
based filtering, collaborative-filtering, graph clustering, interest
mapping, matrix factorization
I. INTRODUCTION
The aim of this paper is to outline our approach at extracting
meaningful relationship from user rankings and content
metadata for use in recommendation systems.
A. Recommendation Systems
Recommendation systems can be found at the core of most
online services which serve content, sell products, or engage in
advertising. These systems serve to exploit data describing user
behaviors, interests, and feedback, as well as the content being
served. The use of these systems can increase a platform’s
revenue, user engagement, user retention, and navigability [1]. It
is apparent that these systems are the key to the success of many
online services, and their effectiveness may serve as a measure
of quality between one service and another.
There are many recommendation systems in the world;
however, two important categories of recommendation systems
are Collaborative Filtering and Content-based Filtering [1].
In collaborative filtering, “[users] will be recommended
items that people with similar tastes and preferences liked in the
past” [2]. While collaborative filtering is not constrained to this
definition, it is an apt description. Simly put, collaborative
filtering takes user feedback, user behavior, and content
metadata into account to filter subsets of content and or users.
In content-based filtering, “[users] will be recommended
items similar to the ones [they] preferred in the past” [2].
Content-based filtering focuses on a single user and their habits.
Unlike collaborative filtering, it does not take into account the
behavior of other users.
An excellent recommendation system excels at
recommending content or products which the user will want to
purchase or view, and enjoy the content. If the provided
recommendations do not align with user interests, they may look
for alternative services or end their search altogether. If the
recommendations do not lead to a positive response, the user’s
engagement will decrease as they lose interest in the service.
B. Required Information
To make accurate recommendations, sufficient information
about the user’s interests and behaviors must be collected. Not
only must information about the user be collected, but the
content must have metadata describing it.
User behavior information, also known as implicit feedback
[1] or passive feedback, refers to what a user does on the
platform. What do they search for? What do they interact with?
For how long do they interact? Have they made recurring
purchases? The answers to these questions provide a baseline for
extracting a user’s interest.
When a user ranks content or writes reviews, it is known as
explicit feedback [1]. This information provides a strong image
of not only what a user is interested in, but also what is
disinteresting to them.
C. The Dataset
In the MyAnimeList dataset [3], there are three main sources
of information, which provide implicit and explicit feedback,
some user metadata, and a wealth of content metadata:
TV List: Content by ID, title, genres, score, computed
rank, popularity, and meta information such as the cast.
User List: Scores by user, watched episodes, and
timestamps.
User: Demographic information for a subset of the users
In the TV List, we find information about all the titles seen on
MAL at the time the website was scraped. This information is
critical in content-based filtering, which measures similarity
among content rather than users.
In the User List, we are provided with information about a
users likes and dislikes, as well as unscored shows which can be
used as implicit feedback. This information includes whether a
user has finished a show, plans to watch a show, and when the
record was last modified by them.
The User data covers demographic information, if the user
publicly shares it, such as age, sex, and location. The author of
the dataset states there are 302,573 users with some demographic
data, but only 116,133 have their age, sex, and location provided
[3].
II. MOTIVATION
The motivation for this project is to explore both data
cleaning and mining techniques, and to explore how information
can be leveraged in recommendation systems. Whether we
recognize it or not, these systems play a role in our daily life so
long as you shop online, use search engines, view the news, or
have seen advertising online.
I have a longheld interest in recommendation systems.
Through discussions with peers, I have found that many of them
take note of services which have good (accurate)
recommendations and bad recommendations. For example,
Youtube seems to be the preferred platform for finding new
music, while platforms like Spotify and Pandora were perceived
as having inferior recommendations. Anecdotes such as these
has lead me to wonder about what makes each system so
different.
III. RELATED WORK
A. Data Cleaning
The author of the dataset [3] has proposed filtering and
cleaning criteria, they have also done some work to create
subsets of the data which match these criteria. Such as filtering
the User List information to only contain users whom have
provided demographic information.
B. Principal Component Analysis
Principal Component Analysis, PCA, is one method of
extracting new features which have some level of covariance [4].
C. Clustering - Binary Vector Set Similarity
Measuring similarity between binary vectors and a query
vector is measured in Hamming space. The pigeonhole principle
is used to find candidates in the dataset and verify them [5]. In
research done by Qin et al., it was found that by changing using
non-equal partition widths and varying the threshold, skewed
data could be accounted for. This leads to improved accuracy in
the set similarity search.
D. Graph Clustering and Interest Mapping
Graph’s are commonly used for mapping relationships
among entities. In a graph, edges represent direct relationships
via some feature of the dataset between two entities.
Graph-guided interest expansion [6], was used in lieu of
time-series data due to data sparsity. A graph with nodes
representing live-streamers and users was created where each
edge was weighted by the donations given to the live-streamer.
Metagrahps were then traversed to mine similar live-streamers,
users, and interests among users.
The clustering of weighted, undirected graphs can be
calculated through the K-Algorithm or M-Algorithm [7]. These
algorithms map a cost function to the traversal of edges in a
graph, and are based upon the k-means algorithm. This algorith
effectively breaks a graph into sub-graphs, but requires a custom
cost function.
E. Matrix Factorization (Single-Value Decomposition)
Single-Value Decomposition, SVD, is a method to break
down a matrix into three separate matrices which represent the
importance, similarity of content, and general preferences of
users [8]. Matrix factorization is the use of SVD to mine the
latent features from the decomposed matrix. Unforunately, there
is a data sparsity problem which occurrs because not all users
have reviewed all content [9]. Deep matrix factorization utilizes
deep neural networks to improve the quality of
recommendations by predicting user ratings through the use of a
loss function [10].
F. Collaborative and Content-based Filtering
Patterns and similarities among users, content, and users and
content, are mined through various methods. The goal is to find
the similarity of each user to each other user, each user to the
content they interact with, and each piece of content to each
other piece of content. There are a variety of techniques used to
mine these relationships. Such as SVD or Matrix Factorization
[8,10,13], k-nearest neighboes [13], the K- and M-Algorithms
[7], and Graph Guided Interest Expansion with Multi Modal
diffusion [6].
IV. PROPOSED APPROACHES
With the goal being to explore data mining techniques and
extracting features, to be utilized with recommendation systems.
I propose to first set criteria for filtering and cleaning the data,
finding candidate features for mining, and extracting new
features through the use of data mining techniques.
A. Data Cleaning
To clean data, we must rectify incomplete records and decide
on how to deal with outliers. Incomplete records are those
without values in all features, to rectify them we must either drop
the record or include a value which aligns with some statistical
description of the feature as a whole, such as the mean or
median.
Not only may records be incomplete, but the entity they are
describing may lack enough supporting data to justify its
inclusion during training or validation. For example, if the
average user rates 10-12 entities, while a handful of users have
only rated one, it may be worth dropping the records which do
not tell a complete story about the user.
Some related reocrds may need to be dropped if the story
they tell is implausible. Such as a user having watched ten times
more than any other user, they are likely not providing a truthful
account and can be considered on a case-by-case basis.
Some feature may not be complete, and cannot be rectified
through statistical descriptions. This data cannot be considered
during feature extraction, and must be discarded during data
cleaning.
B. Feature Extraction
During feature extraction we will examine the existing
features, determine which ones are suitable for mining, and
discard those which hold no significance.
A key technique in feature extraction is feature
transformation. To transform a feature is to change its
representation, such as changing strings of text into vectors,
making ordinal categorical data into integers, or changing the
range of ordinal data, or through methods stuch as
standardization and normalization.
After the features has been transformed, new features may be
extracted through methods such as PCA [4], or the significance
of features may be identified through SVD and covariance
measurements. New features may also be created through
clustering techniques.
V. PLANNED EXPERIMENTS
With the goal being to mine useful features which can be
used in predicting user interests, our exeriments will deal with
identifying relatonships between content, users, and among
users and content.
A. Data Cleaning
We plan to analyze the dataset and determine how much data
is either missing or erroneous, then determine what should be
doe with such features or records. Such as in the case of missing
user scores, we could predict the user’s rating through Deep
Matrix Facorization [10] and susbtitute, rather than using user
averages or modes.
B. Feature Transformation
When transforming features we will seek the best
representation of categorical data and textual data, and see how it
ca be leveraged when measuring similarity. Such as converting
genre information into binary vectors and measuring the set
similarities among content [5,12].
C. Feature Creation
There may be better representations of numeric data which
can expressed through arithmetic or statistical means. Some
possibilites could be the ratio of users who loved a show to those
who hated it, the total members with strong opinions, the
variance in scoring, &c. Other feactures may be extracted from
the data through methods such as PCA [4], or assigning features
based on observed clusters in the data.
D. Graph Clustering
By creating a weighted graph between entities, and an
associated cost function, clusters may be made between
neighbors on the graph [7]. Furthermore, as shown in MMBee
[6], metapaths may provide a user-centric view towards
grouping content entities.
REFERENCES
[1] L. El Harrouchi, H. Moussaoui, M. Karmoudi, and N. El Akkad, “A
review of recommendation systems,” in 2025 5th International
Conference on Innovative Research in Applied Science, Engineering and
Technology (IRASET), May 2025, pp. 1–9. doi:
10.1109/iraset64571.2025.11008191.
[2] G. Adomavicius and A. Tuzhilin, “Toward the next generation of
recommender systems: a survey of the state-of-the-art and possible
extensions,” IEEE Transactions on Knowledge and Data Engineering,
vol. 17, no. 6, pp. 734–749, Jun. 2005, doi: 10.1109/tkde.2005.99.
[3] Matěj Račinský, 2018, “MyAnimeList Dataset.” Kaggle, doi:
10.34740/KAGGLE/DSV/45582.
[4] K. Zhao, “Feature Extraction using Principal Component Analysis A
Simplified Visual Demo”, Accessed: Sep. 5, 2025. [Online.] Available:
https://medium.com/data-science/feature-extraction-using-principal-
component-analysis-a-simplified-visual-demo-e5592ced100a
[5] J. Qin et al., “Generalizing the Pigeonhole Principle for Similarity Search
in Hamming Space,” IEEE Transactions on Knowledge and Data
Engineering, pp. 1–1, 2019, doi: 10.1109/tkde.2019.2899597.
[6] J. Deng et al., “MMBee: Live Streaming Gift-Sending Recommendations
via Multi-Modal Fusion and Behaviour Expansion,” in Proceedings of the
30th ACM SIGKDD Conference on Knowledge Discovery and Data
Mining, Aug. 2024, pp. 4896–4905. doi: 10.1145/3637528.3671511.
[7] S. Sieranoja and P. Fränti, “Adapting k-means for graph clustering,”
Knowledge and Information Systems, vol. 64, no. 1, pp. 115–142, Dec.
2021, doi: 10.1007/s10115-021-01623-y.
[8] GeeksforGeeks, “Singular Value Decomposition (SVD)”, Accessed: Sep.
5, 2025. [Online.] Available: https://www.geeksforgeeks.org/machine-
learning/singular-value-decomposition-svd/
[9] S. Bin and G. Sun, “Matrix Factorization Recommendation Algorithm
Based on Multiple Social Relationships,” Mathematical Problems in
Engineering, vol. 2021, pp. 1–8, Feb. 2021, doi: 10.1155/2021/6610645.
[10] R. Lara-Cabrera, Á. González-Prieto, and F. Ortega, “Deep Matrix
Factorization Approach for Collaborative Filtering Recommender
Systems,” Applied Sciences, vol. 10, no. 14, p. 4926, Jul. 2020, doi:
10.3390/app10144926.
[11] B. Hssina, A. Grota, and M. Erritali, “Recommendation system using the
k-nearest neighbors and singular value decomposition algorithms,”
International Journal of Electrical and Computer Engineering (IJECE),
vol. 11, no. 6, p. 5541, Dec. 2021, doi: 10.11591/ijece.v11i6.pp5541-
5548.
[12] C. Scheuch, “Clustering Binary Data”, Accessed: Sep. 5, 2025. [Online.]
Available: https://blog.tidy-intelligence.com/posts/clustering-binary-
data/