Exploration of Data Mining Techniques Supporting
Recommendation Systems
Joseph Davis
Department of Data Science
Florida Polytechnic University
Lakeland, United States of America
JosephDavisJoey@gmail.com
Abstract—Recommendation systems are the most significant
auxiliary feature of many online services. This project aims to
explore techniques for data cleaning and feature extraction in data
mining. Using these techniques we aim to identify, extract, and
create features which are useful in recommendation systems.
Several approaches are discussed such as Principal Component
Analysis, Single-Value Decomposition, and Graph Clustering.
Keywords—recommendation systems, data mining, content-
based filtering, collaborative-filtering, graph clustering, interest
mapping, matrix factorization
I. INTRODUCTION
The aim of this paper is to outline our approach at extracting
meaningful relationship from user rankings and content
metadata for use in recommendation systems.
A. Recommendation Systems
Recommendation systems can be found at the core of most
online services which serve content, sell products, or engage in
advertising. These systems serve to exploit data describing user
behaviors, interests, and feedback, as well as the content being
served. The use of these systems can increase a platform’s
revenue, user engagement, user retention, and navigability [1]. It
is apparent that these systems are the key to the success of many
online services, and their effectiveness may serve as a measure
of quality between one service and another.
There are many recommendation systems in the world;
however, two important categories of recommendation systems
are Collaborative Filtering and Content-based Filtering [1].
In collaborative filtering, “[users] will be recommended
items that people with similar tastes and preferences liked in the
past” [2]. While collaborative filtering is not constrained to this
definition, it is an apt description. Simply put, collaborative
filtering takes user feedback, user behavior, and content
metadata into account to filter subsets of content and or users.
In content-based filtering, “[users] will be recommended
items similar to the ones [they] preferred in the past” [2].
Content-based filtering focuses on a single user and their habits.
Unlike collaborative filtering, it does not take into account the
behavior of other users.
An excellent recommendation system excels at
recommending content or products which the user will want to
purchase or view, and enjoy the content. If the provided
recommendations do not align with user interests, they may look
for alternative services or end their search altogether. If the
recommendations do not lead to a positive response, the user’s
engagement will decrease as they lose interest in the service.
B. Required Information
To make accurate recommendations, sufficient information
about the user’s interests and behaviors must be collected. Not
only must information about the user be collected, but the
content must have metadata describing it.
User behavior information, also known as implicit feedback
[1] or passive feedback, refers to what a user does on the
platform. What do they search for? What do they interact with?
For how long do they interact? Have they made recurring
purchases? The answers to these questions provide a baseline for
extracting a user’s interest.
When a user ranks content or writes reviews, it is known as
explicit feedback [1]. This information provides a strong image
of not only what a user is interested in, but also what is
disinteresting to them.
C. The Dataset
In the MyAnimeList dataset [3], there are three main sources
of information, which provide implicit and explicit feedback,
some user metadata, and a wealth of content metadata:
TV List: Content by ID, title, genres, score, computed
rank, popularity, and meta information such as the cast.
User List: Scores by user, watched episodes, and
timestamps.
User: Demographic information for a subset of the users
In the TV List, we find information about all the titles seen on
MAL at the time the website was scraped. This information is
critical in content-based filtering, which measures similarity
among content rather than users.
In the User List, we are provided with information about a
users likes and dislikes, as well as unscored shows which can be
used as implicit feedback. This information includes whether a
user has finished a show, plans to watch a show, and when the
record was last modified by them.
The User data covers demographic information, if the user
publicly shares it, such as age, sex, and location. The author of
the dataset states there are 302,573 users with some demographic
data, but only 116,133 have their age, sex, and location provided
[3].
II. MOTIVATION
The motivation for this project is to explore both data
cleaning and mining techniques, and to explore how information
can be leveraged in recommendation systems. Whether we
recognize it or not, these systems play a role in our daily life so
long as you shop online, use search engines, view the news, or
have seen advertising online.
I have a long-held interest in recommendation systems.
Through discussions with peers, I have found that many of them
take note of services which have good (accurate)
recommendations and bad recommendations. For example,
YouTube seems to be the preferred platform for finding new
music, while platforms like Spotify and Pandora were perceived
as having inferior recommendations. Anecdotes such as these
has lead me to wonder about what makes each system so
different.
III. RELATED WORK
A. Data Preprocessing
1) Data Cleaning
The author of the dataset [3] has proposed filtering and
cleaning criteria, they have also done some work to create
subsets of the data which match these criteria. Such as filtering
the User List information to only contain users whom have
provided demographic information.
2) Principal Component Analysis
Principal Component Analysis is an algorithm which can be
used to reduce the dimensionality of numeric data through linear
transformations. PCA reduces dimensionality and maintains
information by leveraging the covariance of features [4].
B. Measuring Similarity
1) Clustering - Binary Vector Set Similarity
Measuring similarity between binary vectors and a query
vector is measured in Hamming space. The pigeonhole principle
is used to find candidates in the dataset and verify them [5]. In
research done by Qin et al., it was found that by changing using
non-equal partition widths and varying the threshold, skewed
data could be accounted for. This leads to improved accuracy in
the set similarity search.
2) Graph Clustering and Interest Mapping
Graph’s are commonly used for mapping relationships
among entities. In a graph, edges represent direct relationships
via some feature of the dataset between two entities.
Graph-guided interest expansion [6], was used in lieu of
time-series data due to data sparsity. A graph with nodes
representing live-streamers and users was created where each
edge was weighted by the donations given to the live-streamer.
Metagraphs were then traversed to mine similar live-streamers,
users, and interests among users.
The clustering of weighted, undirected graphs can be
calculated through the K-Algorithm or M-Algorithm [7]. These
algorithms map a cost function to the traversal of edges in a
graph, and are based upon the k-means algorithm. This
algorithm effectively breaks a graph into sub-graphs, but
requires a custom cost function.
3) Single-Value Decomposition
Single-Value Decomposition, like PCA, requires the use of
the eigen-decomposition of a matrix. Using the eigen values and
vectors of the original data, the importance of rows to other rows,
and column to other columns, is encoded. In essence, the data is
broken down into row similarity, column similarity, and the
relationship to the original data [8]. SVD has a wide domain for
applications, such as image compression, PCA, and signal
processing.
SVD can be used in recommendation systems for two tasks,
latent feature extraction and dimensionality reduction [13]. A
drawback of SVD is that it is affected by sparsity. To address
sparsity some fill in missing values from descriptive statistics
[13,9,10].
C. Recommendation Systems
1) Collaborative and Content-based Filtering
Patterns and similarities among users, content, and users and
content, are mined through various methods. The goal is to find
the similarity of each user to each other user, each user to the
content they interact with, and each piece of content to each
other piece of content. There are a variety of techniques used to
mine these relationships. Such as SVD or Matrix Factorization
[8,10,13], k-nearest neighbors [13], the K- and M-Algorithms
[7], and Graph Guided Interest Expansion with Multi Modal
diffusion [6].
2) Deep Matrix Factorization
Deep matrix factorization expands upon SVD with deep
learning to improve the quality of the matrix. Like SVD,
DeepMF is used for collaborative filtering from datasets
including user reviews by title. DeepMF improves upon its’
recommendations through a loss function, and then continues
training on its output until error is reduced [10]. Through this
training method DeepMF both improves recommendations
through the reduction of sparsity, and through mining k latent
features [10].
3) Graph Collaborative Filtering
Graph Collaborative Filtering, such as MixRec, aim to fill
the gaps in low-dimensional features spaces [14]., such as
DeepMF. Matrix factorization is a common technique
collaborative filtering technique; however, it does not leverage
all the data which is available. Graph collaborative filtering aims
to leverage relational information, such as page views [14], or
donations [6].
Aside from MixRec, another model which similarly
leverages user relationships to content is MMBee, Multi Modal
Fusion and Behavior Expansion [14]. MMBee leverages the
relations of user donations to authors in a graph, rather than
through matrix factorization. Methods of leveraging graphs of
user interactions will likely be the future of collaborative
filtering.
IV. PROPOSED APPROACHES
With the goal being to explore data mining techniques, this
paper intends to explore the extraction of features from the
dataset, the cleaning of data, and visualizing clusters of
information.
A. Data Cleaning
Our approach to cleaning the data set will be to remove
records which do not represent the truth, which are statistically
insignificant, and those that lack quantity necessary to properly
mine information.
Some records do not provide truthful accounts of an
individuals reviews, such as those which have invalid data or
with major z-scores. While others may have many records, but
significant features are not provided. Some content and user
pairs may have so few results or connections that no reasonable
conclusions may be drawn from the data, thus insignificant for
general recommendations.
When there is enough support data, statistical descriptions
will be used to replace missing data, or corrections will be
applied in bulk to common errors in the dataset. Those records
which cannot be corrected, shall be dropped.
B. Feature Extraction
During feature extraction we will examine the existing
features, determine which ones are suitable for mining, and
discard those which hold no significance.
A key technique in feature extraction is feature
transformation. To transform a feature is to change its
representation, such as changing strings of text into vectors,
making ordinal categorical data into integers, or changing the
range of ordinal data, or through methods such as
standardization and normalization.
After the features has been transformed, new features may be
extracted through methods such as PCA [4], or the significance
of features may be identified through SVD and covariance
measurements. New features may also be created through
clustering techniques.
After identifying similarities or dissimilarities among data,
such as by using t-SNE or UMAP, we will aim to reduce the
original features to several features that describes the distance
from each cluster.
C. Visualization and Measurements
To compare the dataset before and after cleaning, we will
utilize graphical plots to describe changes to the descriptive
statistics. After finding correlation among variables, we will
show that the general trend has not been lost due to the removal
of outliers.
Methods such as t-distributed Stochastic Neighbor
Embedding or Uniform Manifold Approximation will be used to
visually explore clusters of data for further mining.
V. PLANNED EXPERIMENTS
With the goal being to mine useful features which can be
used in predicting user interests, our experiments will deal with
identifying relationships between content, users, and among
users and content.
A. Data Cleaning
During cleaning, we will also be removing information
which is not related to our mining task, such as records
describing Music rather than TV or Movies. After removing
irrelevant data, we will be removing records which exist as
outliers, rather than noise. Outliers will be those that are
significantly unlikely, lack enough quantity to be used in
modeling, or those that seem to be a repeated error in the dataset.
To measure the difference, I will plot the descriptive
statistics of several key features which will would be used in
modeling, such as user scores. Furthermore, measuring the
records lost by type due to cleaning and filtering. I will measure
the ‘completeness’ of the dataset, which will be the amount of
records removed after filtering and cleaning over the total
amount of records.
B. Feature Transformation and Extraction
We will propose several basic feature transformations, such
as columns calculated through multiplication or division of a
feature by another feature. We will measure the covariance of
these features with score and report the findings. From these
findings, we will employ PCA [4] to improve the results.
Features may be extracted from textual data, such as creating
binary categorical features for our set of genres to support the
analysis of associations among genres [5,12]. Some ordinal
categorical feature are also represented as averages, we will
explore discretizing this information into a lower range to reduce
the variance in user reviews due to different scoring methods.
C. Graph Clustering
The hope of this project is make a graph of users to content,
users to users, and or content to ccontent. This information could
be explored through the k and m-algorithm graph clustering
algorithm [7] or possibly through visually separating data and
identifying clusters visually. I hope to define metrics for
homophily and heterophily to identify similarities among nodes
in the graphs. Such as coining metrics for homophily and
heterophily when clustering using weights between users &
content to see if there are any distinct clusters [7].
VI. PROGRESS REPORT ONE
So far, we have performed exploratory analysis of the TV
List dataset. The records in this dataset provide information
about each content entry, such as Title, Genre, Total Members,
and Score. Features such as Genre and Score are imperitive in
finding patterns among content and providing
recommendations.
A. Cleaning and Filtering
Thus far, we have explored the TV List dataset which
describes content entries. Because recommendation systems aim
to predict how compatible a user is with content, we have
identifier the key features to be ‘score’ and ‘genre’. Score is a
values from to one to ten which is averaged from all user
reviews. A ten indicates “Masterpiece”, while a one indicates
“Appalling”. Genre is a comma separated string of tags, which
should be useful in identifying common trends among content.
During filtering, we remove all entries which have not
finished airing. In other words, we only consider those which
have been completed and judged by users. Then, we removed all
non-video media, ‘Music’ and ‘Unknown’.
In Fig. 1 we visualize the loss across types due to our filtering
criteria. Both Music and Unknown were removed in their
entirety, while five percent, or 653, of the other types were
removed.
In figures two and three, we visualize the distribution of
score across each type, before and after filtering. While the
median and whiskers do not seem to be particularly effected,
mean seems to move towards the median in all cases.
During cleaning, we remove all records where the score,
members, or genre was null. All records which have no members
are marked as irrelevant and removed from the dataset. If there
are no members, there cannot be any relevant data to use
downstream when predicting user ratings. If the score is less than
one or greater than ten, we discard the record as invalid.
Through cleaning, 68 more records are removed across the
remaining types, with the majority, 44, being movies. After
cleaning and filtering, the dataset lost 1,600 content entries from
the original 14,478. Resulting in a completeness score of 88.9%;
however, when considering cleaning on its own, the result is
99.5%.
B. Exploration
When measuring the correlation of scores with all other
numeric features, we find that members, scored_by, and
favorites have a positive correlation with score. This correlation
is 0.35 for scored_by, the amount of members who have left
ratings, 0.38 with the amount of members who have added it to
their list, and 0.21 with the number of users who have added the
content to their favorites.
In Fig. 4, we see use Kernel Density Estimation to visualize
the probability density of scores by members. Similar to Fig. 2,
we see that the highest density is found between the scores of six
and seven. The vast majority of our content seems to have few
members; however, there does seem to be a general trend where
content that has received higher ratings will continue to reach
wider audiences.
Fig. 2: Score distribution before and after filter. Excluding results of
type TV.
Fig. 1: Records which have been removed, grouped by content type.
Fig. 3: The TV Score distribution before and after filtering.
Fig. 4: Score by Members plotted with Kernel Density
Estimation
C. Feature Transformation
Due to the covariance of Favorites and Members with Score,
we propose a new column based on the amount of favorites per
hundred members. This new beats both Members and Favorites
with a covariance of 0.42; however, approximately 30% of the
content entries do not have any favorites.
The Genre feature in the dataset is a comma separated string
of of tags. In total, there are 81 genres after transformation.
Performing t-distributed Stochastic Neighbor Embeddings on
the genre features to segments the data into many clusters.
Genres which did not have at least 5% of their values as positive
were pruned, reducing the genre features to 78.
In Fig 6, we can see many clusters formed from the genre
information. Many of the tighter groupings are content with just
a single genre ascribed to them. For example, the cluster at the
range [-15,-20] and domain [100,125] only includes the ‘Kids
tag for its genre. While in the range [60,90] and domain
[17.5,32.5] there are three clusters grouped closely together
which are related through their ‘Adventure’ tag, but also include
‘Sci-Fi’ and ‘Comedy’, as well as many supporting tags.
tSNE does a remarkable job as clustering these binary
vectors and should prove useful in finding common categories or
key genres in describing the data for downstream use.
VII. CHECKPOINT II
Since checkpoint one, I have begun preparing the user list
rankings for use in rating prediction. To prepare user rankings,
we will perform an intersection with the previous TV list, then
clean, filter, and transform the dataset. With this dataset, we can
generate a user by content matrix or a content by content matrix
to perform techniques such as SVD on or to analyze further.
A. TV List UMAP
Per the last checkpoint, we have performed UMAP on the
previous dataset used for the TSNE visualizations. See Fig 7. for
the UMAP visualization. Having run both TSNE and UMAP
again, both showed very similar structures, long string-like
connections, of the binary vectors. Which is unusual compared
to the last time where they had formed clusters based on
common values.
B. User Rankings
To prepare the data for mining and modeling, we must first
perform three tasks: Prefiltering, Cleaning, and Filtering. During
prefiltering we remove irrelevant content Ids. During cleaning
we remove impossible values. During filtering we remove
values that are irrelevant for our prediction task.
1) Prefilter
To prefilter the data, we take the output from checkpoint one,
the cleaned TV list, and perform a left inner join from on the user
rankings. This provides us with all the user rankings for which
we have relevant information.
2) Clean
Fig. 5: Score by Favorites per Hundred Members plotted with
Kernel Density Estimation
Fig. 6: tSNE, perplexity at 62, on reduced genre set.
Fig 7: UMAP of genres
To clean the data, we remove any values which have scores
that are outside of the valid range. In the future, we aim to back
fill values, such as watched episodes, from the TV list dataset to
fix these values. Many users have values that are either 0 or
much higher than the show aired with. Once we have cleaned the
watched episodes, we hope to train the model on users who have
rated shows, but did not complete them if they have watched
greater than the show.
The distribution of scores by status can be seen in Fig 8. This
figure excludes all entries where the score is zero. Zero is the
default for score when the user has not rated an entry, but it is in
their list. When the boxplot includes zero entries, the median
value for each status, except Completed, becomes zero.
3) Filter
To filter the data, we remove any records which have a status
that is not equal to complete. Per the last section, we hope that if
we clean the data enough we can extend the status from
completed to dropped or currently watching if the user has seen
enough of the show to make a rating.
During filtering we found that approximately half of all
records (Fig. 9) have a status of Completed, that means 42
million entries are of scored content by users.
C. Predictive Modeling Attempt
Our goal was to predict user ratings given the dataset of
Users, Content, Score using SVD.
1) Data Transformation
To prepare for SVD, we reduced the dataset to the three key
columns, then transformed the dataset such that each row was an
individual user, while each column was a content entry. This
would result in a matrix of approximately 12 thousand columns
by about two-hundred thousand rows.
2) Complication
Due to the sheer size of the dataset, every approach at SVD
with a k-value greater than three had lead to allocating memory
far greater than was available on the system (500GB to around
89 TB). A very small k-value was decided on and the dataset was
reduced to one-hundred thousand records, but after watching the
logs it became clear that it would not finish at it’s current pace as
either my approach was far too inefficient, or the model is not
meant to be used as this scale.
D. Dataset Analysis
From Fig. 10, we see that the majority of all records are
marked as completed, with Plan to watch coming in second.
From Fig. 8 we see that the score for both of these entries has an
average of 8, which is lower than the average score for the TV
List seen in Figures 2 and 3, approximately 6.5.
When excluding all scores less than five (Fig. 11), Plan to
Watch becomes the least represented class, while Completed
still dwarfs the rest of the statuses in within the dataset. When
excluding all scores greather than five and scores that equal to
zero (Fig. 12), the Dropped class becomes the second most
significant class. This is supported by Fig. 8, which shows that
the majority of dropped rankings lie below six. This supports the
idea that some users will drop content they do not like. However,
we can see from Fig. 13 that the majority of all ratings in our
dataset are zero, meaning that users are primarily tracking their
status and not their opinions on the content.
Fig 9. Ranking records removed by the filter.
Status == Complete
Fig 8. The distribution of score and status on the cleaned dataset.
Fig 10. Count of each status seen in the cleaned dataset.
From Fig. 8,
E. Graph Clustering
Using the Louvain method, a ‘method which attempts to
extract non-overlapping community from lage networks’ [15],
we have attempted to form a best fit for the partitioning. From
the data, we were provided with three partitions. The first
partition consists of about ten thousand records, while the next
two have fifteen-hundred and twelve-hundred. I believe this is a
sign that they cannot be well separated by louvain; however,
there is not enough time to analyze these results as this is my
second attempt at clustering, each of which took about twelve
hours to run. Louvain also has a dendrogram method, so I am
hoping to see the results of that during the final checkpoint, if
there is anything to come of it.
REFERENCES
[1] L. El Harrouchi, H. Moussaoui, M. Karmoudi, and N. El Akkad, “A
review of recommendation systems,” in 2025 5th International
Conference on Innovative Research in Applied Science, Engineering and
Technology (IRASET), May 2025, pp. 1–9. doi:
10.1109/iraset64571.2025.11008191.
[2] G. Adomavicius and A. Tuzhilin, “Toward the next generation of
recommender systems: a survey of the state-of-the-art and possible
extensions,” IEEE Transactions on Knowledge and Data Engineering,
vol. 17, no. 6, pp. 734–749, Jun. 2005, doi: 10.1109/tkde.2005.99.
[3] Matěj Račinský, 2018, “MyAnimeList Dataset.” Kaggle, doi:
10.34740/KAGGLE/DSV/45582.
[4] K. Zhao, “Feature Extraction using Principal Component Analysis A
Simplified Visual Demo”, Accessed: Sep. 5, 2025. [Online.] Available:
https://medium.com/data-science/feature-extraction-using-principal-
component-analysis-a-simplified-visual-demo-e5592ced100a
[5] J. Qin et al., “Generalizing the Pigeonhole Principle for Similarity Search
in Hamming Space,” IEEE Transactions on Knowledge and Data
Engineering, pp. 1–1, 2019, doi: 10.1109/tkde.2019.2899597.
[6] J. Deng et al., “MMBee: Live Streaming Gift-Sending Recommendations
via Multi-Modal Fusion and Behaviour Expansion,” in Proceedings of the
30th ACM SIGKDD Conference on Knowledge Discovery and Data
Mining, Aug. 2024, pp. 4896–4905. doi: 10.1145/3637528.3671511.
[7] S. Sieranoja and P. Fränti, “Adapting k-means for graph clustering,”
Knowledge and Information Systems, vol. 64, no. 1, pp. 115–142, Dec.
2021, doi: 10.1007/s10115-021-01623-y.
[8] GeeksforGeeks, “Singular Value Decomposition (SVD)”, Accessed: Sep.
5, 2025. [Online.] Available: https://www.geeksforgeeks.org/machine-
learning/singular-value-decomposition-svd/
[9] S. Bin and G. Sun, “Matrix Factorization Recommendation Algorithm
Based on Multiple Social Relationships,” Mathematical Problems in
Engineering, vol. 2021, pp. 1–8, Feb. 2021, doi: 10.1155/2021/6610645.
[10] R. Lara-Cabrera, Á. González-Prieto, and F. Ortega, “Deep Matrix
Factorization Approach for Collaborative Filtering Recommender
Systems,” Applied Sciences, vol. 10, no. 14, p. 4926, Jul. 2020, doi:
10.3390/app10144926.
[11] B. Hssina, A. Grota, and M. Erritali, “Recommendation system using the
k-nearest neighbors and singular value decomposition algorithms,”
International Journal of Electrical and Computer Engineering (IJECE),
vol. 11, no. 6, p. 5541, Dec. 2021, doi: 10.11591/ijece.v11i6.pp5541-
5548.
[12] C. Scheuch, “Clustering Binary Data”, Accessed: Sep. 5, 2025. [Online.]
Available: https://blog.tidy-intelligence.com/posts/clustering-binary-
data/
Fig 11. Status when Score is greater than or equal to five.
Fig 13. Status when Score is less than or equal to five, but not zero.
Fig 12. Barchart of scores
[13] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Application of
Dimensionality Reduction in Recommender System - A Case Study,”
Minnesota University Department of Computer Science, Jul. 2000, doi:
https://doi.org/10.21236/ADA439541.
[14] L. Xia, M. Xie, Y. Xu, and C. Huang, “MixRec: Heterogeneous Graph
Collaborative Filtering,” Web Search and Data Mining, pp. 136–145, Feb.
2025, doi: https://doi.org/10.1145/3701551.3703591.
[15] Louvain method,” Wikipedia, Jun. 27, 2022.
https://en.wikipedia.org/wiki/Louvain_method