Exploration of Data Mining Techniques Supporting

Recommendation Systems

Joseph Davis

Department of Data Science

Florida Polytechnic University

Lakeland, United States of America

JosephDavisJoey@gmail.com

Abstract—Recommendation systems are the most significant

auxiliary feature of many online services. This project aims to

explore techniques for data cleaning and feature extraction in data

mining. Using these techniques we aim to identify, extract, and

create features which are useful in recommendation systems.

Several approaches are discussed such as Principal Component

Analysis, Single-Value Decomposition, and Graph Clustering.

Keywords—recommendation systems, data mining, content-

based filtering, collaborative-filtering, graph clustering, interest

mapping, matrix factorization

I. INTRODUCTION

The aim of this paper is to outline our approach at extracting

meaningful relationship from user rankings and content

metadata for use in recommendation systems.

A. Recommendation Systems

Recommendation systems can be found at the core of most

online services which serve content, sell products, or engage in

advertising. These systems serve to exploit data describing user

behaviors, interests, and feedback, as well as the content being

served. The use of these systems can increase a platform’s

revenue, user engagement, user retention, and navigability [1]. It

is apparent that these systems are the key to the success of many

online services, and their effectiveness may serve as a measure

of quality between one service and another.

There are many recommendation systems in the world;

however, two important categories of recommendation systems

are Collaborative Filtering and Content-based Filtering [1].

In collaborative filtering, “[users] will be recommended

items that people with similar tastes and preferences liked in the

past” [2]. While collaborative filtering is not constrained to this

definition, it is an apt description. Simply put, collaborative

filtering takes user feedback, user behavior, and content

metadata into account to filter subsets of content and or users.

In content-based filtering, “[users] will be recommended

items similar to the ones [they] preferred in the past” [2].

Content-based filtering focuses on a single user and their habits.

Unlike collaborative filtering, it does not take into account the

behavior of other users.

An excellent recommendation system excels at

recommending content or products which the user will want to

purchase or view, and enjoy the content. If the provided

recommendations do not align with user interests, they may look

for alternative services or end their search altogether. If the

recommendations do not lead to a positive response, the user’s

engagement will decrease as they lose interest in the service.

B. Required Information

To make accurate recommendations, sufficient information

about the user’s interests and behaviors must be collected. Not

only must information about the user be collected, but the

content must have metadata describing it.

User behavior information, also known as implicit feedback

[1] or passive feedback, refers to what a user does on the

platform. What do they search for? What do they interact with?

For how long do they interact? Have they made recurring

purchases? The answers to these questions provide a baseline for

extracting a user’s interest.

When a user ranks content or writes reviews, it is known as

explicit feedback [1]. This information provides a strong image

of not only what a user is interested in, but also what is

disinteresting to them.

C. The Dataset

In the MyAnimeList dataset [3], there are three main sources

of information, which provide implicit and explicit feedback,

some user metadata, and a wealth of content metadata:

 TV List: Content by ID, title, genres, score, computed

rank, popularity, and meta information such as the cast.

 User List: Scores by user, watched episodes, and

timestamps.

 User: Demographic information for a subset of the users

In the TV List, we find information about all the titles seen on

MAL at the time the website was scraped. This information is

critical in content-based filtering, which measures similarity

among content rather than users.

In the User List, we are provided with information about a

users likes and dislikes, as well as unscored shows which can be

used as implicit feedback. This information includes whether a

user has finished a show, plans to watch a show, and when the

record was last modified by them.

The User data covers demographic information, if the user

publicly shares it, such as age, sex, and location. The author of

the dataset states there are 302,573 users with some demographic

data, but only 116,133 have their age, sex, and location provided

[3].

II. MOTIVATION

The motivation for this project is to explore both data

cleaning and mining techniques, and to explore how information

can be leveraged in recommendation systems. Whether we

recognize it or not, these systems play a role in our daily life so

long as you shop online, use search engines, view the news, or

have seen advertising online.

I have a long-held interest in recommendation systems.

Through discussions with peers, I have found that many of them

take note of services which have good (accurate)

recommendations and bad recommendations. For example,

YouTube seems to be the preferred platform for finding new

music, while platforms like Spotify and Pandora were perceived

as having inferior recommendations. Anecdotes such as these

has lead me to wonder about what makes each system so

different.

III. RELATED WORK

A. Data Preprocessing

1) Data Cleaning

The author of the dataset [3] has proposed filtering and

cleaning criteria, they have also done some work to create

subsets of the data which match these criteria. Such as filtering

the User List information to only contain users whom have

provided demographic information.

2) Principal Component Analysis

Principal Component Analysis is an algorithm which can be

used to reduce the dimensionality of numeric data through linear

transformations. PCA reduces dimensionality and maintains

information by leveraging the covariance of features [4].

B. Measuring Similarity

1) Clustering - Binary Vector Set Similarity

Measuring similarity between binary vectors and a query

vector is measured in Hamming space. The pigeonhole principle

is used to find candidates in the dataset and verify them [5]. In

research done by Qin et al., it was found that by changing using

non-equal partition widths and varying the threshold, skewed

data could be accounted for. This leads to improved accuracy in

the set similarity search.

2) Graph Clustering and Interest Mapping

Graph’s are commonly used for mapping relationships

among entities. In a graph, edges represent direct relationships

via some feature of the dataset between two entities.

Graph-guided interest expansion [6], was used in lieu of

time-series data due to data sparsity. A graph with nodes

representing live-streamers and users was created where each

edge was weighted by the donations given to the live-streamer.

Metagraphs were then traversed to mine similar live-streamers,

users, and interests among users.

The clustering of weighted, undirected graphs can be

calculated through the K-Algorithm or M-Algorithm [7]. These

algorithms map a cost function to the traversal of edges in a

graph, and are based upon the k-means algorithm. This

algorithm effectively breaks a graph into sub-graphs, but

requires a custom cost function.

3) Single-Value Decomposition

Single-Value Decomposition, like PCA, requires the use of

the eigen-decomposition of a matrix. Using the eigen values and

vectors of the original data, the importance of rows to other rows,

and column to other columns, is encoded. In essence, the data is

broken down into row similarity, column similarity, and the

relationship to the original data [8]. SVD has a wide domain for

applications, such as image compression, PCA, and signal

processing.

SVD can be used in recommendation systems for two tasks,

latent feature extraction and dimensionality reduction [13]. A

drawback of SVD is that it is affected by sparsity. To address

sparsity some fill in missing values from descriptive statistics

[13,9,10].

C. Recommendation Systems

1) Collaborative and Content-based Filtering

Patterns and similarities among users, content, and users and

content, are mined through various methods. The goal is to find

the similarity of each user to each other user, each user to the

content they interact with, and each piece of content to each

other piece of content. There are a variety of techniques used to

mine these relationships. Such as SVD or Matrix Factorization

[8,10,13], k-nearest neighbors [13], the K- and M-Algorithms

[7], and Graph Guided Interest Expansion with Multi Modal

diffusion [6].

2) Deep Matrix Factorization

Deep matrix factorization expands upon SVD with deep

learning to improve the quality of the matrix. Like SVD,

DeepMF is used for collaborative filtering from datasets

including user reviews by title. DeepMF improves upon its’

recommendations through a loss function, and then continues

training on its output until error is reduced [10]. Through this

training method DeepMF both improves recommendations

through the reduction of sparsity, and through mining k latent

features [10].

3) Graph Collaborative Filtering

Graph Collaborative Filtering, such as MixRec, aim to fill

the gaps in low-dimensional features spaces [14]., such as

DeepMF. Matrix factorization is a common technique

collaborative filtering technique; however, it does not leverage

all the data which is available. Graph collaborative filtering aims

to leverage relational information, such as page views [14], or

donations [6].

Aside from MixRec, another model which similarly

leverages user relationships to content is MMBee, Multi Modal

Fusion and Behavior Expansion [14]. MMBee leverages the

relations of user donations to authors in a graph, rather than

through matrix factorization. Methods of leveraging graphs of

user interactions will likely be the future of collaborative

filtering.

IV. PROPOSED APPROACHES

With the goal being to explore data mining techniques, this

paper intends to explore the extraction of features from the

dataset, the cleaning of data, and visualizing clusters of

information.

A. Data Cleaning

Our approach to cleaning the data set will be to remove

records which do not represent the truth, which are statistically

insignificant, and those that lack quantity necessary to properly

mine information.

Some records do not provide truthful accounts of an

individuals reviews, such as those which have invalid data or

with major z-scores. While others may have many records, but

significant features are not provided. Some content and user

pairs may have so few results or connections that no reasonable

conclusions may be drawn from the data, thus insignificant for

general recommendations.

When there is enough support data, statistical descriptions

will be used to replace missing data, or corrections will be

applied in bulk to common errors in the dataset. Those records

which cannot be corrected, shall be dropped.

B. Feature Extraction

During feature extraction we will examine the existing

features, determine which ones are suitable for mining, and

discard those which hold no significance.

A key technique in feature extraction is feature

transformation. To transform a feature is to change its

representation, such as changing strings of text into vectors,

making ordinal categorical data into integers, or changing the

range of ordinal data, or through methods such as

standardization and normalization.

After the features has been transformed, new features may be

extracted through methods such as PCA [4], or the significance

of features may be identified through SVD and covariance

measurements. New features may also be created through

clustering techniques.

After identifying similarities or dissimilarities among data,

such as by using t-SNE or UMAP, we will aim to reduce the

original features to several features that describes the distance

from each cluster.

C. Visualization and Measurements

To compare the dataset before and after cleaning, we will

utilize graphical plots to describe changes to the descriptive

statistics. After finding correlation among variables, we will

show that the general trend has not been lost due to the removal

of outliers.

Methods such as t-distributed Stochastic Neighbor

Embedding or Uniform Manifold Approximation will be used to

visually explore clusters of data for further mining.

V. PLANNED EXPERIMENTS

With the goal being to mine useful features which can be

used in predicting user interests, our experiments will deal with

identifying relationships between content, users, and among

users and content.

A. Data Cleaning

During cleaning, we will also be removing information

which is not related to our mining task, such as records

describing Music rather than TV or Movies. After removing

irrelevant data, we will be removing records which exist as

outliers, rather than noise. Outliers will be those that are

significantly unlikely, lack enough quantity to be used in

modeling, or those that seem to be a repeated error in the dataset.

To measure the difference, I will plot the descriptive

statistics of several key features which will would be used in

modeling, such as user scores. Furthermore, measuring the

records lost by type due to cleaning and filtering. I will measure

the ‘completeness’ of the dataset, which will be the amount of

records removed after filtering and cleaning over the total

amount of records.

B. Feature Transformation and Extraction

We will propose several basic feature transformations, such

as columns calculated through multiplication or division of a

feature by another feature. We will measure the covariance of

these features with score and report the findings. From these

findings, we will employ PCA [4] to improve the results.

Features may be extracted from textual data, such as creating

binary categorical features for our set of genres to support the

analysis of associations among genres [5,12]. Some ordinal

categorical feature are also represented as averages, we will

explore discretizing this information into a lower range to reduce

the variance in user reviews due to different scoring methods.

C. Graph Clustering

The hope of this project is make a graph of users to content,

users to users, and or content to ccontent. This information could

be explored through the k and m-algorithm graph clustering

algorithm [7] or possibly through visually separating data and

identifying clusters visually. I hope to define metrics for

homophily and heterophily to identify similarities among nodes

in the graphs. Such as coining metrics for homophily and

heterophily when clustering using weights between users &

content to see if there are any distinct clusters [7].

VI. PROGRESS REPORT ONE

So far, we have performed exploratory analysis of the TV

List dataset. The records in this dataset provide information

about each content entry, such as Title, Genre, Total Members,

and Score. Features such as Genre and Score are imperitive in

finding patterns among content and providing

recommendations.

A. Cleaning and Filtering

Thus far, we have explored the TV List dataset which

describes content entries. Because recommendation systems aim

to predict how compatible a user is with content, we have

identifier the key features to be ‘score’ and ‘genre’. Score is a

values from to one to ten which is averaged from all user

reviews. A ten indicates “Masterpiece”, while a one indicates

“Appalling”. Genre is a comma separated string of tags, which

should be useful in identifying common trends among content.

During filtering, we remove all entries which have not

finished airing. In other words, we only consider those which

have been completed and judged by users. Then, we removed all

non-video media, ‘Music’ and ‘Unknown’.

In Fig. 1 we visualize the loss across types due to our filtering

criteria. Both Music and Unknown were removed in their

entirety, while five percent, or 653, of the other types were

removed.

In figures two and three, we visualize the distribution of

score across each type, before and after filtering. While the

median and whiskers do not seem to be particularly effected,

mean seems to move towards the median in all cases.

During cleaning, we remove all records where the score,

members, or genre was null. All records which have no members

are marked as irrelevant and removed from the dataset. If there

are no members, there cannot be any relevant data to use

downstream when predicting user ratings. If the score is less than

one or greater than ten, we discard the record as invalid.

Through cleaning, 68 more records are removed across the

remaining types, with the majority, 44, being movies. After

cleaning and filtering, the dataset lost 1,600 content entries from

the original 14,478. Resulting in a completeness score of 88.9%;

however, when considering cleaning on its own, the result is

99.5%.

B. Exploration

When measuring the correlation of scores with all other

numeric features, we find that members, scored_by, and

favorites have a positive correlation with score. This correlation

is 0.35 for scored_by, the amount of members who have left

ratings, 0.38 with the amount of members who have added it to

their list, and 0.21 with the number of users who have added the

content to their favorites.

In Fig. 4, we see use Kernel Density Estimation to visualize

the probability density of scores by members. Similar to Fig. 2,

we see that the highest density is found between the scores of six

and seven. The vast majority of our content seems to have few

members; however, there does seem to be a general trend where

content that has received higher ratings will continue to reach

wider audiences.

Fig. 2: Score distribution before and after filter. Excluding results of

type TV.

Fig. 1: Records which have been removed, grouped by content type.

Fig. 3: The TV Score distribution before and after filtering.

Fig. 4: Score by Members plotted with Kernel Density

Estimation

C. Feature Transformation

Due to the covariance of Favorites and Members with Score,

we propose a new column based on the amount of favorites per

hundred members. This new beats both Members and Favorites

with a covariance of 0.42; however, approximately 30% of the

content entries do not have any favorites.

The Genre feature in the dataset is a comma separated string

of of tags. In total, there are 81 genres after transformation.

Performing t-distributed Stochastic Neighbor Embeddings on

the genre features to segments the data into many clusters.

Genres which did not have at least 5% of their values as positive

were pruned, reducing the genre features to 78.

In Fig 6, we can see many clusters formed from the genre

information. Many of the tighter groupings are content with just

a single genre ascribed to them. For example, the cluster at the

range [-15,-20] and domain [100,125] only includes the ‘Kids’

tag for its genre. While in the range [60,90] and domain

[17.5,32.5] there are three clusters grouped closely together

which are related through their ‘Adventure’ tag, but also include

‘Sci-Fi’ and ‘Comedy’, as well as many supporting tags.

tSNE does a remarkable job as clustering these binary

vectors and should prove useful in finding common categories or

key genres in describing the data for downstream use.

VII. CHECKPOINT II

Since checkpoint one, I have begun preparing the user list

rankings for use in rating prediction. To prepare user rankings,

we will perform an intersection with the previous TV list, then

clean, filter, and transform the dataset. With this dataset, we can

generate a user by content matrix or a content by content matrix

to perform techniques such as SVD on or to analyze further.

A. TV List UMAP

Per the last checkpoint, we have performed UMAP on the

previous dataset used for the TSNE visualizations. See Fig 7. for

the UMAP visualization. Having run both TSNE and UMAP

again, both showed very similar structures, long string-like

connections, of the binary vectors. Which is unusual compared

to the last time where they had formed clusters based on

common values.

B. User Rankings

To prepare the data for mining and modeling, we must first

perform three tasks: Prefiltering, Cleaning, and Filtering. During

prefiltering we remove irrelevant content Ids. During cleaning

we remove impossible values. During filtering we remove

values that are irrelevant for our prediction task.

1) Prefilter

To prefilter the data, we take the output from checkpoint one,

the cleaned TV list, and perform a left inner join from on the user

rankings. This provides us with all the user rankings for which

we have relevant information.

2) Clean

Fig. 5: Score by Favorites per Hundred Members plotted with

Kernel Density Estimation

Fig. 6: tSNE, perplexity at 62, on reduced genre set.

Fig 7: UMAP of genres

To clean the data, we remove any values which have scores

that are outside of the valid range. In the future, we aim to back

fill values, such as watched episodes, from the TV list dataset to

fix these values. Many users have values that are either 0 or

much higher than the show aired with. Once we have cleaned the

watched episodes, we hope to train the model on users who have

rated shows, but did not complete them if they have watched

greater than the show.

The distribution of scores by status can be seen in Fig 8. This

figure excludes all entries where the score is zero. Zero is the

default for score when the user has not rated an entry, but it is in

their list. When the boxplot includes zero entries, the median

value for each status, except Completed, becomes zero.

3) Filter

To filter the data, we remove any records which have a status

that is not equal to complete. Per the last section, we hope that if

we clean the data enough we can extend the status from

completed to dropped or currently watching if the user has seen

enough of the show to make a rating.

During filtering we found that approximately half of all

records (Fig. 9) have a status of Completed, that means 42

million entries are of scored content by users.

C. Predictive Modeling Attempt

Our goal was to predict user ratings given the dataset of

Users, Content, Score using SVD.

1) Data Transformation

To prepare for SVD, we reduced the dataset to the three key

columns, then transformed the dataset such that each row was an

individual user, while each column was a content entry. This

would result in a matrix of approximately 12 thousand columns

by about two-hundred thousand rows.

2) Complication

Due to the sheer size of the dataset, every approach at SVD

with a k-value greater than three had lead to allocating memory

far greater than was available on the system (500GB to around

89 TB). A very small k-value was decided on and the dataset was

reduced to one-hundred thousand records, but after watching the

logs it became clear that it would not finish at it’s current pace as

either my approach was far too inefficient, or the model is not

meant to be used as this scale.

D. Dataset Analysis

From Fig. 10, we see that the majority of all records are

marked as completed, with Plan to watch coming in second.

From Fig. 8 we see that the score for both of these entries has an

average of 8, which is lower than the average score for the TV

List seen in Figures 2 and 3, approximately 6.5.

When excluding all scores less than five (Fig. 11), Plan to

Watch becomes the least represented class, while Completed

still dwarfs the rest of the statuses in within the dataset. When

excluding all scores greather than five and scores that equal to

zero (Fig. 12), the Dropped class becomes the second most

significant class. This is supported by Fig. 8, which shows that

the majority of dropped rankings lie below six. This supports the

idea that some users will drop content they do not like. However,

we can see from Fig. 13 that the majority of all ratings in our

dataset are zero, meaning that users are primarily tracking their

status and not their opinions on the content.

Fig 9. Ranking records removed by the filter.

Status == Complete

Fig 8. The distribution of score and status on the cleaned dataset.

Fig 10. Count of each status seen in the cleaned dataset.

From Fig. 8,

E. Graph Clustering

Using the Louvain method, a ‘method which attempts to

extract non-overlapping community from lage networks’ [15],

we have attempted to form a best fit for the partitioning. From

the data, we were provided with three partitions. The first

partition consists of about ten thousand records, while the next

two have fifteen-hundred and twelve-hundred. I believe this is a

sign that they cannot be well separated by louvain; however,

there is not enough time to analyze these results as this is my

second attempt at clustering, each of which took about twelve

hours to run. Louvain also has a dendrogram method, so I am

hoping to see the results of that during the final checkpoint, if

there is anything to come of it.

REFERENCES

[1] L. El Harrouchi, H. Moussaoui, M. Karmoudi, and N. El Akkad, “A

review of recommendation systems,” in 2025 5th International

Conference on Innovative Research in Applied Science, Engineering and

Technology (IRASET), May 2025, pp. 1–9. doi:

10.1109/iraset64571.2025.11008191.

[2] G. Adomavicius and A. Tuzhilin, “Toward the next generation of

recommender systems: a survey of the state-of-the-art and possible

extensions,” IEEE Transactions on Knowledge and Data Engineering,

vol. 17, no. 6, pp. 734–749, Jun. 2005, doi: 10.1109/tkde.2005.99.

[3] Matěj Račinský, 2018, “MyAnimeList Dataset.” Kaggle, doi:

10.34740/KAGGLE/DSV/45582.

[4] K. Zhao, “Feature Extraction using Principal Component Analysis — A

Simplified Visual Demo”, Accessed: Sep. 5, 2025. [Online.] Available:

https://medium.com/data-science/feature-extraction-using-principal-

component-analysis-a-simplified-visual-demo-e5592ced100a

[5] J. Qin et al., “Generalizing the Pigeonhole Principle for Similarity Search

in Hamming Space,” IEEE Transactions on Knowledge and Data

Engineering, pp. 1–1, 2019, doi: 10.1109/tkde.2019.2899597.

[6] J. Deng et al., “MMBee: Live Streaming Gift-Sending Recommendations

via Multi-Modal Fusion and Behaviour Expansion,” in Proceedings of the

30th ACM SIGKDD Conference on Knowledge Discovery and Data

Mining, Aug. 2024, pp. 4896–4905. doi: 10.1145/3637528.3671511.

[7] S. Sieranoja and P. Fränti, “Adapting k-means for graph clustering,”

Knowledge and Information Systems, vol. 64, no. 1, pp. 115–142, Dec.

2021, doi: 10.1007/s10115-021-01623-y.

[8] GeeksforGeeks, “Singular Value Decomposition (SVD)”, Accessed: Sep.

5, 2025. [Online.] Available: https://www.geeksforgeeks.org/machine-

learning/singular-value-decomposition-svd/

[9] S. Bin and G. Sun, “Matrix Factorization Recommendation Algorithm

Based on Multiple Social Relationships,” Mathematical Problems in

Engineering, vol. 2021, pp. 1–8, Feb. 2021, doi: 10.1155/2021/6610645.

[10] R. Lara-Cabrera, Á. González-Prieto, and F. Ortega, “Deep Matrix

Factorization Approach for Collaborative Filtering Recommender

Systems,” Applied Sciences, vol. 10, no. 14, p. 4926, Jul. 2020, doi:

10.3390/app10144926.

[11] B. Hssina, A. Grota, and M. Erritali, “Recommendation system using the

k-nearest neighbors and singular value decomposition algorithms,”

International Journal of Electrical and Computer Engineering (IJECE),

vol. 11, no. 6, p. 5541, Dec. 2021, doi: 10.11591/ijece.v11i6.pp5541-

5548.

[12] C. Scheuch, “Clustering Binary Data”, Accessed: Sep. 5, 2025. [Online.]

Available: https://blog.tidy-intelligence.com/posts/clustering-binary-

data/

Fig 11. Status when Score is greater than or equal to five.

Fig 13. Status when Score is less than or equal to five, but not zero.

Fig 12. Barchart of scores

[13] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Application of

Dimensionality Reduction in Recommender System - A Case Study,”

Minnesota University Department of Computer Science, Jul. 2000, doi:

https://doi.org/10.21236/ADA439541.

[14] L. Xia, M. Xie, Y. Xu, and C. Huang, “MixRec: Heterogeneous Graph

Collaborative Filtering,” Web Search and Data Mining, pp. 136–145, Feb.

2025, doi: https://doi.org/10.1145/3701551.3703591.

[15] “Louvain method,” Wikipedia, Jun. 27, 2022.

https://en.wikipedia.org/wiki/Louvain_method