Exploration of Data Mining Techniques Supporting

Recommendation Systems

Joseph Davis

Department of Data Science

Florida Polytechnic University

Lakeland, United States of America

JosephDavisJoey@gmail.com

Abstract—Recommendation systems are the most significant

auxiliary feature of many online services. This project aims to

explore techniques for data cleaning and feature extraction in data

mining. Using these techniques we aim to identify, extract, and

create features which are useful in recommendation systems.

Several approaches are discussed such as Principal Component

Analysis, Single-Value Decomposition, and Graph Clustering.

Keywords—recommendation systems, data mining, content-

based filtering, collaborative-filtering, graph clustering, interest

mapping, matrix factorization

I. INTRODUCTION

The aim of this paper is to outline our approach at extracting

meaningful relationship from user rankings and content

metadata for use in recommendation systems.

A. Recommendation Systems

Recommendation systems can be found at the core of most

online services which serve content, sell products, or engage in

advertising. These systems serve to exploit data describing user

behaviors, interests, and feedback, as well as the content being

served. The use of these systems can increase a platform’s

revenue, user engagement, user retention, and navigability [1]. It

is apparent that these systems are the key to the success of many

online services, and their effectiveness may serve as a measure

of quality between one service and another.

There are many recommendation systems in the world;

however, two important categories of recommendation systems

are Collaborative Filtering and Content-based Filtering [1].

In collaborative filtering, “[users] will be recommended

items that people with similar tastes and preferences liked in the

past” [2]. While collaborative filtering is not constrained to this

definition, it is an apt description. Simly put, collaborative

filtering takes user feedback, user behavior, and content

metadata into account to filter subsets of content and or users.

In content-based filtering, “[users] will be recommended

items similar to the ones [they] preferred in the past” [2].

Content-based filtering focuses on a single user and their habits.

Unlike collaborative filtering, it does not take into account the

behavior of other users.

An excellent recommendation system excels at

recommending content or products which the user will want to

purchase or view, and enjoy the content. If the provided

recommendations do not align with user interests, they may look

for alternative services or end their search altogether. If the

recommendations do not lead to a positive response, the user’s

engagement will decrease as they lose interest in the service.

B. Required Information

To make accurate recommendations, sufficient information

about the user’s interests and behaviors must be collected. Not

only must information about the user be collected, but the

content must have metadata describing it.

User behavior information, also known as implicit feedback

[1] or passive feedback, refers to what a user does on the

platform. What do they search for? What do they interact with?

For how long do they interact? Have they made recurring

purchases? The answers to these questions provide a baseline for

extracting a user’s interest.

When a user ranks content or writes reviews, it is known as

explicit feedback [1]. This information provides a strong image

of not only what a user is interested in, but also what is

disinteresting to them.

C. The Dataset

In the MyAnimeList dataset [3], there are three main sources

of information, which provide implicit and explicit feedback,

some user metadata, and a wealth of content metadata:

 TV List: Content by ID, title, genres, score, computed

rank, popularity, and meta information such as the cast.

 User List: Scores by user, watched episodes, and

timestamps.

 User: Demographic information for a subset of the users

In the TV List, we find information about all the titles seen on

MAL at the time the website was scraped. This information is

critical in content-based filtering, which measures similarity

among content rather than users.

In the User List, we are provided with information about a

users likes and dislikes, as well as unscored shows which can be

used as implicit feedback. This information includes whether a

user has finished a show, plans to watch a show, and when the

record was last modified by them.

The User data covers demographic information, if the user

publicly shares it, such as age, sex, and location. The author of

the dataset states there are 302,573 users with some demographic

data, but only 116,133 have their age, sex, and location provided

[3].

II. MOTIVATION

The motivation for this project is to explore both data

cleaning and mining techniques, and to explore how information

can be leveraged in recommendation systems. Whether we

recognize it or not, these systems play a role in our daily life so

long as you shop online, use search engines, view the news, or

have seen advertising online.

I have a longheld interest in recommendation systems.

Through discussions with peers, I have found that many of them

take note of services which have good (accurate)

recommendations and bad recommendations. For example,

Youtube seems to be the preferred platform for finding new

music, while platforms like Spotify and Pandora were perceived

as having inferior recommendations. Anecdotes such as these

has lead me to wonder about what makes each system so

different.

III. RELATED WORK

A. Data Cleaning

The author of the dataset [3] has proposed filtering and

cleaning criteria, they have also done some work to create

subsets of the data which match these criteria. Such as filtering

the User List information to only contain users whom have

provided demographic information.

B. Principal Component Analysis

Principal Component Analysis, PCA, is one method of

extracting new features which have some level of covariance [4].

C. Clustering - Binary Vector Set Similarity

Measuring similarity between binary vectors and a query

vector is measured in Hamming space. The pigeonhole principle

is used to find candidates in the dataset and verify them [5]. In

research done by Qin et al., it was found that by changing using

non-equal partition widths and varying the threshold, skewed

data could be accounted for. This leads to improved accuracy in

the set similarity search.

D. Graph Clustering and Interest Mapping

Graph’s are commonly used for mapping relationships

among entities. In a graph, edges represent direct relationships

via some feature of the dataset between two entities.

Graph-guided interest expansion [6], was used in lieu of

time-series data due to data sparsity. A graph with nodes

representing live-streamers and users was created where each

edge was weighted by the donations given to the live-streamer.

Metagrahps were then traversed to mine similar live-streamers,

users, and interests among users.

The clustering of weighted, undirected graphs can be

calculated through the K-Algorithm or M-Algorithm [7]. These

algorithms map a cost function to the traversal of edges in a

graph, and are based upon the k-means algorithm. This algorith

effectively breaks a graph into sub-graphs, but requires a custom

cost function.

E. Matrix Factorization (Single-Value Decomposition)

Single-Value Decomposition, SVD, is a method to break

down a matrix into three separate matrices which represent the

importance, similarity of content, and general preferences of

users [8]. Matrix factorization is the use of SVD to mine the

latent features from the decomposed matrix. Unforunately, there

is a data sparsity problem which occurrs because not all users

have reviewed all content [9]. Deep matrix factorization utilizes

deep neural networks to improve the quality of

recommendations by predicting user ratings through the use of a

loss function [10].

F. Collaborative and Content-based Filtering

Patterns and similarities among users, content, and users and

content, are mined through various methods. The goal is to find

the similarity of each user to each other user, each user to the

content they interact with, and each piece of content to each

other piece of content. There are a variety of techniques used to

mine these relationships. Such as SVD or Matrix Factorization

[8,10,13], k-nearest neighboes [13], the K- and M-Algorithms

[7], and Graph Guided Interest Expansion with Multi Modal

diffusion [6].

IV. PROPOSED APPROACHES

With the goal being to explore data mining techniques and

extracting features, to be utilized with recommendation systems.

I propose to first set criteria for filtering and cleaning the data,

finding candidate features for mining, and extracting new

features through the use of data mining techniques.

A. Data Cleaning

To clean data, we must rectify incomplete records and decide

on how to deal with outliers. Incomplete records are those

without values in all features, to rectify them we must either drop

the record or include a value which aligns with some statistical

description of the feature as a whole, such as the mean or

median.

Not only may records be incomplete, but the entity they are

describing may lack enough supporting data to justify its

inclusion during training or validation. For example, if the

average user rates 10-12 entities, while a handful of users have

only rated one, it may be worth dropping the records which do

not tell a complete story about the user.

Some related reocrds may need to be dropped if the story

they tell is implausible. Such as a user having watched ten times

more than any other user, they are likely not providing a truthful

account and can be considered on a case-by-case basis.

Some feature may not be complete, and cannot be rectified

through statistical descriptions. This data cannot be considered

during feature extraction, and must be discarded during data

cleaning.

B. Feature Extraction

During feature extraction we will examine the existing

features, determine which ones are suitable for mining, and

discard those which hold no significance.

A key technique in feature extraction is feature

transformation. To transform a feature is to change its

representation, such as changing strings of text into vectors,

making ordinal categorical data into integers, or changing the

range of ordinal data, or through methods stuch as

standardization and normalization.

After the features has been transformed, new features may be

extracted through methods such as PCA [4], or the significance

of features may be identified through SVD and covariance

measurements. New features may also be created through

clustering techniques.

V. PLANNED EXPERIMENTS

With the goal being to mine useful features which can be

used in predicting user interests, our exeriments will deal with

identifying relatonships between content, users, and among

users and content.

A. Data Cleaning

We plan to analyze the dataset and determine how much data

is either missing or erroneous, then determine what should be

doe with such features or records. Such as in the case of missing

user scores, we could predict the user’s rating through Deep

Matrix Facorization [10] and susbtitute, rather than using user

averages or modes.

B. Feature Transformation

When transforming features we will seek the best

representation of categorical data and textual data, and see how it

ca be leveraged when measuring similarity. Such as converting

genre information into binary vectors and measuring the set

similarities among content [5,12].

C. Feature Creation

There may be better representations of numeric data which

can expressed through arithmetic or statistical means. Some

possibilites could be the ratio of users who loved a show to those

who hated it, the total members with strong opinions, the

variance in scoring, &c. Other feactures may be extracted from

the data through methods such as PCA [4], or assigning features

based on observed clusters in the data.

D. Graph Clustering

By creating a weighted graph between entities, and an

associated cost function, clusters may be made between

neighbors on the graph [7]. Furthermore, as shown in MMBee

[6], metapaths may provide a user-centric view towards

grouping content entities.

REFERENCES

[1] L. El Harrouchi, H. Moussaoui, M. Karmoudi, and N. El Akkad, “A

review of recommendation systems,” in 2025 5th International

Conference on Innovative Research in Applied Science, Engineering and

Technology (IRASET), May 2025, pp. 1–9. doi:

10.1109/iraset64571.2025.11008191.

[2] G. Adomavicius and A. Tuzhilin, “Toward the next generation of

recommender systems: a survey of the state-of-the-art and possible

extensions,” IEEE Transactions on Knowledge and Data Engineering,

vol. 17, no. 6, pp. 734–749, Jun. 2005, doi: 10.1109/tkde.2005.99.

[3] Matěj Račinský, 2018, “MyAnimeList Dataset.” Kaggle, doi:

10.34740/KAGGLE/DSV/45582.

[4] K. Zhao, “Feature Extraction using Principal Component Analysis — A

Simplified Visual Demo”, Accessed: Sep. 5, 2025. [Online.] Available:

https://medium.com/data-science/feature-extraction-using-principal-

component-analysis-a-simplified-visual-demo-e5592ced100a

[5] J. Qin et al., “Generalizing the Pigeonhole Principle for Similarity Search

in Hamming Space,” IEEE Transactions on Knowledge and Data

Engineering, pp. 1–1, 2019, doi: 10.1109/tkde.2019.2899597.

[6] J. Deng et al., “MMBee: Live Streaming Gift-Sending Recommendations

via Multi-Modal Fusion and Behaviour Expansion,” in Proceedings of the

30th ACM SIGKDD Conference on Knowledge Discovery and Data

Mining, Aug. 2024, pp. 4896–4905. doi: 10.1145/3637528.3671511.

[7] S. Sieranoja and P. Fränti, “Adapting k-means for graph clustering,”

Knowledge and Information Systems, vol. 64, no. 1, pp. 115–142, Dec.

2021, doi: 10.1007/s10115-021-01623-y.

[8] GeeksforGeeks, “Singular Value Decomposition (SVD)”, Accessed: Sep.

5, 2025. [Online.] Available: https://www.geeksforgeeks.org/machine-

learning/singular-value-decomposition-svd/

[9] S. Bin and G. Sun, “Matrix Factorization Recommendation Algorithm

Based on Multiple Social Relationships,” Mathematical Problems in

Engineering, vol. 2021, pp. 1–8, Feb. 2021, doi: 10.1155/2021/6610645.

[10] R. Lara-Cabrera, Á. González-Prieto, and F. Ortega, “Deep Matrix

Factorization Approach for Collaborative Filtering Recommender

Systems,” Applied Sciences, vol. 10, no. 14, p. 4926, Jul. 2020, doi:

10.3390/app10144926.

[11] B. Hssina, A. Grota, and M. Erritali, “Recommendation system using the

k-nearest neighbors and singular value decomposition algorithms,”

International Journal of Electrical and Computer Engineering (IJECE),

vol. 11, no. 6, p. 5541, Dec. 2021, doi: 10.11591/ijece.v11i6.pp5541-

5548.

[12] C. Scheuch, “Clustering Binary Data”, Accessed: Sep. 5, 2025. [Online.]

Available: https://blog.tidy-intelligence.com/posts/clustering-binary-

data/