II. MOTIVATION
The motivation for this project is to explore both data
cleaning and mining techniques, and to explore how information
can be leveraged in recommendation systems. Whether we
recognize it or not, these systems play a role in our daily life so
long as you shop online, use search engines, view the news, or
have seen advertising online.
I have a long-held interest in recommendation systems.
Through discussions with peers, I have found that many of them
take note of services which have good (accurate)
recommendations and bad recommendations. For example,
YouTube seems to be the preferred platform for finding new
music, while platforms like Spotify and Pandora were perceived
as having inferior recommendations. Anecdotes such as these
has lead me to wonder about what makes each system so
different.
III. RELATED WORK
A. Data Preprocessing
1) Data Cleaning
The author of the dataset [3] has proposed filtering and
cleaning criteria, they have also done some work to create
subsets of the data which match these criteria. Such as filtering
the User List information to only contain users whom have
provided demographic information.
2) Principal Component Analysis
Principal Component Analysis is an algorithm which can be
used to reduce the dimensionality of numeric data through linear
transformations. PCA reduces dimensionality and maintains
information by leveraging the covariance of features [4].
B. Measuring Similarity
1) Clustering - Binary Vector Set Similarity
Measuring similarity between binary vectors and a query
vector is measured in Hamming space. The pigeonhole principle
is used to find candidates in the dataset and verify them [5]. In
research done by Qin et al., it was found that by changing using
non-equal partition widths and varying the threshold, skewed
data could be accounted for. This leads to improved accuracy in
the set similarity search.
2) Graph Clustering and Interest Mapping
Graph’s are commonly used for mapping relationships
among entities. In a graph, edges represent direct relationships
via some feature of the dataset between two entities.
Graph-guided interest expansion [6], was used in lieu of
time-series data due to data sparsity. A graph with nodes
representing live-streamers and users was created where each
edge was weighted by the donations given to the live-streamer.
Metagraphs were then traversed to mine similar live-streamers,
users, and interests among users.
The clustering of weighted, undirected graphs can be
calculated through the K-Algorithm or M-Algorithm [7]. These
algorithms map a cost function to the traversal of edges in a
graph, and are based upon the k-means algorithm. This
algorithm effectively breaks a graph into sub-graphs, but
requires a custom cost function.
3) Single-Value Decomposition
Single-Value Decomposition, like PCA, requires the use of
the eigen-decomposition of a matrix. Using the eigen values and
vectors of the original data, the importance of rows to other rows,
and column to other columns, is encoded. In essence, the data is
broken down into row similarity, column similarity, and the
relationship to the original data [8]. SVD has a wide domain for
applications, such as image compression, PCA, and signal
processing.
SVD can be used in recommendation systems for two tasks,
latent feature extraction and dimensionality reduction [13]. A
drawback of SVD is that it is affected by sparsity. To address
sparsity some fill in missing values from descriptive statistics
[13,9,10].
C. Recommendation Systems
1) Collaborative and Content-based Filtering
Patterns and similarities among users, content, and users and
content, are mined through various methods. The goal is to find
the similarity of each user to each other user, each user to the
content they interact with, and each piece of content to each
other piece of content. There are a variety of techniques used to
mine these relationships. Such as SVD or Matrix Factorization
[8,10,13], k-nearest neighbors [13], the K- and M-Algorithms
[7], and Graph Guided Interest Expansion with Multi Modal
diffusion [6].
2) Deep Matrix Factorization
Deep matrix factorization expands upon SVD with deep
learning to improve the quality of the matrix. Like SVD,
DeepMF is used for collaborative filtering from datasets
including user reviews by title. DeepMF improves upon its’
recommendations through a loss function, and then continues
training on its output until error is reduced [10]. Through this
training method DeepMF both improves recommendations
through the reduction of sparsity, and through mining k latent
features [10].
3) Graph Collaborative Filtering
Graph Collaborative Filtering, such as MixRec, aim to fill
the gaps in low-dimensional features spaces [14]., such as
DeepMF. Matrix factorization is a common technique
collaborative filtering technique; however, it does not leverage
all the data which is available. Graph collaborative filtering aims
to leverage relational information, such as page views [14], or
donations [6].
Aside from MixRec, another model which similarly
leverages user relationships to content is MMBee, Multi Modal
Fusion and Behavior Expansion [14]. MMBee leverages the
relations of user donations to authors in a graph, rather than
through matrix factorization. Methods of leveraging graphs of
user interactions will likely be the future of collaborative
filtering.
IV. PROPOSED APPROACHES
With the goal being to explore data mining techniques, this
paper intends to explore the extraction of features from the