II. MOTIVATION
The motivation for this project is to explore both data
cleaning and mining techniques, and to explore how information
can be leveraged in recommendation systems. Whether we
recognize it or not, these systems play a role in our daily life so
long as you shop online, use search engines, view the news, or
have seen advertising online.
I have a longheld interest in recommendation systems.
Through discussions with peers, I have found that many of them
take note of services which have good (accurate)
recommendations and bad recommendations. For example,
Youtube seems to be the preferred platform for finding new
music, while platforms like Spotify and Pandora were perceived
as having inferior recommendations. Anecdotes such as these
has lead me to wonder about what makes each system so
different.
III. RELATED WORK
A. Data Cleaning
The author of the dataset [3] has proposed filtering and
cleaning criteria, they have also done some work to create
subsets of the data which match these criteria. Such as filtering
the User List information to only contain users whom have
provided demographic information.
B. Principal Component Analysis
Principal Component Analysis, PCA, is one method of
extracting new features which have some level of covariance [4].
C. Clustering - Binary Vector Set Similarity
Measuring similarity between binary vectors and a query
vector is measured in Hamming space. The pigeonhole principle
is used to find candidates in the dataset and verify them [5]. In
research done by Qin et al., it was found that by changing using
non-equal partition widths and varying the threshold, skewed
data could be accounted for. This leads to improved accuracy in
the set similarity search.
D. Graph Clustering and Interest Mapping
Graph’s are commonly used for mapping relationships
among entities. In a graph, edges represent direct relationships
via some feature of the dataset between two entities.
Graph-guided interest expansion [6], was used in lieu of
time-series data due to data sparsity. A graph with nodes
representing live-streamers and users was created where each
edge was weighted by the donations given to the live-streamer.
Metagrahps were then traversed to mine similar live-streamers,
users, and interests among users.
The clustering of weighted, undirected graphs can be
calculated through the K-Algorithm or M-Algorithm [7]. These
algorithms map a cost function to the traversal of edges in a
graph, and are based upon the k-means algorithm. This algorith
effectively breaks a graph into sub-graphs, but requires a custom
cost function.
E. Matrix Factorization (Single-Value Decomposition)
Single-Value Decomposition, SVD, is a method to break
down a matrix into three separate matrices which represent the
importance, similarity of content, and general preferences of
users [8]. Matrix factorization is the use of SVD to mine the
latent features from the decomposed matrix. Unforunately, there
is a data sparsity problem which occurrs because not all users
have reviewed all content [9]. Deep matrix factorization utilizes
deep neural networks to improve the quality of
recommendations by predicting user ratings through the use of a
loss function [10].
F. Collaborative and Content-based Filtering
Patterns and similarities among users, content, and users and
content, are mined through various methods. The goal is to find
the similarity of each user to each other user, each user to the
content they interact with, and each piece of content to each
other piece of content. There are a variety of techniques used to
mine these relationships. Such as SVD or Matrix Factorization
[8,10,13], k-nearest neighboes [13], the K- and M-Algorithms
[7], and Graph Guided Interest Expansion with Multi Modal
diffusion [6].
IV. PROPOSED APPROACHES
With the goal being to explore data mining techniques and
extracting features, to be utilized with recommendation systems.
I propose to first set criteria for filtering and cleaning the data,
finding candidate features for mining, and extracting new
features through the use of data mining techniques.
A. Data Cleaning
To clean data, we must rectify incomplete records and decide
on how to deal with outliers. Incomplete records are those
without values in all features, to rectify them we must either drop
the record or include a value which aligns with some statistical
description of the feature as a whole, such as the mean or
median.
Not only may records be incomplete, but the entity they are
describing may lack enough supporting data to justify its
inclusion during training or validation. For example, if the
average user rates 10-12 entities, while a handful of users have
only rated one, it may be worth dropping the records which do
not tell a complete story about the user.
Some related reocrds may need to be dropped if the story
they tell is implausible. Such as a user having watched ten times
more than any other user, they are likely not providing a truthful
account and can be considered on a case-by-case basis.
Some feature may not be complete, and cannot be rectified
through statistical descriptions. This data cannot be considered
during feature extraction, and must be discarded during data
cleaning.
B. Feature Extraction
During feature extraction we will examine the existing
features, determine which ones are suitable for mining, and
discard those which hold no significance.
A key technique in feature extraction is feature
transformation. To transform a feature is to change its
representation, such as changing strings of text into vectors,
making ordinal categorical data into integers, or changing the