Machine Learning for Outlier Detection in R

Question

Machine Learning for Outlier Detection in R

nick.dale.burns

SSCrazy

Points: 2226
More actions
July 4, 2017 at 11:58 pm

#402248

Comments posted to this topic are about the item Machine Learning for Outlier Detection in R

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

tomaz.kastrun SSCrazy Points: 2141 More actions · Answer 1

Hi Nick,
I would be careful as to what type of data to be used in PCA, as this algorithm is sensitive to types of data (nominal, ordinal, categorical, interval). Especially in this part, because it is a distance based algorithm.

distance_matrix <- as.matrix(dist(scale(mtcars))) pca <- prcomp(distance_matrix)

What was puzzling me, is a simple content question: What is the outlier in your case? When looking for outliers, one should have a clear goal as to how to define an outlier (in content sense), so that the part of, "what one would be looking for" is made simpler and what are the thresholds values for such outliers.

Best, Tomaž

Tomaž Kaštrun | twitter: @tomaz_tsql | Github: https://github.com/tomaztk | blog: https://tomaztsql.wordpress.com/

nick.dale.burns SSCrazy Points: 2226 More actions · Answer 2

tomaz.kastrun - Wednesday, July 5, 2017 12:08 AM
Hi Nick,
I would be careful as to what type of data to be used in PCA, as this algorithm is sensitive to types of data (nominal, ordinal, categorical, interval). Especially in this part, because it is a distance based algorithm.

distance_matrix <- as.matrix(dist(scale(mtcars))) pca <- prcomp(distance_matrix)
What was puzzling me, is a simple content question: What is the outlier in your case? When looking for outliers, one should have a clear goal as to how to define an outlier (in content sense), so that the part of, "what one would be looking for" is made simpler and what are the thresholds values for such outliers.
Best, TomaÅ¾

Hi Tomas,

Completely agree with you re being aware of your data and whether an Euclidean distance metric is appropriate. In this case, all the features are continuous or ordinal, in which case an Euclidean distance measure is both sensible and appropriate. With regards to PCA itself, this is a technique that partitions variation and exploits correlation amongst features. So this can definitely be applied to categorical features - take for example, genomics which routinely uses PCA to explain genetic variability based on categorical allele counts (0, 1 or 2). Should you apply PCA (or any technique) blindly, without understanding the assumptions behind it and the appropriateness to your data? Heck no.

As for your question, I guess I can't answer it. I completely agree with you, that the way you interpret results is very important. Of course, as you state, the way you interpret the results will very much rely on your own problem and your understanding of that problem. So, a pinch of salt will always go a long way.

cstater Old Hand Points: 377 More actions · Answer 3

Nick,

Loved the article. Can you point me to information on the installation/setup of R to be able to run your examples? I have installed R, but need some guidance on installing packages/libraries.

Thanks,
CBS

nick.dale.burns SSCrazy Points: 2226 More actions · Answer 4

cstater - Wednesday, July 5, 2017 8:29 AM
Nick,
Loved the article. Can you point me to information on the installation/setup of R to be able to run your examples? I have installed R, but need some guidance on installing packages/libraries.
Thanks,
CBS

Hi There,

Welcome to R! First of all, google RStudio and install this as your IDE. By far my favourite exploratory IDE for R. From RStudio, you can install most packages using the install.packages() function:

install.packages("DBSCAN") library(DBSCAN)

Good luck.

tomaz.kastrun SSCrazy Points: 2141 More actions · Answer 5

Thank you Nick,

Agree with your points. Thank you for point them out and thumbs up on article. 🙂

Tomaž Kaštrun | twitter: @tomaz_tsql | Github: https://github.com/tomaztk | blog: https://tomaztsql.wordpress.com/

cstater Old Hand Points: 377 More actions · Answer 6

nick.dale.burns - Wednesday, July 5, 2017 3:09 PM
cstater - Wednesday, July 5, 2017 8:29 AM
Nick,
Loved the article. Can you point me to information on the installation/setup of R to be able to run your examples? I have installed R, but need some guidance on installing packages/libraries.
Thanks,
CBS
Hi There,
Welcome to R! First of all, google RStudio and install this as your IDE. By far my favourite exploratory IDE for R. From RStudio, you can install most packages using the install.packages() function:
install.packages("DBSCAN") library(DBSCAN)
Good luck.

Nick,
Thank you! RStudio makes it much easier...

CBS

Jonathan Mallia SSCertifiable Points: 5192 More actions · Answer 7

Thanks for the article. So how you define an outlier in an automated process? Those observations that fall in the first cluster?

nick.dale.burns SSCrazy Points: 2226 More actions · Answer 8

Jonathan Mallia - Saturday, July 22, 2017 9:32 AM
Thanks for the article. So how you define an outlier in an automated process? Those observations that fall in the first cluster?

Hi Jonathan,

How do you set your thresholds for an automated process? With a lot of care! Have a quick read of R's documentation for DBSCAN. Any points not assigned to a cluster are labelled with a zero.

The question then, is how do you pick the hyper-parameters (the neighbourhood radius and min points in the case of DBSAN)? I would be very nervous about leaving this to a truly automated process. Like any science, you will need to collect data, run proof of concepts, evaluate these and develop your solution on historical data. It's the only way to do anything and retain confidence in your future results.

Cheers,
Nick