Hi All,
Currently, I am analysing cases and the time spent on each one. I know what is an acceptable time, but I am experiencing a difficulty to decide what is are the outliers in the range of data (some are quite visible).
Have you ever used a formula and a rule what deviation should be used in order to define the outliers?
Many thanks,
Marieta

Hi Marieta,
Â
This is a pretty standard task in Statistics. You can basically follow two main basic approaches given your context of univariate data:
Â
1) Check if you have gaussian or nearly gaussian data (by visual inspection and hypothesis testing)
Â
If so, you could be using the Standard Deviation method to define outliers. This would typically be values falling beyond three standard deviations from the mean. In any case you can adapt your threshold to the context by having a check on the distribution of the values. A box plot is always a nice visualization technique. If you use R, it will even automatically identify the outliers for you in the boxplot function.Â
Â
2) Non-gaussian data or small sample sizes
Â
If you cannot apply the previous method, you could be using the IQR (Interquartile Range), i.e, the difference between the 0.75 percentile and the 0.25 percentile. Typically, you would then set the factor of 1.5 times the IQR and define as an outlier any value smaller than (0.25 percentile - 1.5*IQR) or higher than (0.75 + 1.5*IQR). The factor 1.5 is a value usually used in practice but other values can turn out to be more adequate, explore the data and refine it.
Â
These are two methods generally used in practice because they are straightforward but still effective. If you need something more powerful there is a whole lot of more recent techniques like proximity-based models (clustering) or information theoretic methods (outliers are data instances increasing the complexity of the data set) that you can look up online.
Â
Still, in this particular case I think that one of the previous methods should be enough, especially if you are able to derive a nicely tuned-up threshold given out of functional experience with the problem of timing.
Â
I hope it helps.
Â
Best regards,
Pedro F. B. Silva