A. Koufakou, **E. G. Ortiz**, M. Georgiopoulos, G. C. Anagnostopoulos, K. M. Reynolds. "A Scalable and Efficient Outlier Detection Strategy for Categorical Data". International Conference on Tools with Artificial Intelligence (ICTAI), vol. 2 pp. 210-217, 2007. [PDF | Code]

# Abstract

Outlier detection has received significant attention in many applications, such as detecting credit card fraud or network intrusions. Most existing research focuses on numerical datasets, and cannot directly apply to categorical sets where there is little sense in calculating distances among data points. Furthermore, a number of outlier detection methods require quadratic time with respect to the dataset size and usually multiple dataset scans. These characteristics are undesirable for large datasets, potentially scattered over multiple distributed sites. In this paper, we introduce Attribute Value Frequency (AVF), a fast and scalable outlier detection strategy for categorical data. AVF scales linearly with the number of data points and attributes, and relies on a single data scan. AVF is compared with a list of representative outlier detection approaches that have not been contrasted against each other. Our proposed solution is experimentally shown to be significantly faster, and as effective in discovering outliers.

# Introduction and Motivation

- Outliers skew results - Their removal allows for less disorder
- Outliers sometimes provide valuable information
- Best method for categorical data unknown
- Recent methods do not compare to each other

# Applications

- Network Intrusion
- Disease Classification
- Terrorist Activity Over the Internet
- Credit Card Fraud

# Algorithms

## Terms

*k* - User-inputted target number of outliers

## Entropy (Greedy)

- Entropy measure (degree of disorder) used to find outliers
- Major Idea:
- Find
*k*outliers that minimizes entropy (uncertainty) when removed - Results in removal of
**lowest**probability

- Find

## Attribute-Value Frequency (AVF - Our Method)

- Calculate the frequency of every possible value of each attribute
- Average frequencies for each individual data point
- Points with lower scores are more likely to be outliers since they will have infrequent values on average

## Frequent Itemset Mining

- Scan large databases to discover frequent patterns or sets of items (itemsets) that co-occur
- Goal: find regularities/trends that co-occur
- Items purchased together
- Automatic customer profiling

## Frequency Itemset Based Algorithms

- Frequent Pattern Outlier Factor (FindFPOF)
- Outliers are the transactions (data points) that contain less frequent patterns

- Fast Distributed Outlier Detection (FDOD)

- Outliers are the points which contain the most infrequent itemsets
- Outlier score based on the length of each infrequent itemset contained in a point

# Results for Real Datasets (UCI)

## Wisconsin Breast Cancer

- 39 malignant outliers (8% of dataset)
- 444 benign non-outliers (92% of dataset)
- 9 attributes
- All algorithms perform equivalently

## Lymphography

- Goal: Find abnormal lymph nodes
- 148 instances
- 8 attributes
- Outliers: 4%
- Non-Outliers: 96%
- Greedy converges on all of the outliers the quickest, followed by AVF, FindFPOF, and FDOD

# Results for Simulated Datasets

## Vary Data Dimensionality

- Data size range from 1,000 to 800,000 points
- 10 attributes
- 10 attribute values per attribute
- All algorithms increase linearly
- Greedy increases the quickest
- AVF increases the slowest

## Vary Attribute Dimensionality

- 100,000 data points
- 10 attribute values per attribute
- Number of attributes varies from 2 to 30
- All algorithms increase linearly
- Greedy increases the quickest
- AVF increases the slowest

## Vary Input *k*

- 100,000 points
- 10 attributes
- 10 attribute values per attribute
- Greedy increases linearly
- Other algorithms have near constant execution time

# AVF Advantages

- AVF Runs significantly faster than Greedy
- AVF needs one pass over data instead of
*k* - FIM-based methods:
- Slower for larger datasets
- Higher dimensionalities result in increasingly larger itemsets

# Conclusions and Future Work

- Greedy algorithm becomes exceedingly slow for large data sizes. This is because the entropy calculations become increasingly more expensive as the size of the dataset increases.
- FPOF and FDOD methods also become slower for larger datasets, since they need to search through all the possible subsets of each point.
- Greedy algorithm slows down considerably for large values of k as well as for larger dimensionalities. In contrast, the AVF, FindFPOF and FDOD algorithms do not depend on k, so their performance remains constant for each different k value.
- FindFPOF and FDOD algorithms depend on the minimum support threshold minsup entered by the user. The accuracy of FindFPOF and FDOD is improved if different values for minsup are used.
- AVF has the advantage that it does not create itemsets and assumes only a single pass over the entire dataset. Therefore its complexity is only dependent on the size and the number of attributes. In addition, AVF eliminates the need to make difficult choices for any user-given parameters such as minimum support or maximum itemset length.
- In the future, we plan to extend the developed technique, AVF, to incorporate the handling of continuous attributes and to modify and apply the ideas presented here in a distributed setting.

# Downloads