Skip to content

This small project aims to show how to spot anomalies in a dataset using the Benford's Law and few lines of Python

Notifications You must be signed in to change notification settings

fabridigua/BenfordLawAnalyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BenfordLawAnalyzer

This small project aims to show how to spot anomalies in a dataset using the Benford's Law and few lines of Python


System Requirements

  • Python 3.10
  • Numpy 1.23.5
  • Pandas 1.5.2

Steps to use the code with your own dataset

  1. Select the dataset, the column of interest and the title you want to set in the plot

    benford.analyze('datasets\\gaia-dr2-rave-35.csv', 'r_distance', 'Distance to Earth of 250k stars')
  2. Calculate the correlation

    benford.calculateCorrelation()
  3. Implement your own manipulation method

    vect = [1, 2, 3, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 9, 8, 9, 9]
    self.data_vector = [random.choice(vect) * 1000 if random.randint(0, 10) % 2 == 0 else x for x in
                        self.data_vector if not math.isnan(x) and x > 0]
  4. Call the maipulation method and re-analyze the vector

    benford.manipulateV2()
    benford.reanalyze()
    print(benford.calculateCorrelation())

Note that using Benford's Law for spot data anomalies is only suitable in particular conditions:

  • Sample Size: if the sample size is too small, the distribution of the leading digits may not follow the expected pattern.
  • Selection Bias: the dataset needs to be representative of the population it is drawn from, otherwise it may not follow Benford’s Law.
  • Leading Digit Preference: Benford’s Law assumes that people are equally likely to report any leading digit, i.e. each number 1 through 9 should have an equal chance of being the leading digit.

More details in the Medium article.


Datasets

The datasets used for the tests come from Kaggle

Bibliography

About

This small project aims to show how to spot anomalies in a dataset using the Benford's Law and few lines of Python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages