Auditing The Data Using Python

This document discusses using Python for auditing and analyzing data. It outlines several advantages of using Python over commercial data analysis software, including that Python is open source and free to use. It also summarizes some of Python's key capabilities for data analysis, including libraries for working with tabular data (Pandas), scientific computing (NumPy, SciPy), machine learning (Scikit-Learn, TensorFlow), natural language processing (NLTK, TextBlob), data visualization (Matplotlib, Seaborn), and time series analysis (Prophet, PyFlux).

Uploaded by

Marcos Toti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

132 views

Auditing The Data Using Python

Uploaded by

Marcos Toti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Auditing the data using Python

This article is an excerpt from my book “Practical Data Analysis: Using Python & Open
Source Technology” (available on Amazon worldwide, Google Play and Apple iBook
Store).

“Python” and “R” are amongst the most popular open source programming languages
for data science. While R’s functionality was developed with statisticians in mind,
Python on the other hand is a general-purpose programming language, with an easy to
understand syntax and a gentle learning curve. Historically, R has been used primarily
for academic and research work, although anecdotal evidence suggests that R is starting
to see some level of adoption in the enterprise world (especially in the Financial
Services sector) due to its robust data visualisation capabilities. Python, on the other
hand, is widely used in enterprises due to the breadth of its functionalities that span
beyond data analytics and also because of the ease with which Python based solutions
can be integrated with many other technologies in a typical enterprise set up. Needless
to say, due to my enterprise background, I am more inclined towards Python as the data
analysis tool of my choice.

In addition to Python and R, there is also a wide variety of very powerful commercial
data analysis software. However, Python has several advantages over these commercial
offerings as follows -

a) Python’s open source license (GPL compatible, but you can distribute a modified
version without making your changes open source) means that it can be used for free.
Commercial packages on the other hand come with licensing constraints and the
associated cost factor can often limit their availability to a handful of staff in an
organisation.

b) Unlike many commercial data analysis software, Python can be used even on a low
specification Desktop computer, making it suitable for large scale deployment without
additional investment in hardware. Data analysis codes written in native Python can also
be used in multiple computing platforms and operating systems that support Python
(e.g. Windows, Linux and MacOS).

c) Most (if not all) commercial data analysis software are designed for interactive use,
often making them unsuitable for implementing fully automated and reusable data
analytics solutions. Python codes, on the other hand, can be used to fully automate the
entire data analysis process, and can also be distributed and re-used without constraints.

d) The worldwide Python community is constantly adding new packages and features to
its already rich set of functions. Due to the size and scale of community support, new
data analysis techniques coming out of academia and research also become freely
available in Python much quicker than in a commercial offering.

e) There are a number of online discussion forums dedicated to Python for knowledge
sharing. The PyData conferences also provide a valuable channel for exchanging
information on new approach and emerging open source technologies for data
management, processing, analytics and visualisation. Video recordings of the PyData
conference proceedings are freely available on YouTube.

f) Generally speaking, there are more people with Python programming skills than with
working knowledge of commercial data analysis software. Python is also gaining
increasing popularity as an introductory programming language in many schools and
universities world-wide. We are, therefore, very likely to see an increase in the number
of people with Python programming skills in the near future.

Data analysis capabilities of Python

Python has an extremely rich set of data analysis functionalities, which in my view are
more than adequate to meet even the needs of an advanced data analytics practitioner. I
have summarised below some of the key data analysis capabilities of Python (you will
find a number of real-life examples on how to use these capabilities in my book
“Practical Data Analysis – Using Open Source Tools & Techniques”). Readers who are
interested in a more comprehensive list of useful Python packages should visit the
Awesome Python website, where a “curated list of awesome Python frameworks,
libraries, software and resources” is maintained.

a) “Pandas” is probably one of the most important Python libraries for data analysis.
Conceptually, Pandas is similar to Microsoft Excel (excluding the point and click
functionality of Excel) in that it allows you to open, explore, change, update, analyse
and save data in a tabular structure. However, in my opinion, Pandas is way more
powerful than Microsoft Excel as a data analysis tool for the following reasons - (i)
Pandas combined with the broader Python ecosystem provides a much richer set of data
analysis functionalities than Microsoft Excel; (ii) unlike Microsoft Excel, Pandas can
load and process extremely large data files even on a low specification Desktop
computer; (iii) coding your data analysis logic in Pandas ensures that the same set of
rules are operated every time you run the analysis; (iv) once you have written your data
analysis logic in Pandas, you can re-use it for analysing a completely different data set
with minimal changes; and (v) it is much easier to share a piece of code with a friend or
colleague, rather than a long set of instructions describing a series of manual operations
in Excel.

b) “NumPy”, “SciPy” and “StatsModels” are Python libraries for scientific and financial
computation, economic research and applied econometrics. These libraries enable
Python to be used in the areas of statistics, linear algebra, Fourier transformation, signal
processing, image processing, genetic algorithm, econometric analysis and much more!

c) Machine Learning is the latest buzzword amongst technologists and data scientists,
and a data analysis toolkit will be incomplete without Machine Learning capabilities.
Built on top of NumPy and SciPy, Python’s “Scikit-Learn” library provides a range of
out of the box implementation of supervised and un-supervised Machine Learning
algorithms for data mining and data analysis. A more recent addition to the Python open
source Machine Learning toolkit is “TensorFlow” from Google. TensorFlow is designed
for large scale and computationally intense Machine Learning (such as shallow and
deep learning neural networks). It allows you to define in Python a graph of the
computations to perform and then TensorFlow takes that graph and runs it efficiently
using optimised C++ code. Other noteworthy Machine Learning libraries in Python
include “Keras”, “Theano”, “Pylearn2” and “Pyevolve”.

d) “NLTK” (Natural Language Toolkit) is a leading platform for building Python

programs to work with human language data. It provides a suite of text processing
libraries for classification, tokenisation, stemming, tagging, parsing, semantic reasoning
and wrappers for industrial-strength Natural Language Processing (NLP) tasks.
“TextBlob”, a complementary Python NLP library, makes text processing simple by
providing an intuitive interface to NLTK. A third Python NLP library called “Gensim”
makes text mining, topic modelling and document similarity analysis tasks look easy!!
And last but not the least is “SpaCy”, a relatively new Python NLP library featuring
state-of-the-art speed and accuracy needed for real world use cases of NLP.

e) Data visualisation is an essential step in any rigorous data analysis process and the
Python community have left no stone unturned to make all kinds of data visualisation
tools and techniques available to Python users. “Matplotlib” is the grandfather of
Python data visualisation packages. It is extremely powerful but with that power comes
complexity. “Seaborn” harnesses the power of Matplotlib to create beautiful charts in a
few lines of code. However, Seaborn was perhaps created with statistical visualisations
in mind and is extremely useful when you have to plot a lot of different variables.
“Bokeh” and “Pygal” libraries offers Python users the ability to create interactive and
web- ready plots. “Geoplotlib” and “Matplotlib basemap” are Python toolkits for
creating maps and plotting geographical data. “Missingno” allows you to quickly gauge
the completeness of a data set with a visual summary.

f) Regular Time Series is a sequence of data points collected at constant time intervals
(e.g. daily closing share price of a listed company for the past six month). While you
can implement your own Time Series data analysis methods using a combination of the
Pandas, NumPy, Scikit-Learn, SciPy and StatsModels libraries, there are currently two
off-the-shelf Python packages for time series forecasting - “Prophet”, an open source
Python package created by the Facebook data science team, and “PyFlux”, which is
built on top of NumPy, SciPy and Pandas. Using these libraries, one can automate the
process of analysing time series data to forecast future values of the series (e.g.
tomorrow’s closing share price of a listed company, although I will caveat this
statement with a word of caution - forecasting and making a prediction are two different
things) and the degree of uncertainty associated with the forecast.

g) In April 2017, Microsoft announced a preview release of SQL Server 2017 that is
capable of in-database analytics and machine learning using Python. This feature will
allow Python based data analysis models to be built inside the SQL Server itself,
thereby eliminating the need to move data from the database to the models. Currently,
Python also provides a number of off-the-shelf packages for extracting data from
different types of data sources, such as traditional databases (the full list of Python
database drivers is available on Awesome MySQL website), Excel files (“Pandas”,
“openpyxl” and “pyexcel” libraries), Word documents (“python-docx” and “textract”
libraries), PDF documents (“PDFMiner” library), CSVfiles (“Pandas” library) and
Internet websites (“lassie”, “micawber” and “newspaper” libraries).

h) While “Jupyter Notebook” is not a tool for data analysis, it is still worth listing it
here. Jupyter Notebook provides a rich web browser (e.g. Internet Explorer) based user
interface for writing Python codes, as well as adding explanatory narratives and
displaying the outputs of the code in a single document. Jupyter Notebook makes it
extremely easy to document and share your data analysis work with friends and
colleagues and publish it on the Internet.

To summarise, if you choose to invest some of your time to develop practical data
analytics skills, give Python a try!

PS: I have just started recording some videos on emerging data analytics tools and
techniques that you can view on YouTube via the link below. Feel free to drop by and
leave your comments.

Python Programming For Beginners (Knowles, Chad)
100% (10)
Python Programming For Beginners (Knowles, Chad)
246 pages
Data Analysis From Scratch With Python - Beginner Guide Using Python, Pandas, NumPy, Scikit-Learn, IPython, TensorFlow and
100% (10)
Data Analysis From Scratch With Python - Beginner Guide Using Python, Pandas, NumPy, Scikit-Learn, IPython, TensorFlow and
104 pages
SAP F-28 Guide: Posting Manual Customer Payment
80% (10)
SAP F-28 Guide: Posting Manual Customer Payment
14 pages
Seminar On Stress MGMNT FR Students
100% (1)
Seminar On Stress MGMNT FR Students
23 pages
Python-2
No ratings yet
Python-2
18 pages
Article Review 3 Eng
No ratings yet
Article Review 3 Eng
16 pages
Ds Python Unit-I
No ratings yet
Ds Python Unit-I
30 pages
Python For Data Analytics Scientific and Technical Applications
No ratings yet
Python For Data Analytics Scientific and Technical Applications
6 pages
Python
No ratings yet
Python
23 pages
Introduction
No ratings yet
Introduction
45 pages
Getting Started with Python Data Analysis
From Everand
Getting Started with Python Data Analysis
Vo.T.H Phuong
No ratings yet
Python Libraries Seminar Report
100% (2)
Python Libraries Seminar Report
16 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
49 pages
Python - Report - Shayan & Shivani
No ratings yet
Python - Report - Shayan & Shivani
14 pages
Data Analysis of Visualization: CHAPTER - 1 Preliminaries
No ratings yet
Data Analysis of Visualization: CHAPTER - 1 Preliminaries
93 pages
Comparative Analysis of R and Python For Mathematical Programming
No ratings yet
Comparative Analysis of R and Python For Mathematical Programming
4 pages
Handout 1 - Introduction To Setting Up Python
No ratings yet
Handout 1 - Introduction To Setting Up Python
49 pages
Comparision of Python, MatLab and R
No ratings yet
Comparision of Python, MatLab and R
18 pages
Data Science Lecture No 5
No ratings yet
Data Science Lecture No 5
16 pages
Ganesh
No ratings yet
Ganesh
28 pages
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
Learning Data Mining with Python Layton - Download the ebook in PDF with all chapters to read anytime
100% (1)
Learning Data Mining with Python Layton - Download the ebook in PDF with all chapters to read anytime
61 pages
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
From Everand
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
FLOYD BAX
No ratings yet
Matlab vs. Python vs. R: Journal of Data Science: Jds July 2017
No ratings yet
Matlab vs. Python vs. R: Journal of Data Science: Jds July 2017
19 pages
Instant Access to Python for Data Analysis, 3rd Edition (Second Early Release) Wes Mckinney ebook Full Chapters
No ratings yet
Instant Access to Python for Data Analysis, 3rd Edition (Second Early Release) Wes Mckinney ebook Full Chapters
37 pages
Python Basic
No ratings yet
Python Basic
145 pages
Python
No ratings yet
Python
3 pages
Reading 3 - Programming For Data Science
No ratings yet
Reading 3 - Programming For Data Science
6 pages
R Vs Python For Data Science
No ratings yet
R Vs Python For Data Science
7 pages
Python For Data Analysis Matt Algore download
No ratings yet
Python For Data Analysis Matt Algore download
36 pages
Software Environment
No ratings yet
Software Environment
11 pages
Paper 5184
No ratings yet
Paper 5184
7 pages
Student System Management
No ratings yet
Student System Management
18 pages
10EXP01.docx
No ratings yet
10EXP01.docx
12 pages
Guide Python Data Science
100% (2)
Guide Python Data Science
13 pages
IEEE Paper (DEVELOPMENT OF PROGRAMMING LANGUAGE PYTHON)
No ratings yet
IEEE Paper (DEVELOPMENT OF PROGRAMMING LANGUAGE PYTHON)
16 pages
Data Science
No ratings yet
Data Science
14 pages
Introduction to Python
No ratings yet
Introduction to Python
71 pages
Slidesgo The Versatility of Python Exploring Its Expansive Applications 20240722113237WKju
No ratings yet
Slidesgo The Versatility of Python Exploring Its Expansive Applications 20240722113237WKju
8 pages
Ds Module 1
No ratings yet
Ds Module 1
72 pages
1 Introduction Python Programming For Data Science
No ratings yet
1 Introduction Python Programming For Data Science
11 pages
Comprehending The Statistics of Zomato
No ratings yet
Comprehending The Statistics of Zomato
33 pages
PYTHON
No ratings yet
PYTHON
11 pages
Id CARD GENRATOR FILE
No ratings yet
Id CARD GENRATOR FILE
26 pages
Micro Project Report Format
No ratings yet
Micro Project Report Format
11 pages
Python_vs_R_for_Data_Science_1725025528
No ratings yet
Python_vs_R_for_Data_Science_1725025528
10 pages
Basics of Python Programming and Statistics
No ratings yet
Basics of Python Programming and Statistics
56 pages
Python Unit 1 & 2
No ratings yet
Python Unit 1 & 2
16 pages
Tools of Business Analytics
No ratings yet
Tools of Business Analytics
20 pages
Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries
From Everand
Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries
PURNA CHANDER RAO. KATHULA
5/5 (1)
IJERT Data Analysis Using Python
No ratings yet
IJERT Data Analysis Using Python
6 pages
TUM-CPE_203_Module_1
No ratings yet
TUM-CPE_203_Module_1
5 pages
Christopher Wilkinson - Python Data Science - An Ultimate Guide For Beginners To Learn Fundamentals of Data Science Using Python (2020)
100% (2)
Christopher Wilkinson - Python Data Science - An Ultimate Guide For Beginners To Learn Fundamentals of Data Science Using Python (2020)
141 pages
Planning and Design For Python Guidance
No ratings yet
Planning and Design For Python Guidance
8 pages
Python Report
No ratings yet
Python Report
49 pages
Data Visualization
No ratings yet
Data Visualization
25 pages
Your First Python Program
From Everand
Your First Python Program
Alexander Paz
No ratings yet
Getting Started With Python Data Analysis - Sample Chapter
0% (1)
Getting Started With Python Data Analysis - Sample Chapter
17 pages
Python
No ratings yet
Python
323 pages
dsbda Unit4
No ratings yet
dsbda Unit4
110 pages
Data Analysis With Pandas
No ratings yet
Data Analysis With Pandas
7 pages
Learning Data Mining with Python Layton download pdf
100% (5)
Learning Data Mining with Python Layton download pdf
55 pages
How To Audit Agile Projects
No ratings yet
How To Audit Agile Projects
2 pages
The ISO 31000 Standard
100% (1)
The ISO 31000 Standard
3 pages
What Is Data Governance
No ratings yet
What Is Data Governance
10 pages
The Role of Big Data in Auditing and Analytics
No ratings yet
The Role of Big Data in Auditing and Analytics
3 pages
Financing Decisions Creditors and Investors
No ratings yet
Financing Decisions Creditors and Investors
35 pages
[Ebooks PDF] download Medieval Dress and Textiles in Britain: A Multilingual Sourcebook Louise M. Sylvester full chapters
100% (2)
[Ebooks PDF] download Medieval Dress and Textiles in Britain: A Multilingual Sourcebook Louise M. Sylvester full chapters
55 pages
ICES 2013 Poster
No ratings yet
ICES 2013 Poster
1 page
Medical Coding
No ratings yet
Medical Coding
4 pages
Test 1 B Focus
No ratings yet
Test 1 B Focus
3 pages
Khel Khel Mein
No ratings yet
Khel Khel Mein
5 pages
Comparison Asme B31.1-B31.3-B31.8
100% (1)
Comparison Asme B31.1-B31.3-B31.8
7 pages
Music by Lyrics By: G, Eorge Gershwin - Ira Gershwin
50% (2)
Music by Lyrics By: G, Eorge Gershwin - Ira Gershwin
8 pages
C Network Programming
No ratings yet
C Network Programming
5 pages
Quizizz - Conjunctions ...... 6th
100% (1)
Quizizz - Conjunctions ...... 6th
2 pages
SINO1003 - Class and Public Opinion in China
No ratings yet
SINO1003 - Class and Public Opinion in China
39 pages
Spec Trafo Page 9 - 14
No ratings yet
Spec Trafo Page 9 - 14
14 pages
Deed of Conditional Sale
No ratings yet
Deed of Conditional Sale
2 pages
Practicum 1 Professional Development: Week 3-4 Grooming and Personality
No ratings yet
Practicum 1 Professional Development: Week 3-4 Grooming and Personality
44 pages
ch4 Questions
No ratings yet
ch4 Questions
14 pages
Intermediate Algebra A Real World Approach 3rd Edition Ignacio Bello download
100% (1)
Intermediate Algebra A Real World Approach 3rd Edition Ignacio Bello download
79 pages
The Performance of 16th-Century Music
100% (1)
The Performance of 16th-Century Music
3 pages
Bellis vs. Bellis 20 SCRA 358, June 06, 1967
No ratings yet
Bellis vs. Bellis 20 SCRA 358, June 06, 1967
2 pages
9 Garolacan - H8DD-IIIb-c-17 New
No ratings yet
9 Garolacan - H8DD-IIIb-c-17 New
17 pages
Weaving Sikkim
100% (1)
Weaving Sikkim
30 pages
Et200s 2ao U ST Manual en-US
No ratings yet
Et200s 2ao U ST Manual en-US
22 pages
Bullying and Conflict Resolution Strategies
No ratings yet
Bullying and Conflict Resolution Strategies
4 pages
5 DESCRIBE YOU FRIENDS Ingleeesssssssssss
No ratings yet
5 DESCRIBE YOU FRIENDS Ingleeesssssssssss
5 pages
Quick Project and Full Project
No ratings yet
Quick Project and Full Project
2 pages
STUDY OF IC ENGINES-compressed
No ratings yet
STUDY OF IC ENGINES-compressed
5 pages
Rachel Olson
No ratings yet
Rachel Olson
2 pages
Grammar-Exercises Unit 5
No ratings yet
Grammar-Exercises Unit 5
43 pages
ITP111153971197
No ratings yet
ITP111153971197
3 pages

Auditing The Data Using Python

Uploaded by

Auditing The Data Using Python

Uploaded by

Auditing the data using Python

Data analysis capabilities of Python

d) “NLTK” (Natural Language Toolkit) is a leading platform for building Python

You might also like