Auditing The Data Using Python
Auditing The Data Using Python
This article is an excerpt from my book “Practical Data Analysis: Using Python & Open
Source Technology” (available on Amazon worldwide, Google Play and Apple iBook
Store).
“Python” and “R” are amongst the most popular open source programming languages
for data science. While R’s functionality was developed with statisticians in mind,
Python on the other hand is a general-purpose programming language, with an easy to
understand syntax and a gentle learning curve. Historically, R has been used primarily
for academic and research work, although anecdotal evidence suggests that R is starting
to see some level of adoption in the enterprise world (especially in the Financial
Services sector) due to its robust data visualisation capabilities. Python, on the other
hand, is widely used in enterprises due to the breadth of its functionalities that span
beyond data analytics and also because of the ease with which Python based solutions
can be integrated with many other technologies in a typical enterprise set up. Needless
to say, due to my enterprise background, I am more inclined towards Python as the data
analysis tool of my choice.
In addition to Python and R, there is also a wide variety of very powerful commercial
data analysis software. However, Python has several advantages over these commercial
offerings as follows -
a) Python’s open source license (GPL compatible, but you can distribute a modified
version without making your changes open source) means that it can be used for free.
Commercial packages on the other hand come with licensing constraints and the
associated cost factor can often limit their availability to a handful of staff in an
organisation.
b) Unlike many commercial data analysis software, Python can be used even on a low
specification Desktop computer, making it suitable for large scale deployment without
additional investment in hardware. Data analysis codes written in native Python can also
be used in multiple computing platforms and operating systems that support Python
(e.g. Windows, Linux and MacOS).
c) Most (if not all) commercial data analysis software are designed for interactive use,
often making them unsuitable for implementing fully automated and reusable data
analytics solutions. Python codes, on the other hand, can be used to fully automate the
entire data analysis process, and can also be distributed and re-used without constraints.
d) The worldwide Python community is constantly adding new packages and features to
its already rich set of functions. Due to the size and scale of community support, new
data analysis techniques coming out of academia and research also become freely
available in Python much quicker than in a commercial offering.
e) There are a number of online discussion forums dedicated to Python for knowledge
sharing. The PyData conferences also provide a valuable channel for exchanging
information on new approach and emerging open source technologies for data
management, processing, analytics and visualisation. Video recordings of the PyData
conference proceedings are freely available on YouTube.
f) Generally speaking, there are more people with Python programming skills than with
working knowledge of commercial data analysis software. Python is also gaining
increasing popularity as an introductory programming language in many schools and
universities world-wide. We are, therefore, very likely to see an increase in the number
of people with Python programming skills in the near future.
Python has an extremely rich set of data analysis functionalities, which in my view are
more than adequate to meet even the needs of an advanced data analytics practitioner. I
have summarised below some of the key data analysis capabilities of Python (you will
find a number of real-life examples on how to use these capabilities in my book
“Practical Data Analysis – Using Open Source Tools & Techniques”). Readers who are
interested in a more comprehensive list of useful Python packages should visit the
Awesome Python website, where a “curated list of awesome Python frameworks,
libraries, software and resources” is maintained.
a) “Pandas” is probably one of the most important Python libraries for data analysis.
Conceptually, Pandas is similar to Microsoft Excel (excluding the point and click
functionality of Excel) in that it allows you to open, explore, change, update, analyse
and save data in a tabular structure. However, in my opinion, Pandas is way more
powerful than Microsoft Excel as a data analysis tool for the following reasons - (i)
Pandas combined with the broader Python ecosystem provides a much richer set of data
analysis functionalities than Microsoft Excel; (ii) unlike Microsoft Excel, Pandas can
load and process extremely large data files even on a low specification Desktop
computer; (iii) coding your data analysis logic in Pandas ensures that the same set of
rules are operated every time you run the analysis; (iv) once you have written your data
analysis logic in Pandas, you can re-use it for analysing a completely different data set
with minimal changes; and (v) it is much easier to share a piece of code with a friend or
colleague, rather than a long set of instructions describing a series of manual operations
in Excel.
b) “NumPy”, “SciPy” and “StatsModels” are Python libraries for scientific and financial
computation, economic research and applied econometrics. These libraries enable
Python to be used in the areas of statistics, linear algebra, Fourier transformation, signal
processing, image processing, genetic algorithm, econometric analysis and much more!
c) Machine Learning is the latest buzzword amongst technologists and data scientists,
and a data analysis toolkit will be incomplete without Machine Learning capabilities.
Built on top of NumPy and SciPy, Python’s “Scikit-Learn” library provides a range of
out of the box implementation of supervised and un-supervised Machine Learning
algorithms for data mining and data analysis. A more recent addition to the Python open
source Machine Learning toolkit is “TensorFlow” from Google. TensorFlow is designed
for large scale and computationally intense Machine Learning (such as shallow and
deep learning neural networks). It allows you to define in Python a graph of the
computations to perform and then TensorFlow takes that graph and runs it efficiently
using optimised C++ code. Other noteworthy Machine Learning libraries in Python
include “Keras”, “Theano”, “Pylearn2” and “Pyevolve”.
e) Data visualisation is an essential step in any rigorous data analysis process and the
Python community have left no stone unturned to make all kinds of data visualisation
tools and techniques available to Python users. “Matplotlib” is the grandfather of
Python data visualisation packages. It is extremely powerful but with that power comes
complexity. “Seaborn” harnesses the power of Matplotlib to create beautiful charts in a
few lines of code. However, Seaborn was perhaps created with statistical visualisations
in mind and is extremely useful when you have to plot a lot of different variables.
“Bokeh” and “Pygal” libraries offers Python users the ability to create interactive and
web- ready plots. “Geoplotlib” and “Matplotlib basemap” are Python toolkits for
creating maps and plotting geographical data. “Missingno” allows you to quickly gauge
the completeness of a data set with a visual summary.
f) Regular Time Series is a sequence of data points collected at constant time intervals
(e.g. daily closing share price of a listed company for the past six month). While you
can implement your own Time Series data analysis methods using a combination of the
Pandas, NumPy, Scikit-Learn, SciPy and StatsModels libraries, there are currently two
off-the-shelf Python packages for time series forecasting - “Prophet”, an open source
Python package created by the Facebook data science team, and “PyFlux”, which is
built on top of NumPy, SciPy and Pandas. Using these libraries, one can automate the
process of analysing time series data to forecast future values of the series (e.g.
tomorrow’s closing share price of a listed company, although I will caveat this
statement with a word of caution - forecasting and making a prediction are two different
things) and the degree of uncertainty associated with the forecast.
g) In April 2017, Microsoft announced a preview release of SQL Server 2017 that is
capable of in-database analytics and machine learning using Python. This feature will
allow Python based data analysis models to be built inside the SQL Server itself,
thereby eliminating the need to move data from the database to the models. Currently,
Python also provides a number of off-the-shelf packages for extracting data from
different types of data sources, such as traditional databases (the full list of Python
database drivers is available on Awesome MySQL website), Excel files (“Pandas”,
“openpyxl” and “pyexcel” libraries), Word documents (“python-docx” and “textract”
libraries), PDF documents (“PDFMiner” library), CSVfiles (“Pandas” library) and
Internet websites (“lassie”, “micawber” and “newspaper” libraries).
h) While “Jupyter Notebook” is not a tool for data analysis, it is still worth listing it
here. Jupyter Notebook provides a rich web browser (e.g. Internet Explorer) based user
interface for writing Python codes, as well as adding explanatory narratives and
displaying the outputs of the code in a single document. Jupyter Notebook makes it
extremely easy to document and share your data analysis work with friends and
colleagues and publish it on the Internet.
To summarise, if you choose to invest some of your time to develop practical data
analytics skills, give Python a try!
PS: I have just started recording some videos on emerging data analytics tools and
techniques that you can view on YouTube via the link below. Feel free to drop by and
leave your comments.