Microsoft Business Intelligence (Data Tools)|python 3 regex

Showing posts with label python 3 regex. Show all posts

Thursday, September 16, 2021

Python — A Tool for Everything

Data is a tool, an asset for making a better decisions which can be act as a supreme driver of business value. Now a days, Python is one of the fastest growing programming languages. With the help of this programming language, we can easily do the followings —

Data manipulation with Pandas,
Creating fabulous visualizations with Seaborn, or
Scaling Analytics, Deep Learning and AI Data model with TensorFlow,

So, we can trust on the Python language which seems to have a tool for everything.

In the current era, the volumes of data generated continue to grow at a rapid pace across structured, semi structured, and unstructured data types that businesses are now able to store and need to analyze. 
Few years back, Cloud Technology was considered an optional technology environment but now a days, it is the foundation for modernizing data management and most of the organizations use cloud services or infrastructure widely in their data architecture.

Pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with structured like tabular, multidimensional, potentially heterogeneous and time series data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

OS comes under Python’s standard utility modules. This module provides a portable way of using operating system-dependent functionality. os.listdir(‘your_path’) will list all content of a directory

NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

SQLite3 can be integrated with Python using sqlite3 module, which provides an SQL interface compliant with the DB-API 2.0 specification described by PEP 249. You do not need to install this module separately because it is shipped by default along with Python version 2.5.x onwards.

Seaborn is a Python data visualization library based on matplotlib. It will be used to visualize random distributions and provides a high-level interface for drawing attractive and informative statistical graphics.

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python, and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK.

TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It is an end-to-end open source machine learning platform for everyone and can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. TensorFlow is a symbolic math library based on dataflow and differentiable programming.

Note: 
1. Seaborn supports Python 3.7+ and no longer supports Python 2. 
2. TensorFlow now supports Python 3.5.x through Python 3.8.x, but you still have to use a 64-bit version.


To learn more, please follow us -
http://www.sql-datatools.com
To Learn more, please visit our YouTube channel at —
http://www.youtube.com/c/Sql-datatools
To Learn more, please visit our Instagram account at -
https://www.instagram.com/asp.mukesh/
To Learn more, please visit our twitter account at -
https://twitter.com/macxima

Mukesh Singh

With over 17 years of experience in the Data Engineering stack across a variety of cloud and on-premises systems, I have successfully delivered more than ten complete business product solutions. My expertise lies in building robust infrastructure and architecture to support data engineering, data analytics, and machine learning processes. These solutions have significantly improved collaboration among cross-functional teams, including data scientists, business analysts, software engineers, and stakeholders. Key Contributions Data Modelling and Integration • Data Modeling: Developed various data models to produce suitable data for business users, data analytics, data science, and data visualization teams. • Legacy Systems and Cloud Technologies: Integrated legacy systems with modern cloud-based technologies (AWS, Azure, GCP), data lakes, and data warehouses. • Streamlined Data Pipelines: Built efficient data pipelines, data warehouses, BI reports, and dashboards to streamline data access and insights.

Friday, October 16, 2020

Python — Retrieve matching rows from two Dataframes

This is the most common requirement to pull the common records from the two dataframes in Python if you are working as a Python developer/data analytics or data scientist for any organisation.

For an example, you have some users data in a dataframe-1 and you have to new users data in a dataframe-2, then you have to find out all the matched records from dataframe-2 and dataframe-1 by using pandas and retrieve matching rows and report to the business for the reason of these records.

So, we are here to show you the logic to get these matched records from two datasets/dataframes in Python.

# pandas library for data manipulation in python
import pandas as pd
#create NaN Values in Pandas DataFrame by numpy
import numpy as np
#creating dataframe-1
df1 = pd.DataFrame({
‘Name’: [‘Ryan’,’Rosy’,’Wills’,’Tom’,’Alice’,’Volter’,’Jay’,’John’,’Ronny’],
‘Age’: [25,26,14,19,22,28,30,32,28],
‘Height’: [189.0,193.0,200.0,155.0,165.0,170.0,172.0,156.0,165.0]})
#creating dataframe-2
df2 = pd.DataFrame({
‘Name’: [‘Ryan’,’Rosy’,’Wills’,’Tom’,’Alice’,np.nan,’Jay’,’John’,’Ronny’],
‘Age’: [25,26,14,0,22,28,30,32,28],
‘Height’: [189.0,np.nan,200.0,155.0,np.nan,170.0,172.0,156.0,165.0]})
Display Values from Dataframe -1 and Dataframe -2 Now, we have populated the both dataframes and these are the below values from dataframes -

Verify the datatypes for each column in both dataframes — You have to check the datatypes of your columns and ensure they are the same, as we mentioned here —

# check datatypes for each column
df1 = df1.astype(df2.dtypes.to_dict())

How to pull the matched records? — Now, we have to find out all the matched or common rows from both dataframes by comparing through merge by right_index as given blow-

#matched rows through merge by right_index
commondf=pd.merge(df1,df2, on=[‘Name’,’Age’,’Height’], right_index=True)
#show common records
commondf

Now, you can see this is very easy task to find out the matched records from two dataframes through merge by right_index property.

To learn more, please follow us -

http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at -

http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -

https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -

https://twitter.com/macxima

To Learn more, please visit our Medium account at -

https://medium.com/@macxima

Mukesh Singh

Friday, October 9, 2020

Python — Show unmatched rows from two dataframes

For an example, you have some users data in a dataframe-1 and you have to new users data in a dataframe-2, then you have to find out all the unmatched records from dataframe-2 by comparing with dataframe-1 and report to the business for the reason of these records.

If you are working as a Python developer and you have to validate the existing data with new incoming datasets then it would not be an easy job for you.

So, we are here to show you the logic to get these unmatched records from two datasets/dataframes in Python.

# pandas library for data manipulation in python

import pandas as pd

#create NaN Values in Pandas DataFrame by numpy

import numpy as np

#creating dataframe-1

df1 = pd.DataFrame({

‘Name’: [‘Ryan’,’Rosy’,’Wills’,’Tom’,’Alice’,’Volter’,’Jay’,’John’,’Ronny’],

‘Age’: [25,26,14,19,22,28,30,32,28],

‘Height’: [189.0,193.0,200.0,155.0,165.0,170.0,172.0,156.0,165.0]})

#creating dataframe-2

df2 = pd.DataFrame({

‘Name’: [‘Ryan’,’Rosy’,’Wills’,’Tom’,’Alice’,np.nan,’Jay’,’John’,’Ronny’],

‘Age’: [25,26,14,0,22,28,30,32,28],

‘Height’: [189.0,np.nan,200.0,155.0,np.nan,170.0,172.0,156.0,165.0]})

Display Values from Dataframe -1 and Dataframe -2 Now, we have populated the both dataframes and these are the below values from dataframes -

Unmatched rows from Dataframe-2 : Now, we have to find out all the unmatched rows from dataframe -2 by comparing with dataframe-1. For doing this, we can compare the Dataframes in an elementwise manner and get the indexes as given below:

# compare the Dataframes in an elementwise manner

indexes = (df1 != df2).any(axis=1)

and then check for those rows where any of the items differ from dataframe-2 as given below:

#looking unmatched indexes in dataframe-2

# and store unmatched rows in dataframe-3

df3 = df2.loc[indexes]

#displaying unmatched values from dataframe-2

df3

Unmatched rows from Dataframe-1 : Now, we have to find out all the unmatched rows from dataframe -1 by comparing with dataframe-2.

#looking unmatched indexes in dataframe-1

# and store unmatched rows in dataframe-4

df4 = df1.loc[indexes]

#displaying unmatched values from dataframe-1

df4

Unmatched rows from Dataframe-1 & Dataframe-2 if you want to display all the unmatched rows from the both dataframes, then you can also merge the unmatched rows from dataframe-1 and unmatched rows from dataframe-2 as given below :

#merge both unmatched dataframes by using outer join

df5=pd.merge(df3, df4,how=’outer’)

#display all unmatched rows

df5

Now, you can see this is very easy task to find out the unmatched records from two dataframes by index comparing the dataframes in an elementwise.

To Learn more, please visit our YouTube channel at -
http://www.youtube.com/c/Sql-datatools
To Learn more, please visit our Instagram account at -
https://www.instagram.com/asp.mukesh/
To Learn more, please visit our twitter account at -
https://twitter.com/macxima
To Learn more, please visit our Medium account at -
https://medium.com/@macxima

Mukesh Singh

Saturday, August 29, 2020

Python - Transpose Dataframe Columns into Rows

Today, I was working with Python where I have to transpose some columns into rows to avoid a lot of calculations. As we know that Python has a lot of libraries and very strong communities support. That means, you can solve any problems with your dataset.

Here, I’m using a small dataset to show you that how can we use pandas library to transpose your dataframe.

In this example, I’m using class student’s dataset where each student has their subject in the columns with their obtained marks.

Now, we have to transpose subject columns into rows in the ‘Subject’ column and marks will be display in Marks column next to Subject within dataset-2.

Pandas melt() function is used to change the DataFrame format from wide to long. It’s used to create a specific format of the DataFrame object where one or more columns work as identifiers. All the remaining columns are treated as values and unpivoted to the row axis and only two columns — variable and value.

Here, we can see that with the help of Pandas library, we can transpose our dataset into the desired results.

#import Libraries

import pandas as pd

# Creating DataFrame from dict of narray/lists. intialise data of lists

list={'Name':['Ryan','Arjun','john','Rosy'],

'Class':['IV','III','III','V'],

'English':[90,85,90,95],

'Math':[95,90,85,80],

'Science':[95,90,90,90],

'Computer':[98,95,90,85],

'Year':[2020,2020,2020,2020]}

# Create DataFrame from list/narray

df=pd.DataFrame(list)

#show data in the dataframe

======================================================

------------------------------------------------------

Ryan |IV | 2020 | 90 | 95 | 95 |98

Arjun|III | 2020 | 85 | 90 | 90 |95

John |III | 2020 | 90 | 85 | 90 |90

Rosy |V | 2020 | 95 | 80 | 90 |85

======================================================

# function to unpivot the dataframe

df3=df.melt(['Name','Class','Year'], var_name='Subject')

#show data in the dataframe

df3

=======================================

---------------------------------------

0 |Ryan | IV |2020 |Computer| 98

1 |Arjun| III |2020 |Computer| 95

2 |john | III |2020 |Computer| 90

3 |Rosy | V |2020 |Computer| 85

4 |Ryan | IV |2020 |English | 90

5 |Arjun| III |2020 |English | 85

6 |john | III |2020 |English | 90

7 |Rosy | V |2020 |English | 95

8 |Ryan | IV |2020 |Math | 95

9 |Arjun| III |2020 |Math | 90

10|john | III |2020 |Math | 85

11|Rosy | V |2020 |Math | 80

12|Ryan | IV |2020 |Science | 95

13|Arjun| III |2020 |Science | 90

14|john | III |2020 |Science | 90

15|Rosy | V |2020 |Science | 90

=======================================

#rename value columns to Marks

df3=df3.rename(columns = {'value': 'Marks'}, inplace = False)

#show data in the dataframe

df3

=======================================

---------------------------------------

0 |Ryan | IV |2020 |Computer| 98

1 |Arjun| III |2020 |Computer| 95

2 |john | III |2020 |Computer| 90

3 |Rosy | V |2020 |Computer| 85

4 |Ryan | IV |2020 |English | 90

5 |Arjun| III |2020 |English | 85

6 |john | III |2020 |English | 90

7 |Rosy | V |2020 |English | 95

8 |Ryan | IV |2020 |Math | 95

9 |Arjun| III |2020 |Math | 90

10|john | III |2020 |Math | 85

11|Rosy | V |2020 |Math | 80

12|Ryan | IV |2020 |Science | 95

13|Arjun| III |2020 |Science | 90

14|john | III |2020 |Science | 90

15|Rosy | V |2020 |Science | 90

=======================================

To learn more, please follow us -

http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at -

http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -

https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -

https://twitter.com/macxima

To Learn more, please visit our Medium account at -

https://medium.com/@macxima

Mukesh Singh

Wednesday, February 26, 2020

Python - Extracting Domain Name From URLs Using Regular Expressions

As a python developers/programmers, we have to accomplished a lot of data cleansing jobs from a file before processing the other business operations.

For an example, you have a raw data text file containing web scrapping data and you have to read some specific data like website URLs by to performing the actual Regular Expression matching to pull the domain names.

Extracting the Domain name accurately can be quite tricky mainly because the domain extension can contain 2 parts (like .com.au or .co.uk) and the sub domain (the prefix) may or may not be there.

The hard part is knowing if the name is at the second or third level or so on.

What is a Regular Expression and which module is used in Python?

Regular expression is a sequence of special character(s) mainly used to find and replace patterns in a string or file, using a specialized syntax held in a pattern.

The Python module re provides full support for Perl-like regular expressions in Python. The re module raises the exception re.error if an error occurs while compiling or using a regular expression.

# Python program to extract domain names from the list of website URLs By Regular Expression.

# Importing module required for regular expressions

import re

# List of website URLs

domainlist=['m.google.com',

'm.docs.google.com',

'www.someisotericdomain.innersite.mall.co.uk',

'www.ouruniversity.department.mit.ac.us',

'www.somestrangeurl.shops.relevantdomain.net',

'www.example.info']

#print values in the list

print(domainlist)

Output -

['m.google.com', 'm.docs.google.com', 'www.someisotericdomain.innersite.mall.co.uk', 'www.ouruniversity.department.mit.ac.us', 'www.somestrangeurl.shops.relevantdomain.net', 'www.example.info']

Now, we have the website URLs in the list and we want to extract only domain name from the list. So, we are going to apply regex based regular expressions such as

# Read list by for loop

# get list of domain

# The regex will have to be enormous in order to catch all kinds of domains

# It returns domain from URL.

#It's quick and doesn't need any input file listing stuff.

for l in domainlist:

# get list of domain

res = re.findall(r'(?<=\.)([^.]+)(?:\.(?:co\.uk|ac\.us|[^.]+(?:$|\n)))',l)

print(l, "|", res[0])

The final output is -

To learn more, please follow us -
http://www.sql-datatools.com
To Learn more, please visit our YouTube channel at —
http://www.youtube.com/c/Sql-datatools
To Learn more, please visit our Instagram account at -
https://www.instagram.com/asp.mukesh/
To Learn more, please visit our twitter account at -
https://twitter.com/macxima

Mukesh Singh

Thursday, September 16, 2021

Python — A Tool for Everything

Friday, October 16, 2020

Python — Retrieve matching rows from two Dataframes

Friday, October 9, 2020

Python — Show unmatched rows from two dataframes

Saturday, August 29, 2020

Python - Transpose Dataframe Columns into Rows

Wednesday, February 26, 2020

Python - Extracting Domain Name From URLs Using Regular Expressions

Popular Posts