Exploring Basics of Informatica Data Profiling
Last Updated :
24 Apr, 2025
Data Profiling is an important step in data quality assurance. By profiling your data, you can identify any data quality issues that need to be addressed. This can help to ensure that your data is accurate, complete, and consistent, which is essential for effective data analysis and decision-making.
It is the process of gathering statistics and data-related information about a dataset. This information can be used to evaluate the quality of the data, identify potential issues, and determine its suitability for different purposes. Reviewing Profile Results helps in identifying deviations in the data and knowing about vulnerabilities in the data.
Why Data Profiling?
Profiling can be used for a variety of purposes, including:
- Data quality assessment: Profiling can be used to identify data quality issues, such as missing values, invalid data types, and duplicate records.
- Data exploration: Profiling can be used to gain insight into the data, such as its distribution, patterns, and relationships.
- Data Integration: It helps in identifying data conflicts and inconsistencies across datasets. This is important for creating a unified and accurate view of the data.
- knowing: It is an essential tool for detecting anomalies and outliers in data. These anomalies can indicate errors, fraud, or unusual behavior, making them valuable for risk assessment and security purposes.
Challenges in Database Metadata and Documentation
While not all database metadata and documentation provide precise representations of the data source, it’s noteworthy that documentation is a rare find.
- Column Profiling: The descriptions documented for data elements may exhibit differences in how these elements are actually utilized. Additionally, there can be disparities between the column names and the data they actually contain. In contrast, Column Profiling ascertains a column’s characteristics by examining the data it encompasses.
- Outdated Documentation: As databases evolve over time, documentation often lags behind. Changes to data structure, column names, or data types may not be adequately reflected in the documentation, leading to confusion
- Complex Data Models: Database with complex data models, including multiple relationships, hierarchies, and inheritance structures, can be challenging to document. Ensuring that relationships are correctly documented is crucial for data understanding
- Documentation Ownership: Determining who is responsible for creating and maintaining database documentation can be unclear. Lack of ownership can lead to gaps in documentation and inconsistencies. Challenges in Database Metadata and Documentation
Creating data profile in Informatica
You should have appropriate IICS license for accessing Informatica Data Profiling or we can use the free version of informatica using Oracle e-delivery with the license key provided at the oracle license code website.
- Select Data Profiling in My Services Page of IICS (Informatica Intelligent Cloud Services).
- Click on New.
- from the dialog, select Data Profiling Task.
- In the profile definition tab, enter the profile name, description, and location to save in the Asset Details Section.
- Select the Connection and Source Object information in the Source Details section.
- Tune the profile settings to run the profile on all or on limited number of rows. Choose enable or disable drilldown.
- You can select the columns for which Profiling is required.
- Click Run on the top Right section of IICS.
Informatica Data Profile DefinitionUtilize the “Monitor” feature to check the outcomes of a profile or job run. Users can opt to view results from various runs by selecting “Choose Profile Run” from the menu.
Note : The Rules Tab is used to add Data Quality Assets to Profiling. Additionally you can configure scheduling, runtime settings, and advanced profiling options within the “Schedule” tab. Please go through Informatica Documentations for detailed explanations of these tabs as these are not explained in this article.
Profile Results — Summary View
One of the Profile Results types is Summary View which offers a consolidated perspective of profiling outcomes, encompassing patterns and data types.
Summary View of Data Profile ResultBelow are the Columns that can be found in Profile Results:
Columns, Value Distribution, % Null, # Null, % Distinct, # Distinct, % Non-Distinct, # Non-Distinct, # Patterns, % Top Patterns, Minimum Length, Maximum Length, Minimum Value, Maximum Value, # Blank and % Blank.
Profile Results — Detailed View
The second and the final option to view the profile result is the Detailed View which is a in-depth explanation of a specific column which lists down the unique values in a Column with frequencies , percentage of usage and length. It also Includes Value frequency graph, datatypes , patterns, five most frequent values and more.
Snippet of a section in Detailed View of Results---
Python Clone of Data Profiling for Summary Results
We will try to achieve almost similar results using the program but there are limitations. This data profiling tool in python can be used in small scale and the input format is excel, csv.
we will use Pandas to import files and create dataframe out of the same. Additionally it is used to get Minimum, Maximum, Average and to create the profile results. Regex is used to identify patterns in a column. This algorithm can identify URL, Credit card number, IP Address , email, Aadhar card Number and more. Additionally, we will use math module to do complex mathematical operations.
It is a simple code and can be used for small scale Analysis of metadata. First, we will import the dataset and check what kind of datatype is it, so we know what operations we can do with it based on the data types. Then we will find Number of Null values in a specific column, Null percentage, Number of Unique Values, Percentage of Unique Value, Number of Duplicate , Percentage Duplicate, Number of Patterns, Minimum value, Maximum value and average for Number data types, Minimum length , maximum length and Average length for String or Object types and the last column which tells the datatype. Finally, we will add all this information to a new dataframe to create a profile of the dataset.
Below is the Regex assignments that I used to identify the patterns in a column:
Python3
pattern_url = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
pattern_date = r'\d{4}-\d{2}-\d{2}|[0-3]?[0-9]/[0-3]?[0-9]/\d{4}'
pattern_credit_card = r'\b(?:\d{4}-?){3}\d{4}\b'
pattern_ip = r'\b(?:\d{1,3}\.){3}\d{1,3}\b|\b(?:[A-Fa-f0-9]{1,4}:){7}[A-Fa-f0-9]{1,4}\b'
pattern_email = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b'
pattern_phone = r'(?:(?:\+|0{0,2})(91-?)?)[7-9]\d{9}'
pattern_aadhar = r'\b\d{4}\s\d{4}\s\d{4}\b'
pattern_ssn = r'\b\d{3}-\d{2}-\d{4}\b'
pattern_list = [pattern_url , pattern_date, pattern_credit_card, pattern_ip , pattern_email , pattern_phone , pattern_aadhar , pattern_ssn]
Below is the code to the function to get all the column properties:
Python3
def data_profile(i, col_name, pattern_list):
# Total Number of Records
Total_Records = len(data[data.columns[i]])
# 2 Number of Null values in the column
Number_Null = data[data.columns[i]].isnull().sum()
# 3 Percentage Null
Percent_Null = Number_Null / Total_Records * 100
# 4 Number of Unique Values
Num_Unique = data[data.columns[i]].nunique()
# 5 Percentage Unique
Percent_Unique = Num_Unique / Total_Records * 100
# 6 Number of Duplicate Values
Non_Distinct = Total_Records - Num_Unique
# 7 Percentage of Duplicate Values
Percent_non_distinct = Non_Distinct / Total_Records * 100
# 8, 9, 10 - Minimum, Maximum and Average of a Integer or Float type
if data[data.columns[i]].dtypes in ["int64", "int32", "int16", "float64", "float32", "float16"]:
minimum = min(data[data.columns[i]])
maximum = max(data[data.columns[i]])
average = data[data.columns[i]].mean()
else:
minimum = "NA"
maximum = "NA"
average = "NA"
# 11 Number of Patterns
patterns = []
if data[data.columns[i]].dtype == 'O':
for value in data[data.columns[i]]:
k = ""
for n in range(0, len(pattern_list)):
k = re.search(pattern_list[n], value)
patterns.append(k)
if k == re.Match:
break
patterns_1 = list(filter(lambda x: x is not None, patterns))
Num_Pattern = len(patterns_1)
else:
for value in data[data.columns[i]].astype(str):
k = ""
for n in range(0, len(pattern_list)):
k = re.search(pattern_list[n], value)
patterns.append(k)
if k == re.Match:
break
patterns_1 = list(filter(lambda x: x is not None, patterns))
Num_Pattern = len(patterns_1)
# 12 Minimum Length of the string or number
if data[data.columns[i]].dtype == 'O':
min_word = min(data[data.columns[i]], key=len)
min_len = len(min_word)
else:
min_len = "NA"
# 13 Maximum Length of the string or number
if data[data.columns[i]].dtype == 'O':
max_word = max(data[data.columns[i]], key=len)
max_len = len(max_word)
else:
max_len = "NA"
# 14 Average Length of the string
if data[data.columns[i]].dtype == 'O':
avg_word = data[data.columns[i]].apply(len).mean()
avg_word = math.ceil(avg_word)
else:
avg_word = "NA"
# 15 Datatype of the Column
dtype = data[data.columns[i]].dtype
if dtype == 'O':
data_type = "String"
elif dtype == 'int64':
data_type = "int64"
elif dtype == 'int32':
data_type = "int32"
elif dtype == 'int16':
data_type = "int16"
elif dtype == 'float64':
data_type = "float64"
elif dtype == 'float32':
data_type = "float32"
elif dtype == 'float16':
data_type = "float16"
elif dtype == 'bool':
data_type = "boolean"
else:
data_type = dtype
return_list = [col_name, Number_Null, Percent_Null, Num_Unique, Percent_Unique, Non_Distinct,
Percent_non_distinct, Num_Pattern, minimum, maximum, average, min_len, max_len, avg_word, data_type]
return return_list
Below is the snapshot of the result :
Profile Result for a simple dataset with around 2000 records
---
Informatica stands out as a top choice in the market for Data Profiling and Data Quality, making it a prominent tool for organizations, whether they are focused on organizational needs or analyzing extensive datasets.
Similar Reads
Informatica Work Experience
Welcome to my blog! I'm thrilled to share my journey working at Informatica, specifically focusing on Master Data Management (MDM). At Informatica, I am a part of the Master Data Management (MDM) team, diving deep into the architecture and processes that make this platform a leader in the industry.
4 min read
Steps for Mastering Exploratory Data Analysis | EDA Steps
Mastering exploratory data analysis (EDA) is crucial for understanding your data, identifying patterns, and generating insights that can inform further analysis or decision-making. Data is the lifeblood of cutting-edge groups, and the capability to extract insights from records has become a crucial
15+ min read
Difference between Data Profiling and Data Mining
1. Data Mining :Data mining can be defined as the process of identifying the patterns in a prebuilt database. It extracts aberrant patterns, interconnection between the huge datasets to get the correct outcomes.Data mining, sometimes known as âKnowledge discovery in databasesâ. We can say that it is
5 min read
What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is a important step in data science as it visualizing data to understand its main features, find patterns and discover how different parts of the data are connected. In this article, we will see more about Exploratory Data Analysis (EDA).Why Exploratory Data Analysis
8 min read
Unlocking Insights with Exploratory Data Analysis (EDA): The Role of YData Profiling
Exploratory Data Analysis (EDA) is a crucial step in the data science workflow, enabling data scientists to understand the underlying structure of their data, detect patterns, and generate insights. Traditional EDA methods often require writing extensive code, which can be time-consuming and complex
6 min read
What is Data Exploration and its process?
Data exploration is the first step in the journey of extracting insights from raw datasets. Data exploration serves as the compass that guides data scientists through the vast sea of information. It involves getting to know the data intimately, understanding its structure, and uncovering valuable nu
8 min read
How to Write Data Analysis Reports
Reports on data analysis are essential for communicating data-driven insights to decision-makers, stakeholders, and other pertinent parties. These reports provide an organized format for providing conclusions, analyses, and suggestions derived from data set analysis. In this guide, we will learn how
9 min read
Understanding Data Profiling
Overview :Everything in todayâs world is all about generating data. With all these huge amounts of data lying around, there is a requirement for standard and quality. Data profiling comes into the picture here. Data profiling is the method of evaluating the quality and content of the data so that th
3 min read
Unlocking Insights: A Guide to Data Analysis Methods
The data collected already in this information age are what makes advancement possible. But by itself, raw data is a confused mess. We employ the performance of data analysis to clear this confusion, extracting valuable insights from the muck that's gradually forming the base for key decisions and i
14 min read
Application of Data Engineering
Data Engineering is a pivotal field in the modern data landscape, focusing on the architecture, design, and management of systems for collecting, storing, and processing large volumes of data. Its applications span various industries, transforming how organizations leverage data to drive decision-ma
6 min read