Convert Unstructured Data to Structured Data Using Python



Unstructured data is data that does not follow any specific data model or format, and it can come in different forms such as text, images, audio, and video. Converting unstructured data to structured data is an important task in data analysis, as structured data is easier to analyse and extract insights from. Python provides various libraries and tools for converting unstructured data to structured data, making it more manageable and easier to analyse.

In this article, we will explore how to convert unstructured biometric data into a structured format using Python, allowing for more meaningful analysis and interpretation of the data.

While there are different approaches that we can make use of to convert unstructured data into structured data in Python. In this article, we will discuss the following two approaches:

  • Regular Expressions (Regex): This approach involves using regular expressions to extract structured data from unstructured text. Regex patterns can be defined to match specific patterns in the unstructured text and extract the relevant information.

  • Data Wrangling Libraries: Data wrangling libraries such as pandas can be used to clean and transform unstructured data into a structured format. These libraries provide functions to perform operations such as data cleaning, normalisation, and transformation.

Using Regular Expression

Consider the code shown below.

Example

import re import pandas as pd # sample unstructured text data text_data = """ Employee ID: 1234 Name: John Doe Department: Sales Punch Time: 8:30 AM Employee ID: 2345 Name: Jane Smith Department: Marketing Punch Time: 9:00 AM """ # define regular expression patterns to extract data id_pattern = re.compile(r'Employee ID: (\d+)') name_pattern = re.compile(r'Name: (.+)') dept_pattern = re.compile(r'Department: (.+)') time_pattern = re.compile(r'Punch Time: (.+)') # create empty lists to store extracted data ids = [] names = [] depts = [] times = [] # iterate through each line of the text data for line in text_data.split('\n'): # check if the line matches any of the regular expression patterns if id_pattern.match(line): ids.append(id_pattern.match(line).group(1)) elif name_pattern.match(line): names.append(name_pattern.match(line).group(1)) elif dept_pattern.match(line): depts.append(dept_pattern.match(line).group(1)) elif time_pattern.match(line): times.append(time_pattern.match(line).group(1)) # create a dataframe using the extracted data data = {'Employee ID': ids, 'Name': names, 'Department': depts, 'Punch Time': times} df = pd.DataFrame(data) # print the dataframe print(df)

Explanation

  • First, we define the unstructured text data as a multiline string.

  • Next, we define regular expression patterns to extract the relevant data from the text. We use the re module in Python for this.

  • We create empty lists to store the extracted data.

  • We iterate through each line of the text data and check if it matches any of the regular expression patterns. If it does, we extract the relevant data and append it to the corresponding list.

  • Finally, we create a Pandas dataframe using the extracted data and print it.

Output

        Employee ID      Name           Department  Punch Time
0        1234                 John Doe      Sales            8:30 AM
1        2345                 Jane Smith   Marketing      9:00 AM

Using Pandas Library

Suppose we have unstructured data that looks like this.

employee_id,date,time,type 1001,2022-01-01,09:01:22,Punch-In 1001,2022-01-01,12:35:10,Punch-Out 1002,2022-01-01,08:58:30,Punch-In 1002,2022-01-01,17:03:45,Punch-Out 1001,2022-01-02,09:12:43,Punch-In 1001,2022-01-02,12:37:22,Punch-Out 1002,2022-01-02,08:55:10,Punch-In 1002,2022-01-02,17:00:15,Punch-Out

Example

import pandas as pd # Load unstructured data unstructured_data = pd.read_csv("unstructured_data.csv") # Extract date and time from the 'date_time' column unstructured_data['date'] = pd.to_datetime(unstructured_data['date_time']).dt.date unstructured_data['time'] = pd.to_datetime(unstructured_data['date_time']).dt.time # Rename 'date_time' column to 'datetime' and drop it unstructured_data = unstructured_data.rename(columns={"date_time": "datetime"}) unstructured_data = unstructured_data.drop(['datetime'], axis=1) # Pivot the table to get 'Punch-In' and 'Punch-Out' time for each employee on each date structured_data = unstructured_data.pivot(index=['employee_id', 'date'], columns='type', values='time').reset_index() # Rename column names structured_data = structured_data.rename(columns={"Punch-In": "punch_in", "Punch-Out": "punch_out"}) # Calculate total hours worked by subtracting 'punch_in' from 'punch_out' structured_data['hours_worked'] = pd.to_datetime(structured_data['punch_out']) - pd.to_datetime(structured_data['punch_in']) # Print the structured data print(structured_data)

Output

type  employee_id        date   punch_in  punch_out hours_worked
0           1001  2022-01-01  09:01:22  12:35:10     03:33:48
1           1001  2022-01-02  09:12:43  12:37:22     03:24:39
2           1002  2022-01-01  08:58:30  17:03:45     08:05:15
3           1002  2022-01-02  08:55:10  17:00:15     08:05:05

Conclusion

In conclusion, unstructured data can be difficult to analyse and interpret. However, with the help of Python and various approaches such as regular expressions, text parsing, and machine learning techniques, it is possible to convert unstructured data into structured data.

Updated on: 2023-08-03T17:39:31+05:30

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements