Internship Report: Automating Mechanisms for
GDPR Compliances
Organization: VisionTech Systems PVT LTD
Mentor Name: Vijay Shukla
Akanksha Arpan Gevariya
202111004 202111030
[email protected] [email protected] Abstract—- This report presents the research and implemen- • One of the primary tasks involved the development, im-
tation of a General Data Regulation Policy, with a specific focus plementation, and scheduling of a robust script designed
on compliance with the General Data Protection Regulation to automatically drop tables that exceeded the 28-day
(GDPR). Throughout the project, The project began with an
extensive examination of the GDPR framework, highlighting the retention period within our clusters. This task required
task performed implementing GDPR compliances. meticulous attention to scheduling and automation to
ensure that outdated data was efficiently managed and
I. I NTRODUCTION removed at regular intervals, thereby optimizing storage
This report provides an insightful overview of my internship and maintaining data hygiene.
experience at VisionTech Systems PVT LTD.We conducted • In addition to managing data retention, we were also
extensive research to understand the General Data Protection focused on identifying and dropping tables that contained
Regulation (GDPR), its applications, and its implications in personal data elements to ensure alignment with GDPR
real-life scenarios. This foundational work was crucial as we requirements. This involved a comprehensive review of
aimed to implement GDPR compliance for the company’s the data schemas and the implementation of systematic
data warehousing solution. We had the chance to work on procedures to protect and handle personal data appropri-
implementing General Data Protection Regulation (GDPR) ately, thereby mitigating potential legal risks and ensuring
compliance for their data warehousing solution. This opportu- compliance with stringent data privacy regulations.
nity allowed us to dive into a dynamic environment focused • To further enhance GDPR compliance, we developed,
on data privacy and regulatory adherence, presenting a unique implemented, and scheduled a script to verify whether the
blend of technical challenges and professional growth. tables containing personal data elements included a time-
Aligning the data warehousing solution with GDPR re- series column, such as a snapshot day or the latest up-
quirements was crucial for the organization to ensure com- dated date. This verification was crucial for determining
pliance with international data protection laws and to safe- the row creation date, which is essential for implementing
guard personal data. Throughout our internship, we engaged a 2.5 year data retention policy. The script’s automation
in several critical tasks, including developing, implementing, ensured consistent checks and updates, facilitating long-
and scheduling scripts to manage data retention, identifying term data management aligned with regulatory require-
and removing tables containing personal data elements, and ments.
ensuring adherence to GDPR guidelines. • Moreover, we established a robust mechanism to identify
Working on these projects not only sharpened our technical and drop tables that were created without adhering to
skills in SQL, Python, and AWS but also provided us with a the established GDPR rules within our data warehouse.
deeper understanding of GDPR and its implications for data This task involved implementing checks for compliance
management. This experience was professionally enriching with table naming conventions, the presence of required
and offered valuable insights into best practices for data columns, and other GDPR-related criteria. By ensuring
protection and compliance. This report details our experiences, that all tables met these compliance standards, the in-
the methodologies employed, and the significant impacts of the tegrity and regulatory alignment of the data warehouse
projects undertaken during our internship. were maintained, thus supporting the organization’s over-
II. TASKS A SSIGNED all data governance framework.
During our internship, we undertook several critical tasks III. D EVELOPMENT A PPROACH
aimed at ensuring GDPR compliance within our data ware- The overall approach for the first task involved a systematic
housing solution. process to manage data retention efficiently. Initially, we pulled
data from the designated cluster based on table creation dates
and stored this information in an S3 bucket, utilizing AWS’s
cloud storage capabilities. We developed an SQL query to ac-
curately retrieve key details such as schema name, table name,
and creation date from the database. Subsequently,we created
an S3 bucket to temporarily store this data. To automate the
data management process, We developed a job to unload data
from the database tables and transfer it to the S3 bucket,
configuring the job to run at specified intervals for regular
data updates. Additionally, We created a Python function to
determine the age of each table by comparing its creation date
with the current date. If a table’s age exceeded 28 days, the
function automatically dropped the table from the database,
ensuring compliance with our data retention policies.
For the second task, we began by extracting data from the
USER and EMP schemas and then compared the columns
with a file containing details on impacted columns. This
comparison helped us identify tables containing personal data
elements that needed attention. To streamline this process,
We developed a configuration table with key parameters,
including schema name, table name, table owner, creation date,
personal identifiable information (PII) columns, and various
status indicators such as the presence of customer ID, OD3
status, communication sent, table age, and whether the table
had been dropped. This configuration table served as a critical
tool for tracking and managing GDPR compliance, ensuring
accurate and efficient handling of data impacted by privacy
regulations. Fig. 1. Flowchart
For the third task, we adopted a structured approach to enhance
GDPR compliance by focusing on time series columns in
impacted tables. We began by adding a time series column ensuring that all tables in the data warehouse complied with
to these tables, which was updated to ”Yes” if a time series the established rules.
column was present and ”No” if it was absent. To ensure IV. I NSIGHTS AND ACQUIRED S KILLS
accurate tracking, we developed a script to verify each table
in the GDPR-impacted list against svv-columns, setting the During the implementation process, we gained valuable
time series column accordingly. Additionally, we introduced knowledge and skills, including:
another column to record the names of suspected time series • Proficiency with SQL Client Tools: During our internship,
columns for tables marked ”Yes,” while this column remained we gained significant experience with SQL client tools,
null for tables marked ”No.” Finally, to streamline the process, particularly mySQL and DBeaver. We learned to effec-
we automated the task by creating jobs across our databases tively connect mySQL and DBeaver to various databases,
that routinely check for the presence of time series columns facilitating efficient data exploration and query execution.
and update the status as well as list suspected columns if This experience enhanced our ability to navigate and
applicable. manage database systems, execute complex queries, and
For the fourth task, we took a comprehensive approach to visualize data interactively.
ensure GDPR compliance by addressing table naming conven- • AWS S3 Exploration: We successfully created and man-
tions and the presence of creation-date columns. We began aged an Amazon S3 bucket, leveraging AWS’s object
by developing an SQL query to identify tables across five storage service. We acquired a solid understanding of
schemas that either did not adhere to the prescribed naming bucket configurations, storage management, and data re-
guidelines or lacked a creation-date column. This required a trieval within the AWS ecosystem. This hands-on expe-
join operation between svv-columns, which contained schema rience with S3 contributed to my ability to handle large
names, table names, and column names, and admin-schema.d- volumes of data and implement scalable storage solutions.
aim-tables, which included the creation-date but not column • Enhanced SQL Skills: We refined our SQL skills, fo-
names. To automate the process, we created a Python script cusing on constructing and executing complex queries.
that identifies and lists tables failing to meet GDPR stan- We improved our ability to retrieve specific data from
dards. The script was designed to correct tables with naming large datasets, work with system catalog tables, and op-
guideline issues and to drop those missing a creation-date, timize query performance. This deepened understanding
of SQL was crucial for managing and manipulating data the company minimized the risk of non-compliance with data
effectively. protection regulations. This proactive measure helped avoid
• Advanced Python Programming: Our proficiency in potential legal penalties and financial repercussions related to
Python was significantly strengthened through scripting data management failures. Additionally, it reinforced the com-
and automation tasks. We developed efficient scripts for pany’s commitment to data privacy and security, enhancing
data processing and task automation, gaining practical its reputation and trustworthiness among users and regulatory
experience in writing and debugging Python code. This bodies.
enhanced our ability to handle data manipulation and VI. ACKNOWLEDGMENT
workflow automation tasks effectively.
We extend our heartfelt gratitude to Indian Institute of
• Familiarity with Pandas DataFrame: We explored and
Information Technology Vadodara-International Campus Diu
utilized the Pandas library, particularly its DataFrame
for their invaluable support throughout our internship journey.
functionality, for data processing tasks. This experience
We are immensely thankful to Visiontech Systems Pvt Ltd
allowed us to perform complex data transformations,
for providing us with the exceptional opportunity to intern
analyses, and visualizations with ease. Our familiarity
with the company. We are deeply indebted to our mentor
with Pandas enhanced our ability to manage and analyze
Vijay Shukla for their warm hospitality and mentorship during
large datasets efficiently.
our internship. Their support and insights have significantly
• Understanding GDPR Compliance: We gained a com- contributed to our learning experience.This internship has been
prehensive understanding of the General Data Protec- an enriching and learning experience, and We are truly grateful
tion Regulation (GDPR), including its requirements and for all the support and guidance we received.
implications for data management and protection. This
knowledge was crucial for ensuring that our data handling
practices adhered to regulatory standards and protected
personal data.
V. S IGNIFICANT O UTCOMES
We achieved the following impacts:
Data Privacy Enhancement: By systematically identifying
and dropping tables that contained personal data elements,
we significantly enhanced the organization’s data privacy
practices. This process ensured that sensitive information was
managed according to GDPR regulations, which mandate the
protection of personal data. The proactive removal of these
tables mitigated the risk of unauthorized access and poten-
tial data breaches, thereby safeguarding user privacy. This
approach demonstrated a strong commitment to compliance
and responsible data management, reassuring stakeholders and
users about the security of their personal information.
Operational Efficiency: The development and implementa-
tion of scripts for automating data management tasks notably
improved operational efficiency. By creating automated pro-
cesses to drop tables that exceeded the predefined retention
period, we optimized database storage utilization. This not
only helped in maintaining a cleaner and more manageable
database but also enhanced overall database performance. The
reduction in redundant and outdated data led to faster query
responses and reduced costs associated with data storage and
management. The streamlined approach contributed to more
efficient operations and better resource allocation, aligning
with the company’s goals of cost-effectiveness and perfor-
mance optimization.
Risk Mitigation: The approach of dropping tables based on
their age played a crucial role in mitigating legal and financial
risks associated with data retention and GDPR compliance.
By ensuring that tables were removed when they were no
longer needed or when they exceeded the retention period,