Python Interview Questions
Python Interview Questions
List:
Mutable: Elements can be changed after creation.
Memory Usage: Consumes more memory.
Performance: Slower iteration compared to tuples but better for insertion and
deletion operations.
Methods: Offers various built-in methods for manipulation.
Tuple:
Immutable: Elements cannot be changed after creation.
Memory Usage: Consumes less memory.
Performance: Faster iteration compared to lists but lacks the flexibility of lists.
Methods: Limited built-in methods.
Mutability:
List: Mutable (modifiable).
Tuple: Immutable (non-modifiable).
Set: Mutable, but elements inside must be immutable.
Dictionary: Mutable; keys are immutable, but values can change.
Order:
List: Maintains order of elements.
Tuple: Maintains order of elements.
Set: No guaranteed order.
Dictionary: As of Python 3.7+, insertion order is preserved.
Uniqueness:
List: Allows duplicates.
Tuple: Allows duplicates.
Set: Only unique elements.
Dictionary: Unique keys, values can be duplicated.
Data Structure:
List: Ordered collection.
Tuple: Ordered collection.
Set: Unordered collection.
Dictionary: Collection of key-value pairs.
Dictionary
Unlike all other collection types, dictionaries strictly contain key-value pairs.
In Python versions < 3.7: is an unordered collection of data.
In Python v3.1: a new type of dictionary called ‘OrderedDict’ was introduced,
which was similar to dictionary in python; the difference was
that orderedDict was ordered (as the name suggests)
In the latest version of Python, i.e. 3.7: Finally, in python 3.7, dictionary is now
an ordered collection of key-value pairs. The order is now guaranteed in the
insertion order, i.e. the order in which they were inserted.
Syntax
Dictionary
Unlike all other collection types, dictionaries strictly contain key-value pairs.
In Python versions < 3.7: is an unordered collection of data.
In Python v3.1: a new type of dictionary called ‘OrderedDict’ was introduced,
which was similar to dictionary in python; the difference was
that orderedDict was ordered (as the name suggests)
In the latest version of Python, i.e. 3.7: Finally, in python 3.7, dictionary is now
an ordered collection of key-value pairs. The order is now guaranteed in the
insertion order, i.e. the order in which they were inserted.
Syntax
Code:
dict1 = {"key1": "value1", "key2": "value2"}
dict2 = {}
dict3 = dict({1: "one", 2: "two", 3: "three"})
print(dict1)
print(dict2)
print(dict3)
Output:
{'key2': 'value2', 'key1': 'value1'}
{}
{1: 'one', 2: 'two', 3: 'three'}
Indexing
Code:
dict1 = {"one": 1, "two": 2, "three": 3}
print(dict1.keys())
print(dict1.values())
print(dict1['two'])
Output:
['three', 'two', 'one']
[3, 2, 1]
2
Adding New Element
Code:
dict1 = {"India": "IN", "Russia": "RU", "Australia": "AU"}
dict1.update({"Canada": "CA"})
print(dict1)
dict1.pop("Australia")
print(dict1)
Output:
{'Canada': 'CA', 'Australia': 'AU', 'India': 'IN', 'Russia': 'RU'}
{'Canada': 'CA', 'India': 'IN', 'Russia': 'RU'}
Deleting Element
Code:
dict1 = {"India": "IN", "Russia": "RU", "Australia": "AU"}
dict1.pop('Russia')
print(dict1)
Output:
{'Australia': 'AU', 'India': 'IN'}
Sorting Elements
Code:
dict1 = {"India": "IN", "Russia": "RU", "Australia": "AU"}
print(sorted(dict1))
Output:
['Australia', 'India', 'Russia']
Searching Elements
Code:
dict1 = {"India": "IN", "Russia": "RU", "Australia": "AU"}
print(dict1['Australia'])
Output:
AU
1. Library: apache-airflow
The apache-airflow library is a widely used scheduler and monitors for executing and
managing tasks, batch jobs, and orchestrating data pipelines. Data engineers can
use it to manage tasks and dependencies within a data workflow that can handle a
large number of tasks. It provides a simple UI and API that includes scripting for
failure handling and error recovery, all wrapped in a high-performance framework. It
allows one to define complex workflows as directed acyclic graphs (DAGs) of tasks,
where the edges between tasks represent dependencies and the nodes represent
the actual tasks that are to be executed.
1. Library: kafka-python
Apache Kafka is a popular distributed messaging platform used for building real-time
data pipelines and streaming applications that stores data and replicates it across
multiple servers, providing high availability and durability in case of server failures.
The Kafka-python library provides a high-level API for producing and consuming
messages from Apache Kafka, as well as lower-level APIs for more advanced use
cases such as asynchronous processing that facilitates sending and receiving
messages without blocking the main thread of execution.
1. Library: pandas
Pandas is one of the most popular Python libraries for working with small- and
medium-sized datasets. Built on top of NumPy, Pandas (abbreviation for Python
Data Analysis Library) is ideal for data analysis and data manipulation. It’s
considered a must-have given its large collection of powerful features such as data
merging, handling missing data, data exploration, and overall efficiency. Data
engineers use it to quickly read data from various sources, perform analysis and
transformation operations on the data, and output the results in various formats.
Pandas is also frequently paired with other python libraries for data engineering,
such as scikit-learn for data analysis and machine learning tasks.
1. Library: pyarrow
Developed by some of the same authors of Pandas (Wes McKinney), to solve some
of the scalability issues of Pandas, Apache Arrow uses the now popular columnar
data store for better performance and flexibility. The PyArrow library provides a
Python API for the functionality provided by the Arrow libraries, along with tools for
Arrow integration and interoperability with pandas, NumPy, and other software in the
Python ecosystem. For data engineers, pyarrow provides a scalable library to easily
integrate data from multiple sources into a single, unified, and large dataset for easy
manipulation and analysis.
CLOUD LIBRARIES
1. Library: boto3
AWS is one of the most popular cloud service providers so there’s no surprise that
boto3 is on top of the list. Boto3 is a Software Development Kit (SDK) library for
programmers to write software that makes use of a long list of Amazon services
including data engineer favorites such as Glue, EC2, RDS, S3, Kinesis, Redshift,
and Athena. In addition to performing common tasks such as uploading and
downloading data, and launching and managing EC2 instances, data engineers can
leverage Boto3 to programmatically access and manage many AWS services, that
can be used to build data pipelines and automate data workflow tasks.
1. Library: Azure-core
From another of the top 5 cloud providers, Azure Core is a python library and API for
interacting with the Azure cloud services and is used by data engineers for accessing
resources and automating engineering tasks. Common tasks include submitting and
monitoring batch jobs, accessing databases, data containers, and data lakes, and
generally managing resources such as virtual machines and containers. A related
library for Python is azure-storage-blob, a library built to manage retrieve, and store
large amounts of unstructured data such as images, audio, video, or text.
1. Library: SQLAlchemy
SQLAlchemy is the Python SQL toolkit that provides a high-level interface for
interacting with databases. It allows data engineers to query data from a database
using SQL-like statements and perform common operations such as inserting,
updating, and deleting data from a database. SQLAlchemy also provides support for
object-relational mapping (ORM), which allows data engineers to define the structure
of their database tables as Python classes and map those classes to the actual
database tables. SQLAlchemy provides a full suite of well-known enterprise-level
persistence patterns, designed for efficient and high-performing database access
such as connection pooling and connection reuse.
1. Library: pyspark
Apache Spark is one of the most popular open-source data engineering platforms
thanks to its scalable design that lets it process large amounts of data fast, and
makes it ideal for tasks that require real-time processing or big data analysis
including ETL, machine learning, and stream processing. It can also easily integrate
with other platforms, such as Hadoop and other big data platforms, making it easier
for data engineers to work with a variety of data sources and technologies. The
PySpark library allows data engineers to work with a wide range of data sources and
formats, including structured data, unstructured data, and streaming data.
UTILITY LIBRARIES
1. Library: python-dateutil
The need to manipulate date and time is ubiquitous in Python, and often the built-in
datetime module doesn’t suffice. The dateutil module is a popular extension to the
standard datetime module. If you’re seeking to implement timezones, calculate time
deltas, or want more powerful generic parsing, then this library is a good choice.
Python Decorators:
Python decorators are a powerful aspect of the language that allow you to modify the
behavior of functions or methods. They are functions themselves that wrap around
another function, enabling you to add functionality to existing code without modifying
it. Let’s dive into decorators with a simple example.
Explanation:
my_decorator is a function that takes another function (func) as an argument.
wrapper is a nested function within my_decorator that adds extra functionality
before and after the original function (func) is called.
@my_decorator is used above the say_hello function declaration, indicating
that say_hello will be passed to my_decorator as an argument.
When say_hello is called, it actually executes the wrapper function created by
the decorator my_decorator. This allows for the additional behavior to be
added before and after the original say_hello function execution.
Conclusion:
Decorators are a versatile tool in Python that enable you to modify the behavior of
functions without changing their actual code. They are widely used in frameworks
like Flask and Django for tasks such as authentication, logging, and more.
Understanding decorators can greatly enhance your ability to write clean, reusable,
and efficient code.
Generators in Python
Introduction
In above example we have created list as nums and then iterated using for loop.
So now question is, how we will know whether object is an iterable or not?
Answer : If object contains __iter__() method (It’s also called as dunder or magic
method) then it’s an iterable and we can check whether object contains __Iter__()
method or not using built in “dir” keyword.
Now it’s clear any object which contains __iter__() method is an Iterable.
What is an Iterator:
An Iterator is an object which stores the current state though iteration and produce
next value when you call next() method.Any object that has a __next__() method is
therefore an Iterator.We can create an iterator object by applying the iter() built-in
function to an iterable.
We can use next method to fetch data from iterator in a sequence and once data is
consumed then it will throw an StopIteration exception.
How Iteration work under for loop:
We can use for loop in python to iterate on an iterable like string,list and tuple etc.
But how this is actually implemented? let’s have a look.
From above code we can understand that for loop is internally using while loop and
iterator.
What is iterator protocol :
Python iterator protocol includes two function one is iter() and the other is next().
iter() function is used to convert an iterable object to an iterator and next() function is
used to fetch next value.
What is lazy evaluation :
Iterator allow us to to create lazy iterable that don’t do any work until we ask them for
next item.
Because of their laziness, the iterators can help us to deal with infinite long iterables.
In some cases we can’t even store all the information in the memory,so we can
create iterator which can give us next element whenever we ask it.
Iterator helps us to save memory and CPU time and this approach is called lazy
evaluation.
What are the advantage of an iterator?
Iterator in python saves resources. To get all the element, only one element is
stored in the memory at the time.unlike list or tuple where all the values are
stored at once.
For smaller datasets, Iterator and list based approach have similar
performance.for larger dataset, iterator saves both time and space.
Cleaner code.
Iterator can work with infinite sequences.
When using pandas for data engineering, several key concepts are particularly
important:
1. DataFrames and Series:
DataFrames: The primary data structure in pandas, representing tabular data
with rows and columns.
Series: A one-dimensional array-like object containing a sequence of values,
which can be of any data type.
1. Data Loading and Saving:
Reading data: Using functions
like pd.read_csv(), pd.read_excel(), pd.read_sql(), etc., to load data from
various file formats and databases.
Writing data: Using functions like df.to_csv(), df.to_excel(), df.to_sql(), etc., to
save data into various file formats and databases.
3. Data Cleaning and Preparation:
Handling missing values: Methods such as df.isnull(), df.dropna(),
and df.fillna().
Data type conversion: Functions like df.astype().
String manipulation: Using the .str accessor for string operations on Series.
4. Indexing and Selecting Data:
Indexing: Using df.set_index() and df.reset_index().
Selecting data: Using .loc[], .iloc[], and boolean indexing for accessing
specific rows and columns.
5. Data Transformation:
Aggregation: Using df.groupby() for aggregating data by groups.
Pivot tables: Using df.pivot_table() for summarizing data.
Merging and joining: Using pd.merge(), df.join(), and pd.concat() for
combining multiple DataFrames.
6. Reshaping Data:
Melt and Pivot: Using pd.melt() to transform DataFrames from wide to long
format and df.pivot() to transform DataFrames from long to wide format.
Stack and Unstack: Using df.stack() and df.unstack() to reshape the data.
7. Time Series Data:
Datetime operations: Using pd.to_datetime(), df.resample(), and .dtaccessor
for time-based operations.
Rolling and expanding windows: Using df.rolling() and df.expanding()for
calculating rolling statistics.
8. Performance Optimization:
Vectorization: Avoiding loops by using pandas’ built-in functions that operate
on entire columns or DataFrames.
Memory usage: Optimizing memory usage by downcasting data types
with pd.to_numeric().
9. Visualization:
Plotting: Using df.plot() for basic visualizations and integrating with libraries
like Matplotlib and Seaborn for advanced visualizations.
10. Advanced Data Manipulation:
Apply functions: Using df.apply() and df.applymap() for applying functions to
DataFrame elements.
Lambda functions: Using lambda functions for inline operations
within apply() and other pandas methods.
Pandas vs PySpark..!
1. Definitions
1.1 What is PySpark?
PySpark is the Python library for Spark programming. It allows you to use the
powerful and efficient data processing capabilities of Apache Spark from within the
Python programming language. PySpark provides a high-level API for distributed
data processing that can be used to perform common data analysis tasks, such as
filtering, aggregation, and transformation of large datasets.
1.2 What is Pandas?
Pandas is a Python library for data manipulation and analysis. It provides powerful
data structures, such as the DataFrame and Series, that are designed to make it
easy to work with structured data in Python. With pandas, you can perform a wide
range of data analysis tasks, such as filtering, aggregation, and transformation of
data, as well as data cleaning and preparation.
Both definitions look more or less the same, but there is a difference in their
execution and processing architecture. Let’s go over some major differences
between these two.
2. Key Differences between PySpark and Pandas
1. PySpark is a library for working with large datasets in a distributed computing
environment, while pandas is a library for working with smaller, tabular
datasets on a single machine.
2. PySpark is built on top of the Apache Spark framework and uses the Resilient
Distributed Datasets (RDD) data structure, while pandas uses the DataFrame
data structure.
3. PySpark is designed to handle data processing tasks that are not feasible
with pandas due to memory constraints, such as iterative algorithms and
machine learning on large datasets.
4. PySpark allows for parallel processing of data, while pandas does not.
5. PySpark can read data from a variety of sources, including Hadoop
Distributed File System (HDFS), Amazon S3, and local file systems,
while pandas is limited to reading data from local file systems.
6. PySpark can be integrated with other big data tools like Hadoop and Hive,
while pandas is not.
7. PySpark is written in Scala, and runs on the Java Virtual Machine (JVM),
while pandas is written in Python.
8. PySpark has a steeper learning curve than pandas, due to the additional
concepts and technologies involved (e.g. distributed computing, RDDs, Spark
SQL, Spark Streaming, etc.).
How to decide which library to use — PySpark vs Pandas
The decision of whether to use PySpark or pandas depends on the size and
complexity of the dataset and the specific task you want to perform.
1. Size of the dataset: PySpark is designed to handle large datasets that are
not feasible to work with on a single machine using pandas. If you have a
dataset that is too large to fit in memory, or if you need to perform iterative or
distributed computations, PySpark is the better choice.
2. Complexity of the task: PySpark is a powerful tool for big data processing
and allows you to perform a wide range of data processing tasks, such as
machine learning, graph processing, and stream processing. If you need to
perform any of these tasks, PySpark is the better choice.
3. Learning Curve: PySpark has a steeper learning curve than pandas, as it
requires knowledge of distributed computing, RDDs, and Spark SQL. If you
are new to big data processing and want to get started quickly, pandas may
be the better choice.
4. Resources available: PySpark requires a cluster or distributed system to run,
so you will need access to the appropriate infrastructure and resources. If you
do not have access to these resources, then pandas is a good choice.
In summary, use PySpark for large datasets and complex tasks that are not feasible
with pandas, and use pandas for small datasets and simple tasks that can be
handled on a single machine.
`try:` and `except:` are commonly known for exceptional handling in Python, so
where does `else:` come in handy? `else:` will be triggered when no exception is
raised.
Example:
Let’s learn more about `else:` with a couple of examples.
1. On the first try, we entered 2 as the numerator and “d” as the denominator.
Which is incorrect, and `except:` was triggered with “Invalid input!”.
2. On the second try, we entered 2 as the numerator and 1 as the denominator
and got the result 2. No exception was raised, so it triggered the `else:`
printing the message “Division is successful.”
A Comprehensive Guide to Python String Functions