MACHINE LEARNING
TOPIC-3
Data representation is a foundational concept in machine learning (ML), encompassing
how raw data is transformed into formats that algorithms can process effectively. The
quality and structure of this representation significantly influence the performance of ML
models.
🔍 Key Types of Data Representation in Machine Learning
1. Tabular (Attribute–Value) Systems
This is the most common format, where data is organized into rows and columns. Each
row represents an instance (e.g., a customer), and each column represents a feature
(e.g., age, income). This structure is prevalent in structured datasets like spreadsheets
and databases.
2. One-Hot Encoding
For categorical data, one-hot encoding transforms each category into a binary vector.
For example, a "Color" feature with values "Red," "Green," and "Blue" would be
represented as
Red → [1, 0, 0] This method ensures that the model doesn't infer
Green → [0, 1, 0] any ordinal relationship between categories.
Blue → [0, 0, 1]
3. Embeddings
Embeddings are dense vector representations of data, particularly useful for high-
dimensional or categorical data. In natural language processing (NLP), words are
mapped to vectors in a continuous vector space, capturing semantic relationships. For
instance, "king" and "queen" might be represented as vectors that are close in this
space, reflecting their semantic similarity.
4. Numerical and Ordinal Encoding
Numerical data (e.g., age, salary) is directly used as input features. Ordinal data (e.g.,
education level: High School < Bachelor's < Master's) can be encoded using integers
that reflect their inherent order, though care must be taken to avoid implying equal
intervals between categories.
🧠 Advanced Representation Techniques
Representation Learning
This approach involves learning the best way to represent data from the data itself,
often through deep learning models. For example, convolutional neural networks
(CNNs) can learn hierarchical representations of images, starting from edges to
complex objects.
Dimensionality Reduction
Techniques like Principal Component Analysis (PCA) and t-Distributed Stochastic
Neighbor Embedding (t-SNE) are used to reduce the number of features while
preserving the data's structure. This is particularly useful for visualizing high-
dimensional data or improving model efficiency.
🧪 Data Types in Machine Learning
Structured Data: Organized in rows and columns, suitable for traditional ML algorithms.
Unstructured Data: Includes text, images, and audio, requiring specialized models like
CNNs or Recurrent Neural Networks (RNNs).
Semi-Structured Data: Contains tags or markers to separate data elements, such as XML
or JSON files.
🧩 Basic Architecture for Tabular Data Models
[Link] Layer
Accepts raw tabular data, which may include both numerical and categorical
features.
[Link] Layer
Handles data normalization (for numerical features) and encoding (for categorical
features, such as one-hot encoding).
[Link] Layer (Optional)
Transforms categorical variables into dense vector representations, capturing semantic
relationships between categories.
[Link] Layers
Consist of fully connected layers (also known as dense layers) that learn complex patterns in
the data.
[Link] Layer
Produces the final prediction, which could be a classification label or a continuous value,
depending on the task.
THANKS FOR WATCHING