U23AD492-DATA SCIENCE
DR. M. MOHAMMED MUSTAFA B.TECH, M.E, MBA, PH.D
ASSOCIATE PROFESSOR.
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA
SRI ESHWAR COLLEGE OF ENGINEERING,
COIMBATORE
INTRODUCTION TO DATA SCIENCE AND
DATA ACQUISITION
DEFINITION OF DATA SCIENCE:
Data Science is an interdisciplinary field that uses scientific methods, processes,
algorithms, and systems to extract knowledge and insights from structured and unstructured
data. It combines principles from statistics, computer science, and domain knowledge to
analyze data and make data-driven decisions.
SCOPE OF DATA SCIENCE:
1. Data collection and storage: gathering data from various sources and
storing it in databases or data warehouses.
2. Data cleaning and preprocessing: ensuring data quality by handling
missing values, outliers, and inconsistent data.
3. Data analysis: using statistical and computational techniques to explore
and understand data patterns and trends.
4. Data visualization: creating visual representations of data to make
insights easily understandable.
SCOPE OF DATA SCIENCE:CONTD..
4. Machine learning and predictive modeling: developing algorithms to
predict future trends and behaviors based on historical data.
5. Big data technologies: utilizing tools and frameworks to handle large
volumes of data that traditional data processing software cannot manage.
6. Data engineering: building and maintaining the infrastructure and
architecture for data generation, storage, and retrieval.
7. Domain-specific applications: applying data science techniques to
specific fields such as healthcare, finance, marketing, and more.
IMPORTANCE OF DATA-DRIVEN
DECISION MAKING:
• Improved accuracy: data-driven decisions are based on empirical evidence rather than intuition or guesswork, leading to more accurate
outcomes.
• Efficiency: analysing data helps identify inefficiencies and areas for improvement, optimizing processes and resource allocation.
• Risk management: predictive analytics can foresee potential risks and enable proactive measures to mitigate them.
• Competitive advantage: businesses that leverage data insights can better understand market trends and customer preferences, gaining a
competitive edge.
• Personalization: data allows for personalized customer experiences, enhancing satisfaction and loyalty.
• Innovation: data analysis can uncover new opportunities and drive innovation in products, services, and business models.
• Strategic planning: data-driven insights inform long-term strategies and help organizations adapt to changing environments.
• Performance measurement: metrics and KPIS derived from data provide objective measures of performance, enabling continuous
improvement.
DATA SCIENCE IS AN
INTERDISCIPLINARY
1. Computer science: algorithms, data structures, programming languages
2. Statistics: statistical inference, regression, hypothesis testing
3. Mathematics: linear algebra, calculus, optimization techniques
4. Domain expertise: knowledge of specific domains such as healthcare, finance, marketing
5. Machine learning: supervised, unsupervised, and reinforcement learning
DATA SCIENCE IS AN
INTERDISCIPLINARY-CONTD…
6. Data visualization: visualization techniques to communicate insights
7. Communication: effective communication of results to stakeholders
8. Business acumen: understanding of business goals and objectives
9. Social sciences: understanding of human behavior, social networks
10. Information science: data storage, retrieval, and management
THE DATA SCIENCE LIFE CYCLE
• Problem definition:
• Objective: understand the business problem or research question.
• Tasks: identify the key objectives, define the problem scope, and establish success
criteria.
• Data collection:
• Objective: gather relevant data from various sources.
• Tasks: collect data from databases, APIS, web scrAPIng, surveys, or other sources.
Ensure data is relevant and comprehensive.
• Data preparation:
• Objective: clean and preprocess data to ensure quality.
• Tasks: handle missing values, remove duplicates, correct errors, normalize data, and
perform feature engineering.
• Exploratory data analysis (eda):
• Objective: understand data characteristics and uncover initial insights.
• Tasks: use statistical techniques and data visualization to explore data distributions,
THE DATA SCIENCE LIFE CYCLE
• Data modeling:
• Objective: develop models to solve the defined problem.
• Tasks: select appropriate algorithms (e.G., Regression, classification, clustering), train models, tune hyperparameters, and
validate performance using techniques like cross-validation.
• Model evaluation:
• Objective: assess model performance and ensure it meets business objectives.
• Tasks: use metrics such as accuracy, precision, recall, f1-score, roc-auc for classification, or rmse for regression. Compare
different models and select the best-performing one.
• Model deployment:
• Objective: implement the model in a production environment.
• Tasks: integrate the model into existing systems or create new applications. Ensure scalability, reliability, and security.
• Model monitoring and maintenance:
• Objective: ensure the deployed model continues to perform well.
• Tasks: monitor model performance over time, retrain with new data if necessary, and update the model to adapt to changes.
• Communication and visualization:
• Objective: present findings and insights to stakeholders.
• Tasks: create dashboards, reports, and visualizations to convey results in an understandable and actionable manner. Provide
clear recommendations based on data insights.
• Documentation and reporting:
• Objective: document the entire process for transparency and future reference.
• Tasks: record methodologies, decisions, and results. Prepare comprehensive reports for stakeholders and team members.
OVERVIEW OF DATA SCIENCE TOOLS
AND TECHNIQUES
Tools:
• Programming languages:
• Python: popular for its simplicity and extensive libraries (e.G., Pandas, numpy, scikit-learn, tensorflow).
• R: widely used for statistical analysis and visualization (e.G., Ggplot2, dplyr).
• Sql: essential for database querying and management.
• Data analysis and manipulation:
• Pandas (python): data manipulation and analysis.
• Numpy (python): numerical computing.
• Dplyr (r): data manipulation.
• Data visualization:
• Matplotlib and seaborn (python): plotting and visualization.
• Ggplot2 (r): data visualization.
• Tableau and power bi: interactive dashboards and business intelligence.
• Machine learning:
• Scikit-learn (python): general-purpose machine learning.
• Tensorflow and keras (python): deep learning.
• Xgboost and lightgbm (python): gradient boosting frameworks.
OVERVIEW OF DATA SCIENCE TOOLS
AND TECHNIQUES-CONTD..
• Big data technologies:
• Hadoop: distributed storage and processing.
• Spark: fast, in-memory data processing.
• Hive and pig: querying and processing large datasets.
• Data engineering:
• Apache kafka: real-time data streaming.
• Airflow: workflow automation and scheduling.
• Etl tools: talend, informatica, alteryx.
• Integrated development environments (ides):
• Jupyter notebooks: interactive data analysis.
• Rstudio: development environment for r.
• Pycharm: ide for python.
TECHNIQUES:
• Descriptive statistics:
• Measures of central tendency (mean, median, mode).
• Measures of dispersion (variance, standard deviation).
• Inferential statistics:
• Hypothesis testing (t-tests, chi-square tests).
• Confidence intervals.
• Exploratory data analysis (eda):
• Data visualization (scatter plots, histograms, box plots).
• Correlation analysis.
• Data preprocessing:
• Data cleaning (handling missing values, outliers).
• Feature engineering (creation of new features, scaling).
• Supervised learning:
• Regression (linear regression, logistic regression).
• Classification (decision trees, support vector machines).
TECHNIQUES:-CONTD…
• Unsupervised learning:
• Clustering (k-means, hierarchical clustering).
• Dimensionality reduction (pca, t-sne).
• Deep learning:
• Neural networks (cnns for image data, rnns for sequential data).
• Transfer learning.
• Model evaluation:
• Cross-validation.
• Performance metrics (accuracy, precision, recall, f1-score, roc-auc).
• Natural language processing (nlp):
• Text preprocessing (tokenization, stemming, lemmatization).
• Sentiment analysis, topic modeling.
• Time series analysis:
• ARIMA models.
• Seasonal decomposition.
APPLICATIONS OF DATA SCIENCE
• Healthcare:
• Predictive analytics for disease outbreaks and patient outcomes.
• Image analysis for radiology (e.G., Detecting tumors in medical images).
• Personalized medicine and treatment recommendations.
• Finance:
• Fraud detection using anomaly detection techniques.
• Algorithmic trading and stock market prediction.
• Risk assessment and credit scoring.
• Marketing:
• Customer segmentation and targeting.
• Sentiment analysis on social media data.
• Recommendation systems for personalized product recommendations.
• Retail:
• Inventory management and demand forecasting.
• Customer behavior analysis and sales prediction.
• Price optimization.
• Manufacturing:
• Predictive maintenance for machinery and equipment.
• Quality control and defect detection.
• Supply chain optimization.
APPLICATIONS OF DATA SCIENCE
• Transportation:
• Route optimization and logistics planning.
• Autonomous vehicles and driver assistance systems.
• Traffic pattern analysis and congestion management.
• Energy:
• Smart grid management and energy consumption forecasting.
• Predictive maintenance for power plants and infrastructure.
• Renewable energy optimization.
• Education:
• Personalized learning experiences and adaptive learning systems.
• Student performance prediction and dropout prevention.
• Curriculum and content recommendation.
• Sports:
• Performance analysis and injury prediction.
• Game strategy optimization using data-driven insights.
• Fan engagement through personalized content and experiences.
• Entertainment:
• Content recommendation systems (e.G., For streaming services).
• Audience sentiment analysis.
• Box office and revenue prediction.
DATA ACQUISITION
• Data acquisition is a crucial step in the data science life cycle
• involving the collection and storage of data from various sources.
• The quality and relevance of the data collected directly impact the outcomes
of data analysis and modeling.
SOURCES OF DATA
• Internal data:
• Transactional data: data generated from business transactions (e.G., Sales records, purchase histories).
• Operational data: data from internal processes (e.G., Inventory levels, production data).
• Customer data: data from customer interactions (e.G., CRM systems, customer support logs).
• External data:
• Public data: data available from government and public agencies (e.G., Census data, economic indicators).
• Commercial data: data purchased from third-party vendors (e.G., Market research reports, consumer data).
• Social media data: data from social networking sites (e.G., Tweets, facebook posts, linkedin profiles).
• Sensor data:
• Iot devices: data from internet-connected devices (e.G., Smart meters, wearable devices).
• Industrial sensors: data from manufacturing and industrial equipment (e.G., Temperature, pressure sensors).
• Web data:
• Web scraping: data extracted from websites (e.G., Product prices, reviews, news articles).
• Web logs: data from web server logs (e.G., User activity, page views).
• Survey data:
• Questionnaires: data collected through structured surveys (e.G., Customer feedback forms, market surveys).
• Interviews: data from personal or telephonic interviews.
DATA COLLECTION METHODS
• Manual data collection:
• Entering data manually from physical documents or observations.
• Suitable for small-scale data collection but can be time-consuming and prone to errors.
• Automated data collection:
• Using scripts and tools to automatically gather data from various sources.
• Reduces time and errors, making it suitable for large-scale data collection.
• Web scrapping:
• Using software tools to extract data from websites.
• Common tools: beautifulsoup, scrapy (python libraries).
• Ethical considerations and compliance with website terms of service are crucial.
• APIS (application programming interfaces):
• APIS provide programmatic access to data from various platforms and services.
• Common uses: fetching real-time data (e.G., Weather data, financial market data),
integrating with third-party services (e.G., Social media platforms).
• API providers often offer documentation and usage guidelines.
USING APIS FOR DATA ACQUISITION
•Understanding APIS:
•APIS enable communication between software applications, allowing data exchange.
•Rest (representational state transfer) is a common architectural style for APIS, using standard http methods (get, post, put, delete).
•Accessing APIS:
•Obtain API keys or tokens for authentication and authorization.
•Follow API documentation to understand endpoints, request methods, parameters, and response formats.
•Making API requests:
•Use tools like postman for testing API requests.
•Implement API calls in code using libraries (e.G., Requests in python).
• Handling API responses:
• Parse and process the response data, typically in JSON or XML format.
• Handle errors and rate limits as specified by the API provider.
• Storing data:
• Save the retrieved data in databases, data lakes, or file storage systems for further analysis.
• Ensure data integrity and security during storage and access.
Example in Python:
import requests
url = "https://api.example.com/data"headers
={"Authorization": "Bearer YOUR_API_KEY"}
response = requests.get(url, headers=headers)
if response.status_code == 200:
data = response.json()
print(data)else:
print(f"Failed to retrieve data: {response.status_code}")
WEB SCRAPING: EXTRACTING DATA
FROM WEBSITES
• Web scraping is the process of automatically extracting information from websites.
• It is a valuable technique for gathering data that is publicly available on the
internet but not readily accessible in a structured format.
• Here’s a detailed overview of web scraping, including accessing different sources of
data.
STEPS IN WEB SCRAPING
• Identifying the target website:
• Determine the website(s) from which you want to extract data.
• Identify the specific pages and the data elements (e.G., Text, images, links) you need.
• Understanding the website structure:
• Analyze the HTML structure of the web pages using browser developer tools (e.G.,
Chrome devtools).
• Identify the tags, classes, and ids that contain the data of interest.
• Setting up the scraping environment:
• Choose a programming language (commonly python) and install necessary libraries
(e.G., Beautifulsoup, scrapy, selenium).
• Set up a virtual environment to manage dependencies.
STEPS IN WEB SCRAPING
•Making HTTP Requests:
•Use libraries like requests to send HTTP requests to the target
web pages.
•Handle different HTTP methods (GET, POST) and manage
request headers for access control and session management.
•Parsing the HTML Content:
•Use HTML parsing libraries like BeautifulSoup to navigate
and extract the desired data from the HTML response.
•Techniques include finding elements by tag, class, ID, and
using CSS selectors or XPath.
STEPS IN WEB SCRAPING
• Handling pagination:
• Many websites display data across multiple pages.
Implement logic to handle pagination by identifying the
structure of next page links.
• Loop through pages to collect data iteratively.
• Storing the extracted data:
• Save the scraped data in a structured format (e.G., CSV,
JSON, database).
• Ensure data integrity and handle duplicates if necessary.
• Example of saving data to a csv file:
STEPS IN WEB SCRAPING
Respecting Ethical Considerations and Legal Issues:
1.Always review and comply with the website’s robots.txt file and terms
of service.
2.Avoid excessive scraping to prevent overloading the website’s
servers.
3.Use respectful time delays between requests (e.g., time.sleep() in
Python)
ACCESSING DIFFERENT SOURCES OF DATA
VIA WEB SCRAPING
• Static web pages:
• Pages with fixed content that doesn’t change dynamically.
• Use direct html parsing techniques to extract data.
• Dynamic web pages:
• Pages that load content dynamically using javascript (e.G., Infinite scrolling,
AJAX calls).
• Use tools like selenium or puppeteer to simulate browser interactions and
capture rendered html.
• Apis as a source of data:
• Some websites provide apis for accessing their data in a structured format.
• Use api endpoints to fetch data directly, which is often more efficient and
reliable than scraping html.
ACCESSING DIFFERENT SOURCES OF DATA
VIA WEB SCRAPING
• Data aggregators and public data repositories:
• Websites that aggregate data from multiple sources or provide public datasets
(e.G., Kaggle, data.Gov).
• Often provide bulk downloads or api access for easier data acquisition.
• Social media platforms:
• Extract data from social media sites using their apis (e.G., Twitter API, facebook
graph API).
• Requires understanding of api usage policies and handling authentication tokens.
THANK YOU