0% found this document useful (0 votes)
33 views25 pages

UNIT I

The document provides an overview of Big Data, its characteristics, applications, and evolution, emphasizing its significance in various sectors such as healthcare, finance, and advertising. It discusses the challenges posed by Big Data, including its volume, velocity, and variety, and highlights the importance of analytics in deriving insights for better decision-making. Additionally, it outlines the phases of Big Data evolution, from structured content to the current mobile and sensor-based content era.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
33 views25 pages

UNIT I

The document provides an overview of Big Data, its characteristics, applications, and evolution, emphasizing its significance in various sectors such as healthcare, finance, and advertising. It discusses the challenges posed by Big Data, including its volume, velocity, and variety, and highlights the importance of analytics in deriving insights for better decision-making. Additionally, it outlines the phases of Big Data evolution, from structured content to the current mobile and sensor-based content era.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 25
PC 702 1T BDA DCET UNIT I Understanding Big Data Characteristics of Data Introduction to Big Data and its importance Evolution of Big Data Challenges posed by 6. Big data analytics and its classification, 7. Big data applications 8. Big data and healthcare 9. Big data in medicine — 10. Big data in Advertising 11. Big data technologies. waene ig Data, 1. Understanding Big Data Big data refers to extremely large and diverse collections of structured, unstructured, and semi- structured data that continues to grow exponentially over time, These datasets are so huge and complex in volume, velocity, and variety, that traditional data management systems cannot store, process, and analyze them Big data analytics allows you to collect and process real-time data points and analyze them to adapt quickly and gain a competitive advantage. These insights can guide and accelerate the planning, production, and launch of new products, features, and updates. Systems that process and store big data have become a common component of data management architectures in organizations. They're combined with tools that support big data analytics uses. Big data is often characterized by the three V's The large volume of data in many environments, The wide variety of data types frequently stored in big data systems. The high velocity at which the data is generated, collected and processed. Doug Lany first identified these three V's of big data in 2001 when he was an analyst at consulting firm Meta Group Inc. Gartner popularized them after it acquired Meta Group in 2005. More recently, several other V's have been added to different descriptions of big data, including veracity, value and variability. Although big data doesn't equate to any specific volume of data, big data deployments often involve terabytes, petabytes and even exabytes of data points created and collected over time. Companies use big data in their systems to improve operational efficiency, provide better customer service, create personalized marketing campaigns and take other actions that can increase revenue and profits, Businesses that use big data effectively hold a potential competitive advantage over those that don't because they're able to make faster and more informed business decisions. 2024-2025 UNIT 1 PC 702 1T BDA DCET For example, big data provides valuable insights into customers that companies can use to refine their marketing, advertising and promotions to increase customer engagement and conversion rates. Both historical and real-time data can be analyzed to assess the evolving, preferences of consumers or corporate buyers, enabling businesses to become more responsive to customer wants and needs. Medical researchers use big data to identify disease signs and risk factors. Doctors use it to help diagnose illnesses and medical conditions in patients. In addition, a combination of data from electronic health records, social media sites, the web and other sources gives healthcare organizations and government agencies up-to-date information on infectious disease threats and outbreaks Big data helps oil and gas companies identify potential drilling locations and monitor pipeline operations. Likewise, utilities use it to track electrical grids Financial services firms use big data systems for risk management and real-time analysis of market data Manufacturers and transportation companies rely on big data to manage their supply chains and optimize delivery routes. Government agencies use bug data for emergency response, crime prevention and smart city initiatives 2. Characteristics of Data Data has several definitions. Usages can be singular or plural "Data is information, usually in the form of facts or statistics that one can analyze or use for further calculations." (Collins English Dictionary] "Data is information that can be stored and used by a computer program.". [Computing] "Data is information presented in numbers, letters, or other form". [Electrical Engineering, Cireuits, Computing and Control] "Data is information from series of observations, measurements or facts". [Science] "Data is information from series of behavioural observations, measurements or facts". [Social Sciences} Data can be classified as structured, semi-structured, multi-structured and unstructured. Structured data conform and associate with data schemas and data models, Structured data are found in tables (rows and columns). Nearly 15-20% data are in structured or semi-structured form. Unstructured data do not conform and associate with any data model. Applications produce continuously increasing volumes of both wnséructured and structured data, Data sources generate data in three forms, viz. structured, semis structured and unstructured Using Structured Data Structured data enables the following: data insert faster data retrieval delete, update and append Indexing to enable ability which enables increasing or decreasing capacities and data processing operations such as, storing, processing and analytics Transactions processing which follows ACID rules (Atomicit Durability) encryption and decryption for data security. Consistency, Isolation and Using Semi-Structured Data Examples of semi-structured data are XML and JSON documents. Semi-structured data contain tags or other markers, which separate semantic elements and enforce hierarchies of records and 2024-2025 UNIT 2 PC 702 1T BDA DCET fields within the data, Semi-structured form of data does not conform and associate with formal data model structures. Data do not associate data models, such as the relational database and table models, Using Multi-Structured Data Multi-structured data refers to data consisting of multiple formats of data, viz. structured, semi- structured and/or unstructured data, Multi-structured data sets can have many formats. They are found in non-transactional systems, For example, streaming data on customer interactions, data of multiple sensors, data at web or enterprise server or the data- warehouse data in multiple formats. Large-scale interconnected systems are thus required to aggregate the data and use the widely distributed resources efficiently Multi- or semi-structured data has some semantic meanings and data is in both structured and unstructured formats. But as structured data, semi-structured data nowadays represent a few parts of data (5-10%). Semi-structured data type has a greater presence compared to structured data. 3. Introduction to Big data and its importance Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process optimization. Industry analyst Doug Laney described the '3Vs' ie. volume, variety and/or velocity as the key "data management challenges” for enterprises. Analytics also describe the ‘4Vs,, ie. volume, velocity, variety and veracity. A number of other definitions are available for Big Data, some of which are given below. "A collection of data sets so large or complex that traditional data processing, applications are inadequate." - Wikipedia "Data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges." [Oxford English Dictionary (traditional database of authoritative definitions)] "Big Data refers to data sets whose size is beyond the ability of typical database software tool to capture, store, manage and analyze." [The McKinsey Global Institute, 2011). 3.1Big Data Characteristics Characteristics of Big Data, called 3Vs (and 4Vs also used) are: Volume The phrase 'Big Data’ contains the term hig, which is related to size of the data and hence the characteristic. Size defines the amount or quantity of data, which is generated from an application(s). The size determines the processing considerations needed for handling that data Velocity The term velocity refers to the speed of generation of data, Velocity is a measure of how fast the data generates and processes, To meet the demands and the challenges of processing Big Data, the velocity of generation of data plays a crucial role. Variety Big Data comprises of a variety of data Data is generated from multiple sources in a system, This introduces variety in data and therefore introduces ‘complexity’. Data consists of various forms and formats. The variety is due to the availability of a large number of heterogeneous platforms in the industry. This means that the type to which Big Data belongs to is also an important characteristic that needs to be known for proper processing of data. This characteristic helps i effective use of data according to their formats, thus maintaining the importance of Big Data. Veracity is also considered an important characteristic to take into account the quality of data captured, which can vary greatly, affecting its accurate analysis 2024-2025 UNIT 3 PC 7021T BDA DcET pity pe) a'e asst pyr) The 4Vs (ie. volume, velocity, variety and veracity) data need tools for mining, discovering patterns, business intelligence, artificial intelligence (AI), machine learning (ML), text analytics, descriptive and predictive analytics, and the data visualization tools. 3.2 Big Data Types Following are the suggested types: 1. Social networks and web data, such as Facebook, Twitter, e-mails, biogs and YouTube. 2. Transactions data and Business Processes (BPs) data, such as credit card transactions, flight bookings, etc. and public agencies data such as medical records, insurance business data etc. 3. Customer master data, such as data for facial recognition and for the name, date of birth, marriage anniversary, gender, location and income category, 4, Machine-generated dara, such as machine-to-machine or Intemet of Things data, and the data from sensors, trackers, web logs and computer systems log. Computer generated data is also considered as machine generated data from data store. Usage of programs for processing of data using data repositories, such as database or file, generates data and also machine generated data. 5. Human-generated data such as biometrics data, human-machine interaction data, e* mail records with a mail server and MySQL database of student grades. Humans also records their experiences in ways such as writing these in notebooks or diaries, taking photographs or audio and video clips. Human-sourced information is now almost entirely digitized and stored everywhere from personal computers to social networks. Such data are loosely structured and often ungoverned. 3.3 Big Data and its importance Big data analytics helps organizations hamess their data and use it to identify new opportunities. That, in turn, leads to smarter business moves, more efficient operations, higher profits and happier customers. Here are the most important values of Big Data, 1, Cost reduction: Big data technologies such as Hadoop and cloud-based analytics bring significant cost advantages when it comes to storing large amounts of data ~ plus they can identify more efficient ways of doing business 2024-2025 UNIT-L 4 PC 702 1T BDA DcET 2, Faster, better decision making: With the speed of Hadoop and in-memory analytics, combined with the ability to analyze new sources of data, businesses are able to analyze information immediately — and make decisions based on what they've learned 3. New products and services: With the ability to gauge customer needs and satisfaction through analytics comes the power to give customers what they want, 3.4 Advantages/Importance of Big Data + One of the biggest advantages of Big Data is predictive analysis. Big Data analytics tools can predict outcomes accurately, thereby, allowing businesses and organizations to make better decisions, while simultaneously optimizing their operational efficiencies and reducing risks. + By harnessing data from social media platforms using Big Data analytics tools, businesses around the world are streamlining their digital marketing strategies to enhance the overall consumer experience, Big Data provides insights into the customer pain points and allows companies to improve upon their products and services, * Being accurate, Big Data combines relevant data from multiple sources to produce highly actionable insights. Almost 43% of companies lack the necessary tools to filter out irrelevant data, which eventually costs them millions of dollars to hash out useful data from the bulk. Big Data tools can help reduce this, saving you both time and money * Big Data analytics could help companies generate more sales leads which would naturally mean a boost in revenue. Businesses are using Big Data analytics tools to understand how well their products/services are doing in the market and how the customers are responding to them Thus, the can understand better where to invest their time and money. + With Big Data insights, you can always stay a step ahead of your competitors. You can screen the market to know what kind of promotions and offers your rivals are providing, and then you can come up with better offers for your customers. Also, Big Data insights allow you to learn customer behavior to understand the customer trends and provide a highly ‘personalized’ experience to them 4, Evolution of Big Data The evolution of Big Data can roughly be subdivided into three main phases Each phase was driven by technological advancements and has its own characteristics and capabilities. In order to understand the context of Big Data today, it is important to understand how each of these phases contributed to the modern meaning of Big Data, Big Data Phase 1 — Structured Content Data analysis, data analytics and Big Data originate from the longstanding domain of database management. It relies heavily on the storage, extraction, and optimization techniques that are common in data that is stored in Relational Database Management Systems (RDBMS). The techniques that are used in these systems, such as structured query language (SQL) and the extraction, transformation and loading (ETL) of data, started to professionalize in the 1970s. Database management and data warehousing systems are still fundamental components of, modern-day Big Data solutions. The ability to quickly store and retrieve data from databases or find information in large data sets, is still a core requirement for the analysis of Big Data. 2024-2025 UNIT- 5 PC 702 1T BDA DcET Relational database management technology and other data processing technologies that were developed during this phase, are still strongly embedded in the Big Data solutions from leading IT vendors, such as Microsoft, Google and Amazon. Big Data Phase 2 - Web Based Unstructured Content From the early 2000s, the internet and corresponding web applications started to generate tremendous amounts of data, In addition to the data that these web applications stored in relational databases, IP-specific search and interaction logs started to generate web based unstructured data. These unstructured data sources provided organizations with a new form of knowledge: insights into the needs and behaviours of internet users. With the expansion of web traffic and online stores, companies such as Yahoo, Amazon and eBay started to analyse customer behaviour by analysing click-rates, IP-specific location data and search logs, opening, a whole new world of possibilities. From a technical point of view, HTTP-based web traffic introduced a massive increase in semi- structured and unstructured data, Besides the standard structured data types, organizations now needed to find new approaches and storage solutions to deal with these new data types in order to analyse them effectively. The arrival and growth of social media data greatly aggravated the need for tools, technologies and analytics techniques that were able to extract meaningful information out of this unstructured data. New technologies, such as networks analysis, web- mining and spatial-temporal analysis, were specifically developed to analyse these large quantities of web based unstructured data effectively. Big Data Phase 3 - Mobile and Sensor-based Content The third and current phase in the evolution of Big Data is driven by the rapid adoption of mobile technology and devices, and the data they generate. The number of mobile devices and tablets surpassed the number of laptops and PCs for the first time in 2011.Jn 2020, there are an estimated 10 billion devices that are connected to the internet. And all of these devices generate data every single second of the day. Mobile devices not only give the possibility to analyse behavioural data (such as clicks and search queries), but they also provide the opportunity to store and analyse location-based GPS data, Through these mobile devices and tablets, it is possible to track movement, analyse physical behaviour and even health-related data (for example the number of steps you take per day). And because these devices are connected to the internet almost every single moment, the data that these devices generate provide a real-time and unprecedented picture of people’s behaviour. Simultaneously, the rise of sensor-based internet-enabled devices is increasing the creation of data to even greater volumes. Famously coined the ‘Internet of Things’ (IoT), millions of new TVs, thermostats, wearables and even refrigerators are connected to the internet every single day, providing massive additional data sets. Since this development is not expected to stop anytime soon, it could be stated that the race to extract meaningful and valuable information out of these new data sources has only just begun 2024-2025 UNIT- 6 PC 702 1T BDA DcET re een} Sentor Based Con! Petioc: 1970-2000 Period: 2000-2010 Period: 2010- Present > ROBNS& doia ~~» ”~—“Infomnation rehieval ond ~~» ~—~‘Localion-aware onda warehousing extraction + Person-centrad analysis + Btract Tronsfer Load © Opinion mining + Contextrelevant analysis © nine Anayiicll «=~ Question answering + Mobile visvateation Processing © Web anaiylics ond web + Human-Computer: + Dothboords & scorecards inteligence Interaction + Dale mining & stolsical ‘Social mecta analytics onaiysis * Sociol network analysis Spatiotemporal analysis 5. Challenges posed by Big Data Big Data refers to complex and large data sets that have to be processed and analyzed to uncover valuable information that can benefit businesses and organizations However, there are certain basic tenets of Big Data that will make it even simpler to answer what is Big Data: + Itrefers to a massive amount of data that keeps on growing exponentially with time + Itis so voluminous that it cannot be processed or analyzed using conventional data processing, techniques, «It includes data mining, data storage, data analysis, data sharing, and data visualization + The term is an all-comprehensive one including data, data frameworks, along with the tools and techniques used to process and analyze the data. These three characteristics cause many of the challenges that organizations encounter in their big data initiatives. Some of the most common of those big data challenges include the following 1. Dealing with data growth / Handling a Large Amount of Data There is a huge explosion in the data available, Look back a few years, and compare it with today, and you will see that there has been an exponential increase in the data that enterprises can access, They have data for everything, right from what a consumer likes, to how they react, toa particular scent, to the amazing restaurant that opened up in Italy last weekend This data exceeds the amount of data that can be stored and computed, as well as retrieved The challenge is not so much the availability, but the management of this data. The most obvious challenge associated with big data is simply storing and analyzing all that information, In its Digital Universe report, IDC estimates that the amount of information stored in the world's IT systems is doubling about every two years. Much of that data is unstructured, meaning that it doesn't reside in a database. Documents, photos, audio, videos and other unstructured data can be difficult to search and analyze. Some of the newest ways developed to manage this data are a hybrid of relational databases combined with NoSQL databases. An example of this is MongoDB, which is an inherent part of the MEAN stack. There are also distributed computing systems like Hadoop to help manage Big Data volumes. 2024-2025 UNIT- 7 PC 702 1T BDA DCET In order to deal with data growth, organizations are turning to a number of different technologies. When it comes to storage, converged and hyperconverged infrastructure and software-defined storage can make it easier for companies to scale their hardware. And technologies like compression, deduplication and tiering can reduce the amount of space and the costs associated with big data storage. On the management and analysis side, enterprises are using tools like NoSQL databases, Hadoop, Spark, big data analytics software, business intelligence applications, artificial intelligence and machine learning to help them comb through their big data stores to find the insights their companies need 2. Generating insights in a timely manner / Real-time can be Complex A lot of data keeps updating every second, and organizations need to be aware of that too. For instance, if a retail company wants to analyze customer behavior, real-time data from their current purchases can help, There are Data Analysis tools available for the same ~ Veracity and Velocity. They come with ETL engines, visualization, computation engines, frameworks and other necessary inputs. It is important for businesses to keep themselves updated with this data, along with the “stagnant” and always available data, This will help build better insights and enhance de making capabilities, However, not all organizations are able to keep up with real-time data, as they are not updated with the evolving nature of the tools and technologies needed. Currently, there are a few reliable tools, though many still lack the necessary sophistication, jon- Of course, organizations don't just want to store their big data — they want to use that big data to achieve business goals, The most common goals associated with big data projects included the following’ 1, Decreasing expenses through operational cost efficiencies 2, Establishing a data-driven culture 3. Creating new avenues for innovation and disruption, 4, Accelerating the speed with which new capabilities and services are deployed 5, Launching new product and service offerings "Everyone wants decision-making to be faster, especially in banking, insurance, and healthcare." To achieve that speed, some organizations are looking to a new generation of ETL and analytics tools that dramatically reduce the time it takes to generate reports, They are investing in software with real-time analytics capabilities that allows them to respond to developments in the marketplace immediately 3. Recruiting and retaining big data talent / Shortage of Skilled People There is a definite shortage of skilled Big Data professionals available at this time. This has been mentioned by many enterprises seeking to better utilize Big Data and build more effective Data Analysis systems. There is a lack experienced people and certified Data Scientists or Data Analysts available at present, which makes the “number crunching” difficult, and insight building slow. 2024-2025 UNIT 8 PC 702 1T BDA DCET In order to deal with talent shortages, organizations have a couple of options, First, many are increasing their budgets and their recruitment and retention efforts. Second, they are offering, more training opportunities to their current staff members in an attempt to develop the talent they need from within, Third, many organizations are looking to technology. They are buying. analytics solutions with self-service and/or machine learning capabilities. Designed to be used by professionals without a data science degree, these tools may help organizations achieve their big data goals even if they do not have a lot of big data experts on staff. 4. Integrating disparate data sources The variety associated with big data leads to challenges in data integration. Big data comes from a lot of different places — enterprise applications, social media streams, email systems, employee-created documents, etc. Combining all that data and reconciling it so that it can be used to create reports can be incredibly difficult. Vendors offer a variety of ETL and data integration tools designed to make the process easier, but many enterprises say that they have not solved the data integration problem yet Closely related to the idea of data integration is the idea of data validation, Often organizations are getting similar pieces of data from different systems, and the data in those different systems doesn't always agree. For example, the ecommerce system may show daily sales at a certain level while the enterprise resource planning (ERP) system has a slightly different number. Or a hospital's electronic health record (EHR) system may have one address for a patient, while a partner pharmacy has a different address on record. The process of getting those records to agree, as well as making sure the records are accurate, usable and secure, is called data governance. Solving data governance challenges is very complex and is usually requires a combination of policy changes and technology. Organizations often set up a group of people to oversee data governance and write a set of policies and procedures. They may also invest in data management solutions designed to simplify data governance and help ensure the accuracy of big data stores — and the insights derived from them. 6. Securing big data / Data Security A lot of organizations claim that they face trouble with Data Security. This happens to be a bigger challenge for them than many other data-related problems. The data that comes into enterprises is made available from a wide range of sources, some of which cannot be trusted to be secure and compliant within organizational standards. They need to use a variety of data collection strategies to keep up with data needs. This in turn leads to inconsistencies in the data, and then the outcomes of the analysis, A simple example such as annual turnover for the retail industry can be different if analyzed from different sources of input, A business will need to adjust the differences, and narrow it down to an answer that, is valid and interesting, This data is made available from numerous sources, and therefore has potential security problems. You may never know which channel of data is compromised, thus compromising the security of the data available in the organization, and giving hackers a chance to move in It’s necessary to introduce Data Security best practices for secure data collection, storage and retrieval Security is also a big concem for organizations with big data stores. After all, some big data stores can be attractive targets for hackers or advanced persistent threats (APTS) 2024-2025 UNIT 9 PC 702 1T BDA DCET 7. Organizational resistance It is not only the technological aspects of big data that can be challenging — people can be an issue too, In the New Vantage Partners survey, 85.5 percent of those surveyed said that their firms were committed to creating a data-driven culture, but only 37.1 percent said they had been successful with those efforts. When asked about the impediments to that culture shift, respondents pointed to three big obstacles within their organizations + Insufficient organizational alignment (4.6 percent) + Lack of middle management adoption and understanding (41.0 percent) + Business resistance or lack of understanding (41.0 percent) 6. Big Data Analytics and its classification 6.1 Big Data Analytics Analytics is the discovery and communication of meaningful patterns in data. Especially, valuable in areas rich with recorded information, analytics relies on the simultaneous application of statistics, computer programming and operation research to qualify performance. Analytics often favors data visualization to communicate insight. Firms may commonly apply analytics to business data, to describe, predict and improve business performance. Especially, areas within include predictive analytics, enterprise decision management ete. Since analytics can require extensive computation(because of big data), the algorithms and software used to analytics harness the most current methods in computer science. In a nutshell, analytics is the scientific process of transforming data into insight for making better decisions. The goal of Data Analytics is to get actionable insights resulting in smarter decision and better business outcomes. And it can help answer the following types of questions + What actually happened? + How or why did it happen? + What's happening now? + What is likely to happen next? There are four type of data analytics: Predictive (forecasting) Descriptive (business intelligence and data mining) Prescriptive (optimization and simulation) Diagnostic analytics reyes Predictive Analytics: Predictive analytics turn the data into valuable, actionable information predictive analytics uses data to determine the probable outcome of an event or a likelihood of a situation occurring Predictive analytics holds a variety of statistical technique from modeling, machine, learning, data mining and game theory that analyze current and historical facts to make prediction about future event. Predictive analytics is all about forecasting. Whether it’s the likelihood of an event happening, in future, forecasting a quantifiable amount or estimating a point in time at which something might happen ~ these are all done through predictive models. Predictive models typically utilise a variety of variable data to make the prediction. The variability of the component data will have a relationship with what i 2024-2025 UNIT 10 PC 702 1T BDA DCET the older a person, the more susceptible they are to a heart-attack — we would say that age has a linear correlation with heart-attack risk). These data are then compiled together into a score or prediction. Ina world of great uncertainty, being able to predict allows one to make better decisions Predictive models are some of the most important utilised across a number of fields, Descriptive Analytics: Descriptive analytics looks at data and analyze past event for insight as how to approach future events, It looks at the past performance and understands the performance by mining historical data to understand the cause of success or failure in the past. Almost all the management reporting such as sales, marketing, operations, and finance uses this type of analysis, Descriptive model quantifies relationship in data in a way that is often used to classify customers or prospect into groups. Unlike predictive model that focuses on predicting the behavior of single customer, Descriptive analytics identify many different relationships between customer and product. An examples of this could be a monthly profit and loss statement. Similarly, an analyst could have data on a large population of customers. Understanding demographic information on their customers (e.g. 30% of our customers are self-employed) would be categorised as “descriptive analytics”. Utilising effective visualisation tools enhances the message of descriptive analytics. Prescriptive Analytics: Prescriptive Analytics automatically synthesize big data, mathematical science, business rule, and machine learning to make prediction and then suggests decision option to take advantage of the prediction. Prescriptive analytics goes beyond predicting future outcomes by also suggesting action benefit, from the predictions and showing the decision maker the implication of each decision option Prescriptive Analytics not only anticipates what will happen and when happen but also why it will happen. Further, Prescriptive Analytics can suggest decision options on how to take advantage of a future opportunity or mitigate a future risk and illustrate the implication of each decision option. For example, Prescriptive Analytics can benefit healthcare strategic planning by using analytics to leverage operational and usage data combined with data of extemal factors such as economic data, population demography etc. The next step up in terms of value and complexity is the prescriptive model. The prescriptive model utilises an understanding of what has happened, why it has happened and a variety of “what-might-happen” analysis to help the user determine the best course of action to take Prescriptive analysis is typically not just with one individual action, but isin fact a host of other actions. A good example of this is a traffic application helping you choose the best route home and taking into account the distance of each route, the speed at which one can travel on each road and, crucially, the current traffic constraints, Another example might be producing an exam time-table such that no students have clashing schedules. 2024-2025 UNIT u PC 7021T BDA DcET 4 types of Data Analytics NX DESCRIPTIVE: biagnose: PREDICTIVE: cot «© business strateles hve remained f Historical patterns being used to pe specie outcomes using algorithms Deccetive + Decsion are automated sing ages PRESCRIPTIVE: Wat one todo? “Recommended actions and stratepes based on champion hallnger esting Diagnostic Analytics: In this analysis, we generally use historical data over other data to answer any question or for the solution of any problem. We try to find any dependency and pattern in the historical data of the particular problem. For example, companies go for this analysis because it gives a great insight for a problem, and they also keep detailed information about there disposal otherwise data collection may turn out, individual for every problem and it will be very time-consuming, This is the next step of complexity in data analytics to descriptive analytics. On assessment of, the descriptive data, diagnostic analytical tools will empower an analyst to drill down and in so doing isolate the root-cause of a problem. ‘Well-designed business information (BI) dashboards incorporating reading of time-series data (ie. data over multiple successive points in time) and featuring filters and drill down capability allow for such analysis. 6.2 Big Data Classification Big Data can be classified on the basis of its characteristies that are used for designing data architecture for processing and analytics. Table 1.1 gives various classification methods for data and Big Data. ‘Table 1.1 Various classification methods Examples for data and Big Data Basis of Classitie Data sources Data storage such as records, RDBMS, (traditional) distributed databases, row-oriented In- memory data tables, column-orientedIn-memory data Lables, data warehouse, server, machine- generated data, human-sourced data, Business 2024-2025 UNIT-L 2 PC 702 1T BDA DcET Process (BP) data, Business Intelligence (BI) data Data formats Structured and semi-structured (traditional) Big Data sources Data storage, distributed file system, Operational Data Store (ODS), data marts, data warehouse, NoSQL. database (MongoDB, Cassandra), sensors data, audit trai of financial transactions, extemal data such as web, social media, weather data, health records Big Data formats Unstructured, semi-structured and multi- structured data Data Stores structure Web, enterprise or eloud servers, data warehouse, row-oriented data for OL, column- oriented for OLAP. records, graph database, hhashed entries for key/value pairs Processing data rates Batch, near-lime, real-time, streaming Processing Big High volume, velocity. variety and veracity Data rates, batch, near real-time and streaming data processing, Analysis types Batch, scheduled, near real-time datasets analyties Big Data processing methods Batch processing (for example, using MapReduce, Hive or Pig), real-time processing (for example, using SparkStreaming, SparkSQL. Apache Drill) Data analysis methods Statistical analysis, predictive analysis, regression analysis, Mahout, machine learning algorithms, clustering algorithms, classifiers, text analysis, social network analysis, location-based analysis, diagnostic analysis, cognitive analysis, Data Usages Human, business process, knowledge discovery, enterprise applications, Data stores 7. Big Data Applications 1, Banki The banking sector relies on Big Data for fraud detection. Big Data tools can efficiently detect fraudulent acts in real-time such as misuse of credit/debit cards, archival of inspection tracks, faulty alteration in customer stats, ete. Large amounts of information will be streaming in into banks, managing all this data and getting proper insights would be possible only with big data analytics, This is important to understand customers and boost their satisfaction, and also to minimize risk and fraud. 2. Government ‘When government agencies are able to harness and apply analytics to their big data, they gain significant ground when it comes to managing utilities, running agencies, dealing with traffic congestion or preventing crime. 3. Health Care Big Data has already started to create a huge difference in the healthcare sector. With the help of predictive analytics, medical professionals and HCPs are now able to provide personalized healthcare services to individual patients. Apart from that, fitness wearables, telemedicine, remote monitoring ~ all powered by Big Data and AI ~ are helping change lives for the better. Patient records, Treatment plans, Prescription information, When it comes to health care, everything needs to be done quickly, accurately, And, in some cases, with enough transparency to satisfy stringent industry regulations. When big data is managed effectively, health care providers can uncover hidden insights that improve patient care. 2024-2025 UNIT- 13 PC 702 1T BDA DCET 4. Education Big Data is also helping enhance education today. Education is no more limited to the physical bounds of the classroom ~ there are numerous online educational courses to learn from. Academic institutions are investing in digital courses powered by Big Data technologies to the all-round development of budding learners. Educators armed with data-driven insight can make a significant impact on school systems, students, and curriculums, By analyzing big data, they can identify at-risk students, make sure students are making adequate progress, and can implement a better system for evaluat support of teachers and principals. 5, Manufacturing According to TCS Global Trend Study, the most significant benefit of manufacturing is improving the supply strategies and product quality. In the manufacturing sector, Big data helps create a transparent infrastructure, thereby, predicting uncertainties and in competencies that can affect the business adversely. Armed with insight that big data can provide, manufacturers can boost quality and output while minimizing waste — processes that are key in today’s highly competitive market. More and more manufacturers are working in an analytics-based culture, which means they can solve problems faster and make more agile business deci 6. Retail Customer relationship building is critical to the retail industry. And the best way to manage that is to manage big data, Retailers need to know the best way to market to customers. The most effective way to handle transactions, and the most strategic way to bring back lapsed business. Big data remains at the heart of all those things. 7.1T One of the largest users of Big Data, IT companies around the world are using Big Data to optimize their functioning, enhance employee productivity, and minimize risks in business operations, By combining Big Data technologies with ML and Al, the IT sector is continually powering innovation to find solutions even for the most complex of problems. 8. Big data and healthcare Big Data analytics in health care use the following data sources: (i) clinical records, (ii) pharmacy records, (3) electronic medical records (4) diagnosis logs and notes and (v) additional data, such as deviations from person usual activities, medical leaves from job, social interactions. Healthcare analytics using Big Data can facilitate the following, 1, Provisioning of value-based and customer-centric healthcare, 2. Utilizing the "Internet of Things’ for health care 3. Preventing fraud, waste, abuse in the healthcare industry and reduce healthcare costs (Examples of frauds are excessive or duplicate claims for clinical and hospital treatments. Example of waste is unnecessary tests, Abuse means unnecessary use of medicines, such as tonics and testing facilities.) 4, Improving outcomes 5, Monitoring patients in real time. Value-based and customer-centric healthcare means cost effective patient care by improving, healthcare quality using latest knowledge, usages of electronic health and medical records and improving coordination among the healthcare providing agencies, which reduce avoidable overuse and healthcare costs. 2024-2025 UNIT 4 PC 702 1T BDA DCET Healthcare Intemet of Things create unstructured data, The data enables the monitoring of the devices data for patient parameters, such as glucose, BP, ECGs and necessities of visiting physicians, Prevention of fraud, waste, and abuse uses Big Data predictive analytics and help resolve excessive or duplicate claims in a systematic manner, The analytics of patient records and billing help in detecting, anomalies such as overutilization of services in short intervals, different hospitals in different locations simultaneously, or identical prescriptions for the same patient filed from multiple locations. Improving outcomes is possible by accurately diagnosing patient conditions, early diagnosis, predicting problems such as congestive heart failure, anticipating and avoiding complications, matching treatments with outcomes and predicting patients at risk for disease or readmission. Patient real-time monitoring uses machine learning algorithms which process real-time events. They provide physicians the insights to help them make life-saving decisions and allow for effective interventions, The process automation sends the alerts to care providers and informs. them instantly about changes in the condition of a patient, 9. Big Data in Medicine Big Data analytics deploys large volume of data to identify and derive intelligence using predictive models about individuals. Big Data driven approaches help in research in medicine which can help patients, Big Data offers potential to transform medicine and the healthcare system Following are some findings: building the health profiles of individual patients and predicting models for diagnosing better and offer better treatment, 1, Aggregating large volume and variety of information around from multiple sources the DNAS, proteins, and metabolites to cells, tissues, organs, organisms, and ecosystems, that can enhance the understanding of biology of diseases. Big data creates patterns and models by data mining and help in better understanding and research, 2, Deploying wearable devices data, the devices data records during active as well as inactive periods, provide better understanding of patient health, and better risk profiling the user for certain diseases. 10. Big Data in Advertising The impact of Big Data is tremendous on the digital advertising industry. The digital advertising industry sends advertisements using SMS, e-mails, WhatsApp, LinkedIn, Facebook, Twitter and other mediums, Big Data technology and analytics provide insights, patterns and models, which relate the media exposure of all consumers to the purchase activity of all consumers using multiple digital channels. Big Data help in identity management and can provide an advertising mix for building better branding exercises. Big Data captures data of multiple sources in large volume, velocity and variety of data unstructured and enriches the structured data at the enterprise data warehouse. Big data real time analytics provide emerging trends and patterns, and gain actionable insights for facing 2024-2025 UNIT 15 PC 702 1T BDA DCET competitions from similar products. The data helps digital advertisers to discover new relationships, lesser competitive regions and areas. Success from advertisements depend on collection, analyzing and mining. The new insights enable the personalization and targeting the online, social media and mobile for advertisements called hyper-localized advertising. Advertising on digital medium needs optimization, Too much usage can also effect negatively Phone calls, SMSs, e-mail-based advertisements can be nuisance if sent without appropriate researching on the potential targets. The analytics help in this direction. The usage of Big Data after appropriate filtering and elimination is crucial enabler of BigData Analytics with appropriate data, data forms and data handling in the right manner. 1, How do data inputs help in Big Data based Customer value analytics? 2, How does Big Data help in credit risk management in financial institutions? 3. How does Big Data Analytics enable prevention of fraud, waste and abuse of healthcare system? 4, Why does Big Data offer the potential to transform the medicine and healthcare system? 5. Why are the Cloud services used for Big Data Analytics for customer acquisition, customer lifetime value analytics and other metrics? 11. Big Data Technologies Big data technologies can be broadly classified into a few different types that together can be used for a wide range of actions across the entire data lifecycle Data Storage Data storage is an important aspect of data operations. Some big data technologies are primarily responsible for collecting, storing, and managing vast volumes of information for convenient access Data Mining Data mining helps unleash valuable insights through hidden patterns and trends for better understanding. Data mining tools use different statistical methods and algorithms to uncover usable information from the unprocessed data sets. Top big data technologies for data mining operations include Presto, Rapidminer, ElasticSearch, MapReduce, Flink, and Apache Storm. Data Analytics Big data technologies equipped with advanced analytical capabilities help provide information to fuel critical business decisions and can use artificial intelligence to generate business insights, Data Visualization As big data technologies deal with extensive volumes of data—structured, semi-structured, unstructured, and with different levels of complexities—it is essential to simplify this, information and make it usable. Visual data formats like graphs, charts, and dashboards are more engaging and easier to comprehend. 2024-2025 UNIT 16 PC 702 1T BDA DcET Bones Doar) 2s OC gst ® me cantandra a ‘© ARINIME Sete _ | LJ aioe pq plotly Data Storage Typically, this type of big data technology includes infrastructure that allows data to be fetched, stored, and managed, and is designed to handle massive amounts of data, Various software programs are able to access, use, and process the collected data easily and quickly. Among the most widely used big data technologies for this purpose are: 1. Apache Hadoop Apache Hadoop is an open-source, Java-based framework for storing and processing big data, developed by the Apache Software Foundation. In essence, it provides a distributed storage platform and processes big data using the MapReduce programming model. The Hadoop framework is designed to automatically handle hardware failures since they are common occurrences. Hadoop framework consists of five modules, namely Hadoop Distributed File System (HDFS), Hadoop YARN (Yet Another Resource Negotiator), Hadoop MapReduce, Hadoop Common, and Hadoop Ozone. ni Apache Hadoop: Companies using Hadoop: LinkedIn, Intel, IBM, MapR, Facebook, Microsoft, Hortonworks, Cloudera, ete. Key features A distributed file system, called HDFS (Hadoop Distributed File System), enables fast data transfer between nodes. 2024-2025 UNIT- 7 PC 702 1T BDA DCET HDFS js a fundamentally resilient file system. In Hadoop, data that is stored on one node is also replicated on other nodes of the cluster to prevent data loss in case of hardware or software failure. Hadoop is an inexpensive, fault-tolerant, and extremely flexible framework capable of storing, and processing data in any format (structured, semi-structured, or unstructured) MapReduce is a built-in batch processing engine in Hadoop that splits large computations across multiple nodes to ensure optimum performance and load balancing, 2. MongoDB MongoDB is an open-source, cross-platform, document-oriented database designed to store and handle large amounts of data while providing high availability, performance, and scalability. Since MongoDB does not store or retrieve data in the form of tables, its considered a NoSQL database. A new entrant to the data storage field, MongoDB is very popular due to its document-oriented NoSQL features, distributed key-value store, and Map Reduce calculation capabilities, This was named “Database Management System of the Year,” by DB- Engines, which isn’t surprising since NoSQL databases are more adept at handling Big Data than traditional RDBMS. mongo DB MongoDB: Companies using MongoDB: MySQL, Facebook, eBay, MetLife, Google, Shutterfly, Aadhar, etc, Key features: It seamlessly integrates with languages like Ruby, Python, and JavaScript ; this seamless integration facilitates high coding velocity A MongoDB database stores data in JSON documents, which provide a rich data model that maps effortlessly to native programming languages. MongoDB has several features that are unavailable in a traditional RDBMS, such as dynamic queries, secondary indexes, rich updates, sorting, and easy aggregation. In document-based database systems, related data is stored in a single document, making it possible to run queries faster than with a traditional relational database where related data is stored in multiple tables and later joined using joins 3. RainStor RainStor is a database management system that manages and analyzes big data and developed by the RainStor company. A de-duplication technique is used in order to streamline the storage of large amounts of data for reference. Due to its ability to sort and store large volumes of information for reference, it eliminates duplicate files. Additionally, it supports cloud storage and multi-tenaney. The RainStor database product is available in two editions: Big Data Retention and Big Data Analytics on Hadoop, which enable highly efficient data management and accelerate data analysis and queries. 2024-2025 UNIT 18 PC 702 1T BDA DCET RainStor| RainStor: Companies using RainStor: Barclays, Reimagine Strategy, Credit Suisse, ete. Key features: With RainStor, large enterprises can manage and analyze Big Data at the lowest total cost. The enterprise database is built on Hadoop to support faster analytics, It allows you to run faster queries and analyses using both SQL queries and MapReduce, leading to 10-100x faster results, RainStor provides the highest compression level. Data is compressed up to 40x (97.5 percent) or more compared to raw data and has no re-inflation required when accessed, 4. Cassandra Cassandra is an open-source, distributed NoSQL database that enables the in-depth analysis of multiple sets of real-time data, It enables high scalability and availability without compror in performance. To interact with the database, it uses CQL (Cassandra Structure Language). With scalability and fault tolerance on cloud infrastructure or commodity hardware, this is the ideal platform for mission-critical data processing, Asa major Big Data tool, it accommodates all types of data formats, including structured, semi-structured, and unstructured. ae cassandra Cassandra: Companies using Cassandra: Facebook, GoDaddy, Netflix, GitHub, Rackspace, Cisco, Hulu, eBay, ete. Key Features: Cassandra's decentralized architecture prevents single points of failure within a cluster. Data sensitivity makes Cassandra suitable for enterprise applications that cannot afford data loss, even when the entire data center fails. ACID (Atomicity, Consistency, Isolation, and Durability) are all supported by Cassandra, It allows Hadoop integration with MapReduce. It also supports Apache Hive & Apache Pig. Due to its scalability, Cassandra can be scaled up to accommodate more customers and more data as required, Data Mining Data mining is the process of extracting useful information from raw data and analyzing it. In many cases, raw data is very large, highly variable, and constantly streaming at speeds that make data extraction nearly impossible without a special technique. Among the most widely used big data technologies for data mining are: 5. Presto Developed by Facebook, Presto is an open-source SQL query engine that enables interactive query analyses on massive amounts of data, This distributed search engine supports fast 2024-2025 UNIT 19 PC 702 1T BDA DCET analytics queries on data sources of various sizes, from gigabytes to petabytes. With this technology, it is possible to query data right where it lives, without moving the data into separate analytics systems. It is possible even to query data from multiple sources within a single query. It supports both relational data sources (such as PostgreSQL, MySQL, Microsoft SQL Server, Amazon Redshift, Teradata, etc.) and non-relational data sources (such as HDFS (Hadoop Distributed File System), MongoDB, Cassandra, HBase, Amazon S3, etc.) prestoDB Companies using Presto: Repro, Netflix, Facebook, Airbnb, GrubHub, Nordstrom, Nasdaq, Atlassian, etc. Key Features: With Presto, you can query data wherever it resides, whether itis in Cassandra, Hive, Relational databases, or even proprietary data stores. With Presto, multiple data sources can be queried at once, This allows you to reference data from multiple databases in one query It does not rely on MapReduce techniques and is capable of retrieving data very quickly within seconds to minutes. Query responses are typically returned within a few seconds, Presto supports standard ANSI SQL, making it easy to use. The ability to query your data without learning a dedicated language is always a big plus, whether you're a developer or a data analyst, Additionally, it connects easily to the most common BI (Business Intelligence) tools with JDBC (Java Database Connectivity) connectors. 6. RapidMiner RapidMiner is an advanced open-source data mining tool for predictive analytics. It's a powerful data science platform that lets data scientists and big data analysts analyze their data quickly. In addition to data mining, it enables model deployment and model operation. With this solution, you will have access to all the machine learning and data preparation capabilities you need to make an impact on your business operations. By providing a unified environment for data preparation, machine learning, deep learning, text mining, and predictive analytics, it aims to enhance productivity for enterprise users of every skill level |D) pidminer Companies using RapidMiner: Domino's Pizza, McKinley Marketing Partners, Windstream_ Communications, George Mason University, etc. Key Features: There is an integrated platform for processing data, building machine learning models, and deploying them. Further, it integrates the Hadoop framework with its inbuilt RapidMiner Radoop RapidMiner Studio provides access, loading, and analysis of any type of data, whether it is structured data or unstructured data such as text, images, and media Automated predictive modeling is available in RapidMiner. 2024-2025 UNIT 20

You might also like