Overview of Apache Spark Last Updated : 10 Nov, 2020 Comments Improve Suggest changes Like Article Like Report In this article, we are going to discuss the introductory part of Apache Spark, and the history of spark, and why spark is important. Let's discuss one by one. According to Databrick's definition "Apache Spark is a lightning-fast unified analytics engine for big data and machine learning. It was originally developed at UC Berkeley in 2009." Databricks is one of the major contributors to Spark includes yahoo! Intel etc. Apache spark is one of the largest open-source projects for data processing. It is a fast and in-memory data processing engine. History of spark : Spark started in 2009 in UC Berkeley R&D Lab which is known as AMPLab now. Then in 2010 spark became open source under a BSD license. After that spark transferred to ASF (Apache Software Foundation) in June 2013. Spark researchers previously working on Hadoop map-reduce. In UC Berkeley R&D Lab they observed that was inefficient for iterative and interactive computing jobs. In Spark to support in-memory storage and efficient fault recovery that Spark was designed to be fast for interactive queries and iterative algorithms. In the below-given diagram, we are going to describe the history of Spark. Let's have a look. Features of Spark : Apache spark can use to perform batch processing. Apache spark can also use to perform stream processing. For stream processing, we were using Apache Storm / S4. It can be used for interactive processing. Previously we were using Apache Impala or Apache Tez for interactive processing. Spark is also useful to perform graph processing. Neo4j / Apache Graph was using for graph processing. Spark can process the data in real-time and batch mode. So, we can say that Spark is a powerful open-source engine for data processing. References : Apache Spark References Comment More infoAdvertise with us Next Article Components of Apache Spark A Ashish_rana Follow Improve Article Tags : DBMS Apache Similar Reads Components of Apache Spark Spark is a cluster computing system. It is faster as compared to other cluster computing systems (such as Hadoop). It provides high-level APIs in Python, Scala, and Java. Parallel jobs are easy to write in Spark. In this article, we will discuss the different components of Apache Spark. Spark proces 5 min read Apache Hive Prerequisites - Introduction to Hadoop, Computing Platforms and Technologies Apache Hive is a data warehouse and an ETL tool which provides an SQL-like interface between the user and the Hadoop distributed file system (HDFS) which integrates Hadoop. It is built on top of Hadoop. It is a software pro 5 min read Apache Hive Prerequisites - Introduction to Hadoop, Computing Platforms and Technologies Apache Hive is a data warehouse and an ETL tool which provides an SQL-like interface between the user and the Hadoop distributed file system (HDFS) which integrates Hadoop. It is built on top of Hadoop. It is a software pro 5 min read Difference between Apache Hive and Apache Spark SQL 1. Apache Hive : Apache Hive is a data warehouse device constructed on the pinnacle of Apache Hadoop that enables convenient records summarization, ad-hoc queries, and the evaluation of massive datasets saved in a number of databases and file structures that combine with Hadoop, together with the Ma 2 min read How to Configure Windows to Build a Project Having Apache Spark Code Without Installing it? Apache Spark is a unified analytics engine and it is used to process large scale data. Apache spark provides the functionality to connect with other programming languages like Java, Python, R, etc. by using APIs. It provides an easy way to configure with other IDE as well to perform our tasks as per 5 min read Difference between RDBMS and Hive RDBMS and Hivey are both strong tools for organizing and accessing data, Relational Database Management Systems (RDBMS) and Apache Hive are designed for distinct use cases and goals. Hive is intended to manage large-scale data analytics and querying on top of the Hadoop environment, while RDBMS is g 4 min read Like