Apache hadoop start of bigdata career

Hadoop the talk of the town is the most popular open-source framework upon which big data architecture can be built. There are many technologies out there that support big data. However, hadoop is most popular owing to the following reasons: 1) Hadoop is an open-source apache project 2) It hosts all the tool needed to store, process and support big data 3) It comes with an integral NoSQL database HBase. This avoids issues with driver integration related to third party NoSQL databases like MongoDB, Cassandra etc 4) Vendors like Hortonworks have taken up projects to spice up open-source apache hadoop into their proprietary HDP called Hortonworks Data Platform 5) Hadoop has technologies that offer high availability, low latency data processing 6) Hadoop is java based. This makes it possible to run in many different operating systems like UNIX, Linux, windows without any issues. Platform independent nature of java makes this possible Hadoop is popular and widely adapted framework to support big data. So, what is the real role of Hadoop is supporting big data projects real-time? Hadoop comprises of lots of tools that are developed to store big data into chunks that can easily be accessed and processed. Lets us see the tools that form the basis of hadoop architecture and role of every component in supporting big data project: 1) HDFS - Hadoop Distributed File system is HDFS. Any system starting with our desktop PC is expected to store data in file system. Hadoop offers their propreitory filesystem that can host data that is big.The byte size of processing in hadoop file system is 128MB. HDFS comes with replication factor which enables the information to be split and distributed across more than one physical machine. Every HDFS needs one name node which is the admin node that stores metadata about all other nodes in the hadoop cluster. The second set of machines are called data nodes that store actual data. Replication factor determines number of copies of the data. Namenode and data node can be installed in same machine for learning purpose. In production name node should be in a machine different than data nodes. HDFS stores both structured and unstructured data. Realtime data can also be stored. 2) Flume - To store unstructured data onto HDFS flume is used 3) Sqoop - Sqoop is used to store and retrieve structred data from HDFS. In real-time oracle is a popular relational database. If there is a requirement to transfer data from oracle onto HDFS sqoop can be used. When information is processed and returned to client sqoop is needed 4) Yarn - We can think of yarn as operating system. It is the heart of big data architecture. Yarn does the data processing of information stored in HDFS 5) HBase - This is NoSQL database that comes as part of hadoop. Tabular information is stored in HBase and files, iamges, unstructured data is stored in HDFS 6) Mapreduce - Hadoop offers mapreduce that interacts with yarn. Yarn does processing of information in HDFS 7) Pig(Latin) - This is a new language developed for sake of data analysts that makes prcessing simple. pig helps with data analysis 8) Hive - The SQL of big data is hive 9) OOzie - This is workflow scheduler tool needed to schedule the tasks in an organized fashion. This can be thought of build tool like ant, mavern in a software development environment 10) Spark - Apache spark is the in-memory processing engine that can be used with hadoop as well as non-hadoop env like mongodb. Spark is gaining popularity 11) Mahout - This is machine learning language used for statistical analysis 12) Client/Python/CLI - The user interacts at this layer 13) HUE - Hadoop User Experience is the user-interface that supports the apache hadoop ecosystem, that makes all of its tools accessible and entire ecosystem is accessed via web browse interface 14) Zookeeper - This is a tool used for maintaining configuration information, naming, providing distributed synchronization, and providing group services Why is hadoop used for big data? When we talk of big data the first thing that comes to my mind is Oracle exadata. Oracle has built a sophisticated machine with powerful operating system to process data that is big. However, hadoop is an open-source project and developed to cater the needs to storing and processing both structured and unstructred data. Henceforth, hadoop is the preferred architecture in big data space. Is hadoop the only architecture available to support big data? No. Hadoop is the preferred choice and widely accepted