Guide

The Beginner’s Guide to Big Data

The Beginner’s Guide to Big Data
What is big data and how does it work? Join us on a deep dive into big data and the technologies you need to extract actionable insights for your organization.

What Is Big Data?

Today’s businesses collect vast amounts of data from a variety of sources and it must often be analyzed in real time. Big data refers to data that is too big, too fast, or too complex to process using traditional techniques. But it also comprises numerous technologies and strategies that big data is making possible like intelligence-generating fields, such as predictive analytics, the internet of things, artificial intelligence, and more.

Research and Markets reports that the global big data market is expected to reach $156 billion by 2026—and companies have many good reasons to get on board. Here’s a look at what big data is, where it comes from, what it can be used for, and how companies can prepare their IT infrastructures for big data success.

The Three Vs of Big Data

While the concept of big data has been around for a long time, industry analyst Doug Laney was the first to coin the three Vs of big data in 2001. The three Vs are:

  • Volume: The quantity of data that must be processed (usually a lot—gigabytes, exabytes, or more)
  • Variety: The wide-ranging types of data, both structured and unstructured, streaming from many different sources
  • Velocity: The speed at which new data is streaming into your system

Some data experts extend the definition to four, five, or more Vs. The fourth and fifth V are:

  • Veracity: The quality of the data with respect to its accuracy, precision, and reliability
  • Value: The value the data provides—what is it worth to your business?

While the list can go all the way up to 42 Vs, these five are the most commonly used to define big data.

There are also two different flavors of big data, which differ in how they’re processed and what questions and queries they’re used to answer.

  • Batch processing is typically used with large amounts of stored historical data to inform long-term strategies or answer big questions. Think: huge amounts of data with complex, in-depth analysis.
  • Streaming data is less about answering big questions than it is about getting immediate, real-time information for on-the-fly purposes, such as maintaining accuracy of a manufacturing process. It’s typically used with large amounts of data that are moving at a rapid pace. Think huge amounts of high-velocity data with less complex but extremely rapid analysis.

Learn more about the difference between big data vs. traditional data.

Where Does Big Data Come From?

Big data is really meant to describe all of the unstructured, modern data collected today and how it’s used for in-depth intelligence and insights. These sources often include:

  • The internet of things and data from billions of devices and sensors
  • Machine-generated log data used for log analytics
  • Software, platforms, and enterprise applications
  • Human beings: social media, transactions, clicks online, health records, natural resource consumption, etc.
  • Research data from the scientific community and other organizations

Types of Big Data: Structured vs. Unstructured

Different types of data require different types of storage. This is the case with structured and unstructured data, which require different types of databases, processing, storage, and analysis.

Structured data is traditional data that can fit neatly into tables. Structured data is often easily categorized and formatted into entries in standard values like prices, dates, times, etc.

Unstructured data is modern data that isn’t quite as simple or easy to input into a table. Unstructured data is often synonymous with big data today and will account for an estimated 80% of data in the coming years. It includes all the data generated by social media, IoT, content creators, surveillance, and more. It can include text, images, sound, and video. It’s the driving force behind new storage categories such as FlashBlade® unified fast file and object (UFFO). To make use of unstructured data, companies need more storage, more processing power, and better consolidation of numerous data types.

Learn more about structured data vs. unstructured data.

What Does the Big Data Lifecycle Look Like?

The lifecycle of big data can include but is not limited to the following:

  1. Data is extracted and gathered. Data could be coming from a variety of sources, including enterprise resource planning systems, IoT sensors, software such as marketing or point-of-sale applications, streaming data via APIs, and more. The output of this data will vary, which makes ingestion an important next step. For example, data coming from the stock market will be vastly different from log data of internal systems.
  2. Data is ingested. Exchange-transform-load (ETL) pipelines transform data into the right format. Whether it’s headed to a SQL database or a data visualization tool, data needs to be transformed into a format the tool can understand. For example, names may be in inconsistent formats, At this point, data is all set for analysis.
  3. Data is loaded into storage for processing. Next, data is stored somewhere, whether that’s in a cloud-based data warehouse or on-premises storage. This can happen in different ways, depending on if the data is loaded in batches or event-based streaming occurs around the clock. (Note: This step may happen before the transformation step, depending on business needs.)

    Learn more: What Is a Data Warehouse?

  4. Data is queried and analyzed. Modern, cloud-based compute, processing, and storage tools are having a big impact on the evolution of the big data lifecycle. (Note: Certain modern tools like Amazon Redshift can bypass ETL processes and enable you to query data much faster.) 
  5. Data is archived. Whether it’s stored for the long term in cold storage, or it’s kept “warm” in more accessible storage, time-sensitive data that has served its purpose will go into storage. If immediate access is no longer required, cold storage is an affordable, space-efficient way to store data, especially if it’s to meet compliance requirements or inform long-term strategic decision-making. This also reduces the performance impacts of keeping petabytes of cold data on a server that also holds hot data.

What Can Businesses Do with Big Data?

There are many exciting, effective uses for big data. Its value lies in the business breakthroughs that big data insights can help to drive. Goals and applications for big data often include:

  • Real-time insights and intelligence on the fly from analysis of streaming data to trigger alerts and identify anomalies
  • Predictive analytics
  • Business intelligence
  • Machine learning
  • Risk analysis to help prevent fraud and data breaches and reduce security risks
  • Artificial intelligence, including image recognition, natural language processing, and neural networks
  • Improving user experience and customer interactions through recommendation engines and predictive support
  • Reducing cost and inefficiencies in processes (internal, manufacturing, etc.)
  • Data-driven marketing and communications, with analysis of millions of social media, consumer, and digital advertising data points created in real time

See more industry-specific big data use cases and applications.

How Is Big Data Stored?

Big data has unique demands, especially in terms of data storage. It’s almost constantly being written to a database (as is the case with real-time streaming data) and it often contains a huge variety of formats. As a result, big data is often best stored in schemaless (unstructured) environments to start on a distributed file system so that processing can happen in parallel across massive data sets. This makes it a great fit for an unstructured storage platform that can unify file and object data.

Learn more about the difference between a data hub vs. a data lake.

How Edge Computing Is Driving Demand for Big Data

The rise of the internet of things (IoT) has led to an increase in the volume of data that must be managed across fleets of distributed devices. 

Instead of waiting for IoT data to be transferred and processed remotely at a centralized location such as a data center, edge computing is a distributed computing topology where information is processed locally at the “edge”: the intersection between people and devices where new data is created. 

Edge computing doesn’t just save businesses money and bandwidth, it also allows them to develop more efficient, real-time applications that offer a superior user experience to their customers. This trend is only going to accelerate in the coming years with the rollout of new wireless technologies such as 5G.

As more and more devices are connected to the internet, the amount of data that must be processed in real time and on the edge is going to increase. So how do you provide data storage that is distributed and agile enough to meet the increasing data storage demands of edge computing? The short answer is container-native data storage. 

When we look at existing edge platforms such as AWS Snowball, Microsoft Azure Stack, and Google Anthos, we see that they are all based on Kubernetes, a popular container orchestration platform. Kubernetes enables these environments to run workloads for data ingestion, storage, processing, analytics, and machine learning at the edge. 

A multi-node Kubernetes cluster running at the edge needs an efficient, container-native storage engine that caters to the specific needs of data-centric workloads. In other words, containerized applications running on the edge require container-granular storage management. Portworx® is a data services platform that provides a stateful fabric for managing data volumes that are container-SLA-aware.

Learn more about the relationship between Big Data and IoT.

Scalable All-Flash Data Storage for All Your Big Data Needs

The benefits of hosting big data on all-flash arrays include:

  • Higher velocities (55-180 IOPS for HDDs vs. 3K-40K IOPS with SSDs)
  • Massive parallelism with over 64K queues for I/O operations
  • NVMe performance and reliability

Why Choose Pure Storage® for Your Big Data Needs?

The relative volume, variety, and velocity of big data is constantly changing. If you want your data to stay big and fast, you’ll want to make sure you’re consistently investing in the latest storage technologies. Advances in flash memory have made it possible to deliver custom all-flash storage solutions for all your data tiers. Here’s how Pure can help power your big data analytics pipeline:

  • All the benefits of all-flash arrays
  • Consolidation into a unified, performant data hub that can handle high-throughput data streaming from a variety of sources
  • Truly non-disruptive Evergreen™ program upgrades with zero downtime and no data migrations
  • A simplified data management system that combines cloud economics with on-premises control and efficiency

Fast and efficient scale-out flash storage with FlashBlade

800-379-7873 +44 20 3870 2633 +43 720882474 +32 (0) 7 84 80 560 +33 9 75 18 86 78 +49 89 12089 253 +353 1 485 4307 +39 02 9475 9422 +31 (0) 20 201 49 65 +46-101 38 93 22 +45 2856 6610 +47 2195 4481 +351 210 006 108 +966112118066 +27 87551 7857 +34 51 889 8963 +41 31 52 80 624 +90 850 390 21 64 +971 4 5513176 +7 916 716 7308 +65 3158 0960 +603 2298 7123 +66 (0) 2624 0641 +84 43267 3630 +62 21235 84628 +852 3750 7835 +82 2 6001-3330 +886 2 8729 2111 +61 1800 983 289 +64 21 536 736 +55 11 2655-7370 +52 55 9171-1375 +56 2 2368-4581 +57 1 383-2387