It is not news that our capacity to gather and store
immense amounts of data has grown by leaps and bounds.
A few years ago, it was unthinkable for a free email account to offer more
than 10 or 20 megabytes of storage. Today, one stores thousands of times that
amount. But that's barely scratching the surface compared to the truly
massive data collection projects now under way.
The Large Synoptic Survey Telescope is slated to come
online in 2016. When it's operational, estimates are that it will acquire
knowledge of our universe at the rate of 140 terabytes of data every five
days, or better than 10 petabytes a year – that's
10,000,000,000,000,000 bytes per year, or more data than in every book ever
written accruing about every two days. And who knows how much info the Large
Hadron Collider will be spewing out by then? In 2010 alone, the LHC gathered
13 petabytes' worth. And then there's Google, processing in the neighborhood
of 24 petabytes. Per day.
Only a few years ago, a gigabyte (one billion bytes)
was thought to be a lot of data. Now it's nothing. Even home hard drives can
store a terabyte (one trillion) these days. The commercial and governmental
sectors regularly handle petabytes (quadrillion), while researchers routinely
chat about the looming frontiers: exabytes
(quintillion), zettabytes (sextillion), and yottabytes (septillion). It has not been necessary to
name the next one after that. Yet.
But it's not just the Googles
and NASAs of the world that are dealing with that kind of data. Virtually
every Fortune 500 company in the world has a massive data warehouse where
it's accumulated millions of documents and billions of data records from
inventory systems, ecommerce systems, and marketing-analytics software.
You bump up against this kind of massive data
collection every time you swipe your credit card at Walmart.
The retail giant processes more than a million transactions just like yours
every hour and dumps the results into a database
that currently contains more than 2.5 petabytes of data. That's equivalent to
all the information contained in all the books in the Library of Congress
about 170 times over.
These increasingly large mounds of data have begun to
befuddle even the geekiest members of those organizations.
Our ability to collect massive amounts of data
continues to grow at an exponential rate. But the more we collect, the harder
it becomes to derive anything meaningful from it. After all, what on earth do
you do with all this stuff? How do you sort it? How do you search it?
How do you analyze it so that something useful comes out the other end?
That's the problem facing developers for whom the traditional tools of
database management are powerless in the face of such an onslaught. Data
stores have far outgrown our ability to keep the data neat, clean, and tidy,
and hence easy to analyze. What we have now is a mess of varying types of
data – with moving definitions, inconsistent implementations, even the
equivalent of digital freeform – that needs to be analyzed at a massive
scale. It's a problem both of size and complexity.
Which brings us face to face with the
hottest tech buzz words of 2012: Big Data.
Supersized Us
The idea that data can be supersized is, of course, not
new. But what is new is a convergence of technologies that deal with
it in some efficient, innovative, and highly creative ways. Though Big Data
is a market that's still in its infancy, it is consuming increasingly large
chunks of the nation's overall IT budget. How much actually is being spent
depends on how you define the term; hard numbers are impossible to come by.
Conservative estimates claim we're headed to somewhere between $20 and $55
billion by 2015. Out at the high end, Pat Gelsinger,
COO of data-storage giant EMC, claims that it is already a $70-billion market
– and growing at 15-20% per year.
Take your pick. But regardless, it's
small wonder that venture capitalists are falling all over themselves to
throw money at this tech. Accel Partners launched a
$100 million Big Data fund last November, and IA Ventures initiated its
$105-million IAVS Fund II in February. Even American Express has ponied up
$100 million to create a fund to invest in the sector.
Down in Washington, DC, the White House has predictably
jumped into the fray, with an announcement on March 29 that it was committing
$200 million to develop new technologies to manipulate and manage Big Data in
the areas of science, national security, health, energy, and education.
John Holdren, director of the
White House's Office of Science and Technology Policy, paid lip service to
the private sector, saying that while it "will take the lead on big data,
we believe that the government can play an important role, funding big data
research, launching a big data workforce, and using big data approaches to
make progress on key national challenges."
At the same time, The National Institute of Standards
and Technology (NIST) will be placing a new focus on big data. According to
IT Lab Director Chuck Romine, NIST will be increasing its work on standards,
interoperability, reliability, and usability of big data technologies, and
predicts that the agency will "have a lot of impact on the big data
question."
CRM = customer resource
management
ERP = enterprise resource planning
ETL = extract, transform, and load
HDFS = Hadoop distributed file system for Big
Data
SQL = a programming language for managing relational databases
NoSQL = not just SQL
NGDW = next generation data warehouse
|
|
kjlkjlk
|
No shocker, the Department of Defense is also already
hip-deep in the sector, planning to spend about $250 million annually –
including $60 million committed to new research projects – on Big Data.
And of course you know that DARPA (the Defense Advanced Research Projects
Agency) has to have its finger in the pie. It's hard at work on the XDATA
program, a $100-million effort over four years to "develop computational
techniques and software tools for sifting through large structured and
unstructured data sets."
If much of this seems a bit fuzzy, here's an easy way
of thinking about it: Suppose you own the mineral rights to a square mile of
the earth. In this particular spot, there were gold nuggets lying on the
surface and a good deal more accessible gold just below ground, and you've
mined all of that. Your operation thus far is analogous to the stripping of
chunks of useful information from the available data using traditional
methods.
But suppose there is a lot more gold buried deeper
down. You can get it out and do so cost-effectively, but in order to
accomplish that you have to sink mine shafts deep into the earth and then off
at various angles to track the veins of precious-metal-bearing rock (the
deepest mine on earth is in South Africa, and it plunges two miles down).
That's a much more complex operation, and extracting gold under those
conditions is very like pulling one small but exceedingly useful bit of information
out of a mountain-sized conglomeration of otherwise-useless Big Data.
So how do you do it?
You do it with an array of new, exciting, and rapidly
evolving tools. But in order to understand the process, you'll first have to learn
the meaning of some acronyms and terms you may not yet be familiar with.
Sorry about that.
With these in mind, we can now interpret this diagram,
courtesy of Wikibon, which lays out the
traditional flow of information within a commercial enterprise:
Here you can see that data generated by three different
departments – customer resource management, enterprise resource
planning, and finance – are funneled into a processor that extracts
the relevant material, transforms it into a useful format (like a
spreadsheet), and loads it into a central storage area, the relational
database warehouse. From there, it can be made available to whichever end
user wants or needs it, either someone within-house or an external customer.
Enter the
Elephant
The old system works fine within certain parameters.
But in many ways, it's becoming Stone-Age stuff, because: The raw amount of
input must not be too large; it must be structured in a way that is easy to
process (traditionally, in rows and columns); and the desired output must not
be too complex. Heretofore, as businesses were interested mainly in such
things as generating accurate financial statements and tracking customer
accounts, this was all that was needed.
However, potential input that could be of value to a
company has increased exponentially in volume and variety, as well as in the
speed at which it is created. Social media, as we all know, have exploded.
700 million Facebook denizens, a quarter of a billion Twitter users, 150
million public bloggers – all these and more are churning out content
that is being captured and stored. Meanwhile, 5 billion mobile-phone owners
are having their calls, texts, IMs, and locations logged. Online transactions
of all different kinds are conducted by the billions every day. And there are
networked devices and sensors all over the place, streaming information.
This amounts to a gargantuan haystack. And what is
more, much of this haystack consists of material that is only
semi-structured, if not completely unstructured, making it impossible for
traditional processing systems to handle. So if you're combing the hay,
looking for the golden needle – let's say, two widely separated but
marginally related data points that can be combined in a meaningful whole for
you – you won't be able to find it without a faster and more practical
method of getting to the object of your search. You must be able to maneuver
through Big Data.
Some IT pros could see this coming, and so they
invented – ta dah – a little elephant:
Apache.org
Hadoop was originally created by Doug Cutting at Yahoo!
and was inspired by MapReduce, a tool for indexing
the Web that was developed by Google. The basic concept was simple: Instead
of poking at the haystack with a single, big computer, Hadoop
relies on a series of nodes running massively parallel processing (MPP)
techniques. In other words, it employs clusters of the smaller,
less-expensive machines known as "commodity hardware" – whose
components are common and unspecialized – and uses them to break up Big
Data into numerous parts that can be analyzed simultaneously.
That takes care of the volume problem and eliminates
the data-ingesting choke point caused by reliance on a single, large-box
processor. Hadoop clusters can scale up to the
petabyte and even exabyte level.
But there's also that other obstacle – namely,
that Big Data comes in semi- or unstructured forms that are resistant to
traditional analytical tools. Hadoop solves this
problem by creating a default file storage known as the Hadoop
Distributed File System (HDFS). HDFS is specially tailored to store data that
aren't amenable to organization into the neatly structured rows and columns
of relational databases.
After the node clusters have been loaded, queries can
be written to the system, usually in Java. Instead of returning relevant data
to be worked on in some central processor, Hadoop
causes the analysis to occur at each node simultaneously. There is also
redundancy, so that if one node fails, another preserves the data.
The MapReduce part of Hadoop then goes to work according to its two functions.
"Map" divides the query into parts and parallel processes it at the
node level. "Reduce" aggregates the results and delivers them to
the inquirer.
After processing is completed, the resulting
information can be transferred into existing relational databases, data
warehouses, or other traditional IT systems, where analysts can further
refine them. Queries can be written in SQL – a language with which more
programmers are familiar – and converted into MapReduce.
One of the beauties of Hadoop
– now a project of the Apache Software Foundation – is that it is
open source. Thus, it's always unfinished. It evolves, with hundreds of
contributors continuously working to improve the core technology.
Now trust us, the above explanation is pared down to
just the barest of bones of this transformational tech. If you're of a
seriously geeky bent (want to play in your very own Hadoop
sandbox? – you can: the
download is free) or are simply masochistic, you can pursue the subject
down a labyrinth that'll force you to learn about a bewildering array of Hadoop subtools with such
colorful names as Hive, Pig, Flume, Oozie, Avro,
Mahout, Sqoop, and Big Top. Help yourself.
Numerous small startups have, well, started up in order
to vend their own Hadoop distributions, along with
different levels of proprietary customization. Cloudera
is the leader at the moment, as its big-name personnel lineup includes Hadoop creator Cutting and data scientist Jeff Hammerbacher from Facebook. Alternatively, there is Hortonworks, which also emerged from Yahoo! and
went commercial last November. MapR is another name
to watch. Unfortunately, the innovators remain private, and there are no
pure-investment plays as yet in this space.
It isn't simply about finding that golden needle in the
haystack, either. The rise of Hadoop has enabled
users to answer questions no one previously would have thought to ask. Author
Jeff Kelly, writing on Wikibon, offers this outstanding example (emphasis
ours):
"[S]ocial networking data [can be] mined to determine which
customers pose the most influence over others inside social networks. This
helps enterprises determine which are their 'most important' customers, who
are not always those that buy the most products or spend the most but those
that tend to influence the buying behavior of others the most."
Brilliant – and now possible.
Hadoop is, as noted, not the be-all and end-all of Big-Data
manipulation. Another technology, called the "next generation data
warehouse" (NGDW), has emerged. NGDWs are similar to MPP systems that
can work at the tera- and sometimes petabyte level.
But they also have the ability to provide near-real-time results to complex
SQL queries. That's a feature lacking in Hadoop, which
achieves its efficiencies by operating in batch-processing mode.
The two are somewhat more complementary than
competitive, and results produced by Hadoop can be
ported to NGDWs, where they can be integrated with more structured data for
further analysis. Unsurprisingly, some vendors have appeared that offer
bundled versions of the different technologies.
For their part, rest assured that the major players
aren't idling their engines on the sidelines while
all of this races past. Some examples: IBM has entered the space in a big
way, offering its own Hadoop platform; Big Blue
also recently acquired a leading NGDW, as did HP; Oracle has a Big-Data
appliance that joins Hadoop from Cloudera with its own NoSQL
programming tools; EMC scooped up Hadoop vendor Greenplum; Amazon employs Hadoop
in its Elastic MapReduce cloud; and Microsoft will
support Hadoop on its Azure cloud.
And then there's government. In addition to the
executive-branch projects mentioned earlier, there is also the rather creepy,
new, $2-billion NSA facility being built in Utah. Though its purpose is top
secret, what is known is that it's being designed with the capability of
storing and analyzing the electronic footprint – voice, email, Web
searches, financial transactions, and more – of every citizen in the
US. Big Data indeed.
The New Big
World
From retail to finance to government to health care
– where an estimated $200 billion a year could be saved by the
judicious use of Big Data – this technology is game-changing. Not
necessarily for the better, as the superspy facility may portend.
And even outside the NSA, there are
any number of serious implications to deal with. Issues related to
privacy, security, intellectual property, liability, and much more will need
to be addressed in a Big-Data world.
We'd better get down to it, because this tech is coming
right at us – and it is not stoppable.
In fact, the only thing slowing it at all is a shortage
of expertise. It's happened so fast that the data scientists with the proper
skill sets are in extremely short supply – a situation that is
projected to get worse before it gets better. Management consulting firm
McKinsey & Co. predicts that by 2018, "the United States alone could
face a shortage of 140,000 to 190,000 people with deep analytical skills, as
well as [a further shortage of] 1.5 million managers and analysts with the
know-how to use the analysis of big data to make effective decisions."
If you know any bright young kids with the right turn
of mind, this is definitely one direction in which to steer them.
The opportunity exists not just for aspiring
information-miners. Just as the relational database – which started as
a set of theoretical papers by a frustrated IBM engineer fed up with the
current status quo in the field – has grown from academic experiments
and open-source projects into a multibillion-dollar-per-year industry with
players like Microsoft and Oracle and IBM, so too is Big Data in the
beginning of a rapid growth curve. From today's small companies and hobby
projects will come major enterprises. Stories like
MySQL – an open-source project acquired by Sun Microsystems for $1
billion in 2008 – are coming in Big Data.
While there's
no pure way to invest in the innovators working to manage Big Data, there are
opportunities in technology that – so far – are under most
investors' radar screens. One involves directly investing in peer-to-peer
lending, which Alex Daley – our chief technology investment strategist
and editor of Casey Extraordinary Technology – will detail at
the Casey Research/Sprott, Inc. Navigating the Politicized Economy Summit in
Carlsbad, California from September 7-9.
Alex will be joined
by a host of wildly successful institutional investors that includes our own
Doug Casey... Eric Sprott of Sprott,
Inc... resource investing legend Rick Rule... as
well as many other financial luminaries, some of whom you've probably never
heard of (but only because they have no interest in the limelight). Together,
they'll show you how governmental meddling in markets has created a
politicized economy that works against most investors. More important,
they'll provide you with investment strategies and specific stock
recommendations that they're using right now to leverage this new economy to
protect and grow their own wealth.
Navigating
the Politicized Economy is a rare
opportunity to not only discover the action plans of some of the most
successful investors in the world, but to also ask them specific questions
about your portfolio and stocks you've been following. Just as important,
you'll also have the opportunity to network with like-minded individuals,
including the executives of some of our favorite companies and the research
team of Casey Research.
Right now,
you can save $300 on registration, but this opportunity is only good through
11:59 p.m. August 3. Get the details and reserve your seat today.
|