2019 Data & AI Trends & Challenges
2019 Data & AI Trends & Challenges
Landscape
It has been another intense year in the world of data, full of excitement but also
complexity.
A few years ago, the discussion around “Big Data” was mostly a technical one, centered
around the emergence of a new generation of tools to collect, process and analyze
massive amounts of data. Many of those technologies are now well understood, and
deployed at scale. In addition, over the last couple of years in particular, we’ve started
adding layers of intelligence through data science, machine learning and AI into many
applications, which are now increasingly running in production in all sorts of consumer
and B2B products.
As those technologies continue to both improve and spread beyond the initial group of
early adopters (FAANG and startups) into the broader economy and world, the
discussion is shifting from the purely technical into a necessary conversation around
impact on our economies, societies and lives.
We’re just starting to truly get a sense of the nature of the disruption ahead.
In a world where data-driven automation becomes the rule (automated products,
automated cars, automated enterprises), what is the new nature of work? How do we
handle the social impact? How do we think about privacy, security, freedom?
Meanwhile, the underlying technologies continue to evolve at a rapid pace, with an ever
vibrant ecosystem of startups, products and projects, heralding perhaps even more
profound changes ahead. In that ecosystem, the year was characterized by the early
innings of a long expected consolidation, and perhaps a passing of the guard from one
era to another as early technologies are starting to give way to the next generation.
To try and make sense of it all, this is our sixth landscape and “state of the union” of
the data and AI ecosystem. For anyone interested in tracking the evolution, here are
the prior versions: 2012, 2014, 2016, 2017 and 2018.
Worth noting: as the term “Big Data” has now entered the museum of once-hot
buzzwords, this year the chart will just be the “Data & AI Landscape”.
Also, to make the reading more digestible, we’ll break down the post into two parts:
Part I (this post) will include a few introductory thoughts on the rapidly evolving
context around data privacy and regulation, which will have a profound impact on what
can/cannot be done with data technologies; it will also include the landscape itself.
Part II will include a roundup of key trends on data infrastructure, analytics and
ML/AI.
In 2018, we noted how the data world had started to reveal some darker, scarier
undertones, in the wake of the Cambridge Analytica scandal in particular.
This trend continued to develop in 2019. There were more data breaches,
more privacy scandals. More stories of surveillance state in China
(including this report on a Muslim town in Northwest China). More freaky examples
of AI deepfakes, for which we are very unprepared.
Certainly, the debate around the dangers of AI, with all its sci-fi connotations, had
captured imaginations already, and this year has seen more initiatives around thinking
through those issues, such as the launch of Fei Fei Li’s Institute for Human-Centered
Artificial Intelligence.
But up until recently, questions around data ownership, privacy and security were met,
for almost everyone but a vocal minority, with a resounding yawn.
Perhaps more than ever, privacy issues jumped to the forefront of public
debate in 2019 and are now front, left and center. The fact that many of those issues
were related to Facebook, a service known to billions, probably played an important
role in sensitizing a much broader group of people around the world to the severity of
the issues.
The data privacy landscape is also shifting, as governments are increasingly getting
involved.
• GDPR, the European data protection and privacy regulation, came into effect in May
2018, and since then a few high profile fines have been announced including a €50
million fine issued to Google in January 2019 by the French data protection regulator
and a £500,000 fine issued to Facebook in October 2018 by the UK’s Information
Commissioner’s Office.
• The California Consumer Privacy Act (CCPA) will become effective on January 1, 2020.
• New York’s privacy bill is “even bolder” than California’s.
• San Francisco just voted to ban the use of facial recognition by city agencies.
• Illinois moved against video bots for hiring interviews.
Yet harsher government actions could take place. For starters, Facebook is likely to
be fined up to $5B by the FTC over privacy issues. Perhaps most importantly, there
have been increasing calls to break up the largest Internet franchises — too much
power, too much data and not enough privacy. The clearest target has been Facebook
(see this well- publicized opinion piece by one of its founders, Chris Hughes), but the
discussion has included others as well (a proposal from presidential candidate
Elizabeth Warren targets Google and Amazon).
Big Tech was already under pressure from within their own midst. Employees at
Google, Amazon and Microsoft protested against the commercialization of their face
recognition technology. Google relented. Amazon did not – some activist shareholders
and employees tried to put a ban into effect, but were defeated.
For the FAANGs, privacy has become a new battleground, forcing their leaders to take
much more of a public stance on the issue:
• Tim Cook, CEO of Apple, warned us about the “weaponization of data” which is leading
us into a “data industrial complex.”
• Sundar Pichai, CEO of Google, took a public stand on the issue in the NY Times.
• Mark Zuckerberg, CEO of Facebook, vowed to turn Facebook into a privacy-focused
messaging and social networking platform.
To which extent such statements should be taken for face value, of course, is anyone’s
guess, and probably depends on the specific company and leader.
The debate around the impact of data and AI on privacy and society is obviously hugely
important, and it is fundamentally healthy that it has become much more central over
the last year or so.
While it is impossible in 2019 to ignore the broader questions of privacy, security and
regulation around data and AI, the ecosystem of data technologies and products is as
exciting (and full!) as ever.
The ecosystem is also evolving into some interesting ways, as some pioneering
technologies such as Hadoop may be on their way out, replaced by cloud computing
and Kubernetes, and entire segments, such as Business Intelligence, seem to be
rapidly consolidating.
We’ll dig into those various trends in some detail, but first, here’s our 2019 Data & AI
Landscape:
Some key resources:
• Yes, you can zoom! The image and all logos are very high-res, so you can navigate
the landscape in detail by zooming. Works very well on mobile, too!
• This year, my FirstMark colleague Lisa Xu provided immense help with the
landscape.
• We’ve detailed some of our methodology in the notes at the end of this post.
• Thoughts and suggestions welcome – please use the comment section to this
post. We’ll probably publish two or three revisions of the chart until it’s fully final.
The last year (since our 2018 landscape) has been active from an exit perspective.
Several companies on the landscape went public. Crowdstrike (NASDAQ:CRWD) and
Elastic (NYSE:ESTC) reached big valuations at IPO time – $7B and $5B, respectively.
Other IPOs included PagerDuty ($1.8B), Anaplan ($1.8B), and Domo ($500M).
Some very large acquisitions occurred in the last year, including Qualtrics (acquired by
SAP for $8B), Medidata (acquired post-IPO by Dassault for $5.8B), Hortonworks
($5.2B merger with Cloudera), Imperva (acquired by Thoma Bravo for $2.1B),
AppNexus (acquired by AT&T for up to $2B), Cylance (acquired by BlackBerry for
$1.4B), Datorama (acquired by Salesforce for $800M), Treasure Data (acquired by
Arm for $600M), Attunity (acquired post-IPO by Qlik for $560M), Dynamic Yield
(acquired by McDonald’s for $300M), and Figure Eight (acquired by Appen for
$300M).
Many other companies on the 2018 landscape were acquired for smaller amounts:
Alooma (Google), Bonsai (Microsoft), Euclid Analytics (WeWork), Sailthru (Campaign
Monitor), Data Artisans (Alibaba), GRIDSMART (Cubic), Drawbridge (LinkedIn),
Citus Data (Microsoft), Quandl (NASDAQ), Connotate (import.io), Datafox (Oracle),
Market Track (Vista Equity Partners), Lattice Engines (Dun & Bradstreet), Blue Yonder
(JDA Software), SimpleReach (Nativo).
Also worth noting, the AI acqui-hire by large Internet companies, a fixture of 2016-
2017, is not completely dead: Twitter acquired Fabula AI to strengthen its machine
learning expertise, for example.
On the investment front, Big Data and AI startups continued to see big financing
rounds. Investments in China were not quite as oversized as last year, when there were
multiple companies that raised over a billion dollars. Chinese companies that raised
large rounds this year included facial recognition company Face++ ($750M Series D),
AI chip maker Horizon Robotics ($600M Series B), fleet management company G7
($320M Series F), online tutoring platform Yuanfudao ($300M Series F).
In the US, huge investments went into autonomous vehicle companies, including
Cruise ($1.9B across 2 rounds in 2018 and 2019), Nuro ($940M Series B), and Aurora
($600M Series B). RPA companies also saw massive rounds: UiPath ($800M across 2
rounds in 2018 and 2019) and Automation Anywhere ($550M across 2 rounds in
2018).
Other major rounds of US companies on the landscape include Verily Life Sciences
($1B private equity round), Cambridge Mobile Telematics ($500M), Clover Health
($500M Series E), Veeam Software ($500M), Snowflake Computing ($450M Series
F), Compass ($400M Series F), Zymergen ($400M Series C), Dataminr ($392M Series
E), Lemonade ($400M Series D), Rubrik ($260M Series E), Databricks ($250M Series
E), and MediaMath ($225M Series D).
Part II: Major Trends in the 2019 Data
& AI Landscape
Part I of the 2019 Data & AI Landscape covered issues around the societal impact of
data and AI, and included the landscape chart itself. In this Part II, we’re going to dive
into some of the main industry trends in data and AI.
The data and AI ecosystem continues to be one of the most exciting areas of
technology. Not only does it have its own explosive momentum, but it also powers
and accelerates innovation in many other areas (consumer applications, gaming,
transportation, etc). As such, its overall impact is immense, and goes much
beyond the technical discussions below.
Of course, no meaningful trend unfolds over the course of just one year, and many of
the following has been years in the making. We’ll focus the discussion on trends that
we have seen particularly accelerating in 2019, or gaining rapid prominence in industry
conversations.
We will loosely follow the order of the landscape, from left to right: infrastructure,
analytics and applications.
INFRASTRUCTURE TRENDS
The data infrastructure world continues its own rapid evolution. The main arc here,
which has been playing out for years but seems to be accelerating, is a three phase
transition from Hadoop to the cloud services to a hybrid/Kubernetes
environment.
Hadoop is very much the “OG” of the Big Data world, dating back to an October
2003 paper. A framework for distributed storage and processing of massive amounts
of data using a network of computers, it played an absolutely central role in the
explosion of the data ecosystem.
Over the last few years, however, it has become a bit of a sport among industry watchers
to pronounce Hadoop dead. This trend accelerated further this year, as Hadoop
vendors ran into all sorts of trouble. MapR has been on the brink of shutting down
and may have found a buyer, at the time of writing. The recently merged Cloudera
and Hortonworks, fresh off their $5.2B merger had a rough day in June when the
stock plummeted 40% as a result of disappointing quarterly earnings. Cloudera has
announced a variety of cloud and hybrid products, but they have not launched yet.
However, it is unlikely that Hadoop is going to go away anytime soon. Its adoption may
slow down, but the sheer magnitude of its deployment across enterprises will give it
inertia and staying power for years to come.
While cloud usage deepens, customers are beginning to balk at costs. In board
rooms all around the world, executives have suddenly taken notice of a line item that
used to be small and has now snowballed very rapidly: their cloud bill. The cloud does
offer agility, but it can often come at a high price, particularly if customers take their
eye off the meter or fail to accurately forecast their computing needs. There are many
stories of AWS customers like Adobe and Capital One that saw their bill grow 60%+
over just one year between 2017 and 2018, to well over $200M.
Costs, as well as concerns over vendor lock-in, have precipitated the evolution towards
a hybrid approach, involving a combination of public cloud, private cloud and on-
prem. Faced with a myriad of options, enterprises will increasingly select the best
tool for the job to optimize performance and economics. As cloud providers
more aggressively differentiate themselves, enterprises are adapting with multi-
cloud strategies that leverage what each cloud provider is best at. And in some cases,
the best approach is to keep (or even repatriate) some workloads back on-premises in
order to optimize economics, especially for non-dynamic workload
Interestingly, cloud providers are adapting to the reality that enterprise computing will
occur in a mix of environments by providing tools such as AWS Outposts which allows
customers to run compute and storage on-premises as well as seamlessly integrate on-
premise workloads with the rest of their applications in the AWS cloud.
In this new multi-cloud and hybrid cloud era, the rising superstar is
undoubtedly Kubernetes. A project for managing containerized workloads and
services open sourced by Google in 2014, Kubernetes is experiencing the same fervor
as Hadoop did a few years ago, with 8,000 attendees at its KubeCon event, and a never
ending stream of blog posts and podcasts. Many analysts believe that Red Hat’s
prominence in the Kubernetes world largely contributed to its massive acquisition by
IBM for $34B. The promise of Kubernetes is very much to help enterprises run their
workloads across their own datacenter and private cloud, as well as one or several
public clouds.
As an orchestration framework that’s particularly apt at managing complex, hybrid
environments, Kubernetes is also becoming an increasingly attractive option for
machine learning. Kubernetes gives data scientists the flexibility to choose
whichever language, machine learning library or framework they prefer, and train and
scale models, allowing for comparatively rapid iteration and strong reproducibility,
without having to be infrastructure experts, with the same infrastructure serving
multiple users (more here). Kubeflow, a machine learning toolkit for Kubernetes, has
been gaining rapid momentum.
Kubernetes is still relatively nascent, but interestingly, the above could signal an
evolution away from the cloud machine learning services, as data scientists may prefer
the overall flexibility and controllability of Kubernetes. We could be entering a third
paradigm shift for data science and ML infrastructure, from Hadoop (up until
2017?) to data cloud services (2017-2019) to a world dominated by Kubernetes and
next-generation data warehouses like Snowflake (2019-?).
Serverless is one attempt at such simplification, albeit with a different angle. This
execution model enables users to write and deploy code without the hassle of worrying
about the underlying infrastructure. The cloud provider handles all backend services
and the customer is charged based on what they actually use. Serverless has certainly
been a key emerging topic in the last couple of years, and this is another new category
we’ve added to this year’s Data & AI Landscape. However, the applicability of serverless
to machine learning and data science is still a very much a work in progress, with
companies like Algorithmia and Iguazio/Nuclio being early entrants.
In a world where some data lives in a data warehouse, some in a data lake, some in
various other sources, across on-prem, private cloud and public cloud, how do you
find, curate, control and trace data? Those efforts take various related forms and
names, including data querying, data governance, data cataloging and data lineage, all
of which are gaining increasing importance and prominence.
Querying data across a hybrid environment is its own challenge, with solutions that
fall within the general trend of separating storage and compute (see
this video from Starburst Data, a company offering an enterprise version of SQL query
engine Presto, from our Data Driven NYC event).
Data governance is another area that’s rapidly becoming top of mind in the
enterprise. The general idea of data governance is to manage one’s data, and make sure
that it’s of high quality throughout the lifecycle of data It touches on areas such as data
availability, integrity, usability, consistency, integrity and security. Notably, in early
2019, Collibra raised a $100M round at over a $1B valuation.
Finally, data lineage is perhaps the most recent category of data management to
emerge. Data lineage is meant to capture the “journey of data” across the enterprise. It
helps companies figure out how data was gathered, and how it was modified and shared
across its lifecycle. The growth of this segment is driven by a number of factors
including the increasing importance of compliance, privacy and ethics, as well as the
need for reproducibility and transparency of machine learning pipelines and models.
Here’s a good podcast on the topic from O’Reilly.
The final key trend that has been accelerating this year is the continued emergence of
an AI-specific infrastructure stack.
The need to manage AI pipelines and models has given rise to the rapidly growing
MLOps (or AIOps) category. To acknowledge this new-ish trend, we have added two
new boxes to this year’s Landscape, one under Infrastructure (with various early stage
startups including Algorithmia, Spell, Weights & Biases, etc.) and one under Open
Source (with a variety of projects, typically fairly early as well, including Pachyderm,
Seldon, Snorkel, MLeap, etc.).
AI is having a profound impact on infrastructure even at the lower levels of the stack,
with the rise of GPU databases and the birth of a new generation of AI chips
(Graphcore, Cerebras, etc.). AI may be forcing us to rethink the entire nature of
compute.
ANALYTICS TRENDS
In business intelligence, the unmistakable trend of the last few months has been
the burst of consolidation activity that we mentioned earlier in this post, with the
acquisitions of Tableau, Looker, Zoomdata and Clearstory, as well as the merger
between SiSense and Periscope (Henry Glaser, CEO of Periscope, had spoken at Data
Driven NYC last year).
As BI consolidates, the heat continues to increase in the data science and machine
learning platform segments. The deployment of ML/AI in the enterprise is a
mega-trend that is still in its early innings, and various players are rushing to build
the platform of choice.
For most companies in the space, the clear goal is to facilitate the democratization of
ML/AI, making its benefits accessible to larger groups of users and companies, in a
context where the ongoing talent shortage in ML/AI continues to be a major
bottleneck to broad adoption. However, different players have different strategies.
One approach is AutoML. It involves automating entire parts of the machine learning
lifecycle, including some of the most tedious ones. Depending on the product, AutoML
will handle anything from feature generation and engineering, algorithm selection, and
model training, deployment and monitoring. DataRobot, an AutoML specialist,
raised a $100M Series D (and allegedly more since) since our 2018 Landscape.
Other companies in the space, such as Dataiku, H20 and RapidMiner offer platforms
that feature AutoML capabilities too, but also offer broader capabilities. Dataiku,
for example, raised a large $101M Series C since our 2018 Landscape, with an overall
philosophy of empowering entire data teams (both data scientists and data analysts),
and abstract away a lot of the complexity and tediousness involved in handling the
entire lifecycle of data (for a great overview, see this video of a presentation by Florian
Douetteau, CEO at of Dataiku) [Disclaimer: FirstMark is an investor in Dataiku].
The cloud providers are of course active, with Microsoft’s Learning Studio, Google’s
Cloud AutoML and AWS Sagemaker. Despite the might of the cloud providers, those
products are still reasonably narrow in their scope – generally hard to use and largely
targeting very technical, advanced users. They’re also still very much
nascent. Sagemaker, Amazon’s cloud machine learning platform, reportedly had
a slow start in 2018, with only $11M in sales to the commercial sector.
Some cloud providers are actively partnering with pure play players in the space:
Microsoft participated in the $250M Series E of Databricks, perhaps a prelude to a
future acquisition.
We had covered the world of AI research in a previous post: Frontier AI: How far are
we from artificial “general” intelligence, really?.
For more, see two great reports that just came out: State of AI Report 2019 by Nathan
Benaich and The State of AI: Divergence by MMC Ventures.
APPLICATION TRENDS
As we complete our journey through the 2019 landscape from the left to the right of
the chart, a couple of key trends to highlights in applications:
At this stage, we are probably 3 or 4 years into a journey of trying to build ML/AI
applications for the enterprise.
There were certainly some awkward product attempts (first generation chatbots) and
some big marketing claims well ahead of reality, especially from older companies
trying to retrofit ML/AI into existing products.
But, bit by bit, we’ve entered the deployment phase of ML/AI in the enterprise,
going from curiosity and experimentation to actual use in production. The trend for
the next few years seems clear: take a given problem, see if ML/AI (more often than
not, deep learning, or a variation thereof) can make a difference, and if so, build an AI
application to address the problem more effectively.
This deployment phase will occur in a variety of ways. Some products will be built and
deployed by internal teams using the enterprise AI platforms mentioned above. Others
will be full-stack products with embedded AI, offered by various vendors, where the
AI part might be largely invisible to the customer. Yet others will be provided by
vendors offering a mix of products and services (for an example of this approach, see
this talk by Jean-Francois Gagne, CEO of Element AI).
Certainly, it is still very much early days. Internal teams often started with discreet
projects addressing one use case (e.g., churn prediction), and are starting to expand to
other problems. Many startups building ML/AI applications are still learning about the
challenges of going from R&D mode to a fully scaled out operation (I wrote a few
thoughts on the topic in this earlier blog post: Scaling AI Startups).
There is a futuristic world where enterprises become not only fully automated
organizations, but eventually also self-healing and autonomous, a topic which we had
explored in our presentation on AI and blockchain last year.
However, we’re far from that stage, and today’s reality is largely focused on RPA. This
is a red hot category, with leaders such as UI Path and Automation Anywhere
growing very fast and raising mega-rounds, as mentioned above.
RPA, short for Robotic Process Automation (although, perhaps disappointingly, it does
not leverage any actual robot), involves taking generally very simple workflows,
typically manual (performed by humans) and repetitive, and replacing them by
software. A lot of RPA takes place in back office functions (e.g., invoice processing).
There are perhaps reasons to be cynical about RPA. Some consider it to be largely
unintelligent “band aid”, or a stopgap measure of sorts – take an inefficient workflow
performed by humans, and just have the machine do it. From that perspective, RPA
may be simply creating the next level of technical debt, and it is unclear what happens
to automated RPA functions as the environment around them changes, other than
leading to the need to more RPA to reconfigure the old task to its new environment.
RPA, at this stage at least, is more about automation than intelligence, more about
rules-based solutions than AI (although several RPA vendors taut their AI capabilities
in marketing materials).
It will be particularly interesting to observe those spaces in the next few years, and it is
possible that RPA and intelligent automation will merge, either through M&A or
through the launch of new homegrown products, unless the latter progresses so rapidly
that is limits the need for the former.
____________________
NOTES:
1) As every year, we couldn’t possibly fit all companies we wanted on the chart. While
the general philosophy of the chart is to be as inclusive as possible, we ended up having
to be somewhat selective. Our methodology is certainly imperfect, but in a nutshell,
here are the main criteria:
• Everything being equal, we gave priority to companies that have reached some level of
market significance. This is a reasonably easy exercise for large tech companies. For
growing startups, considering the limited amounts of data available, we often used
venture capital financings as a proxy for underlying market traction (again, probably
imperfect). So everything else being equal, we tend to feature startups that have raised
larger amounts, typically Series A and beyond.
• Occasionally, we made editorial decisions to include earlier stage startups when we
thought they were particularly interesting.
• On the application front, we gave priority to companies that explicitly leverage Big
Data, machine learning and AI as a key component or differentiator of their offering.
it is a tricky exercise at a time when companies are increasingly crafting their
marketing around an AI message, but we did our best.
• This year as in previous years, we removed a number of companies. One key reason for
removal is that the company was acquired, and not run by the acquirer as an
independent company.. In some select cases, we left the acquired company as is in the
chart when we felt that the brand would be preserved as a reasonably separate offering
from that of the acquiring company.
3) As we get a lot of requests every year: feel free to use the chart in books, conferences,
presentations, etc – two obvious asks: (i) do not alter/edit the chart and (ii) please
provide clear attribution (Matt Turck, Lisa Xu and FirstMark Capital).