Databricks, a leading super unicorn company, is revolutionizing the way businesses leverage data.
However, they are facing challenges such as complex legacy infrastructure, resolving data silos, and managing high latency. Consequently, the demand for data lakes has been steadily increasing. A data lake is a storage repository that ingests vast amounts of raw data in its native format, enabling enterprises to easily access them when needed.
Databricks is currently a super unicorn company in the primary market. It helps businesses prepare data for analysis, supports the adoption of machine learning, and enables data-driven decision-making. It also empowers data scientists to collaborate with data engineers and other business departments to build data products. Today, it has expanded into a broader lake-warehouse integration with the Databricks Marketplace.
01 Apache Spark, the beginning
The Databricks team, consisting of seven computer science Ph.D. holders, embarked on the development of the Spark engine for data processing. In 2014, the project set a world record for data sorting speed.
To make Spark accessible to a broader user base, they chose to open-source it and founded Databricks in 2013. In the same year, the company completed its Series A funding round, led by A16z. In January 2016, Databricks appointed a new CEO. A year later, the company closed its first million-dollar deal.
Overall, the Databricks team is the core developer of Apache Spark, possessing significant influence and expertise. Thus, as a commercial company built around Spark, Databricks rightfully claims its position in the market.
02 Expanding Product Line for Revenue Diversification
Databricks initially focused on Spark, which was used for querying large, unstructured datasets stored in data lakes. However, to cater to the market demands, Databricks expanded into a lake-warehouse platform. Built on Spark, the platform includes Delta Lake, which provides ACID transactions and data versioning for data lakes; MLflow, an open-source platform for managing machine learning workflows; and Redash, a SQL-based data analysis collaboration tool.
Overall, the Databricks lake-warehouse platform combines elements of both data lakes and data warehouses. It offers the flexibility, cost-effectiveness, and scalability of data lakes, while also providing data management and ACID transactions typically found in data warehouses. Users can enable business intelligence and machine learning on all their data.
Databricks products are available on major cloud services such as AWS, Azure, and GCP, providing a unified environment for data, analytics, and machine learning workloads. Visualization can become an integral part of these different activities.
Source: Databricks
03 Data Lake Market Growth with User Enterprises Spanning Large, Medium, and Small Size
Databricks believes that businesses are moving away from isolated systems for data storage and opting for centralized data repositories. This approach enables enterprises to gain deeper insights into past and future trends through business intelligence and predictive analytics.
Data lake technology is precisely based on this concept, allowing for the storage of all types and sources of data together. Statistics indicate that the data lake market is projected to grow from $7.9 billion in 2019 to $20.1 billion in 2024. This growth signifies the increasing adoption and recognition of data lakes as a valuable asset for organizations across industries, with users spanning large, medium, and small enterprises.
Source: marketsandmarkets
Furthermore, Databricks serves customers across large, medium, and small enterprises spanning various industries. As of March 2023, it has garnered over 9,000 enterprise customers worldwide. Some notable customers include AT&T, Shell, Burberry, Toyota, Adobe, Condé Nast, and Regeneron Pharmaceuticals.
If we divide the ARR (Annual Recurring Revenue) ,$1 billion, of Databricks at the end of Q2 2022 by the customer count of 7,000+ at the end of Q2 2022, we can roughly estimate the ACV (Average Contract Value) of Databricks to be around $143,000. In comparison, the estimated ACV of Snowflake is $301,000 as of Q3 2023, indicating that there is still room for improvement for Databricks to increase ACV.
04 Triple Threat
In 2012, Snowflake, founded by former Oracle architects, emerged as a formidable competitor to Databricks. Initially positioning itself as a cloud data platform for data warehousing and analytics workloads, Snowflake primarily targeted business analysts and data engineers. Concurrently, Databricks garnered favor among data scientists and machine learning engineers.
However, the boundaries between the two have become blurred. For instance, Snowflake has introduced features like Snowpark for Data Science, transactional databases, and Python support, aiming to attract data scientists. On the other hand, Databricks has launched products such as Databricks SQL, Delta Lake capabilities, and the Unity catalog to cater to customers focused on data storage and security.
In terms of their models, Snowflake operates within a closed-source ecosystem, while Databricks is open-source. Databricks' primary product lines are available for free, with customers having the option to choose Databricks' enterprise offerings for more advanced features and support. Snowflake provides ready-made solutions, enabling companies to quickly embark on basic analytics, while Databricks offers better customization and configuration, allowing customers to have full control over their settings.
By the end of 2022, Snowflake had an annual revenue of $2.1 billion, while Databricks projected an annual revenue of $1.4 billion. The competition between the two is expected to intensify.
The second type of competitors is the cloud providers themselves. Databricks competes with the proprietary products offered by cloud providers. For example, AWS has Amazon EMR, Azure has Azure HDInsight, and GCP has Dataproc for big data processing. In terms of business analytics solutions, Amazon QuickSight, Azure's Power BI Embedded, and GCP's Looker compete with Databricks.
Lastly, Databricks faces competition from specialized data management and scientific domain solution companies. For instance, Databricks' scheduler competes with Apache Airflow, and its MLflow product competes with DataRobot and Alteryx.
05 Sustained Revenue Growth to be a Capital-Acknowledged Unicorn
Databricks, as an open-source software, generates revenue by offering additional features and services for a fee. It provides a fully managed version of its open-source software to enterprises, along with auxiliary tools such as SaaS query-writing tools and connectors for data sources.
In terms of the pricing model, Databricks charges based on the amount of compute resources consumed by the customers per second. To accomplish this, they have introduced their proprietary unit of measurement called DBU (Databricks Unit), where the number of DBUs consumed by a workload depends on various factors including the compute resources utilized, the volume of data processed, the region, the pricing tier, and the type of service being used.
Additionally, in order to attract users, similar to other open-source companies, Databricks offers a 14-day free trial period to users.
On the financial side, Databricks has also achieved remarkable growth. At the end of Q3 2019, its ARR was $200 million, its revenue was $425 million for the full year 2020, and its ARR exceeded $800 million in 2021. As of August 2022, Databricks' ARR has exceeded $1 billion, with annual growth of over 70%.
As of August 2021, Databricks' valuation was $38 billion, and it has raised a total of $3.5 billion in capital markets. Its investors include A16z, Tiger Global, Amazon Web Services, Microsoft, and Coatue.
However, there have also been reports that in October 2022, Databricks reduced its internal stock price, resulting in a valuation downgrade to $31 billion, a decrease of approximately 7% from the same period in 2021. Nonetheless, Databricks remains a super unicorn in the primary market.
06 Trends, Opportunities, and Risks
With the decrease in cloud storage costs and improved internet speeds, more and more companies are choosing to store all their data in centralized repositories instead of separately storing different types of data. This centralization trend helps businesses gain better insights into their operations through real-time business intelligence and predictive analytics. Additionally, the exponential growth of data has made it impractical for companies to maintain multiple large-scale data stores, leading to the convergence of data lakes and data warehouses into a single platform.
ChatGPT has become a topic across industries, and Databricks has embraced this wave by offering its Unified Data Analytics platform, which allows data teams to store and protect data, generate analytics and insights, and drive the development of machine learning tools. Moreover, Databricks provides integration with popular artificial intelligence frameworks such as TensorFlow and PyTorch, making it easier to build and deploy machine learning models.
Databricks relies on cloud infrastructure providers like AWS, Azure, and GCP to deliver its services. Looking back, the partnership with Microsoft was a milestone for Databricks, as it helped the company's revenue grow from under $1 million in early 2017 to over $100 million in 2018. Any changes in relationships with major cloud providers could impact Databricks' service capabilities.
In summary, we have reason to believe that in this era of data expansion and the rise of AI, Databricks' offering of a unified data storage and analytics platform holds value for enterprises. The company has a great opportunity and capability to capitalize on this wave, although it also faces challenges along the way.