
It also locks you into certain assumptions about your workload profile for a long period of time.”ĭropbox migrated all its analytics data from its on-premises Hadoop Distributed File System infrastructure to a data lake built on Amazon S3 in 2019, and this data is growing by more than 1 PB a month. “Doing that is a very tedious process, and it means time spent not building. The team had to plan in advance for capacity: “We had to predict our needs at least 3 years into the future,” says Ashish Gandhi, technical lead for data infrastructure teams at Dropbox. Over time, the company’s on-premises Hadoop clusters also required operationalization with custom automation, an additional burden for Dropbox’s engineering team of eight engineers.
Dropbox cost software#
In the worst-case scenario, software upgrades risked having to be rolled back, resulting in downtime.

At-scale testing of new versions of big data frameworks like Hadoop or Apache Hive-open-source frameworks for the distributed processing of large datasets-was an expensive and inherently risky process. Dropbox had to invest in custom patches for open-source Hadoop so that these systems could scale to its needs. Though best known for its file-syncing product, Dropbox also offers tools for productivity, team management, data security, and more.īefore moving its analytics data to AWS, Dropbox stored this data in on-premises Hadoop Distributed File System clusters. It serves more than 500,000 business teams and has more than half a billion users. These AWS services enable Dropbox to cost-effectively scale storage and compute independently without planning for capacity and to test new technologies without fear of degrading its users’ experience, ultimately enabling Dropbox to innovate faster while saving money.īeginning as a central hub for file storage in 2007, Dropbox has evolved to offer many business solutions in the collaboration space and is now a global company. To successfully and efficiently innovate and improve customer experience, the company looked to Amazon Web Services (AWS).ĭropbox has since moved 34 PB of analytics data to a data lake in Amazon Simple Storage Service (Amazon S3)-an object storage service that offers industry-leading scalability, data availability, security, and performance-and uses Amazon Elastic Compute Cloud (Amazon EC2) and Amazon EC2 Spot Instances to power the compute for its Hadoop clusters.

The clusters held petabytes of analytical data, including server logs, instrumentation, and metadata related to Dropbox’s more than 600 million global customers. Being responsible for the durability of the data it stored in Apache Hadoop, Dropbox had to be conservative with what technologies it could experiment with. In 2018, Dropbox identified a need to migrate away from its on-premises Hadoop clusters.
