Haystack powers allyoucanupload and will soon power all of Webshots (Webshots is a photo sharing community with 19,000,000 members who have uploaded over 375,000,000 photos). Allyoucanupload is an image hosting service that we built to run alongside Webshots.
Haystack is designed to provide a very scalable, reliable and cost effective platform for object storage and delivery to the Internet. It just went live 2 weeks ago - we are currently using it for the allyoucanupload Webshots image hosting service (gif, jpeg and png). In the very near future, it will serve all Webshots photos, and soon, video.
I'm doing a long post because I'm very proud of what our technical team has accomplished, but also to give some insight into how finance, user and technical strategy intersect in our group. Building a large and sustainable (aka profitable) business in social media requires a balancing act between the delivering on a great user promise, a revenue model and keeping your costs under control. Haystack will allow us to deliver very robust storage solutions for users at a very low marginal cost.
Disclaimer/Credit: Almost all of the technical content in this blog post was written by Paul O, who runs CNET Networks' data center services, including database architecture, network systems and operations and the actual data center. Paul, Jim, Rodolphe, Marcus, and Matthew built Haystack - they don't (yet) have blogs so I'm doing this post. Please do not attribute any technical props to me because of this post. My only contribution to Haystack was to approve its development and cheerlead along the way.
Haystacks' content (social media files) has several interesting characteristics: it grows without bound; it tends toward write-once, read-many; the most recent content tends to be the most frequently accessed. Haystack's design leverage these characteristics.
The challenge: A big financial and therefore technical issue is the relationship of storage to delivery. It's relatively easy to deliver a small number of files to a lot of people. And a large number of files to a small number of people. A large number of files to a large number of people gets more complicated and can get very expensive very quickly.
Haystack gives us the ability to finely match the raw storage capacity of the system with it's overall IO capacity. Haystack grows very naturally through incremental addition of capacity. Haystack is designed to handle failures automatically and to keep reliability constant as the system grows. Haystack uses commodity hardware and software.
The promise of Haystack is that we can handle reliability at
scale at a very low (perhaps the lowest) marginal operating cost.
Reliability means that we never have to say we're sorry - we lost your
photos. Scale is scale. Low marginal cost directly goes to our
ability to give users as much storage as we can, and run the least
intrusive ads, while running Webshots with a sustainable profit margin
- and keeping the data center team focused on talent vs hands and
well paid:) You'll see the effects of Haystack on Webshots in our
upcoming redesign and soon to be revised storage limits.
BSU Haystack consists of many Basic Storage Units (BSU), which are just servers with a lot of disks. Content is scattered more or less randomly across all BSU and spindles to maximize the IO throughput of the system. Multiple copies of the content are maintained on disparate equipment so that no single failure can loose all copies of an object.
Failure: In the event of a component failure, Haystack immediately begins a process to copy the "missing" content from one of the redundant sources to the available components. Because the content is scattered across all available components, recovery time is on the order of 1/N. Of course, the failure rate is on the order of N, so the overall availability is on the order of N * 1/N or a constant. Recovery is also has very little impact on the overall performance of the system.
As new capacity is added, existing content is migrated to the new capacity to rebalance the storage and IO across all available units. The rebalancing time is approximately constant.
Separation of Church and State
To minimize the overhead in tracking the location of any given object, Haystack puts objects into buckets and needs to track only the location of the buckets. The applications using Haystack must independently track metadata about each object including it's bucket. At some level, a bucket is really just a directory and each BSU knows what buckets it contains. Haystack maintains a proper cache of the bucket locations and each BSU checks in periodically to report the state of its buckets. The proper cache can easily be rebuilt and the overall system is very tolerant of data inconsistencies between clients, the cache and the BSUs.
Various processes monitor the overall condition of Haystack and initiate actions as needed to maintain the health. For example, when new capacity is added, these monitoring processes detect the availability of "under-utilized" capacity and begin a bucket migration process to bring the new capacity up to the same levels as the old capacity.
Some of the jobs have names...they are:
leon's job is to identify and remove extra instances
jeopardy makes more copies of data that has too few copies
scalpel removes outdated copies of data
optimist moves data to fill up less-full nodes
pessimist moves data from busy nodes to less busy nodes
Changing the drives over time
Currently we use 400 GB sata drives. As time goes by, the average age of the content will grow and the average number of access per object will decrease. This will allow us to introduce larger capacity disk drives as the system grows, helping to keep the overall hosting and depreciations costs low. The number of sata drives attached to a given BSU is determined by the overall network throughput of the BSU and the ability of the BSU to effectively use the file system cache.
Haystack uses various content caching and redirection mechanisms to both hide the complexity of the system from clients and to leverage the raw IO capacity of the spindles. The system is very loosely coupled and designed to be quite tolerant of failures. This means that failures are very localized. In addition, the types of failures that can have the largest negative impact are with very proven technologies, such as file systems and disk arrays. Thus the overall reliability is high and the operational costs are low.
Rant - lots of private social media properties are venture funded and lose tons of money storing/serving content to get big to either get bought or change the model.
Finale The challenge at public companies like CNET Networks is to build a social media business model as well as a user & technical model from the "get go". That operating plan must be good for user and cheap enough to operate so we can return a healthly return on invested capital to our shareholders. Haystack is a significant innovation that should help us to build a better product for users, scale for marketing partners, and generate operating cash flow for shareholders.
Whew - if you're still reading then you are likely one of the Webshots or CNET Networks engineers - nice work folks :)