Every day, billions of people connect with each other through Meta using services like Facebook, Instagram, WhatsApp, and Messenger. Meta’s services rely on fleets of servers in data centers across the globe, all running applications and delivering the performance the services need. However, silent data corruption, or data errors that go undetected by the larger system, remain a widespread challenge for large-scale infrastructure systems. This type of corruption can propagate across the stack and manifest as application-level problems. It can also result in data loss and require months to debug and resolve. Our teams enable and support hardware testing and large-scale experiments in Meta's data centers including detecting and remediating silent data corruptions on a scale of hundreds of thousands of machines.
Within this novel research domain, we identify research opportunities that range from architectural solutions to data corruption, to fleetwide testing strategies and distributed computing resiliency models, to software and library resiliency, to silicon level design, simulation and manufacturing approaches. Solutions could be cross-layered with proposals combining different domains within the above. This RFP is not limited to solutions specific to CPUs, but instead is pursuing all the components typically used within a server infrastructure.
To foster further innovation in this area, and to deepen our collaboration with academia, Meta is pleased to invite faculty to respond to this call for research proposals pertaining to the aforementioned topics. We anticipate awarding up to five awards, each in the $50,000 range. Payment will be made to the proposer's host university as an unrestricted gift.
Deadline: March 21, 2022