Meta Introduces \”Tulip,\” A Binary Serialization protocol that Aids with Data Schematization
Meta introduces \”Tulip\”, a binary serialization protocols that supports schema evolution. This addresses reliability issues, as well as other concerns. It also helps us to schematize data. Tulip supports multiple legacy formats. Meta’s data platforms use it, and the performance and efficiency have increased significantly. Meta’s platform is composed of a number of heterogeneous services such as warehouse storage, real-time systems and other data exchanges. These services communicate via APIs and share large amounts of information. Meta’s data platform is made up of a number of heterogeneous services, such as warehouse data storage and various real-time systems exchanging large amounts data. Schematization is a key component in building a data platform at Meta’s size. These systems are built with the understanding that each decision and tradeoff has an impact on reliability, data processing efficiency, performance and the engineer developer’s experience. It is a risky move to change the serialization format for the data infrastructure, but it will pay off in the end.
Data Analytics Logging Library, which is part of the Meta web tier as well as the internal services is responsible for the logging of analytical and operational data. This is done using Scribe – a durable message queueing system. The data is read and ingested using Scribe. This also includes an ingestion service for the platform and real-time systems. The data analytics library is used to deserialize and rehydrate data into a structured payload. Meta engineers create, update, and delete logging schemas every month. The data flowing through Scribe is measured in petabytes.
Schematization ensures that messages logged in the future, past, or present, depending on which version of the serializer is used, can be (de-)serialized reliably at any time without data loss. This characteristic is called safe schema evolution through backwards and forward compatibility. This article focuses on the onwire serialization format that is used to encode data for the final processing by the data platform. The new serialization format is more efficient than the previous formats, Hive Text Delimited or JSON Serialization. It requires 40 to 85% fewer bytes, and 50 to 90% fewer CPU cycles, to (de-)serialize the data.