In the first part of this series I described the goal and motivation of eBay CommerceOS initiative and that CommerceOS is eBay version of Microservices. In the second part I described the five major components of CommerceOS (technology, processes and org). In this post, I will focus on one of those components: The standards and patterns.
Let me start with an example: Service/API authentication:
Most eBay applications use between 20-50 services, lots of them require application as well user's identity (for security and functional reasons) - imagine if each service (or domain of services) accepted a different type of token with different issuer, different syntax and validation semantics and different binding of that token to protocol (in the body, in the header with different names, combined with other headers etc.). An application would have learn and then write a lot of boiler plate code to obtain token, store and then submit via request to different services, then it would have to parse and learn the error semantics for all types of authentication done by each service. All these activities make the code more complex to write, test and operate and they do not add any value to the main function of the app. Now extend this across all types of horizontal concerns and you get an idea why standardization backed by run time libraries is a must for any service portfolio at scale.
CommerceOS defines a set of standards and implement them in our framework for developing services (called Raptor) - this does not mean that eBay MP does not allow or discourage services to be built using any other technology stack, but if a service is built using the standard libraries and run-time - it gets the support of all standards.
Pattern and standards include about ~30 different aspects of service design, including
- Identity & Access Management,
- Base Request & Response standard and extended headers
- Compact Header Encoding (more efficient use of headers)
- Tracking
- Internationalization
- Error Handling
- Version Management
- Service Descriptor
- Service Life Cycle, Registry and Discovery
- Addressing and End Points
- Sorting, Pagination, Filters and Views
- Instrumentation of Services
- Messaging and Events
- Security
- Migration
- Fail-over and Recovery
- Multi-tenancy
- Integration (with 2nd and 3rd parties)
- Configuration and Metadata Management
- Content and Translation
- Persistent Storage, Replication
- Failure and Recovery
- Service Modeling & Interface Development Model (IDM)
- Multi-Tenancy
- Base-API Operation (operations that all APIs must answer)
- Escape Response
- Asynchronous Service Design
One significant CommerceOS activity stream centers around design and then implementation of pattern and standards. We focus a lot on correct and accurate documentation
followed by implementation in run-time libraries and or as shared services. CommerceOS also defines a process for developing and adopting a new standard. This process is modeled after and is very similar to Internet standard development process (working groups, open discussions, editors) - with the exception of it has a solid timeline to time bound the process. This way all service providers and application developer can participate and/or comment and influence the standard.
In the rest of this series of post, I will go into a bit more details on some of the more significant or interesting standards (if you want to know more about any standard I didn't explain just contact me). In the next post (part III) I will focus on the two sets of principles/patterns that formed our thinking around portfolio design and individual service design and how we measure goals and operational metrics.
Service Descriptor and Interface Contract
CommerceOS emphasis on a formal contract. We use Google Discovery Document (GDD) as the basis of our service descriptor and we extend it to include aspect of service contract we need to manage the service life cycle during build time or run-time. In Java environment, Service interfaces are annotated using a standard annotation library, our discovery tool then generate JSON based discovery document that is used for interoperability with the rest of our tool set used by application teams, product manager in other product teams, tech writers etc.
COS service contract has four main parts
- Service meta data, this include the basic meta data as well as attributes for financial, regulatory and legal needs such as whether a service handle financial data (what types) - whether a service handles personal data, location of personal data etc.
- Service interface and types as describe in Google Discovery Documents
- Service instrumentation contract - this is the contract service has with its operational environment and defines events service generates and consumes (including events required for technical and business health monitoring)
- Service admin contract, loosely JMX based API for adminstartor to set or get certain attributes and influence service behaviors
Service teams own the contract and its maintenance, but syntax and semantics are standardized.
Service Versionning
CommerceOS services must be versioned, we use Major.Minor.Maintenance format. Service team need to declare/decide how many back versions they support. No team is allowed to support zero back version and break backward compatibility - since this forces all applications to migrate.
We allow multiple versions to be alive at the same time, API Router will ensure a given request goes to the right end point. Data and entities are designed to be backward compatible.
Services are not allowed to be perpetually backward compatible since this practice erode code quality and accumulate significant "dead code" that leads to drop in agility and complexity of test.
Service Life Cycle and Registry
One of the most significant decision for CommerceOS was to standardized and establish a widely understood set of mile stones (called life cycle) for service development. This may sound like the dreaded "G" word (Center Governance) - but in practice a large org can not plan an optimal and rapid release cycle without it.
Before the life cycle standardization, the only defacto mile stone for service team was "live to site" i.e. when the service end points were available and functional in production. Application developers (web and mobile) would then start their serious development, effectively serializing the timeline i.e. Delivery time = Max(Services Delivery) + App Delivery.
Without a wide understood and supported mile stones, service team often change their service implementation and interface till the very end of a project timeline, forcing application developer to wait till the "dust settles". CommerceOS establishes a set of mile stone, the first of which is "interface published" this mean the service descriptor is ready, and an end point is exposed that can respond based on the service descriptor - this end-point, in concept, is similar to Java Proxy API - in that it can produce a "fake" response to the request based on the contract - no real implementation required. Application development can practically starts at this point, to a large degree decoupling app development time from service development time.
Service teams can change the interface - but often the thoughts and consideration that went to interface design leads to more or less stable interfaces, the implementation can change freely at any time. This align with one of our portfolio principles of "Stable interface, agile implementation".
Base Request and Response
CommerceOS services and application talk over http, but the exchange has to happen with a common dialect i.e. certain semantics has to be expressed and binded to the underlying HTTP transport in a common way - this saves individual service provider and app team time to re-invent the wheel also prevent a lot of bugs and issues. Base request and response define a set of common headers and encoding that all COS service and apps understand, a few examples are
Syntax to express compact headers, Authorization headers, Identification of request and request chain, serialization and encoding of request and response, session identification, location, locale and cultural preferences bindings to the protocol, the proper use to HTTP header v.s body and alternative binding to HTTP body.
Migration
One of the practical and most important aspects of establishing service or micro service architecture for large companies with "legacy" code is migration. By migration, I specifically mean migrating either monolithic applications with direct data access or application the use older, legacy services to application that consume contract based micro services. We have established a pattern, called
"Bay Bridge" for service migration. It has four major steps
- Smoke Test: Turn on new service (with new data storage), only use it for a very small number of traffic for a few consumers, dual write into (and read from) both new service/storage and old/legacy storage. primary source of truth still is the legacy.
- Load/Sync: Copy/transform data from legacy storage to the new service storage as appropriate. This phase itself may include smaller phases depending on data. The more long lasting data/entity is the more critical this phase is e.g. User is a very long lasting entity while an Auction listing may last only 7 days or a session may be stored only for few hours. The main goal of this phase is to bring new and legacy storage to parity.
- Fly with Safety Net: Dual read/write continues, but the primary is the new storage/service now
- Clean up: Old storage is cleaned up and deprecated
Persistence
CommerceOS allows services to choose their own persistence storage and technology, depending on types of data a service handles (preferences vs. financial data or blog post vs. password and credential) there are pattern for whether systems should prefer CA (financial), PA (most anything)
From logical point of view, services team are required to have isolated storage i.e. no other service or application should read/write directly from primary database of other services - sometimes (especially with bulk data) it is not efficient to consume a classic service interface (serializing and deserializing is too much over head) - in these cases service must expose a "data feed" - push style, and still should not allow other services to directly read its primary storage.
Services are required to register they database and structure of logical entities stored (there is no "governance" of such entities just registration for discovery process)
Fail Over and Recovery
There two types of fail over and recovery in CommerceOS, Transparent and Degraded.
Transparent failures are the failure of stateless application server/service or database hosts. application servers are all running behind load balancers with virtual IPs (and sometimes load balancing is done using run-time discovery ZooKeeper style) - database failures are handled by partitioning data and replication (Casandra style) the typical failures of services and databases are handled without service code realizing the failure, the system continue to operate with no impact.
The other types of fail-over is "degraded" - in this case failure is not transparent to a service for example when an Pricing service (that calculate total order price) calls an Incentive calculation and receives a failure - say due to yet another system failure in Incentive subsystem that could not be handled (e.g. it uses a non-partition, non-replicated DB that failed) - the Pricing service now has to "degrade" its function in a way that it still calculate the total order price. This is a higher level and more domain specific handling of failure yet a few aspect of it can be abstracted and implemented in run-time. In particular, we define light-weight processing framework. It is pipeline based programming model, each pipeline has a series of phases, phases can be assembled dynamically at run-time. each pipeline is executed by an Executor that is the main run time for pipelines. each phase can be annotated as required, optional, alternative. Each phase has a few life cycle state, the two most important ones are up and down. If a required phase is down, and if no alternative is designated, the pipeline fails, if an optional phase fails executor executes an alternative as designated, if no alternative is designated process continues.
This simple framework provides an abstraction for degraded functionality.
Escape Response
Escape response is a small, yet important, aspect that illustrates the need for standards in a give portfolio. An example illustrates the concept, imagine that due to a security breach, you need all users to change their passwords. You can change 100s of applications to message the user for password change. What do you do?
Applications make service calls all the time, a particular response header is called "Escape Response" and it include an end point and a unique number. All application know (and it is implemented in the service invocation library as well) that if they see the escape header, they must re-direct (device and platform specific) user to the given end point. This "escape" path allows the system to take over from any compliant service using a standard syntax, semantics and protocol binding.