Part I of this series of posts, described the what CommerceOS, Part II explained the structure and overview, in part III we focused on the standards that make a Microservice portfolio work. Part IV was an overview of Shared Services, in this post (the last part) - I write about processes and metrics.
ProcessesPlatforms at their core are abstractions of capabilities. Their value comes from encapsulating frequently and commonly use capabilities, normalize their variations and expose them for use - freeing up applications/consumer to focus on less common (and therefore more value add) capabilities.
Platform are built based on standards, and maintained and enhanced via processes - without processes platforms simply deteriorate over time. This is akin to the second laws of thermodynamics - entropy of systems always increases UNLESS you spend energy - and energy in this context translates to processes. It is very unlikely that a portfolio of services, especially with higher level business domain abstractions and their nuances survive for a long time without some form of processes. The key though is to have meaningful process. We have three criteria for designing "meaningful" processes
- Process should have a clear and measurable goal.
- Processes should be transparent with clear decision making process
- Processes should have a SLA (time bounded)
CommerceOS defines only three processes, all at the portfolio level, they are
- Process for adding of a new service to the portfolio
- Process of adding a new standard to the portfolio
- Process for cross domain, portfolio-wide decision making
1- Service Life Cycle: Ensures the followings
- A given service is neither duplicating (doing the same thing with different name) nor diluting (doing different thing with the same name) existing capabilities. If Order/Cancel and Order/Refund do "the same thing" - we have a duplication. It leads to confusion and more than likely major bugs - after all the intend to do the same thing, but over time functionality diverges and depending on which API is called Order is either not properly canceled or refunded. If User/Order and /Order return types that are drastically different (especially semantically), that is an example of "Dilution" - in this case of Order type.
- Extracting potential new types for the common type space
- Ensuring an audit opportunity for security, legal and regulatory functions.
- Automated assessment of a service (once implemented) against a set of standards
2- Portfolio Standards and Type Space: Process controls adopting a proposal for standardizing some aspect of system. The standard process ensure that
- Standard is really needed - Since all service providers and consumer must comply with standards (and they are implemented in service and application run time) the bar for adopting a standard is high. The goal is to standardized only what must be standardized - no more.
- Standard define behavior not implementation choices: Also we try limit the standards to aspects of interaction between services and applications or services and operational tools//system and not service implementation. This is an important distinction. We have no standard on language, technology stack, data technology etc.
- Standard is automatically testable - Otherwise it is not scalable to test all service for standard behavior manually.
- Standard is properly documented and implemented - For all standards we build support in services, application run time and libraries, or in portfolio tools
3- Decision Making (Huddle): Service and application teams are autonomous, they own all the decisions made regarding the code and systems they own (and accountability that comes with it) - however, there are always decisions at the portfolio level i.e. decisions that impact multiple teams. These decisions are often tactical and made in context of large projects (strategic decision are lframed in form of standards) and the often relate to either orchestrations of complex business processes or migrations related decisions that impact other services or applications. The decision making process calls for
- Clear documentation of question(s) in hand
- Clear documentation of options and proposal
- Identification of decision maker
Debate and transparency are encouraged, the discussions are time-bounded and produces one of four Outcomes:
- Agree, Implement: Everyone implements the decision.
- Agree, do not implement: Tech debt is captured for the team(s) that don't implement the decision.
- Disagree, Imprement: Dissenting view is acknowledged and recorded.
- Disagree, do not implement: One time escalation to a team of 3 senior technologist to make a final decision
MeasurementIn general, there are two types of metrics, strategic metrics and operational metrics. In this context, strategic goals are the main reasons a platform is built (a car is built to transport people), operational metrics show the platform health (a car engine temperature must be within a certain range).
CommerceOS has the following the following primary goal
- Improving the productivity of application developers who develop apps to facilitate all form of commerce using any technology stack, on any device/screen or platform globally.
Of course to achieve this goal, there are plenty of operational goals, the major groups of operational metrics are:
- Service Production Metrics: Encapsulating all eBay Marketplace capabilities in form of RESTful services
- Service Consumption and Adoption Metrics: Exposing capabilities to all marketplace participants on all device, platform and geographies
- Non-functional Aspects: Improving all non-functional aspects such as scale, quality, availability, security, cost etc.
The primary goal is not easy to measure. One way we experiment with measuring the goal is by Application developer NPS assuming that a diverse and wide range of application developers are included.
The operational metrics categories above are measured by a large set of metrics such as
- Coverage (% of resources/noun and verbs in marketplace dictionary that are exposed as services)
- Adoption (% of traffic enabled by services)
- Uniformity (% of device/mobile and web traffic using THE SAME set of service)
- Security (wide range of metrics here, require its own post - but primarily # of security issues reported per period)
- Availability (large set of metrics here per service)
- Quality: Number of P1,P2 bugs reported for a service
- Compliance - Number of standard each service is compliant with (with a core set as mandatory)