Sunday, September 14, 2014

CommerceOS Part V - Processes, Metrics and Measurement



Part I of this series of posts, described the what CommerceOS, Part II explained the structure and overview, in part III we focused on the standards that make a Microservice portfolio work. Part IV was an overview of Shared Services, in this post (the last part) - I write about processes and metrics.

Processes

Platforms at their core are abstractions of capabilities. Their value comes from encapsulating frequently and commonly use capabilities, normalize their variations and expose them for use - freeing up applications/consumer to focus on less common (and therefore more value add) capabilities.

Platform are built based on standards, and maintained and enhanced via processes - without processes platforms simply deteriorate over time. This is akin to the second laws of thermodynamics - entropy of systems always increases UNLESS you spend energy - and energy in this context translates to processes.  It is very unlikely that a portfolio of services, especially with higher level business domain abstractions and their nuances survive for a long time without some form of processes. The key though is to have meaningful process. We have three criteria for designing "meaningful" processes

- Process should have a clear and measurable goal.
- Processes should be transparent with clear decision making process
- Processes should have a SLA (time bounded)

CommerceOS defines only three processes, all at the portfolio level, they are
- Process for adding of a new service to the portfolio
- Process of adding a new standard to the portfolio
- Process for cross domain, portfolio-wide decision making

1- Service Life Cycle: Ensures the followings
  •  A given service is neither duplicating (doing the same thing with different name) nor diluting (doing different thing with the same name) existing capabilities. If Order/Cancel and Order/Refund do "the same thing" - we have a duplication. It leads to confusion and more than likely major bugs - after all the intend to do the same thing, but over time functionality diverges and depending on which API is called Order is either not properly canceled or refunded. If User/Order and /Order return types that are drastically different (especially semantically), that is an example of "Dilution" - in this case of Order type. 
  •  Extracting potential new types for the common type space
  • Ensuring an audit opportunity for security, legal and regulatory functions.
  • Automated assessment of a service (once implemented) against a set of standards

2- Portfolio Standards and Type Space: Process controls adopting a proposal for standardizing some aspect of system. The standard process ensure that

  • Standard is really needed - Since all service providers and consumer must comply with standards (and they are implemented in service and application run time) the bar for adopting a standard is high. The goal is to standardized only what must be standardized - no more. 
  • Standard define behavior not implementation choices: Also we try limit the standards to aspects of interaction between services and applications or services and operational tools//system and not service implementation. This is an important distinction. We have no standard on language, technology stack, data technology etc. 
  • Standard is automatically testable - Otherwise it is not scalable to test all service for standard behavior manually.
  • Standard is properly documented and implemented - For all standards we build support in services, application run time and libraries, or in portfolio tools


3- Decision Making (Huddle): Service and application teams are autonomous, they own all the decisions made regarding the code and systems they own (and accountability that comes with it) - however, there are always decisions at the portfolio level i.e. decisions that impact multiple teams. These decisions are often tactical and made in context of large projects (strategic decision are lframed in form of standards) and the often relate to either orchestrations of complex business processes or  migrations related decisions that impact other services or applications. The decision making process calls for
- Clear documentation of question(s) in hand
- Clear documentation of options and proposal
- Identification of decision maker

Debate and transparency are encouraged, the discussions are time-bounded and produces one of four Outcomes:

  • Agree, Implement: Everyone implements the decision.
  • Agree, do not implement: Tech debt is captured for the team(s) that don't implement the decision.
  • Disagree, Imprement: Dissenting view is acknowledged and recorded. 
  • Disagree, do not implement: One time escalation to a team of 3 senior technologist to make a final decision  



Measurement 

In general, there are two types of metrics, strategic metrics and operational metrics. In this context, strategic goals are the main reasons a platform is built (a car is built to transport people), operational metrics show the platform health (a car engine temperature must be within a certain range).

CommerceOS has the following the following primary goal

- Improving the productivity of application developers who develop apps to facilitate all form of commerce using any technology stack, on any device/screen or platform globally.

Of course to achieve this goal, there are plenty of operational goals, the major groups of operational metrics are:

  1. Service Production Metrics: Encapsulating all eBay Marketplace capabilities in form of RESTful services 
  2. Service Consumption and Adoption Metrics: Exposing capabilities to all marketplace participants on all device, platform and geographies 
  3. Non-functional Aspects: Improving all non-functional aspects such as scale, quality, availability, security, cost etc.


The primary goal is not easy to measure. One way we experiment with measuring the goal is by Application developer NPS assuming that a diverse and wide range of application developers are included.

The operational metrics categories above are measured by a large set of metrics such as
- Coverage (% of resources/noun and verbs in marketplace dictionary that are exposed as services)
- Adoption (% of traffic enabled by services)
- Uniformity (% of device/mobile and web traffic using THE SAME set of service)
- Security (wide range of metrics here, require its own post - but primarily # of security issues reported per period)
- Availability (large set of metrics here per service)
- Quality: Number of P1,P2 bugs reported for a service
- Compliance - Number of standard each service is compliant with (with a core set as mandatory)



Saturday, September 13, 2014

What is CommerceOS - Part IV - Foundations and Shared Services


Part I of this series of posts, described the what CommerceOS, Part II explained the structure and overview, in part III we focused on the standards that make a Microservice portfolio work. In this post I will focus on shared services. Let's start with the natural question

What is a Shared Service:


Of course most services are written to be re-usable and to be "shared" among multiple applications and services. But naturally the scope of use varies for each service. At the lower levels of any dependency graphs, in any service portfolio, there are services with much wider scope of use, a lot more services depend on them and their function is less domain specific and is more generic and "platformy" functions.

Although there are no fast, easy and mathematical rules to define a "shared service" - we have defined a few criteria:
  •  Shared service can not depend on any business services (basic sanity of dependency graph)
  •  Shared services are deployed and are accessible to all other services - they are part of the services runtime.
  •  Shared service have more stable interfaces and slower release cycle - this is normally the case since the functionality is not business specific, so it does not change with business requirements.
  • Shared service is, conceptually, something that is useful/applicable to any company (not just eBay marketplace), so at least in theory, they can be either open sourced or made available to different business/company.
The following services are CommerceOS list of shared services (at the time of this writing)

  1. Identity and Access Management: Provisioning Identity for all Apps and Services. Plus an OpenID Connect based authentication protocol and a SAML based JWT based security token services. 
  2. Billing: Multi-Tenant billing and invoicing based on billing event stream. 
  3. Messaging: Few internal message buses, primary message bus is an internally developed light-weight and highly scale-able Business Event Stream (BES) - it is a publish-subscribe only messaging system. Transitioning to AMQP based messaging system to support wider set of messaging semantics - some use of Kafka.
  4. Logging: Distributed logging, uses TIBCO message buses, stores in HBase and integrated with Hadoop. 
  5. Monitoring: A real time version of Map-Reduce (Red Lemur) - similar to Apache Storm. 
  6. Tracking: of API calls and user activity events. Custom event API, types and structure, over TIBCO buses, 
  7. Config and Metadata:  Custom key/value pair service transitioning to Zookeeper based config management 
  8. Content Management: A template language for defining content and its variations, translation workflow and a run time distributed content repository backed by MongoDB and managed by ZooKeeper
  9. Discovery and Registry: A build time registry tool based on Google Discovery Doc (GDD) and backed by a Casandra cluster.  
  10. Routing and Inter-mediation: WSO2 based ESB, kept simple. Only provides routing and location transparency. Most composition and transformation of lower services are done by higher level application services, intermediary layer is kept "dumb" as much as possible. 
  11. Crypto Services: HSM based hashing and encryption service, based on each tenant security policy.
  12. Object Storage: HBase-based storage for images (or any BLOB)
  13. Caching: Memecahed based API/Protocol implemented by Couchbase  

Shared services are multi-tenant, and can be consumed either as a service (majority of cases) or as a separate private instance (a few cases). Shared services comply with the rest of CommerceOS standards. The access to shared services are controlled by tokens provisioned by CommerceOS STS (secure token service) based on the provisioning rules for each application or service.


Thursday, September 11, 2014

What is CommerceOS Part III - Standards


In the first part of this series I described the goal and motivation of eBay CommerceOS initiative and that CommerceOS is eBay version of Microservices. In the second part I described the five major components of CommerceOS (technology, processes and org). In this post, I will focus on one of those components: The standards and patterns.

Let me start with an example: Service/API authentication:

Most eBay applications use between 20-50 services, lots of them require application as well user's identity (for security and functional reasons) - imagine if each service (or domain of services) accepted a different type of token with different issuer, different syntax and validation semantics and different binding of that token to protocol (in the body, in the header with different names, combined with other headers etc.). An application would have learn and then write a lot of boiler plate code to obtain token, store and then submit via request to different services, then it would have to parse and learn the error semantics for all types of authentication done by each service. All these activities make the code more complex to write, test and operate and they do not add any value to the main function of the app. Now extend this across all types of horizontal concerns and you get an idea why standardization backed by run time libraries is a must for any service portfolio at scale.

CommerceOS defines a set of standards and implement them in our framework for developing services (called Raptor) - this does not mean that eBay MP does not allow or discourage services to be built using any other technology stack, but if a service is built using the standard libraries and run-time - it gets the support of all standards.

Pattern and standards include about ~30 different aspects of service design, including

  1.  Identity & Access Management, 
  2.  Base Request & Response standard and extended headers
  3. Compact Header Encoding (more efficient use of headers)
  4.  Tracking 
  5.  Internationalization 
  6.  Error Handling 
  7.  Version Management 
  8.  Service Descriptor 
  9.  Service Life Cycle, Registry and Discovery  
  10.  Addressing and End Points
  11.  Sorting, Pagination, Filters and Views 
  12.  Instrumentation of Services
  13.  Messaging and Events
  14. Security
  15. Migration 
  16. Fail-over and Recovery  
  17. Multi-tenancy 
  18. Integration (with 2nd and 3rd parties)
  19. Configuration and Metadata Management 
  20. Content and Translation 
  21. Persistent Storage, Replication
  22. Failure and Recovery 
  23. Service Modeling & Interface Development Model (IDM)
  24. Multi-Tenancy 
  25. Base-API Operation (operations that all APIs must answer)
  26. Escape Response 
  27. Asynchronous Service Design 

One significant CommerceOS activity stream centers around design and then implementation of pattern and standards. We focus a lot on correct and accurate documentation followed by implementation in run-time libraries and or as shared services. CommerceOS also defines a process for developing and adopting a new standard. This process is modeled after and is very similar to Internet standard development process (working groups, open discussions, editors) - with the exception of it has a solid timeline to time bound the process. This way all service providers and application developer can participate and/or comment and influence the standard.

In the rest of this series of post, I will go into a bit more details on some of the more significant or interesting standards (if you want to know more about any standard I didn't explain just contact me). In the next post (part III) I will focus on the two sets of principles/patterns that formed our thinking around portfolio design and individual service design and how we measure goals and operational metrics.

Service Descriptor and Interface Contract

CommerceOS emphasis on a formal contract. We use Google Discovery Document (GDD) as the basis of our service descriptor and we extend it to include aspect of service contract we need to manage the service life cycle during build time or run-time. In Java environment, Service interfaces are annotated using a standard annotation library, our discovery tool then generate JSON based discovery document that is used for interoperability with the rest of our tool set used by application teams, product manager in other product teams, tech writers etc.
COS service contract has four main parts

  1. Service meta data, this include the basic meta data as well as attributes for financial, regulatory and legal needs such as whether a service handle financial data (what types) - whether a service handles personal data, location of personal data etc.
  2. Service interface and types as describe in Google Discovery Documents
  3. Service instrumentation contract - this is the contract service has with its operational environment and defines events service generates and consumes (including events required for technical and business health monitoring)
  4. Service admin contract, loosely JMX based API for adminstartor to set or get certain attributes and influence service behaviors  

Service teams own the contract and its maintenance, but syntax and semantics are standardized.

Service Versionning

CommerceOS services must be versioned, we use Major.Minor.Maintenance format. Service team need to declare/decide how many back versions they support. No team is allowed to support zero back version and break backward compatibility - since this forces all applications to migrate.
We allow multiple versions to be alive at the same time, API Router will ensure a given request goes to the right end point. Data and entities are designed to be backward compatible.
Services are not allowed to be perpetually backward compatible since this practice erode code quality and accumulate significant "dead code" that leads to drop in agility and complexity of test.


Service Life Cycle and Registry

One of the most significant decision for CommerceOS was to standardized and establish a widely understood set of mile stones (called life cycle) for service development. This may sound like the dreaded "G" word (Center Governance) - but in practice a large org can not plan an optimal and rapid release cycle without it.

Before the life cycle standardization, the only defacto mile stone for service team was "live to site" i.e. when the service end points were available and functional in production. Application developers (web and mobile) would then start their serious development, effectively serializing the timeline i.e. Delivery time = Max(Services Delivery) + App Delivery.


Without a wide understood and supported mile stones, service team often change their service implementation and interface till the very end of a project timeline, forcing application developer to wait till the "dust settles". CommerceOS establishes a set of mile stone, the first of which is "interface published" this mean the service descriptor is ready, and an end point is exposed that can respond based on the service descriptor - this end-point, in concept, is similar to Java Proxy API - in that it can produce a "fake" response to the request based on the contract - no real implementation required. Application development can practically starts at this point, to a large degree decoupling app development time from service development time.

Service teams can change the interface - but often the thoughts and consideration that went to interface design leads to more or less stable interfaces, the implementation can change freely at any time. This align with one of our portfolio principles of "Stable interface, agile implementation".

Base Request and Response

CommerceOS services and application talk over http, but the exchange has to happen with a common dialect i.e. certain semantics has to be expressed and binded to the underlying HTTP transport in a common way -  this saves individual service provider and app team time to re-invent the wheel also prevent a lot of bugs and issues. Base request and response define a set of common headers and encoding that all COS service and apps understand, a few examples are
Syntax to express compact headers, Authorization headers, Identification of request and request chain, serialization and encoding of request and response, session identification, location, locale and cultural preferences bindings to the protocol, the proper use to HTTP header v.s body and alternative binding to HTTP body.


Migration

One of the practical and most important aspects of establishing service or micro service architecture for large companies with "legacy" code is migration. By migration, I specifically mean migrating either monolithic applications with direct data access or application the use older, legacy services to application that consume contract based micro services. We have established a pattern, called "Bay Bridge" for service migration. It has four major steps

- Smoke Test: Turn on new service (with new data storage), only use it for a very small number of traffic for a few consumers, dual write into (and read from) both new service/storage and old/legacy storage. primary source of truth still is the legacy.
- Load/Sync: Copy/transform data from legacy storage to the new service storage as appropriate. This phase itself may include smaller phases depending on data. The more long lasting data/entity is the more critical this phase is e.g. User is a very long lasting entity while an Auction listing may last only 7 days or a session may be stored only for few hours. The main goal of this phase is to bring new and legacy storage to parity.
- Fly with Safety Net: Dual read/write continues, but the primary is the new storage/service now
- Clean up: Old storage is cleaned up and deprecated




Persistence
CommerceOS allows services to choose their own persistence storage and technology, depending on types of data a service handles (preferences vs. financial data or blog post vs. password and credential) there are pattern for whether systems should prefer CA (financial), PA (most anything)

From logical point of view, services team are required to have isolated storage i.e. no other service or application should read/write directly from primary database of other services - sometimes (especially with bulk data) it is not efficient to consume a classic service interface (serializing and deserializing is too much over head) - in these cases service must expose a "data feed" - push style, and still should not allow other services to directly read its primary storage.

Services are required to register they database and structure of logical entities stored (there is no "governance" of such entities just registration for discovery process)

Fail Over and Recovery 

There two types of fail over and recovery in CommerceOS, Transparent and Degraded.

Transparent failures are the failure of stateless application server/service or database hosts. application servers are all running behind load balancers with virtual IPs (and sometimes load balancing is done using run-time discovery ZooKeeper style) - database failures are handled by partitioning data and replication (Casandra style) the typical failures of services and databases are handled without service code realizing the failure, the system continue to operate with no impact.

The other types of fail-over is "degraded" - in this case failure is not transparent to a service for example when an Pricing service (that calculate total order price) calls an Incentive calculation and receives a failure - say due to yet another system failure in Incentive subsystem that could not be handled (e.g. it uses a non-partition, non-replicated DB that failed) - the Pricing service now has to "degrade" its function in a way that it still calculate the total order price. This is a higher level and more domain specific handling of failure yet a few aspect of it can be abstracted and implemented in run-time. In particular, we define light-weight processing framework. It is pipeline based programming model, each pipeline has a series of phases, phases can be assembled dynamically at run-time. each pipeline is executed by an Executor that is the main run time for pipelines. each phase can be annotated as required, optional, alternative. Each phase has a few life cycle state, the two most important ones are up and down. If a required phase is down, and if no alternative is designated, the pipeline fails, if an optional phase fails executor executes an alternative as designated, if no alternative is designated process continues.
This simple framework provides an abstraction for degraded functionality.

Escape Response


Escape response is a small, yet important, aspect that illustrates the need for standards in a give portfolio. An example illustrates the concept, imagine that due to a security breach, you need all users to change their passwords. You can change 100s of applications to message the user for password change. What do you do?

Applications make service calls all the time, a particular response header is called "Escape Response" and it include an end point and a unique number. All application know (and it is implemented in the service invocation library as well) that if they see the escape header, they must re-direct (device and platform specific) user to the given end point. This "escape" path allows the system to take over from any compliant service using a standard syntax, semantics and protocol binding.





What is CommerceOS - Part II - Basic Structure


In the first post of this series, I described the motivations and goals of CommerceOS. I stated that CommerceOS is eBay MP version of Micro Services, It is a portfolio of services (RESTful and deployed in the cloud) that enable developers to build applications that facilitates commerce among people and/or entities, on any device, any language, rapidly, economically and securely.

In this post I will start describing some technical details, best practices and what we learned from building a service portfolio at scale.

The easiest way to start describing CommerceOS  is with this basic picture illustrating the over all topology of the system




This diagram shows a high level logical architecture of the ecosystem and its main components

- Applications: this includes applications on mobile, and all other devices, web apps as well as 3rd party applications, merchant and partners back ends. etc. Basically all consumers of CommerceOS portfolio. The number of applications using CommerceOS services are in the range of 10s of thousands and those include eBay applications (1st party) as well as partners and 3rd party apps

- Services Portfolio: These are services and business capabilities that enables marketplaces operations as well as Commerce primitives used to build any commercial app. Examples are Listing, Checkout, Order Management, Listing Details, Watch List, Cart etc. These services are often (but not always) multi-tenants. Collection of these services is called "The Portfolio", The number of services in CommerceOS portfolio is the range of 100s built by team across the globe.

- Shared Services: core platform capabilities such as Identity & AM, Security Token Services, Session, Tracking, Messaging (of all types) etc.

- API Middle Tier : Part of "shared services" is a middle tier  that mainly functions as simple router connecting API request to internal services implementing them.


CommerceOS is structured into five major activity streams (and similar organization structure), each essential to build and operate a Microservice ecosystem at scale. The tracks are

1- Standard and Patterns
2- Services Portfolio
3- Foundations and Shared Services
4- Evangelism and Advocacy
5- Governance, Metrics and Measurements

Standards, Pattern

The first observation about an ecosystem with 1000s of apps and 100s of services is that if key aspects of service interface and implementation - such as error handling, tracking, identity internationalization, version policy, monitoring, billing etc. - are left to each service team to decide, developing an application that consumes on average 20-50 services will be extremely difficult and unproductive task. This approach may also creates legal liability (dealing with financial or personal data). A decision we made early on, was to formally define (and carefully document) standards that make all components work together. This may not be what you expect from a Micro service portfolio, but we learned that formal documenting of standards (and versioning them) is extremely valuable both for system developers who would be implementing those standards in form of run time libraries and shared services as well as service and application developers to learn about the portfolio in general. See part III for more detailed on standards

Foundations and Shared Services

A portfolio of services requires three major foundation pieces
- Run time libraries for services (in supported tech stacks)
- Shared Services that are used by the entire portfolio (and you really don't want replication of any of its functionalists), services such as Identity and Access Management, Caching, Messaging, Crypto Services, config, registry and discovery, content management)
- Productivity Tools: Vital for operation of portfolio are tools (and dash boards) for monitoring, recovery, synchronization, life cycle management, learning and exploration tools etc.

See part IV for more information


Service Portfolio 


This is the actual portfolio of business capabilities, the services are designed and built by de-centralized (and global) services teams - they are often between 3-10 people, and they are free to use any technology stack they see fit - of course if they use the eBay standard Java/Spring based framework (called Raptor) they get a lot of functionality for free - but they don't have to. Services must publish their descriptor formally and some elements of their type spaces (such as User, Listing, Order, Cart, Claim, Product etc.) are standardized (i.e. they are part of a Common Type Space) - services also need to declare events they produce.


Evangelism and Advocacy (E&A)

This may be one of the most significant, yet poorly understood, area of any service portfolio management. People often view this as "soft skill" area. However, they are vital to BOTH adoption and evolution of service portfolio. Let's see what they are

- Evangelism: Main goal is to make sure application developer community (especially internal one) know all capabilities in the portfolio and how to use them. It includes documentation, sample codes, sample requests sets (especially for edge and error cases) as well as basic performance characteristics of a service. Without proper evangelism, developers will not know what capabilities exist, and how to use them properly, as the result, they will re-build the same capability (often with slightly different name, semantics and implementation).

Advocacy can be thought of as the reverse of evangelism - its primary goal is to make sure services cover all the right capabilities application developers need. Through advocacy (and collecting requirements) - we learn whether we need to introduce (or promote) a new type to our common type space, or whether we need a new service (such as integration with certain data provider - such as phone IMEI or business address provider). 

Both E&A are critical in re-use at scale - re-use (especially at the domain capability layer) is not developers' natural reaction. The higher you are on the stack, the tougher the re-use become. It is less likely that a team (or engineers) rebuild JVM, or a messaging protocol, but business abstraction such as Identity, Cart, Order, Search Result, Listing, Invoice etc. as well as processes that operate on the, are duplicated all the times. E&A is the key to get some level of re-use at the business domain.


Governance, Metrics and Measurements

The last major area of activity for CommerceOS is Governance, Metrics and Measurements. A lot of people believe that no governance should be required in developing a Micro service portfolio, this only leads to chaos - particularly for application developers that often consume a large number of services. A more sensible suggestion is "decentralized governance". This sounds like a good idea, but the devil is in the details. CommerceOS adopted a light - but well defined - governance model and processes that enables it. I fully realize that this may not be a popular concept in today's environment driven by buzzwords such as "autonomy" and "agility", however my personal experience is that
- Right level of governance makes everyone job mush faster and less painful
- Process has to be transparent, objective and driven by an explicit and measured SLA - what people hate is arbitrary process run by "people who think they know more than you do".

Having said that, CommerceOS defines processes to govern three aspects

  1. Adding a service into CommerceOS portfolio (or deprecating of a service)
  2. CommerceOS Common Type Space
  3. Adoption of an standard or pattern into CommerceOS
Each process is well defined and has largely objective criteria.

See part V for more details.

In part III, I will go over the standards and patterns that make all services and apps interoperate successfully, Part IV focus on Foundations and Core Services and part V talks a bit about processes and measurements.








Monday, September 1, 2014

What is CommerceOS - Part I - Introduction


I am often ask what is CommerceOS, equally as often I answer based on what a successful end state for CommerceOS looks like: It is a set of capabilities exposed largely as RESTful APIs that allows anyone to rapidly, cost effectively and securely develop applications that facilitate commerce between or among real or legal entities, on any platform, device or language, globally.

This means if you want to develop an application for a consignment store to sell its merchandize online (on or off eBay) CommerceOs makes it easy for you. If you work for a large multi-national corporation and want to list your products on eBay and on your site, then let local, regional and international buyers purchase your product, CommerceOS is your APIs. If you want to write a pop-up app on iPhone and Android that showcases adventure riding gears and products by mashing up content and pictures with commerce and community, CommerceOS should make your job easier and cheaper, and of course, if you are an eBay mobile developer, CommerceOS APIs enables you to build your applications ...and countless other scenarios like these, you got the picture.

This answer, although accurate from utility and use point of view, does not describe CommerceOS from technical and architectural point of view. A casual observer may not appreciate the technical, process and organizational complexities that should be overcome to achieve CommerceOS vision. In this series of posts I try to explain CommerceOS from technical aspect.

The rest of this post dive a bit deeper into motivations and problem definitions, in the second part,  I will focus on how we have approached the problem and architectural details of CommerceOS. Part III explains set of technical standards and patterns used by all CommerceOS services and consumers, and are the basis of interoperability of the entire system. Part IV is a review of a few core services of CommerceOS called "Shared Services". Finally Part V deals with process we use to maintain and enhance the platform.

First a bit of a context. Back in 2009-2010 timeframe (where CommerceOS transformation started) eBay had a global engineering workforce in the range of thousands (as it does today). The collective job of this engineering team was to work on an essentially web-based architecture (like so many other web era companies). This architecture assumed (perhaps more implicitly than explicitly) that

  • Buyer interacted with marketplace largely via a web browser, running on a desktop/laptop.
  • Sellers were small(er) and interacted either via web UI or APIs designed to replicate Web UI - essentially APIs were designed for sellers.
  • API code base and the business logic powering web UI were largely two independent code basis.
  • Data and table were shared among all applications and services. Systems were single tenant and monolithic from identity and tenancy point of view. 
  • Both data and code for all subsystems/capability (such Identity, Checkout or Payment services, Search) was hard to isolate and package independently.
  • Operation technology was fixed capacity, manually provisioned, under-utilized.


The platform was stable and scalable (albit with scale-cost characteristic by, then, current state of technology - fixed capacity, built for burst) with its own set of processes that made it possible for global teams to get their job done. However, in 2009-2010, it was very clear that we are a few years into development of trends that would re-shaped (or ram through, depending on your point of view) traditional web-based architecture and require a different scale-cost and productivity trajectories. These trends were:

1- The ever present mandate for efficiency and cost control driven by the need to expand or at least preserve margins or reallocate resource. This is natural trend for any market and segment as they become more mature and scale.
2- The mobile and connected device revolution which was shifting more and more of traffic to non-web, non-PC devices.
3- The advances in cloud computing that drove the operation cost down, but demanded functionality to be packaged cleanly and independently to be deployed wherever capacity was available.
4- The shift of eBay merchant mix from smaller, more casual, lower volume sellers/merchant to larger, more architectural and operationally formal and higher volume sellers.

(I will talk about how technologies known as Big Data changed and impacted eBay architecture in a different post)

It was very clear that the forces of the four factors above require a different architectural approach from what, then, current web-oriented architecture was. We needed to bring down the cost of both infrastructure and operation as well as increase the productivity of our developers to address #1 above.

To address the multi-screen revolution (#2), we needed more than just a set of APIs (which eBay has had since 2003) - eBay marketplaces needed all screens (included desktop) to consume one set of APIs not two sets of code-bases one power the "main" web UI and the other to power all API/devices. This is a much harder problem than simply developing APIs. Additionally, we needed all APIs, regardless of which one of nearly 100 teams across the globe develop it, to look like a coherent portfolio of APIs (given the laws of thermodynamics, you can guess how tough this is).

To take advantage of promises of cloud computing (elasticity of capacity, better resource utilization, related cost saving etc.) we needed our business logic and data to be encapsulated in well-defined and isolated modules, with clear dependencies that lend themselves to packaging suitable for cloud environment. Refactoring code to isolate systems and services is tough enough, but separating shared data and isolating storages and tables are an order of magnitude more challenging.

And finally, the larger, more sophisticated our merchants and partners became, the more we needed to use formal integration methodologies and feed-based interactions built on de-facto industry standard models of Order, Product, Inventory, Cart, Returns and less on auction-centered models such as Item, transaction etc. This represented a deep change in entity and domain models and re-architecture of code and, perhaps more importantly, migration from old models to new models.

Given where we were and where we wanted to evolve marketplace architecture, we came up with a definition of a vision for CommerceOS as the basis for alignment and execution, knowing full well that no one paragraph accurately covers all aspect of this initiative. Having said that, here is the definition we came up with:

CommerceOS is eBay MP initiative to transform its architecture in such a way that large majority of its platform capabilities and business processes are exposed as RESTful APIs in the cloud in such a way that all marketplace participants can consume them, uniformly, securely, effectively and with high quality.

This may sound a bit abstract, but let me parse it:

The primary goal is to "transform architecture" - this means both technology stack (for services and applications) and processes not just technology (it never works that way).

Platform capabilities and business processes: That basically means everything that MP does from identity, verification of attributes, and caching to listing, pricing, order management and search need to be re-designed as RESTFul services and exposed to correct consumers. The modifier "Majority" indicates that some interactions are not strictly RESTful for legacy or integration requirements. We are pragmatic about it.

Our services are RESTful, it is a de-facto standard and we developed a few internal spec to uniformly covers aspects such as use of non-http verbs, security, tracking, internationalization, filtering, constraints etc.

The deployment target for services is "Cloud". This is not a feel good or casual term, it strictly mean that a service must be produced in a way that it can be packaged fully isolated from other services with well known dependency, clear version number. This allow us to deal with a unit of capability as a building block and use technologies such as Docker effectively (we had a few de-tour here with OSGi but course corrected).

"All participants" indicates that the consumers will be applications written not only for buyers and small sellers who interact with marketplace via a browser, but everyone else on all devices including large sellers who never interact via ebay.com, CS and their internal tools, partners such as logistical service providers as well as internal eBay staff and, of course, eBay application on all devices.

Then there is a set of non-functional requirements, each key to how APIs are built, they include

Uniform: basically one code base supporting all consumer, with no "primary" web consumer.
Secure: self-explanatory, platform MUST be secure with well defined identity, access management and auditing. Secure also include availability.
Efficient: APIs must be efficient to consume, i.e. application developer must spent min amount of time and effort to consume it (docs, samples and sandboxes), APIs also should be cost effective to operate (see cloud requirement)
High Quality: An umbrella term covering functional quality (no-bugs) as well as performance.

CommerceOS started with this vision definition and context, in the next post, I will describe the architecture and structure of CommerceOS, both as a technology platform and from organization/process point of view.



Saturday, November 24, 2012

Via Quora: How has eBay's review system evolved in the past few years?



The feedback system and "feedback score" is one of the first things associated with eBay (maybe second only to auction). The main purpose of any feedback system is to harness the "signals" community members send to "feedback processor" to encourage and "enforce" desirable behavior. Anyone who ever thought about or design a rating or feedback system knows that the task is not as easy as it may first seem. And the first step is to understand what the different attribute of a feedback system is.

A member of Quora community asked me to answer the question "How has eBay's review system evolved in the past few years?". The question provided me with an opportunity (and a little push) to talk about six attributes of any feedback system. See here for the full answer and if you are not using Quora, I strongly recommend you consider using it. It is addictive. 

Friday, September 14, 2012

Primary Goals and Strategic Metrics vs. Operational Metrics



Measurement is key to any successful engineering effort, the key is what metrics you choose to measure and how to interpret the measurement.  In general there are two types of metrics associated with two different types of goals

- Primary (or strategic) goals and related metrics
- Operational Goals and related operational metrics

When goals are defined, It is very important to define which category a goal belongs to so that progress (or lack thereof can be measured accurately).

An example makes the difference between the two of metrics clear. Imagine you want to drive from San Francisco to Los Angeles. Your goal is very clear: getting to L.A. and the metric associated with it is also clear, how far are you from L.A. (or how far have you traveled so far). You drive a car, so you measure your car's engine temperature and fuel level - these are your operational metrics.

You can keep your engine temperature within a reasonable range and maintain proper fuel level, but these are not the goals of your trip. If you are happy simply b/c your engine does not overheat, and you are not concerned with how far from L.A you are, you are not measuring the right metrics.

Distinction between primary goals and operational goals is not always as easy to spot. Imagine you are designing an Order Management service (or any other service) - it is easy - maybe even common - to measure number of calls to the service per hour, day etc. and assuming that the goal is being achieved. However, if you service can only be called by one type of client (b/c it is the only one that can supply an input parameter for example), are you really achieving one of the most important (and common) goal of service design which is re-use of a capability by all consumers? Here, number of calls is an operational metric, it has to be measured, it may be necessary but not sufficient. The measure the primary goal - you need to measure the diversity of consumers (across languages, device, platforms)

For another example assume you deploy a distributed cache service - the goal of any cache service is to improve performance (and scalability) - again,it is easy to measure metrics such as number of calls to the service or even more specific metrics such as "hit rate" -but the primary goal here is to measure - from consumer point of view - performance and scale improvement.

One main reason, operational metrics are often measured, and used for decision making, instead of primary goal or strategic metrics is that they are generally easier to measure. It is certainly easier to measure "number of calls" to a service than whether a service increases its consumer productivity. "How to Measure Anything",  by Douglas Hubbard is a great resource for techniques to measure hard to quantify goals and metrics.