Monday, September 1, 2014

What is CommerceOS - Part I - Introduction


I am often ask what is CommerceOS, equally as often I answer based on what a successful end state for CommerceOS looks like: It is a set of capabilities exposed largely as RESTful APIs that allows anyone to rapidly, cost effectively and securely develop applications that facilitate commerce between or among real or legal entities, on any platform, device or language, globally.

This means if you want to develop an application for a consignment store to sell its merchandize online (on or off eBay) CommerceOs makes it easy for you. If you work for a large multi-national corporation and want to list your products on eBay and on your site, then let local, regional and international buyers purchase your product, CommerceOS is your APIs. If you want to write a pop-up app on iPhone and Android that showcases adventure riding gears and products by mashing up content and pictures with commerce and community, CommerceOS should make your job easier and cheaper, and of course, if you are an eBay mobile developer, CommerceOS APIs enables you to build your applications ...and countless other scenarios like these, you got the picture.

This answer, although accurate from utility and use point of view, does not describe CommerceOS from technical and architectural point of view. A casual observer may not appreciate the technical, process and organizational complexities that should be overcome to achieve CommerceOS vision. In this series of posts I try to explain CommerceOS from technical aspect.

The rest of this post dive a bit deeper into motivations and problem definitions, in the second part,  I will focus on how we have approached the problem and architectural details of CommerceOS. Part III explains set of technical standards and patterns used by all CommerceOS services and consumers, and are the basis of interoperability of the entire system. Part IV is a review of a few core services of CommerceOS called "Shared Services". Finally Part V deals with process we use to maintain and enhance the platform.

First a bit of a context. Back in 2009-2010 timeframe (where CommerceOS transformation started) eBay had a global engineering workforce in the range of thousands (as it does today). The collective job of this engineering team was to work on an essentially web-based architecture (like so many other web era companies). This architecture assumed (perhaps more implicitly than explicitly) that

  • Buyer interacted with marketplace largely via a web browser, running on a desktop/laptop.
  • Sellers were small(er) and interacted either via web UI or APIs designed to replicate Web UI - essentially APIs were designed for sellers.
  • API code base and the business logic powering web UI were largely two independent code basis.
  • Data and table were shared among all applications and services. Systems were single tenant and monolithic from identity and tenancy point of view. 
  • Both data and code for all subsystems/capability (such Identity, Checkout or Payment services, Search) was hard to isolate and package independently.
  • Operation technology was fixed capacity, manually provisioned, under-utilized.


The platform was stable and scalable (albit with scale-cost characteristic by, then, current state of technology - fixed capacity, built for burst) with its own set of processes that made it possible for global teams to get their job done. However, in 2009-2010, it was very clear that we are a few years into development of trends that would re-shaped (or ram through, depending on your point of view) traditional web-based architecture and require a different scale-cost and productivity trajectories. These trends were:

1- The ever present mandate for efficiency and cost control driven by the need to expand or at least preserve margins or reallocate resource. This is natural trend for any market and segment as they become more mature and scale.
2- The mobile and connected device revolution which was shifting more and more of traffic to non-web, non-PC devices.
3- The advances in cloud computing that drove the operation cost down, but demanded functionality to be packaged cleanly and independently to be deployed wherever capacity was available.
4- The shift of eBay merchant mix from smaller, more casual, lower volume sellers/merchant to larger, more architectural and operationally formal and higher volume sellers.

(I will talk about how technologies known as Big Data changed and impacted eBay architecture in a different post)

It was very clear that the forces of the four factors above require a different architectural approach from what, then, current web-oriented architecture was. We needed to bring down the cost of both infrastructure and operation as well as increase the productivity of our developers to address #1 above.

To address the multi-screen revolution (#2), we needed more than just a set of APIs (which eBay has had since 2003) - eBay marketplaces needed all screens (included desktop) to consume one set of APIs not two sets of code-bases one power the "main" web UI and the other to power all API/devices. This is a much harder problem than simply developing APIs. Additionally, we needed all APIs, regardless of which one of nearly 100 teams across the globe develop it, to look like a coherent portfolio of APIs (given the laws of thermodynamics, you can guess how tough this is).

To take advantage of promises of cloud computing (elasticity of capacity, better resource utilization, related cost saving etc.) we needed our business logic and data to be encapsulated in well-defined and isolated modules, with clear dependencies that lend themselves to packaging suitable for cloud environment. Refactoring code to isolate systems and services is tough enough, but separating shared data and isolating storages and tables are an order of magnitude more challenging.

And finally, the larger, more sophisticated our merchants and partners became, the more we needed to use formal integration methodologies and feed-based interactions built on de-facto industry standard models of Order, Product, Inventory, Cart, Returns and less on auction-centered models such as Item, transaction etc. This represented a deep change in entity and domain models and re-architecture of code and, perhaps more importantly, migration from old models to new models.

Given where we were and where we wanted to evolve marketplace architecture, we came up with a definition of a vision for CommerceOS as the basis for alignment and execution, knowing full well that no one paragraph accurately covers all aspect of this initiative. Having said that, here is the definition we came up with:

CommerceOS is eBay MP initiative to transform its architecture in such a way that large majority of its platform capabilities and business processes are exposed as RESTful APIs in the cloud in such a way that all marketplace participants can consume them, uniformly, securely, effectively and with high quality.

This may sound a bit abstract, but let me parse it:

The primary goal is to "transform architecture" - this means both technology stack (for services and applications) and processes not just technology (it never works that way).

Platform capabilities and business processes: That basically means everything that MP does from identity, verification of attributes, and caching to listing, pricing, order management and search need to be re-designed as RESTFul services and exposed to correct consumers. The modifier "Majority" indicates that some interactions are not strictly RESTful for legacy or integration requirements. We are pragmatic about it.

Our services are RESTful, it is a de-facto standard and we developed a few internal spec to uniformly covers aspects such as use of non-http verbs, security, tracking, internationalization, filtering, constraints etc.

The deployment target for services is "Cloud". This is not a feel good or casual term, it strictly mean that a service must be produced in a way that it can be packaged fully isolated from other services with well known dependency, clear version number. This allow us to deal with a unit of capability as a building block and use technologies such as Docker effectively (we had a few de-tour here with OSGi but course corrected).

"All participants" indicates that the consumers will be applications written not only for buyers and small sellers who interact with marketplace via a browser, but everyone else on all devices including large sellers who never interact via ebay.com, CS and their internal tools, partners such as logistical service providers as well as internal eBay staff and, of course, eBay application on all devices.

Then there is a set of non-functional requirements, each key to how APIs are built, they include

Uniform: basically one code base supporting all consumer, with no "primary" web consumer.
Secure: self-explanatory, platform MUST be secure with well defined identity, access management and auditing. Secure also include availability.
Efficient: APIs must be efficient to consume, i.e. application developer must spent min amount of time and effort to consume it (docs, samples and sandboxes), APIs also should be cost effective to operate (see cloud requirement)
High Quality: An umbrella term covering functional quality (no-bugs) as well as performance.

CommerceOS started with this vision definition and context, in the next post, I will describe the architecture and structure of CommerceOS, both as a technology platform and from organization/process point of view.



Saturday, November 24, 2012

Via Quora: How has eBay's review system evolved in the past few years?



The feedback system and "feedback score" is one of the first things associated with eBay (maybe second only to auction). The main purpose of any feedback system is to harness the "signals" community members send to "feedback processor" to encourage and "enforce" desirable behavior. Anyone who ever thought about or design a rating or feedback system knows that the task is not as easy as it may first seem. And the first step is to understand what the different attribute of a feedback system is.

A member of Quora community asked me to answer the question "How has eBay's review system evolved in the past few years?". The question provided me with an opportunity (and a little push) to talk about six attributes of any feedback system. See here for the full answer and if you are not using Quora, I strongly recommend you consider using it. It is addictive. 

Friday, September 14, 2012

Primary Goals and Strategic Metrics vs. Operational Metrics



Measurement is key to any successful engineering effort, the key is what metrics you choose to measure and how to interpret the measurement.  In general there are two types of metrics associated with two different types of goals

- Primary (or strategic) goals and related metrics
- Operational Goals and related operational metrics

When goals are defined, It is very important to define which category a goal belongs to so that progress (or lack thereof can be measured accurately).

An example makes the difference between the two of metrics clear. Imagine you want to drive from San Francisco to Los Angeles. Your goal is very clear: getting to L.A. and the metric associated with it is also clear, how far are you from L.A. (or how far have you traveled so far). You drive a car, so you measure your car's engine temperature and fuel level - these are your operational metrics.

You can keep your engine temperature within a reasonable range and maintain proper fuel level, but these are not the goals of your trip. If you are happy simply b/c your engine does not overheat, and you are not concerned with how far from L.A you are, you are not measuring the right metrics.

Distinction between primary goals and operational goals is not always as easy to spot. Imagine you are designing an Order Management service (or any other service) - it is easy - maybe even common - to measure number of calls to the service per hour, day etc. and assuming that the goal is being achieved. However, if you service can only be called by one type of client (b/c it is the only one that can supply an input parameter for example), are you really achieving one of the most important (and common) goal of service design which is re-use of a capability by all consumers? Here, number of calls is an operational metric, it has to be measured, it may be necessary but not sufficient. The measure the primary goal - you need to measure the diversity of consumers (across languages, device, platforms)

For another example assume you deploy a distributed cache service - the goal of any cache service is to improve performance (and scalability) - again,it is easy to measure metrics such as number of calls to the service or even more specific metrics such as "hit rate" -but the primary goal here is to measure - from consumer point of view - performance and scale improvement.

One main reason, operational metrics are often measured, and used for decision making, instead of primary goal or strategic metrics is that they are generally easier to measure. It is certainly easier to measure "number of calls" to a service than whether a service increases its consumer productivity. "How to Measure Anything",  by Douglas Hubbard is a great resource for techniques to measure hard to quantify goals and metrics.

Wednesday, March 14, 2012

What really is this "Managed Market Place" anyway?

I work for a division within eBay Inc. called "Managed Market Places". The name is a bit curious. I was asked, more than once and by range of people, what really "managed marketplace" is? Is it a new type of marketplace by eBay (no it is not!), is it a vertical/niche marketplace within eBay (no it is not!), some one on Quora even interpreted it as it means that eBay simply "manages" the marketplace as oppose to growing it! ( if this was the case why would eBay announce that to the whole word by labeling it as such?)

So then what exactly is MMP (as it is known internally) and why is it important?

the nature of the Internet lends itself perfectly to the basic concept of a "marketplace": a mechanism for buyer and seller to find each other. Marketplaces were and still are an important and growing part of the internet. The growing list of niche marketplaces include etsy, zarrly, odesk, airbnb, taskrabbit, yardsellr, zimride and many many more. (not to mention marketplaces from facebook, google, yahoo and other major players)

At the first glance, it looks simple enough: create a site that brings the parties to a transaction together (from buyer and seller of antique to two people who want to share a ride or a room), and either take a cut of the transaction or make money by advertising. This is indeed the basic concept behind a marketplace - or an unmanaged marketplace. Marketplace itself is not a party to any transaction. Buyer and seller deal with each other directly and take the risk (or bulk of the risk) of direct transaction. EBay operated, more or less, as an un-managed marketplace for a while too.

In managed marketplace on the other hand, neither party to a transaction takes a risk, in other word marketplace guarantees the success of transaction, no risk (at least ideally). Of course a managed marketplace can "manage" other aspect of interaction such as inventory, quantity, price, promotions etc. as well but for now we only focus on risk as it is the focus of eBay MMP as well.

The evolution of simple internet marketplaces to managed marketplaces is an important trend, as the Internet users become more sophisticated and demand more from services they use online. The AirBnB incident back in July of 2011 is a perfect illustration of how "unmanaged" marketplaces will be forced to offer a higher level of assurance/risk mitigation and become managed marketplaces.

What does it mean from systems and architecture point of view? Here are five main aspects that is particularly different in dealing with managed marketplaces

1- The first significant change is that of people's mind set: You have to see yourself in risk management business, or at least assume that risk management is a major part of your operations. What this changes first, and foremost, is that you now have to identify, assess, prioritize, mitigated (or plan to) and measure risk. In all likelihood all of these activities (and the tools and systems you need to perform them) are new to you if you are dealing with a simple/un-managed marketplace.
2- Central to any consumer risk management scheme is "Identity", and I don't mean OpenID or OAuth or SSO... I meant attribute, assurances, verification, accuracy, uniqueness or mapping a real world entity to a digital identity (Entity Resolution)
3- Data is the core to efficient risk management, and big data and your ability to collect and analysis them becomes central to your ability to operate the marketplace at a reasonable cost (minimum losses)
4- Coherent Architecture become even more important. Simply because your systems becomes more complex and more integrated. A simple marketplace is just that, a marketplace site/application. A managed marketplace would include identity provisioning and verification, risk definition, measurement, management at user and at transaction level, a system for filing claims and disputes, systems dealing with ever changing legal and business landscape that enforces what you can and can not do with data you collect and finally integrating all these system in a productive way (seamless but without coupling them)
5- Even Driven and Complex Event processing: This already has a big role in distributed system, but it plays more and more important role in distributed risk management. Real time assessment of risk becomes critical and due the cost/performance of risk assessment, incremental assessment or risk based on primitive and complex event generated over entire session (or even life time of a user) will be the only practical solution.

Wednesday, November 23, 2011

The Uncommon Security Common Sense

I can not claim that I actually counted or classified all the reasons peoples cite for not taking security (or for that matter sound and well thought through system design) seriously from the start, but the three following lines seems to be the most common ones:

1- The "it is too contained" line: So what is the big deal? at worst it may affect a very small percentage of my users.
2- The "it is too early" line: Oh my system/site/project is too small and we only have a few users, we really don't have time/resources for this.
3- The "it is too small" line: My project is too small or too obscure for anyone to care.

By the way, I have heard these lines or their equivalent not only when it comes to security engineering (or re-engineering) but also in designing business policies or risk management measure to prevent fraud, or in general negative user experiences as well as general system design.

Now to be fair, these reasons all sound like "common sense", after all why would you take on additional cost and time for your project or accept the expense and risk of re-engineering your code to fix an issue that may only affect 1% or 0.01% of your users? or why should you spend two weeks to fortify a system that takes you 3 days to design and it is "just an experiment"? and finally who really cares about a small project some where with some obscure URLs that takes an email address as one of its inputs and shows some useful error message if the email is not registered? does ANYONE really care?

Well, as it turns out, security common sense (like many other form of common sense) is actually quite uncommon ! Let's look at these frequently cited common sense logic a bit closer.

To demonstrate the fallacy behind the first logic (it is too contained, it only affect 0.01% of users) I cannot think of any better illustration than the words of presidential candidate Herman Cain where he said that "for each woman who has accused him of harassment there are probably thousands who haven't" and he is 100% accurate and right! But does that make any difference? In all likelihood his presidential bid is all but over. Or could the Washington D.C police chief during the "D.C Sniper Attacks" have possibly argued that the whole thing was not a big deal b/c only 0.001% of D.C metro population were actually killed and therefore there is no need for massive mobilization of police, FBI, ATF and even secret service !?

The same math is thru for security, it does not matter if only 1000 users out of 10MM become victim of a
poorly secured or design system. What matters is how many people hear and learn about it - and you can be
sure that at least in this day and age that number is a few order of magnitude larger than the actual number of
victims. The sense of insecurity that this causes in the rest of the user community and its economic cost is the real math that matters not the fact that only 1000/10MM=0.01% users were affected.

The second line "it is too early" or its equivalents "we don't have enough time or resources" is the most commons line not only in security matters but also system design and architecture aspects as well. What is interesting here is that the exact premise cited for not focusing on security (or sound design for that matter), is why security should be taken seriously i.e. "I am too new to afford not to be secure", if you are releasing a new product (or brand or a site) you REALLY DO NOT HAVE A SECOND CHANCE TO MAKE A FIRST IMPRESSION. If you are not secure, or if your first few user gets taken advantage off (think of AirBnB incident) you are doomed. To further demonstrate the risk in this argument I submit the following picture of one of the more famous car design mistakes : Honda Odyssey 1998



Honda designed this in a hurry to get into the growing minivan market dominated by Dodge/Chrysler. They decided to differentiate by replacing a convenient power sliding door with a traditional door! Imagine what would have happened of this was a new no-name company without Honda's established brand? Of course Honda corrected the mistake in 1999 model and beyond and went on to have one of the most successful Minivans. But if you are not Honda, you better spend time and money on designers and marketers to tell you,  in the first try, that whoever buys a minivan *needs* a sliding door.

Now we get to the third line "Who really cares about me?" I have to admit that I have the most sympathy with people who resort to this logic. After all it is tough to imagine how capable and resourceful the modern fraudester/hacker community is without actually having a brush with them. I do not get into the details -  if you are interested you can briefly scan Rick Howard's excellent book "Cyber Fraud, tactics, Techniques and Procedure" - for the purpose of this writing I'd suggest you assume the following is true:

In the game of "Who wants to break into my system" your adversary is more motivated (financially or politically) than you are, more experienced than you are, is more innovative than you are, is more nimble than you are, wants it worst than you do, has a smaller cost base than you do (and therefore) all he needs is 0.01% (or smaller) of your users - the ONLY advantage that you have is that you right the rule of the game. Do not give up that advantage easily. You WILL lose the game.

Btw, the end point URL that takes an email and very nicely checks and display an error if email does not belong to a valid user - was actually found (although it was a little obscure URL not linked to from anywhere -  and used to extract valid company X user emails (cost $5000+) from a large list of non-verified harvested emails (cost $50) - a vital part of phishing industry value chain.

Sunday, November 6, 2011

OIX Attribute Exchange Summit - Washington DC

Open Identity Exchange (OIX) is holding this year's Attribute Exchange Summit in Washington, DC.

Identity attributes are core of the concept of digital identity. As federated identity ecosystem getting more mature and adoption grows among more sophisticated RPs  - with more consequential use cases such as government, health, education, commerce ... - so does the need for wider sets of attributes with more accurate and fresh values. This presents both tough challenges and opportunities for IDPs.

The challenges center, as one may expect:, around aggregating, correlating, transform and maintaining fresh copy of attributes in a cost effective manner and in a way that it does not compromise the privacy (and other rights) of the principle owner. IDPs can differentiate based on the range of attributes they provide in this way and there in lies the opportunities.

I will be talking more about identity attributes, their life cycle, uses cases and how they help establish and elevate trust among parties to commercial transactions (online and off-line) as part of a panel with Don Thibeau, OIX/OIDF chairman and Abbie Barbir, VP BoA.

If you are planning to attend, I'd be happy to hear from you.

Monday, October 17, 2011

OAuth vs. OpenID Connect ?

OpenID Connet 1.0 Spec is finally released (actually it was release back in Aug). Its release was accompanied by two predictable categories of questions/sentiments, one not very well informed and the other one a legitimate question:

-        OpenID is dead
-        OpenID Connect is really OAuth so why do we need a new protocol?

Granted, this is normally coming from software engineers and social application programmer community and not from identity community, but I feel they are significant enough to be addressed, especially at the time that more and more entities contemplating to become identity providers and they need to decide which protocol they should implement.
First, on the demise of “OpenId”:  It is true that the earlier versions of Open ID (version 1 and version 2) are, for all intent and purposes, depreciated and will not gain a whole lot of traction. But the general idea of “Open” standards for communicating between RPs and IDPs that enables users to provision fewer accounts and have a portable identity while still maintaining control over their privacy and data is alive and well and actually is even more vital than before.
Second, on relationship between OAuth and OpenID Connect, OAuth is a general protocol for authorizing an agent to access a resource on behalf of resource’s owner. OAuth does not assume any particular knowledge about the resource itself. What does this mean? Let’s go back to the canonical OAuth use case of a user who would like to authorize a printing services to access her photos from a Photo service provider. Now imagine that the photo service is slightly sophisticated and recognizes a few properties associated with photos e.g. resolution, size, whether they are shots with no humans, and if there shots with humans, who appears in the photos – basically let’s assume the resource served by SP has more semantics that simple “access”.
Now imagine that the user wants to grant access to only JPEG photos of himself and not a full access to all photos. How would the IDP encode this semantics in the authorization request and response? How would the SP know that they should only provide access to a subset of images?
To be sure, this is doable using OAuth, but the implementer has to add additional parameters to request and response or possibly constraint the input values of some other parameters.
A protocol that is built this way to access a specialized resource, would be a photo access protocol built on top of OAuth.
In essence this is exactly what OpenID Connect it: It is a protocol built on top of OAuth that supports features that are often desired and used when the resources being delegated is “identity” and attribute about an identity.
To illustrate the point, here are what we, at eBay, had to do for an internal authentication protocol on top of OAuth:
-  
      Force Authentication: adding parameters to authorization request to force users to authenticate no matter what the authentication state with IDP is
-        Authorization Behavior: adding parameters to authorization request to indicate to IDPs whether it should display the consent page and how to display the login page (overlay, full page)  
-        Standard Claim Set: Defining the default set of attributes returned by IDP
-        Requested Attributes: adding mechanism to allow RPs to ask for additional attributes, and annotating them to indicate whether explicit user consent is required.
-        Authentication Context: adding a fragment to response to communicated authentication context (single v.s multi-factor, PIN vs. Password, number of retires etc.)
-        Protection: adding parameters to indicate how access tokens should be protected (encryption, signature and order of operations)
-        Token Validation end point: adding an endpoint to introspect access tokens on demand.

These are all features and facets that OpenID Connect enables in a standard and interoperable fashion. In absence of a standard such as OpenID Connect though, any RPs integrating with our IDP had to implement basically a proprietary protocol, be it on top of OAuth.
The point is that if you want to operate an IDP and you want to use just OAuth, you have to add a few things to OAuth, depend on the depth of your requirements, to make it work for “Identity” resource. This is exactly what Facebook did with FB Connect – and they also did a good job of wrapping it with JavaScript plug-ins. The goal of OpenID Connect is to use OAuth as the basic access authorization protocol and add identity specific features to it so that it becomes a standard “identity protocol” that can enable seamless interoperability.