Service mesh in Service Fabric

Introduction

Microservices is here to stay and we can witness the increasing popularity and the maturing technology stack which facilitate microservices. In this great article which explains about the maturity of microservices and the 2.0 stack, it mentions three key aspects.

  1. Service mesh
  2. Matured orchestrators
  3. RPC based service protocols.

This post focuses on the communication infrastructure in Service Fabric. Service Mesh is about the communication infrastructure in a microservices / distributed system platform.

First, let’s look at What is a service mesh ?  In the simplest explanation, service mesh is all about service to service communication. Say, service A wants to talk to service B, then Service A should have all the network and communication functionality and the corresponding implementations, in addition to its business logic. Implementation of the the network functionality makes the service development complex and unnecessarily big.

Service mesh abstracts all or the majority of the networking and communications functionality from a service by providing a communication infrastructure, allowing the services to remain clean with their own business logic.

So with that high level understanding if we do some googling and summarize the results, we will have a definition of a service mesh, with these two key attributes.

  • Service mesh is a network infrastructure layer
  • Primary (or the sole) purpose is to facilitate the service to service communication in cloud native applications.

Cloud native ?? – (wink) do not bother much on that, for the sake of this article, it is safe to assume a distributed system’s service communication.

imgpsh_fullsize

Modern service mesh implementations are proxies which run as sidecar for the services. Generally an agent runs on each node and the services run on the node talk to the proxy and proxy does the service resolution and perform communication.

When Service A wants to talk to Service B

  1. When service A calls its local proxy with the request.
  2. The local proxy perform service resolution and makes the request to Service B
  3. Service B replies to the proxy running in Container 1
  4. Service A receives the response from its local proxy
  5. Service B’s local proxy is NOT used in this communication. Only the caller needs a proxy not the respondent.
  6. Service A is NOT aware of service resolution, resiliency and other network functionalities required to make this call.

There are notable service mesh implementations in the market, Linkered and Istio are quite famous and Conduit is another one and many more in the market. This is a good article explaining those different service mesh technologies.

The mentioned service mesh implementations are known in the Kubernetes and Docker based microservices, but what about service mesh in Service Fabric. 


Service mesh is inherent in Service Fabric

Service Fabric has a proxy based communication system. Defining this as a service mesh is up to the agreed definition of service mesh. Typically there should be a control plane and data plane in a service mesh implementation. Before diving into the details of it, let’s see the available proxy based communication setup in Service Fabric.

Reverse Proxy for HTTP Communication

SF has a Reverse Proxy implementation for HTTP communications. This proxy runs an agent in each node when enabled. This reverse proxy handles the service discovery and resiliency in HTTP based service to service communication. If you want to read more practical aspect of the Reverse Proxy implementation, this article explains the service communication and SF reverse proxy implementation.

Reverse Proxy by default runs on port 19081 and can be configured in the clusterManifest.json


{

............

"reverseProxyEndpointPort": "19081"

............

}

In the local development machine this is configured in the clusterManifest.xml

<HttpApplicationGatewayEndpoint Port="19081" Protocol="http" />

When Service A wants to call the Service B’s APIs, it calls its local reverse proxy with a following URL structure.

http://localhost:{port}/{application name}/{service name}/{api action path}

There are many variations of reverse proxy URLs should be used depending what kind of a service the calls are made. This is a detailed article about Service Fabric Reverse Proxy.

RPC Communication in Service Fabric

RPC Communications in Service Fabric are facilitated by the Service Fabric Remoting SDK. The SDK has the following ServiceProxy class.

Microsoft.ServiceFabric.Services.Remoting.Client.ServiceProxy

Service Proxy class creates a lightweight local proxy for RPC communication and provided by the factory implementation in the SDK. Since we use the SDK to create the RPC proxy, in contrast to the HTTP reverse proxy this has the application defined lifespan and there’s no agent runs in each node.

Regardless of the implementation both the HTTP and RPC communication are well supported by Service Fabric by native and has the sidecar based proxy model implementation.


Data Plane and Control Plane in Service Fabric

From the web inferred definition of service mesh, it has two key components, (note, now we’re talking the details of service mesh) known as data plane and control plane. I recommend to read this article which explains the data plane and the control plane in service mesh.

The inbuilt sidecar based communication proxies in Service Fabric form the network communication infrastructure : which represents the data plane component of the service mesh. The sidecar proxies in Service Fabric form the data plane. 

Control plane is generally bit confusing to understand, but in short, it is safe to assume  control plane has the policies to manage and orchestrate the data plane of the service mesh.

In Service Fabric, control plane is not available as per the complete definition in the above article. Most of the control plane functions are application model specific and implemented by the developers and some are in built in the communication and federation subsystem of Service Fabric. The key missing piece in the control plane component of Service Fabric is, the unified UI to manage the communication infrastructure (or the data plane).

The communication infrastructure cannot be managed separate to the application infrastructure, thus a complete control plane is not available in Service Fabric.

With those observations, we can conclude:

Service Fabric’s service mesh is a sidecar proxy based network communication infrastructure, which is leaning much on the data plane attributes of a service mesh.

Service Communication and Cluster Setup in Service Fabric

Planning the service communication and the cluster setup is one of the important items we should do when developing on Service Fabric (SF). In this article I have tried my best to stick to the minimalist options in setting up the cluster whilst sufficient details by eliminating the common doubts. The motive behind this research is to find the optimum cluster with little amount of development and ops time.

blog im1

Layering WFE Services

Rule #1 : It is not recommended to open the services to the Internet. You would use a either a Load Balancer (LB) or a Gateway service. In on-prem implementations mostly this would be a LB and your cluster will reside behind a firewall.

The services mapped or connected to the LB act as the Web Front End (WFE) services. In most cases these are stateless services .

Rule #2 : LB needs to find the WFE services so WFE services should have static ports. LB (based on the selection) will have a direct or configured port mapping to these WFE services.

When you create a ASP.NET Core stateless service, Visual Studio (VS) will create the service with following aspects.

  • VS will assign a static port to the service.
  • Service Instance is set to 1 in in both Local.1Node.xml and Local.5Node.xml.
  • Service Instance is -1 in Cloud.xml
  • Kestrel Service listener
  • ServiceFabricIntegrationOption set to None

Since Kestrel does not support port sharing, in the local development environment the Kestrel based stateless services are set to have only one instance whenever a port has been specified.

In you development machine, if you specify an instance which results higher than 1 while a port is specified in the ServiceManifest.xml for service which Kestrel listener, then you will get the following famous SF error.

Error event: SourceId='System.FM', Property='State'.
Partition is below target replica or instance count

The above error is about, Failover Manager (FM) complaining that SF cannot create replicas as requested. In FM’s point of view, there’s a request to create more instances but due to port  sharing issue in Kestrel, SF cannot create more than one instance. This is the same error you would get regardless 1 node / 5 node setup because because physically we use one machine in development.

Using HttpSys listener is an option to overcome this issue. In order to use the HttpSys listener install the following NuGet package, update the listener to HttpSysCommunicationListener and the ServiceManifest.xml as below.

Install-Package Microsoft.ServiceFabric.AspNetCore.HttpSys


protected override IEnumerable<ServiceInstanceListener> CreateServiceInstanceListeners()
{
return new ServiceInstanceListener[]
{
new ServiceInstanceListener(serviceContext =>
new HttpSysCommunicationListener(serviceContext, "GatewayHttpSysServiceEnpoint", (url, listener) =>
{
ServiceEventSource.Current.ServiceMessage(serviceContext, $"Starting HttpSys on {url}");
return new WebHostBuilder()
.UseHttpSys()
.ConfigureServices(
services => services
.AddSingleton<StatelessServiceContext>(serviceContext))
.UseContentRoot(Directory.GetCurrentDirectory())
.UseStartup<Startup>()
.UseServiceFabricIntegration(listener, ServiceFabricIntegrationOptions.None)
.UseUrls(url)
.Build();
}))
};


<Endpoints>
<Endpoint Protocol="http" Name="GatewayHttpSysServiceEnpoint" Type="Input" Port="8080"/>
</Endpoints>

In fact, in the production deployments when more than one node available we can use Kestrel listener with static port mentioned in the ServiceManifest.xml with more than one instance.  SF will place the instances in different nodes. This is why the instance count is set to -1 in Cloud.xml.

Here the -1 is safe, because setting a specific number for instance count while Kestrel is used in static port mode may create issues when the requested instance count exceeds the nodes available for SF to place the service.

Common Question: Can we use HttpSys listener and enable scaling ? This is possible but most cases specially in stateless services scaling number of instances is the typical scale out scenario. So there’s no point having a scale out strategy in a single node by congesting a node with many number of services, because running multiple instances in same the same node will not yield the desired throughput we need. Also in such cases Cluster Manager will not find enough nodes with UD/FD combination in order to place the instances and provide a warning message.

Do not make the mistake that I favor Kestrel over HttpSys in this article, there are specific cases where you need HttpSys over Kestrel. In Microsoft articles Kestrel being mentioned and most of the cases are given in such a way that Kestrel can be used to reach desired output regardless of its inability of handing port sharing. From ASP.NET Core point of view Kestrel is good as long as your service is not directly facing the Internet.

Best Practice : Do NOT place WFE services in all nodes. Have dedicated nodes for the WFE services (use placement strategies). This allows stronger separation between WFE nodes and internal nodes. We can also implement a firewall between WFE service nodes and internal service nodes. In another way we trying to achieve the WFE and application server separation we used to do in N-Tier deployments. (to be honest I winked a little bit here when thinking of microservices)

Layering Internal Services

WFE services will route the requests to the internal services with the specific service resolution. Communication from WFE services to the internal services are generally based on HTTP because this provides loose coupling between the WFE services and internal services.

First let’s see what should happen when WFE wants to route a request to the internal services.

  1. WFE should resolve the service location – either via Naming Service directly or via the SF Reverse Proxy
  2. Services should have unique URLs (apart from the IP and port) because when services move from node to node, one service can pick the same port from a node which was used by the previous service and could cause issues. – In such cases a connection can be made to a wrong service (read more from this link)

Rule #3: It is recommended to use SF Reverse Proxy for internal HTTP service communications, because it provides features like endpoint resolution, connection retry, failure resolution and etc.

Reverse Proxy should be enabled in the cluster with the HttpApplicationGatewayEndpoint tag in ClusterManifest.xml. The default port for reverse proxy is 19801 and this service run in all the nodes. You can customize this via ClusterManifest.xml

WFE services should resolve the internal services (first layer services which has HTTP communication from WFE services) using SF Reverse Proxy.

http://localhost:19801/ApplicationName/InternalServiceName/RestOfTheUri

The localhost is applicable as the request is sent via the Reverse Proxy agent running on the node which is calling the internal service. The above URL will be used in a simple HTTPClient implementation to make the call. The below snippet shows a simple GET request.


string reverseProxyUrl = "http://localhost:19801/ApplicationName/InternalServiceName/RestOfTheUri";
var httpClient = new HttpClient();
var response = await httpClient.GetAsync(reverseProxyUrl);

Things to be noted in SF Reverse Proxy

The above URL is the simplest form for a reverse proxy URL which resolves a stateless services. Since This article assumes the 1st layer internal services are stateless the above URL structure will work – no need to mentioned the partition id and kind. In order to learn the full URI structure read this link

Reverse Proxy does retries when a service is failed or not found. Not found can happen when a service is moved from the requested node. Not Found can also occur when your internal service APIs return 404 for a not found business entity. Reverse Proxy requires a way to distinguish between these two cases because if it’s a business logic which returns 404 then there’s no point retrying. This scenario is explained in above article.  In order to avoid a wrong service being called internal stateless services should be have unique service URL integration.

In order to mitigate this, internal services should tell the Reverse Proxy not to retry with the header value. You can do this with an IResultFilter implementation like below and apply the attribute to your controllers. So any action method returns 404 (business service aware 404) values will have this header and Reverse Proxy will understand the situation.


public class ReverseProxyServiceResponseFilter : IResultFilter
{
public void OnResultExecuted(ResultExecutedContext context)
{
if(context.HttpContext.Response.StatusCode == 404)
{
context.HttpContext.Response.Headers.Add("X-ServiceFabric", "ResourceNotFound");
}
}
public void OnResultExecuting(ResultExecutingContext context)
{
}
}

So in this mode the internal stateless services which uses HTTP endpoints should have following aspects

  • Dynamic port assignment
  • Kestrel Service listener
  • Can scale the service instance as long as FD:UD constraints are not violated
  • No restrictions in dev enviornment
  • ServiceFabricIntegrationOption set to UseUniqueServiceUrl

Note: User revers proxy for internal HTTP communication. Clients outside the cluster SHOULD connect to the WFEs via LB or any such similar service. Mapping Reverse Proxy to the LB can cause the clients outside the cluster to reach the HTTP service endpoints which are not supposed to be discoverable outside the cluster.

Summary

Let me summarize the items in points below.

  • Use Kestrel for WFE with static port assignment with placement strategies for the nodes allocated to handle WFE workload.
  • Using HttpSys for WFE is fine, but do not use this in the intention of scaling out thus would not yield the right expected result.
  • Use Kestrel for internal HTTP stateless services with dynamic port allocation and enabling unique service URL.
  • Use SF Reverse Proxy for internal HTTP communications whenever possible
  • It is not recommended to map the SF Reverse Proxy the external LB or Gateway service.

 

In the endpoint configuration services have endpoint type which can be set to Input or Internal. I did some testing but failed as both types exposes the services as long as they have a valid port mapping to LB. Finally ended up asking from the creators and this is the answer I got. So technically endpoint type does not matter.