Should you design your own internal Platform ?

9 min readFeb 6, 2022

I have been involved in the Platform-as-a-Service (PaaS) market for several years, my first experience was in design VNFM (PaaS for telecoms) that provide the capability to onboard and deploy Virtual Network Function (VNF) this is part of the digital transformation that is been rolled in the telecom cloud industry for more then a decade. My second experience was a design from scratch of PaaS platform for several industries under a single platform, I'm not going to get into what those platforms do and where (if you that curious just look at my LinkedIn profile). As an innovation practice I'm thinking about the benefit of an internal PaaS platform for large organization that employ developers that require services that usually they instate them selfs on their public cloud they are using, request DevOps assistant (that always happens), thoughts about the benefits on an internal PaaS will it simplify management/orchestration of service? Will it solve regulations issues? Will it unified resource control for better budget control?

The ideas I'm playing with is not new, in my opinion in large enterprises with separate development groups having a set of tools, services, and environments that enable developers to deploy and operate services in their public cloud will accelerate and simplify deliveries, enhance governance, cut cost on cloud resources and allow better budget control.

Let's assume over 1,000 services that range from experiments built during hackathons, to internal tooling supporting company processes, to public-facing, critical components of the flagship products. Despite its crucial responsibilities, such a platform should be simple. The platform inputs it takes to deploy a service are just a Docker image containing the service logic, and a YAML file — the Service Descriptor — that describes the backing resources that the service needs (databases, queues, caches, etc.) as well as various configuration settings, such as the service’s autoscaling characteristics. The system should takes care of expanding those inputs into production-ready services, ensuring operational functionality (e.g. log aggregation, monitoring, alerting) and best practices (e.g. cross Availability-Zone resiliency, consistent backup/restore/retention configuration, sensible security policies) are present out-of-the-box.

I haven’t invented much here: nearly everything is achieved by using standard cloud (AWS/GCP/Azure/BlueOcean…) features . With this in mind, it is common for engineers to question the need for such a platform: couldn’t we simply open up plain, direct cloud access to all teams, so that they can use cloud’s full and rapidly-expanding functionality?

This is a great question which i’ll explore below, focussing on the following points:

I believe that there are fantastic economies of scale that come from a platform that strongly encourages consistent technologies, tools, and processes — and that these advantages would likely be degraded if direct access to AWS for example were the norm.
I acknowledge that this level of indirection can sometimes result in reduced flexibility, so it’s important to consider how to mitigate this trade-off.

The benefits

Let's assume public cloud SP breadth of Cloud infrastructure features is extensive to say the least, and only a subset is made available to developers. This discrepancy isn’t due to a platform being unable to “keep up” with AWS. The limitation exists primarily because i believe in the value of reducing technical extend, especially at the bottom of our stacks. There are fantastic benefits that manifest when teams across the organization use infrastructure in a consistent manner. To name a few:

Integrations with standard company tooling and processes are much easier to implement. An internal platform provides control points around provisioning. Consequently, integration with logging, metrics, service metadata management, compliance controls, etc. can be baked into the platform rather than being handled by many different teams. Additionally, when we need to change those integrations, a large part of the change effort can be centralised. For example, this has proven valuable when needed to shift the entire company from one logging or monitoring vendor to another.
More generally, projects that require sweeping changes to internal services are simplified and more predictable. This includes security initiatives, new compliance programs and more. Some concrete examples include efforts to normalised TLS configurations across the company, to enforce consistent access controls relating to SOX compliance, and enableling encryption at rest on data stores.
The impact of each engineer’s expertise is multiplied. The relatively small number of PostgreSQL or Elasticsearch experts across the company can have an outsized impact by advising dozens of teams. Engineers who have grown intimate with, say, the quirks of AWS ALB, can share and apply those lessons across pretty much every single service and team in the company.
Relatedly, the impact of tooling built around the platform is multiplied. Side-projects started by engineers who seek to “scratch an itch” become valuable to the entire organisation. We’ve seen several such side-projects evolve into fully supported platform components. Examples could be tools that checks for possible configuration issues and missed best-practices, a bootstrap tool that generates skeleton services, and a key management service that provisions new key pairs for service-to-service authentication at deploy time.
Any new public cloud features made available to teams are configured and integrated to maintain the security, compliance, operational standards, and best practices that have been incrementally achieved over time. In other words, when a new resource becomes available we dont just “added” it.
Operation, best to determine relevant metrics and backup strategies, to make sure it fits in with existing compliance and security standards.

These economies of scale are a consequence of using a consistent, controlled interface to provision and manage services in clouds, combined with sensible bounds on the vast array of cloud features available. Both would likely be degraded if direct access to any public cloud were the norm.

How a platform helps

Let’s just dive into what i think could be beneficial:

A platform could enforces consistent use of service metadata. All services have a metadata record that includes information such as the service owner, team, service level — this is invaluable for incident scenarios, in which obtaining this kind of information quickly is essential. Enforces consistent tagging of the cloud resources, which helps with cost attribution and setting up filters for automated security, compliance, and best-practices scanning.
Ensures all services integrate with standard observability tools. Any service that is automatically configured to AWS flow logs for example into a standard Splunk cluster, metrics into a single account, tracing and alerts.
Sucha platform will handle network connectivity. It manages a set of VPCs/VNETs and subnets with appropriate peering configuration, so services can access the parts of the internal network that they need. Additionally, the platform should integrates with a centrally managed edge, so that all public-facing services will benefit from the same level of DDoS protection. Components running on every node will add a control point for egress, which is used to enforce export controls and for security controls and monitoring.
A platform helps tomeet the company’s security standards. should provides a single point which controls load balancer security settings and security group setup. Furthermore, because all services essentially run on the same AMI, could easily ensure host OS security patches are up to date. Provide audited easly and controlled channels for operational & diagnostics access to services.
On compliance, the metadata mentioned above also tracks information about services’ use of data: does this service handle customer data? What about personally identifiable information or financial data? This allows to enforce controls at the platform level where warranted. For example, the binaries of services that handle any kind of sensitive data will fail to deploy to production unless we can confirm they come from a locked-down artifact store, and were built by a restricted CI pipeline. A platform also simplifies audits, by ensuring that there is only a small number of teams who need system-level access, and that this access means the same thing, and is controlled in the same way, across all services.
On data integrity, A platform ensures that all data stores are set up with the right backup frequency, retention and locality.
A platform also helps with overall service resilience. It encourages or enforces some of the essential “12 factor” principles, in particular by ensuring compute nodes are immutable and stateless (for example by integrating with Chaos Monkey-like tooling which periodically shuts down nodes). It also provisions services with sensible defaults for autoscaling and alerts, as well as a mandatory spread across AZs in production. This allows to run AZ failure simulations, in which you could periodically evacuate all nodes in an AZ and measure impact on reliability.
A platform provides a consistent set of deployment strategies, all of which result in zero-downtime deploys, post-deploy verification with automated rollback.

The trade-offs of a PaaS — and some mitigations

A strategy which favours a controlled, consistent and hardened subset of cloud features, over free-form access to accounts. This strategy has some trade-offs. For example:

It may be harder to experiment with and learn new cloud features.
If a feature is ideal for a given use-case, but is not supported by the platform, engineers have to make a decision between a “suboptimal” but supported cloud features, versus forgoing the economies of scale mentioned above by hosting their service off-platform.
Some 3rd party tools that engineers would like to include in their dev loop or deployment flow can’t be used in conjunction unless they are platform aware.

Let’s look at some mitigations to these trade-offs.

Free-for-all access in a controlled Training Account

If teams would like to experiment cloud features to validate they fit their use cases, or just to learn more about them, i generally would point them to a training Account. This account has some notable restrictions: it is not connected to the rest of the internal network, and it will purged on a weekly basis. Nonetheless, it’s an ideal “playground” to experiment, validate assumption and build simple proofs of concept.

Extending PaaS

The above isolated experimentation is valuable but can only go so far. Fortunately, there’s a range of ways in which a PaaS can be extended.

Many resources that a platform provides are integrated to the platform via a curated set of declarative templates. Teams can branch our repository of templates and add their own declarative templates, which can immediately be referenced by Service Descriptors and therefore be provisioned by services in development environments. This allows for the resource to be tried and tested, and to examine in detail what would be required to make the resource available to all teams in production.

Acknowledging that not everything one may wish to provision alongside a service is best represented with a declarative template, a platform should also accept extensions in the declarative. In other words, teams can contribute services which may themselves become part of the provisioning flow for subsequent services, by deploying and managing new types of assets defined in the Service Descriptor. Building and running such services is no small undertaking, and you should take care to ensure this extension point isn’t used as a vector to pump out PaaS features that don’t have a high level of ongoing operational support. In practice, use this functionality primarily to decompose the core of the PaaS, and to scale the PaaS platform. For example, spin out an independent sub-team that owns all the platform’s data stores and their associated tooling, and to give your Networking team greater autonomy in the implementation and ownership of the platform’s integration with your company-wide public edge.

Finally

There is no mandate that says services must run using an abstracted platform. However in a large dev. organization flexibility comes with additional responsibilities and considerations, which echo the list above describing where a platform helps.
Here are the questions teams must consider before going off-platform:

Service metadata: do you have appropriate records for your services? Are your cloud resources correctly tagged?
Observability: how will you get your logs into a unified system? what willyou avice by that ? any insights ?
Network connectivity: do you need access to the internal network? Do you need to call other microservices? How are you implementing DDoS protection and export controls on ingress and egress?
Security: does the Security team have suitable insight into your services? How will you implement changes to security requirements and policies? How are you balancing operational & diagnostics access to services with access control requirements?
Compliance: What are the implications of compliance certifications for your services? How are compliance controls enforced? Do you review access levels to your infrastructure? Does your team need to be involved in audits?
Data integrity: Is your backup frequency, retention and locality in line with requirements? Are your backups monitored? Do you do periodic restore testing?
Service resilience: What is the impact of a node failure on your service? Is your service resilient to AZ failures? Do you have periodic testing for this?
Deployments & change management. Do you have a mechanism to achieve zero-downtime deploys? To be alerted if a deploy is broken? What mechanisms do you have in place for rollback? How will other teams know what has changed with your service and when?

If your asking about the horse, yes its the family hobby