Gadgets. Tech. Data Warehouse. Business Intelligence. NFL. College Football.
867 stories

Automating project creation with Google Cloud Deployment Manager

1 Share

Do you need to create a lot of Google Cloud Platform (GCP) projects for your company? Maybe the sheer volume or the need to standardize project creation is making you look for a way to automate project creation. We now have a tool to simplify this process for you.

Google Cloud Deployment Manager is the native GCP tool you can use to create and manage GCP resources, including Compute Engine (i.e., virtual machines), Container EngineCloud SQLBigQuery and Cloud Storage. Now, you can use Deployment Manager to create and manage projects as well.

Whether you have ten or ten thousand projects, automating the creation and configuration of your projects with Deployment Manager allows you to manage projects consistently. We have a set of templates that handle:
  • Project Creation - create the new project with the name you provide
  • Billing - set the billing account for the new project
  • Permissions - set the IAM policy on the project
  • Service Accounts - optionally create service accounts for the applications or services to run in this project
  • APIs - turn on compatible Google APIs that the services or applications in a project may need

Getting started

Managing project creation with Deployment Manager is simple. Here are few steps to get you started:
Download the templates from our github samples.

  1. The project creation samples are available in the Deployment Manager github repo under the project_creation directory. Or clone the whole DM github repo:

    git clone

    Then copy the templates under the examples/v2/project_creation directory.

  2. Follow the steps in the Readme in the project_creation directory. The readme includes detailed instructions, but there is one point to emphasize. You should create a new project using the Cloud Console that will be used as your “Project Creation” project. The service account under which Deployment Manager runs needs powerful IAM permissions to create projects and manage billing accounts, hence the recommendation to create this special project and use it only for creation of other projects.

  3. Customize your deployments.
    • At a minimum, you'll need to change the config.yaml file to add the name of the project you want to create, your billing account, the APIs you want, the IAM permissions you choose to use and the APIs to enable.
    • Advanced customization  you can do as little or as much as you want here. Let’s assume that your company typically has three types of projects: production service projects, test service projects and developer sandbox projects. These projects require vastly different IAM permissions, different types of service accounts and may also need different APIs. You could add a new top level template with a parameter for “project-type”. That parameter takes a string as input (such as “prodservice”, “testservice” or “developer”) and uses that value to customize the project for your needs. Alternatively, you can make three copies of the .yaml file  one for each project type with the correct settings for your three project types.

  4. Create your project.
    From the directory where you stored your templates, use the command line interface to run Deployment Manager:
    gcloud deployment-manager deployments create 
    <newproject_deployment> --config config.yaml --project <Project 
    Creation project>

    Where <newproject_deployment> is the name you want to give the deployment. This is not the new project name, that comes from the value in the config.yaml file. But you may want to use the same name for the deployment, or something similar so you know how they match up once you’ve stamped out a few hundred projects.
Now you know how to use Deployment Manager to automatically create and manage projects, not just GCP resources. Watch this space to learn more about how to use Deployment Manager, and let us know what you think of the feature. You can also send mail to
Read the whole story
18 hours ago
santa clara, CA
Share this story

Dockercon 2017: How Docker Changed Windows So Windows Could Change Docker

1 Share

A decade ago, Microsoft’s publicly stated policy towards Linux was to compel its vendors, by any and all means, to obtain paid licenses for whatever Windows technologies they may, directly or indirectly, have stolen. Last week at DockerCon 2017 in Austin, two of Microsoft’s busiest engineers demonstrated how they are rewiring the schematics of Windows Server 2016 (due for a major update in just a few weeks’ time), enabling it to manage and run Linux-based containers on a Linux subsystem within Windows.

“We wanted to have one layer that was kind of a, ‘Here’s the entry point to all things,’” described Taylor Brown, Microsoft’s principal lead program manager.

Technically, Brown was explaining why Microsoft intentionally redesigned its Hyper-V virtualization system to include something called Host Compute Service (HCS). His team had observed how, in Linux systems, multiple permutations of container systems were simultaneously pinging the same control group (cgroup) interfaces.

“What we feared was, someday you’re going to have Docker running next to rkt next to some other thing,” he continued, “and there’s going to be no common way to be talking about these things at all.”

The Entry Point to All Things

Soon after Microsoft premiered its Docker support two years ago at its own company conference, it presented our first glimpse at a scheme for Windows and Linux interoperability. In August 2015, Taylor Brown explained why Microsoft chose to produce two implementations of containers: one just branded “Windows containers,” and the other “Hyper-V containers.” For security purposes, he said, you may have a need to run containers in perfect isolation, and Hyper-V provides that.

Of course, isolation was the original idea behind the creation of cgroups in Linux anyway. So there was a lingering question over why a Windows dev or admin wouldn’t want Hyper-V implementation in every case.

Tuesday, there came a long, thorough, and somewhat necessarily circuitous, response to that question.

“When you run a Windows Server container, that is a shared kernel,” explained Brown, being careful now to include the word “Server” in the phrase for reasons that will soon become obvious. “If I run a second one of those, same kernel. They also share the same kernel with the host.”

As it turns out, enabling this kernel sharing runs contrary to the architecture of Windows 10, Microsoft’s client-side OS.

“Even though it’s the same kernel between Windows 10 and [Windows] Server,” said Brown, in response to an attendee’s question, “they operate differently. You get different scheduling parameters, you get different memory management techniques. So if you were trying to get to a state where you could say definitively, ‘This is going to run the same way,’ those will interfere with and change the way things work.”

Put another way: If the Docker-inspired methodology of sharing kernels were to be applied to both Windows 10 and Windows Server, then most every effort you would make to try to balance the performance characteristics of a container across both environments would lead to imbalance. This is a problem Microsoft has encountered before, specifically with its long-standing web server, IIS. Beginning with version 6.0 (released in 2003), it uses a kernel-mode device driver HTTP.SYS — in Windows parlance, a library intended only for use at the base layer of Windows. This was Microsoft hard-wiring the core of its Web server into its operating system.

Although Microsoft unified its kernel for client and server OSes, their support structures have diverged greatly. Their concepts of time and process scheduling are now based on separate constructs. As a result, IIS behaves dramatically differently for both OSes. Ensuring consistent behavior across all supported implementations is now a requirement for any containerization platform. So despite documentation published only months ago explaining how to run both container types on Windows 10, going forward, said Brown, the client OS will only run Hyper-V containers.

“That image would work differently in those different environments,” continued Brown, “which is kind of counter to the entire spirit and goal of what we’ve done with Docker. Which is why we’ve implemented it the way that we did, at least for now.”

The Entire Spirit and Goal

If it were Docker’s idea from the beginning to construct two platforms that could run workloads identically on two operating systems, the project might never have been completed. From an architectural standpoint, it’s not much different from constructing a skyscraper on an island and an identically functioning one on an offshore platform.

Suffice it to say, under the hood, the two Dockers don’t work alike. What’s more, Microsoft has made significant changes to Windows Server to enable any kind of containerization to happen at all.

The crux of Microsoft’s changes has to do with networking. Whereas the virtual switch, or vSwitch, has played very little of a role in the admittedly Linux-leaning New Stack so far, and no role whatsoever in Docker for Linux, it is critical to how Windows perceives a virtual network. All Docker containers’ connections take place over an IP network; and in Linux, the model for those connections is, simply put, Linux Routing. Just as simply, there is no Linux Routing in Windows. So Microsoft’s and Docker’s engineers had to gear the Windows version of the container platform for the vSwitch.

Back in the era of Windows Vista and the “Longhorn” project, Microsoft’s Windows engineers created a concept called IP routing compartments. Its original intent was to guarantee isolation for routing within a virtual private network (VPN). Docker networking on Windows does not use a VPN; however, it does leverage the compartment concept. Hyper-V leveraged it first to enable multiple virtual servers to run on a physical server. Specifically, compartmentalization ensures that each virtual server is isolated from all the others.

If you recall how Linux containers evolved, the concept began with the cgroup, which led to isolated namespaces. Giving each of these spaces network isolation with designated IP addresses, then led to containers as Linux developers understand them. With Windows, the evolution ran the opposite direction, at least as Microsoft principal engineering lead Dinesh Govindasamy explained it last Tuesday at DockerCon, in the session with Brown. First, Hyper-V gave Windows engineers the network isolation component that any container would need. From there, these engineers had to build a way to share the kernel, to share resources over an IP network, and to incorporate subsystems that would, in turn, enable Linux containers to run in Windows.

So even though software-defined networking (SDN) has been a thing for quite a while, Windows Server will only now begin officially supporting overlay network mode — a critical feature for implementing the Docker Swarm orchestration engine — when the latest update patch ships in just days, announced Govindasamy.

“The Docker networking architecture is built upon this Container Networking Model,” he went on, presenting a diagram that looks like a stack of pancakes cooked for Piet Mondrian.

For this architecture to apply to analogous Windows components, there needed to be a way for an application to securely build out the network — something Windows didn’t really have.

So Govindasamy and his colleagues devised a component called Host Networking Service (HNS, the networking counterpart to HCS). Docker for Windows relies upon HNS to create vSwitches and virtual firewalls (what Windows calls WinNAT). HNS also establishes network endpoints, binds them to vSwitch ports, and applies policies to the endpoints, he explained to attendees.

“The default network mode we have in Linux is bridge mode. And the default network mode we have in Windows is NAT mode,” he continued, illustrating with a very broad brush the fact that the two operating systems have entirely different networking priorities. In a sense, they have contrasting personalities; Linux wants to connect, while Windows wants to abstract.

“For NAT mode, we create an internal vSwitch,” explained Govindasamy. “[It’s] a private vSwitch with an addition of a gateway NIC [virtual network interface card] to the host partition. Then we create a NAT between the gateway NIC and the external NIC. So any containers you added to this NAT network should be able to talk to each other because of the vSwitch, and any traffic that’s going out of this NAT network will be NATted using WinNAT.”

Common Assemblies

Why does all this matter? There are a few reasons, which I promise are interesting.

First, while it has been the goal of container orchestration to present consistent methodologies across platforms, the adaptations made to Kubernetes to enable Windows container networking are certain to result in different performance profiles. Not necessarily worse, perhaps even better, but certainly different.

And that leads to the other key reason why this matters: While the original Linux Routing architecture for Docker could not be translated into Windows, theoretically, there’s nothing stopping Microsoft’s vSwitch-oriented methodology from, at some point, being translated into a Linux implementation, if only experimentally. Open vSwitch is no longer a VMware project; it’s now stewarded by the Linux Foundation, of which Microsoft is a member.

Furthermore, during Tuesday’s keynote session at DockerCon where he introduced the Moby Project, Docker Chief Technology Officer Solomon Hykes spoke of the need for standardizing Docker’s chassis — the way an automobile manufacturer does — in order to enable easier specialization for various platforms.

“Obviously they re-use the same individual parts — the same wheels, engines, etc.,” said Hykes, “but they also collaborate on common assemblies of these components. And that allows them to not duplicate effort… So we stole that idea, and we applied it to this engineering problem of ours.  And the result is something that worked really well. We created within Docker a place where all of our teams could collaborate… on common assemblies.”

Networking is fundamental to the way Docker operates, so it’s impossible to imagine that Docker’s engineers have somehow avoided considering the possibility of adopting Microsoft’s vSwitch/vNIC/NAT model or perhaps applying some standardized form of the model across all its editions, including Linux, at some future date. Thus, the problem Microsoft solved for enabling Docker on Windows Server may very well have pointed the way toward a future for Docker everywhere else.

After all, an ecosystem abhors imbalance.

Feature image: A Roberval balance scale, taken by Nikodem Nijaki, licensed under Creative Commons 3.0.

The post Dockercon 2017: How Docker Changed Windows So Windows Could Change Docker appeared first on The New Stack.

Read the whole story
1 day ago
santa clara, CA
Share this story

Guest post: Using Terraform to manage Google Cloud Platform infrastructure as code

1 Share

Managing infrastructure usually involves a web interface or issuing commands in the terminal. These work great for individuals and small teams, but managing infrastructure in this way can be troublesome for larger teams with complex requirements. As more organizations migrate to the cloud, CIOs want hybrid and multi-cloud solutions. Infrastructure as code is one way to manage this complexity.

The open-source tool Terraform, in particular, can help you more safely and predictably create, change and upgrade infrastructure at scale. Created by HashiCorp, Terraform codifies APIs into declarative configuration files that can be shared amongst team members, edited, reviewed and versioned in the same way that software developers can with application code.

Here's a sample Terraform configuration for creating an instance on Google Cloud Platform (GCP):

resource "google_compute_instance" "blog" {
  name         = "default"
  machine_type = "n1-standard-1"
  zone         = "us-central1-a"

  disk {
    image = "debian-cloud/debian-8"

  disk {
    type    = "local-ssd"
    scratch = true

  network_interface {
    network = "default"

Because this is a text file, it can be treated the same as application code and manipulated with the same techniques that developers have had for years, including linting, testing, continuous integration, continuous deployment, collaboration, code review, change requests, change tracking, automation and more. This is a big improvement over managing infrastructure with wikis and shell scripts!

Terraform separates the infrastructure planning phase from the execution phase. The terraform plan command performs a dry-run that shows you what will happen. The terraform apply command makes the changes to real infrastructure.

$ terraform plan
+ google_compute_instance.default
    can_ip_forward:                    "false"
    create_timeout:                    "4"
    disk.#:                            "2"
    disk.0.auto_delete:                "true"
    disk.0.disk_encryption_key_sha256: ""
    disk.0.image:                      "debian-cloud/debian-8"
    disk.1.auto_delete:                "true"
    disk.1.disk_encryption_key_sha256: ""
    disk.1.scratch:                    "true"
    disk.1.type:                       "local-ssd"
    machine_type:                      "n1-standard-1"
    metadata_fingerprint:              ""
    name:                              "default"
    self_link:                         ""
    tags_fingerprint:                  ""
    zone:                              "us-central1-a"

$ terraform apply
google_compute_instance.default: Creating...
  can_ip_forward:                    "" => "false"
  create_timeout:                    "" => "4"
  disk.#:                            "" => "2"
  disk.0.auto_delete:                "" => "true"
  disk.0.disk_encryption_key_sha256: "" => ""
  disk.0.image:                      "" => "debian-cloud/debian-8"
  disk.1.auto_delete:                "" => "true"
  disk.1.disk_encryption_key_sha256: "" => ""
  disk.1.scratch:                    "" => "true"
  disk.1.type:                       "" => "local-ssd"
  machine_type:                      "" => "n1-standard-1"
  metadata_fingerprint:              "" => ""
  name:                              "" => "default"
  network_interface.#:               "" => "1"
  network_interface.0.address:       "" => ""          "" => ""       "" => "default"
  self_link:                         "" => ""
  tags_fingerprint:                  "" => ""
  zone:                              "" => "us-central1-a"
google_compute_instance.default: Still creating... (10s elapsed)
google_compute_instance.default: Still creating... (20s elapsed)
google_compute_instance.default: Creation complete (ID: default)

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

This instance is now running on Google Cloud:
(click to enlarge)

Terraform can manage more that just compute instances. At Google Cloud Next, we announced support for GCP APIs to manage projects and folders as well as billing. With these new APIs, Terraform can manage entire projects and many of their resources.

By adding just a few lines of code to the sample configuration above, we create a project tied to our organization and billing account, enable a configurable number of APIs and services on that project and launch the instance inside this newly-created project.

resource "google_project" "blog" {
  name            = "blog-demo"
  project_id      = "blog-demo-491834"
  billing_account = "${var.billing_id}"
  org_id          = "${var.org_id}"

resource "google_project_services" "blog" {
  project = "${}"

  services = [

resource "google_compute_instance" "blog" {
  # ... 

  project = "${}" # <-- ...="" code="" new="" option="">

Terraform also detects changes to the configuration and only applies the difference of the changes.

$ terraform apply
google_compute_instance.default: Refreshing state... (ID: default)
google_project.my_project: Creating...
  name:        "" => "blog-demo"
  number:      "" => ""
  org_id:      "" => "1012963984278"
  policy_data: "" => ""
  policy_etag: "" => ""
  project_id:  "" => "blog-demo-491834"
  skip_delete: "" => ""
google_project.my_project: Still creating... (10s elapsed)
google_project.my_project: Creation complete (ID: blog-demo-491835)

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

We can verify the project is created with the proper APIs:
(click to enlarge)

And the instance exists inside this project.
This project + instance can be stamped out multiple times. Terraform can also create and export IAM credentials and service accounts for these projects.

By combining GCP’s new resource management and billing APIs and Terraform, you have more control over your organization's resources. With the isolation guaranteed by projects and the reproducibility provided by Terraform, it's possible to quickly stamp out entire environments. Terraform parallelizes as many operations as possible, so it's often possible to spin up a new environment in just a few minutes. And in larger organizations with rollup billing, IT teams can use Terraform to stamp out pre-configured environments tied to a single billing organization.

Use Cases

There are many challenges that can benefit from an infrastructure as code approach to managing resources. Here are a few that come to mind:

Ephemeral environments
Once you've codified an infrastructure in Terraform, it's easy to stamp out additional environments for development, QA, staging or testing. Many organizations pay thousands of dollars every month for a dedicated staging environment. Because Terraform parallelizes operations, you can curate a copy of production infrastructure in just one trip to the water cooler. Terraform enables developers to deploy their changes into identical copies of production, letting them catch bugs early.

Rapid project stamping
The new Terraform google_project APIs enable quick project stamping. Organizations can easily create identical projects for training, field demos, new hires, coding interviews or disaster recovery. In larger organizations with rollup billing, IT teams can use Terraform to stamp out pre-configured environments tied to a single billing organization.
On-demand continuous integration
You can use Terraform to create continuous integration or build environments on demand that are always in a clean state. These environments only run when needed, reducing costs and improving parity by using the same configurations each time.

Whatever your use case, the combination of Terraform and GCP’s new resource management APIs represents a powerful new way to manage cloud-based environments. For more information, please visit the Terraform website or review the code on GitHub.
Read the whole story
4 days ago
santa clara, CA
Share this story

One podcaster's (fruitless) quest to replace Skype

1 Share

Every now and then when I complain about Skype, which most of my podcast peers and I use for our conversations, someone suggests an alternative voice-over-IP service and asks why we don’t switch.

The truth is, Skype’s terribleness may be overstated—people get cranky when they’re entirely dependent on a single product and that product isn’t reliable—and the product has gotten better recently after a few particularly rocky months.

But it’s not just about abandoning Skype. Yes, there are numerous services that will let multiple people connect over the Internet and have a voice conversation.1 Yes, we could move to Google Hangouts or some other web-based business conferencing tool or video game chat app.

But here’s the thing: Everybody I know uses Skype. If I’m going to start the painful process of moving house—of getting everyone I’m on a podcast with to, over the course of many months, upgrade their software and get used to a new way of working—I want to move to something that is vastly superior to what we’re currently using. There is no point in dealing with transition costs—inevitably including many lost minutes as everyone waits for someone to install unfamiliar software and figure out how to use it—to make a lateral move.2

Leaving aside the fact that I have no real faith that alternative option X is actually better than Skype—one person’s “I’ve never had any problems” can be another person’s “omigod it was a disaster”—I’ve decided that I’m leaving Skype only if I’m forced to or if I can find a tool that solves other problems specific to podcasters.

Right now, the biggest issues I have with Skype, beyond the occasional bout of unreliability, are related to recording audio. This isn’t Skype’s fault—it wasn’t built with recording podcasts in mind!—but it’s a necessity for podcasters. While I’m doing a podcast, I need to record my own microphone and, ideally, the rest of the conversation—and in separate files or on separate tracks. (You can read more about this in my “How I Podcast: Recording” article.)

On the Mac, this is pretty easy—I bought a bunch of copies of Ecamm Call Recorder for Skype, which is a plug-in that integrates recording right into Skype. For people who don’t have Call Recorder, QuickTime can record audio fairly easily. On Windows, it’s more complicated—the podcast guest guide that I use recommends downloading the free audio app Audacity. More complexity means there are more chances to do something wrong.

And then there’s iOS, where this is just impossible. You can’t record your microphone locally while talking on Skype. This severely limits iOS podcasting.

Plus there are some things that Skype does really well, that any replacement needs to do a decent job at. Skype massages audio before it reaches you, leveling and boosting audio and removing background noise and echoes. Its servers merge audio streams together so that multi-person conversations can happen even on on low-bandwidth connections. Skype may have its issues, but it’s also got a lot of strengths that I didn’t appreciate until I began investigating alternatives.

So if I’m going to move from Skype, I need to move to something that won’t be dramatically worse than Skype in terms of stability and audio quality, and it needs to make it easier to record podcast audio across all major platforms, desktop and mobile.

This is a big ask. And it turns out, there’s basically no solution today. But there is hope.

The closest we’ve come are two web services, Cast and Zencastr. Both of these services rely on WebRTC, a browser-based set of real-time communication protocols that let browsers transfer audio and video without special plug-ins. Both services automatically record the local audio of participants and upload them to a remote server, so panelists don’t need to install or run any special software to have their high-quality audio captured for later use.


Cast costs $10/month for its basic plan. I’ve used it for several months in the recording of the TV Talk Machine podcast, and have found it to be quite reliable. It can’t handle conferences with more than four participants, including the host, which disqualifies it from my large panels on The Incomparable, but most podcasts don’t have panels with five or six people in them.

Zencastr has a basic free tier, but to record with more than one guest it’s $20/month. Zencastr claims it can handle “unlimited” guests, though I haven’t tested this and suspect it will bog down quickly if you have a large panel. I’ve used it a few times and found it a little less reliable than Cast—I’ve seen files cut off a few seconds too early, and the quality of the live audio connection had more artifacts than I’ve seen with Cast.


I appreciate Zencastr’s cloud-storage integration: all source files are automatically deposited in my Dropbox after a session is over. In contrast, Cast makes me wait for several minutes before I can download my files.

If you’re recording a podcast with three or four participants, Cast’s $10/month plan is a pretty good deal. If it’s just a one-on-one chat, Zencastr’s free tier is even better. For more than four participants, though, you’re back to Zencastr and you’ll pay $20/month for the privilege. Still, there’s a lot to be said for automatically recording panelist audio without any intervention.

…but then there’s mobile. The fact is, Safari doesn’t support WebRTC right now, so you can’t use either Cast or Zencastr on an iPad or iPhone. It looks like WebKit will support WebRTC at some point in the near future, but we might not see support in iOS until 2018.

In looking for a solution that would work on my iPhone or iPad, I discovered Ringr, which offers built-in microphone recording and supports WebRTC on the desktop and offers iOS and Android apps. Unfortunately, Ringr only supports one-on-one calls, so while it would work great for two-person podcasts, that’s all it supports. A Ringr support page suggests that conference calling will appear in the future, but the date it cites—“later in 2016”—doesn’t inspire confidence.

For the record, business-conference-call apps with desktop and mobile versions don’t support recording of local microphone tracks. Some of them will record the entire conference call on the server, which is cool, but that’s only good for reference—for the best podcast audio, you want to record the microphone at the source.

So the end result of all this? I’ve got a close eye on Zencastr, Cast, and on the progress of implementing WebRTC in WebKit. But for now, there doesn’t seem to be a single voice-over-IP product of any kind that will work on Mac, Windows, and iOS and automatically record local audio.

  1. Since many of my podcasts feature more than two people, two-person tools like FaceTime are not an option. ↩

  2. This isn’t just about Skype, but the tools people use to record their audio—if we leave Skype, often those tools have to change, too. ↩

Read the whole story
4 days ago
santa clara, CA
Share this story

BetterTLS - A Name Constraints test suite for HTTPS clients


Written by Ian Haken

At Netflix we run a microservices architecture that has hundreds of independent applications running throughout our ecosystem. One of our goals, in the interest of implementing security in depth, is to have end-to-end encrypted, authenticated communication between all of our services wherever possible, regardless of whether or not it travels over the public internet. Most of the time, this means using TLS, an industry standard implemented in dozens of languages. However, this means that every application in our environment needs a TLS certificate.

Bootstrapping the identity of our applications is a problem we have solved, but most of our applications are resolved using internal names or are directly referenced by their IP (which lives in a private IP space). Public Certificate Authorities (CAs) are specifically restricted from issuing certificates of this type (see section of the CA/B baseline requirements), so it made sense to use an internal CA for this purpose. As we convert applications to use TLS (e.g., by using HTTPS instead of HTTP) it was reasonably straightforward to configure them to use a truststore which includes this internal CA. However, the question remained of what to do about users accessing their services using a browser. Our internal CA isn’t trusted by browsers out-of-the-box, so what should we do?

The most obvious answer is straightforward: “add the CA to browsers’ truststores.” But we were hesitant about this solution. By forcing our users to trust a private CA, they must take on faith that this CA is only used to mint certificates for internal services and is not being used to man-in-the-middle traffic to external services (such as banks, social media sites, etc). Even if our users do take on faith our good behavior, the impact of a compromise to our infrastructure becomes significant; not only could an attacker compromise our internal traffic channels, but all of our employees are suddenly at risk, even when they’re at home.
Fortunately, the often underutilized Name Constraints extension provides us a solution to both of these concerns.

The Name Constraints Extension

One powerful (but often neglected) feature of the TLS specification is the Name Constraints extension. This is an extension that can be put on CA certificates which whitelists and/or blacklists the domains and IPs for which that CA or any sub-CAs are allowed to create certificates for. For example, suppose you trust the Acme Corp Root CA, which delegates to various other sub-CAs that ultimately sign certificates for websites. They may have a certificate hierarchy that looks like this:

Now suppose that Beta Corp and Acme Corp become partners and need to start trusting each other’s services. Similar to Acme Corp, Beta Corp has a root CA that has signed certificates for all of its services. Therefore, services inside Acme Corp need to trust the Beta Corp root CA. Rather than update every service in Acme Corp to include the new root CA in its truststore, a simpler solution is for Acme Corp to cross-certify with Beta Corp so that the Beta Corp root CA has a certificate signed by the the Acme Root CA. For users inside Acme Corp their trust hierarchy now looks like this.

However, this has the undesirable side effect of exposing users inside of Acme Corp to the risk of a security incident inside Beta Corp. If a Beta Corp CA is misused or compromised, it could issue certificates for any domain, including those of Acme Corp.

This is where the Name Constraints extension can play a role. When Acme Corp signs the Beta Corp root CA certificate, it can include an extension in the certificate which declares that it should only be trusted to issue certificates under the “” domain. This way Acme Corp users would not trust mis-issued certificates for the “” domain from CAs under the Beta Corp root CA.

This example demonstrates how Name Constraints can be useful in the context of CA cross-certification, but it also applies to our original problem of inserting an internal CA into browsers’ trust stores. By minting the root CA with Name Constraints, we can limit what websites could be verified using that trust root, even if the CA or any of its intermediaries were misused.

At least, that’s how Name Constraints should work.

The Trouble with Name Constraints

The Name Constraints extension lives on the certificate of a CA but can’t actually constrain what a bad actor does with that CA’s private key (much less control what a subordinate CA issues), so even with the extension present there is nothing to stop the bad actor from signing a certificate which violates the constraint. Therefore, it is up to the TLS client to verify that all constraints are satisfied whenever the client verifies a certificate chain.

This means that for the Name Constraints extension to be useful, HTTPS clients (and browsers in particular) must enforce the constraints properly.

Before relying on this solution to protect our users, we wanted to make sure browsers were really implementing Name Constraints verification and doing so correctly. The initial results were promising: each of the browsers we tested (Chrome, Firefox, Edge, and Safari) all gave verification exceptions when browsing to a site where a CA signed a certificate in violation of the constraints.

However, as we extended our test suite beyond basic tests we rapidly began to lose confidence. We created a battery of test certificates which moved the subject name between the certificate’s subject common name and Subject Alternate Name extension, which mixed the use of Name Constraint whitelisting and blacklisting, and which used both DNS names and IP names in the constraint. The result was that every browser (except for Firefox, which showed a 100% pass rate) and every HTTPS client (such as Java, Node.JS, and Python) allowed some sort of Name Constraint bypass.

Introducing BetterTLS

In order to raise awareness around the issues we discovered and encourage TLS implementers to correct them, and to allow them to include some of these tests in their own test suite, we are open sourcing the test suite we created and making it available online. Inspired by, we created with the hope that the tests we add to this site can help improve the resiliency of TLS implementations.

Before we made public, we reached out to many of the affected vendors and are happy to say that we received a number of positive responses. We’d particularly like to thank Ryan Sleevi and Adam Langley from Google who were extremely responsive and immediately took actions to remediate some of the discovered issues and incorporate some of these test certificates into their own test suite. We have also received confirmation from Oracle that they will be addressing the results of this test suite in Java in an upcoming security release.

The source for is available on github, and we welcome suggestions, improvements, corrections, and additional tests!

Read the whole story
5 days ago
santa clara, CA
Share this story

The Evolution of Container Usage at Netflix

Containers are already adding value to our proven globally available cloud platform based on Amazon EC2 virtual machines.  We’ve shared pieces of Netflix’s container story in the past (video, slides), but this blog post will discuss containers at Netflix in depth.  As part of this story, we will cover Titus: Netflix’s infrastructural foundation for container based applications.  Titus provides Netflix scale cluster and resource management as well as container execution with deep Amazon EC2 integration and common Netflix infrastructure enablement.

This month marks two major milestones for containers at Netflix.  First, we have achieved a new level of scale, crossing one million containers launched per week.  Second, Titus now supports services that are part of our streaming service customer experience.  We will dive deeper into what we have done with Docker containers as well as what makes our container runtime unique.

History of Container Growth

Amazon’s virtual machine based infrastructure (EC2) has been a powerful enabler of innovation at Netflix.  In addition to virtual machines, we’ve also chosen to invest in container-based workloads for a few unique values they provide.  The benefits, excitement and explosive usage growth of containers from our developers has surprised even us.

While EC2 supported advanced scheduling for services, this didn’t help our batch users.  At Netflix there is a significant set of users that run jobs on a time or event based trigger that need to analyze data, perform computations and then emit results to Netflix services, users and reports.  We run workloads such as machine learning model training, media encoding, continuous integration testing, big data notebooks and CDN deployment analysis jobs many times each day.  We wanted to provide a common resource scheduler for container based applications independent of workload type that could be controlled by higher level workflow schedulers.  Titus serves as a combination of a common deployment unit (Docker image) and a generic batch job scheduling system. The introduction of Titus has helped Netflix expand to support the growing batch use cases.

With Titus, our batch users are able to put together sophisticated infrastructure quickly due to having to only specify resource requirements.  Users no longer have to deal with choosing and maintaining AWS EC2 instance sizes that don’t always perfectly fit their workload.  Users trust Titus to pack larger instances efficiently across many workloads.  Batch users develop code locally and then immediately schedule it for scaled execution on Titus.  Using containers, Titus runs any batch application letting the user specify exactly what application code and dependencies are needed.  For example, in machine learning training we have users running a mix of Python, R, Java and bash script applications.

Beyond batch, we saw an opportunity to bring the benefits of simpler resource management and a local development experience for other workloads.  In working with our Edge, UI and device engineering teams, we realized that service users were the next audience.  Today, we are in the process of rebuilding how we deploy device-specific server-side logic to our API tier leveraging single core optimized NodeJS servers.  Our UI and device engineers wanted a better development experience, including a simpler local test environment that was consistent with the production deployment.

In addition to a consistent environment, with containers developers can push new application versions faster than before by leveraging Docker layered images and pre-provisioned virtual machines ready for container deployments.  Deployments using Titus now can be done in one to two minutes versus the tens of minutes we grew accustomed to with virtual machines.  

The theme that underlies all these improvements is developer innovation velocity.  Both batch and service users can now experiment locally and test more quickly.  They can also deploy to production with greater confidence than before.  This velocity drives how fast features can be delivered to Netflix customers and therefore is a key reason why containers are so important to our business.

Titus Details

We have already covered what led us to build Titus.  Now, let’s dig into the details of how Titus provides these values.  We will provide a brief overview of how  Titus scheduling and container execution supports the service and batch job requirements as shown in the below diagram.

Screen Shot 2017-04-17 at 2.52.01 PM.png

Titus handles the scheduling of applications by matching required resources and available compute resources.  Titus supports both service jobs that run “forever” and batch jobs that run “until done”.  Service jobs restart failed instances and are autoscaled to maintain a changing level of load.  Batch jobs are retried according to policy and run to completion.  

Titus offers multiple SLA’s for resource scheduling.  Titus offers on-demand capacity for ad hoc batch and non-critical internal services by autoscaling capacity in EC2 based on current needs.  Titus also offers pre-provisioned guaranteed capacity for user facing workloads and more critical batch.   The scheduler does both bin packing for efficiency across larger virtual machines and anti-affinity for reliability spanning virtual machines and availability zones.  The foundation of this scheduling is a Netflix open source library called Fenzo.

Titus’s container execution, which runs on top of EC2 VMs, integrates with both AWS and Netflix infrastructure. We expect users to use both virtual machines and containers for a long time to come so we decided that we wanted the cloud platform and operational experiences to be as similar as possible.  In using AWS we choose to deeply leverage existing EC2 services.  We used Virtual Private Cloud (VPC) for routable IPs rather than a separate network overlay.  We leveraged Elastic Network Interfaces (ENIs) to ensure that all containers had application specific security groups.  Titus provides a metadata proxy that enables containers to get a container specific view of their environment as well as IAM credentials.  Containers do not see the host’s metadata (e.g., IP, hostname, instance-id).  We implemented multi-tenant isolation (CPU, memory, disk, networking and security) using a combination of Linux, Docker and our own isolation technology.

For containers to be successful at Netflix, we needed to integrate them seamlessly into our existing developer tools and operational infrastructure.  For example, Netflix already had a solution for continuous delivery – Spinnaker.  While it might have been possible to implement rolling updates and other CI/CD concepts in our scheduler, delegating this feature set to Spinnaker allowed for our users to have a consistent deployment tool across both virtual machines and containers.  Another example is service to service communication.  We avoided reimplementing service discovery and service load balancing.  Instead we provided a full IP stack enabling containers to work with existing Netflix service discovery and DNS (Route 53) based load balancing.   In each of these examples, a key to the success of Titus was deciding what Titus would not do, leveraging the full value other infrastructure teams provide.

Using existing systems comes at the cost of augmenting these systems to work with containers in addition to virtual machines.  Beyond the examples above, we had to augment our telemetry, performance autotuning, healthcheck systems, chaos automation, traffic control, regional failover support, secret management and interactive system access.  An additional cost is that tying into each of these Netflix systems has also made it difficult to leverage other open source container solutions that provide more than the container runtime platform.

Running a container platform at our level of scale (with this diversity of workloads) requires a significant focus on reliability.  It also uncovers challenges in all layers of the system.  We’ve dealt with scalability and reliability issues in the Titus specific software as well as the open source we depend on (Docker Engine, Docker Distribution, Apache Mesos, Snap and Linux).  We design for failure at all levels of our system including reconciliation to drive consistency between distributed state that exists between our resource management layer and the container runtime.  By measuring clear service level objectives (container launch start latency, percentage of containers that crash due to issues in Titus, and overall system API availability) we have learned to balance our investment between reliability and functionality.

A key part of how containers help engineers become more productive is through developer tools.  The developer productivity tools team built a local development tool called Newt (Netflix Workflow Toolkit).  Newt helps simplify container development both iteratively locally and through Titus onboarding.  Having a consistent container environment between Newt and Titus helps developer deploy with confidence.

Current Titus Usage

We run several Titus stacks across multiple test and production accounts across the three Amazon regions that power the Netflix service.

When we started Titus in December of 2015, we launched a few thousand containers per week across a handful of workloads.  Last week, we launched over one million containers.  These containers represented hundreds of workloads.  This 1000X increase in container usage happened over a year timeframe, and growth doesn’t look to be slowing down.

We run a peak of 500 r3.8xl instances in support of our batch users.  That represents 16,000 cores of compute with 120 TB of memory.  We also added support for GPUs as a resource type using p2.8xl instances to power deep learning with neural nets and mini-batch.

In the early part of 2017, our stream-processing-as-a-service team decided to leverage Titus to enable simpler and faster cluster management for their Flink based system.  This usage has resulted in over 10,000 service job containers that are long running and re-deployed as stream processing jobs are changed.  These and other services use thousands of m4.4xl instances.

While the above use cases are critical to our business, issues with these containers do not impact Netflix customers immediately.  That has changed as Titus containers recently started running services that satisfy Netflix customer requests.

Supporting customer facing services is not a challenge to be taken lightly.  We’ve spent the last six months duplicating live traffic between virtual machines and containers.  We used this duplicated traffic to learn how to operate the containers and validate our production readiness checklists.  This diligence gave us the confidence to move forward making such a large change in our infrastructure.

The Titus Team

One of the key aspects of success of Titus at Netflix has been the experience and growth of the Titus development team.  Our container users trust the team to keep Titus operational and innovating with their needs.

We are not done growing the team yet.  We are looking to expand the container runtime as well as our developer experience.  If working on container focused infrastructure excites you and you’d like to be part of the future of Titus check out our jobs page.

On behalf of the entire Titus development team
Read the whole story
6 days ago
santa clara, CA
Share this story
Next Page of Stories