Hardening Remote Access for Azure Machine Learning

Overview

In the Azure cloud, securing high-value assets is a foundational element of a robust FinOps and security practice. Azure Machine Learning (AML) compute instances, which are often equipped with powerful GPUs to process sensitive data and proprietary algorithms, represent prime targets for malicious actors. A common but critical oversight is leaving the Secure Shell (SSH) service exposed to the internet on its default TCP port 22.

This configuration creates a significant and unnecessary attack surface. Automated scanners and botnets constantly scour the internet for open port 22, launching relentless brute-force and credential-stuffing attacks. Failing to deviate from this default setting exposes expensive compute resources to compromise, data exfiltration, and operational disruption.

Adopting a policy of using non-standard ports for SSH is a simple yet highly effective security hygiene measure. While not a complete defense on its own, it acts as a crucial first-line filter, dramatically reducing the volume of automated attacks and improving the signal-to-noise ratio for security monitoring. This article explores the risks of default SSH configurations in Azure and outlines the governance needed to mitigate them.

Why It Matters for FinOps

From a FinOps perspective, an insecure SSH configuration is a direct threat to cloud cost management and operational efficiency. When an Azure ML compute instance is compromised, the financial and business impacts can be severe. Attackers often repurpose these powerful, GPU-enabled instances for cryptojacking, leading to a sudden and massive spike in compute costs. The organization is left responsible for a bill that can run into thousands of dollars for resources that delivered no business value.

Beyond direct costs, a breach triggers significant operational drag. Responding to the incident consumes valuable engineering and security team hours that could have been spent on innovation. The compromised instance must be isolated and rebuilt, leading to downtime for data science teams and potential loss of unsaved work. This failure in governance not only creates waste but also erodes trust in the cloud environment, potentially slowing down future ML initiatives. Effective cost governance requires proactive security to prevent such avoidable and expensive incidents.

What Counts as “Idle” in This Article

In the context of this article, we define an "idle" configuration as any security setting left in its vendor-supplied default state without intentional review and hardening. The use of TCP port 22 for SSH on an internet-facing Azure ML compute instance is a perfect example of such a configuration. It represents a form of governance idleness, where the default setting is accepted passively rather than changed proactively as part of a deliberate security strategy.

The primary signal for this risk is an Azure Network Security Group (NSG) rule that allows inbound traffic from any source (0.0.0.0/0) to the destination port 22 on a virtual machine. This configuration is easily identifiable through cloud security posture management and indicates a lack of defense-in-depth, making the resource vulnerable to automated, opportunistic attacks.

Common Scenarios

Scenario 1

Data science teams often use remote development tools like Visual Studio Code’s SSH extension to work directly on powerful Azure ML compute instances. To enable this workflow, SSH access is required. If security configurations are not actively managed, the instance is often deployed with the default port 22 exposed to the internet, creating an immediate vulnerability.

Scenario 2

Organizations migrating legacy data science workflows to Azure may have existing scripts that use protocols like SCP or SFTP to transfer data. These scripts are frequently hardcoded to use port 22. Without updating these automated processes during migration, teams inadvertently perpetuate a known security weakness in their new cloud environment.

Scenario 3

Continuous integration and deployment (CI/CD) pipelines sometimes need to connect to an ML compute instance to deploy code, run tests, or retrieve model artifacts. If the pipeline’s SSH connector is configured with default parameters, it creates a dependency on port 22, making it difficult to harden the environment without breaking automated workflows.

Risks and Trade-offs

The primary goal of changing the default SSH port is to enhance security, but this action is not without trade-offs. The most immediate risk is operational disruption. If the Network Security Group rules are not updated correctly before the SSH configuration on the instance is changed, administrators can easily lock themselves out, requiring a more complex recovery process.

Furthermore, managing non-standard ports adds a minor layer of complexity for developers and automation tools, which must be explicitly configured to use the new port. This requires clear documentation and communication. While some argue this practice is merely "security through obscurity," the tangible benefit of filtering out massive volumes of automated attack traffic far outweighs the small operational cost of managing a custom port configuration, especially when used as part of a layered security strategy.

Recommended Guardrails

To enforce secure remote access policies consistently, organizations should establish clear governance and automated guardrails.

Start by creating a corporate standard that mandates the use of non-standard, high-numbered ports for any SSH access. This policy should be codified using Azure Policy to audit for or deny the creation of NSG rules that allow inbound traffic on port 22 from the internet.

Implement strong tagging standards to ensure every ML compute instance has a clear owner responsible for its configuration and security. All changes to network rules governing remote access should go through a documented approval flow. Finally, configure automated alerts in Microsoft Defender for Cloud to notify security and FinOps teams immediately when a non-compliant configuration is detected, enabling rapid remediation.

Provider Notes

Azure

In Azure, controlling SSH access is primarily managed through Network Security Groups (NSGs). An NSG acts as a stateful firewall, allowing you to define inbound and outbound rules based on source/destination IP, port, and protocol. The best practice is to create an inbound rule that allows SSH traffic on a custom port only from trusted IP address ranges, such as a corporate VPN.

For a more advanced and secure approach, consider eliminating public SSH exposure entirely. Azure Bastion is a fully managed PaaS service that provides secure and seamless RDP and SSH connectivity to your virtual machines directly from the Azure portal over TLS, without needing a public IP on the VM. Additionally, Microsoft Defender for Cloud offers Just-In-Time (JIT) VM access, which locks down inbound traffic to your VMs by default and opens ports like SSH only on-demand, for a limited time, and from approved IP addresses.

Binadox Operational Playbook

Binadox Insight: Changing the default SSH port is a simple but powerful technique to move your resources out of the line of fire. Most automated attacks are opportunistic and low-effort; by not being on port 22, you avoid the vast majority of indiscriminate, automated scanning and brute-force attempts.

Binadox Checklist:

  • Inventory all Azure ML compute instances with public IP addresses.
  • Define a standard high-numbered port for SSH access across your organization.
  • Update the associated Network Security Group to allow traffic on the new port from trusted sources.
  • Modify the SSH daemon configuration on the compute instance to listen on the new port.
  • Verify connectivity on the new port before disabling the old rule.
  • Remove or explicitly deny the inbound rule for TCP port 22 in the NSG.

Binadox KPIs to Track:

  • Percentage of ML compute instances compliant with the non-standard port policy.
  • Mean Time to Remediate (MTTR) for instances flagged with an open port 22.
  • Reduction in failed SSH login attempts recorded in system logs.
  • Number of security alerts related to brute-force attempts on management ports.

Binadox Common Pitfalls:

  • Changing the SSH port on the instance before updating the NSG, leading to a loss of access.
  • Failing to update CI/CD pipelines and developer configuration files, causing broken workflows.
  • Setting the NSG source to "Any" on the new port, which reduces but does not eliminate risk.
  • Treating a non-standard port as a complete security solution instead of one layer in a defense-in-depth strategy.

Conclusion

Securing remote access to Azure Machine Learning compute instances is a non-negotiable aspect of cloud governance. Moving SSH away from the default port 22 is a foundational hardening step that effectively shields high-value compute resources from a constant barrage of automated threats.

This simple change reduces security noise, minimizes the risk of costly compromises, and demonstrates a commitment to proactive security. For a truly robust posture, this practice should be combined with other Azure-native controls like restrictive NSG rules, Azure Bastion, and Just-In-Time access to build a layered defense that protects your critical ML workloads and preserves your cloud budget.