Mastering AKS Troubleshooting #1: Resolving Connectivity and DNS Failures

Mastering AKS Troubleshooting #1: Resolving Connectivity and DNS Failures

This article is contributed. See the original author and article here.

Introduction


AKS or Azure Kubernetes Service is a fully managed Kubernetes container orchestration service that enables you to deploy, scale, and manage containerized applications easily. However, even with the most robust systems issues can arise that require troubleshooting. 


 


This blog post marks the beginning of a three-part series, that originated from an intensive one-day bootcamp focused on advanced AKS networking triage and troubleshooting scenarios. It offers a practical approach to diagnosing and resolving common AKS networking issues, aiming to equip readers with quick troubleshooting skills for their AKS environment.


 


Each post walks through a set of scenarios that simulate typical issues. Detailed setup instructions will be provided to build a functional environment. Faults will then be introduced that causes the setup to malfunction. Hints will be provided on how to triage and troubleshoot these issues using common tools such as kubectl, nslookup, and tcpdump. Each scenario concludes with fixes for the issues faced and explanation of the steps taken to resolve the problem. 


 


Prerequisites


Before setting up AKS, ensure that you have an Azure account and subscription, with permissions that allows you to create resource groups and deploy AKS clusters. PowerShell needs to be available as PS scripts will be used.  Follow instructions provided in this Github link to set up AKS and run scenarios. It is also recommended that you read up on troubleshooting inbound and outbound networking scenarios that may arise in your AKS environment.


 


For inbound scenarios, troubleshooting connectivity issues pertains to applications hosted on the AKS cluster. Link describes issues related to firewall rules, network security groups, or load balancers, and provides guidance on verifying network connectivity, checking application logs, and examining network traffic to identify potential bottlenecks.


 


For outbound access, troubleshooting scenarios are related to traffic leaving the AKS cluster, such as connectivity issues to external resources like databases, APIs, or other services hosted outside of the AKS cluster.      


 


Figure below shows the AKS environment, which uses a custom VNet with its own NSG attached to the custom subnet. The AKS setup uses the custom subnet and will have its own NSG created and attached to the Network Interface of the Nodepool. Any changes to the AKS networking are automatically added to its NSG. However, to apply AKS NSG changes to the custom Subnet NSG, they must be explicitly added.


 


varghesejoji_11-1683334250677.png


 


Scenario 1: Connectivity resolution between pods or services in same cluster


Objective: The goal of this exercise is to troubleshoot and resolve connectivity between pods and services within the same Kubernetes cluster.


Layout: AKS cluster layout with 2 Pods created by their respective deployments and exposed using Cluster IP Service.


varghesejoji_13-1683334307156.png


 


Step 1: Set up the environment



  1. Setup up AKS as outlined in this script.

  2. Create namespace student and set context to this namespace


kubectl create ns student
kubectl config set-context –current –namespace=student

# Verify current namespace
kubectl config view –minify –output ‘jsonpath={..namespace}’


  1. Clone solutions Github link and change directory to Lab1 i.e., cd Lab1.


 


Step 2: Create two deployments and respective services



  1. Create a deployment nginx-1 with a simple nginx image:


kubectl create deployment nginx-1 –image=nginx


  1. Expose the deployment as a ClusterIP service:


kubectl expose deployment nginx-1 –name nginx-1-svc –port=80 –target-port=80 –type=ClusterIP


  1. Repeat the above steps to create nginx-2 deployment and a service:


kubectl create deployment nginx-2 –image=nginx
kubectl expose deployment nginx-2 –name nginx-2-svc –port=80 –target-port=80 –type=ClusterIP

 Confirm deployment and service functional. Pods should be running and services listening on Port 80. 


kubectl get all

 


Step 3: Verify that you can access both services from within the cluster by using Cluster IP addresses


# Services returned: nginx-1-svc for pod/nginx-1, nginx-2-svc for pod/nginx-2
kubectl get svc

# Get the values of and
kubectl get pods

# below should present HTML page from nginx-2
kubectl exec -it — curl nginx-2-svc:80

# below should present HTML page from nginx-1
kubectl exec -it — curl nginx-1-svc:80

# check endpoints for the services
kubectl get ep

 


Step 4: Backup existing deployments



  1. Backup the deployment associated with nginx-2 deployment:


kubectl get deployment.apps/nginx-2 -o yaml > nginx-2-dep.yaml


  1. Backup the service associated with nginx-2 service:


kubectl get service/nginx-2-svc -o yaml > nginx-2-svc.yaml

 


Step 5: Simulate service down



  1. Delete nginx-2 deployment


kubectl delete -f nginx-2-dep.yaml


  1. Apply the broken.yaml deployment file found in Lab1 folder


kubectl apply -f broken.yaml


  1. Confirm all pods are running


kubectl get all

 


Step 6: Troubleshoot the issue


Below is the inbound flow. Confirm every step from top down.


varghesejoji_1-1683334820052.png


 



  1. Check the health of the nodes in the cluster to see if there is a node issue


kubectl get nodes


  1. Verify that you can no longer access nginx-2-svc from within the cluster


kubectl exec -it  — curl nginx-2-svc:80
# msg Failed to connect to nginx-2-svc port 80: Connection refused


  1. Verify that you can access nginx-1-svc from within the cluster


kubectl exec -it  — curl nginx-1-svc:80
# displays HTML page


  1. Verify that you can access nginx-2 locally. This confirms no issue with the nginx-2 application.


kubectl exec -it  — curl localhost:80
# displays HTML page


  1. Check the Endpoints using below command and verify that the right Endpoints line up with their Services. There should be at least 1 Pod associated with a service, but none seem to exist for nginx-2 service but nginx-2 service/pod association is fine.


 kubectl get ep

varghesejoji_0-1683345896996.png


 



  1. Check label selector used by the Service experiencing issue, using below command:


kubectl describe service 

Ensure that it matches the label selector used by its corresponding Deployment using describe command:


kubectl describe deployment 

Use ‘k get svc’ and ‘k get deployment’ to get service and deployment names.


Do you notice any discrepancies?


 



  1. Using the Service label selector from #3, check that the Pods selected by the Service match the Pods created by the Deployment using the following command


kubectl get pods –selector=

If no results are returned then there must be a label selector mismatch.


From below figure, selector used by deployment returns pods but not the selector used by corresponding service.


varghesejoji_1-1683345896997.png


 



  1. Check service and pod logs and ensure HTTP traffic is seen. Compare nginx-1 pod  and service logs with nginx-2. Latter does not show GET requests, suggesting no incoming traffic.


k logs pod/ # no incoming traffic
k logs pod/ # HTTP traffic as seen below

k logs svc/
k logs svc/

varghesejoji_2-1683345897001.png


 


Step 7: Restore connectivity



  1. Check the label selector the Service is associated with and get associated pods:


# Get label
kubectl describe service nginx-2-svc

# When attempting to obtain pods using the service label, results in “no resources found” or “no pods available”.
kubectl describe pods -l app=nginx-2


  1. Update deployment and apply changes.


kubectl delete -f nginx-2-dep.yaml

In broken.yaml, update labels ‘app: nginx-02’, to ‘app: nginx-2’, as shown below


varghesejoji_0-1683346259445.png


kubectl apply -f broken.yaml # or apply dep-nginx-2.yaml

k describe pod
k get ep # nginx-2 svc should have pods unlike before


  1. Verify that you can now access the newly created service from within the cluster:


# Should return HTML page from nginx-2-svc
kubectl exec -it — curl nginx-2-svc:80

# Confirm above from logs
k logs pod/      

 


Step 8: Using Custom Domain Names


Currently Services in your namespace ‘student’ will resolve using ..svc.cluster.local. 

Below command should return web page.


k exec -it  — curl nginx-2-svc.student.svc.cluster.local

 



  1. Apply broken2.yaml in Lab1 folder and restart CoreDNS


kubectl apply -f broken2.yaml
kubectl delete pods -l=k8s-app=kube-dns -n kube-system

# Monitor to ensure pods are running
kubectl get pods -l=k8s-app=kube-dns -n kube-system


  1. Validate if DNS resolution works and it should fail wit ‘curl: (6) Could not resolve host:’


k exec -it  — curl nginx-2-svc.student.svc.cluster.local
k exec -it — curl nginx-2-svc


  1. Check the DNS configuration files in kube-system which shows the configmap’s, as below.


k get cm -A -n kube-system | grep dns


  1. Describe each of the ones found above and look for inconsistencies


k describe cm coredns -n kube-system
k describe cm coredns-autoscaler -n kube-system
k describe cm coredns-custom -n kube-system


  1. Since the custom DNS file holds the breaking changes, either edit coredns-custom and remove data section OR delete the ConfigMap ‘coredns-custom’. Deleting kube-dns pods should re-create deleted ConfigMap ‘coredns-custom’. 


kubectl delete cm coredns-custom -n kube-system
kubectl delete pods -l=k8s-app=kube-dns -n kube-system

# Monitor to ensure pods are running
kubectl get pods -l=k8s-app=kube-dns -n kube-system


  1. Confirm DNS resolution now works as before.


kubectl exec -it  — curl nginx-2-svc.student.svc.cluster.local


# Challenge lab: Resolve using FQDN aks.com #


# Run below command to get successful DNS resolution
k exec -it — curl nginx-2-svc.aks.com 

# Solution #
k apply -f working2.yaml
kubectl delete pods -l=k8s-app=kube-dns -n kube-system

# Monitor to ensure pods are running
kubectl get pods -l=k8s-app=kube-dns -n kube-system

# Confirm working using below cmd
k exec -it — curl nginx-2-svc.aks.com 

# Bring back to default
k delete cm coredns-custom -n kube-system
kubectl delete pods -l=k8s-app=kube-dns -n kube-system

# Monitor to ensure pods are running
kubectl get pods -l=k8s-app=kube-dns -n kube-system

 


Step 9: What was in the broken files


In broken.yaml deployment labels didn’t match up with the service i.e., it should have been nginx-2


varghesejoji_8-1683334196415.png


 


In broken2.yaml breaking changes were made that resolved ‘student.svc.cluster.local’ to ‘bad.cluster.local’, which broke DNS resolution.


$kubectl_apply=@”
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns-custom
  namespace: kube-system
data:
  internal-custom.override: | # any name with .server extension
    rewrite stop {
      name regex (.*).svc.cluster.local {1}.bad.cluster.local.
      answer name (.*).bad.cluster.local {1}.svc.cluster.local.
    }
“@
$kubectl_apply | kubectl apply -f –

 


Step 10: Cleanup


k delete deployment/nginx-1 deployment/nginx-2 service/nginx-1-svc service/nginx-2-svc
or just delete namespace >  k delete ns student

 


 


Scenario 2: DNS and External access failure resolution


Objective: The goal of this exercise is to troubleshoot and resolve Pod DNS lookups and DNS resolution failures.


Layout: Cluster layout as shown below has NSG applied to AKS subnet, with Network Policies in effect.


varghesejoji_0-1683347176139.png


 


Step 1: Set up the environment



  1. Setup up AKS as outlined in this script.

  2. Create and switch to the newly created namespace


kubectl create ns student
kubectl config set-context –current –namespace=student

# Verify current namespace
kubectl config view –minify –output ‘jsonpath={..namespace}’


  1. Clone solutions Github link and change directory to Lab2 i.e., cd Lab2.


 


Step 2: Verify DNS Resolution works within cluster



  1. Create pod for DNS validation within Pod


kubectl run dns-pod –image=nginx –port=80 –restart=Never
kubectl exec -it dns-pod — bash

# Run these commands at the bash prompt
apt-get update -y
apt-get install dnsutils -y
exit


  1. Test and confirm DNS resolution resolves to the correct IP address.


kubectl exec -it dns-pod — nslookup kubernetes.default.svc.cluster.local

 


Step 3: Break DNS resolution



  1. From Lab2 folder apply broken1.yaml


kubectl apply -f broken1.yaml


  1. Confirm running below command results in ‘connection timed out; no servers could be reached’


kubectl exec -it dns-pod — nslookup kubernetes.default.svc.cluster.local

 


Step 4: Troubleshoot DNS Resolution Failures



  1. Verify DNS resolution works within the AKS cluster


kubectl exec -it dns-pod — nslookup kubernetes.default.svc.cluster.local
# If response ‘connection timed out; no servers could be reached’ then proceed below with troubleshooting


  1. Validate DNS service which should show port 53 in use


kubectl get svc kube-dns -n kube-system


  1. Check logs for pods associated with kube-dns


$coredns_pod=$(kubectl get pods -n kube-system -l k8s-app=kube-dns -o=jsonpath='{.items[0].metadata.name}’)
kubectl logs -n kube-system $coredns_pod


  1. If a custom ConfigMap is present, verify that the configuration is correct.


kubectl describe cm coredns-custom -n kube-system


  1. Check for networkpolicies currently in effect. If DNS related then describe and confirm no blockers. If network policy is a blocker then have that removed.


kubectl get networkpolicy -A
NAMESPACE     NAME              POD-SELECTOR            
kube-system   block-dns-ingress  k8s-app=kube-dns        

kubectl describe networkpolicy block-dns-ingress -n kube-system
# should show on Ingress path not allowing DNS traffic to UDP 53 


  1. Remove the offending policy


kubectl delete networkpolicy block-dns-ingress -n kube-system


  1. Verify DNS resolution works within the AKS cluster. Below is another way to create a Pod to execute task as nslookup and delete on completion


kubectl run -it –rm –restart=Never test-dns –image=busybox –command — nslookup kubernetes.default.svc.cluster.local
# If the DNS resolution is working correctly, you should see the correct IP address associated with the domain name


  1. Check NSG has any DENY rules that might block port 80. If exists, then have that removed


# Below CLI steps can also be performed as a lookup on Azure portal under NSG

 


Step 5: Create external access via Loadbalancer



  1. Expose dns-pod with service type Load Balancer.


kubectl expose pod dns-pod –name=dns-svc –port=80 –target-port=80 –type LoadBalancer


  1. Confirm allocation of External-IP.


kubectl get svc


  1. Confirm External-IP access works within cluster. 


kubectl exec -it dns-pod — curl 


  1. Confirm from browser that External-IP access fails from internet to cluster.


curl 

 


Step 6: Troubleshoot broken external access via Loadbalancer



  1. Check if AKS NSG applied on the VM Scale Set has an Inbound HTTP Allow rule.

  2. Check if AKS Custom NSG applied on the Subnet has an ALLOW rule and if none then apply as below.


$custom_aks_nsg = “custom_aks_nsg” # <- verify
$nsg_list=az network nsg list –query “[?contains(name,’$custom_aks_nsg’)].{Name:name, ResourceGroup:resourceGroup}” –output json

# Extract Custom AKS Subnet NSG name, NSG Resource Group
$nsg_name=$(echo $nsg_list | jq -r ‘.[].Name’)

$resource_group=$(echo $nsg_list | jq -r ‘.[].ResourceGroup’)
echo $nsg_list, $nsg_name, $resource_group

$EXTERNAL_IP=””
az network nsg rule create –name AllowHTTPInbound `
–resource-group $resource_group –nsg-name $nsg_name `
–destination-port-range 80 –destination-address-prefix $EXTERNAL_IP `
–source-address-prefixes Internet –protocol tcp `
–priority 100 –access allow


  1. After ~60s, confirm from browser that External-IP access succeeds from internet to cluster.


curl 

 


Step 7: What was in the broken files


Broken1.yaml is a Network Policy that blocks UDP ingress requests on port 53 to all Pods


varghesejoji_1-1683347703684.png


 


Step 8: Cleanup


k delete pod/dns-pod 
or
k delete ns student

az network nsg rule delete –name AllowHTTPInbound `
–resource-group $resource_group –nsg-name $nsg_name

 


Conclusion


This post demonstrates common connectivity and DNS issues that can arise when working with AKS. The first scenario focuses on resolving connectivity problems between pods and services within the Kubernetes cluster. We encountered issues where the assigned labels of a deployment did not match the corresponding pod labels, resulting in non-functional endpoints. Additionally, we identified and rectified issues with CoreDNS configuration and custom domain names. The second scenario addresses troubleshooting DNS and external access failures. We explored how improperly configured network policies can negatively impact DNS traffic flow. In the next article, second of the three-part series, we will delve into troubleshooting scenarios related to endpoint connectivity across virtual networks and tackle port configuration issues involving services and their corresponding pods.


 


Disclaimer


The sample scripts are not supported by any Microsoft standard support program or service. The sample scripts are provided AS IS without a warranty of any kind. Microsoft further disclaims all implied warranties including, without limitation, any implied warranties of merchantability or of fitness for a particular purpose. The entire risk arising out of the use or performance of the sample scripts and documentation remains with you. In no event shall Microsoft, its authors, or anyone else involved in the creation, production, or delivery of the scripts be liable for any damages whatsoever (including, without limitation, damages for loss of business profits, business interruption, loss of business information, or other pecuniary loss) arising out of the use of or inability to use the sample scripts or documentation, even if Microsoft has been advised of the possibility of such damages.

Account structure activation performance enhancement 

Account structure activation performance enhancement 

This article is contributed. See the original author and article here.

Introduction  

Account structures in Dynamics 365 Finance use a main account and financial dimensions to create a set of rules that determine the order and allowed values when entering account numbers in transactions. Once an account structure is defined, it must be activated. Historically, the account structure activation process has been time consuming. It was also difficult to view the activation progress or to view any errors with the new configuration. If an account structure configuration change caused an error, a user could not find the root error message on the account structure page, but rather needed to dig through batch job logs to find the error message to understand the problem with the new account structure configuration.  

Feature details  

In order to solve these problems, we have recently released an enhancement to the account structure activation process in application release 10.0.31. This performance enhancement lets you activate account structures more quickly by allowing multiple transaction updates to happen at the same time. An added benefit of this new feature enhancement is allowing the structure to be marked as active immediately after it is validated and before the remaining unposted transactions are updated to the new structure configuration. This allows transaction processing to continue while the existing unposted transactions are updated to the new structure.  

To view the status of the activation, select View activation status above the grid on the Account structures page. You can also view the activation status by selecting View on the Action Pane and then selecting Activation status on the drop-down menu. 

Enable the feature 

In order to use this new functionality, enable the feature “Account structure activation performance enhancement” from within feature management.  

Learn more

More information about this feature can be found at this location: Account structure activation performance enhancement – Finance | Dynamics 365 | Microsoft Learn 

The post Account structure activation performance enhancement  appeared first on Microsoft Dynamics 365 Blog.

Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.

How to use Azure OpenAI Playgrounds to experiment with Chatbots

How to use Azure OpenAI Playgrounds to experiment with Chatbots

This article is contributed. See the original author and article here.





1. Navigate to https://portal.azure.com/#home


2. Click “Azure OpenAI”



3. Click to open an existing Open AI Resource


Screenshot of: Click to open an existing Open AI Resource


4. Click “Go to Azure OpenAI Studio”.



5. Click here to open the Chat playground.


Screenshot of: Click here to open the Chat playground.


6. Click “Select a template”



7. Click “IRS tax chatbot”



8. Click “Continue”



9. Click the “User message” field.



10. Talk to the bot and ask it some questions.




Announcing the Public Preview of Code Optimizations

Announcing the Public Preview of Code Optimizations

This article is contributed. See the original author and article here.

Code Optimizations: A New AI-Based Service for .NET Performance Optimization


We are thrilled to announce that Code Optimizations (previously known as Optimization Insights) is now available in public preview! This new AI-based service can identify performance issues and offer recommendations specifically tailored for .NET applications and cloud services.


 


What is Code Optimizations?


Code Optimizations is a service within Application Insights that continuously analyzes profiler traces from your application or cloud service and provides insights and recommendations on how to improve its performance.


 


Code Optimizations can help you identify and solve a wide range of performance issues, ranging from incorrect API usages and unnecessary allocations all the way to issues relating to exceptions and concurrency. It can also detect anomalies whenever your application or cloud service exhibits abnormal CPU or Memory behavior.


 


Code Optimizations PageCode Optimizations Page


 


Why should I use Code Optimizations?


Code Optimizations can help you optimize the performance of your .NET applications and cloud services by:



  • Saving you time and effort: Instead of manually sifting through gigabytes of profiler data or relying on trial-and-error methods, you can use Code Optimizations to automatically uncover complex performance bugs and get guidance on how to solve them.

  • Improving your user experience: By improving the speed and reliability of your application or cloud service, you can enhance your user satisfaction and retention rates. This can also help you gain a competitive edge over other apps or services in your market.

  • Saving you money: By fixing performance issues early and efficiently, you can reduce the need for scaling out cloud resources or paying for unnecessary compute power. This can help you avoid problems such as cloud sprawling or overspending on your Azure bill.


How does Code Optimizations work?


Code Optimizations relies on an AI model trained on thousands of traces collected from Microsoft-owned services around the globe. By learning from these traces, the model can glean patterns corresponding to various performance issues seen in .NET applications and learn from the expertise of performance engineers at Microsoft. This enables our AI model to pinpoint with accuracy a wide range of performance issues in your app and provide you with actionable recommendations on how to fix them.


 


Code Optimizations runs at no additional cost to you and is completely offline to the app. It has no impact on your app’s performance.


 


How can I use Code Optimizations?


If you are interested in trying out this new service for free during its public preview period, you can access it using the following steps:



  1. Sign up for Application Insights if you haven’t already. Application Insights is a powerful application performance monitoring (APM) tool that helps you monitor, diagnose, and troubleshoot your apps.

  2. Enable profiling for your .NET app or cloud service. Profiling collects detailed information about how your app executes at runtime.

  3. Navigate to the Application Insights Performance blade from the left navigation pane under Investigate and select Code Optimizations from the top menu.


 

Link to Code Optimizations from Application Insights: PerformanceLink to Code Optimizations from Application Insights: Performance


 


Click here for the documentation.


Click here for information on troubleshooting.


Fill out this quick survey if you have any additional issues or questions.


 

Automatically disrupt adversary-in-the-middle (AiTM) attacks with XDR

Automatically disrupt adversary-in-the-middle (AiTM) attacks with XDR

This article is contributed. See the original author and article here.

Microsoft has been on a journey to harness the power of artificial intelligence to help security teams scale more effectively. Microsoft 365 Defender correlates millions of signals across endpoints, identities, emails, collaboration tools, and SaaS apps to identify active attacks and compromised assets in an organization’s environment. Last year, we introduced automatic attack disruption, which uses these correlated insights and powerful AI models to stop some of the most sophisticated attack techniques while in progress to limit lateral movement and damage.  


 


Today, we are excited to announce the expansion of automatic attack disruption to include adversary-in-the-middle attacks (AiTM) attacks, in an addition to the previously announced public preview for business email compromise (BEC) and human-operated ransomware attacks.


 


AiTM attacks are a widespread and can pose a major risk to organizations. We are observing a rising trend in the availability of adversary-in-the-middle (AiTM) phishing kits for purchase or rent, with our data showing that over organizations have already been attacked in 2023.


 


During AiTM attacks (Figure 1), a phished user interacts with an impersonated site created by the attacker. This allows the attacker to intercept credentials and session cookies and bypass multifactor authentication (MFA), which can then be used to initiate other attacks such as BEC and credential harvesting. 


 


Automatic attack disruption does not require any pre-configuration by the SOC team. Instead, it’s built in as a capability in Microsoft’s XDR.


Figure 1. Example of an AiTM phishing campaign that led to a BEC attackFigure 1. Example of an AiTM phishing campaign that led to a BEC attack


 


How Microsoft’s XDR automatically contains AiTM attacks


Similarly to attack disruption of BEC and human-operated ransomware attacks, the goal is to contain the attack as early as possible while it is active in an organization’s environment and reduce its potential damage to the organization. AiTM attack disruption works as follows:


 



  1. High-confidence identification of an AiTM attack based on multiple, correlated Microsoft 365 Defender signals.

  2. Automatic response is triggered that disables the compromised user account in Active Directory and Azure Active Directory.

  3. The stolen session cookie will be automatically revoked, preventing the attacker from using it for additional malicious activity.


Figure 2. An example of a contained AiTM incident, with attack disruption tagFigure 2. An example of a contained AiTM incident, with attack disruption tag


 


To ensure SOC teams have full control, they can configure automatic attack disruption and easily revert any action from the Microsoft 365 Defender portal. See our documentation for more details.


 


Get started



  1. Make sure your organization fulfills the Microsoft 365 Defender pre-requisites

  2. Connect Microsoft Defender for Cloud Apps to Microsoft 365.

  3. Deploy Defender for Endpoint. A free trial is available here.

  4. Deploy Microsoft Defender for Identity. You can start a free trial here.


Learn more