Notice: Our URL has changed! Please update your bookmarks. Dismiss

Troubleshooting Guide for OpenShift Container Platform

Router/Registry Not deploying to correct node

Ensure that you created the registry with the --selector flag.

oadm registry --config=/etc/origin/master/admin.kubeconfig \
   --credentials=/etc/origin/master/openshift-registry.kubeconfig --replicas=2 \
   --images='registry.access.redhat.com/openshift3/ose-${component}:${version}' --selector='region=Infra'

Solution

To see if you’ve specified a selector, type oc edit dc/docker-registry and search for nodeSelector. If this value is not defined, it is easier to delete the entire registry and re-create it with the --selector flag seen above.

Registry not showing contents of NFS mount (persistent volume)

You may run into a situation with Persistent Volumes where the node which is hosting the registry shows that the mount is there:

192.168.200.99:/storage/vms/ose_nfs/stratus on /var/lib/origin/openshift.local.volumes/pods/66295ab3-e624-11e5-952e-0800273943e4/volumes/kubernetes.io~nfs/docker-registry-storage type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.200.60,local_lock=none,addr=192.168.200.99)

but upon investigating the pod itself, there is no content showing in /registry. Checking the PV and PVC doesn’t show any issues:

{
  "apiVersion": "v1",
  "kind": "PersistentVolume",
  "metadata": { "name": "docker-registry-storage"
  },
  "spec": { "capacity": { "storage": "50Gi" }, "accessModes": [ "ReadWriteOnce" ], "nfs": { "path": "/storage/vms/ose_nfs/stratus/", "server": "192.168.200.99" }, "persistentVolumeReclaimPolicy": "Recycle"
  }
}
[root@master00 ~]# oc get pv
NAME                      LABELS    CAPACITY   ACCESSMODES   STATUS    CLAIM                          REASON    AGE
docker-registry-storage   <none>    50Gi       RWO           Bound     default/docker-storage-claim             39m
apiVersion: "v1"
kind: "PersistentVolumeClaim"
metadata:
  name: "docker-storage-claim"
spec:
  accessModes:
    - "ReadWriteOnce"
  resources:
    requests:
      storage: "50Gi"
[root@master00 ~]# oc get pvc
NAME                   LABELS    STATUS    VOLUME                    CAPACITY   ACCESSMODES   AGE
docker-storage-claim   <none>    Bound     docker-registry-storage   50Gi       RWO           38m

Solution

This is most likely an error during the editing of the deployment config. When assigning the volume the --name= section refers to the name of the volume to be overwritten. The template for the router is mounting the registry-storage volume to /registry. Creating new claims and volumes thus, will not have the desired effect. You need to overwrite the default which is emptydir with the PV.

WRONG COMMAND:

oc volume deploymentconfigs/docker-registry --add --name=docker-registry-storage -t pvc --claim-name=docker-storage-claim --overwrite

This is overwriting/creating a volume called docker-registry-storage which will not be able to be mounted in /registry

CORRECT COMMAND:

oc volume deploymentconfigs/docker-registry --add --name=registry-storage -t pvc --claim-name=docker-storage-claim --overwrite

Hosts Can No Longer Resolve Each Other During Anisble Install

During the install, firewalld is disabled in favor of iptables. This transition may not open port 53 which is required for DNS lookups.

Solution

Use a DNS source external to the installer (at least during the installation phase) so that the installation does not fail due to host name lookup problems.

Failure to deploy registry (permissions issues)

Deploying a registry requires a user with cluster-admin privileges. The system:admin user should have these privileges. OSEv3.1 Deploying a Docker Registry

Application Pod fails to deploy

This can be caused by DNS issues. Turn on the debugging on your dns server. If using bind:

rdnc querylog

Then tail /var/log/messages if you see entries like this:

lb00.ose.example.com.jarae.example.com
github.com.ose.example.com

The likely cause is that your DNS wild card is redirecting this causing the build to fail

Issues with Nodes

Nodes being reported as ready, but builds failing

When in a virtual environment, resuming the VM’s doesn’t always put the atomic-openshift-node.service into a clean state. Try restarting the service on all nodes and ensure they start without any failures.

Node reporting NotReady

In the event that you run

oc get nodes

And the result looks like the following:

[root@master02 ~]# oc get nodes
NAME                              LABELS                                                   STATUS                        AGE
app00.ose.example.com      kubernetes.io/hostname=app00.ose.example.com      Ready                         22h
app01.ose.example.com      kubernetes.io/hostname=app01.ose.example.com      Ready                         22h
infra00.ose.example.com    kubernetes.io/hostname=infra00.ose.example.com    Ready                         22h
infra01.ose.example.com    kubernetes.io/hostname=infra01.ose.example.com    Ready                         22h
master00.ose.example.com   kubernetes.io/hostname=master00.ose.example.com   Ready,SchedulingDisabled      22h
master01.ose.example.com   kubernetes.io/hostname=master01.ose.example.com   Ready,SchedulingDisabled      22h
master02.ose.example.com   kubernetes.io/hostname=master02.ose.example.com   NotReady,SchedulingDisabled   22h

Start by investigating the atomic-openshift-node service:

[root@master02 ~]# systemctl status atomic-openshift-node
● atomic-openshift-node.service - Atomic OpenShift Node
   Loaded: loaded (/usr/lib/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d
           └─openshift-sdn-ovs.conf
   Active: failed (Result: start-limit) since Thu 2016-02-25 07:50:00 CST; 44min ago
     Docs: https://github.com/openshift/origin
  Process: 2407 ExecStart=/usr/bin/openshift start node --config=${CONFIG_FILE} $OPTIONS (code=exited, status=255)
 Main PID: 2407 (code=exited, status=255)

Feb 25 07:49:59 master02 systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
Feb 25 07:49:59 master02 systemd[1]: Failed to start Atomic OpenShift Node.
Feb 25 07:49:59 master02 systemd[1]: Unit atomic-openshift-node.service entered failed state.
Feb 25 07:49:59 master02 systemd[1]: atomic-openshift-node.service failed.
Feb 25 07:50:00 master02 systemd[1]: atomic-openshift-node.service holdoff time over, scheduling restart.

In some cases the service will come back on its own because the service will reschedule itself

[root@master02 ~]# systemctl status atomic-openshift-node -l
● atomic-openshift-node.service - Atomic OpenShift Node
   Loaded: loaded (/usr/lib/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d
           └─openshift-sdn-ovs.conf
   Active: active (running) since Thu 2016-02-25 08:37:31 CST; 1min 16s ago
     Docs: https://github.com/openshift/origin
 Main PID: 2412 (openshift)
   CGroup: /system.slice/atomic-openshift-node.service
           └─2412 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2

Feb 25 08:37:31 master02 atomic-openshift-node[2412]: E0225 08:37:31.938263    2412 proxier.go:218] Error flushing userspace chain: error flushing chain "KUBE-NODEPORT-CONTAINER": exit status 1: iptables: No chain/target/match by that name.
Feb 25 08:37:31 master02 atomic-openshift-node[2412]: I0225 08:37:31.938540    2412 node.go:256] Started Kubernetes Proxy on 0.0.0.0
Feb 25 08:37:31 master02 systemd[1]: Started Atomic OpenShift Node.
Feb 25 08:37:31 master02 atomic-openshift-node[2412]: I0225 08:37:31.956248    2412 proxier.go:352] Setting endpoints for "default/kubernetes:dns-tcp" to [192.168.200.50:53 192.168.200.51:53 192.168.200.52:53]
Feb 25 08:37:31 master02 atomic-openshift-node[2412]: I0225 08:37:31.956397    2412 proxier.go:352] Setting endpoints for "default/kubernetes:dns" to [192.168.200.50:53 192.168.200.51:53 192.168.200.52:53]
Feb 25 08:37:31 master02 atomic-openshift-node[2412]: I0225 08:37:31.956434    2412 proxier.go:352] Setting endpoints for "default/kubernetes:https" to [192.168.200.50:8443 192.168.200.51:8443 192.168.200.52:8443]
Feb 25 08:37:31 master02 atomic-openshift-node[2412]: I0225 08:37:31.956476    2412 proxier.go:429] Not syncing iptables until Services and Endpoints have been received from master
Feb 25 08:37:31 master02 atomic-openshift-node[2412]: I0225 08:37:31.965155    2412 proxier.go:294] Adding new service "default/kubernetes:https" at 172.50.0.1:443/TCP
Feb 25 08:37:31 master02 atomic-openshift-node[2412]: I0225 08:37:31.965358    2412 proxier.go:294] Adding new service "default/kubernetes:dns" at 172.50.0.1:53/UDP
Feb 25 08:37:31 master02 atomic-openshift-node[2412]: I0225 08:37:31.965450    2412 proxier.go:294] Adding new service "default/kubernetes:dns-tcp" at 172.50.0.1:53/TCP
[root@master02 ~]# oc get nodes
NAME                              LABELS                                                   STATUS                     AGE
app00.ose.example.com      kubernetes.io/hostname=app00.ose.example.com      Ready                      22h
app01.ose.example.com      kubernetes.io/hostname=app01.ose.example.com      Ready                      22h
infra00.ose.example.com    kubernetes.io/hostname=infra00.ose.example.com    Ready                      22h
infra01.ose.example.com    kubernetes.io/hostname=infra01.ose.example.com    Ready                      22h
master00.ose.example.com   kubernetes.io/hostname=master00.ose.example.com   Ready,SchedulingDisabled   22h
master01.ose.example.com   kubernetes.io/hostname=master01.ose.example.com   Ready,SchedulingDisabled   22h
master02.ose.example.com   kubernetes.io/hostname=master02.ose.example.com   Ready,SchedulingDisabled   22h

/var/log/messages can sometimes shed some additional light if the problem is not resolved by restarting the atomic-openshift-node service

Nodes report ready but ETCD health check fails

[root@master02 ~]#  etcdctl -C https://master00.ose.example.com:2379,https://master01.ose.example.com:2379,https://master01.ose.example.com:2379 --ca-file=/etc/origin/master/master.etcd-ca.crt     --cert-file=/etc/origin/master/master.etcd-client.crt     --key-file=/etc/origin/master/master.etcd-client.key cluster-health
member e0e2c123213680f is healthy: got healthy result from https://192.168.200.50:2379
member 64f1077d838e039c is healthy: got healthy result from https://192.168.200.51:2379
member a9e031ea9ce2a521 is unhealthy: got unhealthy result from https://192.168.200.52:2379

In the event that the health check fails check the status of etcd you could see one or a combination of the following:

[root@master02 ~]#  etcdctl -C https://master00.ose.example.com:2379,https://master01.ose.example.com:2379,https://master01.ose.example.com:2379 --ca-file=/etc/origin/master/master.etcd-ca.crt     --cert-file=/etc/origin/master/master.etcd-client.crt     --key-file=/etc/origin/master/master.etcd-client.key cluster-health
member e0e2c123213680f is healthy: got healthy result from https://192.168.200.50:2379
member 64f1077d838e039c is healthy: got healthy result from https://192.168.200.51:2379
member a9e031ea9ce2a521 is unhealthy: got unhealthy result from https://192.168.200.52:2379
[root@master01 ~]# systemctl status etcd
● etcd.service - Etcd Server
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2016-02-25 08:43:37 CST; 4h 32min ago
 Main PID: 1103 (etcd)
   CGroup: /system.slice/etcd.service
           └─1103 /usr/bin/etcd --name=master01.ose.example.com --data-dir=/var/lib/etcd/ --lis...

Feb 25 11:32:52 master01 etcd[1103]: got unexpected response error (etcdserver: request timed out)
Feb 25 11:32:52 master01 etcd[1103]: got unexpected response error (etcdserver: request timed out)
Feb 25 11:33:02 master01 etcd[1103]: got unexpected response error (etcdserver: request timed out)
Feb 25 11:33:02 master01 etcd[1103]: got unexpected response error (etcdserver: request timed out)
Feb 25 11:33:12 master01 etcd[1103]: got unexpected response error (etcdserver: request timed out)
Feb 25 11:33:12 master01 etcd[1103]: got unexpected response error (etcdserver: request timed out)
[root@master00 ~]# systemctl status etcd
● etcd.service - Etcd Server
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2016-02-25 08:43:55 CST; 4h 32min ago
 Main PID: 1097 (etcd)
   CGroup: /system.slice/etcd.service
           └─1097 /usr/bin/etcd --name=master00.ose.example.com --data-dir=/var/lib/etcd/ --lis...

Feb 25 11:40:25 master00 etcd[1097]: the connection to peer a9e031ea9ce2a521 is unhealthy
Feb 25 11:40:55 master00 etcd[1097]: the connection to peer a9e031ea9ce2a521 is unhealthy
Feb 25 11:41:25 master00 etcd[1097]: the connection to peer a9e031ea9ce2a521 is unhealthy
Feb 25 11:41:55 master00 etcd[1097]: the connection to peer a9e031ea9ce2a521 is unhealthy
Feb 25 11:42:25 master00 etcd[1097]: the connection to peer a9e031ea9ce2a521 is unhealthy

Solution

In most cases restarting etcd one at a time on each etcd host resolves the issue

systemctl restart etcd

Atomic-openshift-node service fails to start

The installer fails with:

TASK: [openshift_node Start and enable node] ********************************
failed: [app00.ose.example.com] => {"failed": true}
msg: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details.

Upon investigating the node’s status has the following message:

[root@app00 ~]# systemctl status atomic-openshift-node
● atomic-openshift-node.service - Atomic OpenShift Node
   Loaded: loaded (/usr/lib/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d
           └─openshift-sdn-ovs.conf
   Active: failed (Result: start-limit) since Tue 2016-03-08 09:28:55 EST; 31s ago
     Docs: https://github.com/openshift/origin
  Process: 20182 ExecStart=/usr/bin/openshift start node --config=${CONFIG_FILE} $OPTIONS (code=exited, status=255)
 Main PID: 20182 (code=exited, status=255)

Mar 08 09:28:55 app00.ose.example.com systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
Mar 08 09:28:55 app00.ose.example.com systemd[1]: Failed to start Atomic OpenShift Node.
Mar 08 09:28:55 app00.ose.example.com systemd[1]: Unit atomic-openshift-node.service entered failed state.
Mar 08 09:28:55 app00.ose.example.com systemd[1]: atomic-openshift-node.service failed.
Mar 08 09:28:55 app00.ose.example.com systemd[1]: atomic-openshift-node.service holdoff time over, scheduling restart.
Mar 08 09:28:55 app00.ose.example.com systemd[1]: start request repeated too quickly for atomic-openshift-node.service
Mar 08 09:28:55 app00.ose.example.com systemd[1]: Failed to start Atomic OpenShift Node.
Mar 08 09:28:55 app00.ose.example.com systemd[1]: Unit atomic-openshift-node.service entered failed state.
Mar 08 09:28:55 app00.ose.example.com systemd[1]: atomic-openshift-node.service failed.
Mar 08 09:29:22 app00.ose.example.com systemd[1]: Stopped Atomic OpenShift Node.

/var/log/messages has the following messages:

Unable to connect to the server: x509: certificate signed by unknown authority

Solution

The problem is that the keys are most likely corrupt or missing from /etc/origin/node. Copy the files from a host that did succeed.

Registry issues

OpenShift builds fail trying to push image using a wrong IP address for the registry

While attempting to deploy an application in Openshift you see the following error in the event logs:

I0309 17:55:25.743584       1 sti.go:218] Pushing 172.50.115.185:5000/ex2/django-example:latest image ...
I0309 17:59:41.829972       1 sti.go:234] Failed to push 172.50.115.185:5000/ex2/django-example:latest
 
The build will ultimately fail to deploy due to errors. Checking the
services indicate that the registry ip is actually 172.50.225.185:
[root@master00 ~]# oc get service
NAME              CLUSTER_IP       EXTERNAL_IP   PORT(S)                 SELECTOR                  AGE
docker-registry   172.50.225.185   <none>        5000/TCP                docker-registry=default   18h
kubernetes        172.50.0.1       <none>        443/TCP,53/UDP,53/TCP   <none>                    23h
router            172.50.49.239    <none>        80/TCP                  router=router             20h

Solution

This can be caused during the setup of the registry. If you have a change which triggers a re-ip of the docker-registry container (such as undeploy/redeploy) the old registry IP may be "stuck" in the configuration. When you recreate the service associated to the internal registry it will receive a new IP address. OpenShift masters do not automatically detect that change. Usually restarting the atomic-openshift-master-api service will fix the problem

systemctl restart atomic-openshift-master-api

OpenShift build error: failed to push image while using NFS persistent storage

During the deployment of an application you see

Build error: Failed to push image. Response from registry is: Received unexpected HTTP status: 500 Internal Server Error

Check the host where the registry pod is supposed to be deployed:

[root@master00 ~]# oc get pods --all-namespaces
NAMESPACE   NAME                      READY     STATUS      RESTARTS   AGE
default     docker-registry-2-n8d21   1/1       Running     0          14h
default     docker-registry-2-rlqzt   1/1       Running     1          15h
default     router-1-47xfi            1/1       Running     2          15h
default     router-1-vuw38            1/1       Running     2          15h

[root@master00 ~]# oc describe pod docker-registry-2-n8d21
Name:                docker-registry-2-n8d21
Namespace:            default
Image(s):            registry.access.redhat.com/openshift3/ose-docker-registry:v3.1.1.6
Node:                infra01.ose.example.com/192.168.200.61

Check to see if the mount point exists on the node (in this case infra01)

[root@infra01 ~]# mount \|grep origin
(rw,relatime,rootcontext="system_u:object_r:svirt_sandbox_file_t:s0:c0,c1",seclabel)
192.168.200.99:/storage/vms/ose_nfs/stratus on /var/lib/origin/openshift.local.volumes/pods/06f81440-e64b-11e5-9d5e-0800270462ed/volumes/kubernetes.io~nfs/docker-registry-storage type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.200.61,local_lock=none,addr=192.168.200.99)

If the volume is mounted on the registry host, it is not a firewall, or nfs-server configuration issue.

Solution

It is likely that SELinux is blocking access to NFS from within the docker container. Check that the proper boolean is set:

[root@infra01 ~]# getsebool virt_use_nfs
virt_use_nfs --> off

Set this boolean to on across any node that will host pods which may require NFS access (such as databases, registries etc):

setsebool -P virt_use_nfs=true

Failure to push image to OpenShift’s Registry when backed by shared storage

When attempting to do an S2I build over 200M, the build fails with either

Response from registry is: digest invalid: provided digest did not match uploaded content

or

Response from registry is: blob upload invalid

Examining the logs of the registry show something similar to:

[root@master00 ~]# oc describe pod docker-registry-2-n8d21

time="2016-03-10T09:00:56.671348073-05:00" level=error msg="response completed with error" err.code="BLOB_UPLOAD_INVALID" err.detail="Invalid token" err.message="blob upload invalid" go.version=go1.4.2 http.request.host="172.50.225.185:5000" http.request.id=e4066c94-950d-4306-89de-57a1ac573f72 http.request.method=PUT http.request.remoteaddr="10.5.0.1:34874" http.request.uri="/v2/ex3/tm/blobs/uploads/11158157-1eb4-4ba6-9327-9e01a8cbc103?_state=HXJVBhFZdeHo5zeLrzyKKMGb7NPxCQq-Fawt-zNaYBB7Ik5hbWUiOiJleDMvdG0iLCJVVUlEIjoiMTExNTgxNTctMWViNC00YmE2LTkzMjctOWUwMWE4Y2JjMTAzIiwiT2Zmc2V0Ijo1MjA2MjIwODAsIlN0YXJ0ZWRBdCI6IjIwMTYtMDMtMTBUMTM6NTc6NDNaIn0%3D&digest=sha256%3Ab30d0a02a4a259346c94eca8c6150b48a2132cf6821332e3196f2cfe0316d42b" http.request.useragent="docker/1.8.2-el7 go/go1.4.2 kernel/3.10.0-327.10.1.el7.x86_64 os/linux arch/amd64" http.response.contenttype="application/json; charset=utf-8" http.response.duration=180.043287ms http.response.status=404 http.response.written=88 instance.id=79ab5634-8822-4e05-95b7-f13c42fee017 vars.name="ex3/tm" vars.uuid=11158157-1eb4-4ba6-9327-9e01a8cbc103

10.5.0.1 - - [10/Mar/2016:09:00:56 -0500] "PUT /v2/ex3/tm/blobs/uploads/11158157-1eb4-4ba6-9327-9e01a8cbc103?_state=HXJVBhFZdeHo5zeLrzyKKMGb7NPxCQq-Fawt-zNaYBB7Ik5hbWUiOiJleDMvdG0iLCJVVUlEIjoiMTExNTgxNTctMWViNC00YmE2LTkzMjctOWUwMWE4Y2JjMTAzIiwiT2Zmc2V0Ijo1MjA2MjIwODAsIlN0YXJ0ZWRBdCI6IjIwMTYtMDMtMTBUMTM6NTc6NDNaIn0%3D&digest=sha256%3Ab30d0a02a4a259346c94eca8c6150b48a2132cf6821332e3196f2cfe0316d42b HTTP/1.1" 404 88 "" "docker/1.8.2-el7 go/go1.4.2 kernel/3.10.0-327.10.1.el7.x86_64 os/linux arch/amd64"

Solution

There is a Red Hat Bugzilla report describing that the solution is to add no_wdelay to the nfs export options:

(rw,sync,root_squash,no_wdelay)

OR add session affinity to the registry service:

oc get -o yaml service docker-registry \| \
      sed 's/\(sessionAffinity:\s*\).*/\1ClientIP/' \| \
      oc replace -f -

Restart the nfs server and restart the S2I build

Quotas and Limitranges

Must make a non-zero request for cpu

After creating a quota for for CPU usage inside of a project you receive the following error:

failed to create build pod: Pod "nodejs-example-2-build" is forbidden: must make a non-zero request for cpu since it is tracked by quota.

Solution:

There must be a corresponding CPU amount defined in your project limits to go along with the defined quota. Click for more information on quotas and limits.

I have enough ram for another pod but it won’t build

Quotas can prevent build if build machine + new pod is greater than quota. Click for more information on quotas.

Installation Fails…​

Job for atomic-openshift-master-api.service failed

failed: [master00.ose.example.com] => {"failed": true}
msg: Job for atomic-openshift-master-api.service failed because the control process exited with error code. See "systemctl status atomic-openshift-master-api.service" and "journalctl -xe" for details.


FATAL: all hosts have already failed -- aborting

Solution:

The exact cause of this is unknown at this time. You can try to log into each master and checking the status and journald entries mentioned in the error. Often though, this proves less than fruitful. A potential solution has been to log into the first master, start the service manually and restart the ansible installer from the beginning. This has been known to allow the installer to continue

Web Console Public URL on a different Port

If the client wishes to move the public URL off of 8443 either because of a port conflict or other reason you need to edit the master-config.yaml

Solution

Edit the master-config.yaml on each master and replace the following lines with the appropriate values:

  publicURL: https://ose.example.com:8443/console/
  assetPublicURL: https://ose.example.com:8443/console/

Then restart the atomic-openshift-master-api service on each master

UI Redirecting to the URL of the masters instead of the LB

The main cause for this seems to be the installer failing to honor the  openshift_master_cluster_public_hostnam option for the OSEv3:vars section. This results in the master-config.yaml file having the publicURL  being set to the master’s FQDN. ex. publicURL: master00.example.com. When the LB selects a master to pass the request to, OSE uses this value and substitutes the URL in the browser. These values being set incorrectly have implications when deploying your Docker registry.

Intermittent Login issues (htpasswd)

Using htpasswd, it is possible that the htpasswd file has not propagated to all masters. To troubleshoot, do the following

  1. Open a terminal session to each master and examine /var/log/messages

  2. If there are no clues there, edit /etc/sysconfig/atomic-openshift-master-api

  3. change OPTIONS=–loglevel=2 to OPTIONS=–loglevel=4

  4. restart the service

    systemctl restart atomic-openshift-master-api
  5. While watching /var/log/messages look for lines similar to

    Feb 25 12:33:48 master01 atomic-openshift-master-api: I0225 12:33:48.633642   10267 htpasswd.go:116] Loading htpasswd file /etc/origin/htpasswd...
    Feb 25 12:33:50 master01 atomic-openshift-master-api: I0225 12:33:50.061424   10267 trace.go:57] Trace "Update *api.Node" (started 2016-02-25 12:33:47.811133024 -0600 CST):
    Feb 25 12:33:50 master01 atomic-openshift-master-api: [2.250105891s] [2.250105891s] END
    Feb 25 12:35:47 master00 journal: http: TLS handshake error from 192.168.200.2:56781: EOF

    If there is no error messages in the log files, it is likely that the htpasswd file has not been updated from the default file that was created during the installation. Below is the function that is called to load the htpasswd file. You can see on line 7 that the file’s modification time is compared to the information about the file. If they are the same, the file is not loaded and no error message is returned.

    func (a *Authenticator) loadIfNeeded() error {
        info, err := os.Stat(a.file)
        if err != nil {
            return err
        }
    
        if a.fileInfo == nil \|\| a.fileInfo.ModTime() != info.ModTime() {
            glog.V(4).Infof("Loading htpasswd file %s...", a.file)
            loadingErr := a.load()
            if loadingErr != nil {
                return err
            }
    
            a.fileInfo = info
            return nil
        }
        return nil
    }

Solution:

Create the htpasswd on each master or sync the correct htpasswd file from one master to all other masters.

Build Issues

oc new-app runs s2i instead of Docker build

An application that was created containing a builder image appears to ignore any Dockerfile that is in the github repo. To ensure a Docker build occurs instead of an s2i build, the application only needs to be created with a github repo. 

oc new-app https://github.com/lawnjarae/eap-openshift-rhc-license.git
--> Found Docker image 5c93a30 (5 months old) from registry.access.redhat.com for "registry.access.redhat.com/jboss-eap-6/eap-openshift"
    * An image stream will be created as "eap-openshift:latest" that will track the source image
    * A Docker build using source code from https://github.com/lawnjarae/eap-openshift-rhc-license.git will be created
      * The resulting image will be pushed to image stream "eap-openshift-rhc-license:latest"
      * Every time "eap-openshift:latest" changes a new build will be triggered
    * This image will be deployed in deployment config "eap-openshift-rhc-license"
    * Ports 8080/tcp, 8443/tcp will be load balanced by service "eap-openshift-rhc-license"
--> Creating resources with label app=eap-openshift-rhc-license ...
    ImageStream "eap-openshift" created
    ImageStream "eap-openshift-rhc-license" created
    BuildConfig "eap-openshift-rhc-license" created
    DeploymentConfig "eap-openshift-rhc-license" created
    Service "eap-openshift-rhc-licens" created
--> Success
    Build scheduled for "eap-openshift-rhc-license" - use the logs command to track its progress.
    Run 'oc status' to view your app.

Additionally, if you would like to force oc new-app to a docker strategy, you can pass the --strategy=docker flag to the command.

oc new-app https://github.com/lawnjarae/eap-openshift-rhc-license.git --strategy=docker

Binary Build Fails, citing "BadRequest"

Running of a binary build using:

oc start-build my-app --from-dir=./build-dir

Fails with the following error message:

Uploading directory "oc-build" as binary input for the build ...
Error from server (BadRequest): cannot upload file to build sprint-rest-2 with status New

Solution:

It’s likely that there is a problem with one of your ImageStream objects. Take a look at your buildConfig

$ oc describe bc/spring-rest
Name:		spring-rest
Namespace:	spring-rest-dev
Created:	29 minutes ago
Labels:		application=spring-rest
		template=generic-java-jenkins-pipeline
Annotations:	kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"BuildConfig","metadata":{"annotations":{},"labels":{"application":"spring-rest","template":"generic-java-jenkins-pipeline"},"name":"spring-rest","namespace":"spring-rest-dev"},"spec":{"output":{"to":{"kind":"ImageStreamTag","name":"spring-rest:latest"}},"source":{"binary":{},"type":"Binary"},"strategy":{"sourceStrategy":{"from":{"kind":"ImageStreamTag","name":"redhat-openjdk18-openshift:1.1","namespace":"openshift"}},"type":"Source"}}}

Latest Version:	1

Strategy:	Source
From Image:	ImageStreamTag openshift/redhat-openjdk18-openshift:1.1
Output to:	ImageStreamTag spring-rest:latest
Binary:		provided on build

Build Run Policy:	Serial
Triggered by:		<none>

Build		Status		Duration	Creation Time
spring-rest-1 	complete 	50s 		2017-11-20 20:55:21 -0500 EST

Events:	<none>

Notice the two lines that say From Image: and Output to:. Its likely that one of those imagestreams are either misspelled, or have not yet been created. Ensure your imagestreams are created and correct, and try running the build again.

Misc

Docker won’t start

[root@master00 ~]# systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
   Active: failed (Result: start-limit) since Wed 2016-03-02 20:12:43 CST; 8s ago
     Docs: http://docs.docker.com
  Process: 2577 ExecStart=/usr/bin/docker daemon $OPTIONS $DOCKER_STORAGE_OPTIONS $DOCKER_NETWORK_OPTIONS $ADD_REGISTRY $BLOCK_REGISTRY $INSECURE_REGISTRY (code=exited, status=1/FAILURE)
 Main PID: 2577 (code=exited, status=1/FAILURE)

Mar 02 20:12:43 master00 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Mar 02 20:12:43 master00 systemd[1]: Failed to start Docker Application Container Engine.
Mar 02 20:12:43 master00 systemd[1]: Unit docker.service entered failed state.
Mar 02 20:12:43 master00 systemd[1]: docker.service failed.
Mar 02 20:12:43 master00 systemd[1]: docker.service holdoff time over, scheduling restart.
Mar 02 20:12:44 master00 systemd[1]: start request repeated too quickly for docker.service
Mar 02 20:12:44 master00 systemd[1]: Failed to start Docker Application Container Engine.
Mar 02 20:12:44 master00 systemd[1]: Unit docker.service entered failed state.
Mar 02 20:12:44 master00 systemd[1]: docker.service failed.

Check /var/log/messages

Mar  2 20:06:43 master00 docker: time="2016-03-02T20:06:43.672735546-06:00" level=info msg="Listening for HTTP on unix (/var/run/docker.sock)"
Mar  2 20:06:43 master00 docker: time="2016-03-02T20:06:43.873012061-06:00" level=warning msg="Docker could not enable SELinux on the host system"
Mar  2 20:06:43 master00 docker: time="2016-03-02T20:06:43.879826788-06:00" level=fatal msg="Error starting daemon: Error loading key file /etc/docker/key.json: unable to decode private key JWK: decoding JWK Private Key JSON data: unexpected end of JSON input\n"

Solution:

Chances are the key is empty. Remove the key and restart docker, the key will be regenerated

# rm -f /etc/docker/key.json
# systemctl restart docker