docker – How to Import Streamsets pipeline in Dockerfile without container exiting

docker – How to Import Streamsets pipeline in Dockerfile without container exiting

In your Dockerfile you overwrite the default CMD and ENTRYPOINT from the StreamSets Data Collector Dockerfile. So the container only executes your command during startup and exits without errors afterwards. This is the reason why your container is in Exited (0) status.

In general this is good and expected behavior. If you want to keep your container alive you need to execute another command in the foreground, which never ends. But unfortunately, you cannot run multiple CMDs in your docker file.

I dug a little deeper. The default entry point of the image is ENTRYPOINT [/docker-entrypoint.sh]. This script sets up a few things and starts the Data Collector.

It is required that the Data Collector is running before the pipeline is imported. So a solution could be to copy the default docker-entrypoint.sh and modify it to start the Data Collector and import the pipeline afterwards. You could to it like this:

Dockerfile:

FROM streamsets/datacollector:3.18.1

COPY myPipeline.json /pipelinejsonlocation/
# Replace docker-entrypoint.sh
COPY docker-entrypoint.sh /docker-entrypoint.sh 

EXPOSE 18630

docker-entrypoint.sh (https://github.com/streamsets/datacollector-docker/blob/master/docker-entrypoint.sh):

#!/bin/bash
#
# Copyright 2017 StreamSets Inc.
#
# Licensed under the Apache License, Version 2.0 (the License);
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an AS IS BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

set -e

# We translate environment variables to sdc.properties and rewrite them.
set_conf() {
  if [ $# -ne 2 ]; then
    echo set_conf requires two arguments: <key> <value>
    exit 1
  fi

  if [ -z $SDC_CONF ]; then
    echo SDC_CONF is not set.
    exit 1
  fi

  grep -q ^$1 ${SDC_CONF}/sdc.properties && sed s|^#?($1=).*|1$2| -i ${SDC_CONF}/sdc.properties || echo -e n$1=$2 >> ${SDC_CONF}/sdc.properties
}

# support arbitrary user IDs
# ref: https://docs.openshift.com/container-platform/3.3/creating_images/guidelines.html#openshift-container-platform-specific-guidelines
if ! whoami &> /dev/null; then
  if [ -w /etc/passwd ]; then
    echo ${SDC_USER:-sdc}:x:$(id -u):0:${SDC_USER:-sdc} user:${HOME}:/sbin/nologin >> /etc/passwd
  fi
fi

# In some environments such as Marathon $HOST and $PORT0 can be used to
# determine the correct external URL to reach SDC.
if [ ! -z $HOST ] && [ ! -z $PORT0 ] && [ -z $SDC_CONF_SDC_BASE_HTTP_URL ]; then
  export SDC_CONF_SDC_BASE_HTTP_URL=http://${HOST}:${PORT0}
fi

for e in $(env); do
  key=${e%=*}
  value=${e#*=}
  if [[ $key == SDC_CONF_* ]]; then
    lowercase=$(echo $key | tr [:upper:] [:lower:])
    key=$(echo ${lowercase#*sdc_conf_} | sed s|_|.|g)
    set_conf $key $value
  fi
done

# MODIFICATIONS:
#exec ${SDC_DIST}/bin/streamsets $@

check_data_collector_status () {
   watch -n 1 ${SDC_DIST}/bin/streamsets cli -U http://localhost:18630 ping | grep -q version && echo Data Collector has started! && import_pipeline
}

function import_pipeline () {
    sleep 1

    echo Start to import pipeline
    ${SDC_DIST}/bin/streamsets cli -U http://localhost:18630 -u admin -p admin store import -n myPipeline --stack -f /pipelinejsonlocation/myPipeline.json

    echo Finished importing pipeline
}

# Start checking if Data Collector is up (in background) and start Data Collector
check_data_collector_status & ${SDC_DIST}/bin/streamsets $@

I commented out the last line exec ${SDC_DIST}/bin/streamsets $@ of the default docker-entrypoint.sh and added two functions. check_data_collector_status () pings the Data Collector service until it is available. import_pipeline () imports your pipeline.

check_data_collector_status () runs in background and ${SDC_DIST}/bin/streamsets $@ is started in foreground as before. So the pipeline is imported after the Data Collector service is started.

Run this image with sleep command:

docker run -p 18630:18630 -d --name sdc cmp/sdc sleep 300 

300 is the time to sleep in seconds.

Then exec your script manually within the docker container and find out whats wrong.

docker – How to Import Streamsets pipeline in Dockerfile without container exiting

Related posts

Leave a Reply

Your email address will not be published.