docker – How to Import Streamsets pipeline in Dockerfile without container exiting

In your Dockerfile you overwrite the default CMD and ENTRYPOINT from the StreamSets Data Collector Dockerfile. So the container only executes your command during startup and exits without errors afterwards. This is the reason why your container is in Exited (0) status.

In general this is good and expected behavior. If you want to keep your container alive you need to execute another command in the foreground, which never ends. But unfortunately, you cannot run multiple CMDs in your docker file.

I dug a little deeper. The default entry point of the image is ENTRYPOINT [/]. This script sets up a few things and starts the Data Collector.

It is required that the Data Collector is running before the pipeline is imported. So a solution could be to copy the default and modify it to start the Data Collector and import the pipeline afterwards. You could to it like this:


FROM streamsets/datacollector:3.18.1

COPY myPipeline.json /pipelinejsonlocation/
# Replace

EXPOSE 18630 (

set -e

# We translate environment variables to and rewrite them.
set_conf() {
  if [ $# -ne 2 ]; then
    echo set_conf requires two arguments: <key> <value>
    exit 1

  if [ -z $SDC_CONF ]; then
    echo SDC_CONF is not set.
    exit 1

  grep -q ^$1 ${SDC_CONF}/ && sed s|^#?($1=).*|1$2| -i ${SDC_CONF}/ || echo -e n$1=$2 >> ${SDC_CONF}/

# support arbitrary user IDs
# ref:
if ! whoami &> /dev/null; then
  if [ -w /etc/passwd ]; then
    echo ${SDC_USER:-sdc}:x:$(id -u):0:${SDC_USER:-sdc} user:${HOME}:/sbin/nologin >> /etc/passwd

# In some environments such as Marathon $HOST and $PORT0 can be used to
# determine the correct external URL to reach SDC.
if [ ! -z $HOST ] && [ ! -z $PORT0 ] && [ -z $SDC_CONF_SDC_BASE_HTTP_URL ]; then
  export SDC_CONF_SDC_BASE_HTTP_URL=http://${HOST}:${PORT0}

for e in $(env); do
  if [[ $key == SDC_CONF_* ]]; then
    lowercase=$(echo $key | tr [:upper:] [:lower:])
    key=$(echo ${lowercase#*sdc_conf_} | sed s|_|.|g)
    set_conf $key $value

#exec ${SDC_DIST}/bin/streamsets $@

check_data_collector_status () {
   watch -n 1 ${SDC_DIST}/bin/streamsets cli -U http://localhost:18630 ping | grep -q version && echo Data Collector has started! && import_pipeline

function import_pipeline () {
    sleep 1

    echo Start to import pipeline
    ${SDC_DIST}/bin/streamsets cli -U http://localhost:18630 -u admin -p admin store import -n myPipeline --stack -f /pipelinejsonlocation/myPipeline.json

    echo Finished importing pipeline

# Start checking if Data Collector is up (in background) and start Data Collector
check_data_collector_status & ${SDC_DIST}/bin/streamsets $@

I commented out the last line exec ${SDC_DIST}/bin/streamsets $@ of the default and added two functions. check_data_collector_status () pings the Data Collector service until it is available. import_pipeline () imports your pipeline.

check_data_collector_status () runs in background and ${SDC_DIST}/bin/streamsets $@ is started in foreground as before. So the pipeline is imported after the Data Collector service is started.

Run this image with sleep command:

docker run -p 18630:18630 -d --name sdc cmp/sdc sleep 300 

300 is the time to sleep in seconds.

Then exec your script manually within the docker container and find out whats wrong.

