How to expose a web server with REST API and HTML/JavaScript applications from an existing Python application?

Refresh

April 2019

Views

60 time

1

I have an existing Python application that crawls the Internet continuously. It uses the requests package to make HTTP requests to various Internet websites such as GitHub, Twitter, etc. and downloads the available data on to a filesystem. It also makes HTTP requests to the REST APIs of GitHub repositories and Twitter and downloads a lot of metadata. It keeps doing this in an infinite loop. After every iteration it invokes time.sleep(3600) to sleep for 1 hour before the next iteration.

Now I want to expose an HTTP server on port 80 from this application so that any client can connect to port 80 of this app to query its internal state. For example if someone runs curl http://myapp/status it should respond with {"status": "crawling"} or {"status": "sleeping"}. If someone visits http://myapp/status with their web browser, it should display an HTML page showing the status. Based on the user agent detected, it would serve both REST API responses or HTML pages. If for any reason, my app goes down or crashes, the HTTP requests to port 80 should of course fail.

How can I expose such an HTTP server from the application? I thought of using Django because as the project goes it has to do a lot of heavy lifting such as authentication, protection against CSRF attacks, accepting user input and querying against the data it has, and so on. Django seems good for this purpose. But the problem with Django is that I cannot embed Django in my current app. I have to run a separate uwsgi server to serve the Django app. The same problem exists with Flask as well.

What is the right way to solve a problem like this in Python?

1 answers

0

The way I see it, you have two high-level ways of tackling this problem:

  1. Have separate applications (a "server" and a "crawler") that have some shared datastore (database, Redis, etc). Each application would operate independently and the crawler could just update its status in the shared datastore. This approach could probably scale better: if you spin it up in something like Docker Swarm, you could scale the crawler instances as much as you can afford.
  2. Have a single application that spawns separate threads for the crawler and server. Since they're in the same process, you can share information between them a bit quicker (though if it's just the crawler status that shouldn't matter much). The advantage to this option seems to just be difficulty of spinning it up -- you wouldn't need a shared datastore, and you wouldn't need to manage more than one service.

I would personally tend towards (1) here, because each of the pieces are simpler. What follows is a solution to (1), and a quick and dirty solution to (2).

1. Separate Processes with shared Datastore

I would use Docker Compose to handle spinning up all of the services. It adds an extra layer of complexity (as you need to have Docker installed), but it greatly simplifies managing the services.

The whole Docker Compose stack

Building on the example configuration here I would make a ./docker-compose.yaml file that looks like

version: '3'
services:
  server:
    build: ./server
    ports:
      - "80:80"
    links:
      - redis
    environment:
      - REDIS_URL=redis://cache
  crawler:
    build: ./crawler
    links:
      - redis
    environment:
      - REDIS_URL=redis://cache
  redis:
    image: "redis/alpine"
    container_name: cache
    expose: 
      - 6379

I would organize the applications into separate directories, like ./server and ./crawler, but that's not the only way to do it. However you organize them, your build arguments in the configuration above should match.

The server

I would write a simple server in ./server/app.py that does something like

import os

from flask import Flask
import redis

app = Flask(__name__)
r_conn = redis.Redis(
    host=os.environ.get('REDIS_HOST'),
    port=6379
)

@app.route('/status')
def index():
    stat = r_conn.get('crawler_status')
    try:
        return stat.decode('utf-8')
    except:
        return 'error getting status', 500

app.run(host='0.0.0.0', port=8000)

Along with it a ./server/requirements.txt file with the dependencies

Flask
redis

And finally a ./server/Dockerfile that tells Docker how to build your server

FROM alpine:latest
# install Python
RUN apk add --no-cache python3 && \
    python3 -m ensurepip && \
    rm -r /usr/lib/python*/ensurepip && \
    pip3 install --upgrade pip setuptools && \
    rm -r /root/.cache
# copy the app and make it your current directory
RUN mkdir -p /opt/server
COPY ./ /opt/server
WORKDIR /opt/server
# install deps and run server
RUN pip3 install -qr requirements.txt
EXPOSE 8000
CMD ["python3", "app.py"]

Stop to check things are alright

At this point, if you open a CMD prompt or terminal in the directory with ./docker-compose.yaml you should be able to run docker-compose build && docker-compose up to check that everything builds and runs happily. You will need to disable the crawler section of the YAML file (since it hasn't been written yet) but you should be able to spin up a server that talks to Redis. If you're happy with it, uncomment the crawler section of the YAML and proceed.

The crawler process

Since Docker handles restarting the crawler process, you can actually just write this as a very simple Python script. Something like ./crawler/app.py could look like

from time import sleep
import os
import sys

import redis

TIMEOUT = 3600  # seconds between runs
r_conn = redis.Redis(
    host=os.environ.get('REDIS_HOST'),
    port=6379
)

# ... update status and then do the work ...
r_conn.set('crawler_status', 'crawling')
sleep(60)
# ... okay, it's done, update status ...
r_conn.set('crawler_status', 'sleeping')

# sleep for a while, then exit so Docker can restart
sleep(TIMEOUT)
sys.exit(0)

And then like before you need a ./crawler/requirements.txt file

redis

And a (very similar to the server's) ./crawler/Dockerfile

FROM alpine:latest
# install Python
RUN apk add --no-cache python3 && \
    python3 -m ensurepip && \
    rm -r /usr/lib/python*/ensurepip && \
    pip3 install --upgrade pip setuptools && \
    rm -r /root/.cache
# copy the app and make it your current directory
RUN mkdir -p /opt/crawler
COPY ./ /opt/crawler
WORKDIR /opt/crawler
# install deps and run server
RUN pip3 install -qr requirements.txt
# NOTE that no port is exposed
CMD ["python3", "app.py"]

Wrapup

In 7 files, you have two separate applications managed by Docker as well as a Redis instance. If you want to scale it, you can look into the --scale option for docker-compose up. This is not necessarily the simplest solution, but it manages some of the unpleasant bits about process management. For reference, I also made a Git repo for it here.

To run it as a headless service, just run docker-compose up -d.

From here, you can and should add nicer logging to the crawler. You can of course use Django instead of Flask for the server (though I'm more familiar with Flask and Django may introduce new dependencies). And of course you can always make it more complicated.

2. Single process with threading

This solution does not require any Docker, and should only require a single Python file to manage. I won't write a full solution unless OP wants it, but the basic sketch would be something like

import threading
import time

from flask import Flask

STATUS = ''

# run the server on another thread
def run_server():
    app = Flask(__name__)
    @app.route('/status')
    def index():
        return STATUS
server_thread = threading.Thread(target=run_server)
server_thread.start()

# run the crawler on another thread
def crawler_loop():
    while True:
        STATUS = 'crawling'
        # crawl and wait
        STATUS = 'sleeping'
        time.sleep(3600)
crawler_thread = threading.Thread(target=crawler_loop)
crawler_thread.start()

# main thread waits until the app is killed
try:
    while True:
        sleep(1)
except:
    server_thread.kill()
    crawler_thread.kill()

This solution doesn't handle anything to do with keeping the services alive, really much to do with error handling, and the block at the end won't handle signals from the OS very well. That said, it's a quick and dirty solution that should get you off the ground.