A FHIR-Native Analytics Approach Redux — Python

Steve Munini
Helios Software
Published in
6 min readFeb 26, 2021

--

Our original article A FHIR-Native Analytics Approach was one of our most popular blog articles. As its readership grew, we decided to do several hands-on workshops using its contents, and we also had many developers independently work through the steps outlined in the article.

We learned a lot since authoring the original article, and we wanted to refresh, simplify, and improve the article overall. So, here is a redux version of A FHIR-Native Analytics Approach.

redux [ ri-duhks ]

adjective — brought back; resurgent

What did we learn?

Since the original publication of A FHIR-Native Analytics Approach, we have only seen strong and increasing demand from developers who wish to ask analytic questions of their FHIR data. This, of course, has a lot to do with the growth of the FHIR standard itself. Most of the interest has come from organizations that are building new products, versus those who are mapping from legacy architectures. We do envision that as newer options for accessing large amounts of FHIR data become available, most notably the FHIR Bulk Data Access specification, we will see even more demand for analytic workloads such as the ones we outlined in the original article.

Developers really want access to a FHIR data model. A common question in the FHIR developer community is:

“Can anybody direct me to a tool (preferably open source tool) that can convert FHIR resources to a relational data model either columnar storage or pure-relational for querying purposes and possibly for analytical workloads?”

The FHIR Specification doesn’t publish a SQL data model, nor should it. The FHIR Specification is an interoperability standard, and it’s up to implementers of the FHIR Specification to design and implement the standard as they see fit. The Helios FHIR Server provides a Cassandra data model expressly for this purpose.

Developers really want to use their preferred languages and tools. We received some great feedback that readers liked the R example, however many wish that it was also available in Python, which happens to be much more popular than R.

There have been some significant advancements in the Spark Cassandra Connector driver. For a complete download of the recent changes, watch Russell Spitzer’s presentation: DataSource V2 and Cassandra — A Whole New World.

Recommended Reading

If you have landed here first, and haven’t had a chance to do a complete walk-through of A FHIR-Native Analytics Approach, we would recommend reading the following sections of that article:

  • The Introduction
  • Overview
  • Why Cassandra?
  • Spark and Cassandra: Working Together
  • Helios FHIR Server’s Data Model
  • Bringing It All Together

What hasn’t changed?

Ok! Let’s dive in, but before we do, what hasn’t changed from the original article?

  • We are still using Synthea to generate sample data. Synthea, is an open-source tool developed by The MITRE Corporation. Synthea is a Synthetic Patient Population Simulator. Synthea can output synthetic, realistic (but not real), patient data and associated health records in a variety of formats, including HL7 FHIR R4.
  • We are still running Spark on Cassandra.
  • We are still using the Helios FHIR Server of course! There have been several updates and optimizations to the Helios FHIR Server since the original publication of the article, so go ahead and download the latest version.
  • The example analytic problem we are solving has not changed. The example predicts which patients have asthma. The overall program flow also has not changed — it’s simply reimplemented in Python. The example prepares data for analysis, trains predictive models, uses cross-validation to select the best model, and finally uses the Lasso algorithm to reduce input complexity.

What has changed?

  • In this article, we are using Python instead of R. Python is a more popular programming language, and we wanted to showcase that different language can be used just as easily as R.
  • We are using the new Spark Cassandra Connector from DataStax.
  • We are now using the open-source Apache Cassandra instead of DataStax Enterprise Analytics. Many of the key enabling features required to connect Cassandra to/from Spark has been moved to the open-source Spark Cassandra Connector. Thank you, DataStax for making this possible!
  • The original article included many steps for setting up and configuring 4 EC2 instances on AWS (3 Cassandra nodes, and a Helios FHIR Server). We greatly simplified the steps in this article, and you no longer need to use AWS. The instructions below are intended to be run on your local laptop/desktop environment.
  • Instead of R Studio, we are using a Jupyter notebook.
  • We have a fancy new data upload tool that is really speedy and enables parallel import of the Synthea data. It’s called simply fhir-importer.

Prerequisites

You will need two different versions of Java available on your computer. These instructions assume a Mac environment.

  • Java 8 — for running Cassandra
$ brew tap AdoptOpenJDK/openjdk
$ brew cask install adoptopenjdk8
  • Java 11 — for running the Helios FHIR Server
$ brew cask install adoptopenjdk11

Add the following lines to your ~/.bash_profile so you can easily switch between the two versions of Java:

export JAVA_8_HOME=$(/usr/libexec/java_home -v1.8)
export JAVA_11_HOME=$(/usr/libexec/java_home -v11)
alias java8='export JAVA_HOME=$JAVA_8_HOME'
alias java11='export JAVA_HOME=$JAVA_11_HOME'

Refresh your environment:

$ source ~/.bash_profile

Always check your version of java with:

$ java -version
openjdk version "11.0.5" 2019-10-15
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.5+10)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.5+10, mixed mode)

Above, you can see I’m running Java 11. Now, with this command, I’m running Java 8.

$ java8
$ java -version
java version "1.8.0_211"
Java(TM) SE Runtime Environment (build 1.8.0_211-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.211-b12, mixed mode)

Install Cassandra

Follow these instructions to install Cassandra locally.

Install the Helios FHIR Server

Navigate to https://heliossoftware.com/download-enterprise-edition/ and download the latest version of the Helios FHIR Server Enterprise Edition.

Follow these instructions to install the Helios FHIR Server.

After you have logged in to the administrative user interface, enable the following FHIR Resources:

AllergyIntolerance
Bundle
CarePlan
CareTeam
Claim
Condition
Coverage
Device
DiagnosticReport
DocumentReference
Encounter
ExplanationOfBenefit
ImagingStudy
Immunization
Location
Medication
MedicationAdministration
MedicationRequest
Observation
Organization
Patient
Practitioner
PractitionerRole
Procedure
Provenance
ServiceRequest
SupplyDelivery
ExplanationOfBenefit

Synthea

Next, download Synthea from GitHub using git.

git clone https://github.com/synthetichealth/synthea.git

Let’s generate 1000 simulated patients and data and examine the output.

cd synthea
./run_synthea -p 1000
cd output/fhir
[...Examine some of the FHIR output...]

Import the Sample Data

Build and run the fhir-importer using Java 11.

$ java11
$ git clone git@github.com:HeliosSoftware/fhir-importer.git
$ cd fhir-importer
$ mvn clean install

The fhir-importer jar will reside in the /target folder.

Run the following command.

$ java -jar [path to fhir-importer]/fhir-importer-0.0.1-SNAPSHOT.jar -directory [path-to-output/fhir] -regex “.*.json”

Install Jupyter

Assuming you have python installed on your computer, simply execute the following commands:

$ sudo pip3 install jupyter
$ sudo pip3 install numpy

Install Spark

Download and unzip spark from https://spark.apache.org/downloads.html

Select the 3.0.x Spark release and the “Prebuilt for Apache Hadoop 3.2 and later” package type.

Un-tar Spark to a convenient folder.

Add the following to your ~/.bash_profile

export SPARK_HOME="/Users/[your username]/[a directory]/spark-3.0.1-bin-hadoop3.2" 
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PATH=$SPARK_HOME/bin:$PATH

Refresh your environment:

$ source ~/.bash_profile

Clone the pyspark-analytics project

This is the GitHub project that contains the Jupyter project we will be using.

$ git clone git@github.com:HeliosSoftware/pyspark-analytics.git

Run pyspark!

Run the following command in the same directory as your jupyter notebook (ie where you cloned the https://github.com/HeliosSoftware/pyspark-analytics project)

$  pyspark --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0

Open the fhir-analytics-with-python.ipynb notebook file and run it! You will find detailed comments describing the logic in the notebook itself.

--

--