The Value of Open-Source

Leveraging free, code-first tools to iterate toward advanced analytics

Alex Zajichek

Research Data Scientist, Cleveland Clinic

February 27, 2025

What is AI, Anyway?

Two Sides of A Coin


AI as Tools

  • Pre-built products like ChatGPT, Gemini, etc.
  • Use (purchase) to conform to our tasks
  • Perceived as productivity tools
  • Dominates the conversation

AI as Data Science

  • Data infrastructure, analytical thinking, statistical reasoning, tools for facilitation
  • How we use data to help inform decision making
  • The building blocks of AI itself

Blurred Lines



  • Tend to focus on the first, bypassing the second; conflating views


  • Leads to ambiguity and confusion

Figure: The confusing nature of AI-related fields

Figure: The confusing nature of AI-related fields

The Horse Before the Cart

Current State

  • Gartner (2018): 87% of business at low analytics maturity [1]

Figure: Gartner Analytics Maturity Model

Figure: Gartner Analytics Maturity Model

Back to Basics

  • Are you capturing what’s important? (Data infrastructure)
    • “If I knew X at time Y, I could do Z”
  • How are you using your data? (Reporting/analytical thinking)
  • What are your limitations? (Tools, skills, time, etc.)
  • Are AI tools really the solution?
    • “AI for the sake of AI is a losing proposition” [2]. Be intentional!

The Case for Open-Source

What Is Open-Source?

Background

  • In essence, free and open software (Wikipedia)
  • Think of as community built
  • Like a workshop for your raw materials (with directions)

Benefits in Data Science

  • Code-first approach gives flexibility and control (art + science)
  • Use it to facilitate analytical approach
    • Building blocks to AI
  • Low-cost iteration and experimentation (anyone can do it)


The R Programming Language

Background

  • A free and open-source functional programming language
  • Developed in the 1990’s for statistical computing, but has long since expanded to much broader usage
  • Advanced through packages developed by the community (Package list)
  • Commonly used in the RStudio IDE

Example

library(plotly) # Load package
my_data <- trees # Assign object (dataset)
plot_ly(
  data = my_data,
  x = ~Height,
  y = ~Girth,
  size = ~Volume,
  color = ~Volume,
  text = ~paste0("Height: ", Height, "<br>Girth: ", Girth, "<br>Volume: ", Volume), height = 300, width = 500
)

Map example

# Load packages
library(tidyverse)
library(tidycensus)
library(mapgl)

# Extract a data set to use
dat <- 
  get_acs(
    geography = "tract",
    variables = "B19013_001", # Median income,
    state = "WI",
    year = 2022,
    geometry = TRUE
  ) |>
  
  # Make an information column
  mutate(
    Info = paste0(str_remove(NAME, ";.+$"), "<br>Median Income ($): ", round(estimate))
  )

maplibre() |>
  
  # Focus the mapping area
  fit_bounds(dat) |>
  
  # Fill with the data values
  add_fill_layer(
    id = "mc_acs",
    source = dat,
    fill_outline_color = "black",
    fill_color = 
      interpolate(
        column = "estimate",
        values = range(dat$estimate, na.rm = TRUE),
        stops = c("#f2d37c", "#08519c"),
        na_color = "gray"
      ),
    fill_opacity = 0.50,
    popup = "Info"
  ) |>
  add_legend(
    legend_title = "Median income ($)",
    values = range(dat$estimate, na.rm = TRUE),
    colors = c("#f2d37c", "#08519c")
  )

Quarto for Reproducible Documents

Background

  • Quarto is an open source technical publishing system
  • Build custom analytical documents in programmatic way
  • Vehicle for dissemination, promoting automation and reproducibility
  • Integrates well with R, Python, and many other tools

YAML (header/metadata)

---
title: "The Value of Open-Source"
subtitle: "Leveraging free, code-first tools to iterate toward advanced analytics"
institute: "Research Data Scientist, Cleveland Clinic"
author: "Alex Zajichek"
date: "2025-02-27"
date-format: long
format:
  revealjs: 
    theme: [serif, custom.scss]
    footer: "<em>AI Innovations at Work Conference 2025</em>"
    slide-number: true
    incremental: true
---

Markdown (body)

### Background

- [Quarto](https://quarto.org/) is a open source technical publishing system
- Build custom analytical documents in programmatic way
- Vehicle for dissemination, promoting automation and reproducibility

Shiny for Web Applications

Background

  • Shiny is an R package for building custom, interactive web applications
  • Allows users to interact with data, analytics, models, etc. however you see fit

Example (in R)

library(shiny) # Load package

# The user interface
ui <- 
  fluidPage(
    title = "MyApp", 
    sidebarLayout(
      sidebarPanel(
        selectInput(
          inputId = "color",
          label = "Choose Color",
          choices = c("red", "blue", "green")
        )
      ),
      mainPanel(
        plotOutput("my_plot")
      )
    )
  )

# How the inputs turn to outputs
server <- 
  function(input, output) {
    
    output$my_plot <-
      renderPlot({
        with(trees, plot(Height, Girth, col = input$color))
      })
    
  }

# Run the app
shinyApp(ui, server)

Deployment and Integration

How can we share it?

  • Start with free; iterate and pay as needed
    • Not all or nothing!
  • Repeat: use to facilitate analytical strategy

Options to get started

  • Posit Connect Cloud: Deploy Shiny apps, Quarto docs, and more to the web for free from GitHub
    • Now in beta with private deployment options (for low cost)
  • shinyapps.io: Deploy Shiny apps to the web; free basic plan
    • Pay for more features like security, domains, and authentication
  • Quarto Pub: Freely publish Quarto output to the web for public access
  • Shiny Server: Open source (free) server to deploy Shiny apps and other content
    • Install on premises or the cloud
    • Intriguing option for orgs new to data science

How can I learn?

  • First, go install them and mess around


  • Make a (small) tangible goal, figure it out, then iterate
    • Document everything


  • Focus on automation, reporting and reproducibility to start (low hanging fruit)
    • Build the foundation for analytics maturity

ChatGPT

  • A companion for any and all questions along the way

Shiny Assistant

Tools in Action: riskcalc.org

What is it?

Background

  • riskcalc.org is a free-to-use repository of risk calculators for individualized medical decision making (~10-15K users/month)
  • Embedded with published predictive models
  • Each calculator is a Shiny application
  • Majority are regression models



The Backend

Figure: Tools supporting riskcalc.org

Figure: Tools supporting riskcalc.org

Key points

  • Can build working infrastructure for free (to start)


  • Incur cost for service as needed


  • AWS versus on premises

Acknowledgements

Co-developers

  • Alex Milinovich
  • Xinge (Kathy) Ji
  • Blaine Martyn-Dow

In Memory

Michael W. Kattan, PhD
Department Chair (2004-2024)
Quantitative Health Sciences
Cleveland Clinic

Creator of riskcalc.org

Questions?


Link to slides: www.zajichekstats.com/presentations/the-value-of-open-source

References

  1. https://www.gartner.com/en/newsroom/press-releases/2018-12-06-gartner-data-shows-87-percent-of-organizations-have-low-bi-and-analytics-maturity
  2. https://www.kearney.com/service/digital-analytics/ai-assessment-aia-2024-the-drive-for-greater-maturity-scale-and-impact