meddatasets Documentation

Welcome

The meddatasets package provides a comprehensive collection of medical and healthcare datasets from around the world. It includes extensive data on topics such as cancer diagnostics, chronic disease records, epidemiology statistics, hospital management data, public health indicators, smoking and risk factors, COVID-19 records, and water pollution impact on health.

The package contains clinical research datasets, cancer diagnostic records, chronic disease statistics, smoking and cancer risk data, worldwide COVID-19 case records, water pollution and disease impact data, and much more.

Philosophy

The author's vision is to create specialized dataset packages focused on specific themes and topics. Instead of searching through multiple generic data packages to find relevant datasets, users can go directly to a thematic package where all datasets are carefully curated around a particular subject.

In the case of meddatasets, every dataset is exclusively focused on medicine, healthcare, clinical research, and public health, making it the go-to resource for researchers, data scientists, clinicians, epidemiologists, public health analysts, healthcare administrators, and students working in the medical, biomedical, and health sciences fields.

Cross-Platform Ecosystem

meddatasets has a sibling package in the R ecosystem called meddatasets, maintaining consistency across programming languages and ensuring that users can work with the same high-quality datasets whether they prefer Python or R.

This cross-platform approach reflects our commitment to making specialized datasets accessible to the widest possible audience, regardless of their preferred data analysis environment.

Getting Started

Installation

From PyPI (Recommended)

The easiest way to install meddatasets is directly from PyPI:

pip install meddatasets

From GitHub (Latest Development Version)

To get the latest development version with the newest features and bug fixes:

pip install git+https://github.com/lightbluetitan/meddatasets-py

Quick Start Tutorial

1. Import the Package

import meddatasets as md

2. List Available Datasets

See all datasets included in the package:

# Get list of all datasets
datasets = md.list_datasets()
print(datasets)

3. Load a Dataset

Load any dataset as a pandas DataFrame:

# Load nfl_concussion_injuries
df = md.load_dataset('nfl_concussion_injuries')

# Display first rows
print(df.head())

# Check dataset dimensions
print(f"Shape: {df.shape}")

4. Describe a dataset


# Describe a dataset
print(md.describe("smoking_cancer_risk"))

Basic Concepts

Dataset Naming Convention

All dataset names in meddatasets follow a consistent naming pattern:

Lowercase with underscores: smoking_cancer_risk
Descriptive names that reflect content

Some Datasets available at `meddatasets`

Every dataset is exclusively focused on medical and healthcare for data analysis, clinical research, epidemiology, and education:

hypertension_risk: Dataset containing clinical variables for hypertension risk prediction.
nfl_concussion_injuries: Dataset containing concussion injuries in the NFL from 2012-2014.
human_glaucoma: Dataset containing patient age, ocular pressure, and eye side for glaucoma diagnosis.
us_polio_cases: Dataset containing polio cases and deaths in the United States from 1910-2019.

Data Licenses

All datasets maintain their original open-source licenses:

Most datasets use CC0: Public Domain (free for any use)
Some use MIT License or Apache 2.0
The meddatasets package itself is licensed under MIT