usdatasets Documentation

Welcome

The usdatasets package provides a comprehensive collection of datasets focused on the United States. It includes extensive data on topics such as crime and public safety, political history, economic indicators, education, public health, natural disasters, demographics, infrastructure, sports, and cultural events.

The package contains diverse data types, including historical political records, crime statistics, wage and income data, election results, mortality rates, presidential information, educational metrics, environmental incidents, asylum records, firefighter fatalities, terrorist activity, NFL statistics, stock market data, and entertainment industry records.

Philosophy

The author's vision is to create specialized dataset packages focused on specific themes and topics. Instead of searching through multiple generic data packages to find relevant datasets, users can go directly to a thematic package where all datasets are carefully curated around a particular subject.

In the case of usdatasets, every dataset is exclusively focused on the United States, making it the go-to resource for researchers, data scientists, educators, and analysts working with American data.

Cross-Platform Ecosystem

usdatasets has a sibling package in the R ecosystem called usdatasets, maintaining consistency across programming languages and ensuring that users can work with the same high-quality datasets whether they prefer Python or R.

This cross-platform approach reflects our commitment to making specialized datasets accessible to the widest possible audience, regardless of their preferred data analysis environment.

Getting Started

Installation

From PyPI (Recommended)

The easiest way to install usdatasets is directly from PyPI:

pip install usdatasets

From GitHub (Latest Development Version)

To get the latest development version with the newest features and bug fixes:

pip install git+https://github.com/lightbluetitan/usdatasets-py

Quick Start Tutorial

1. Import the Package

import usdatasets as usd

2. List Available Datasets

See all datasets included in the package:

# Get list of all datasets
datasets = usd.list_datasets()
print(datasets)

3. Load a Dataset

Load any dataset as a pandas DataFrame:

# Load crime and incarceration data
df = usd.load_dataset('crime_and_incarceration_by_state')

# Display first rows
print(df.head())

# Check dataset dimensions
print(f"Shape: {df.shape}")

Basic Concepts

Dataset Naming Convention

All dataset names in usdatasets follow a consistent naming pattern:

Lowercase with underscores: crime_and_incarceration_by_state
Descriptive names that reflect content
Some include time periods: shootings_2020, shootings_2021

Dataset Categories

Datasets are organized into thematic categories:

Crime & Public Safety: Shootings, incarceration, terrorism, firefighter fatalities
Politics & Government: Presidents, elections, Congress, executive orders, pardons
Economy & Finance: Wages, income, stock prices
Education: Colleges, wage premiums
Public Health: Mortality rates, causes of death, asylum data
Environment: Wildfires, radiation monitoring
Sports & Entertainment: NFL statistics, American Idol data
Infrastructure: EV charging stations
Culture: Holidays, UFO sightings

Data Licenses

All datasets maintain their original open-source licenses:

Most datasets use CC0: Public Domain (free for any use)
Some use MIT License or Apache 2.0
The usdatasets package itself is licensed under MIT