usdatasets Documentation
Welcome
The usdatasets package provides a comprehensive collection of datasets focused on the United States. It includes extensive data on topics such as crime and public safety, political history, economic indicators, education, public health, natural disasters, demographics, infrastructure, sports, and cultural events.
The package contains diverse data types, including historical political records, crime statistics, wage and income data, election results, mortality rates, presidential information, educational metrics, environmental incidents, asylum records, firefighter fatalities, terrorist activity, NFL statistics, stock market data, and entertainment industry records.
Philosophy
The author's vision is to create specialized dataset packages focused on specific themes and topics. Instead of searching through multiple generic data packages to find relevant datasets, users can go directly to a thematic package where all datasets are carefully curated around a particular subject.
In the case of usdatasets, every dataset is exclusively focused on the United States, making it the go-to resource for researchers, data scientists, educators, and analysts working with American data.
Cross-Platform Ecosystem
usdatasets has a sibling package in the R ecosystem called usdatasets, maintaining consistency across programming languages and ensuring that users can work with the same high-quality datasets whether they prefer Python or R.
This cross-platform approach reflects our commitment to making specialized datasets accessible to the widest possible audience, regardless of their preferred data analysis environment.
Getting Started
Installation
From PyPI (Recommended)
The easiest way to install usdatasets is directly from PyPI:
pip install usdatasets
From GitHub (Latest Development Version)
To get the latest development version with the newest features and bug fixes:
pip install git+https://github.com/lightbluetitan/usdatasets-py
Quick Start Tutorial
1. Import the Package
import usdatasets as usd
2. List Available Datasets
See all datasets included in the package:
# Get list of all datasets
datasets = usd.list_datasets()
print(datasets)
3. Load a Dataset
Load any dataset as a pandas DataFrame:
# Load crime and incarceration data
df = usd.load_dataset('crime_and_incarceration_by_state')
# Display first rows
print(df.head())
# Check dataset dimensions
print(f"Shape: {df.shape}")
Basic Concepts
Dataset Naming Convention
All dataset names in usdatasets follow a consistent naming pattern:
- Lowercase with underscores:
crime_and_incarceration_by_state - Descriptive names that reflect content
- Some include time periods:
shootings_2020,shootings_2021
Dataset Categories
Datasets are organized into thematic categories:
- Crime & Public Safety: Shootings, incarceration, terrorism, firefighter fatalities
- Politics & Government: Presidents, elections, Congress, executive orders, pardons
- Economy & Finance: Wages, income, stock prices
- Education: Colleges, wage premiums
- Public Health: Mortality rates, causes of death, asylum data
- Environment: Wildfires, radiation monitoring
- Sports & Entertainment: NFL statistics, American Idol data
- Infrastructure: EV charging stations
- Culture: Holidays, UFO sightings
Data Licenses
All datasets maintain their original open-source licenses:
- Most datasets use CC0: Public Domain (free for any use)
- Some use MIT License or Apache 2.0
- The
usdatasetspackage itself is licensed under MIT