darkness is gøing døwn

a site for soar eyes

Photographer

  • A.  Code Description

    A small investment company needs to rebalance its equity fund’s holdings, which is comprised of 150 U.S. companies.  The code for the program cleans, groups business IDs by state, runs descriptive statistics for all financial information provided, provides businesses with negative debt-to-equity ratios, and calculates the debt-to-income ratio. This data analysis will provide actionable insights to the fund managers who are looking to rebalance the fund’s holdings and assess risk. This code for Task 2 was executed using Python in IntelliJ IDE and pulled, committed, and pushed to GitLab for version control.

    import pandas as pd

    import openpyxl as xl

    #load workbook and specify columns to exclude unnecessary data

    file_path = “/Users/meredithsmith/PycharmProjects/PythonProject/PythonProject/D598PA1/D598_Data_Set.xlsx”

    df = pd.read_excel(file_path, usecols=[“Business ID”, “Business State”, “Total Long-term Debt”, “Total Equity”, “Debt to Equity”, “Total Liabilities”, “Total Revenue”, “Profit Margin”], sheet_name = “1-150 V2”)

    #check df rows and columns

    print(df.head)

    This script imports necessary packages and loads the Excel sheet data into the IntelliJ IDE.

    Step by step:

    • Import the necessary packages pandas and openpyxl
    • Load data frame ‘df’from Excel spreadsheet
    • Listing the necessary columns of the data frame (df) eliminates the inclusion of any unnecessary data

    #check for duplicate rows

    duplicates = df[df.duplicated(keep=False)]

    print(duplicates)

    #check data types

    df[‘Business State’] = df[‘Business State’].astype(‘string’)

    df[‘Business State’] = df[‘Business State’].fillna(”).astype(‘string’)

    print(df.dtypes)

    #correct capitalization in ‘Business State’ column

    df[‘Business State’] = df[‘Business State’].apply(lambda x: x.capitalize())

    df[‘Business State’] = df[‘Business State’].apply(lambda x: x.title())

    df = df.apply(lambda col: col.str.strip() if col.dtype == ‘object’ else col)

    df = df.astype({col: ‘string’ for col in df.select_dtypes(include=’object’).columns})

    df.fillna(” “, inplace=True)

    This script cleans data.

    Step by step:

    • Check and preview any duplicates
    • Convert all object columns to ‘string’ data type (dtype)
      • Fill empty spaces with ‘na’
      • Check data types
    • Corrects capitalization errors discovered from print(df.head) in previous code block
    • Removes whitespace from string entries across the entire data frame
    • Replace all ‘nan’ values with a blank space
    • Store and preview the cleaned data frame

    #group businesses by states and calculate descriptive stats

    df3 = df2.groupby([‘Business State’]).agg({

    ‘Total Long-term Debt’: [‘mean’, ‘median’, ‘min’, ‘max’],

    ‘Total Equity’: [‘mean’, ‘median’, ‘min’, ‘max’],

    ‘Debt to Equity’: [‘mean’, ‘median’, ‘min’, ‘max’],

    ‘Total Liabilities’: [‘mean’, ‘median’, ‘min’, ‘max’],

    ‘Total Revenue’: [‘mean’, ‘median’, ‘min’, ‘max’],

    ‘Profit Margin’: [‘mean’, ‘median’, ‘min’, ‘max’]

    })

    Step by Step:

    • This script groups the cleaned data frame by state
    •  Assigns descriptive statistics to calculate all numerical columns.
    #Flatten the multi-index columns
    df3.columns = ['_'.join(col).strip() for col in df3.columns.values]
    df3 = df3.reset_index()
    print(df3)

    Step by Step:

    • This code flattens hierarchical column names. Example:
      • (‘Profit Margin’, ‘mean’)) into single level [‘Profit Margin_mean’].
    • Reset index moves the index back into the df as a column
    • Preview ‘df3’

    Output Description:

    The new data frame ‘df3’ has 24 columns, 40 rows. [‘Business ID’] is grouped by [‘Business State’] with statistical analysis of each numerical value.

    #filter and identify businesses with negative debt-to-equity(DE) ratios

    df4 = df2[df2[‘Debt to Equity’]<0][[‘Business ID’, ‘Debt to Equity’]].copy()

    #print(df4)

    Step by step:

    • Creates new df4
    • Filter [‘Business ID’] that have negative [‘Debt to Equity’] ratios
    • Preview df4

    Output description:

    • 7 [‘Business ID’] with negative [‘Debt to Equity’}

    #create new column, calculate debt-to-income (DI) ratio, and concatenate it to df2 (cleaned original df)

    df5 = df2.copy()

    df5[‘DI_ratio’]= df5[‘Total Long-term Debt’]/df5[‘Total Revenue’]

    print(df5.columns)

    Step by Step:

    • Copies df2 and stores to df5
    • Creates a new column [‘DI_ratio’] and calculates the debt-to-income ratio
    •  Preview df5 columns

    Output description:

    New data frame df5 has 9 columns and 150 rows, concatenating [‘DI_ratio’] with df2.

    B.  4 Customized Data Visualizations

    (See GitLab link)

    C.  Creating Custom Visualizations in Jupyter Notebook

    Viz Overview

    The following four visualizations, described below, were made using Python in a Jupyter notebook. Matplotlib inline was used to keep static images embedded in the notebook.

    The Jupyter notebook is best utilized for development, collaboration, sharing, and even publication of data science results” (VanderPlas, 2022).

    • Viz 1
      • divides and color codes positive and negative [‘Debt to Equity’] grouped by [‘Business ID’]
    • Viz 2
      • Bar chart that plots [‘Business_Counts’] grouped by  [‘Business State’]
    • Viz 3
      • shows [‘Debt to Equity_mean’] in descending order and [‘DI_Ratio_mean’]  in a scatter plot with all the businesses grouped by [‘State’]
    • Viz 4
      • plots the [‘Profit Margin_mean’] in descending order by [‘State’].

    Project Code Prep:

    #import script

    %run -i /Users/meredithsmith/IdeaProjects/D598PA1/D598PA2.py

    Step by Step:

    • This Jupyter magic command imports Python script and variables from Task 2 code to Jupyter notebook
    • It also prints index and lists columns

    from D598PA2 import df3, df5, df, df2

    %matplotlib inline

    import pandas as pd

    import numpy as np

    Step by Step:

    • Imports data frames from Task 2 program
    • Imports necessary packages for program

    Viz 1 – Negative D/E Ratio by ID with Labeled Markers

    Viz Code prep:

    #Viz1 Neg and Pos DE by ID

    import matplotlib.pyplot as plt

    # Sort data

    dfDE = df5.copy()
    dfDE = df5.sort_values(by='Debt to Equity', ascending=False)

    # Split into positive and negative segments

    pm_pos = np.where(dfDE[‘Debt to Equity’] >= 0, dfDE[‘Debt to Equity’], np.nan)

    pm_neg = np.where(dfDE[‘Debt to Equity’] < 0, dfDE[‘Debt to Equity’], np.nan)

    Step by Step:

    • Creates ‘dfDE’ to store sorted ‘df5’ sorted by [‘Debt to Equity’].
    • Separates positive and negative [‘Debt to Equity’] values.

    # Plot all points

    fig, ax = plt.subplots(figsize=(12, 8))

    # Plot positive DE in teal

    ax.scatter(dfDE[‘Business ID’], pm_pos,  marker = ‘x’, color=’teal’, label=’Positive PM’)

    # Plot negative DE in red

    ax.scatter(dfDE[‘Business ID’], pm_neg, marker='<‘, color=’red’, label=’Negative PM’)

    # Round for clean labels

    dfDE[‘Debt to Equity’] = dfDE[‘Debt to Equity’].round(2)

    Step by Step:

    • Differentiates positive [‘Debt to Equity’] values associated with its [‘Business ID’] by color coding and labeling.
    • Rounds [‘Debt to Equity’] to two decimal places

    # Filter negative values

    neg_mask = dfDE[‘Debt to Equity’] < 0

    neg_points = dfDE[neg_mask]

    # Annotate only negative points

    for i in neg_points.index:

        business_id = dfDE.loc[i, ‘Business ID’]

        equity_val = dfDE.loc[i, ‘Debt to Equity’]

        label = f”{business_id} ({equity_val})”

        ax.text(business_id, equity_val – 0.5, label, fontsize=8, color=’red’, ha=’center’)

    Step by Step:

    • ‘neg_mask’ creates a Boolean series to filter negative [‘Debt to Equity’] values
    • ‘neg_points’ applies the mask above to return only negative values
    • Annotate only negative points
      • Iterates through flagged negative points
      • Retrieves [‘Business ID’] and [‘Debt to Equity’] from ‘dfDE’
      • Places the point label on the plot just below the point
      • Uses red text for the negative points and labels

    # Create legend handles manually

    import matplotlib.patches as mpatches

    handles = [mpatches.Patch(color=’red’, label=f”{row[‘Business ID’]}: {row[‘Debt to Equity’]:.2f}”)

               for _, row in neg_points.iterrows()]

    Step by Step:

    • Iterates values and labels only ytick markers for negative [‘Debt to Equity’] values with the [‘Business ID’] and negative DE value.

    # Layout

    ax.set_title(“Negative D/E by ID with Labeled Markers”)

    ax.set_xlabel(“Business ID”)

    ax.set_ylabel(“Debt to Equity”)

    ax.legend(handles=handles, title=”IDS with neg D/E”, loc=’upper right’, fontsize=9)

    ax.grid(True, axis=’y’, linestyle=’–‘, alpha=0.5)

    plt.tight_layout()

    plt.show()

    Step by step:

    • Sets plot title and axes labels
    • Creates a legend that prints out the numerical values for [‘Business ID’] and negative [‘Debt to Equity’] for easier accessibility

    Viz Description:

    This line graph shows [‘Debt to Equity’] in descending order in a scatter plot grouped by [‘Business ID’]. The plot has negative [‘Debt to Equity’] and [‘Business ID’] points labeled in red, as well as a legend to increase readability.

    Viz 2 – Business Counts by State Bar Chart

    #Viz2 Business Counts by State Bar Chart

    df_counts = df5.copy()

    df_counts = df5.groupby(‘Business State’).size().reset_index(name=’Business_Counts’)

    #print(df_counts)

    plt.figure(figsize=(12, 8))

    plt.style.use(‘seaborn-v0_8-whitegrid’)

    Step by Step:

    • Creates ‘df_counts’ from copy of ‘df5’
      • Groups by and counts [‘Business State’]
      • Creates new column [‘Business_Counts’]
    • Preview ‘df_counts’
    • Creates canvas size
    • Creates canvas style

    # Bar plot

    plt.bar(df_counts[‘Business State’], df_counts[‘Business_Counts’], color=’steelblue’, edgecolor=’black’)

    plt.title(‘Business Counts by State’)

    plt.ylabel(‘Business Count’)

    plt.xticks(rotation=65)

    plt.grid(True, linestyle=’–‘, alpha=0.5)

    plt.tight_layout()

    Step by Step:

    • Creates a bar chart to show [‘Business_Counts’] grouped by [‘Business State’]
      • Assigns bar color and edgecolor
    • Creates bar chart title, label for y axis, rotates x ticks for accessibility, and horizontal grid

    # Annotate each bar with its count

    for idx, row in df_counts.iterrows():

        plt.text(row[‘Business State’], row[‘Business_Counts’] +0.25,

                 str(row[‘Business_Counts’]),

                 ha=’center’, fontsize=14, color=’darkred’)

    plt.show()

    Step by Step:

    • Annotates each bar with [‘Business_Counts’]

    Viz Description:

    This shows how many businesses are in each state with a bar chart and labeled [‘Business_Counts’] atop each bar.

    Viz 3 – Average D/E and DI Ratios by State

    Code prep:

    #Viz3 code and plot begins here#####################################

    from matplotlib import pyplot as plt

    #Code prep

    #Sort by Business State to ensure proper plotting

    dfState = df5.copy()

    dfState = df5.sort_values(‘Business State’)

    # Dictionary mapping full names to abbreviations

    state_abbrev = {

    ‘Alabama’: ‘AL’, ‘Alaska’: ‘AK’, ‘Arizona’: ‘AZ’, ‘Arkansas’: ‘AR’,

    ‘California’: ‘CA’, ‘Colorado’: ‘CO’, ‘Connecticut’: ‘CT’, ‘Delaware’: ‘DE’,’Florida’: ‘FL’, ‘Georgia’: ‘GA’, ‘Hawaii’: ‘HI’, ‘Idaho’: ‘ID’,

    ‘Illinois’: ‘IL’, ‘Indiana’: ‘IN’, ‘Iowa’: ‘IA’, ‘Kansas’: ‘KS’,

    ‘Kentucky’: ‘KY’, ‘Louisiana’: ‘LA’, ‘Maine’: ‘ME’, ‘Maryland’: ‘MD’,

    ‘Massachusetts’: ‘MA’, ‘Michigan’: ‘MI’, ‘Minnesota’: ‘MN’,’Mississippi’: ‘MS’,’Missouri’: ‘MO’, ‘Montana’: ‘MT’, ‘Nebraska’: ‘NE’, ‘Nevada’: ‘NV’,

    ‘New Hampshire’: ‘NH’, ‘New Jersey’: ‘NJ’, ‘New Mexico’: ‘NM’, ‘New York’: ‘NY’,’North Carolina’: ‘NC’, ‘North Dakota’: ‘ND’, ‘Ohio’: ‘OH’, ‘Oklahoma’: ‘OK’,’Oregon’: ‘OR, ‘Pennsylvania’: ‘PA’, ‘Rhode Island’: ‘RI’, ‘South Carolina’: ‘SC’,’South Dakota’: ‘SD’, ‘Tennessee’: ‘TN’, ‘Texas’: ‘TX’, ‘Utah’: ‘UT’, ‘Vermont’: ‘VT’, ‘Virginia’: ‘VA’, ‘Washington’: ‘WA’, ‘Washington D.C.’: ‘DC’, ‘West Virginia’: ‘WV’, ‘Wisconsin’: ‘WI’, ‘Wyoming’: ‘WY’

    }

    #Replace full names with abbreviations in ‘Business State’ column

    dfState[‘State’] = df5[‘Business State’].map(state_abbrev)

    dfState[‘State’] = dfState[‘State’].astype(‘string’)

    dfState=dfState[‘State’].astype(‘string’)

    dfState.fillna(” “, inplace=True)

    print(dfState)

    Step by Step:

    • ‘dfState’ is created from a copy of ‘df5’ to store the state abbreviations of each [‘Business State’].
    • Converts object value to string by (.astype(‘string’))
    • Fills ‘na’ values with blank spaces
    • Previews ‘dfState’
    • ‘dfState’ has 1 column, 150 rows

    df_combo2 = pd.concat([dfState, df5], axis=1)

    df_new = df_combo2.copy()

    df_new = df_combo2.groupby([‘State’]).agg({

        ‘Total Long-term Debt’: [‘mean’],

        ‘Total Equity’: [‘mean’],

        ‘Debt to Equity’: [‘mean’],

        ‘Total Liabilities’: [‘mean’],

        ‘Total Revenue’: [‘mean’],

        ‘Profit Margin’: [‘mean’],

        ‘DI_ratio’: [‘mean’]

    })

    #Flatten the MultiIndex Columns

    df_new.columns = [‘_’.join(col).strip() for col in df_new.columns.values]

    df_new = df_new.reset_index()

    print(df_new)

    Step by Step:

    • Concatenates state abbreviation data frame ‘dfState’ with df5
    • ‘df5’ has 9 columns, 150 rows
    • ‘dfState’ has 1 column, 150 rows
    • ‘df_combo2’ has 10 columns, 150 rows
    • ‘df_new’ is created from a copy of ‘df_combo2’ which concatenates ‘dfState’ to ‘df5’ to store the mean of each numerical value grouped by [‘State’]
      • [‘State’] is a created column from a dictionary of state abbreviations for easier graphical representation purposes.
    • ‘df_new’ multi-index columns are flattened
    • ‘df_new’ has 8 columns and 40 rows, stores the means of all numerical data of each [‘State’]
    df_sorted1 = df_new.copy()
    df_sorted1 = df_new.sort_values(by='Debt to Equity_mean', ascending=False)
    fig, ax = plt.subplots(figsize=(12,6))

    Step by Step:

    • ‘df_sorted1’ is created from a copy of ‘df_new’
    •  ‘df_sorted1’ is sorted by [‘Debt to Equity_mean’], the product of the flattened multi-index columns, and is ordered from largest to smallest
    • ‘fig’ is the container/canvas; ‘ax’ is the individual plot area
    • (figsize=12,6)) sets the canvas to 12 inches wide x 6 inches tall
    ax.scatter(df_sorted1['State'].astype(str), df_sorted1['Debt to Equity_mean'], label='DE', color='teal', marker ='x')
    ax.scatter(df_sorted1['State'].astype(str), df_sorted1['DI_Ratio_mean'], label='DI', color = 'blue', marker ='o')

    Step by Step:

    • This script layers two scatter plots on one axis to visualize both [‘Debt to Equity_mean’] and [‘DI_Ratio_mean’] by [‘State’]
    • Both axes are sorted by [‘State’], which is converted to a string for categorical plotting
    • Both axes are assigned a label, color, and marker
    #Labels and Title
    ax.set_xlabel('State')
    ax.set_ylabel('Avg DE and DI')
    plt.title('Avg DE and DI by State')
    plt.xticks(rotation=70,)
    ax.legend(loc='upper right')
    ax.grid(True, axis='y', linestyle='--', alpha=0.5)
    plt.show()

    Step by step:

    • Sets x and y axis labels
    • Creates plot title
    • Xtick labels, rotated 70 degrees for accessibility
    • Creates legend and position it in the upper right of canvas
    • Creates a horizontal grid, line style and marker size
    • Shows plot

    Viz Description:

    The outcome of this code presents a scatter plot with two lines sorted by descending [‘Debt to Equity_mean’] values. The [‘Debt to Equity-mean’] line is colored with teal ‘x’ markers; the [‘DI_Ratio_mean’] has blue ‘o’ markers. The markers indicate the corresponding value for each [‘State’]. This allows the stakeholders to visualize which states have the highest average DE and DI ratios, as well as the lowest. This provides insights into the riskier states that have businesses that need to be rebalanced due to low DE or DI.

    Viz 4 – Average Profit Margin by State

    # Viz 4 Average Profit Margin by State 
    dfPM = df_new.sort_values(by='Profit Margin_mean)

    # Split into positive and negative segments
    pm_pos = np.where(dfPM['Profit Margin_mean'] >= 0, dfPM['Profit Margin_mean'], np.nan)
    pm_neg = np.where(dfPM['Profit Margin_mean'] < 0, dfPM['Profit Margin_mean'], np.nan)

    Step by Step:

    • creates ‘dfPM’
    •  sort values by [‘Profit Margin_mean’] in ascending order
    • ‘pm_pos’ and ‘pm_neg’ splits positive and negative segments by [‘Profit Margin_mean’]
    # Plot
    fig, ax = plt.subplots(figsize=(10, 6))

    # Plot positive profit margins in teal
    ax.plot(dfPM['State'], pm_pos, marker='o', color='teal', label='Positive PM')

    # Plot negative profit margins in red
    ax.plot(dfPM['State'], pm_neg, marker='x', color='red', label='Negative PM')
    • See Viz 1 for similar descriptions
    # Labels and layout
    ax.set_title("Average Profit Margin by State")
    ax.set_xlabel("State")
    ax.set_ylabel("Profit Margin (%)")
    ax.legend(loc='lower right')
    plt.xticks(rotation=45)
    plt.tight_layout()
    ax.grid(True, axis='y', linestyle='--', alpha=0.5)
    plt.show()
    • See Viz 1 for similar explanation

    Viz 4 Description:

    This visualization provides insights into the states that have high or low [‘Profit Margin_mean’]. The value in splitting positive and negative segments is apparent in this visualization, as it clearly shows Virginia has lost a lot of money in the last quarter.

    Conclusion

                This code turned an 8-column, 150-row Excel spreadsheet into a dozen different data frames to provide insights to the fund managers of the small investment company. Four visualizations were created to show the stakeholders which states and businesses are at high risk. This analysis helps stakeholders to make decisions on how to rebalance their funds for the next quarter.

                Each visualization measures how the individual businesses or states compare to the rest. The insights of how they did or did not stand out were noticed in each viz when plotting the businesses’ debt-to-equity and debt-to-income ratios, and profit margins. For example, Virginia has a negative average profit margin of -24.30, and Texas has three businesses in the top five highest DI ratios, however, there are 14 businesses in Texas which is to be factored in before making a conclusion. The business count bar chart plots the number of businesses per state so that the stakeholders will not make a skewed judgment based on state results alone.

    D.  References

    Vanderplas, Jake. (2022). Python Data Science Handbook. Sebastopol, CA: O’Reilly Media, 2022. https://research.ebsco.com/linkprocessor/plink?id=784a5973-0f83-38c2-baff-ec6631b66091. Accessed: September 7, 2025.

  • The Data Analytics Life Cycle

                      The data analytics life cycle is a process utilized by data analysts to create data-driven decision-making. This structured life cycle helps businesses understand the value of their data to bolster growth, profitability, and innovation. Business understanding, data acquisition, data cleaning, data exploration, data modeling, data mining, and reporting and visualization play an essential role in the data analytics life cycle. In addition to a snapshot of the data analytics life cycle, this paper will discuss data cleaning with Python and some ethical problems in the data cleaning process.

    A. The Seven Phases of the Data Analytics Life Cycle

    Business Understanding

    Understanding, planning, and discovering the business’s objectives, stakeholders, and project scope is foundational to the data analytics life cycle. The business understanding phase aims to understand the objectives that inspire the need for the data analysis. Data analysts communicate with the stakeholders to establish goals and the key questions. (WGU, 2.1.1)

    Data analysts can gain expertise in business understanding by communicating with stakeholders and researching the business’s history. For example, within a retail business, data analysts will collaborate with the owner or management staff to plan a time series analysis of when customers purchase certain products throughout the year to understand demand and create a sales forecast. The data analyst and stakeholders will set the project’s goal and define the key questions.

    Data Acquisition

    Gathering, extracting, querying, and collecting data occur under this data analytics life cycle phase. Data acquisition can be executed by doing online research and/or acquiring reports provided to the organization’s data analyst. The analyst may have to extract data from the business by querying the data. (WGU, 2.1.1)

    Expertise in acquiring data requires proficient knowledge of query languages such as SQL. Other data collection avenues are surveys, web scraping, building data pipelines, and utilizing APIs to download data from external sources. An example of data acquisition in a sales forecast prediction project would require obtaining financial records of the retail business to understand its historical revenue trends.

    Data Cleaning

    Data cleaning, which is also called cleaning, wrangling, scrubbing, and munging data, is often overlooked, thus possibly causing irrelevant results. This phase involves identifying and fixing improperly formatted values, duplicates, missing data, and outliers. Overlooking the data cleaning phase will likely result in an inaccurate project outcome. (WGU, 2.1.1)

    Data cleaning expertise consists of having expert knowledge of SQL, Python, R, and/or Excel. A thorough understanding of these tools lets the data analyst know which tool is best for their project to efficiently and effectively modify, transform, and/or reduce data. For example, in a sales forecast prediction project, the analyst will upload a dataset to Excel, Python, or R to check for errors such as duplicates, improperly formatted data, and outliers.

    Data Exploration

    Identifying basic correlations between variables and discovering patterns helps the analyst to understand the data they are working with. Data analysts can utilize exploratory data analysis (EDA) and descriptive statistics to summarize the main characteristics of a data set. This phase creates an understanding of the data structure and identification of patterns, trends, and anomalies. (WGU 2.1.1)

    Expertise is gained in data exploration through an extensive knowledge of descriptive statistics to create numerical summaries and data visualizations. For example, the data analyst will develop sales statistics from the data for a sales forecast prediction project. The data can then be presented in a visualization with a line chart to show trends and changes in sales over time.

    Predictive Modeling

    Predictive modeling, regression models, time series, and data modeling extend exploratory data analysis to mathematical models. This phase involves estimating the likelihood of a future event or future values using historical data, various statistical algorithms, and machine learning techniques. Predictive modeling is important to businesses as it aids them in decision making, risk management, resource optimization, understanding customer insights, having a competitive advantage, cost reduction, and having improved outcomes (geeksforgeeks.org, 2024).

    Predictive modeling expertise involves automating the training and use of models using tools such as Python and R to create statistical algorithms. The data analyst will learn from the previous sales data analyzed through EDA and statistical summaries and predict future sales from past trends. For example, in a sales forecast prediction project, the sales data is preprocessed to show sales trends over time to predict future sales revenue. Prediction modeling executed using Python, for example, involves grouping data by sales summaries for each order date, to create a trendline that can expand to include future dates. (geeksforgeeks.org, 2025 April)

    Data Mining

    Data mining occurs through looking for patterns in large amounts of data with tools like Python and R. This phase determines if groups exist within the data, and if so, the groups are then classified. Data mining is often used interchangeably with machine learning, deep learning, AI, and supervised/unsupervised models. Automated models that learn and improve over time can be created in this phase to simplify and eliminate the need to do unnecessary repeat analysis. (WGU, 2.1.1)

    Expertise in data mining involves knowing that data needs to be subset into training and testing datasets to build models. If machine learning runs on the entire dataset, it will become problematic. The goal is to create models using machine learning algorithms that learn, improve, and continually update over time. (geeksforgeeks.org, 2025 April)

    Reporting and Visualization

    Reporting and visualizing data is where the analyst gets to tell the data story. The reports and visualizations summarize the analysis and provide actionable insights to stakeholders through insightful graphs or interactive dashboards. Reporting and visualization expertise consists of proficient knowledge of graphs and tools such as Tableau. The finished reports provide actionable insights to the stakeholders (WGU, 2.1.1)

    For example, with a sales prediction forecast project, a visualization that compares actual and predicted values is a way the data analyst can show the accuracy of their model to stakeholders. One way of creating this type of visualization is by connecting machine learning models to Tableau. The analyst can use Tableau to interpret the results and make clear, understandable, insightful visualizations to provide the sales prediction forecast to the stakeholders. (Geeksforgeeks.org, 2025 April)

    The Importance of Understanding the Organizational Goal/Mission of the Organization

     The data analyst’s knowledge of the organizational goal will help the analyst to identify the business requirement by narrowing down the focus of their data analysis. Understanding the businesses’ goal and mission provides critical information regarding identifying their stakeholders, a timeline for meeting the goal, any potential limitations the organization may have, and the type of resources and budget allotted for the project.

    The data analyst should continually learn about their business. This effort will improve the data analyst’s ability to communicate efficiently with stakeholders from different backgrounds. Understanding the mission and goal of the organization will help the analyst to communicate more effectively by being bilingual and versed in the language of analytics and the organization. The data analyst reviews the business question when providing project updates, reports, and visualizations to the stakeholders. (WGU, 3.1.3)

    Knowing the goal and mission of the organization helps determine which type of analytics to use, which determines the vital questions that data analysts must have to progress throughout the data analytics cycle. Determining the kind of analytics allows the data analyst to understand he value and difficulty of the analysis. There are four types of analytics. In order of least value and difficulty to most are descriptive, diagnostic, predictive, and prescriptive analytics. Starting with descriptive, the lower end of the diagnostics spectrum also corresponds to the project being more about the organization simply needing information, and to the higher end, desiring optimization insights. (WGU, 1.1.2)

    B.  Data Cleaning with Python

    Discerning the Appropriate Data Cleaning Tool

    The decision-making process of selecting the appropriate tool for the data cleaning phase depends on the data cleaning goals, which will determine the techniques, methods, and type of tool that is most appropriate. Things to consider when choosing between Python and R, for example, involve understanding the problems that need to be addressed in the dataset. Is the data in a web app or an installed program? How large is the data? Python would be the appropriate data analytics program/tool to use for cleaning data when the data is in a web app, and if the database is vast. (WGU, 2.2.3)

    Addressing Technical Problems with Cleaning Data in Python

    Technical problems that will be addressed by using libraries such as pandas or numpy within Python for data cleansing include eliminating/repairing data errors within the dataset. Errors come in the form of improperly formatted values, duplicates, missing data, and outliers. A reduction in data amount will occur when mistakes are addressed and eliminated. The outcome of the data could be dramatically changed. Outliers need to be identified and dealt with to minimize variability in the statistical models. (geeksforgeeks.org, 2025 April)

    An Organization’s Need for Cleaning Data in Python

    For an organization that has an extensive dataset and uses a web-based app, Python is the appropriate programming language. It is needed to identify and remove missing, irrelevant, and duplicate data. Raw data needs to be cleaned to ensure that the data is accurate and free of errors. After the data goes through the cleaning process, its interpretability is enhanced, thus ensuring correct predictions. Data cleaning is necessary to provide accuracy and take the right actions based on EDA insights. Data cleaning using the correct program and subsequent libraries for Python in relation to the data results in improved model performance, increased accuracy, better representation of the data, improved data quality, and data security. (geeksforgeeks.org, 2025 January)

    C.  Potential Ethical Risks on Essential Business Functions

    Discuss Three Risks of Using Python for Data Cleaning

    Three risks for using Python (or any tool) to clean data are that it is time-consuming, error-prone, and can result in overfitting. When dealing with large and complex datasets, data cleaning is very time-consuming, even when using the appropriate language. The risk exists that important information can be lost in the process. Too much data may be removed, which results in overfitting the data. (Webb, 2020)

    Three Potential Legal or Ethical Problems of Data Cleaning

    Based on the time-consuming, error-prone, and overfitting risks of using Python for data cleaning, potential legal or ethical problems related to the big data of the organization exist. One issue with big data is the issue of privacy. It is difficult for an organization that possesses big data not only to pay to keep data safe but also to keep it safe logistically. Sensitive information such as personal, financial, or health data should be protected and held to any legal or ethical regulations to which the data is subject. The data analyst should take measures to protect customers from identity theft, fraud, or malicious attacks by encrypting, anonymizing, or masking data before cleaning it.

    “Insights are only as good as the quality of the data they come from” (Hillier). Applying poor best practice to data cleaning is an ethical concern. Dirty or bad data can lead to poor insights and risk in certain circumstances. Take healthcare records, for example, the analysis project is to tell a data story of the number of patients with a rare contagious illness. If medical records at a local hospital had duplicate records for several patients, and the data analyst skipped over cleaning the data, the numbers would become statistically inflated, leading to an inflated public health concern. (Hillier, 2022)

    Bias is one more ethical risk in the data cleaning process. The data analyst needs to be intentional in looking for ways to offset bias. Personal and societal biases can creep in unintentionally, and taking time to step back from the data to ensure bias is non-existent or minimized is necessary. Offsetting bias in the cleaning process is difficult, so it is best to minimize it before data collection. However, if bias is detected in the data cleaning process, it is best to acknowledge bias with transparency, diversify data from various sources, and clean data methodically. Methodical data cleaning includes handling data in a way that does not introduce bias, bases decisions on sound statistical reasoning over gut feeling or convenience, and documents data cleaning steps so others can validate the outcome. (LinkedIn.com, July 6)

    Conclusion

                      The data analytics life cycle does not always occur chronologically. It can skip around and go back to different phases multiple times. Each phase plays an important role, with more time spent in some phases than others. The data analyst must know the organization’s goal and intentionally offset or avoid bias. Cleaning data with Python before moving on to EDA is needed to prevent the final product’s setbacks, delays, or inaccuracies. Risks occur when cleaning big data, such as the risk of eliminating essential data. Ethical risks include committing bias, data leaks, and data integrity. When data analysts carefully and cautiously follow the data analytics life cycle process, stakeholders will benefit from the actionable insights and data assurance of the final outcome.

    References

    Geeksforgeeks.org. (2025, January 20). ML: Overview of Data Cleaning. https://www.geeksforgeeks.org/data-analysis/data-cleansing-introduction/

    Geeksforgeeks.org. (2025, April 8). Sales Forecast Prediction – Python. https://www.geeksforgeeks.org/python/sales-forecast-prediction-python/

    Geeksforgeeks.org. (2024, March 18). What is Predictive Modeling? https://www.geeksforgeeks.org/data-science/what-is-predictive-modeling/

    Hillier, Will. (2022, December 15). Is Big Data Dangerous? https://careerfoundry.com/en/blog/data-analytics/is-big-data-dangerous/

    LinkedIn.com (a). (accessed July 2, 2025). How Can You Choose the Right Data Cleaning Tools for Any Dataset?

    LinkedIn.com (b). (accessed July 6, 2025). What Security Measures Should You Take When Cleaning Data for ML?

    LinkedIn.com (c). (accessed July 6, 2025). You’re Analyzing Data with Potential Biases. How Can You Ensure an Unbiased Analysis Process?

    Webb, Rebecca. (2020, November 25). 12 Challenges of Data Analytics and How to Fix Them. https://www.clearrisk.com/risk-management-blog/challenges-of-data-analytics-0

    Western Governors University. (Accessed July 2, 2025). The Data Analytics Journey Lesson: Understanding the Data Analytics Life Cycle.

  • This report presents an analysis of the patterns and correlations in a health insurance dataset comprising 1,338 records to investigate the impact of demographic and lifestyle factors, such as Age, BMI, and Smoking status, and the resulting Insurance Charge. This project examines the variables using univariate and bivariate visualizations, descriptive statistics, as well as parametric and non-parametric testing. In alignment with standard regression analysis, this project seeks to test the Null Hypothesis (H0)that there is no statistically significant relationship between the selected independent variables and medical charges.

    PartI: Univariate and Bivariate Statistical Analysis and Visualization

    A. 1. The Univariate Variables and Visualizations of ‘Age’, ‘Charges’, ‘BMI_Category’, & ‘SmokerQuantitative Variable 1 – Univariate Analysis of ‘Charges’

    Figure 1. Charges: Histogram with KDE curve, Boxplot, & Q-Q Plot – Interpretation: Charges have a right-skewed distribution (positive skew), heavy‑tailed, meaning the mean is pulled higher than the median.

    Quantitative Variable – Univariate Analysis: ‘Age’        

    Figure 2. Age: Histogram with KDE curve, Boxplot, and Q-Q Plot – Interpretation: Age is almost perfectly symmetric (skew ≈ 0) but strongly platykurtic (kurtosis ≈ –1.24), meaning it has a flat, light‑tailed distribution rather than a normal bell curve.

    Qualitative Variables Chosen: ‘BMI_Category (from BMI) and ‘Smoker’

                Qualitative Variable 1 – Univariate Analysis: ‘BMI_Category’

    Figure 3. BMI_Category: Countplot, Pie Chart, and Summary- Interpretation: Most density around Class1_Obese, and secondarily Overweight

    Qualitative Variable 2 – Univariate Analysis: ‘Smoker’

    and non-smokers.

    2. Bivariate Visualization

                Variable Mapping Table:

    Selected VariableBivariate Pair 1Bivariate Pair 2Bivariate Pair 3
    ChargesAge vs ChargesCharges vs. SmokerBMI_Cat vs Charges
    AgeAge vs ChargesAge vs BMI_Category
    BMI_CategoryBMI_CatBMI_Cat vs Charges
    SmokerCharges vs SmokerAge vs SmokerBMI_Category vs Smoker

    Figure 5. Mapping Table for Bivariate Variables – To ensure two visualizations per each variable chosen in A1

    Bivariate Visualization: ‘Age’ vs ‘Charges’

    Figure 6. Age vs. Medical Charges: 2D KDE and Scatter – Analysis: Three Distinct Risk Tiers: The 2D KDE plot clearly shows three horizontal density tunnels. This indicates that while age is a factor, the population is divided into three distinct risk groups based on other variables (likely smoking and BMI): Youth Concentration: The brightest density area at the bottom left indicates that the majority of the dataset consists of younger individuals (18–30) in the lowest-cost tier. 

    Bivariate Visualization: ‘Smoker’ Status vs ‘Charges’

    Figure 7. Smoker Status vs. Charges: Violin and KDE – Analysis: The Violin Plot shows that non-smokers are tightly clustered at the bottom of the scale, while smokers  have a much higher and broader distribution of charges; Bimodal Smoking Costs: The KDE Plot for smokers (blue) shows two distinct peaks. This suggests smokers are split into two groups: “healthier” smokers and a high-risk group (likely those with high BMI) whose charges frequently exceed $40,000.

    Bivariate Visualization: ‘Age’ vs  ‘Smoker’ Status

    Figure 8. Age vs. Smoker Status: Violin and KDE – Analysis: Identical Distributions — The Split Violin Plot and KDE Plot show how the age distributions for both smokers and non-smokers are almost identical; No Age Bias: Both groups show a wide range of ages (roughly 18 to 70) with similar peaks around age 20. This is an important finding because it confirms that smoking status is independent of age in this dataset—meaning you don’t have a “young smoker” or “old non-smoker” bias.

    Bivariate Visualization:  ‘BMI_Category’ vs Charges

    Figure 9. The Impact of BMI on Medical Costs: Bar and box plots – Analysis: A non-linear relationship between weight and cost: Average Charges Increase with BMI: There is a visible step-up in average charges as you move from “Under” to ‘C1_Obese’ and ‘C2_Obese’ categories.The “Obese” Outlier Effect: While the median charges for the “Obese” categories are higher, the boxplot reveals a massive amount of high-cost outliers (reaching $40–$60k). This suggests that high BMI itself isn’t the only driver.

    Bivariate Visualization: ‘BMI_Category’ vs ‘Smoker’ Status

    Figure 10. Obesity and Smoking Intersection: Count Plot and Heatmap – Show that while there are fewer smokers than non-smokers overall, there is a significant population of Obese Smokers. This intersection is where the highest charges in the dataset occur.

    B. Complete the following using the attached “Health Insurance Dataset” and R or Python:

    1. Provide the descriptive statistics for all quantitative variables selected in the dataset.

                Quantitative Variables:

    AgeBMICharges
    count1,3381,3381,338
    mean3931$13,270.42
    std146$12,110.01
    min1815$1,121.87
    25%2726$4,740.29
    50%3930$9,382.03
    75%5135$16,639.91
    max6453$63,770.43

    Figure 11. Descriptive Statistics of Selected Quantitative Variables

    2. Provide the Descriptive Statistics for all qualitative variables

    Qualitative Variables:

    CategoryVariableFrequencyPercentage
    9Class1_ObeseBMI_Category39729.67%
    10Class2_ObeseBMI_Category22616.89%
    11Class3_ObeseBMI_Category936.95%
    16NormalBMI_Category22116.52%
    19OverweightBMI_Category38028.40%
    22UnderweightBMI_Category211.57%
    6ALevel1128.37%
    7BLevel26419.73%
    8CLevel42731.91%
    12DLevel34826.01%
    13ELevel18713.98%
    17NortheastRegion32424.22%
    18NorthwestRegion32524.29%
    20SoutheastRegion36427.20%
    21SouthwestRegion32524.29%
    14FemaleSex66249.48%
    15MaleSex67650.52%
    23noSmoker106479.52%
    24yesSmoker27420.48%
         

    Figure 12. Descriptive Statistics of Selected Qualitative Variables

    Part II. Parametric Statistical Testing

    C.  Real-world Organizational Issue Description

    1.  Research Question:

    To what extent do smoking status, BMI, and age significantly predict annual medical charges among policyholders, and which factor has the greatest impact on those charges?

    D.  Analyze the dataset by doing the following:

    1.  Parametric Statistical Test Chosen: Ordinary Least Squares Regression (OLS)

    2.  Dataset Variables: ‘BMI,’ ‘Age,’ ‘Smoker,’ and ‘Charges’

    3. Why Ordinary Least Squares Regression(OLS) with Robust Standard Errors (SE)?

    I selected Ordinary Least Squares (OLS) regression based on variables, ‘BMI’, ‘Age’, ‘Smoker’, and ‘Charges’ because it provides a direct, interpretable measure of how smoking status, BMI, and age impact annual medical charges. This approach is uniquely suited for health insurance pricing research, as it converts complex predictors into quantifiable, financially meaningful explanations of cost drivers, while providing robust evaluation metrics such as R-squared and RMSE (Bruce & Bruce, 2017).

    The research design aligns with the core assumptions of OLS: linearity, independent errors, and approximately normal residuals (Rajaretnam, 2016). These conditions are met to a defensible degree within this dataset: Normality — while ‘Charges’ are right-skewed and ‘Age’ and ‘BMI’ are not perfectly normal, the moderate sample size ensures the residuals remain roughly normal and; Stability — No extreme outliers dominate the results, and the variance is not severely unequal. Taken together, these factors make OLS a robust choice for the hypothesis testing and relationship analysis required to evaluate insurance charges (Waples, 2025).

    Standard OLS assumes homoskedasticity, or constant error variance. However, medical charges are notoriously high-variance and often exhibit heteroskedasticity—where the spread of charges increases alongside predictors like BMI or Age (Waples, 2025). Without correction, standard errors can be biased, potentially leading to “false significance” or Type I errors.

    To protect the integrity of the findings, I employed the HC3 estimator for robust standard errors. HC3 is an advanced jackknife estimator designed to: Correct for unequal variance in the residuals and; Adjust for high-leverage observations, ensuring that a few influential data points do not distort the reliability of p-values. This choice aligns with modern statistical best practices, providing the most accurate and conservative inference for health cost modeling (Pinzon, 2022).

    4.  Null and Alternative Hypotheses: Null Hypothesis (H0): There is no statistically significant relationship between the independent variables (BMI, age, and smoking status) and the dependent variable (medical charges); the regression coefficients for these predictors are equal to zero. Alternative Hypothesis(H1): There is a statistically significant relationship between at least one of the independent variables (BMI, age, or smoking status) and the dependent variable (medical charges); at least one regression coefficient is not equal to zero.

    5. Error-free Code in Python Calculations

    Figure 13: Screenshot of OLS with SE calculation code exported to HTML .

    OLS with Robust SE Output and Results

    Figure 14. OLS Regression Results from PyCharm

    E.  OLS with Robust SE Test Results The p-value of the regression model (1.48e-292),  is essentially zero. This confirms that the model is statistically significant overall. With R-squared explaining approximately 75% of the variation in medical charges, this indicates strong predictive performance for real-world health‑cost data. Smoking status is the dominant cost driver, with smokers incurring an estimated $23,800 more in charges than non‑smokers, controlling for BMI and age. BMI and age also contribute meaningfully, with each additional BMI point associated with $323 higher charges and each additional year of age associated with $260 higher charges.

    Because the residual diagnostics indicated heteroscedasticity and non‑normality, the model was refit using robust standard errors, ensuring more reliable confidence intervals and hypothesis tests (Bruce & Bruce, 2017). As exhibited in Figure 12 above, the coefficients remain unchanged across models, confirming that the underlying relationships are stable. However, the robust standard errors are larger, particularly for the smoking variable, reflecting the heteroscedasticity observed in the residual diagnostics. This widens the confidence intervals and produces more conservative hypothesis tests. Despite this adjustment, all predictors remain highly significant, indicating that the model’s conclusions are trustworthy.

    More from the results:

    • Durbin-Watson (2.026): A value near 2.0 suggests no autocorrelation in the residuals, which is a good sign that the observations are independent.
    • Skew (1.213) & Kurtosis (5.618): These indicate that the residuals (errors) are not perfectly normally distributed; they are heavy-tailed. This is common in insurance data, where a small number of individuals have extremely high costs. Because of HC3, the model is robust against this.
    • RMSE (6,083.21): On average, the model’s predictions are off by about $6,083. Given that some charges exceed $50,000, this is a respectable margin of error. (Bruce & Bruce, 2017).

    Null Hypothesis (H0): Rejected. — There is no statistically significant relationship between the independent variables (BMI, age, and smoking status) and the dependent variable (medical charges); the regression coefficients for these predictors are equal to zero.

    Alternative Hypothesis(H1): Accepted — There is a statistically significant relationship between at least one of the independent variables (BMI, age, or smoking status) and the dependent variable (medical charges); at least one regression coefficient is not equal to zero.

    confirms that HC3 Robust Standard Errors are essential. This adjustment allows the model to remain a “defensible modeling choice” despite the complex, skewed nature of medical cost data (Waples, 2025).

    2. Answer to Research Question: Given the Rejection of the Null Hypothesis

    Given the rejection of the H0, and the discussion of the regression results above, ‘BMI’, ‘Age’, and ‘Smoker’ status together predict individual health insurance charges with a high degree of accuracy. The model explains approximately 75% of the variation in charges, indicating that these predictors collectively provide a strong and reliable estimate of costs.   All predictors within the model demonstrated statistical significance with p-values far below the threshold, at near 0. In alignment with standard regression framing, a significant predictive relationship exists between the identified lifestyle factors and insurance costs. While the diagnostics revealed some skewness in the residuals—suggesting the data is not perfectly normal—the robust F-statistic and narrow confidence intervals confirm that the model has strong explanatory power for organizational risk assessment.   

    3. How Stakeholders Benefit From OLS Testing Method

    OLS regression transforms raw cost data into clear, quantifiable insights that support financial, operational, and strategic decisions. It provides financial predictability by showing exactly how risk factors drive costs—for example, smoking adds roughly $23,800, and each additional year of age adds about $260—allowing actuaries to price premiums based on real risk and stabilize financial planning. With an R² of 0.75, OLS explains most of the variation in charges, giving leadership clarity on which factors, such as age, BMI, and smoking, have the greatest impact and enabling targeted wellness investments. Its statistically rigorous outputs, including a high F-statistic and near-zero p-values, make policy decisions, such as surcharges or discounts, defensible and reduce regulatory or reputational risk. Finally, OLS diagnostics highlight gaps or missing variables, helping stakeholders refine models, capture nonlinear effects, and strengthen long-term forecasting and data strategy.

    F.  Parametric Statistical Testing Summary

    1.  Recommended Course of Action

    Regression analysis provides a clear framework for risk segmentation and pricing by categorizing customers into tiers based on cost drivers. High-risk individuals, such as smokers who average an additional $23,800 in charges, can be distinguished from moderate-risk groups like those with high BMI or older age, while lower-risk tiers include non-smokers with healthy BMI. Premium structures and reserve allocations can then be adjusted accordingly to ensure fairness and transparency.

    Targeted wellness programs should prioritize smoking cessation initiatives, given smoking is the single largest cost driver, while also developing BMI management strategies such as nutrition counseling and fitness subsidies to reduce long-term costs. Communicating these connections clearly to employees and customers empowers them to understand how lifestyle changes directly reduce charges.

    Effective stakeholder communication is essential: executives should be shown that smoking is the most actionable cost driver, finance teams can use RMSE to set realistic error margins in forecasts, and employees and customers should receive transparent education on how age, BMI, and smoking affect charges. In summary, the findings confirm that smoking, BMI, and age are strong predictors of charges. The recommended course of action is to segment risk, invest in targeted wellness programs, document compliance, and continuously refine the model.

    2.  Limitations

    OLS regression effectively demonstrates that ‘Smoker’ status, ‘BMI,’ and ‘Age’ are independent predictors of the dependent variable, medical expenses; however, it has limitations. The errors aren’t perfectly normal, so extreme cases may make confidence intervals less reliable. It also assumes constant variance, linearity, and independence—assumptions that may not hold if, for example, smoking interacts with age. Outliers can skew the results, and the model may overlook other important factors, such as comorbidities or income. Because it’s based on a single dataset at a single point in time, it can’t capture trends or causal changes. Finally, while R² ≈ 0.75 indicates that the three independent predictors account for 75% of the variation of costs, approximately 25% of the variation remains unexplained. In short, OLS is best suited for identifying significant cost drivers and informing strategy, rather than predicting exact charges for individuals. (Bruce & Bruce, 2017)

    Part III: Nonparametric Statistical Testing

    G.  Real-world Organizational Issue Description

    1.  Research Question

    Is there a statistically significant difference in the median annual medical charges between policyholders who are smokers and those who are non-smokers?

    H. Analysis

    1. Nonparametric Statistical Test: Mann–Whitney U

    2. Dataset Variables: ‘Charges’, ‘Smoker’

    3. Why Mann-Whitney U Test?

    The Mann–Whitney U (MWU) test is an appropriate nonparametric method because it evaluates differences based on the ranks of the data rather than the raw values. By relying on ranks, the test is naturally resistant to outliers—extreme charges simply appear as the highest ranks rather than distorting the analysis. It is often described as a more flexible alternative to the t‑test, since it does not require the data to follow a normal distribution (Statsig, 2025).

    The Mann-Whitney U-test is used to analyze the degree of separation/overlap between two sets of observations drawn independently (‘Smokers’ and ‘Nonsmokers’) from the same population, where the measurement is ordinal or continuous (Rajaretnam, 2015). In this case, healthcare ‘Charges’ are continuous but often skewed. It is confirmed through exploratory data analysis that the healthcare ‘Charges’ in the given dataset are right-skewed. In addition to the visualizations from Part I, the Shapiro-Wilk test further confirms that the data is non-normal. The Mann-Whitney U test works by ranking all observations from both groups, smokers and non-smokers, and evaluating whether these ranks differ significantly in their healthcare costs (Geeks for Geeks, 2025).

    4. Null and Alternative Hypotheses

    Null Hypothesis (H0): There is no relationship between smoking status and the distribution of annual medical charges; specifically, the median medical charges for smokers and non-smokers are equal.

    Alternative Hypothesis (H1): There is a statistically significant relationship between smoking status and annual medical charges, such that the distribution of charges (median) differs significantly between smokers and non-smokers

    5. Error-free Code in Python Calculation, Results and Output

    Figure 16. Shapiro-Wilk and Mann-Whitney U Test Calculation, Results, and Output from PyCharm

    I.  Nonparametric Test Results Evaluation

    1.  Mann-Whitney U Test Results and Rejection of the Null Hypothesis

    The Mann–Whitney U test was used to assess whether smoking status is associated with meaningful differences in annual medical charges. This non-parametric approach is appropriate given the strong right skew in the cost data, allowing the analysis to focus on distributional differences rather than assumptions of normality.

    The resulting U‑statistic of 284,133.0 reflects substantial separation between the two groups. It indicates that charges for smokers consistently rank higher than those for non-smokers across the dataset. The p-value of 5.27 \times 10^{-130} is effectively zero which provides overwhelming evidence that the observed differences are not due to random variation.

    The effect observation of the median difference validates that the coefficient for smoking status represents a substantial increase in cost. In a regression-aligned interpretation, this means smoking status functions as a statistically significant predictor of the distribution of medical charges. Consequently, the null hypothesis—that smoking status has no explanatory relationship with healthcare costs—is decisively rejected.

    Charges vs. Smoker Status Regression Plot

    Figure 17. Regression Plot – Interpretation:  The regression line slopes upward from non‑smokers to smokers, confirming that smokers consistently incur higher medical charges.

    2.  Answer to Research Question: Given the Rejection of the Null Hypothesis

    There is a statistically significant relationship between smoking status and annual medical charges. The analysis confirms that smoking status is a significant predictor of the variance in healthcare costs, with smokers’ charges being significantly higher than those of non-smokers.The null hypothesis is rejected, showing that smokers and non-smokers do not share the same distribution of charges. In practical terms, smokers have much higher median charges, confirming that smoker status is a strong driver of cost differences.

    • HO: Rejected. The data provides overwhelming evidence against the assumption that smoking status has no effect on the distribution of annual medical charges.
    • H1: Accepted. At least one level of the independent variable (smoking status) explains a non-zero portion of the variance in the dependent variable (medical charges).

    3.  How Stakeholders Benefit from Mann-Whitney U Test

    Choosing the Mann–Whitney U test for smoker versus non-smoker charges directly supports business, compliance, and community needs. For executives, it shows clear evidence that smokers cost more, justifying premium differences or surcharges. Regulators benefit because the test is fair and unbiased, with claims data not skewed, making rating factors easier to defend.

    Healthcare providers and advocates gain proof of smoking’s financial impact, strengthening support for cessation programs. Policyholders see transparent evidence explaining higher premiums, which builds trust. Analysts and audit teams value the test for its simplicity, reproducibility, and robustness to outliers, making it easily fit into compliance workflows. Overall, the test focuses on medians, handles skewed data well, and produces results that are easy to explain: smokers cost more, and here’s the statistical proof.

    J.  Nonparametric Statistical Testing Summary

    1.  Recommend a course of action based on your findings.

    Based on these results, the best next step is to use smoking status as part of pricing and policy decisions because the data shows smokers consistently cost more. This gives insurers and employers a fair, evidence‑based reason to charge higher premiums or add smoker surcharges. The analysis is statistically solid, so regulators can see that the method is transparent and defensible. The findings also highlight the real financial impact of smoking, which supports offering or expanding smoking‑cessation programs in the community.

    Different groups should get information in a way that makes sense for them: executives need a clear business explanation for pricing changes, regulators need a clean audit trail showing how the numbers were produced, public‑health leaders can use the results to push for prevention programs, and the community should get a simple explanation of why smoker premiums differ. As a next step, the analysis can be expanded to include other factors like age, BMI, and region to build a more complete picture of what drives healthcare costs.

    2.  Discuss the limitations of your data analysis.

    The Mann–Whitney U test is useful for comparing two groups, such as smokers and non-smokers; however, it only examines differences in their distributions or relative rankings. It doesn’t account for other factors, such as age, BMI, or region, so it can show that smokers pay more, but it doesn’t explain why or how much compared to these other influences. ‘Charges’ is skewed, with very high outliers, and while the test is fairly robust, extreme cases can still distort results, making median differences appear smaller than the actual financial impact.

    The simple smoker/non-smoker split also misses nuances, like how much someone smokes or whether they used to smoke, which can matter for risk. Results depend on the dataset, so findings may not apply to all populations or regions. Importantly, the test shows association, not causation—higher charges could be linked to other health or social factors. Finally, the U‑statistic itself isn’t intuitive, so results need to be translated into medians, ranges, and effect sizes to make sense for non-technical stakeholders. Without clear communication, stakeholders may misinterpret the strength of the evidence.

    Conclusion

    Univariate and bivariate visualizations, descriptive statistics, and parametric and nonparametric testing deliver valuable insights for this major health insurance provider. Results from the OLS with Robust SE parametric test, reflected in the R² and RMSE values, clearly show that BMI, age, and smoking status are strong predictors of healthcare costs. The Mann-Whitney U nonparametric test further highlights the significant cost increase driven by smoking, with smokers incurring notably higher expenses than non‑smokers. These findings underscore the importance of accounting for the financial impact of smoking and reinforce the need to pursue recommended interventions aimed at reducing costs for stakeholders. Most significantly, the statistical testing suggests that the Null Hypothesis (H0) of no relationship can be rejected in favor of the Alternative Hypothesis (H1) for Age, BMI, and Smoker Status, with Smoker Status being the primary cost driver for Insurance Charges.

    References

    Bobbitt, Zach. (2022, June 11). How to Test for Normality in Python (4 Methods). https://www.statology.org/normality-test-python/

    Bruce, P. & Bruce, A. (2017). Practical Statistics for Data Scientists. O’Reilly Media, Inc.

    Downey, Allen B. (2025). Think Stats, 3rd Edition. O’Reilly Media, Inc. https://colab.research.google.com/github/AllenDowney/ThinkStats/blob/v3/nb/chap11.ipynb

    Geeks for Geeks. (2025, July 23). Linear Regression (Python Implementation). https://www.geeksforgeeks.org/machine-learning/linear-regression-python-implementation/

    Geeks for Geeks. (2025, July 24). Mann-Whitney U test. https://www.geeksforgeeks.org/machine-learning/mann-whitney-u-test-2/

    Geeks for Geeks (2024). Residual Sum of Squares. https://www.geeksforgeeks.org/maths/residual-sum-of-squares/

    Lujan, Alan. Mastering Econometrics: Regressions, Inference and Numerical Optimization. https://github.com/jhu-aap-econ/merino/blob/main/notebooks/Ch8.%20Heteroskedasticity.ipynb

    Pinzon, Enrique. (2022, October 6). Heteroskedasticity Robust Stand Errors: Some Practical Considerations. https://blog.stata.com/2022/10/06/heteroskedasticity-robust-standard-errors-some-practical-considerations/

    Rajaretnam, T. (2015). Statistics for Social Sciences. SAGE Publications Inc.

    Statsig. (2025, June 23). Mann-Whitney U: Non-Parametric A/B Testing. https://www.statsig.com/perspectives/mannwhitney-nonparametric-abtesting

    Waples, Josef. (2025, January 8). OLS Regression: The Key Ideas Explained. https://www.datacamp.com/tutorial/ols-regression

    file:///Users/meredithsmith/PycharmProjects/PythonProject/PythonProject/PythonProject50/exportToHTML/D599Task2.ipynb.html

  • Welcome to WordPress! This is your first post. Edit or delete it to take the first step in your blogging journey.

Design a site like this with WordPress.com
Get started