One of the most important uses of the Pandas library is the ability to derive business metrics from data you would’ve never been able to see by just looking at the dataset itself.

To demonstrate this, we took a sample data set from a clinic and extracted some data to answer business questions all while avoiding HIPPA violations (I think).

With this dataset, we can answer questions such as:

  • What are the most popular and least popular days for patients? (Used to help with staffing)

  • Were there any non-repeat patients? (Used to see reoccurring patients, finding churn rate)

  • What advertising source/medium (google, doctor, etc) provided the highest income? (Used to determine metrics like ROI)

Take a scenario where your team is working on Meta advertising. Your boss makes note that they see more female patients than male patients and suggests updating their Meta advertising strategy to target females.

The question: should your team update their advertising strategy to target females?

Understanding the data

First thing is first, let’s inspect our data to try and get an understanding of it - this will almost always be step 1. You’ll need to open it and call a few useful functions:

df = pd.read_csv("medical_clinic_data.csv")
df.columns()

Which will show us all of the columns in the dataset:

Index(['Patient's Name', 'Age', 'Phone Number', 'Gender', 'Patient Type', 'Date of Appointment', 'Time of Appointment', 'Day', 'Fees Charged', 'Duration of  Session', 'Diagnosis', 'Doctor's Remarks', 'Conversion'], dtype='object')

In here, we see the column “Gender” - bingo. To get an answer quickly, you decide to plot it:

df['Gender'].value_counts().plot(kind='bar')

plt.grid(visible = False)
plt.title("Genders across the entire dataset")
plt.xticks(rotation = 45)
plt.ylabel("Count")
plt.tight_layout()

But wait, is this really the solution? It can’t be that easy.

You’re right - this is way too easy. In fact, this is what your boss made note of. However, there’s 2 boundary cases we’re going to need to consider:

  • Duplicate values - patients return for follow-up visits, so we’re going to want to remove these duplicates.

  • Timestamps - What if there’s values in other years? What if someone made a typo and added an entry for next year?

» By the way, we’re looking to create a pandas email-based course. Sign up for the waitlist here to get periodic updates - first 20 to join will receive something special upon course launch.

Starting with an operation that’s slightly easier, you’re going to want to make sure that all of your dates and times are using the pandas datetime interface, then filter it for only dates with 2025:

df['Date of Appointment'] = pd.to_datetime(
    df["Date of Appointment"]
)
df = df.dropna(how = 'all')

# Extract the year and place it into its own column
df['Year'] = df["Date of Appointment"].dt.year.astype(int).astype(str)

# Filter out the years we don't want
df = df[df['Year'] == '2025']

Next, we’re going to want to de-duplicate our data. However, we only want to de-duplicate data if it’s a repeating patient. To do this, we’ll use a combination of groupby and agg :

df = df.groupby(['Gender']).agg(
  {"Patient's Name" : 'nunique'}
)

groupby is a function that will split rows into “buckets” by gender, so under the hood all “male” rows are together, and all “female” rows are together.

agg is a function that will summarize each bucket groupby has done. Here, we're calling a function to count the unique names of each patient, thereby removing any duplicate patients. Re-plotting it shows us we’re able to accomplish this:

This analysis shows that the new patient split is nearly 50/50 not taking into account re-visits, whereas the first graph showed that females had more appointments and significantly outweighed males.

So, to answer the question of doing targeted ads towards a specific gender: this variable shouldn’t be one to consider since it’s a 50/50 split. However, making a decision solely off of the gender (all else being equal) would lead to an answer of no.

📧 Join the Python Snacks Newsletter! 🐍

Want even more Python-related content that’s useful? Here’s 3 reasons why you should subscribe the Python Snacks newsletter:

  1. Get Ahead in Python with bite-sized Python tips and tricks delivered straight to your inbox, like the one above.

  2. Exclusive Subscriber Perks: Receive a curated selection of up to 6 high-impact Python resources, tips, and exclusive insights with each email.

  3. Get Smarter with Python in under 5 minutes. Your next Python breakthrough could just an email away.

You can unsubscribe at any time.

Interested in starting a newsletter or a blog?

Do you have a wealth of knowledge and insights to share with the world? Starting your own newsletter or blog is an excellent way to establish yourself as an authority in your field, connect with a like-minded community, and open up new opportunities.

If TikTok, Twitter, Facebook, or other social media platforms were to get banned, you’d lose all your followers. This is why you should start a newsletter: you own your audience.

This article may contain affiliate links. Affiliate links come at no cost to you and support the costs of this blog. Should you purchase a product/service from an affiliate link, it will come at no additional cost to you.

Reply

or to participate

Keep Reading

No posts found