5 October 2023
Python has become one of the most popular programming languages used in data science. With its versatile libraries and frameworks for manipulating, visualizing and modelling data, Python provides data scientists with the tools to glean powerful insights from real-world data.
This hands-on guide will demonstrate how Python's data science libraries allow you to efficiently access, clean, explore and build predictive models using real-world datasets. You'll learn:
Whether you're just getting started with JBI Trainings Advanced Python Course for data science or looking to level up your skills, this guide will have you equipped to harness the power of Python for analysing real-world data.
Python has cemented itself as a must-have skill for aspiring data professionals. Here's a quick primer on Python and its robust ecosystem of libraries for data science tasks:
What is Python?Python is a general-purpose, high-level programming language that was created by Guido van Rossum in 1991. Some key qualities that make Python well-suited for data science:
Python has a vibrant open-source community that has developed powerful libraries for data manipulation, analysis and modelling, including:
Python code is relatively easy to write, learn and debug. The availability of skilled Python programmers makes it a great choice for data science teams. Python can integrate with other languages like R, SQL, C and JavaScript. All these qualities make Python an ideal first programming language for aspiring data scientists.
Python 2 vs Python 3Python 3, released in 2008, was a major upgrade that is incompatible with earlier Python 2 code. Python 3 has become the standard version for data science. This guide covers Python 3, which has better Unicode support, modern features and active community support.
Now that you have a basic understanding of Python and its use in data science, let's look at how to work with real-world datasets using Python libraries.
Many interesting real-world datasets are available that we can analyze using Python. Here are some places you can get free datasets:
Once you've identified a dataset to work with, how do you get it into a Python environment? Here are some approaches:
read_csv()
read_excel()
to import Excel data into a dataframejson
moduleLet's look at a quick example of loading a CSV dataset:
import pandas as pd
dataset = pd.read_csv('path/to/file.csv')
This reads the CSV file into a Pandas dataframe ready for analysis. Let's now learn how to clean and prepare real-world data for exploration.
Real-world data often requires preprocessing and tidying before analysis. Here are some common data cleaning steps in Python:
Handling Missing DataUse Pandas dropna()
and fillna()
to deal with missing values.
# Drop rows with missing values
dataset.dropna()
# Fill missing values with a placeholder
dataset.fillna(0)
Data Formatting
Format strings, change data types and normalize formats with Pandas to_datetime
, to_numeric
etc.
# Convert string to proper date format
dataset['date'] = pd.to_datetime(dataset['date'])
# Normalize mixed data formats
dataset['sales'] = pd.to_numeric(dataset['sales'])
Fixing Duplicates
Identify and remove duplicate rows with dataset.duplicated()
.
Apply functions like sklearn.preprocessing.scale()
to standardize columns.
Proper cleaning and formatting ensures your data is ready for exploration and modelling. Let's now dive into exploratory data analysis using Python.
Exploratory Data Analysis (EDA) allows us to summarize dataset characteristics, spot anomalies, identify patterns and relationships before applying predictive models. Here are useful Python EDA techniques:
Summary StatisticsPandas dataframe.describe()
provides summary stats like mean, std deviation, min/max values etc.
dataset.describe()
Data Visualization
Use Matplotlib and Seaborn to visualize distributions as histograms, scatter plots, heatmaps etc.
# Scatter plot
plt.scatter(x, y)
# Histogram
plt.hist(data)
Grouping and Aggregating
Pandas groupby
and agg
enable aggregating metrics by specific columns.
# Average sales per store location
sales_by_store = dataset.groupby('store_id').agg({'sales': 'mean'})
Correlations
Find linear correlations between columns using dataset.corr()
dataset.corr()
Thorough EDA provides intuition about trends, relationships and anomalies that can guide the application of predictive models.
Once you've explored and understood the datasets, Scikit-learn provides a robust set of machine learning algorithms to build models that can make predictions from data.
Importing Scikit-learn
from sklearn import svm, tree, linear_model, ensemble
Splitting Training and Test Sets
Use train_test_split
to create separate subsets to train and test models.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
Applying Models
Fit models like linear regression, random forests, SVM on the training set.
# Linear regression
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
# Random forest classifier
rf_model = ensemble.RandomForestClassifier()
rf_model.fit(X_train, y_train)
Evaluating Models
Evaluate models on the test set using metrics like accuracy, precision, recall, F1-score.
from sklearn.metrics import accuracy_score
predictions = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
Scikit-learn makes it easy to apply machine learning models to make predictions and classifications from real-world data.
Now let's apply what we've learned by walking through end-to-end case studies.
Let's demonstrate a real-world application of Python's data science libraries by analyzing a healthcare dataset.
We'll use the freely available Diabetes 130-US hospitals for years 1999-2008 Data Set from UCI Machine Learning Repository.
This dataset contains over 100,000 patient records spanning 130 US hospitals from 1999-2008. Features include patient demographics, medical history, diagnoses, tests, medications and outcomes.
Let's use Python to analyze this rich dataset and derive insights that could improve healthcare services.
Data Import and ExplorationWe load the CSV dataset into a Pandas dataframe:
import pandas as pd
df = pd.read_csv('diabetes_data.csv')
Now we can start exploring the data:
# Check dataframe info
df.info()
# Summary statistics
df.describe()
# Data visualization
plt.hist(df['time_in_hospital'])
Data Cleaning
We prepare the data for modelling by handling missing values, formatting and transforming columns:
# Fill NA values
df['test_result'] = df['test_result'].fillna('Unknown')
# Normalize gender column
df['gender'] = df['gender'].map({'Female':0, 'Male':1})
Developing Prediction Model
We'll train a random forest model to predict whether a patient will be readmitted based on their test results and medical history:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Select features and target
X = df[['test_result', 'age']]
y = df['readmitted']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Train model
model = RandomForestClassifier()
model.fit(X_train, y_train)
# Evaluate on test data
accuracy = model.score(X_test, y_test)
This model could help hospitals identify patients at high risk of readmission so they can improve patient care. The wide range of data also allows deriving many other insights to enhance healthcare services.
Let's look at another real-world use case - analyzing tweets to identify user sentiment and trends.
We'll use the freely available Twitter Sentiment Analysis Dataset from Kaggle, containing 1.6 million tweets labelled for sentiment (positive/negative).
Here's how to analyze this social media data with Python:
Load the DataWe load the CSV file into a Pandas dataframe:
import pandas as pd
df = pd.read_csv('tweets_data.csv')
EDA
We check distribution of sentiment labels:
# Profile of sentiment column
df['sentiment'].value_counts()
# Distribution plot
df['sentiment'].plot(kind='hist')
Text Preprocessing
We clean tweet text to prepare it for modelling:
import re
import string
# Remove links, users, hashtags
df['text'] = df['text'].apply(remove_urls_users_hashtags)
# Remove punctuation
df['text'] = df['text'].translate(str.maketrans('', '', string.punctuation))
# Lemmatize text
from textblob import Word
df['text'] = df['text'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
Sentiment Classification Model
We'll train a logistic regression model on the cleaned text to predict tweet sentiment:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Split data
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['sentiment'], test_size=0.3)
# Train model
logit = LogisticRegression()
logit.fit(X_train, y_train)
# Evaluate
accuracy = logit.score(X_test, y_test)
This model identifies tweet sentiment, helping analyze public perceptions, trends and reactions on social media platforms like Twitter.
We've explored a broad range of techniques to ingest, prepare, analyze, visualize and model real-world datasets using Python.
Here are some of the key takeaways:
Python's extensive libraries make practically any type of dataset and data problem accessible to aspiring data scientists.
Whether analyzing social media trends, customer engagement, financial indicators
or healthcare outcomes, Python empowers you to unlock meaningful insights from real-world data across industries. Its versatility and simplicity will ensure Python remains fundamental to data science workflows.
To dive deeper, here are useful resources to build your Python data science skills:
Datasets: Kaggle, UCI Machine Learning Repository Libraries: NumPy, Pandas, Matplotlib, Seaborn, Scikit-learn Learning: DataCamp, DataQuest, Coursera (Python for Data Science course) Documentation: Official Python Documentation, Scikit-learn User Guide Community: StackOverflow, Reddit (r/Python, r/LearnPython)This guide provided you with the techniques and tools to access, prepare, analyze and model real-world data using Python's robust data science libraries like Pandas, Matplotlib and Scikit-learn.
You learned how to:
Python's versatility, readability and vibrant ecosystem make it an essential skill for unlocking value from real-world data across industries.
So get out there, find interesting datasets to analyze and start honing your Python data science skills today!
You might like to read our article on Is Python good for finance? or How does Python for Data Science help in analysing real-world datasets?
Or to view some of the options for your Python training see some of our course below, we can also customise a course for your teams training requirements.
CONTACT
+44 (0)20 8446 7555
Copyright © 2024 JBI Training. All Rights Reserved.
JB International Training Ltd - Company Registration Number: 08458005
Registered Address: Wohl Enterprise Hub, 2B Redbourne Avenue, London, N3 2BS
Modern Slavery Statement & Corporate Policies | Terms & Conditions | Contact Us