DA-103 Sprint Summary

Modified on Wed, 23 Jul at 2:01 PM

TABLE OF CONTENTS


DA-103

Sprint summary

Getting Started

Topic 1: Getting Started with DA-103


- Introduction to DA-103

Let’s move beyond spreadsheets and dive into real-world datasets. DA-103 introduces you to the power of structured data analysis using Python and SQL, preparing you for full-scale data projects.

What is it?
DA-103 focuses on working with databases, dataframes, and analytical workflows to solve domain-specific problems.

Where is it used?
Used in analytics teams working with customer, operational, or product data across domains like transport, banking, and retail.

How is it used?
• Build queries with SQL to fetch and filter data
• Use Python libraries like Pandas for transformation and visualization
• Practice through real-world datasets such as Uber rides and banking attrition
• Focus on drawing actionable insights from patterns

--- Takeaways / best practices
• Understand the business domain of your dataset
• Combine SQL and Pandas for efficient data handling
• Keep your code modular and reusable for analysis blocks
• Focus on insights, not just charts

- SQL and Pandas

When SQL meets Pandas, it becomes a powerful duo for analytics. SQL is great for fetching data from structured sources, while Pandas makes it flexible to transform and explore.

What is it?
A blended approach of using SQL for querying and Pandas for post-processing and analysis in Python.

Where is it used?
In data pipelines, dashboards, business intelligence reports, and any scenario where raw data needs cleaning and analysis.


Scenario 1: Detailed in-Memory Analysis:


Scenario 2: Joining Tables:


How is it used?
• Use SQL (via SQLite or databases) to extract relevant data
• Load SQL results into a Pandas DataFrame using pd.read_sql_query()
• Perform filtering, grouping, and visualization in Pandas
• Combine multiple queries and DataFrames for deeper analysis


Summary:

SQL: Ideal for querying and filtering large datasets stored in databases and joining tables.

Pandas: Best for detailed in-memory analysis, custom calculations, and data transformations after the data is fetched. 


--- Takeaways / best practices
• Always limit/select only required fields in SQL
• Use Pandas after SQL for flexibility and readability
• Avoid redundant transformations — choose the right tool for the step
• Index your data when using joins or merges in Pandas


- Bank attrition Analysis

Why do customers leave a bank? Attrition analysis helps uncover behavioral and demographic patterns that influence churn, helping institutions retain their clients.

What is it?
A real-world case study using customer data to analyze and predict churn.

Where is it used?
Common in finance and customer success teams for retention strategy and predictive modeling.


Practice Problem:


Dataset to be used - Link
Problem Statement:

A banking institution is facing challenges in credit risk assessment and customer retention.

The goal of this analysis is to understand customer behavior, detect patterns, and derive insights that will help the bank make data-driven decisions. These decisions can be in line with

  • Upselling products, and

  • Improving Customer Retention


Objective:

  • Improve Customer Retention – Identify factors that influence customer churn and engagement levels.

  • Understand how churn rate varies across different age groups, demography, gender etc





Key Metrics Involved:


--- Takeaways / best practices
• Look for patterns in demographics, tenure, and transactions
• Check data balance — is churn a minority class?
• Use correlation heatmaps to guide variable selection
• Don’t forget to document business implications of your insights


Topic 2: Visualizing Data with Pandas and Matplotlib


- Importing Data using SQLite

Working with real-world data often means querying structured databases. SQLite offers a lightweight and efficient way to do this directly in Python.

Unlike traditional databases, it does not require a separate server process, making it ideal for local data storage.


What is it?
SQLite is a self-contained SQL database engine often used to store and query relational data locally.

Where is it used?
In projects needing small-scale relational storage, quick prototyping, and local querying (e.g., apps, internal tools, bootstrapped dashboards).


What makes SQLite different from SQL:

No Server Needed: Unlike databases like MySQL or PostgreSQL that require a separate server, SQLite runs directly on your computer.

Same SQL, Different Storage: The SQL queries you already know (SELECT, JOIN, GROUP BY) work in SQLite, but instead of storing data on a big server, it’s stored in a single file. 


How is it used?
• Use Python’s sqlite3 library to connect to the database
• Write SQL queries to fetch data
• Use pandas.read_sql_query() to load the results into a DataFrame

Example:


--- Takeaways / best practices
• Always close your database connection after use
• Filter data at the SQL level when possible to reduce memory load
• Use SQL joins to combine multiple tables before loading into Pandas

- Visualization in Pandas

Pandas makes it surprisingly simple to go from DataFrame to basic plots with minimal code.

What is it?
Pandas has built-in .plot() functionality powered by Matplotlib for quick visualizations.

Where is it used?
Ideal for rapid exploratory data analysis and creating visuals directly from tabular data.

How is it used?
• Use .plot(kind='line'/'bar'/...) on Series or DataFrame objects
• Customize title, labels, and colors with simple arguments
• Combine with grouping or pivoting for insightful plots

--- Takeaways / best practices
• Use value_counts() or groupby() before plotting for better insights
• Add titles and axis labels to improve readability
• For customization beyond basic charts, switch to Matplotlib

- Intro to Matplotlib

Matplotlib is the foundation for most Python visualizations — it gives you full control over your plots.

What is it?
A 2D plotting library used for generating static, animated, or interactive visualizations.

Where is it used?
Across domains in Python-based analytics, reporting, and dashboards.


Syntax:



How is it used?
• Import matplotlib.pyplot as plt
• Use plt.plot(), plt.bar(), plt.scatter() to create charts
• Call plt.show() to display the plot


Example:


--- Takeaways / best practices
• Always include plt.title(), plt.xlabel(), and plt.ylabel()
• Use plt.grid(True) to improve chart readability
• Combine with subplots for comparative visuals

- Customizing Plots

Customization transforms your charts from raw output to polished insights.

What is it?
The process of modifying visual elements like color, style, labels, and size in a chart.

Where is it used?
In dashboards, reports, and presentations to match audience needs and highlight key data points.


How is it used?
• Customize color using color='green'
• Add legends with plt.legend()
• Set figure size using plt.figure(figsize=(width, height))
• Use markers, line styles, and annotations to enhance the message

Example:


Note:

To Adjust spacing and labels on the axes use x-axis ticks and y-axis ticks 

Syntax: 

plt.xticks(ticks=None, labels=None)

plt.xticks([1,2,3,4],[A,B,C,D])


# ticks : A list of positions where the ticks should appear on the x-axis

# labels : A list of labels corresponding to the ticks positions.


Line Plot for Multiple Product Sales:

A company tracks sales of three products over time.

Data: Monthly sales of three different products.

months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun"]

product_A = [10000, 12000, 13000, 12500, 14000, 15000]

product_B = [8000, 9500, 10500, 10200, 11000, 12000]

product_C = [6000, 7200, 8000, 8200, 8800, 9500]


  1. Create line plots for above in same chart.

  2. Add title , label and show grids.




--- Takeaways / best practices
• Consistency in colors and labels improves clarity
• Use legends only when multiple series are plotted
• Limit over-customization that may distract from data


- Creating Basic Plots

Creating simple visualizations is a vital part of quick data storytelling.

What is it?
Generating standard chart types like bar, line, and scatter plots using Matplotlib or Pandas.

Where is it used?
At every stage of data analysis — especially EDA and final presentations.

How is it used?
• Choose chart type based on data (e.g., bar for categories, line for trends)
• Use .plot(kind='bar'), plt.bar(), or plt.scatter() as needed
• Label axes, title, and apply suitable styles

Example:
Create Bar chart for Product Sales 

To create bar chart use: plt.bar(<x-axis>,<y-axis>)

  1. For the product sale data create a bar chart with category on x-axis and sales on y-axis

  2. Add title and axis labels to your chart

Data for reference:

categories = ["Electronics", "Clothing", "Furniture", "Toys", "Books"]


sales = [50000, 30000, 40000, 20000, 10000] 




Matplotlib, Seaborn and Plotly:

Matplotlib = Basic Word Document (manual styling required).

Fully customizable but requires effort for better styling.


Seaborn = Pre-designed Word Template (better visuals with less effort).

Built on Matplotlib, with predefined styles and statistical plots.


Plotly = Interactive Online Report (users can explore data dynamically).

Web-based library for interactive, dynamic visualizations.



--- Takeaways / best practices
• Pick the right chart for your data type
• Avoid 3D and overly colorful visuals unless necessary


Topic 3: Cleaning and Preparing Real-World Data


This is the dataset to be used:

  • Contains around 15k rows of data

  • Is stored in CSV file

- Importing Data

Before any analysis begins, data needs to be loaded into your environment correctly and efficiently.

What is it?
Importing data means reading files (like CSV, Excel, or SQL) into Python for further processing.

Where is it used?
Every step of analytics begins with bringing in the data — whether from local files or databases.

How is it used?
• Use pd.read_csv() to load CSV files
• Use pd.read_excel() for Excel spreadsheet


Recap: Data Import using Pandas

df = pd.read_csv("data.csv") #reads dataframe from the CSV file

df = pd.read_csv("data.csv", header=1)  # Second row as header

df = pd.read_csv("data.csv", header=None)  # No header row, default column numbers


Recap: Skipping Rows 


df = pd.read_csv('data.csv', skiprows=n) # Skip first n rows

df = pd.read_csv('data.csv', skiprows=[row2, row5]) # Skip multiple rows



Example:



--- Takeaways / best practices
• Always verify file encoding and separators
• Preview with .head() and .shape() after importing
• Handle parse dates and missing values right during import if possible

- Data Cleaning

Raw data can be messy. Cleaning ensures it’s usable, accurate, and structured for insights.

What is it?
Data cleaning involves correcting or removing bad, missing, or inconsistent entries.

Where is it used?
Used in every domain to ensure high data quality before visualization or modeling.

How is it used?
• Drop irrelevant rows/columns
• Replace or fill nulls
• Remove duplicates
• Fix incorrect formats

--- Takeaways / best practices
• Always check for nulls, duplicates, and invalid entries
• Cleaning improves accuracy and trust in your analysis
• Save checkpoints while cleaning large datasets

- Fixing Column Names

Column names often come in irregular or messy formats — which impacts readability and coding.



What is it?
Standardizing column names to be clean, lowercase, and consistent.


Where is it used?
In dataframes used for scripting, automation, and modeling.

How is it used?
• Strip white spaces
• Convert to lowercase
• Replace spaces with underscores
--- Takeaways / best practices
• Always clean column names right after import
• Avoid spaces and capital letters in column names
• Use meaningful, readable names for clarity

- Handle Missing data

Missing values can silently distort your analysis — they need special attention.

What is it?
Handling missing data means either removing or replacing NaN or null values.








Example:


Where is it used?
Common in survey data, logs, financial records, and customer data.

How is it used?
• Check missing values using isnull().sum()
• Use dropna() to remove rows with missing values
• Use fillna() to fill with a default or statistical value

Checklist for Handling Missing Values

Choose a Strategy

✔ Drop if missing data is minimal: df.dropna()

✔ Fill (Impute)

  • Numeric: mean(), median(), mode()

  • Categorical: mode() or "Unknown"✔ Forward/Backward Fill for sequential data: df.fillna(method='ffill')


Example:



--- Takeaways / best practices
• Don’t blindly drop — understand the context of missing values
• Use domain knowledge to decide how to impute
• Missing values can sometimes carry useful signals

- Data Consistency Issues

Inconsistent formats, spellings, or cases in data lead to poor grouping and analysis.

What is it?
Data consistency issues occur when the same value appears in different formats (e.g., "Male", "male", "MALE").




How is it used?
• Use str.lower() or str.upper() for uniformity
• Replace known variations using .replace()
• Check unique values with df['col'].unique()



Fix Inconsistent Data:


--- Takeaways / best practices
• Standardize as early as possible in the pipeline
• Always check unique values for key columns
• Use dictionaries for efficient replacements

- Detecting Outliers

Outliers can skew your analysis and often signal anomalies or errors.

What is it?
Outliers are values that are significantly higher or lower than the rest of the data.

Where is it used?
In customer spending, sensor data, fraud detection, etc.

How is it used?
• Use IQR (Interquartile Range) method


• Use visualization: boxplot

Boxplot:


Example:



--- Takeaways / best practices
• Visuals are often the best way to spot outliers
• Treating outliers depends on context: fix, remove, or keep
• Document assumptions while removing them

- Handling Outliers

Once detected, outliers must be addressed carefully depending on the data’s purpose.

What is it?
The process of adjusting, capping, or removing extreme values from the dataset.

Where is it used?
Crucial before modeling or summarizing data, especially in sensitive domains.

How is it used?
• Remove rows containing outliers
• Cap values using quantiles (winsorization)
• Transform values using log or square root

Example (Use IQR method to list Salary Outliers):



--- Takeaways / best practices
• Never handle outliers without understanding why they exist
• Try visualizing before and after cleaning
• Outliers might be valid — validate with domain experts when unsure


Topic 4: Exploratory Data Analysis(EDA)


- Working on Bank dataset

Working with a real-world dataset helps bring context to statistical patterns and customer behavior.

What is it?
An exercise using actual customer data from a bank to analyze churn, demographics, and financial activity.


Bank attrition dataset link: click here


Where is it used?
In business analytics, customer retention modeling, and financial forecasting.

How is it used?
• Load the dataset using pd.read_csv()
• Explore customer profiles, balances, and account activity
• Clean and transform data
• Use this dataset as the foundation for EDA and feature engineering

--- Takeaways / best practices
• Understand business context before diving into analysis
• Start by exploring key features like age, balance, and churn label
• Visualize categorical and numerical fields separately

- Performing Basic Exploratory Data Analysis (EDA)

EDA helps you “talk” to the data and understand its structure, patterns, and oddities.

What is it?
A structured process to summarize main characteristics of data before any modeling.

Where is it used?
In every data project to get a sense of distribution, outliers, and trends.

How is it used?
• Use .info() and .describe() for structural overview
• Use .value_counts() to understand categories:


• Plot histograms, bar charts, and box plots for visual insights:







--- Takeaways / best practices
• Use .describe(include='all') for both numeric and categorical insights
• Summarize findings after each round of EDA

- Feature engineering


What is it?

Feature engineering involves creating new meaningful columns from existing ones. 

This helps in:

Better customer segmentation (e.g., identifying high-value customers).

Improved risk assessment (e.g., calculating debt-to-income ratio).

Enhancing predictive models (e.g., predicting customer churn). 


Where is it used?
Widely used in machine learning, reporting automation, and performance segmentation.

How is it used?
• Combine features (e.g., total balance = savings + credit)
• Derive new ones (e.g., tenure buckets, spending categories)


Example:

 Adding new features:

Create new feature to help assess loyalty.

Steps:

Add Loyalty Score column using (Tenure * Satisfaction Score) / (1 + Count of Complaints) to predict churn risk.

Display the first five customers with these new scores using df.head() 


Output:


--- Takeaways / best practices
• Features should be interpretable and relevant
• Normalize or scale if required by algorithms
• Don’t create redundant or highly correlated features

- Using transform()

When you want to apply a function while keeping the shape of the data intact, transform() is your friend.

What is it?
A Pandas method that applies a function to each group and returns a Series with the same index.

Where is it used?
In group-based operations where you need the transformed result to align with the original structure.



Syntax:

df["New Column"] = df.groupby("Grouping Column")["Target  Column"].transform("Function") 


How is it used?
• Use inside groupby() to compute metrics per group
• Apply standardization or custom functions within each group

Example:

  1. Compare each customer’s Count of Complaints to their state’s average - and create ‘State Avg Complaints’ column  - boolean 

  2. Create a ‘High Complainer’ column  boolean flag using -  Above State Avg Complaints. (Hint: use astype) - flag 1 or  0

  3. Display the derived value with First Name included

For this, use random sampling - # Random sampling of 5 users

df_sample = df.sample(n=5, random_state=42)



--- Takeaways / best practices
• Use transform() when you want to keep original row-level details
• Works great for per-group normalization
• Prefer transform() over apply() when returning same-sized output

- Correlation Between Variables

Understanding how features relate to each other is key to preventing redundancy and building better models.

What is it?
Correlation measures the degree to which two variables move together.

Where is it used?
In feature selection, multicollinearity detection, and statistical reporting.

How is it used?
• Use .corr() on DataFrame to get the correlation matrix
• Visualize with heatmaps for easier pattern recognition

Example:

correlation_matrix = df.corr()


print(correlation_matrix[['Customer_Age', 'Total_Balance']])

Output:

                     Customer_Age Total_Balance
Customer_Age  1.00     0.25
Total_Balance   0.25     1.00







--- Takeaways / best practices
• Focus on variables with strong correlations (+/- > 0.5)
• Use heatmaps to filter top correlations visually
• Watch for highly correlated inputs in predictive models

- Univariate, Bi-variate and Multivariate analysis

Data analysis can be classified into three main types based on the number of variables being analyzed:

Univariate Analysis → Examines a single variable.

Bivariate Analysis → Examines the relationship between two variables.

Multivariate Analysis → Examines the relationship among three or more variables. 

Where is it used?
In hypothesis testing, EDA, feature exploration, and visual storytelling.

How is it used?
• Univariate: use hist(), value_counts()
• Bivariate: use scatter(), groupby(), corr()
• Multivariate: use pairplots, heatmaps, and regression plots


Example(bivariate):



--- Takeaways / best practices
• Start from univariate to multivariate for deeper insights
• Use visual and statistical tools side by side
• Capture interactions — not just individual trends

Topic 5: Churn Analysis with Advanced Visuals


To better understand why customers leave, analyzing demographic patterns is critical.

Bank Dataset:  Link 


What is it?
This involves using customer age, gender, marital status, income, and similar features to detect churn behavior patterns.

Where is it used?
In customer segmentation, retention strategy design, and churn prediction.

How is it used?
• Filter or group data based on demographic features
• Compare churn rates across gender, age groups, and income levels
• Use summary tables and bar plots to show variation

--- Takeaways / best practices
• Always explore churn by categories like gender, age, marital status
• Use normalized percentages to avoid volume bias
• Combine demographic data with product usage for better insights


- Univariate analysis

This helps you look at individual variables to understand their distribution and characteristics.

What is it?
Analysis of a single variable whether numeric or categorical — to observe patterns, extremes, or imbalance.

Where is it used?
Used early in EDA to evaluate variables before exploring relationships.

How is it used?
• Use value_counts() for categorical features
• Use histograms or boxplots for numerical features
• Check for skewness, outliers, and frequency distribution

--- Takeaways / best practices
• Helps determine if transformation or binning is needed
• Identify outliers visually before deciding how to treat them
• Use boxplots when comparing distributions


Example:



- Bivariate analysis

Here we explore how two variables are related — e.g., gender vs churn, age vs income.

What is it?
Analysis of the relationship between two variables, typically dependent vs independent.

Where is it used?
Common in identifying influencers of churn, pricing impact, or campaign response.

How is it used?
• Use groupby() and mean() for numerical trends
• Use scatter plots or bar plots for categorical comparisons
• Use correlation matrix for continuous variables

--- Takeaways / best practices
• Focus on churn differences across segments
• Visualize comparisons using grouped bar plots
• Watch for misleading averages — always check variance

Example:


- Visualization with matplotlib and seaborn

These libraries allow deep, beautiful, and customized static visualizations.

What is it?
Matplotlib is a base Python plotting library; Seaborn builds on it for statistical plots.

Where is it used?
Widely used in data analysis notebooks, dashboards, and presentations.




How is it used?
• Use Matplotlib for line, bar, and pie charts
• Use Seaborn for distributions, boxplots, heatmaps
• Customize themes, colors, and layouts easily


How can we give custom labels for each ?



--- Takeaways / best practices
• Use Seaborn for better visuals and less code
• Always label axes and titles clearly
• Save plots with plt.savefig() for reuse

- Interactive visualization using plotly

Static plots tell stories; interactive plots allow your audience to explore those stories.

What is it?
Plotly is a library that creates dynamic, web-based interactive charts like sliders, zooms, and hover data.

Where is it used?
Ideal for dashboards, reports, and live data tools.

How is it used?
• Use plotly.express for quick plots
• Embed in web apps using Dash or notebooks

Example:

import plotly.express as px
fig = px.scatter(df, x="Customer_Age", y="Total_Trans_Amt", color="Attrition_Flag")
fig.show()


--- Takeaways / best practices
• Use for large datasets to explore deeper interactions
• Customize tooltips and colors for better storytelling


What is unstack() ?

In basic terms, unstack() is a function that reshapes your data from rows into columns. 

e.g, employment type in rows, unstack() moves it into separate columns.

   How to use it ? 

Syntax:  data.unstack() 




When a customer stops using a product — like a credit card or savings account — it’s essential to know which product and why. This helps businesses refine offerings and retain users.

What is it?
Product-related churn analysis involves examining usage patterns across various products (e.g., credit cards, loans, online services) to identify features driving churn.

Where is it used?
Used heavily in banking, telecom, e-commerce, SaaS platforms, and insurance to identify underperforming products or services leading to customer loss.

How is it used?
• Identify product-specific columns (e.g., Total_Trans_Ct, Credit_Limit, Card_Category)
• Compare metrics across churned vs non-churned groups
• Use bar plots or pivot tables to visualize product usage patterns
• Group by product features and analyze churn percentage


Why is product related churn analysis important ?

A bank wants to understand why some of its customers are leaving. 


They suspect that customers who use fewer products (like credit cards, loans, or savings accounts) are more likely to churn. 


They aim to identify which products are most linked to churn to create targeted retention strategies


Eg: A customer using both a credit card and a loan might be more likely to stay because they have multiple touchpoints with the bank.


Example:
# Calculate churn rate by card category
churn_rate = df.groupby('Card_Category')['Attrition_Flag'].value_counts(normalize=True).unstack()
print(churn_rate)


Output (sample):

Attrition_Flag     Attrited Customer   Existing Customer
Card_Category                                    
Blue                        0.18                 0.82
Silver                      0.24                 0.76
Gold                        0.31                 0.69
Platinum                 0.42                 0.58


This shows higher attrition rates in premium products like Platinum and Gold, suggesting they may not be meeting customer expectations.

Takeaways / best practices
• Compare product features with churn labels to find hidden trends
• Consider creating new variables like “product engagement score”
• Use visuals (grouped bars, pie charts) to share insights with business teams
• Always map churn back to business value — not all churn is equal


Topic 7: Financial Metrics and Customer Feedback


- Financial and Feedback analysis

Understanding how customers spend and what they feel about the service gives a 360° view of why they stay — or leave.

What is it?
Financial and Feedback analysis combines transactional metrics (spending, credit limit, income) with subjective input (feedback, complaints, satisfaction).

Where is it used?
Used in churn prediction, customer satisfaction tracking, and service quality optimization — especially in banking, e-commerce, and telecom industries.

How is it used?
• Analyze key financial indicators like Credit_Limit, Total_Revolving_Bal, Total_Trans_Amt
• Compare churned vs non-churned on financial performance
• If feedback data exists (e.g., survey scores, complaints), apply sentiment analysis or frequency counts
• Use groupby() and describe() for financial summaries
• Visualize trends over churn, income brackets, or transaction levels



Takeaways / best practices
• Always normalize financial data (e.g., per customer or per month) for fair comparison
• Use outlier detection to remove unusually high-value accounts that may skew averages
• Combine numerical and textual insights (if available) for better churn profiling
• Financial analysis alone isn’t enough — it should be paired with customer behavior or sentiment


Make use of the problem: 




Topic 8: Solving Business Problems with Data


- Understanding Data and Problem Statement

Before jumping into analysis, you must clearly understand what problem you’re solving and what data you're working with.

What is it?
Interpreting the business challenge and aligning it with the available dataset.

Where is it used?
In any analytics or machine learning project — to make sure the solution fits the actual business need.

How is it used?
• Review the dataset: what each column means
• Identify the target variable (e.g., churn, revenue)
• Align data fields with business context
• Formulate the key questions (e.g., What influences churn? What predicts purchase?)

--- Takeaways / best practices
• Never begin without defining the problem clearly
• Look at business metrics, not just columns
• Ask: What decision will this analysis support?

Example:

Data Analyst @Airbnb 

First, What is Airbnb?

  • Online platform for listing, discovering, and booking accommodations.

  • Business Model: Connects hosts offering short-term rentals with travelers.

  • Listing Types: Entire homes, private rooms, shared rooms.

The problem statement to work around! 

  1. Customer Preference Analysis

  2. Pricing Strategy Analysis

  3. Growth Opportunity Analysis

Let’s understand each one better!


Customer Preference Analysis:


Find answers to the following:

  1. What is the most popular room type across different neighborhood groups?

  2. Do customers prefer certain price ranges in specific areas?

  3. Which locations receive the most reviews, indicating high demand?

Pricing Strategy Analysis:


Find answers to the following:

  1. What is the average price of Airbnb listings in different neighborhood groups?

  2. Does room type (Private room, Shared room, Entire home) impact price?

  3. What is the relationship between availability (availability_365) and price?


Growth Opportunity Analysis:


Find answers to the following:

  1. Which neighborhoods are undersaturated or oversaturated with Airbnb listings?

  2. How does the competition vary between budget vs. luxury listings?

  3. What are the best opportunities for new Airbnb hosts based on revenue potential?


- Load and Explore the Data

The first technical step in any project — get the data into your workspace and inspect it thoroughly.

What is it?
Reading the dataset and performing an initial scan to understand structure and quality.


Here’s the dataset to be used.


Where is it used?
First phase in every analytics pipeline — EDA begins here.

How is it used?
• Use pd.read_csv() or similar to load data
• Check .shape, .head(), .info()
• Use .describe() to summarize numerical fields
• Check for data types and null values


--- Takeaways / best practices
• Always inspect column names and types
• Early insights here save debugging time later

- Filling Missing Values

Missing data can lead to skewed analysis — handling it is critical.

What is it?
Filling gaps in the dataset using strategies like mean, median, or domain-specific logic.

Where is it used?
Before modeling or dashboarding — required for clean results.

How is it used?
• Use .isnull().sum() to count nulls
• Use .fillna() for imputation
• Choose method: mean (for continuous), mode (for categorical), forward fill (for time series)



- Handling Outliers

Outliers can distort trends and reduce model accuracy — you must spot and treat them.

What is it?
Values that fall far outside the normal range of the dataset.


Example:
# IQR method


For other columns like number_of_reviews, reviews_per_month .. although the values are valid, but extreme to we will cap extreme values


--- Takeaways / best practices
• Outliers aren’t always bad — they may be valid extreme cases
• Visualize before deciding to drop or cap
• Use business context to guide treatment

- Working around Problem statements

Solving a problem is more than just analyzing data — it’s about turning findings into actionable outcomes.

What is it?
Applying analytical methods to answer specific business questions defined at the start.

Where is it used?
In final stages of analytics — drawing conclusions or building models to meet goals.

How is it used?
• Use findings from EDA and feature engineering
• Create hypothesis (e.g., high transaction customers churn less)
• Support your insights with visuals, stats, or models
• Recommend actionable next steps

Example:
If churn is highest among customers with low balance and no product upgrades — suggest targeting them with promotions.

--- Takeaways / best practices
• Tie your insights back to business impact
• Don’t just answer what — explain why
• Structure your solution so it’s usable (in reports, models, or dashboards)


Topic 9: AI in Data Analytics & Review Analysis


- Data Analytics in Python using AI

We’re entering the era where AI enhances how we analyze data — automating insights, surfacing trends, and making predictions.

What is it?
Leveraging AI techniques (like NLP or ML) within Python to extract deeper, faster insights from data.

Where is it used?
In product analytics, customer service automation, recommendation systems, fraud detection, and more.


How is it used?
• Use AI libraries (e.g., scikit-learn, spaCy, transformers) to process and model data
• Automate feature extraction using ML
• Apply pre-trained models to categorize or summarize textual feedback
• Use AI to scale decision-making based on complex data




- Prompt Engineering

Prompt engineering is the art of asking AI models the right questions — it’s key to getting accurate, helpful outputs.

What is it?
Structuring your input (prompts) to AI tools like ChatGPT or LLMs for better results.

Where is it used?
Used in data cleaning, report generation, code writing, and summarizing customer feedback via AI.

How is it used?
• Start with clear, specific instructions
• Use examples in prompts when needed
• Break down large problems into smaller steps
• Refine outputs using follow-up prompts



- AI Libraries in Python

Python has a robust ecosystem for AI — from classic ML to state-of-the-art deep learning and NLP.

What is it?
Libraries that offer AI and ML capabilities: modeling, text processing, prediction, classification, etc.

Where is it used?
Across data science tasks like classification, regression, clustering, sentiment analysis, and more.







  • Interview Preparation



Prepare for an interview using this doc.

  • This contains interview questions on given topics

  • Make sure to go through all the questions











Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article