DA-201 Sprint Summary

Modified on Wed, 23 Jul at 2:03 PM

TABLE OF CONTENTS

DA-201

DA-201

Sprint summary

Getting Started

Topic 1: What-If Analysis

- What if analysis

Understanding different possibilities and planning for the unknown is critical in data-driven decision-making. That’s where What If Analysis becomes a powerful tool.

What is it?
What If Analysis allows analysts to model different scenarios and evaluate the outcomes by changing input variables without altering the actual data.

Where is it used?
Used in budgeting, forecasting, sensitivity analysis, and decision modeling in tools like Excel, Power BI, and financial modeling platforms.

How is it used?

Excel also helps us to look into such scenarios using the in-built what-if analysis feature.

With What-If Analysis, you try out different possible scenarios and see how they affect your outcomes.
It helps you prepare for different situations so you’re not surprised by what happens.

Note: This feature is not present in Excel online free version.

Types of What-if Analysis

Three types:

Goal Seek
Scenario Manager
Data table

Goal Seek:

Goal Seek is a tool that helps you find the right input to achieve a specific target.

Example: Goal Seek Analysis by Region for Insurance Coverage

Goal:

Determine the insurance coverage needed in each region to keep the total amount that patients pay out-of-pocket below a set limit.

Example:

What insurance coverage percentage is required in each region to ensure out-of-pocket expenses stay below $100,000?

Scenario Manager:

Allows you to create and compare multiple “what-if” scenarios

Data Table:

Data Tables allow you to see how changes in one or two variables affect your results across a range of scenarios.

How the Data Table Works:

Data Table VS Scenario Manager

Data table:

How 1 or 2 input can impact
Result in a table showing all possible combinations.

Scenario Manager:

Take in multiple inputs
Result: Summary comparing results of different scenarios

--- Takeaways / Best Practices
• Always validate scenarios with domain knowledge
• Limit parameter ranges to realistic values
• Use slicers for interactivity in dashboards

Topic 2: Window Functions in SQL

- Window functions

When dealing with large datasets and needing to perform calculations across a set of rows related to the current row — Window Functions are your go-to SQL powerhouse.

What is it?
Window Functions perform calculations across a set of table rows that are somehow related to the current row, without collapsing the result set like GROUP BY does.

A window function allows us to perform calculations across a specific set of rows related to the current row.
calculations happen within a defined window of data
useful for aggregates, rankings, and cumulative totals without altering the dataset.
The OVER clause is key to defining this window.

Where is it used?
Used in SQL queries for ranking, running totals, moving averages, and comparisons across partitions in databases like PostgreSQL, MySQL, SQL Server, etc.

How is it used?

Syntax:

SELECT column_name1,

window_function(column_name2)

OVER([ORDER BY column_name3]) AS new_column

FROM table_name;

How window functions work? - Example

Partition By Clause:

The PARTITION BY clause is an extension of the OVER() clause that divides the result set into logical partitions, allowing you to isolate calculations within each partition without collapsing the original rows.

Syntax:

<window_function>(<expression>) OVER (PARTITION BY <column> ORDER BY <column>)

Working of Partition by Clause:

--- Takeaways / Best Practices
• Window functions don’t reduce the row count — ideal for detailed analytics
• Use partitions carefully to isolate group-wise calculations
• Essential for leaderboard creation, time-based comparisons, and retention analysis

-SQL Rank Function

1. ROW_NUMBER():

The ROW_NUMBER() function is perfect when you need to uniquely rank rows within a result set, providing a simple, efficient way to add row numbers.

Syntax:

Example:

2.RANK():

RANK() is a ranking function that assigns ranks to rows within a partition of a result set. It allows ties by giving multiple rows the same rank if they have the same value in the ORDER BY clause.

Syntax

RANK() OVER ([PARTITION BY <column>] ORDER BY <column>)

Example:

Let’s rank the restaurants by total revenue, this time allowing for ties:

SELECT

r.restaurant_id,

r.name AS restaurant_name,

CASE

WHEN r.restaurant_id IN (96814, 18424018) THEN 50000 -- Force a tie for testing

ELSE SUM(o.total_price)

END AS total_revenue,

RANK() OVER (ORDER BY

CASE

WHEN r.restaurant_id IN (96814, 18424018) THEN 50000 -- Same total for testing

ELSE SUM(o.total_price)

END DESC

) AS rownumber

FROM Restaurants r

JOIN Orders o ON r.restaurant_id = o.restaurant_id

GROUP BY r.restaurant_id, r.name

ORDER BY total_revenue DESC;

3.DENSE_RANK():

DENSE_RANK() is a ranking function similar to RANK(), but it does not leave gaps in the ranking sequence when there are ties.

It is ideal when you want a continuous ranking order, even when multiple rows share the same rank.

Syntax:

DENSE_RANK() OVER ([PARTITION BY <column>] ORDER BY <column>)

4.NTILE():

NTILE() is a ranking function that distributes rows into a specified number of roughly equal groups, called tiles or buckets.

It is particularly useful for percentile calculations and cohort analysis.

Syntax

NTILE(<number_of_groups>) OVER ([PARTITION BY <column>] ORDER BY <column>)

Topic 3: Working with CTEs and Introduction to Real-World Projects

- CTEs

A CTE is a temporary, named result set defined within a SQL query that can be referenced like a table.

Where is it used?
CTEs are used in SQL for simplifying complex joins and subqueries, improving code readability, and enabling recursive queries.

A CTE (Common Table Expression) is a temporary result set in SQL that you can reference within a SELECT, INSERT, UPDATE, or DELETE statement.

CTEs are often used to simplify complex joins, improve code readability, and enhance query performance.

How is it used?
• Start with a WITH clause followed by the CTE name and query.
• Reference the CTE in the main query like a regular table.
• Useful in breaking down multi-step logic (e.g., filtering before aggregation).

Syntax:

Simple CTE:

WITH cte_name AS (

SELECT ...

)

SELECT *

FROM cte_name;

Chaining Multiple CTE:

WITH first_cte AS (

SELECT ...

second_cte AS (

SELECT ... FROM first_cte

)

SELECT *

FROM second_cte

Example:

Improve the query we wrote earlier using CTEs to make it scalable and reusable:

https://www.crio.do/learn/PSDS/ME_SQL_CTE/ME_SQL_CTE_MODULE_CTEISQL/

Subqueries VS CTEs

Aspect	Subqueries	CTEs
Readability	Harder with nested levels	Easier to read and modular
Reusability	Not reusable	CTEs can be referenced multiple times
Debuggability	Difficult	Easier to debug each CTE block
Performance	May be faster for simple cases	Comparable; may optimize better for complex logic
Use Case Suitability	Best for simple, one-off operations	Best for layered, complex logic

--- Takeaways / Best Practices
• Use CTEs to make queries easier to maintain and understand.
• Name CTEs clearly to reflect their purpose.
• Combine multiple CTEs for layered logic without nesting.

-Mini Project(Coke-Sales Analysis)

-Problem Statement:

We want to Analyze the factors affecting sales. For example:

Month with lowest and highest sales?
How sales vary across regions?
How sales vary for different brands?
How sales vary for different brands and different months?
How sales vary for different brands and different regions?
How sales vary with different retailers? ….. etc

-Project Submission Guidelines

As a part of the project submissions, we are not limiting you to do the project in a certain way. You are free to do the project on your own and come up with your own analysis. Here are a few guidelines to submit your project.

Video Demo of your Project
1. Record a video presenting your project work. You can use any recording platforms like Loom or OBS Studio. You can also record the video in Zoom, download and submit it. Make sure the video length does not exceed 15 minutes. Video recording is supported on Crio Platform but you won’t be able to share screens.
2. Upload the video in the video activity given.

Instructions for your project submission - Link

Submit the GitHub repository link in the activity
1. Follow this process to create a GitHub account, push the project and submit the repository link - Process document

Topic 4: Debrief and Creating a Reusable Analysis Template

- Template for analysis

How to attempt an analysis?

For the coke-sales analysis

What are the different analysis which can be done?
How to approach each analysis?

Don’t we need a template to synchronise our thought-process in a systematic way?

Data Analysis Template:

? 1. Understanding Problem Statement

Clearly define the business or operational problem.
What are we trying to understand, improve, or predict?
Who are the stakeholders, and what decision(s) will this analysis support?

? 2. What Data Exists?

What data sources are available (databases, Excel, APIs)?
What’s the granularity (daily, transactional, user-level)?
Key tables, columns, and data types.
Any known data limitations (missing data, outdated records)?

? 3. Data Points Related to the Problem Statement

What variables are likely to influence or explain the problem?
Are there metrics/KPIs you should define (e.g., churn rate, conversion rate)?
Data cleaning or preprocessing steps required?

? 4. Analysis and Visualization

Example

? 5. Insights & Next Steps

What are the actionable insights?
What follow-up analysis are needed?
What business actions can be taken?

--- Takeaways / Best Practices
• Templates save time and reduce errors
• Keep them modular and easy to update
• Include both technical (code, queries) and communication (summary, insights) components
• Store them in shared repositories for team collaboration and version control

Topic 5: Project Setup and Execution Using Google Colab

- Using google colab

What is it?

Google Colab is a free, cloud-hosted Jupyter notebook environment provided by Google.
Enables you to run Python code directly in the browser, no installation needed.

How is it used?
• Access at: https://colab.research.google.com
• Upload or connect to data via Google Drive or web URLs
• Use Python and common libraries like Pandas, NumPy, Matplotlib
• Share notebooks with teams using Google Drive
• Run cells individually and use markdown for documentation

--- Takeaways / Best Practices
• No installation needed—ideal for quick collaboration
• Keep your code blocks modular and well-commented
• Use markdown cells to explain steps and findings
• Take advantage of built-in versioning through Drive for backups

Problem statement: ChatGPT Reviews

People have shared a wide range of feedback about ChatGPT through ratings and written reviews
In this session, you’ll explore these reviews to understand what users love, what frustrates them, and what can be improved
By analyzing sentiment, patterns over time, and common issues, we can learn how to make ChatGPT better.

Understanding the problem statement

The key objectives of this problem are:

Sentiment Analysis:

Identify positive, negative, and neutral user sentiments

Issue Identification:

Detect common problems driving negative reviews

Time-Series Analysis:

Track how sentiment evolves over time

Exploring the dataset:

To download the dataset, head over to this link:

https://statso.io/wp-content/uploads/2024/08/chatgpt_reviews.csv

Exploring the Dataset

The dataset contains 4 fields:

Review ID: A unique identifier for every review
Review: Comments given by the user
Rating: Represents the levels of satisfaction on a scale of 0-5
Review date: When the review was posted

Topic 6: Reading Dashboards and Debriefing Results

- Dashboard Reading

What is it?
Dashboard reading involves analyzing and drawing insights from visual data summaries created using tools like Power BI or Tableau.

Where is it used?
Used by business analysts, stakeholders, and decision-makers to monitor performance, identify trends, and take data-driven actions.

How is it used?
Dashboard : Transport Analytics

Activity: Building Data Intuition with Public Logistics Power BI Dashboards

Objective: To develop data intuition by observing existing Power BI dashboards from the logistics domain, making inferences about the underlying data, business processes, and potential insights.

Public URL: https://community.fabric.microsoft.com/t5/Data-Stories-Gallery/Transportation-Analytics/m-p/3734629

Steps:

1. Observe the Dashboard

What are the main charts, graphs, and visualizations?
What KPIs (Key Performance Indicators) are being displayed? (e.g., On-time Delivery Rate, Fuel Consumption, Inventory Levels, Shipment Volume).
How is the data segmented or filtered? (e.g., by region, date, product, carrier).
Are there any interactive elements ?

2. Make Inferences about the Data

What kind of data would be needed to create this dashboard? (e.g., historical delivery records, vehicle telemetry data, warehouse inventory logs, customer order details).
What data sources might be involved? (e.g., ERP systems, TMS, WMS, GPS tracking, sensor data).
What's the granularity of the data? (e.g., daily, per shipment, per vehicle, per item).

3. Infer Business Questions & Insights

What business questions is this dashboard trying to answer? (e.g., "Are we meeting our delivery SLAs?", "Which routes are most fuel-efficient?", "Do we have too much stock of a particular item?", "Which suppliers are performing best?").
What immediate insights can you draw from the current state of the dashboard? (e.g., "On-time delivery dropped significantly last month," "Warehouse B has consistently higher picking errors," "Fuel costs are rising in the North region").
How might a logistics manager use this dashboard to make decisions?

4. Identify Potential Improvements/Next Steps

What other KPIs or visualizations would you add to make this dashboard more useful?
What questions does the dashboard raise that aren't answered?
If you had access to the data, what deeper analysis would you perform?

--- Takeaways / Best Practices
• Always check for filters/slicers that affect what you're viewing
• Focus on high-impact visuals first (KPIs, summaries)
• Context is everything—understand what each metric means for the business

- project debrief

Dashboard #2

Supply chain Dashboard

Repeat the steps for above dashboard:

Observe the dashboard
Make inferences about the data
Infer business questions and insights
Identify Potential Improvements/Next Steps

Topic 7: SQL Workbench

- SQL workbench

SQL Workbench is a lightweight SQL client that lets you connect to databases, write queries, and manage data interactively.

Where is it used?
Used by data analysts, developers, and DBAs to run SQL scripts, test queries, and visualize query output from various databases (PostgreSQL, MySQL, etc.)

How is it used?

Documentation : Install Workbench

Working on the dataset:

Run this SQL file in workbench to have zomato database with 1 lakh rows in place:

SQL File

Reading Data in Chunks Using Pandas (Option 1)

SQL First, Pandas Second (Option 2)

--- Takeaways / Best Practices
• Always test queries on limited data before scaling
• Use aliases, formatting, and comments for readability
• Save frequently used queries as scripts or templates
• Verify connection settings before executing DML (INSERT/UPDATE/DELETE) commands

How will you handle data updates?

manually querying for updated data every day?

Or is there some other way?

Stored Procedure

A stored procedure is a precompiled group of SQL statements stored in the database. You can execute it by calling it by name. It helps:

Reuse code across applications.
Encapsulate business logic.
Improve performance for complex or repeated queries.

Instead of manually querying for updated data every day, you can create a stored procedure:

Then in Python:

pd.read_sql("CALL Get_Yesterday_Orders();", engine)

Topic 8: Aptitude Building and Interview Preparation

- Aptitude Building

Why is Aptitude Building important for Data Analysts?

Problem-Solving Ability: Identify patterns, troubleshoot issues, and develop solutions from complex datasets.
Logical Thinking: Interpreting trends, correlations, and anomalies in data.
Adaptability to Tools and Methods: Quickly learn new tools, programming languages, or analytical techniques as technology evolves.
Decision-Making Under Uncertainty: Make informed decisions even when data is incomplete, ambiguous, or conflicting.

Categories of Aptitude tests:

Numerical Aptitude:

Solve percentage-based problems.

Apply ratio and proportion concepts to quantitative problems.

Calculate averages effectively.

Analyze profit and loss scenarios.

Data Interpretation:

Visual data interpretation involves understanding and drawing conclusions from data represented graphically or in tables. This includes identifying patterns, comparing values, and recognizing trends in various formats like bar graphs, pie charts, tables, and line graph

Logical and Analytical Skills:

Number Series

Letter Series

Code and Decode

Syllogism

- Some Interview Questions

1. Q: What is a window function in SQL?
A: A function that performs calculations across rows related to the current row (e.g., ROW_NUMBER(), RANK()).

2. Q: What is the use of WITH clause (CTE) in SQL?
A: It defines a temporary result set to improve readability and reusability in complex queries.

3. Q: Difference between RANK() and DENSE_RANK()?
A: RANK() skips numbers after ties; DENSE_RANK() doesn’t.

4. Q: What is a What-If parameter in Power BI?
A: It allows users to simulate different scenarios by dynamically changing input values to see their effect on measures.

5. Q: How is DirectQuery different from Import mode?
A: Import stores data in Power BI; DirectQuery queries the source live—best for real-time data but slower performance.

6. Q: When would you use CALCULATE() in Power BI?
A: To change the context of a measure, e.g., calculating sales for a specific region:
CALCULATE(SUM(Sales[Amount]), Region="East")

7. Q: What is the use of Google Colab in data analysis?
A: It's a cloud-based notebook that supports Python—used for writing, running, and sharing code without setup.

8. Q: How is SQL Workbench used in projects?
A: For writing, testing, and executing SQL queries—commonly used in data transformation and retrieval tasks.

Topic 9: Applying AI Tools in Data Analysis

- AI tools for analysis

Artificial Intelligence (AI) is reshaping data analytics by automating insights, simplifying data preparation, and enabling predictive capabilities. This session introduces the use of AI tools that help analysts work faster and smarter.

What is it?
AI tools for analysis include platforms and features that apply machine learning, natural language processing, and automation to analyze data and extract patterns with minimal manual effort.

AI tools for analysis

Julius
Perplexity Labs
Google Colab

How to use them ?

Julius

Let’s further our cause by using different tools for different purposes!

Use same Dataset - Global Retail & Online Store Sales.xlsx
Open Julius & login using your email address.
Attach the file and ask - “Create a workflow for: Identify customer segments with high return rates.”

Perplexity Labs

Let’s use our Global Retail Sales data from our one of the previous sprints & try to identify trends directly using AI tools.

Download Dataset - Global Retail & Online Store Sales.xlsx
Open Perplexity Labs & login using your email address.
Attach the file and ask - “Referencing the attached file - What were the major trends in global retail sales during 2024?”

Google Colab

Login to Google Colab using your personal gmail ID.
Click + New Notebook
Analyze files with Gemini
Use same Dataset - Global Retail & Online Store Sales.xlsx

Takeaways / Best Practices
• Combine human intuition with AI-powered suggestions for the best outcomes
• Always review AI-generated outputs for accuracy and context
• Use AI to speed up repetitive tasks like data cleaning or formatting
• Prompt engineering is critical—be clear and specific with your instructions

-Interview Prep : Data Analyst Roles

Compare and/or contrast - Type 1

Example: What is the difference between Data Mining and Data profiling?

How can we do this/ How does it happen/ Why is this not possible - Type 2

Example: What are the various steps involved in any analytics project?

What is it/ Why do we need it/ How is it used - Type 3

Example: What is the significance of Exploratory Data Analysis (EDA)?

Type 1:

Note:

Intro line -

(What are both of these used for / how are they similar)

3 Points compared and contrasted -

Data Mining and Data profiling?

UseCase / Similarity:

Both are essential steps in preparing and understanding data before it is used for decision-making or machine learning.
Data Mining = "What can we learn from this data?"
Data Profiling = "What does this data look like and is it usable?"

3 points compared and contrasted:

Feature	Data Mining	Data Profiling
Scope	Broad and analytical – focuses on predictive modeling and pattern discovery.	Narrow and descriptive – focuses on understanding the current state of data.
Techniques Used	Clustering, classification, regression, association rules, etc.	Column profiling, structure analysis, dependency analysis, etc.
Usage Scenario	Used in business intelligence, marketing, fraud detection.	Used in data cleaning, data migration, ETL processes.

Type 2:

How, if possible, otherwise why not -

2 Things to keep in mind while doing so -

(best practice/pitfalls)

Type 3:

Need -

(Give a real world deficit.)

What is it? & How is it used?

(fulfill the NEED section.)

What is the significance of Exploratory Data Analysis (EDA)?

Need

Before jumping into modeling or drawing conclusions, it's crucial to understand the data you’re working with. Raw data often contains:
- Missing values
- Outliers or anomalies
- Incorrect formats or inconsistent entries
- Hidden patterns or biases
EDA reduces risk and increases model effectiveness by laying a strong foundation.

What is it?

Exploratory Data Analysis (EDA) is the initial process of analyzing data sets to:
- Summarize their main characteristics
- Discover patterns and relationships
- Detect anomalies
- Test hypotheses using statistical graphics and data visualization
It combines statistics, visualization, and domain knowledge to extract insights before applying machine learning or advanced models.

How is it used?

Exploratory Data Analysis (EDA) is the initial process of analyzing data sets to:
- Summarize their main characteristics
- Discover patterns and relationships
- Detect anomalies
- Test hypotheses using statistical graphics and data visualization
It combines statistics, visualization, and domain knowledge to extract insights before applying machine learning or advanced models.

How does data visualization help you?

Data visualization has grown rapidly in popularity due to its ease of viewing and understanding complex data in the form of charts and graphs. In addition to providing data in a format that is easier to understand, it highlights trends and outliers. The best visualizations illuminate meaningful information while removing noise from data.

What do you mean by Time Series Analysis? Where is it used?

In the field of Time Series Analysis (TSA), a sequence of data points is analyzed over an interval of time. Instead of just recording the data points intermittently or randomly, analysts record data points at regular intervals over a period of time in the TSA. It can be done in two different ways: in the frequency and time domains. As TSA has a broad scope of application, it can be used in a variety of fields. TSA plays a vital role in the following places:

Statistics
Signal processing
Econometrics
Weather forecasting
Earthquake prediction
Astronomy
Applied science