Python Pandas — II
Name some descriptive statistic functions used with DataFrames.
Answer
Some descriptive statistic functions used with DataFrames are min(), max(), idxmax(), idxmin(), mode(), mean(), median(), count(), sum(), quantile(), std(), var().
To calculate statistical values for each row, the axis argument should be.
Answer
The axis argument should be equal to 1 to calculate statistical values for each row.
What are quantile and quartiles ?
Answer
Quantile is a process of dividing the total distribution of given data into a given number of equal proportions.
Quartile refers to the division of the total distribution of given data into a four equal proportions with each containing one fourth of the total population.
What does quantile() do ?
Answer
The quantile()
function is used to calculate any type of quantiles in Pandas.
Name pivoting functions available for DataFrames.
Answer
The pivoting functions available for DataFrames are pivot(), pivot_table(), melt(), stack(), unstack(), crosstab().
What does hist() do ?
Answer
The hist()
function of PyPlot module is used to create and plot histogram from a given sequence(s) of numbers.
What are missing values in a dataset ?
Answer
Missing values are the values that cannot contribute to any computation or we can say that missing values are the values that carry no computational significance.
Write two methods generally used for handling missing values.
Answer
The dropna()
and fillna()
are the two methods used for handling missing values in a DataFrame.
What is the difference between dropna() and fillna() ?
Answer
The dropna()
method is used to drop all rows that contain NaN values from a DataFrame. On the other hand, the fillna()
method is used to replace all NaN values in a Pandas object with the specified value.
To iterate over horizontal subsets of DataFrame, ............... function may be used.
- iterate()
- iterrows()
- itercols()
- iteritems()
Answer
iterrows()
Reason — The iterrows()
method iterates over a DataFrame row-wise, where each horizontal subset is in the form of (row-index, Series) where Series contains all column values for that row-index.
To iterate over vertical subsets of a DataFrame, ............... function may be used.
- iterate()
- iterrows()
- itercols()
- iteritems()
Answer
iteritems()
Reason — The iteritems()
method in pandas iterates over a DataFrame column-wise, where each vertical subset is in the form of (column-index, Series) where Series contains all row values for that column-index.
To add two DataFrames' values, ............... function may be used.
- plus
- rplus
- add
- radd
Answer
add, radd
Reason — We can add two DataFrame objects using either (+) operator or using add()
function as per syntax : <DF1>.add(<DF2>)
which means <DF1>+<DF2>
or by using radd()
function i.e., reverse add as per syntax : <DF1>.radd(<DF2>)
which means <DF2>+<DF1>
.
To subtract the values of two DataFrames, ............... function may be used.
- sub
- difference
- minus
- rsub
Answer
sub, rsub
Reason — We can subtract two DataFrame objects using either (-) operator or using sub()
function as per syntax : <DF1>.sub(<DF2>)
which means <DF1>-<DF2>
or by using rsub()
function i.e., reverse subtract as per syntax : <DF1>.rsub(<DF2>)
which means <DF2>-<DF1>
.
To divide the values of two DataFrames, ............... function may be used.
- divide
- div
- rdiv
- division
Answer
div, rdiv
Reason — We can divide two DataFrame objects using either (/) operator or using div()
function as per syntax : <DF1>.div(<DF2>)
which means <DF1>/<DF2>
or by using rdiv()
function i.e., reverse division as per syntax : <DF1>.rdiv(<DF2>)
which means <DF2>/<DF1>
.
Which of the following would give the same output as DF/DF1 where DF and DF1 are DataFrames ?
- DF.div(DF1)
- DF1.div(DF)
- Divide(DF, DF1)
- Div(DF, DF1)
Answer
DF.div(DF1)
Reason — The div()
function with the syntax <DF>.div(<DF1>)
, performs element-wise division between DF
and DF1
, which is equivalent to <DF>/<DF1>
in terms of functionality and output.
To skip NaN values in a calculation, you can specify ............... attribute.
- NaN
- NA
- skipna
- all of these
Answer
skipna
Reason — The skipna
attribute in Pandas is used to skip (exclude) NaN, NA, or null values when performing calculations.
Which of the following is not a valid function that can be used with DataFrames ?
- count()
- sum()
- length()
- mad()
Answer
length()
Reason — In Pandas, the length()
function is not a valid function for DataFrames. Instead, we can use the len()
function to get the length of a DataFrame. The other functions listed (count(), sum(), mad()) are valid functions for performing operations on DataFrames.
The technique that divides total distribution of data into a given number of equal proportions is called a ............... .
- quartile
- tercile
- median
- quantile
Answer
quantile
Reason — Quantile is a process of dividing the total distribution of given data into a given number of equal proportions.
............... divides the total distribution in four equal parts.
- quartile
- tercile
- median
- quantile
Answer
quartile
Reason — Quartile refers to the division of the total distribution of given data into a four equal proportions with each containing one fourth of the total population.
To divide total distribution of given data in two equal parts, ............... function is used.
- median()
- quartile()
- quantile()
- all of these
Answer
quantile()
Reason — The quantile()
function is used to divide the total distribution of given data into two equal parts. For example, to divide the data of DataFrame df
into two equal parts, the statement would be df.quantile(q = 0.5)
.
To divide total distribution of given data in four equal parts, ............... function is used.
- median()
- quartile()
- quantile()
- all of these
Answer
quantile()
Reason — The quantile()
function is used to divide the total distribution of given data into four equal parts. For example, to divide the data of DataFrame ndf
into four equal parts, the statement would be df.quantile(q = 0.25)
.
To divide total distribution of given data in eight equal parts, ............... function is used.
- median()
- quartile()
- quantile()
- all of these
Answer
quantile()
Reason — To divide the total distribution of given data into eight equal parts, we can use the quantile()
function with a quantile value of 0.125 (1/8). For example, to divide the data of DataFrame df1
into eight equal parts, the statement would be df.quantile(q = 0.125)
.
Which of the following is not a descriptive statistics function ?
- count()
- add()
- sum()
- max()
Answer
add()
Reason — The add()
function is a binary operation function, while count()
, sum()
, max()
are descriptive statistics functions.
Which function calculates descriptive statistical details for a DataFrame ?
- info()
- describe()
- show()
- list()
Answer
describe()
Reason — The descriptive statistics details can be calculated through a single function, namely the describe()
function.
To calculative cumulative sum of a column of a DataFrame, you may use ............... function.
- sum()
- sum(cumulative = True)
- cumsum()
- none of these
Answer
cumsum()
Reason — The cumsum()
function in pandas is used to calculate the cumulative sum of a column in a DataFrame i.e., in the output of this function, the value of each row is replaced by the sum of all prior rows, including this row.
The function to get the index of maximum value in a column of DataFrame is ............... .
- max()
- index()
- idxmax()
- maxidx()
Answer
idxmax()
Reason — The idxmax()
function in pandas is used to get the index of the maximum value in a column of a DataFrame.
To get top 5 rows of a DataFrame, you may use ............... function.
- head()
- head(5)
- top()
- top(5)
Answer
head(), head(5)
Reason — The head()
function in pandas is used to get the top n
rows of a DataFrame, where n
is an optional argument. If n
is not provided, it defaults to 5. Hence, both head()
and head(5)
are correct and will return the top 5 rows of the DataFrame.
To get bottom 3 rows of a DataFrame, you may use ............... function.
- tail()
- tail(3)
- bottom()
- bottom(3)
Answer
tail(3)
Reason — The tail()
function in pandas is used to get the bottom n
rows of a DataFrame, where n
is an optional argument. Hence, tail(3)
is correct and will return the bottom 3 rows of the DataFrame.
Function ............... can be used to drop missing values.
- fillna()
- isnull()
- dropna()
- delna()
Answer
dropna()
Reason — The dropna()
function in pandas is used to drop rows with missing values.
Which of the following methods of combining two DataFrames is a patching method ?
- concat()
- merge()
- join()
- none of these
Answer
join()
Reason — The join()
function in pandas is a patching method, which means that it aligns the DataFrames based on a common index or set of keys, and then combines them.
iterrows() and iteritems() functions help you to iterate over a DataFrame.
To add two DataFrames, you may use functions add() or radd( ).
To concatenate two string columns of a DataFrame, + operator is used.
The mode() function returns the maximum repeating value.
The mean() function returns the average of given data.
The median() function returns the halfway point in a given data.
Using quantile() function you can calculate terciles.
Using quantile() function you can calculate quartiles.
Using quantile() function you can calculate octiles.
Using isnull() method, you can check if there are any missing values in DataFrame.
The merge() function combines two DataFrames such that two rows with some common value are merged together in the final result.
The iteritems() iterates over the rows of a DataFrame.
Answer
False
Reason — The iteritems()
method in pandas iterates over a DataFrame column-wise, where each vertical subset is in the form of (column-index, Series) where Series contains all row values for that column-index.
The iteritems() iterates over the columns of a DataFrame.
Answer
True
Reason — The iteritems()
method in pandas iterates over a DataFrame column-wise, where each vertical subset is in the form of (column-index, Series) where Series contains all row values for that column-index.
The result produced by the functions sub() and rsub() is the same.
Answer
False
Reason — The sub()
and rsub()
functions produce different results because they subtract the operands in a different order. The sub()
function performs element-wise subtraction between two DataFrames, subtracting the right operand from the left operand. The syntax is <DF1>.sub(<DF2>)
, which means <DF1> - <DF2>
. On the other hand, the rsub()
function performs element-wise subtraction with the right operand subtracted from the left operand. The syntax is <DF1>.rsub(<DF2>)
, which means <DF2> - <DF1>
.
The result produced by the functions add() and radd() is the same.
Answer
True
Reason — When adding two DataFrame objects using the add()
function with the syntax <DF1>.add(<DF2>)
, it means <DF1> + <DF2>
. Similarly, using the radd()
function with the syntax <DF1>.radd(<DF2>)
means <DF2> + <DF1>
. Both operations produce the same results because addition is commutative, meaning the order of operands does not affect the result.
The result produced by the functions div() and rdiv() is the same.
Answer
False
Reason — The div()
and rdiv()
functions do not produce the same result. The div()
function performs element-wise division between two DataFrames, where the left operand is the dividend and the right operand is the divisor. The syntax is <DF1>.div(<DF2>)
, which means <DF1> / <DF2>
. On the other hand, the rdiv()
function performs element-wise division with the right operand as the dividend and the left operand as the divisor. The syntax is <DF1>.rdiv(<DF2>)
, which means <DF2> / <DF1>
.
The info() and describe() list the same information about a DataFrame.
Answer
False
Reason — The info()
method in pandas provides basic information about a DataFrame, including the data types of each column, the number of rows, and the memory usage. It also displays the index name and type. On the other hand, the describe()
method provides detailed descriptive statistics about the numerical columns in a DataFrame, including count, mean, standard deviation, minimum value, 25th percentile, median (50th percentile), 75th percentile, and maximum value.
Function add() and operator + give the same result.
Answer
True
Reason — In pandas, both the '+' operator and the add()
function can be used to perform element-wise addition between two DataFrames, resulting in the same output. The syntax for the '+' operator is <DF1> + <DF2>
, while the syntax for the add()
function is <DF1>.add(<DF2>)
, which is equivalent to <DF1> + <DF2>
.
Function rsub() and operator - give the same result.
Answer
False
Reason — The rsub()
function and the '-' operator in pandas do not produce the same result because the rsub()
function performs reverse subtraction, which means it subtracts the left operand from the right, while the '-' subtracts the right operand from the left operand.
The minus - operator's result is same as sub() and rsub().
Answer
False
Reason — The '-' operator in pandas subtracts the right operand from the left operand, similar to the sub()
function. However, the rsub()
function subtracts the left operand from the right operand. Therefore, '-' and sub()
produce the same result, while rsub()
produces a different result due to the order of operands.
Python integer datatype can store NaN values.
Answer
False
Reason — Python integer datatype cannot store NaN values. To store a NaN value in a column, the datatype of a column is changed to non-integer suitable type.
Functions sum() and cumsum() produce the same result.
Answer
False
Reason — The function sum()
returns the sum of the values for the requested axis. On the other hand, cumsum()
calculates the cumulative sum. In the output of this function, the value of each row is replaced by the sum of all prior rows, including this row.
The fillna() can also fill individual missing values for different columns.
Answer
True
Reason — The fillna()
function in pandas can be used to fill individual missing values for different columns. This can be done by passing a dictionary with column names as keys and the corresponding filling values as values to the fillna()
function.
To drop missing values from a DataFrame, the function used is delna().
Answer
False
Reason — To drop missing values from a DataFrame, the function used is dropna()
.
Assertion. A quantile refers to equally distributed portion of a data set.
Reason. A median divides a distribution in 2 quantiles while a quartile divides a distribution in 4 quantiles.
- Both A and R are true and R is the correct explanation of A.
- Both A and R are true but R is not the correct explanation of A.
- A is true but R is false.
- A is false but R is true.
Answer
Both A and R are true and R is the correct explanation of A.
Explanation
Quantile is a statistical term that refers to dividing the total distribution of given data into a given number of equal proportions. Each portion is called a quantile, and it represents a specific percentage of the data set. The median divides a distribution into two equal parts, representing the middle value of a data set when the data is sorted in ascending order. A quartile divides a distribution into four equal parts, with each quartile representing 25% of the data set.
Assertion. Data aggregation produces a summary statistics of a dataset.
Reason. Data aggregation summarizes data using statistical aggregation functions.
- Both A and R are true and R is the correct explanation of A.
- Both A and R are true but R is not the correct explanation of A.
- A is true but R is false.
- A is false but R is true.
Answer
Both A and R are true and R is the correct explanation of A.
Explanation
Data aggregation produces a summary statistic of a dataset because it involves combining and summarizing data from multiple sources or data points, and then calculating a single summary statistic or a set of summary statistics. This process utilizes statistical aggregation functions, such as sum, average, count, minimum, and maximum.
Assertion. In a dataset, there can be missing values that cannot contribute to any computation.
Reason. In a dataset, NULL, NaN or None are considered the missing values.
- Both A and R are true and R is the correct explanation of A.
- Both A and R are true but R is not the correct explanation of A.
- A is true but R is false.
- A is false but R is true.
Answer
Both A and R are true and R is the correct explanation of A.
Explanation
Missing values are the values that cannot contribute to any computation or we can say that missing values are the values that carry no computational significance. In a dataset, NULL, NaN or None are considered the missing values.
Assertion (A). The output of addition of two series will be NaN, if one of the elements or both the elements have no value(s).
Reason (R). While performing mathematical operations on a series, by default all missing values are filled in with 0.
- Both A and R are true and R is the correct explanation of A.
- Both A and R are true but R is not the correct explanation of A.
- A is true but R is false.
- A is false but R is true.
Answer
A is true but R is false.
Explanation
When adding two pandas Series and encountering NaN values in either or both Series, the result of the addition operation will be NaN for those corresponding elements. Additionally, pandas does not automatically replace missing values with 0 during mathematical operations but instead with NaN.
Name the function to iterate over a DataFrame horizontally.
Answer
The iterrows()
function iterates over a DataFrame horizontally.
Name the function to iterate over a DataFrame vertically.
Answer
The iteritems()
function iterates over a DataFrame vertically.
Write equivalent expressions for the given functions :
(i) A.add(B)
(ii) B.add(A)
(iii) A.sub(B)
(iv) B.sub(A)
(v) A.rsub(B)
(vi) B.mul(A)
(vii) A.rdiv(B)
(viii) B.div(A)
(ix) B.rdiv(A)
(x) A.div(B)
Answer
(i) A.add(B) — A + B
(ii) B.add(A) — B + A
(iii) A.sub(B) — A - B
(iv) B.sub(A) — B - A
(v) A.rsub(B) — B - A
(vi) B.mul(A) — B * A
(vii) A.rdiv(B) — B / A
(viii) B.div(A) — B / A
(ix) B.rdiv(A) — A / B
(x) A.div(B) — A / B
Is the result of sub() and rsub() the same? Why/why not ?
Answer
The sub()
and rsub()
functions produce different results because they subtract the operands in a different order. The sub()
function performs element-wise subtraction between two DataFrames, subtracting the right operand from the left operand. The syntax is <DF1>.sub(<DF2>)
, which means <DF1> - <DF2>
. On the other hand, the rsub()
function performs element-wise subtraction with the right operand subtracted from the left operand. The syntax is <DF1>.rsub(<DF2>)
, which means <DF2> - <DF1>
.
Write appropriate functions to perform the following on a DataFrame ?
(i) Calculate the sum
(ii) Count the values
(iii) Calculate the average
(iv) Calculate the most repeated value
(v) Calculate the median
(vi) Calculate the standard deviation
(vii) Calculate the variance
(viii) Calculate the maximum value
(ix) Calculate the standard deviation
(x) Calculate the variance
Answer
(i) Calculate the sum — sum() function
(ii) Count the values — count() function
(iii) Calculate the average — mean() function
(iv) Calculate the most repeated value — mode() function
(v) Calculate the median — median() function
(vi) Calculate the standard deviation — std() function
(vii) Calculate the variance — var() function
(viii) Calculate the maximum value — max() function
(ix) Calculate the standard deviation — std() function
(x) Calculate the variance — var() function
What does info() and describe() do ?
Answer
The info()
method in pandas provides basic information about a DataFrame, including the data types of each column, the number of rows, and the memory usage. It also displays the index name and type. On the other hand, the describe()
method provides detailed descriptive statistics about the numerical columns in a DataFrame, including count, mean, standard deviation, minimum value, 25th percentile, median (50th percentile), 75th percentile, and maximum value.
Are sum() and add() functions the same ?
Answer
No, the sum()
and add()
functions are not the same in pandas. The sum()
function in pandas calculates the sum of values along a specified axis. On the other hand, the add()
function in pandas is used to add two objects element-wise.
Name some functions that perform descriptive statistics on a DataFrame.
Answer
Some functions that perform descriptive statistics on a DataFrame are min(), max(), idxmax(), idxmin(), mode(), mean(), median(), count(), sum(), quantile(), std(), var().
To consider only the numeric values for calculation, what argument do you pass to statistics functions of Pandas ?
Answer
The numeric_only = True
argument is passed to statistics functions in pandas to consider only the numeric values (int, float, boolean columns) for calculation.
Is there one function that calculates much of descriptive statistics values ? Name it.
Answer
Yes, there is one function that calculates many descriptive statistics values, namely the describe() function.
What happens if mode() returns multiple values for a column but other columns have a single mode ?
Answer
When the mode()
function in pandas returns multiple values for a column (meaning there are multiple modes), but other columns have a single mode, pandas fills the non-mode columns with NaN values.
What is quantile and quartile ?
Answer
Quantile is a process of dividing the total distribution of given data into a given number of equal proportions.
Quartile refers to the division of the total distribution of given data into a four equal proportions with each containing one fourth of the total population.
Name the function that lets you calculate different types of quantiles.
Answer
The function used to calculate different types of quantiles is quantiles()
with syntax : <dataframe>.quantile(q = 0.5, axis = 0, numeric_only = True)
.
Name the functions that give you maximum and minimum values in a DataFrame.
Answer
The max()
and min()
functions find out the maximum and minimum values respectively from a DataFrame.
Name the functions that give you the index of maximum and minimum values in a DataFrame.
Answer
The idxmax()
and idxmin()
functions find out the index of maximum and minimum values respectively from a DataFrame.
What is missing data ?
Answer
Missing data, also known as missing values are the values that cannot contribute to any computation or we can say that missing values are the values that carry no computational significance.
Why is missing data filled in DataFrame with some value ?
Answer
The dropna()
method in pandas can be used to remove rows or columns with missing values from a DataFrame. However, this can also lead to a loss of non-null data, which may not be desirable for data analysis. To avoid losing non-null data, we can use the fillna()
method to fill missing values with a specified value. This can help to ensure that the dataset is complete and can be used for data analysis.
Name the functions you can use for filling missing data.
Answer
In Python, the fillna()
function is used to fill missing data in a Pandas DataFrame.
How do you iterate over a DataFrame?
Answer
To iterate over a DataFrame in Python, we can use various methods depending on our specific needs. We can use the iterrows()
method to iterate over each row in the DataFrame. In this method, each horizontal subset is in the form of (row-index, series), where the series contains all column values for that row-index. Additionally, we can use the iteritems()
method to iterate over each column in the DataFrame. Here, each vertical subset is in the form of (column-index, series), where the series contains all row values for that column-index.
What are binary operations ? Name the functions that let you perform binary operations on a DataFrame.
Answer
Binary operations refer to operations that require two values to perform and these values are picked element-wise. In a binary operation involving DataFrames, the data from the two DataFrames are aligned based on their row and column indexes. For matching row and column indexes, the specified operation is performed, while for non-matching row and column indexes, NaN values are stored in the result. The functions that perform binary operations on a DataFrame include add(), radd(), sub(), rsub(), mul(), div(), and rdiv().
What is descriptive statistics ? Name Pandas descriptive statistics functions.
Answer
A descriptive statistic is a summary statistic that quantitatively describes or summarizes features of a collection of information.
The functions that perform descriptive statistics on a DataFrame are min(), max(), idxmax(), idxmin(), mode(), mean(), median(), count(), sum(), quantile(), std(), var().
The info() and describe() are said to be inspection functions. What do they do ?
Answer
The info()
and describe()
functions in pandas are considered inspection functions because they provide valuable insights into the structure, content, and statistical summary of a DataFrame. The info()
method in pandas provides basic information about a DataFrame, including the data types of each column, the number of rows, and the memory usage. It also displays the index name and type. On the other hand, the describe()
method provides detailed descriptive statistics about the numerical columns in a DataFrame, including count, mean, standard deviation, minimum value, 25th percentile, median (50th percentile), 75th percentile, and maximum value.
What is the difference between sum and cumulative sum? How do you perform the two on DataFrame ?
Answer
The sum provides the total of all numbers in a sequence, while the cumulative sum represents the running total of numbers encountered up to a specific point in the sequence.
The sum()
function gives the total sum of values along specified axes, whereas the cumsum()
function provides the cumulative sum of values along specified axes, either column-wise or row-wise.
Name and explain some cumulative functions provided by Pandas.
Answer
The cumulative functions provided by Pandas are cumsum(), cumprod(), cummax(), cummin().
cumsum() — It calculates cumulative sum i.e., in the output of this function, the value of each row is replaced by sum of all prior rows including this row. The syntax is
<DF>.cumsum([axis = None])
.cumprod() — It calculates cumulative product of values in a DataFrame object.
cummax() — It calculates cumulative maximum of value from a DataFrame object.
cummin() — It calculates cumulative minimum of value from a DataFrame object.
The head() and tail() extract rows or columns from a DataFrame. Explain.
Answer
The head()
function in pandas retrieves the top n rows of a DataFrame, where n is an optional argument defaulting to 5 if not provided. It is used with the syntax <DF>.head()
. Similarly, the tail()
function in pandas fetches the bottom n rows of a DataFrame, where n is also optional and defaults to 5 if not specified. Its syntax is <DF>.tail()
.
Why does Python change the datatype of a column as soon as it stores an empty value (NaN) even though it has all other values stored as integer ?
Answer
When a column in a pandas DataFrame contains a mixture of integers and missing values (NaNs), the data type is automatically changed to a floating-point type to accommodate these missing values because in Python integer types cannot store NaN values.
What do quantile and var() functions do ?
Answer
The quantile()
function returns the values at the given quantiles over requested axis (axis 0 or 1). On the other hand, var()
function computes variance and returns the unbiased variance over the requested axis.
What is a quartile ? How is it different from quantile ?
Answer
Quartile refers to the division of the total distribution of given data into a four equal proportions with each containing one fourth of the total population.
Quartiles specifically divide the dataset into four equal parts, while quantiles can divide the dataset into any desired number of equal parts.
How do you create quantiles and quartiles in Python Pandas ?
Answer
In Python pandas, we can create quantiles and quartiles using the quantile()
method.
The statement to create quartiles is df.quantile(q = 0.25)
.
Assume that required libraries (Pandas and Numpy) are imported and DataFrame ndf has been created as shown in solved problem 16. Predict the output produced by following code fragment :
print(ndf[ndf["age"] > 30])
print(ndf.head(2))
print(ndf.tail(3))
The following DataFrame ndf
is from solved problem 16 :
Name Sex Position City age Projects Budget
0 Rabina F Manager Bangalore 30 13 48
1 Evan M Programer New Delhi 27 17 13
2 Jia F Manager Chennai 32 16 32
3 Lalit M Manager Mumbai 40 20 21
4 Jaspreet M Programmer Chennai 28 21 17
5 Suji F Programmer Bangalore 32 14 10
Answer
Name Sex Position City age Projects Budget
2 Jia F Manager Chennai 32 16 32
3 Lalit M Manager Mumbai 40 20 21
5 Suji F Programmer Bangalore 32 14 10
Name Sex Position City age Projects Budget
0 Rabina F Manager Bangalore 30 13 48
1 Evan M Programer New Delhi 27 17 13
Name Sex Position City age Projects Budget
3 Lalit M Manager Mumbai 40 20 21
4 Jaspreet M Programmer Chennai 28 21 17
5 Suji F Programmer Bangalore 32 14 10
print(ndf[ndf["age"] > 30])
— Thendf["age"] > 30
creates a boolean mask where True indicates rows where the "age" column has values greater than 30. Thenndf[...]
uses this boolean mask to filter the DataFramendf
, returning only rows where the condition is True.print(ndf.head(2))
— It prints the first 2 rows of the DataFramendf
.print(ndf.tail(3))
— It prints the last 3 rows of the DataFramendf
.
Given the two DataFrames as :
>>> dfc1
- -
0 1
0 2 a
1 3 b
2 4 c
>>> dfc2
- -
0 1 2
0 2 3 4
2 p q r
Why are following statements giving errors ?
(a) print(dfc1 + dfc2)
(b) print(dfc1.sub(dfc2))
(c) print(dfc1 * dfc2)
Answer
(a) print(dfc1 + dfc2)
— This statement tries to add two DataFrames dfc1
and dfc2
. However, they have different shapes (dfc1
has 3 rows and 2 columns, while dfc2
has 2 rows and 3 columns), and their indices do not match. DataFrame addition requires both DataFrames to have the same shape and compatible indices, which is not the case here.
(b) print(dfc1.sub(dfc2))
— This statement attempts to subtract dfc1
from dfc2
using the sub()
method. Similar to addition, subtraction between DataFrames requires them to have the same shape and compatible indices, which is not satisfied here due to the mismatch in shapes and indices.
(c) print(dfc1 * dfc2)
— This statement tries to perform element-wise multiplication between dfc1
and dfc2
. Again, this operation requires both DataFrames to have the same shape and compatible indices, which is not the case here due to the mismatched shapes and indices.
Consider the following code that creates two DataFrames :
ore1 = pd.DataFrame(np.array([[20, 35, 25, 20], [11, 28, 32, 29]]),
columns = ['iron', 'magnesium', 'copper', 'silver'])
ore2 = pd.DataFrame(np.array([[14, 34, 26, 26], [33, 19, 25, 23]]),
columns = ['iron', 'magnesium', 'gold', 'silver'])
What will be the output produced by the following code fragments ?
(a) print(ore1 + ore2)
ore3 = ore1.radd(ore2)
print(ore3)
(b) print(ore1 - ore2)
ore3 = ore1.rsub(ore2)
print(ore3)
(c) print(ore1 * ore2)
ore3 = ore1.mul(ore2)
print(ore3)
(d) print(ore1 / ore2)
ore3 = ore1.rdiv(ore2)
print(ore3)
Answer
(a)
copper gold iron magnesium silver
0 NaN NaN 34 69 46
1 NaN NaN 44 47 52
copper gold iron magnesium silver
0 NaN NaN 34 69 46
1 NaN NaN 44 47 52
print(ore1 + ore2)
: This line attempts to add the DataFramesore1
andore2
using the '+' operator. When adding DataFrames with different shapes and column names, pandas aligns the DataFrames based on indices and columns, resulting in NaN values where elements are missing in either DataFrame.ore3 = ore1.radd(ore2)
: This line uses theradd()
method, which reverses the addition operation between DataFramesore1
andore2
.- The
radd()
function and '+' operator produce the same result in pandas, as the order of operands does not affect addition due to its commutative property.
(b)
copper gold iron magnesium silver
0 NaN NaN 6 1 -6
1 NaN NaN -22 9 6
copper gold iron magnesium silver
0 NaN NaN -6 -1 6
1 NaN NaN 22 -9 -6
print(ore1 - ore2)
: This line performs a subtraction operation between corresponding elements in DataFramesore1
andore2
. When subtracting DataFrames with different shapes and column names, pandas aligns the DataFrames based on indices and columns, resulting in NaN values where elements are missing in either DataFrame.ore3 = ore1.rsub(ore2)
: This line uses thersub()
method, which reverses the subtraction operation between DataFramesore1
andore2
.- The
rsub()
function and the '-' operator in pandas do not produce the same result because thersub()
function performs reverse subtraction, which means it subtracts the left operand from the right, while the '-' subtracts the right operand from the left operand.
(c)
copper gold iron magnesium silver
0 NaN NaN 280 1190 520
1 NaN NaN 363 532 667
copper gold iron magnesium silver
0 NaN NaN 280 1190 520
1 NaN NaN 363 532 667
print(ore1 * ore2)
: This line attempts to perform element-wise multiplication between the DataFramesore1
andore2
using the '*' operator. When multiplying DataFrames with different shapes and column names, pandas aligns the DataFrames based on indices and columns, resulting in NaN values where elements are missing in either DataFrame.ore3 = ore1.mul(ore2)
: This line uses themul()
method to perform element-wise multiplication between DataFramesore1
andore2
.
(d)
copper gold iron magnesium silver
0 NaN NaN 1.428571 1.029412 0.769231
1 NaN NaN 0.333333 1.473684 1.260870
copper gold iron magnesium silver
0 NaN NaN 0.7 0.971429 1.300000
1 NaN NaN 3.0 0.678571 0.793103
print(ore1 / ore2)
: This line attempts to perform element-wise division between the DataFramesore1
andore2
using the '/' operator. When dividing DataFrames with different shapes and column names, pandas aligns the DataFrames based on indices and columns, resulting in NaN values where elements are missing in either DataFrame.ore3 = ore1.rdiv(ore2)
: This line uses therdiv()
method to perform reciprocal division between DataFramesore1
andore2
.
Consider the DataFrame wdf as shown below :
minTemp | maxTemp | Rainfall | Evaporation | |
---|---|---|---|---|
0 | 2.9 | 8.0 | 24.3 | 0.0 |
1 | 3.1 | 14.0 | 26.9 | 3.6 |
2 | 6.2 | 13.7 | 23.4 | 3.6 |
3 | 5.3 | 13.3 | 15.5 | 39.8 |
4 | 6.3 | 17.6 | 16.1 | 2.8 |
5 | 5.4 | 18.2 | 16.9 | 0.0 |
6 | 5.5 | 21.1 | 18.2 | 0.2 |
7 | 4.8 | 18.3 | 17.0 | 0.0 |
8 | 3.6 | 20.8 | 19.5 | 0.0 |
9 | 7.7 | 19.4 | 22.8 | 16.2 |
10 | 9.9 | 24.1 | 25.2 | 0.0 |
11 | 11.8 | 28.5 | 27.3 | 0.2 |
12 | 13.2 | 29.1 | 27.9 | 0.0 |
13 | 16.8 | 24.1 | 30.9 | 0.0 |
14 | 19.4 | 28.1 | 31.2 | 0.0 |
15 | 21.6 | 34.4 | 32.1 | 0.0 |
16 | 20.4 | 33.8 | 31.2 | 0.0 |
17 | 18.5 | 26.7 | 30.0 | 1.2 |
18 | 18.8 | 32.4 | 32.3 | 0.6 |
19 | 17.6 | 28.6 | 33.4 | 0.0 |
20 | 19.7 | 30.3 | 33.4 | 0.0 |
(a) Write statement(s) to calculate minimum value for each of the columns.
(b) Write statement(s) to calculate maximum value for each of the rows.
(c) Write statement(s) to calculate variance for column Rainfall.
(d) Write statement(s) to compute mean , mode median for last 10 rows.
Answer
(a)
>>> wdf.min()
Minimum values for each column:
minTemp 2.9
maxTemp 8.0
Rainfall 15.5
Evaporation 0.0
dtype: float64
(b)
>>> wdf.max(axis=1)
Maximum values for each row:
0 24.3
1 26.9
2 23.4
3 39.8
4 17.6
5 18.2
6 21.1
7 18.3
8 20.8
9 22.8
10 25.2
11 28.5
12 29.1
13 30.9
14 31.2
15 34.4
16 33.8
17 30.0
18 32.4
19 33.4
20 33.4
dtype: float64
(c)
>>> wdf['Rainfall'].var()
Variance for column Rainfall: 38.852999999999994
(d)
>>> last_10_rows = wdf.tail(10)
>>> last_10_rows.mean()
>>> last_10_rows.mode()
>>> last_10_rows.median()
Mean for last 10 rows:
minTemp 17.78
maxTemp 29.60
Rainfall 30.97
Evaporation 0.20
dtype: float64
Mode for last 10 rows:
minTemp maxTemp Rainfall Evaporation
0 11.8 24.1 31.2 0.0
1 13.2 26.7 33.4 NaN
2 16.8 28.1 NaN NaN
3 17.6 28.5 NaN NaN
4 18.5 28.6 NaN NaN
5 18.8 29.1 NaN NaN
6 19.4 30.3 NaN NaN
7 19.7 32.4 NaN NaN
8 20.4 33.8 NaN NaN
9 21.6 34.4 NaN NaN
Median for last 10 rows:
minTemp 18.65
maxTemp 28.85
Rainfall 31.20
Evaporation 0.00
dtype: float64
Write a program to print a DataFrame one column at a time and print only first three columns.
import pandas as pd
data = {
'Name': ['Aliya', 'Hemanth', 'Charlie'],
'Age': [25, 30, 35],
'City': ['Bangalore', 'Chennai', 'Mumbai'],
'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
first_three_columns = df.iloc[:, :3]
print("Each column:")
for column_name in first_three_columns:
column_data = first_three_columns[column_name]
print(column_name)
print(column_data)
Each column:
Name
0 Aliya
1 Hemanth
2 Charlie
Name: Name, dtype: object
Age
0 25
1 30
2 35
Name: Age, dtype: int64
City
0 Bangalore
1 Chennai
2 Mumbai
Name: City, dtype: object
Write a program to print a DataFrame one row at a time and print only first five rows.
import pandas as pd
data = {
'Name': ['Amruta', 'Harsh', 'Yogesh', 'Shreya', 'Zoya', 'Nyra'],
'Age': [25, 30, 35, 40, 45, 28],
'City': ['Chandigarh', 'Jaipur', 'Dehradun', 'Delhi', 'Vadodara', 'Guwahati']
}
df = pd.DataFrame(data)
first_five_rows = df.head(5)
print("Each row:")
for index, row in first_five_rows.iterrows():
print("Index:", index)
print(row)
Each row:
Index: 0
Name Amruta
Age 25
City Chandigarh
Name: 0, dtype: object
Index: 1
Name Harsh
Age 30
City Jaipur
Name: 1, dtype: object
Index: 2
Name Yogesh
Age 35
City Dehradun
Name: 2, dtype: object
Index: 3
Name Shreya
Age 40
City Delhi
Name: 3, dtype: object
Index: 4
Name Zoya
Age 45
City Vadodara
Name: 4, dtype: object
Write a program that performs count, sum, max, and min functions :
On rows
On columns
import pandas as pd
data = {
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
}
df = pd.DataFrame(data)
print("Operations on rows:")
print("Count:")
print(df.count(axis=1))
print("Sum:")
print(df.sum(axis=1))
print("Max:")
print(df.max(axis=1))
print("Min:")
print(df.min(axis=1))
print("\nOperations on columns:")
print("Count:")
print(df.count())
print("Sum:")
print(df.sum())
print("Max:")
print(df.max())
print("Min:")
print(df.min())
Operations on rows:
Count:
0 3
1 3
2 3
dtype: int64
Sum:
0 12
1 15
2 18
dtype: int64
Max:
0 7
1 8
2 9
dtype: int64
Min:
0 1
1 2
2 3
dtype: int64
Operations on columns:
Count:
A 3
B 3
C 3
dtype: int64
Sum:
A 6
B 15
C 24
dtype: int64
Max:
A 3
B 6
C 9
dtype: int64
Min:
A 1
B 4
C 7
dtype: int64
Take a DataFrame of your choice. Write a program to calculate count of values only in a selective column.
import pandas as pd
data = {
'Name': ['Deepika', 'Simran', 'Shivang', 'Anurag', 'Ankith'],
'Age': [25, 30, None, 40, 45],
'City': ['Goa', 'Kochi', 'Pondicherry', 'Rohtak', 'Kasol']
}
df = pd.DataFrame(data)
selected_column = 'Age'
count_values = df[selected_column].count()
print("Count of values in", selected_column, ":" , count_values)
Count of values in Age : 4
Give two identical DataFrames Sales16 and Sales17. But Sales17 has some values missing. Write code so that Sales17 fills its missing values from the corresponding entries of Sales16.
import pandas as pd
data_sales16 = {
'Product': ['A', 'B', 'C', 'D'],
'Sales': [100, 150, 120, 180]
}
Sales16 = pd.DataFrame(data_sales16)
data_sales17 = {
'Product': ['A', 'B', 'C', 'D'],
'Sales': [100, None, 120, None]
}
Sales17 = pd.DataFrame(data_sales17)
Sales17 = Sales16.fillna({'B': 150, 'D': 180})
print("Sales16:")
print(Sales16)
print("\nSales17 (after filling missing values):")
print(Sales17)
Sales16:
Product Sales
0 A 100
1 B 150
2 C 120
3 D 180
Sales17 (after filling missing values):
Product Sales
0 A 100
1 B 150
2 C 120
3 D 180
Write code that just produces single True/False as a result for the presence of missing values in whole DataFrame namely df.
import pandas as pd
data = {'A': [1, 2, None, 4], 'B': [None, 5, 6, 7], 'C': [8, 9, 10, 11]}
df = pd.DataFrame(data)
has_missing_values = False
for column in df.columns:
for value in df[column]:
if pd.isnull(value):
has_missing_values = True
break
print(has_missing_values)
True
Four Series objects Temp1, Temp2, Temp3 and Temp4 store the temperatures of week1, week2, week3 and week4 respectively. Create a DataFrame from these four series objects where the indexes should be 'Sunday', 'Monday', ... , 'Saturday', and columns should be 'Week1', 'Week2', 'Week3' and 'week4'.
import pandas as pd
weekdays = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
week1 = [25.4, 26.7, 27.2, 28.9, 29.8, 30.2, 31.5]
week2 = [22.3, 23.2, 24.6, 25.4, 26.8, 27.9, 28.4]
week3 = [20.1, 21.2, 22.4, 23.9, 24.7, 25.57, 26.78]
week4 = [18.4, 19.1, 20.3, 21.7, 22.67, 23.4, 24.09]
Temp1 = pd.Series(week1, index=weekdays)
Temp2 = pd.Series(week2, index=weekdays)
Temp3 = pd.Series(week3, index=weekdays)
Temp4 = pd.Series(week4, index=weekdays)
df = pd.DataFrame({'Week1': Temp1, 'Week2': Temp2, 'Week3': Temp3, 'Week4': Temp4})
print(df)
Week1 Week2 Week3 Week4
Sunday 25.4 22.3 20.10 18.40
Monday 26.7 23.2 21.20 19.10
Tuesday 27.2 24.6 22.40 20.30
Wednesday 28.9 25.4 23.90 21.70
Thursday 29.8 26.8 24.70 22.67
Friday 30.2 27.9 25.57 23.40
Saturday 31.5 28.4 26.78 24.09
Given a DataFrame that stores the details of past 25 years' monthly sales. Some old data is however missing. Write a script to calculate average :
Monthly sales across years
Yearly sales
Make sure that missing values do not hamper the overall result.
import pandas as pd
data = {
'Year': [],
'Month': [],
'Sales': []
}
n = int(input("Enter the number of rows in the DataFrame: "))
for i in range(n):
year = int(input("Enter the year: "))
month = int(input("Enter the month (1-12): "))
sales = float(input("Enter the sales amount: "))
data['Year'].append(year)
data['Month'].append(month)
data['Sales'].append(sales)
df = pd.DataFrame(data)
df.fillna(0, inplace=True)
avg_monthly_sales = df.groupby('Month')['Sales'].mean()
yearly_sales = df.groupby('Year')['Sales'].mean()
print("Average Monthly Sales Across Years:")
print(avg_monthly_sales)
print("\nYearly Sales:")
print(yearly_sales)
Average Monthly Sales Across Years:
Month
1 543.25
2 644.25
3 486.50
4 583.50
5 376.50
6 644.75
7 847.00
8 497.75
9 564.50
10 701.00
11 524.75
12 522.25
Name: Sales, dtype: float64
Yearly Sales:
Year
2000 781.082333
2001 480.030233
2002 390.530000
2003 682.917767
2004 421.250050
2005 408.333333
2006 509.757500
2007 634.333667
2008 907.250000
2009 876.222222
2010 940.757578
2011 690.900067
2012 623.250043
2013 890.088888
2014 491.500000
2015 601.919997
2016 623.250055
2017 489.086633
2018 523.750000
2019 789.910067
2020 456.250000
2021 481.083333
2022 590.750000
2023 618.916667
2024 621.250000
Name: Sales, dtype: float64
Given two identical DataFrames Sales16 and Sales17. But Sales17 has some values missing. Write code so that Sales17 fills its missing values from corresponding entries of Sales16.
import pandas as pd
data_sales16 = {
'Product': ['A', 'B', 'C', 'D'],
'Sales': [100, 150, 120, 180]
}
Sales16 = pd.DataFrame(data_sales16)
data_sales17 = {
'Product': ['A', 'B', 'C', 'D'],
'Sales': [100, None, 120, None]
}
Sales17 = pd.DataFrame(data_sales17)
Sales17 = Sales16.fillna({'B': 150, 'D': 180})
print("Sales16:")
print(Sales16)
print("\nSales17 (after filling missing values):")
print(Sales17)
Sales16:
Product Sales
0 A 100
1 B 150
2 C 120
3 D 180
Sales17 (after filling missing values):
Product Sales
0 A 100
1 B 150
2 C 120
3 D 180
Write code that just produces single True/False as a result for the presence of missing values in whole DataFrame namely df.
import pandas as pd
data = {'A': [1, 2, None, 4], 'B': [None, 5, 6, 7], 'C': [8, 9, 10, 11]}
df = pd.DataFrame(data)
has_missing_values = False
for column in df.columns:
for value in df[column]:
if pd.isnull(value):
has_missing_values = True
break
print(has_missing_values)
True