Python Pandas — II

PrevNextBack

Python Pandas — II

Checkpoint 2.1

Question 1

Name some descriptive statistic functions used with DataFrames.

Answer

Some descriptive statistic functions used with DataFrames are min(), max(), idxmax(), idxmin(), mode(), mean(), median(), count(), sum(), quantile(), std(), var().

Question 2

To calculate statistical values for each row, the axis argument should be.

Answer

The axis argument should be equal to 1 to calculate statistical values for each row.

Question 3

What are quantile and quartiles ?

Answer

Quantile is a process of dividing the total distribution of given data into a given number of equal proportions.

Quartile refers to the division of the total distribution of given data into a four equal proportions with each containing one fourth of the total population.

Question 4

What does quantile() do ?

Answer

The quantile() function is used to calculate any type of quantiles in Pandas.

Question 5

Name pivoting functions available for DataFrames.

Answer

The pivoting functions available for DataFrames are pivot(), pivot_table(), melt(), stack(), unstack(), crosstab().

Question 6

What does hist() do ?

Answer

The hist() function of PyPlot module is used to create and plot histogram from a given sequence(s) of numbers.

Question 7

What are missing values in a dataset ?

Answer

Missing values are the values that cannot contribute to any computation or we can say that missing values are the values that carry no computational significance.

Question 8

Write two methods generally used for handling missing values.

Answer

The dropna() and fillna() are the two methods used for handling missing values in a DataFrame.

Question 9

What is the difference between dropna() and fillna() ?

Answer

The dropna() method is used to drop all rows that contain NaN values from a DataFrame. On the other hand, the fillna() method is used to replace all NaN values in a Pandas object with the specified value.

Multiple Choice Questions

Question 1

To iterate over horizontal subsets of DataFrame, ............... function may be used.

  1. iterate()
  2. iterrows()
  3. itercols()
  4. iteritems()

Answer

iterrows()

Reason — The iterrows() method iterates over a DataFrame row-wise, where each horizontal subset is in the form of (row-index, Series) where Series contains all column values for that row-index.

Question 2

To iterate over vertical subsets of a DataFrame, ............... function may be used.

  1. iterate()
  2. iterrows()
  3. itercols()
  4. iteritems()

Answer

iteritems()

Reason — The iteritems() method in pandas iterates over a DataFrame column-wise, where each vertical subset is in the form of (column-index, Series) where Series contains all row values for that column-index.

Question 3

To add two DataFrames' values, ............... function may be used.

  1. plus
  2. rplus
  3. add
  4. radd

Answer

add, radd

Reason — We can add two DataFrame objects using either (+) operator or using add() function as per syntax : <DF1>.add(<DF2>) which means <DF1>+<DF2> or by using radd() function i.e., reverse add as per syntax : <DF1>.radd(<DF2>) which means <DF2>+<DF1>.

Question 4

To subtract the values of two DataFrames, ............... function may be used.

  1. sub
  2. difference
  3. minus
  4. rsub

Answer

sub, rsub

Reason — We can subtract two DataFrame objects using either (-) operator or using sub() function as per syntax : <DF1>.sub(<DF2>) which means <DF1>-<DF2> or by using rsub() function i.e., reverse subtract as per syntax : <DF1>.rsub(<DF2>) which means <DF2>-<DF1>.

Question 5

To divide the values of two DataFrames, ............... function may be used.

  1. divide
  2. div
  3. rdiv
  4. division

Answer

div, rdiv

Reason — We can divide two DataFrame objects using either (/) operator or using div() function as per syntax : <DF1>.div(<DF2>) which means <DF1>/<DF2> or by using rdiv() function i.e., reverse division as per syntax : <DF1>.rdiv(<DF2>) which means <DF2>/<DF1>.

Question 6

Which of the following would give the same output as DF/DF1 where DF and DF1 are DataFrames ?

  1. DF.div(DF1)
  2. DF1.div(DF)
  3. Divide(DF, DF1)
  4. Div(DF, DF1)

Answer

DF.div(DF1)

Reason — The div() function with the syntax <DF>.div(<DF1>), performs element-wise division between DF and DF1, which is equivalent to <DF>/<DF1> in terms of functionality and output.

Question 7

To skip NaN values in a calculation, you can specify ............... attribute.

  1. NaN
  2. NA
  3. skipna
  4. all of these

Answer

skipna

Reason — The skipna attribute in Pandas is used to skip (exclude) NaN, NA, or null values when performing calculations.

Question 8

Which of the following is not a valid function that can be used with DataFrames ?

  1. count()
  2. sum()
  3. length()
  4. mad()

Answer

length()

Reason — In Pandas, the length() function is not a valid function for DataFrames. Instead, we can use the len() function to get the length of a DataFrame. The other functions listed (count(), sum(), mad()) are valid functions for performing operations on DataFrames.

Question 9

The technique that divides total distribution of data into a given number of equal proportions is called a ............... .

  1. quartile
  2. tercile
  3. median
  4. quantile

Answer

quantile

Reason — Quantile is a process of dividing the total distribution of given data into a given number of equal proportions.

Question 10

............... divides the total distribution in four equal parts.

  1. quartile
  2. tercile
  3. median
  4. quantile

Answer

quartile

Reason — Quartile refers to the division of the total distribution of given data into a four equal proportions with each containing one fourth of the total population.

Question 11

To divide total distribution of given data in two equal parts, ............... function is used.

  1. median()
  2. quartile()
  3. quantile()
  4. all of these

Answer

quantile()

Reason — The quantile() function is used to divide the total distribution of given data into two equal parts. For example, to divide the data of DataFrame df into two equal parts, the statement would be df.quantile(q = 0.5).

Question 12

To divide total distribution of given data in four equal parts, ............... function is used.

  1. median()
  2. quartile()
  3. quantile()
  4. all of these

Answer

quantile()

Reason — The quantile() function is used to divide the total distribution of given data into four equal parts. For example, to divide the data of DataFrame ndf into four equal parts, the statement would be df.quantile(q = 0.25).

Question 13

To divide total distribution of given data in eight equal parts, ............... function is used.

  1. median()
  2. quartile()
  3. quantile()
  4. all of these

Answer

quantile()

Reason — To divide the total distribution of given data into eight equal parts, we can use the quantile() function with a quantile value of 0.125 (1/8). For example, to divide the data of DataFrame df1 into eight equal parts, the statement would be df.quantile(q = 0.125).

Question 14

Which of the following is not a descriptive statistics function ?

  1. count()
  2. add()
  3. sum()
  4. max()

Answer

add()

Reason — The add() function is a binary operation function, while count(), sum(), max() are descriptive statistics functions.

Question 15

Which function calculates descriptive statistical details for a DataFrame ?

  1. info()
  2. describe()
  3. show()
  4. list()

Answer

describe()

Reason — The descriptive statistics details can be calculated through a single function, namely the describe() function.

Question 16

To calculative cumulative sum of a column of a DataFrame, you may use ............... function.

  1. sum()
  2. sum(cumulative = True)
  3. cumsum()
  4. none of these

Answer

cumsum()

Reason — The cumsum() function in pandas is used to calculate the cumulative sum of a column in a DataFrame i.e., in the output of this function, the value of each row is replaced by the sum of all prior rows, including this row.

Question 17

The function to get the index of maximum value in a column of DataFrame is ............... .

  1. max()
  2. index()
  3. idxmax()
  4. maxidx()

Answer

idxmax()

Reason — The idxmax() function in pandas is used to get the index of the maximum value in a column of a DataFrame.

Question 18

To get top 5 rows of a DataFrame, you may use ............... function.

  1. head()
  2. head(5)
  3. top()
  4. top(5)

Answer

head(), head(5)

Reason — The head() function in pandas is used to get the top n rows of a DataFrame, where n is an optional argument. If n is not provided, it defaults to 5. Hence, both head() and head(5) are correct and will return the top 5 rows of the DataFrame.

Question 19

To get bottom 3 rows of a DataFrame, you may use ............... function.

  1. tail()
  2. tail(3)
  3. bottom()
  4. bottom(3)

Answer

tail(3)

Reason — The tail() function in pandas is used to get the bottom n rows of a DataFrame, where n is an optional argument. Hence, tail(3) is correct and will return the bottom 3 rows of the DataFrame.

Question 20

Function ............... can be used to drop missing values.

  1. fillna()
  2. isnull()
  3. dropna()
  4. delna()

Answer

dropna()

Reason — The dropna() function in pandas is used to drop rows with missing values.

Question 21

Which of the following methods of combining two DataFrames is a patching method ?

  1. concat()
  2. merge()
  3. join()
  4. none of these

Answer

join()

Reason — The join() function in pandas is a patching method, which means that it aligns the DataFrames based on a common index or set of keys, and then combines them.

Fill in the Blanks

Question 1

iterrows() and iteritems() functions help you to iterate over a DataFrame.

Question 2

To add two DataFrames, you may use functions add() or radd( ).

Question 3

To concatenate two string columns of a DataFrame, + operator is used.

Question 4

The mode() function returns the maximum repeating value.

Question 5

The mean() function returns the average of given data.

Question 6

The median() function returns the halfway point in a given data.

Question 7

Using quantile() function you can calculate terciles.

Question 8

Using quantile() function you can calculate quartiles.

Question 9

Using quantile() function you can calculate octiles.

Question 10

Using isnull() method, you can check if there are any missing values in DataFrame.

Question 11

The merge() function combines two DataFrames such that two rows with some common value are merged together in the final result.

True/False Questions

Question 1

The iteritems() iterates over the rows of a DataFrame.

Answer

False

Reason — The iteritems() method in pandas iterates over a DataFrame column-wise, where each vertical subset is in the form of (column-index, Series) where Series contains all row values for that column-index.

Question 2

The iteritems() iterates over the columns of a DataFrame.

Answer

True

Reason — The iteritems() method in pandas iterates over a DataFrame column-wise, where each vertical subset is in the form of (column-index, Series) where Series contains all row values for that column-index.

Question 3

The result produced by the functions sub() and rsub() is the same.

Answer

False

Reason — The sub() and rsub() functions produce different results because they subtract the operands in a different order. The sub() function performs element-wise subtraction between two DataFrames, subtracting the right operand from the left operand. The syntax is <DF1>.sub(<DF2>), which means <DF1> - <DF2>. On the other hand, the rsub() function performs element-wise subtraction with the right operand subtracted from the left operand. The syntax is <DF1>.rsub(<DF2>), which means <DF2> - <DF1>.

Question 4

The result produced by the functions add() and radd() is the same.

Answer

True

Reason — When adding two DataFrame objects using the add() function with the syntax <DF1>.add(<DF2>), it means <DF1> + <DF2>. Similarly, using the radd() function with the syntax <DF1>.radd(<DF2>) means <DF2> + <DF1>. Both operations produce the same results because addition is commutative, meaning the order of operands does not affect the result.

Question 5

The result produced by the functions div() and rdiv() is the same.

Answer

False

Reason — The div() and rdiv() functions do not produce the same result. The div() function performs element-wise division between two DataFrames, where the left operand is the dividend and the right operand is the divisor. The syntax is <DF1>.div(<DF2>), which means <DF1> / <DF2>. On the other hand, the rdiv() function performs element-wise division with the right operand as the dividend and the left operand as the divisor. The syntax is <DF1>.rdiv(<DF2>), which means <DF2> / <DF1>.

Question 6

The info() and describe() list the same information about a DataFrame.

Answer

False

Reason — The info() method in pandas provides basic information about a DataFrame, including the data types of each column, the number of rows, and the memory usage. It also displays the index name and type. On the other hand, the describe() method provides detailed descriptive statistics about the numerical columns in a DataFrame, including count, mean, standard deviation, minimum value, 25th percentile, median (50th percentile), 75th percentile, and maximum value.

Question 7

Function add() and operator + give the same result.

Answer

True

Reason — In pandas, both the '+' operator and the add() function can be used to perform element-wise addition between two DataFrames, resulting in the same output. The syntax for the '+' operator is <DF1> + <DF2>, while the syntax for the add() function is <DF1>.add(<DF2>), which is equivalent to <DF1> + <DF2>.

Question 8

Function rsub() and operator - give the same result.

Answer

False

Reason — The rsub() function and the '-' operator in pandas do not produce the same result because the rsub() function performs reverse subtraction, which means it subtracts the left operand from the right, while the '-' subtracts the right operand from the left operand.

Question 9

The minus - operator's result is same as sub() and rsub().

Answer

False

Reason — The '-' operator in pandas subtracts the right operand from the left operand, similar to the sub() function. However, the rsub() function subtracts the left operand from the right operand. Therefore, '-' and sub() produce the same result, while rsub() produces a different result due to the order of operands.

Question 10

Python integer datatype can store NaN values.

Answer

False

Reason — Python integer datatype cannot store NaN values. To store a NaN value in a column, the datatype of a column is changed to non-integer suitable type.

Question 11

Functions sum() and cumsum() produce the same result.

Answer

False

Reason — The function sum() returns the sum of the values for the requested axis. On the other hand, cumsum() calculates the cumulative sum. In the output of this function, the value of each row is replaced by the sum of all prior rows, including this row.

Question 12

The fillna() can also fill individual missing values for different columns.

Answer

True

Reason — The fillna() function in pandas can be used to fill individual missing values for different columns. This can be done by passing a dictionary with column names as keys and the corresponding filling values as values to the fillna() function.

Question 13

To drop missing values from a DataFrame, the function used is delna().

Answer

False

Reason — To drop missing values from a DataFrame, the function used is dropna().

Assertions and Reasons

Question 1

Assertion. A quantile refers to equally distributed portion of a data set.

Reason. A median divides a distribution in 2 quantiles while a quartile divides a distribution in 4 quantiles.

  1. Both A and R are true and R is the correct explanation of A.
  2. Both A and R are true but R is not the correct explanation of A.
  3. A is true but R is false.
  4. A is false but R is true.

Answer

Both A and R are true and R is the correct explanation of A.

Explanation
Quantile is a statistical term that refers to dividing the total distribution of given data into a given number of equal proportions. Each portion is called a quantile, and it represents a specific percentage of the data set. The median divides a distribution into two equal parts, representing the middle value of a data set when the data is sorted in ascending order. A quartile divides a distribution into four equal parts, with each quartile representing 25% of the data set.

Question 2

Assertion. Data aggregation produces a summary statistics of a dataset.

Reason. Data aggregation summarizes data using statistical aggregation functions.

  1. Both A and R are true and R is the correct explanation of A.
  2. Both A and R are true but R is not the correct explanation of A.
  3. A is true but R is false.
  4. A is false but R is true.

Answer

Both A and R are true and R is the correct explanation of A.

Explanation
Data aggregation produces a summary statistic of a dataset because it involves combining and summarizing data from multiple sources or data points, and then calculating a single summary statistic or a set of summary statistics. This process utilizes statistical aggregation functions, such as sum, average, count, minimum, and maximum.

Question 3

Assertion. In a dataset, there can be missing values that cannot contribute to any computation.

Reason. In a dataset, NULL, NaN or None are considered the missing values.

  1. Both A and R are true and R is the correct explanation of A.
  2. Both A and R are true but R is not the correct explanation of A.
  3. A is true but R is false.
  4. A is false but R is true.

Answer

Both A and R are true and R is the correct explanation of A.

Explanation
Missing values are the values that cannot contribute to any computation or we can say that missing values are the values that carry no computational significance. In a dataset, NULL, NaN or None are considered the missing values.

Question 4

Assertion (A). The output of addition of two series will be NaN, if one of the elements or both the elements have no value(s).

Reason (R). While performing mathematical operations on a series, by default all missing values are filled in with 0.

  1. Both A and R are true and R is the correct explanation of A.
  2. Both A and R are true but R is not the correct explanation of A.
  3. A is true but R is false.
  4. A is false but R is true.

Answer

A is true but R is false.

Explanation
When adding two pandas Series and encountering NaN values in either or both Series, the result of the addition operation will be NaN for those corresponding elements. Additionally, pandas does not automatically replace missing values with 0 during mathematical operations but instead with NaN.

Type A: Very Short Answer Questions

Question 1

Name the function to iterate over a DataFrame horizontally.

Answer

The iterrows() function iterates over a DataFrame horizontally.

Question 2

Name the function to iterate over a DataFrame vertically.

Answer

The iteritems() function iterates over a DataFrame vertically.

Question 3

Write equivalent expressions for the given functions :

(i) A.add(B)

(ii) B.add(A)

(iii) A.sub(B)

(iv) B.sub(A)

(v) A.rsub(B)

(vi) B.mul(A)

(vii) A.rdiv(B)

(viii) B.div(A)

(ix) B.rdiv(A)

(x) A.div(B)

Answer

(i) A.add(B) — A + B

(ii) B.add(A) — B + A

(iii) A.sub(B) — A - B

(iv) B.sub(A) — B - A

(v) A.rsub(B) — B - A

(vi) B.mul(A) — B * A

(vii) A.rdiv(B) — B / A

(viii) B.div(A) — B / A

(ix) B.rdiv(A) — A / B

(x) A.div(B) — A / B

Question 4

Is the result of sub() and rsub() the same? Why/why not ?

Answer

The sub() and rsub() functions produce different results because they subtract the operands in a different order. The sub() function performs element-wise subtraction between two DataFrames, subtracting the right operand from the left operand. The syntax is <DF1>.sub(<DF2>), which means <DF1> - <DF2>. On the other hand, the rsub() function performs element-wise subtraction with the right operand subtracted from the left operand. The syntax is <DF1>.rsub(<DF2>), which means <DF2> - <DF1>.

Question 5

Write appropriate functions to perform the following on a DataFrame ?

(i) Calculate the sum

(ii) Count the values

(iii) Calculate the average

(iv) Calculate the most repeated value

(v) Calculate the median

(vi) Calculate the standard deviation

(vii) Calculate the variance

(viii) Calculate the maximum value

(ix) Calculate the standard deviation

(x) Calculate the variance

Answer

(i) Calculate the sum — sum() function

(ii) Count the values — count() function

(iii) Calculate the average — mean() function

(iv) Calculate the most repeated value — mode() function

(v) Calculate the median — median() function

(vi) Calculate the standard deviation — std() function

(vii) Calculate the variance — var() function

(viii) Calculate the maximum value — max() function

(ix) Calculate the standard deviation — std() function

(x) Calculate the variance — var() function

Question 6

What does info() and describe() do ?

Answer

The info() method in pandas provides basic information about a DataFrame, including the data types of each column, the number of rows, and the memory usage. It also displays the index name and type. On the other hand, the describe() method provides detailed descriptive statistics about the numerical columns in a DataFrame, including count, mean, standard deviation, minimum value, 25th percentile, median (50th percentile), 75th percentile, and maximum value.

Question 7

Are sum() and add() functions the same ?

Answer

No, the sum() and add() functions are not the same in pandas. The sum() function in pandas calculates the sum of values along a specified axis. On the other hand, the add() function in pandas is used to add two objects element-wise.

Question 8

Name some functions that perform descriptive statistics on a DataFrame.

Answer

Some functions that perform descriptive statistics on a DataFrame are min(), max(), idxmax(), idxmin(), mode(), mean(), median(), count(), sum(), quantile(), std(), var().

Question 9

To consider only the numeric values for calculation, what argument do you pass to statistics functions of Pandas ?

Answer

The numeric_only = True argument is passed to statistics functions in pandas to consider only the numeric values (int, float, boolean columns) for calculation.

Question 10

Is there one function that calculates much of descriptive statistics values ? Name it.

Answer

Yes, there is one function that calculates many descriptive statistics values, namely the describe() function.

Question 11

What happens if mode() returns multiple values for a column but other columns have a single mode ?

Answer

When the mode() function in pandas returns multiple values for a column (meaning there are multiple modes), but other columns have a single mode, pandas fills the non-mode columns with NaN values.

Question 12

What is quantile and quartile ?

Answer

Quantile is a process of dividing the total distribution of given data into a given number of equal proportions.

Quartile refers to the division of the total distribution of given data into a four equal proportions with each containing one fourth of the total population.

Question 13

Name the function that lets you calculate different types of quantiles.

Answer

The function used to calculate different types of quantiles is quantiles() with syntax : <dataframe>.quantile(q = 0.5, axis = 0, numeric_only = True).

Question 14

Name the functions that give you maximum and minimum values in a DataFrame.

Answer

The max() and min() functions find out the maximum and minimum values respectively from a DataFrame.

Question 15

Name the functions that give you the index of maximum and minimum values in a DataFrame.

Answer

The idxmax() and idxmin() functions find out the index of maximum and minimum values respectively from a DataFrame.

Question 16

What is missing data ?

Answer

Missing data, also known as missing values are the values that cannot contribute to any computation or we can say that missing values are the values that carry no computational significance.

Question 17

Why is missing data filled in DataFrame with some value ?

Answer

The dropna() method in pandas can be used to remove rows or columns with missing values from a DataFrame. However, this can also lead to a loss of non-null data, which may not be desirable for data analysis. To avoid losing non-null data, we can use the fillna() method to fill missing values with a specified value. This can help to ensure that the dataset is complete and can be used for data analysis.

Question 18

Name the functions you can use for filling missing data.

Answer

In Python, the fillna() function is used to fill missing data in a Pandas DataFrame.

Type B: Short Answer Questions/Conceptual Questions

Question 1

How do you iterate over a DataFrame?

Answer

To iterate over a DataFrame in Python, we can use various methods depending on our specific needs. We can use the iterrows() method to iterate over each row in the DataFrame. In this method, each horizontal subset is in the form of (row-index, series), where the series contains all column values for that row-index. Additionally, we can use the iteritems() method to iterate over each column in the DataFrame. Here, each vertical subset is in the form of (column-index, series), where the series contains all row values for that column-index.

Question 2

What are binary operations ? Name the functions that let you perform binary operations on a DataFrame.

Answer

Binary operations refer to operations that require two values to perform and these values are picked element-wise. In a binary operation involving DataFrames, the data from the two DataFrames are aligned based on their row and column indexes. For matching row and column indexes, the specified operation is performed, while for non-matching row and column indexes, NaN values are stored in the result. The functions that perform binary operations on a DataFrame include add(), radd(), sub(), rsub(), mul(), div(), and rdiv().

Question 3

What is descriptive statistics ? Name Pandas descriptive statistics functions.

Answer

A descriptive statistic is a summary statistic that quantitatively describes or summarizes features of a collection of information.

The functions that perform descriptive statistics on a DataFrame are min(), max(), idxmax(), idxmin(), mode(), mean(), median(), count(), sum(), quantile(), std(), var().

Question 4

The info() and describe() are said to be inspection functions. What do they do ?

Answer

The info() and describe() functions in pandas are considered inspection functions because they provide valuable insights into the structure, content, and statistical summary of a DataFrame. The info() method in pandas provides basic information about a DataFrame, including the data types of each column, the number of rows, and the memory usage. It also displays the index name and type. On the other hand, the describe() method provides detailed descriptive statistics about the numerical columns in a DataFrame, including count, mean, standard deviation, minimum value, 25th percentile, median (50th percentile), 75th percentile, and maximum value.

Question 5

What is the difference between sum and cumulative sum? How do you perform the two on DataFrame ?

Answer

The sum provides the total of all numbers in a sequence, while the cumulative sum represents the running total of numbers encountered up to a specific point in the sequence.

The sum() function gives the total sum of values along specified axes, whereas the cumsum() function provides the cumulative sum of values along specified axes, either column-wise or row-wise.

Question 6

Name and explain some cumulative functions provided by Pandas.

Answer

The cumulative functions provided by Pandas are cumsum(), cumprod(), cummax(), cummin().

  1. cumsum() — It calculates cumulative sum i.e., in the output of this function, the value of each row is replaced by sum of all prior rows including this row. The syntax is <DF>.cumsum([axis = None]).

  2. cumprod() — It calculates cumulative product of values in a DataFrame object.

  3. cummax() — It calculates cumulative maximum of value from a DataFrame object.

  4. cummin() — It calculates cumulative minimum of value from a DataFrame object.

Question 7

The head() and tail() extract rows or columns from a DataFrame. Explain.

Answer

The head() function in pandas retrieves the top n rows of a DataFrame, where n is an optional argument defaulting to 5 if not provided. It is used with the syntax <DF>.head(). Similarly, the tail() function in pandas fetches the bottom n rows of a DataFrame, where n is also optional and defaults to 5 if not specified. Its syntax is <DF>.tail().

Question 8

Why does Python change the datatype of a column as soon as it stores an empty value (NaN) even though it has all other values stored as integer ?

Answer

When a column in a pandas DataFrame contains a mixture of integers and missing values (NaNs), the data type is automatically changed to a floating-point type to accommodate these missing values because in Python integer types cannot store NaN values.

Question 9

What do quantile and var() functions do ?

Answer

The quantile() function returns the values at the given quantiles over requested axis (axis 0 or 1). On the other hand, var() function computes variance and returns the unbiased variance over the requested axis.

Question 10

What is a quartile ? How is it different from quantile ?

Answer

Quartile refers to the division of the total distribution of given data into a four equal proportions with each containing one fourth of the total population.

Quartiles specifically divide the dataset into four equal parts, while quantiles can divide the dataset into any desired number of equal parts.

Question 11

How do you create quantiles and quartiles in Python Pandas ?

Answer

In Python pandas, we can create quantiles and quartiles using the quantile() method.

The statement to create quartiles is df.quantile(q = 0.25).

Question 12

Assume that required libraries (Pandas and Numpy) are imported and DataFrame ndf has been created as shown in solved problem 16. Predict the output produced by following code fragment :

print(ndf[ndf["age"] > 30])
print(ndf.head(2)) 
print(ndf.tail(3))

The following DataFrame ndf is from solved problem 16 :

       Name Sex    Position       City  age  Projects  Budget
0    Rabina   F     Manager  Bangalore   30        13      48
1      Evan   M   Programer  New Delhi   27        17      13
2       Jia   F     Manager    Chennai   32        16      32
3     Lalit   M     Manager     Mumbai   40        20      21
4  Jaspreet   M  Programmer    Chennai   28        21      17
5      Suji   F  Programmer  Bangalore   32        14      10

Answer

Output
    Name Sex    Position       City  age  Projects  Budget
2    Jia   F     Manager    Chennai   32        16      32
3  Lalit   M     Manager     Mumbai   40        20      21
5   Suji   F  Programmer  Bangalore   32        14      10
     Name Sex   Position       City  age  Projects  Budget
0  Rabina   F    Manager  Bangalore   30        13      48
1    Evan   M  Programer  New Delhi   27        17      13
       Name Sex    Position       City  age  Projects  Budget
3     Lalit   M     Manager     Mumbai   40        20      21
4  Jaspreet   M  Programmer    Chennai   28        21      17
5      Suji   F  Programmer  Bangalore   32        14      10
Explanation
  1. print(ndf[ndf["age"] > 30]) — The ndf["age"] > 30 creates a boolean mask where True indicates rows where the "age" column has values greater than 30. Then ndf[...] uses this boolean mask to filter the DataFrame ndf, returning only rows where the condition is True.

  2. print(ndf.head(2)) — It prints the first 2 rows of the DataFrame ndf.

  3. print(ndf.tail(3)) — It prints the last 3 rows of the DataFrame ndf.

Question 13

Given the two DataFrames as :

>>> dfc1        
    -   -         
    0   1        
0   2   a      
1   3   b      
2   4   c   
>>> dfc2
    -   -   
    0   1   2
0   2   3   4
2   p   q   r

Why are following statements giving errors ?

(a) print(dfc1 + dfc2)

(b) print(dfc1.sub(dfc2))

(c) print(dfc1 * dfc2)

Answer

(a) print(dfc1 + dfc2) — This statement tries to add two DataFrames dfc1 and dfc2. However, they have different shapes (dfc1 has 3 rows and 2 columns, while dfc2 has 2 rows and 3 columns), and their indices do not match. DataFrame addition requires both DataFrames to have the same shape and compatible indices, which is not the case here.

(b) print(dfc1.sub(dfc2)) — This statement attempts to subtract dfc1 from dfc2 using the sub() method. Similar to addition, subtraction between DataFrames requires them to have the same shape and compatible indices, which is not satisfied here due to the mismatch in shapes and indices.

(c) print(dfc1 * dfc2) — This statement tries to perform element-wise multiplication between dfc1 and dfc2. Again, this operation requires both DataFrames to have the same shape and compatible indices, which is not the case here due to the mismatched shapes and indices.

Question 14

Consider the following code that creates two DataFrames :

ore1 = pd.DataFrame(np.array([[20, 35, 25, 20], [11, 28, 32, 29]]),
       columns = ['iron', 'magnesium', 'copper', 'silver'])
ore2 = pd.DataFrame(np.array([[14, 34, 26, 26], [33, 19, 25, 23]]),
       columns = ['iron', 'magnesium', 'gold', 'silver'])

What will be the output produced by the following code fragments ?

(a) print(ore1 + ore2)
     ore3 = ore1.radd(ore2)
     print(ore3)

(b) print(ore1 - ore2)
     ore3 = ore1.rsub(ore2)
     print(ore3)

(c) print(ore1 * ore2)
     ore3 = ore1.mul(ore2)
     print(ore3)

(d) print(ore1 / ore2)
     ore3 = ore1.rdiv(ore2)
     print(ore3)

Answer

(a)

Output
   copper  gold  iron  magnesium  silver
0     NaN   NaN    34         69      46
1     NaN   NaN    44         47      52
   copper  gold  iron  magnesium  silver
0     NaN   NaN    34         69      46
1     NaN   NaN    44         47      52
Explanation
  1. print(ore1 + ore2): This line attempts to add the DataFrames ore1 and ore2 using the '+' operator. When adding DataFrames with different shapes and column names, pandas aligns the DataFrames based on indices and columns, resulting in NaN values where elements are missing in either DataFrame.
  2. ore3 = ore1.radd(ore2): This line uses the radd() method, which reverses the addition operation between DataFrames ore1 and ore2.
  3. The radd() function and '+' operator produce the same result in pandas, as the order of operands does not affect addition due to its commutative property.

(b)

Output
  copper  gold  iron  magnesium  silver
0     NaN   NaN     6          1      -6
1     NaN   NaN   -22          9       6
   copper  gold  iron  magnesium  silver
0     NaN   NaN    -6         -1       6
1     NaN   NaN    22         -9      -6
Explanation
  1. print(ore1 - ore2): This line performs a subtraction operation between corresponding elements in DataFrames ore1 and ore2. When subtracting DataFrames with different shapes and column names, pandas aligns the DataFrames based on indices and columns, resulting in NaN values where elements are missing in either DataFrame.
  2. ore3 = ore1.rsub(ore2): This line uses the rsub() method, which reverses the subtraction operation between DataFrames ore1 and ore2.
  3. The rsub() function and the '-' operator in pandas do not produce the same result because the rsub() function performs reverse subtraction, which means it subtracts the left operand from the right, while the '-' subtracts the right operand from the left operand.

(c)

Output
   copper  gold  iron  magnesium  silver
0     NaN   NaN   280       1190     520
1     NaN   NaN   363        532     667
   copper  gold  iron  magnesium  silver
0     NaN   NaN   280       1190     520
1     NaN   NaN   363        532     667
Explanation
  1. print(ore1 * ore2): This line attempts to perform element-wise multiplication between the DataFrames ore1 and ore2 using the '*' operator. When multiplying DataFrames with different shapes and column names, pandas aligns the DataFrames based on indices and columns, resulting in NaN values where elements are missing in either DataFrame.
  2. ore3 = ore1.mul(ore2): This line uses the mul() method to perform element-wise multiplication between DataFrames ore1 and ore2.

(d)

Output
   copper  gold      iron  magnesium    silver
0     NaN   NaN  1.428571   1.029412  0.769231
1     NaN   NaN  0.333333   1.473684  1.260870
   copper  gold  iron  magnesium    silver
0     NaN   NaN   0.7   0.971429  1.300000
1     NaN   NaN   3.0   0.678571  0.793103
Explanation
  1. print(ore1 / ore2): This line attempts to perform element-wise division between the DataFrames ore1 and ore2 using the '/' operator. When dividing DataFrames with different shapes and column names, pandas aligns the DataFrames based on indices and columns, resulting in NaN values where elements are missing in either DataFrame.
  2. ore3 = ore1.rdiv(ore2): This line uses the rdiv() method to perform reciprocal division between DataFrames ore1 and ore2.

Question 15

Consider the DataFrame wdf as shown below :

 minTempmaxTempRainfallEvaporation
02.98.024.30.0
13.114.026.93.6
26.213.723.43.6
35.313.315.539.8
46.317.616.12.8
55.418.216.90.0
65.521.118.20.2
74.818.317.00.0
83.620.819.50.0
97.719.422.816.2
109.924.125.20.0
1111.828.527.30.2
1213.229.127.90.0
1316.824.130.90.0
1419.428.131.20.0
1521.634.432.10.0
1620.433.831.20.0
1718.526.730.01.2
1818.832.432.30.6
1917.628.633.40.0
2019.730.333.40.0

(a) Write statement(s) to calculate minimum value for each of the columns.

(b) Write statement(s) to calculate maximum value for each of the rows.

(c) Write statement(s) to calculate variance for column Rainfall.

(d) Write statement(s) to compute mean , mode median for last 10 rows.

Answer

(a)

>>> wdf.min()
Output
Minimum values for each column:
minTemp         2.9
maxTemp         8.0
Rainfall       15.5
Evaporation     0.0
dtype: float64

(b)

>>> wdf.max(axis=1)
Output
Maximum values for each row:
0     24.3
1     26.9
2     23.4
3     39.8
4     17.6
5     18.2
6     21.1
7     18.3
8     20.8
9     22.8
10    25.2
11    28.5
12    29.1
13    30.9
14    31.2
15    34.4
16    33.8
17    30.0
18    32.4
19    33.4
20    33.4
dtype: float64

(c)

>>> wdf['Rainfall'].var()
Output
Variance for column Rainfall: 38.852999999999994

(d)

>>> last_10_rows = wdf.tail(10)
>>> last_10_rows.mean()
>>> last_10_rows.mode()
>>> last_10_rows.median()
Output
Mean for last 10 rows:
minTemp        17.78
maxTemp        29.60
Rainfall       30.97
Evaporation     0.20
dtype: float64

Mode for last 10 rows:
   minTemp  maxTemp  Rainfall  Evaporation
0     11.8     24.1      31.2          0.0
1     13.2     26.7      33.4          NaN
2     16.8     28.1       NaN          NaN
3     17.6     28.5       NaN          NaN
4     18.5     28.6       NaN          NaN
5     18.8     29.1       NaN          NaN
6     19.4     30.3       NaN          NaN
7     19.7     32.4       NaN          NaN
8     20.4     33.8       NaN          NaN
9     21.6     34.4       NaN          NaN

Median for last 10 rows:
minTemp        18.65
maxTemp        28.85
Rainfall       31.20
Evaporation     0.00
dtype: float64

Type C: Long Answer Questions

Question 1

Write a program to print a DataFrame one column at a time and print only first three columns.

Solution
import pandas as pd

data = {
    'Name': ['Aliya', 'Hemanth', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['Bangalore', 'Chennai', 'Mumbai'],
    'Salary': [50000, 60000, 70000]
}

df = pd.DataFrame(data)
first_three_columns = df.iloc[:, :3]

print("Each column:")
for column_name in first_three_columns:
    column_data = first_three_columns[column_name]
    print(column_name)
    print(column_data)
Output
Each column:
Name
0      Aliya
1    Hemanth
2    Charlie
Name: Name, dtype: object
Age
0    25
1    30
2    35
Name: Age, dtype: int64
City
0    Bangalore
1      Chennai
2       Mumbai
Name: City, dtype: object

Question 2

Write a program to print a DataFrame one row at a time and print only first five rows.

Solution
import pandas as pd
data = {
    'Name': ['Amruta', 'Harsh', 'Yogesh', 'Shreya', 'Zoya', 'Nyra'],
    'Age': [25, 30, 35, 40, 45, 28],
    'City': ['Chandigarh', 'Jaipur', 'Dehradun', 'Delhi', 'Vadodara', 'Guwahati']
}

df = pd.DataFrame(data)
first_five_rows = df.head(5)
print("Each row:")
for index, row in first_five_rows.iterrows():
    print("Index:", index)
    print(row)
Output
Each row:
Index: 0
Name        Amruta
Age             25
City    Chandigarh
Name: 0, dtype: object
Index: 1
Name     Harsh
Age         30
City    Jaipur
Name: 1, dtype: object
Index: 2
Name      Yogesh
Age           35
City    Dehradun
Name: 2, dtype: object
Index: 3
Name    Shreya
Age         40
City     Delhi
Name: 3, dtype: object
Index: 4
Name        Zoya
Age           45
City    Vadodara
Name: 4, dtype: object

Question 3

Write a program that performs count, sum, max, and min functions :

  • On rows

  • On columns

Solution
import pandas as pd
data = {
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
}

df = pd.DataFrame(data)
print("Operations on rows:")
print("Count:")
print(df.count(axis=1))
print("Sum:")
print(df.sum(axis=1))
print("Max:")
print(df.max(axis=1))
print("Min:")
print(df.min(axis=1))

print("\nOperations on columns:")
print("Count:")
print(df.count())
print("Sum:")
print(df.sum())
print("Max:")
print(df.max())
print("Min:")
print(df.min())
Output
Operations on rows:
Count:
0    3
1    3
2    3
dtype: int64
Sum:
0    12
1    15
2    18
dtype: int64
Max:
0    7
1    8
2    9
dtype: int64
Min:
0    1
1    2
2    3
dtype: int64

Operations on columns:
Count:
A    3
B    3
C    3
dtype: int64
Sum:
A     6
B    15
C    24
dtype: int64
Max:
A    3
B    6
C    9
dtype: int64
Min:
A    1
B    4
C    7
dtype: int64

Question 4

Take a DataFrame of your choice. Write a program to calculate count of values only in a selective column.

Solution
import pandas as pd
data = {
    'Name': ['Deepika', 'Simran', 'Shivang', 'Anurag', 'Ankith'],
    'Age': [25, 30, None, 40, 45],
    'City': ['Goa', 'Kochi', 'Pondicherry', 'Rohtak', 'Kasol']
}

df = pd.DataFrame(data)
selected_column = 'Age'
count_values = df[selected_column].count()
print("Count of values in", selected_column, ":" , count_values)
Output
Count of values in Age : 4

Question 5

Give two identical DataFrames Sales16 and Sales17. But Sales17 has some values missing. Write code so that Sales17 fills its missing values from the corresponding entries of Sales16.

Solution
import pandas as pd
data_sales16 = {
    'Product': ['A', 'B', 'C', 'D'],
    'Sales': [100, 150, 120, 180]
}
Sales16 = pd.DataFrame(data_sales16)
data_sales17 = {
    'Product': ['A', 'B', 'C', 'D'],
    'Sales': [100, None, 120, None]
}
Sales17 = pd.DataFrame(data_sales17)
Sales17 = Sales16.fillna({'B': 150, 'D': 180})
print("Sales16:")
print(Sales16)
print("\nSales17 (after filling missing values):")
print(Sales17)
Output
Sales16:
  Product  Sales
0       A    100
1       B    150
2       C    120
3       D    180

Sales17 (after filling missing values):
  Product  Sales
0       A    100
1       B    150
2       C    120
3       D    180

Question 6

Write code that just produces single True/False as a result for the presence of missing values in whole DataFrame namely df.

Solution
import pandas as pd
data = {'A': [1, 2, None, 4], 'B': [None, 5, 6, 7], 'C': [8, 9, 10, 11]}
df = pd.DataFrame(data)
has_missing_values = False

for column in df.columns:
    for value in df[column]:
        if pd.isnull(value):
            has_missing_values = True
            break
print(has_missing_values)
Output
True

Question 7

Four Series objects Temp1, Temp2, Temp3 and Temp4 store the temperatures of week1, week2, week3 and week4 respectively. Create a DataFrame from these four series objects where the indexes should be 'Sunday', 'Monday', ... , 'Saturday', and columns should be 'Week1', 'Week2', 'Week3' and 'week4'.

Solution
import pandas as pd
weekdays = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
week1 = [25.4, 26.7, 27.2, 28.9, 29.8, 30.2, 31.5]  
week2 = [22.3, 23.2, 24.6, 25.4, 26.8, 27.9, 28.4]  
week3 = [20.1, 21.2, 22.4, 23.9, 24.7, 25.57, 26.78]  
week4 = [18.4, 19.1, 20.3, 21.7, 22.67, 23.4, 24.09]  
Temp1 = pd.Series(week1, index=weekdays)
Temp2 = pd.Series(week2, index=weekdays)
Temp3 = pd.Series(week3, index=weekdays)
Temp4 = pd.Series(week4, index=weekdays)
df = pd.DataFrame({'Week1': Temp1, 'Week2': Temp2, 'Week3': Temp3, 'Week4': Temp4})
print(df)
Output
           Week1  Week2  Week3  Week4
Sunday      25.4   22.3  20.10  18.40
Monday      26.7   23.2  21.20  19.10
Tuesday     27.2   24.6  22.40  20.30
Wednesday   28.9   25.4  23.90  21.70
Thursday    29.8   26.8  24.70  22.67
Friday      30.2   27.9  25.57  23.40
Saturday    31.5   28.4  26.78  24.09

Question 8

Given a DataFrame that stores the details of past 25 years' monthly sales. Some old data is however missing. Write a script to calculate average :

  • Monthly sales across years

  • Yearly sales

Make sure that missing values do not hamper the overall result.

Solution
import pandas as pd
data = {
    'Year': [],
    'Month': [],
    'Sales': []
}

n = int(input("Enter the number of rows in the DataFrame: "))
for i in range(n):
    year = int(input("Enter the year: "))
    month = int(input("Enter the month (1-12): "))
    sales = float(input("Enter the sales amount: "))
    data['Year'].append(year)
    data['Month'].append(month)
    data['Sales'].append(sales)

df = pd.DataFrame(data)

df.fillna(0, inplace=True)

avg_monthly_sales = df.groupby('Month')['Sales'].mean()

yearly_sales = df.groupby('Year')['Sales'].mean()

print("Average Monthly Sales Across Years:")
print(avg_monthly_sales)
print("\nYearly Sales:")
print(yearly_sales)
Output
Average Monthly Sales Across Years:
Month
1     543.25
2     644.25
3     486.50
4     583.50
5     376.50
6     644.75
7     847.00
8     497.75
9     564.50
10    701.00
11    524.75
12    522.25
Name: Sales, dtype: float64

Yearly Sales:
Year
2000    781.082333
2001    480.030233
2002    390.530000
2003    682.917767
2004    421.250050
2005    408.333333
2006    509.757500
2007    634.333667
2008    907.250000
2009    876.222222
2010    940.757578
2011    690.900067
2012    623.250043
2013    890.088888
2014    491.500000
2015    601.919997
2016    623.250055
2017    489.086633
2018    523.750000
2019    789.910067
2020    456.250000
2021    481.083333
2022    590.750000
2023    618.916667
2024    621.250000
Name: Sales, dtype: float64

Question 9

Given two identical DataFrames Sales16 and Sales17. But Sales17 has some values missing. Write code so that Sales17 fills its missing values from corresponding entries of Sales16.

Solution
import pandas as pd
data_sales16 = {
    'Product': ['A', 'B', 'C', 'D'],
    'Sales': [100, 150, 120, 180]
}
Sales16 = pd.DataFrame(data_sales16)
data_sales17 = {
    'Product': ['A', 'B', 'C', 'D'],
    'Sales': [100, None, 120, None]
}
Sales17 = pd.DataFrame(data_sales17)
Sales17 = Sales16.fillna({'B': 150, 'D': 180})
print("Sales16:")
print(Sales16)
print("\nSales17 (after filling missing values):")
print(Sales17)
Output
Sales16:
  Product  Sales
0       A    100
1       B    150
2       C    120
3       D    180

Sales17 (after filling missing values):
  Product  Sales
0       A    100
1       B    150
2       C    120
3       D    180

Question 10

Write code that just produces single True/False as a result for the presence of missing values in whole DataFrame namely df.

Solution
import pandas as pd
data = {'A': [1, 2, None, 4], 'B': [None, 5, 6, 7], 'C': [8, 9, 10, 11]}
df = pd.DataFrame(data)
has_missing_values = False

for column in df.columns:
    for value in df[column]:
        if pd.isnull(value):
            has_missing_values = True
            break
print(has_missing_values)
Output
True