Data Handling using Pandas
Pandas is the most popular open-source Python library used for doing data analysis.
In order to work with Pandas in Python, you need to import pandas library in your program.
The two basic and universally popular data structures of Pandas are Series and DataFrame.
A Series is a Pandas data structure that represents a one-dimensional array-like object of indexed data.
To create a series object, Series() method is used.
Missing data in Pandas series and DataFrames can be filled with a NaN-Not a Number value.
DataFrame has two indices.
Selecting a subset from a DataFrame requires loc and iloc functions.
read_csv() function is used to read data from a CSV file in your DataFrame.
to_csv() function saves the data of DataFrame to a CSV file.
A series object is 2D array that stores ordered collection columns that can store data of different type.
Answer
False
Reason — A DataFrame object is 2D labelled array like Pandas data structure that stores an ordered collection columns that can store data of different types.
A DataFrame is a 1D array-like object containing an array of data and an associated array of data labels.
Answer
False
Reason — A Series is a Pandas data structure that represents a one-dimensional array like object containing an array of data and an associated array of data labels, called its index.
To access subset of a dataframe, we can use loc[] and iloc[] attributes.
Answer
True
Reason — To access a subset of a DataFrame in pandas, we can utilize the loc[]
and iloc[]
attributes, where loc[]
is label-based and iloc[]
is integer index-based.
The iteritems() iterate over vertical subset in the form of (col_index,series) pair.
Answer
True
Reason — The iteritems() method in pandas iterates over a DataFrame column-wise, where each vertical subset is in the form of (column-index, Series) where Series contains all row values for that column-index.
The value NA/NAT/None are the same in Pandas and considered as NaN-values.
Answer
True
Reason — In Pandas, 'NA' (Not Available) and 'NAT' (Not A Time) are used to represent missing values in numeric and datetime columns, respectively. 'None' is used to represent missing values in object columns. Although they are used for different data types, they are all considered as NaN (Not a Number) values.
The iteritems() brings horizontal subsets from a dataframe.
Answer
False
Reason — The iteritems() method in pandas iterates over a DataFrame column-wise, where each vertical subset is in the form of (column-index, Series) where Series contains all row values for that column-index.
The all() and any() functions are used to check if all or any item is non-zero, not-empty or not-False.
Answer
True
Reason — The all() and any() functions are used to check if all or any item is non-zero, not-empty or not-False.
CSV refers to tabular data saved as plain text where data values are separated by commas.
Answer
True
Reason — CSV stands for Comma Separated Values. It is a file format used to store tabular data, such as tables or spreadsheets, as plain text. In a CSV file, each line represents a single row of data, and each value in that row is separated from the next value by a comma (,).
read_csv() method automatically takes the last row of the CSV file.
Answer
False
Reason — The read_csv()
method in pandas does not automatically take the last row of the CSV file. Instead, it reads the entire CSV file into a DataFrame.
Data and index in an ndarray must be of the same length.
Answer
True
Reason — In a pandas DataFrame or Series, the data and index should be of the same length. Each element in the data should have a corresponding label in the index.
We need to define an index in Pandas.
Answer
False
Reason — In Pandas, an index is automatically created when a DataFrame or Series is created. By default, the index is a range of integers starting from 0, incrementing by 1 for each row.
Which of the following commands is used to install Pandas?
- pip install python-pandas
- pip install pandas
- python install python
- python install pandas
Answer
pip install pandas
Reason — The correct command to install Pandas is pip install pandas
.
A two-dimensional labelled array that is an ordered collection of columns to store heterogeneous data types is:
- Series
- NumPy array
- Dataframe
- Panel
Answer
Dataframe
Reason — A DataFrame object is 2D labelled array like Pandas data structure that stores an ordered collection of columns that can store heterogeneous data types.
In a dataframe, axis = 0 is for
- Columns
- Rows
- Rows and Columns both
- None of these
Answer
Rows
Reason — The axis 0 identifies a DataFrame's row index.
Which of the following statements is false?
- Dataframe is size-mutable.
- Dataframe is values-mutable.
- Dataframe is immutable.
- Dataframe is capable of holding multiple types of data.
Answer
Dataframe is immutable.
Reason — DataFrame is value-mutable, size-mutable and stores heterogeneous data types.
Which attribute of dataframe is used to perform the transpose operation on a dataframe?
- T
- Ndim
- Empty
- Shape
Answer
T
Reason — The T
attribute of DataFrame is used to perform transpose operation on a DataFrame. The syntax is : <DataFrame>.T
.
Which attribute of dataframe is used to retrieve its shape?
- T
- Ndim
- Empty
- Shape
Answer
Shape
Reason — The shape
attribute is used to retrieve DataFrame's shape.
Which attribute of dataframe is used to get number of axes?
- T
- Ndim
- Empty
- Shape
Answer
Ndim
Reason — The ndim
attribute of dataframe is used to get number of axes/array dimensions.
CSV stands for:
- Comma Separated Values
- Comma Separated Variables
- Column Separated Values
- Column Separated Variables
Answer
Comma Separated Values
Reason — The acronym CSV is short for Comma-Separated Value.
Which of the following can be used to specify the data while creating a dataframe?
- Series
- List of Dictionaries
- Structured ndarray
- All of these
Answer
All of these
Reason — We can create a DataFrame object by passing data in many different ways, such as two-dimensional dictionaries (i.e., dictionaries having lists or dictionaries or ndarrays or series objects etc), two-dimensional ndarrays, series type object and another DataFrame object.
Which of the following commands shows the information with city="Delhi" from dataframe SHOP?
- print(SHOP[City == 'Delhi'])
- print(SHOP[SHOP.City == 'Delhi])
- print(SHOP[SHOP.'City' == 'Delhi'])
- print(SHOP[SHOP[City] == 'Delhi'])
Answer
print(SHOP[SHOP.City == 'Delhi'])
Reason — The correct code print(SHOP[SHOP.City == 'Delhi'])
filters the SHOP
DataFrame to show only the rows where the City
column is equal to 'Delhi'. It does this by creating a boolean mask SHOP.City == 'Delhi'
that returns True for rows where the city is 'Delhi' and False otherwise, and then using this mask to select the corresponding rows from the original DataFrame using SHOP[]
. The resulting DataFrame, which contains only the rows that match the condition, is then printed to the console.
The following statement will ............... .
df = df.drop(['Name', 'Class', 'Rollno'], axis = 1) #df is a DataFrame object
- delete three columns having labels 'Name', 'Class' and `Rollno'
- delete three rows having labels 'Name', 'Class' and 'Rollno'
- delete any three columns
- return error
Answer
delete three columns having labels 'Name', 'Class' and `Rollno'
Reason — The drop()
function is used to remove rows from a DataFrame. In this case, the axis=1
parameter specifies that we want to drop columns. The list ['Name', 'Class', 'Rollno'] contains the labels of the columns to be dropped. Therefore, the statement will delete the three columns with labels 'Name', 'Class', and 'Rollno' from the DataFrame df
.
Assertion (A): Pandas is a Python library.
Reasoning (R): Pandas is a powerful, flexible and easy to use open source data analysis library.
- Both A and R are true and R is the correct explanation of A.
- Both A and R are true but R is not the correct explanation of A.
- A is true but R is false.
- A is false but R is true.
Answer
Both A and R are true and R is the correct explanation of A.
Explanation
Pandas is a Python library that makes data analysis easy and effective. It is the most famous Python library for data science, offering powerful and flexible data structures that facilitate data analysis and manipulation. As an open-source library, Pandas provides high-performance, easy-to-use data structures and data analysis tools.
Assertion (A): A series stores data row-wise.
Reasoning (R): A series is a one-dimensional labelled data structure.
- Both A and R are true and R is the correct explanation of A.
- Both A and R are true but R is not the correct explanation of A.
- A is true but R is false.
- A is false but R is true.
Answer
A is false but R is true.
Explanation
A series in pandas does not store data row-wise. It is a one-dimensional array like object containing an array of data and an associated array of data labels, called its index.
Assertion (A): Dataframe has both a row and column index.
Reasoning (R): Dataframe is a two-dimensional labelled data structure like a table of MySQL.
- Both A and R are true and R is the correct explanation of A.
- Both A and R are true but R is not the correct explanation of A.
- A is true but R is false.
- A is false but R is true.
Answer
Both A and R are true and R is the correct explanation of A.
Explanation
A DataFrame in Pandas has both a row index and a column index. It is a two-dimensional labeled data structure, similar to a table in MySQL, each value is identifiable with the combination of row and column indices.
Assertion (A): While creating a series using scalar values, index must be provided.
Reasoning (R): The scalar value is repeated to match the length of index.
- Both A and R are true and R is the correct explanation of A.
- Both A and R are true but R is not the correct explanation of A.
- A is true but R is false.
- A is false but R is true.
Answer
Both A and R are true and R is the correct explanation of A.
Explanation
When creating a Series using scalar values, we must provide an index. This is because a Series requires an index to associate with each value. This scalar value shall be repeated to match the length of the index.
Assertion (A): The shape attribute returns the number of rows and number of columns available in dataframe.
Reasoning (R): The shape attribute returns the values in the form of list.
- Both A and R are true and R is the correct explanation of A.
- Both A and R are true but R is not the correct explanation of A.
- A is true but R is false.
- A is false but R is true.
Answer
A is true but R is false.
Explanation
The shape attribute of a Pandas DataFrame returns the number of rows and the number of columns in the DataFrame. The shape attribute returns the values in the form of tuple.
Assertion (A): After running the following code:
df = pd.DataFrame([11,46], index = ['True', 'False'])
print(df[True])
A key error will be produced.
Reasoning (R): Dataframe does not support Boolean Indexing.
- Both A and R are true and R is the correct explanation of A.
- Both A and R are true but R is not the correct explanation of A.
- A is true but R is false.
- A is false but R is true.
Answer
A is true but R is false.
Explanation
DataFrames do support Boolean Indexing, which allows to select rows based on a Boolean condition. The code df[True]
is trying to access a column named True
, which does not exist in the DataFrame. The index of the DataFrame is ['True', 'False']
. To access the row where the index is 'True', we would use df.loc['True']
. This is an example of label-based indexing, where we are selecting a row based on its index label.
Assertion (A): We can add a new column in an existing dataframe using .at or .loc methods.
Reasoning (R): When we reassign new values to an existing column in a dataframe, the previous values are overwritten.
- Both A and R are true and R is the correct explanation of A.
- Both A and R are true but R is not the correct explanation of A.
- A is true but R is false.
- A is false but R is true.
Answer
Both A and R are true but R is not the correct explanation of A.
Explanation
We can add a new column in an existing dataframe using at
or loc
methods. When we reassign new values to an existing column in a dataframe, the previous values are overwritten.
Assertion (A): When a dataframe is created using dictionary, the keys will be the columns and corresponding values will be the rows of the dataframe.
Reasoning (R): NaN values are automatically filled for values of non-matching keys while creating a dataframe using dictionary.
- Both A and R are true and R is the correct explanation of A.
- Both A and R are true but R is not the correct explanation of A.
- A is true but R is false.
- A is false but R is true.
Answer
Both A and R are true and R is the correct explanation of A.
Explanation
When a dataframe is created using a dictionary, the keys of the dictionary become the columns of the dataframe, and the values of the dictionary become the rows of the dataframe. If the dictionaries used to create the dataframe have non-matching keys, NaN values will be automatically filled in for the missing values.
Inayra is writing a program using Pandas library for creating a dataframe from two dictionaries. Given below is the snippet she has developed. Help her to complete it by selecting the correct option given as under:
import ............... as pd # Statement 1
dict = {'x': [10,25], 'y' : [32,93] }
dict 1 = {'x',: [14,56], 'y': [36,72] }
df = pd.DataFrame (dict)
df1 = pd. ............... (dict) # Statement 2
df2 = pd. ............... ([df, df1]) # Statement 3
1. Choose the correct option from the following for Statement 1.
(a) pd
(b) data
(c) df
(d) pandas
2. Which of the following option should be taken for Statement 2?
(a) Series
(b) Dataframe
(c) DataFrame
(d) Dictionary
3. Select the correct method from the following for Statement 3.
(a) concat()
(b) shape()
(c) index()
(d) append()
Answer
1. pandas
Reason — The pandas library is imported as pd using the statement import pandas as pd
.
2. DataFrame
Reason — The pd.DataFrame()
function is used to create a DataFrame from a dictionary.
3. concat()
Reason — The pd.concat()
function is used to concatenate two or more DataFrames into a single DataFrame. In this case, it is used to concatenate df
and df1
into a single DataFrame df2
.
What is the significance of Pandas Library?
Answer
The significance of Python Pandas library is as follows:
- It can read or write in many different data formats (integer, float, double, etc.).
- It can calculate in all the possible ways data is organized i.e., across rows and down columns.
- It can easily select subsets of data from bulky data sets and even combine multiple datasets together. It has functionality to find and fill missing data.
- It allows to apply operations to independent groups within the data.
- It supports reshaping of data into different forms.
- It supports advanced time-series functionality.
- It supports visualization by integrating matplotlib and seaborn etc. libraries.
Name some common data structures of Python's Pandas Library.
Answer
The common data structures of Python's Pandas library are Series and DataFrame.
Name the function to iterate over a dataframe horizontally.
Answer
The iterrows()
function iterates over a DataFrame horizontally.
Name the function to iterate over a dataframe vertically.
Answer
The iteritems()
function iterates over a DataFrame vertically.
What is CSV file?
Answer
CSV (Comma Separated Values) files are delimited files that store tabular data (data stored in rows and columns as we see in spreadsheets or databases) where comma delimits every value. Each line in a CSV file is a data record. Each record consists of one or more fields, separated by commas (or the chosen delimiter).
What is the use of nrows argument in read_csv() method?
Answer
The nrows
argument in the read_csv()
method is used to specify the number of rows to read from the CSV file.
How can we create CSV file? Explain with steps.
Answer
The steps to create CSV files are as follows:
- Launch Microsoft Excel.
- Type the data given in Table in the Excel sheet.
- Save the file with a proper name by clicking File -> Save or Save As or press Ctrl + S to open the Save As window.
- Type the name of the file and select file type as CSV from the drop-down arrow.
- Click the Save button. Excel will ask for confirmation to select CSV format.
- Click OK.
- It will display a dialog box asking permission to keep comma as delimiter for CSV file.
- Lastly, click Yes to retain and save the Excel file in CSV format.
How do you iterate over a dataframe? Explain with the help of code snippet.
Answer
Iterating over rows in the DataFrame:
The iterrows()
method is used to iterate over each row in the DataFrame. In this method, each horizontal subset is in the form of (row-index, series), where the series contains all column values for that row-index.
For example :
import pandas as pd
total_sales = {2015 : {'Qtr1' : 34500, 'Qtr2' : 45000},
2016 : {'Qtr1' : 44500, 'Qtr2' : 65000}}
df = pd.DataFrame(total_sales)
for (row, rowseries) in df.iterrows():
print("RowIndex :", row)
print('Containing :')
print(rowseries)
RowIndex : Qtr1
Containing :
2015 34500
2016 44500
Name: Qtr1, dtype: int64
RowIndex : Qtr2
Containing :
2015 45000
2016 65000
Name: Qtr2, dtype: int64
Iterating over columns in the DataFrame:
The iteritems()
method is used to iterate over each column in the DataFrame. In this method, each vertical subset is in the form of (column-index, series), where the series contains all row values for that column-index.
For example :
import pandas as pd
total_sales = {2015 : {'Qtr1' : 34500, 'Qtr2' : 45000},
2016 : {'Qtr1' : 44500, 'Qtr2' : 65000}}
df = pd.DataFrame(total_sales)
for (col, colseries) in df.iteritems():
print("Column Index :", col)
print('Containing :')
print(colseries)
Column Index : 2015
Containing :
Qtr1 34500
Qtr2 45000
Name: 2015, dtype: int64
Column Index : 2016
Containing :
Qtr1 44500
Qtr2 65000
Name: 2016, dtype: int64
Write commands to print the following details of a series object seal.
(a) If the series is empty
(b) Indexes of the series
(c) The data type of underlying data
(d) If the series stores any NaN values
Answer
(a)
seal.empty
(b)
seal.index
(c)
seal.dtype
(d)
seal.hasnans
How do you fill all missing values with previous non-missing values?
Answer
To fill all missing values in a pandas DataFrame with the previous non-missing values, we can use the fillna
method with the ffill
parameter. The syntax is df.fillna(method = 'ffill', inplace = True)
Consider the following tables Item and Customer and answer the questions that follow:
Table: Item
Item_ID | ItemName | Manufacturer | Price |
---|---|---|---|
PC01 | Personal Computer | HCL India | 42000 |
LCO5 | Laptop | HP USA | 55000 |
PCO3 | Personal Computer | Dell USA | 32000 |
PC06 | Personal Computer | Zenith USA | 37000 |
LCO3 | Laptop | Dell USA | 57000 |
Table: Customer
Item_ID | CustomerName | City |
---|---|---|
LCO3 | N Roy | Delhi |
PCO3 | H Singh | Mumbai |
PC06 | R Pandey | Delhi |
LCO3 | C Sharma | Chennai |
PC01 | K Agarwal | Bengaluru |
Assume that the Pandas has been imported as pd.
(a) Create a dataframe called dfI for table Item.
(b) Create a dataframe called dfC for table Customer.
(c) Perform the default join operation on item_ID using two dataframes: dfI and dfC.
(d) Perform the left join operation on item_ID using two dataframes: dfI and dfC.
(e) Perform the right join operation on Item_ID using two dataframes: dfI and dfC.
(f) Perform the default operation on Item_ID using two dataframes: dfI and dfC with the left index as true.
(g) Perform the outer join operation on item_ID using two dataframes: dfI and dfC.
(h) Create a new dataframe dfN using dataframes: dfI and dfC. The new dataframe data will hold both left index and right index true values.
(i) Arrange the dataframe dfN in descending order of Price.
(j) Arrange the dataframe dfN in descending order of City and Price.
Answer
(a)
dfI = pd.DataFrame({
'Item_ID': ['PC01', 'LCO5', 'PCO3', 'PC06', 'LCO3'],
'ItemName': ['Personal Computer', 'Laptop', 'Personal Computer', 'Personal Computer', 'Laptop'],
'Manufacturer': ['HCL India', 'HP USA', 'Dell USA', 'Zenith USA', 'Dell USA'],
'Price': [42000, 55000, 32000, 37000, 57000]
})
Item_ID ItemName Manufacturer Price
0 PC01 Personal Computer HCL India 42000
1 LCO5 Laptop HP USA 55000
2 PCO3 Personal Computer Dell USA 32000
3 PC06 Personal Computer Zenith USA 37000
4 LCO3 Laptop Dell USA 57000
(b)
dfC = pd.DataFrame({
'Item_ID': ['LCO3', 'PCO3', 'PC06', 'LCO3', 'PC01'],
'CustomerName': ['N Roy', 'H Singh', 'R Pandey', 'C Sharma', 'K Agarwal'],
'City': ['Delhi', 'Mumbai', 'Delhi', 'Chennai', 'Bengaluru']
})
Item_ID CustomerName City
0 LCO3 N Roy Delhi
1 PCO3 H Singh Mumbai
2 PC06 R Pandey Delhi
3 LCO3 C Sharma Chennai
4 PC01 K Agarwal Bengaluru
(c)
df_default_join = pd.merge(dfI, dfC, on = 'Item_ID')
Item_ID ItemName Manufacturer Price CustomerName City
0 PC01 Personal Computer HCL India 42000 K Agarwal Bengaluru
1 PCO3 Personal Computer Dell USA 32000 H Singh Mumbai
2 PC06 Personal Computer Zenith USA 37000 R Pandey Delhi
3 LCO3 Laptop Dell USA 57000 N Roy Delhi
4 LCO3 Laptop Dell USA 57000 C Sharma Chennai
(d)
df_left_join = pd.merge(dfI, dfC, left_on='Item_ID', right_on='Item_ID', how='left')
Item_ID ItemName Manufacturer Price CustomerName City
0 PC01 Personal Computer HCL India 42000 K Agarwal Bengaluru
1 LCO5 Laptop HP USA 55000 NaN NaN
2 PCO3 Personal Computer Dell USA 32000 H Singh Mumbai
3 PC06 Personal Computer Zenith USA 37000 R Pandey Delhi
4 LCO3 Laptop Dell USA 57000 N Roy Delhi
5 LCO3 Laptop Dell USA 57000 C Sharma Chennai
(e)
df_right_join = pd.merge(dfI, dfC, left_on='Item_ID', right_on='Item_ID', how='right')
Item_ID ItemName Manufacturer Price CustomerName City
0 LCO3 Laptop Dell USA 57000 N Roy Delhi
1 PCO3 Personal Computer Dell USA 32000 H Singh Mumbai
2 PC06 Personal Computer Zenith USA 37000 R Pandey Delhi
3 LCO3 Laptop Dell USA 57000 C Sharma Chennai
4 PC01 Personal Computer HCL India 42000 K Agarwal Bengaluru
(f) This operation is not possible because the left_index
parameter is used for merging on the index of the dataframes, not on a specific column. The default merge operation is an inner join, and it requires a common column to merge on.
(g)
df_outer_join = pd.merge(dfI, dfC, on='Item_ID', how='outer')
Item_ID ItemName Manufacturer Price CustomerName City
0 LCO3 Laptop Dell USA 57000 N Roy Delhi
1 LCO3 Laptop Dell USA 57000 C Sharma Chennai
2 LCO5 Laptop HP USA 55000 NaN NaN
3 PC01 Personal Computer HCL India 42000 K Agarwal Bengaluru
4 PC06 Personal Computer Zenith USA 37000 R Pandey Delhi
5 PCO3 Personal Computer Dell USA 32000 H Singh Mumbai
(h)
dfN = pd.merge(dfI, dfC, left_index=True, right_index=True)
Item_ID_x ItemName Manufacturer Price Item_ID_y CustomerName City
0 PC01 Personal Computer HCL India 42000 LCO3 N Roy Delhi
1 LCO5 Laptop HP USA 55000 PCO3 H Singh Mumbai
2 PCO3 Personal Computer Dell USA 32000 PC06 R Pandey Delhi
3 PC06 Personal Computer Zenith USA 37000 LCO3 C Sharma Chennai
4 LCO3 Laptop Dell USA 57000 PC01 K Agarwal Bengaluru
(i)
dfN_sorted_price = dfN.sort_values(by='Price', ascending=False)
Item_ID_x ItemName Manufacturer Price Item_ID_y CustomerName City
4 LCO3 Laptop Dell USA 57000 PC01 K Agarwal Bengaluru
1 LCO5 Laptop HP USA 55000 PCO3 H Singh Mumbai
0 PC01 Personal Computer HCL India 42000 LCO3 N Roy Delhi
3 PC06 Personal Computer Zenith USA 37000 LCO3 C Sharma Chennai
2 PCO3 Personal Computer Dell USA 32000 PC06 R Pandey Delhi
(j)
dfN_sorted_city_price = dfN.sort_values(by=['City', 'Price'], ascending=[False, False])
Item_ID_x ItemName Manufacturer Price Item_ID_y CustomerName City
1 LCO5 Laptop HP USA 55000 PCO3 H Singh Mumbai
0 PC01 Personal Computer HCL India 42000 LCO3 N Roy Delhi
2 PCO3 Personal Computer Dell USA 32000 PC06 R Pandey Delhi
3 PC06 Personal Computer Zenith USA 37000 LCO3 C Sharma Chennai
4 LCO3 Laptop Dell USA 57000 PC01 K Agarwal Bengaluru
Consider the following series object namely S:
0 0.430271
1 0.617328
2 0.265421
3 0.836113
dtype: float64
What will be returned by the following statements?
(a) S * 100
(b) S > 0
(c) S1 = pd.Series(S)
(d) S3 = pd.Series(S1) + 3
Answer
(a) S * 100
0 43.0271
1 61.7328
2 26.5421
3 83.6113
dtype: float64
(b) S > 0
0 True
1 True
2 True
3 True
dtype: bool
(c) S1 = pd.Series(S)
0 0.430271
1 0.617328
2 0.265421
3 0.836113
dtype: float64
(d) S3 = pd.Series(S1) + 3
0 3.430271
1 3.617328
2 3.265421
3 3.836113
dtype: float64
What will be the output produced by the following code?
Stationery = ['pencils', 'notebooks', 'scales', 'erasers']
S = pd.Series([20, 33, 52, 10], index = Stationery)
S2 = pd.Series([17, 13, 31, 32], index = Stationery)
print(S == S2)
S = S + S2
print(S)
Answer
pencils False
notebooks False
scales False
erasers False
dtype: bool
pencils 37
notebooks 46
scales 83
erasers 42
dtype: int64
The code creates two pandas Series, S
and S2
, with the index Stationery which is a list. The code then compares S
and S2
element-wise using the "==" operator, which returns a boolean Series indicating whether each pair of values is equal. Finally, the code adds S
and S2
element-wise using the "+" operator, which returns a new Series with the summed values, and assigns the result back to S
. The resulting S
Series will have the same index as before i.e., Stationery.
What will be the output produced by the following codes, considering the Series object S given in Q.13?
(a) print(S[1:4])
(b) print(S[:1])
(c) print(S[0:2])
(d) S[0:2] = 12
print(S)
(e) print(S.index)
(f) print(S.values)
The Series object 'S' is as follows:
pencils 20
notebooks 33
scales 52
erasers 10
dtype: int64
Answer
(a) print(S[1:4])
notebooks 33
scales 52
erasers 10
dtype: int64
The slice S[1:4]
starts at index 1 and ends at index 3, hence, it includes three elements i.e., elements from index 1 and 3.
(b) print(S[:1])
pencils 20
dtype: int64
The slice S[:1]
starts at index 0 and ends at index 1, but because the end index is exclusive, it includes only one element i.e., the element at index 0.
(c) print(S[0:2])
pencils 20
notebooks 33
dtype: int64
The slice S[0:2]
starts at index 0 and ends at index 1, hence, it includes two elements i.e., elements from index 0 and 1.
(d) S[0:2] = 12
print(S)
pencils 12
notebooks 12
scales 52
erasers 10
dtype: int64
The slice S[0:2] = 12
assigns the value 12 to indices 0 and 1 in Series S
, directly modifying those elements. The updated Series is then printed.
(e) print(S.index)
Index(['pencils', 'notebooks', 'scales', 'erasers'], dtype='object')
The code print(S.index)
displays the indices of Series S
.
(f) print(S.values)
[12 12 52 10]
The code print(S.values)
displays the values of Series S
.
Write a program to iterate and print a dataframe column-wise and print only first three columns.
import pandas as pd
data = {
'Name': ['Aliya', 'Hemanth', 'Charlie'],
'Age': [25, 30, 35],
'City': ['Bangalore', 'Chennai', 'Mumbai'],
'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
first_three_columns = df.iloc[:, :3]
print("Each column:")
for column_name in first_three_columns:
column_data = first_three_columns[column_name]
print(column_name)
print(column_data)
Each column:
Name
0 Aliya
1 Hemanth
2 Charlie
Name: Name, dtype: object
Age
0 25
1 30
2 35
Name: Age, dtype: int64
City
0 Bangalore
1 Chennai
2 Mumbai
Name: City, dtype: object
Write a program to iterate and print a dataframe row-wise at a time and print only first five rows.
import pandas as pd
data = {
'Name': ['Amruta', 'Harsh', 'Yogesh', 'Shreya', 'Zoya', 'Nyra'],
'Age': [25, 30, 35, 40, 45, 28],
'City': ['Chandigarh', 'Jaipur', 'Dehradun', 'Delhi', 'Vadodara', 'Guwahati']
}
df = pd.DataFrame(data)
first_five_rows = df.head(5)
print("Each row:")
for index, row in first_five_rows.iterrows():
print("Index:", index)
print(row)
Each row:
Index: 0
Name Amruta
Age 25
City Chandigarh
Name: 0, dtype: object
Index: 1
Name Harsh
Age 30
City Jaipur
Name: 1, dtype: object
Index: 2
Name Yogesh
Age 35
City Dehradun
Name: 2, dtype: object
Index: 3
Name Shreya
Age 40
City Delhi
Name: 3, dtype: object
Index: 4
Name Zoya
Age 45
City Vadodara
Name: 4, dtype: object
Find the error in the following code fragments:
S2 = pd.Series([101, 102, 1-2, 104])
print (S2.index)
S2.index = [0.1.2.3, 4, 5]
S2[5] = 220
print (S2)
Answer
S2 = pd.Series([101, 102, 1-2, 104])
print(S2.index)
S2.index = [0.1.2.3, 4, 5] #Error 1
S2[5] = 220
print(S2)
Error 1 — In the line S2.index = [0.1.2.3, 4, 5]
, the index values are not separated by commas. It should be S2.index = [0, 1, 2, 3, 4, 5]
. Additionally, the Series S2
initially has four elements, so assigning a new index list of six elements ([0, 1, 2, 3, 4, 5]) to S2.index
will raise a ValueError because the new index list length does not match the length of the Series.
The corrected code is:
S2 = pd.Series([101, 102, 1-2, 104])
print(S2.index)
S2.index = [0, 1, 2, 3]
S2[5] = 220
print(S2)
Find the error in the following code fragments:
S = pd.Series(2, 3, 4, 55, index = range (4))
Answer
In the above code fragment, the data values should be enclosed in square brackets [] to form a list.
The corrected code is:
S = pd.Series([2, 3, 4, 55], index = range(4))
Find the error in the following code fragments:
S1 = pd.Series(1, 2, 3, 4, index = range(7))
Answer
In the above code fragment, the data values should be enclosed in square brackets to form a list and the specified index range range(7) is out of range for the provided data [1, 2, 3, 4]. Since there are only four data values, the index should have a length that matches the number of data values.
The corrected code is:
S1 = pd.Series([1, 2, 3, 4], index = range(4))
Find the error in the following code fragments:
S2 = pd.Series([1, 2, 3, 4, 5], index = range(4))
Answer
The error in the code fragment is that the length of the data list [1, 2, 3, 4, 5] does not match the length of the index range(4). Since there are only five data values, the index should have a length that matches the number of data values.
The corrected code is:
S1 = pd.Series([1, 2, 3, 4, 5], index = range(5))
Find the error:
data = np.array(['a', 'b', 'c', 'd', 'e', 'f'])
s = pd.Series (data, index=[100, 101, 102, 103, 104, 105])
print(s[102,103,104])
Answer
The error in the above code is in the line print(s[102, 103, 104])
. When accessing elements in a pandas Series using square brackets, we should use a list of index values, not multiple separate index values separated by commas.
The corrected code is:
data = np.array(['a', 'b', 'c', 'd', 'e', 'f'])
s = pd.Series(data, index = [100, 101, 102, 103, 104, 105])
print(s[[102, 103, 104]])
Why does the following code cause error?
s1 = pd.Series(range 1, 15, 5), index = list('ababa')
print(s1['ab'])
Answer
The statement s1['ab']
causes an error because 'ab' is not a single key in the index. The index has individual keys 'a' and 'b', but not 'ab'.
Consider the following Class12.csv file containing the data as given below:
RollNo | Name | Accounts | Maths | BSt | IP | Eco |
---|---|---|---|---|---|---|
10 | Ritu Jain | 88 | 67 | 87 | 97 | 56 |
11 | Mridul Mehta | 67 | 78 | 77 | 87 | 90 |
12 | Divij | 87 | 89 | 78 | 82 | 92 |
13 | Yashvi Verma | 67 | 82.3 | 76.5 | 98.2 | 78.6 |
14 | Deepak Virmani | 56.7 | 76.5 | 88 | 78 | 67 |
15 | Jatin Malik | 76 | 66 | 77 | 87.5 | 67.5 |
(a) Read the csv file into a dataframe df which is stored with tab ('\t') separator.
(b) Write the code to find the total marks (Total_ marks) for each student and add it to the newly-created dataframe.
(c) Also calculate the percentage obtained by each student under a new column “Average” in the dataframe.
Answer
(a)
df = pd.read_csv('Class12.csv', sep='\t')
RollNo Name Accounts Maths BSt IP Eco
0 10 Ritu Jain 88.0 67.0 87.0 97.0 56.0
1 11 Mridul Mehta 67.0 78.0 77.0 87.0 90.0
2 12 Divij 87.0 89.0 78.0 82.0 92.0
3 13 Yashvi Verma 67.0 82.3 76.5 98.2 78.6
4 14 Deepak Virmani 56.7 76.5 88.0 78.0 67.0
5 15 Jatin Malik 76.0 66.0 77.0 87.5 67.5
(b)
df['Total_marks'] = df[['Accounts', 'Maths', 'BSt', 'IP', 'Eco']].sum(axis=1)
RollNo Name Accounts Maths BSt IP Eco Total_marks
0 10 Ritu Jain 88.0 67.0 87.0 97.0 56.0 395.0
1 11 Mridul Mehta 67.0 78.0 77.0 87.0 90.0 399.0
2 12 Divij 87.0 89.0 78.0 82.0 92.0 428.0
3 13 Yashvi Verma 67.0 82.3 76.5 98.2 78.6 402.6
4 14 Deepak Virmani 56.7 76.5 88.0 78.0 67.0 366.2
5 15 Jatin Malik 76.0 66.0 77.0 87.5 67.5 374.0
(c)
df['Average'] = (df['Total_marks'] / 500) * 100
RollNo Name Accounts Maths BSt IP Eco Total_marks Average
0 10 Ritu Jain 88.0 67.0 87.0 97.0 56.0 395.0 79.00
1 11 Mridul Mehta 67.0 78.0 77.0 87.0 90.0 399.0 79.80
2 12 Divij 87.0 89.0 78.0 82.0 92.0 428.0 85.60
3 13 Yashvi Verma 67.0 82.3 76.5 98.2 78.6 402.6 80.52
4 14 Deepak Virmani 56.7 76.5 88.0 78.0 67.0 366.2 73.24
5 15 Jatin Malik 76.0 66.0 77.0 87.5 67.5 374.0 74.80
Write a program that reads students marks from a ‘Result.csv’ file and displays percentage of each student.
import pandas as pd
df = pd.read_csv('Result.csv')
df['Total_marks'] = df[['Maths', 'Science', 'English',]].sum(axis=1)
df['Percentage'] = (df['Total_marks'] / 300) * 100
print("Percentage of each student: ")
print(df[['Name', 'Percentage']])
Percentage of each student:
Name Percentage
0 Rahul 85.000000
1 Rohan 81.666667
2 Riya 88.333333
3 Raj 86.666667
Write the name of function to store data from a dataframe into a CSV file.
Answer
to_csv()
function is used to store data from a DataFrame into a CSV file.
How can we import specific columns from a CSV file?
Answer
To import specific columns from a CSV file, we can use the usecols
parameter of the read_csv
function from the pandas library. The usecols
parameter is used to specify the list of columns to be read from the CSV file. The syntax is pd.read_csv('filename.csv', usecols = ['column1', 'column2',...])
.
For example, the following command will access Name and Age columns of Employee file.
df = pd.read_csv("Employee.csv", usecols = ['Name', 'Age'])
print(df)
What are the advantages of CSV file formats?
Answer
The advantages of CSV file formats are as follows:
It is a simple, compact and ubiquitous format for data storage.
It is a common format for data interchange.
It can be opened in popular spreadsheet packages like MS-Excel, Calc etc.
Nearly all spreadsheets and databases support import/export to csv format.
What all libraries do you require in order to bring data from a CSV file into a dataframe?
Answer
Python's Pandas library is required to bring data from a CSV file into a DataFrame.
You want to read data from a CSV file in a dataframe but you want to provide your own column names to the dataframe. What additional argument would you specify in read_csv()?
Answer
To read data from a CSV file into a DataFrame while providing our own column names, we can use the names
argument in the read_csv()
function. The syntax is : <DF> = pandas.read_csv(<filepath>, names = <sequence containing column names>)
.
By default, read_csv() uses the value of first row as column headers in dataframes. Which argument will you give to ensure that the top/first row’s data is used as data and not as column headers?
Answer
To ensure that the top/first row's data is used as data and not as column headers in a DataFrame when using the read_csv()
function, we need to use the header
argument and set it to None. The syntax is : <DF> = pandas.read_csv(<filepath>, header = None)
.
Which argument would you give to read.csv() if you only want to read top 10 rows of data?
Answer
The nrows
argument can be used to read only the top 10 rows of data from a CSV file using the read_csv()
function in pandas. The nrows
argument specifies the number of rows of the file to read. The syntax is : df = pandas.read_csv(<filepath>, nrows = 10)
.
Create the following dataframe by the name Project regarding a competition and answer the questions given below:
Enrolment No. | Name | Class | Section | Project Name |
---|---|---|---|---|
101 | Rekha | XII | B | Data Analysis |
102 | Divya | XII | C | Graphical Analysis |
103 | Geet | XII | H | Machine Learning |
104 | Jeet | XII | B | App Development |
(a) Insert two records with different methods.
(b) Insert a column to store grades given to their projects.
(c) Write a command to display the name and section for all.
(d) Write a command to display the records with index value 101 and 102.
(e) Insert a column after name to store the school name.
(f) Display the second and third record.
(g) Replace the name and section of Jeet to 'XI','A'.
(h) Remove the column Project Name and Section.
Answer
The DataFrame project is created as follows:
import pandas as pd
data = {'Name': ['Rekha', 'Divya', 'Geet', 'Jeet'],
'Class': ['XII', 'XII', 'XII', 'XII'],
'Section': ['B', 'C', 'H', 'B'],
'Project Name': ['Data Analysis', 'Graphical Analysis', 'Machine Learning', 'App Development']
}
Project = pd.DataFrame(data, index = [101, 102, 103, 104])
print(Project)
Name Class Section Project Name
101 Rekha XII B Data Analysis
102 Divya XII C Graphical Analysis
103 Geet XII H Machine Learning
104 Jeet XII B App Development
(a)
Project.loc[105] = [105, 'Arya', 'XI', 'D', 'Web Development']
Project.loc[105] = ['Arya', 'XI', 'D', 'Web Development']
Project.at[106, 'Name'] = 'Vikram'
Project.at[106, 'Class'] = 'XI'
Project.at[106, 'Section'] = 'A'
Project.at[106, 'Project Name'] = 'AI Research'
Name Class Section Project Name
101 Rekha XII B Data Analysis
102 Divya XII C Graphical Analysis
103 Geet XII H Machine Learning
104 Jeet XII B App Development
105 Arya XI D Web Development
106 Vikram XI A AI Research
(b)
Project['Grade'] = ['A', 'B+', 'C+', 'B', 'A+', 'C']
Name Class Section Project Name Grade
101 Rekha XII B Data Analysis A
102 Divya XII C Graphical Analysis B+
103 Geet XII H Machine Learning C+
104 Jeet XII B App Development B
105 Arya XI D Web Development A+
106 Vikram XI A AI Research C
(c)
print(Project[['Name', 'Section']])
Name Section
101 Rekha B
102 Divya C
103 Geet H
104 Jeet B
105 Arya D
106 Vikram A
(d)
print(Project.loc[[101, 102]])
Name Class Section Project Name Grade
101 Rekha XII B Data Analysis A
102 Divya XII C Graphical Analysis B+
(e)
Project.insert(1, 'School', ['ABC', 'PQR', 'ABC', 'PQR', 'XYZ', 'XYZ'])
Name School Class Section Project Name Grade
101 Rekha ABC XII B Data Analysis A
102 Divya PQR XII C Graphical Analysis B+
103 Geet ABC XII H Machine Learning c+
104 Jeet PQR XII B App Development B
105 Arya XYZ XI D Web Development A+
106 Vikram XYZ XI A AI Research C
(f)
print(Project.iloc[1:3])
Name School Class Section Project Name Grade
102 Divya PQR XII C Graphical Analysis B+
103 Geet ABC XII H Machine Learning c+
(g)
Project.Class[104] = 'XI'
Project.Section[104] = 'A'
Name School Class Section Project Name Grade
101 Rekha ABC XII B Data Analysis A
102 Divya PQR XII C Graphical Analysis B+
103 Geet ABC XII H Machine Learning c+
104 Jeet PQR XI A App Development B
105 Arya XYZ XI D Web Development A+
106 Vikram XYZ XI A AI Research C
(h)
Project = Project.drop(['Project Name', 'Section'], axis = 1)
Name School Class Grade
101 Rekha ABC XII A
102 Divya PQR XII B+
103 Geet ABC XII c+
104 Jeet PQR XI B
105 Arya XYZ XI A+
106 Vikram XYZ XI C
Consider the following dataframe: CORONA and answer the questions given below:
ID | State | Cases |
---|---|---|
100 | Delhi | 3000 |
110 | Mumbai | 4000 |
120 | Chennai | 5000 |
130 | Surat | 4500 |
Create the above-given dictionary with the given indexes.
(a) Write code to add a new column “Recovery” using the series method to store the number of patients recovered in every state.
(b) To add a new column “Deaths” using the assign() method to store the number of deaths in every state.
(c) To add a new row to store details of another state using loc (assume values).
(d) To add a new column "Percentage" using the insert() method to store the percentage of recovery in every state (assume values). The column should be added as the fourth column in the dataframe.
(e) To delete the column “Percentage” using del command.
(f) To delete the column “Deaths” using pop() method.
(g) To insert a new row of values using iloc[] at the 1st position.
(h) To delete Cases and State temporarily from the dataframe.
Answer
The DataFrame CORONA is created as :
import pandas as pd
data = {'State': ['Delhi', 'Mumbai', 'Chennai', 'Surat'],
'Cases': [3000, 4000, 5000, 4500]}
CORONA = pd.DataFrame(data, index=[100, 110, 120, 130])
print(CORONA)
State Cases
100 Delhi 3000
110 Mumbai 4000
120 Chennai 5000
130 Surat 4500
(a)
CORONA['Recovery'] = pd.Series([2500, 3000, 3500, 3200], index=[100, 110, 120, 130])
State Cases Recovery
100 Delhi 3000 2500
110 Mumbai 4000 3000
120 Chennai 5000 3500
130 Surat 4500 3200
(b)
CORONA = CORONA.assign(Deaths=[200, 250, 300, 220])
State Cases Recovery Deaths
100 Delhi 3000 2500 200
110 Mumbai 4000 3000 250
120 Chennai 5000 3500 300
130 Surat 4500 3200 220
(c)
CORONA.loc[140] = ['Karnataka', 4200, 2800, 180]
State Cases Recovery Deaths
100 Delhi 3000 2500 200
110 Mumbai 4000 3000 250
120 Chennai 5000 3500 300
130 Surat 4500 3200 220
140 Karnataka 4200 2800 180
(d)
CORONA.insert(3, 'Percentage', [80, 75, 70, 71, 67])
State Cases Recovery Percentage Deaths
100 Delhi 3000 2500 80 200
110 Mumbai 4000 3000 75 250
120 Chennai 5000 3500 70 300
130 Surat 4500 3200 71 220
140 Karnataka 4200 2800 67 180
(e)
del CORONA['Percentage']
State Cases Recovery Deaths
100 Delhi 3000 2500 200
110 Mumbai 4000 3000 250
120 Chennai 5000 3500 300
130 Surat 4500 3200 220
140 Karnataka 4200 2800 180
(f)
CORONA.pop('Deaths')
State Cases Recovery
100 Delhi 3000 2500
110 Mumbai 4000 3000
120 Chennai 5000 3500
130 Surat 4500 3200
140 Karnataka 4200 2800
(g) The iloc
method is not used to add rows to a DataFrame. It is used for index-based or integer-location-based accessing of rows and columns, not for adding rows.
One way of adding a row at a specific position in a DataFrame is by creating a new DataFrame including the row and then concatenating the two DataFrames as shown below :
new_row = {'State': 'Hyderabad', 'Cases': 5200, 'Recovery': 3800}
new_df = pd.DataFrame([new_row], index= [150])
CORONA = pd.concat([new_df, CORONA])
State Cases Recovery
150 Hyderabad 5200 3800
100 Delhi 3000 2500
110 Mumbai 4000 3000
120 Chennai 5000 3500
130 Surat 4500 3200
140 Karnataka 4200 2800
(h)
CORONA.drop(['Cases', 'State'], axis=1, inplace=True)
Recovery
100 3800
110 2500
120 3000
130 3500
140 3200
5 2800
Create a dataframe ‘Student’ from two series—Name and Grade, Name and Marks of five students.
(a) Display the first three records from student dataframe.
(b) Display the last two records from student dataframe.
Answer
The Student DataFrame is created as :
import pandas as pd
Name = ['John', 'Anna', 'Peter', 'Linda', 'Bob']
Grade = ['A', 'B', 'A+', 'C', 'B+']
Marks = [90, 80, 95, 70, 85]
S1 = pd.Series(Grade, index = Name)
S2 = pd.Series(Marks, index = Name)
Student = pd.DataFrame({'Grade' : S1, 'Marks' : S2 })
print(Student)
Grade Marks
John A 90
Anna B 80
Peter A+ 95
Linda C 70
Bob B+ 85
(a)
print(Student.head(3))
Grade Marks
John A 90
Anna B 80
Peter A+ 95
(b)
print(Student.tail(2))
Grade Marks
Linda C 70
Bob B+ 85
Create a dataframe of dictionary consisting of Name, Sub1, Sub2, Sub3, Sub4, Sub5 of five students.
(a) Display the dataframe.
(b) Display the first 5 rows and bottom 3 rows of student dataframe.
Answer
(a)
import pandas as pd
data = {
'Name': ['Joseph', 'Ananya', 'Praneet', 'Lakshmi', 'Bhagya'],
'Sub1': [90, 80, 95, 70, 85],
'Sub2': [85, 90, 88, 92, 89],
'Sub3': [88, 85, 90, 95, 92],
'Sub4': [92, 88, 85, 90, 95],
'Sub5': [95, 92, 92, 88, 90]
}
Student = pd.DataFrame(data)
print(Student)
Name Sub1 Sub2 Sub3 Sub4 Sub5
0 Joseph 90 85 88 92 95
1 Ananya 80 90 85 88 92
2 Praneet 95 88 90 85 92
3 Lakshmi 70 92 95 90 88
4 Bhagya 85 89 92 95 90
(b)
print("First 5 rows:")
print(Student.head(5))
print("\nBottom 3 rows:")
print(Student.tail(3))
First 5 rows:
Name Sub1 Sub2 Sub3 Sub4 Sub5
0 Joseph 90 85 88 92 95
1 Ananya 80 90 85 88 92
2 Praneet 95 88 90 85 92
3 Lakshmi 70 92 95 90 88
4 Bhagya 85 89 92 95 90
Bottom 3 rows:
Name Sub1 Sub2 Sub3 Sub4 Sub5
2 Praneet 95 88 90 85 92
3 Lakshmi 70 92 95 90 88
4 Bhagya 85 89 92 95 90
Create two dataframes of salary of five employees and do the following:
(a) Display both the dataframes.
(b) Add 5000 as bonus in both dataframes and display them.
Answer
(a)
import pandas as pd
data1 = {'Employee': ['Jatin', 'Avinash', 'Kavya', 'Apoorva', 'Nitin'],
'Salary': [50000, 60000, 70000, 80000, 90000]}
df1 = pd.DataFrame(data1)
data2 = {'Employee': ['Saanvi', 'Aditi', 'Shashank', 'Swapnil', 'Shravani'],
'Salary': [55000, 45000, 30000, 85000, 66000]}
df2 = pd.DataFrame(data2)
print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)
DataFrame 1:
Employee Salary
0 Jatin 50000
1 Avinash 60000
2 Kavya 70000
3 Apoorva 80000
4 Nitin 90000
DataFrame 2:
Employee Salary
0 Saanvi 55000
1 Aditi 45000
2 Shashank 30000
3 Swapnil 85000
4 Shravani 66000
(b)
df1['Bonus'] = 5000
df2['Bonus'] = 5000
print("DataFrame 1 with bonus:")
print(df1)
print("\nDataFrame 2 with bonus:")
print(df2)
DataFrame 1 with bonus:
Employee Salary Bonus
0 Jatin 50000 5000
1 Avinash 60000 5000
2 Kavya 70000 5000
3 Apoorva 80000 5000
4 Nitin 90000 5000
DataFrame 2 with bonus:
Employee Salary Bonus
0 Saanvi 55000 5000
1 Aditi 45000 5000
2 Shashank 30000 5000
3 Swapnil 85000 5000
4 Shravani 66000 5000
Create a dataframe using list [10, 11, 12, 13, 14] [23, 34, 45, 32, 65] [55, 60, 65, 70, 75] and do the following:
(a) Display the dataframe.
(b) Add the list [1, 2, 3, 4, 5] to dataframe and display it.
Answer
(a)
import pandas as pd
data = [[10, 11, 12, 13, 14],
[23, 34, 45, 32, 65],
[55, 60, 65, 70, 75]]
df = pd.DataFrame(data)
print(df)
0 1 2 3 4
0 10 11 12 13 14
1 23 34 45 32 65
2 55 60 65 70 75
(b)
df.loc[3] = [12, 62, 53, 34, 75]
print(df)
0 1 2 3 4
0 10 11 12 13 14
1 23 34 45 32 65
2 55 60 65 70 75
3 12 62 53 34 75
Create a dataframe of [23, 25], [34], [43, 44, 45, 46] and do the following: :
(a) Display the dataframe. Notice that the missing value is represented by NaN.
(b) Replace the missing value with 0.
(c) Replace the missing value with -1, -2, -3, -4 for columns 0, 1, 2, 3.
(d) Replace the missing value by copying the value from the above cell.
Answer
(a)
import pandas as pd
data = [[23, 25],
[34],
[43, 44, 45, 46]]
df = pd.DataFrame(data)
print(df)
0 1 2 3
0 23 25.0 NaN NaN
1 34 NaN NaN NaN
2 43 44.0 45.0 46.0
(b)
df = df.fillna(0)
0 1 2 3
0 23 25.0 0.0 0.0
1 34 0.0 0.0 0.0
2 43 44.0 45.0 46.0
(c)
df = df.fillna({0:-1, 1:-2, 2:-3, 3:-4})
0 1 2 3
0 23 25.0 -3.0 -4.0
1 34 -2.0 -3.0 -4.0
2 43 44.0 45.0 46.0
(d)
df= df.fillna(method='ffill')
0 1 2 3
0 23 25.0 NaN NaN
1 34 25.0 NaN NaN
2 43 44.0 45.0 46.0
Create a dataframe of D1 and D2;
D1 = {‘Rollno’ : [1001, 1004, 1005, 1008, 1009], ‘Name’: [‘Sarika’, ‘Abhay’, ‘Mohit’, ‘Ruby’, ‘Govind’ ]}
D2 = {‘Rollno’ : [1002, 1003, 1004, 1005, 1006], ‘Name’:[‘Seema’,‘Jia’,‘Shweta’, ‘Sonia’, ‘Nishant’]}
(a) Concatenate row-wise.
(b) Concatenate column-wise.
Answer
The DataFrame D1 and D2 are created as :
import pandas as pd
Data1 = {'Rollno': [1001, 1004, 1005, 1008, 1009], 'Name': ['Sarika', 'Abhay', 'Mahit', 'Ruby', 'Govind']}
Data2 = {'Rollno': [1002, 1003, 1004, 1005, 1006], 'Name': ['Seema', 'Jia', 'Shweta', 'Sonia', 'Nishant']}
D1 = pd.DataFrame(Data1)
D2 = pd.DataFrame(Data2)
print(D1)
print(D2)
Rollno Name
0 1001 Sarika
1 1004 Abhay
2 1005 Mahit
3 1008 Ruby
4 1009 Govind
Rollno Name
0 1002 Seema
1 1003 Jia
2 1004 Shweta
3 1005 Sonia
4 1006 Nishant
(a)
df = pd.concat([D1, D2])
Rollno Name
0 1001 Sarika
1 1004 Abhay
2 1005 Mahit
3 1008 Ruby
4 1009 Govind
0 1002 Seema
1 1003 Jia
2 1004 Shweta
3 1005 Sonia
4 1006 Nishant
(b)
df = pd.concat([D1, D2], axis=1)
Rollno Name Rollno Name
0 1001 Sarika 1002 Seema
1 1004 Abhay 1003 Jia
2 1005 Mahit 1004 Shweta
3 1008 Ruby 1005 Sonia
4 1009 Govind 1006 Nishant
Create a dataframe of {‘A’ : [ ]} and display whether it is empty or not.
import pandas as pd
df = pd.DataFrame({'A': []})
print(df)
if df.empty:
print("The dataframe is empty.")
else:
print("The dataframe is not empty.")
Empty DataFrame
Columns: [A]
Index: []
The dataframe is empty.
Create a dataframe of {‘A’ : [5, 6], ‘B’: [3, 0], 'C': [0, 0]} and display the result of all() and any() functions.
import pandas as pd
df = pd.DataFrame({'A': [5, 6], 'B': [3, 0], 'C': [0, 0]})
print(df)
print("\nResult of all() function:")
print(df.all())
print("\nResult of any() function:")
print(df.any())
A B C
0 5 3 0
1 6 0 0
Result of all() function:
A True
B False
C False
dtype: bool
Result of any() function:
A True
B True
C False
dtype: bool
Create a dataframe of {‘A’ : [True, True], ‘B’: [True, False], ‘C’: [False, False]} and display the result of all() and any().
import pandas as pd
df = pd.DataFrame({'A': [True, True], 'B': [True, False], 'C': [False, False]})
print(df)
print("\nResult of all() function:")
print(df.all())
print("\nResult of any() function:")
print(df.any())
A B C
0 True True False
1 True False False
Result of all() function:
A True
B False
C False
dtype: bool
Result of any() function:
A True
B True
C False
dtype: bool
What is the use of statement: inplace=True?
Answer
When inplace=True
is set, the DataFrame is modified directly, and the changes are applied to the existing DataFrame. This means that the original DataFrame is altered, and no new DataFrame is created.
Differentiate between del, pop() and drop() functions.
Answer
del function | pop() function | drop() function |
---|---|---|
The del statement deletes a column from a DataFrame. | The pop() function is used to delete a column or an item from a DataFrame or Series. | The drop() function is used to drop rows or columns from a DataFrame. |
It does not return a value. | It returns the removed item. | It returns a new DataFrame with the dropped labels. |
It modifies the original DataFrame or Series. | It modifies the original DataFrame or Series. | It does not modifies the original DataFrame by default (unless inplace = True is specified). |
The syntax is del df['column_name'] . | The syntax is df['column_name'].pop() . | The syntax is df.drop(index or sequence of indexes) . |