8bitzz/blogs

Pandas Basic (Part I: Indexing, Selecting & Assigning)

Opened this issue · 1 comments

Indexing, Selecting, Assigning

Choosing between loc and iloc

  • iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9.
  • loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10
  • This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000.
  • In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them!
  • To get 1000 elements using loc, you will need to go one lower and ask for df.loc[0:999]

Manipulating the index

  • The set_index() method can be used to manipulate the index in any way we see fit
reviews.set_index("title")

Assigning data with a constant value or with an iterable of values

reviews['critic'] = 'everyone'
reviews['index_backwards'] = range(len(reviews), 0, -1)

Example

  • Select the description column from reviews and assign the result to the variable desc
desc = reviews.description
desc = reviews['description]
  • Select the first value from the description column of reviews, assigning it to variable first_description
first_description = reviews.description[0]
first_description = reviews.description.loc[0]
first_description = reviews.description.iloc[0]
  • Select the first row of data (the first record) from reviews, assigning it to the variable first_row
first_row = reviews.loc[0]
first_row = reviews.iloc[0]
  • Select the first 10 values from the description column in reviews, assigning the result to variable first_descriptions
first_descriptions = reviews.loc[:9, 'description']
first_descriptions = reviews.description.iloc[:10]
first_descriptions = desc.head(10)
  • Create a variable df containing the country, province, region_1, and region_2 columns of the records with the index labels 0, 1, 10, and 100
indices = [0,1,10,100]
cols = ['country', 'province', 'region_1', 'region_2']
df = reviews.loc[indices, cols]
  • Create a DataFrame top_oceania_wines containing all reviews with at least 95 points (out of 100) for wines from Australia or New Zealand
top_oceania_wines = reviews.loc[(reviews.points >= 95) & (reviews.country.isin(['Australia', 'New Zealand']))]

Indexing, Selecting, Assigning

Choosing between loc and iloc

  • iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9.
  • loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10
  • This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000.
  • In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them!
  • To get 1000 elements using loc, you will need to go one lower and ask for df.loc[0:999]

Manipulating the index

  • The set_index() method can be used to manipulate the index in any way we see fit
reviews.set_index("title")

Assigning data with a constant value or with an iterable of values

reviews['critic'] = 'everyone'
reviews['index_backwards'] = range(len(reviews), 0, -1)

Example

  • Select the description column from reviews and assign the result to the variable desc
desc = reviews.description
desc = reviews['description]
  • Select the first value from the description column of reviews, assigning it to variable first_description
first_description = reviews.description[0]
first_description = reviews.description.loc[0]
first_description = reviews.description.iloc[0]
  • Select the first row of data (the first record) from reviews, assigning it to the variable first_row
first_row = reviews.loc[0]
first_row = reviews.iloc[0]
  • Select the first 10 values from the description column in reviews, assigning the result to variable first_descriptions
first_descriptions = reviews.loc[:9, 'description']
first_descriptions = reviews.description.iloc[:10]
first_descriptions = desc.head(10)
  • Create a variable df containing the country, province, region_1, and region_2 columns of the records with the index labels 0, 1, 10, and 100
indices = [0,1,10,100]
cols = ['country', 'province', 'region_1', 'region_2']
df = reviews.loc[indices, cols]
  • Create a DataFrame top_oceania_wines containing all reviews with at least 95 points (out of 100) for wines from Australia or New Zealand
top_oceania_wines = reviews.loc[(reviews.points >= 95) & (reviews.country.isin(['Australia', 'New Zealand']))]