Creating an empty Pandas DataFrame, and then filling it


I'm starting from the pandas DataFrame documentation here: Introduction to data structures

I'd like to iteratively fill the DataFrame with values in a time series kind of calculation. I'd like to initialize the DataFrame with columns A, B, and timestamp rows, all 0 or all NaN.

I'd then add initial values and go over this data calculating the new row from the row before, say row[A][t] = row[A][t-1]+1 or so.

I'm currently using the code as below, but I feel it's kind of ugly and there must be a way to do this with a DataFrame directly or just a better way in general.

import pandas as pd
import datetime as dt
import scipy as s
base =
dates = [ base - dt.timedelta(days=x) for x in range(9, -1, -1) ]

valdict = {} symbols = [‘A’,‘B’, ‘C’] for symb in symbols: valdict[symb] = pd.Series( s.zeros(len(dates)), dates )

for thedate in dates: if thedate > dates[0]: for symb in valdict: valdict[symb][thedate] = 1 + valdict[symb][thedate - dt.timedelta(days=1)]


NEVER grow a DataFrame row-wise!

TLDR; (just read the bold text)

Most answers here will tell you how to create an empty DataFrame and fill it out, but no one will tell you that it is a bad thing to do.

Here is my advice: Accumulate data in a list, not a DataFrame.

Use a list to collect your data, then initialise a DataFrame when you are ready. Either a list-of-lists or list-of-dicts format will work, pd.DataFrame accepts both.

data = []
for row in some_function_that_yields_data():

df = pd.DataFrame(data)

pd.DataFrame converts the list of rows (where each row is a scalar value) into a DataFrame. If your function yields DataFrames instead, call pd.concat.

Pros of this approach:

  1. It is always cheaper to append to a list and create a DataFrame in one go than it is to create an empty DataFrame (or one of NaNs) and append to it over and over again.

  2. Lists also take up less memory and are a much lighter data structure to work with, append, and remove (if needed).

  3. dtypes are automatically inferred (rather than assigning object to all of them).

  4. A RangeIndex is automatically created for your data, instead of you having to take care to assign the correct index to the row you are appending at each iteration.

If you aren't convinced yet, this is also mentioned in the documentation:

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

pandas >= 2.0 Update: append has been removed!

DataFrame.append was deprecated in version 1.4 and removed from the pandas API entirely in version 2.0.

See the docs on Deprecations as well as this github issue that originally proposed its deprecation.

These options are horrible

append or concat inside a loop

Here is the biggest mistake I've seen from beginners:

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True) # yuck
    # or similarly,
    # df = pd.concat([df, pd.Series({'A': i, 'B': b, 'C': c})], ignore_index=True)

Memory is re-allocated for every append or concat operation you have. Couple this with a loop and you have a quadratic complexity operation.

The other mistake associated with df.append is that users tend to forget append is not an in-place function, so the result must be assigned back. You also have to worry about the dtypes:

df = pd.DataFrame(columns=['A', 'B', 'C'])
df = df.append({'A': 1, 'B': 12.3, 'C': 'xyz'}, ignore_index=True)

df.dtypes A object # yuck! B float64 C object dtype: object

Dealing with object columns is never a good thing, because pandas cannot vectorize operations on those columns. You will need to do this to fix it:

A      int64
B    float64
C     object
dtype: object

loc inside a loop

I have also seen loc used to append to a DataFrame that was created empty:

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df.loc[len(df)] = [a, b, c]

As before, you have not pre-allocated the amount of memory you need each time, so the memory is re-grown each time you create a new row. It's just as bad as append, and even more ugly.

Empty DataFrame of NaNs

And then, there's creating a DataFrame of NaNs, and all the caveats associated therewith.

df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
     A    B    C
0  NaN  NaN  NaN
1  NaN  NaN  NaN
2  NaN  NaN  NaN
3  NaN  NaN  NaN
4  NaN  NaN  NaN

It creates a DataFrame of object columns, like the others.

A    object  # you DON'T want this
B    object
C    object
dtype: object

Appending still has all the issues as the methods above.

for i, (a, b, c) in enumerate(some_function_that_yields_data()):
    df.iloc[i] = [a, b, c]

The Proof is in the Pudding

Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.

enter image description here

Benchmarking code for reference.

"Large data" workflows using pandas [closed]

Pandas Merging 101