python - Import data into DataFrame with additional commas -


i have data file such:

id,orig,time,text 364,1,7-10-15,this works fine 16254,1,7-10-15,but, don't work :( 9846,0,7-10-15,neither, do,   

when import using pandas i'm trying following:

+-------+------+---------+----------------------+ | id    | orig | time    | text                 | +=======+======+=========+======================+ | 3464  | 1    | 7-10-15 | works fine      | +-------+------+---------+----------------------+ | 16254 | 1    | 7-10-15 | but, don't work :( | +-------+------+---------+----------------------+ | 9846  | 0    | 7-10-15 | neither, do,       | +-------+------+---------+----------------------+ 

using script data_df = pd.read_csv('data.csv', low_memory=false), when import 1st row fine (with no index set).

however second row since there's comma there, data in id moves index column , gets shifted 1 left.

+-------+----+---------+-----------------+-----------------+ |       | id | orig    | time            | text            | +=======+====+=========+=================+=================+ | 3464  | 1  | 7-10-15 | works fine | nan             | +-------+----+---------+-----------------+-----------------+ | 16254 | 1  | 7-10-15 |             | don't work :( | +-------+----+---------+-----------------+-----------------+ 

the pattern repeats more commas found in last column. possible solution rewrite file i'm trying find way import without having rewrite each file (i have 65+).

my question is:

is possible import (per row) first column "id" second column "orig" third column "time" , else "text"?

your csv malformed because not use quotes distinguish commas delimiters commas part of field's value.

however, iterate through lines of csv , use str.split(',', 3) split on first 3 commas:

lines = (line.split(',',3) line in f) 

we can pass iterator directly pd.dataframe:

df = pd.dataframe(lines, columns=header) 

this not fast loading valid csv using pd.read_csv's optimized parsing engine, think result pretty considering input malformed.


import numpy np import pandas pd  open('data', 'r') f:     header = [item.strip() item in next(f).split(',')]     lines = (line.split(',', 3) line in f)     df = pd.dataframe(lines, columns=header)     df = df.convert_objects(convert_numeric=true)     df['time'] = pd.to_datetime(df['time'])  print(df) 

yields

      id  orig       time                    text 0    364     1 2015-07-10       works fine\n 1  16254     1 2015-07-10  but, don't work :(\n 2   9846     0 2015-07-10        neither, do,   

with

print(df.dtypes) # id               int64 # orig             int64 # time    datetime64[ns] # text            object # dtype: object 

Comments