i have data file such:
id,orig,time,text 364,1,7-10-15,this works fine 16254,1,7-10-15,but, don't work :( 9846,0,7-10-15,neither, do, when import using pandas i'm trying following:
+-------+------+---------+----------------------+ | id | orig | time | text | +=======+======+=========+======================+ | 3464 | 1 | 7-10-15 | works fine | +-------+------+---------+----------------------+ | 16254 | 1 | 7-10-15 | but, don't work :( | +-------+------+---------+----------------------+ | 9846 | 0 | 7-10-15 | neither, do, | +-------+------+---------+----------------------+ using script data_df = pd.read_csv('data.csv', low_memory=false), when import 1st row fine (with no index set).
however second row since there's comma there, data in id moves index column , gets shifted 1 left.
+-------+----+---------+-----------------+-----------------+ | | id | orig | time | text | +=======+====+=========+=================+=================+ | 3464 | 1 | 7-10-15 | works fine | nan | +-------+----+---------+-----------------+-----------------+ | 16254 | 1 | 7-10-15 | | don't work :( | +-------+----+---------+-----------------+-----------------+ the pattern repeats more commas found in last column. possible solution rewrite file i'm trying find way import without having rewrite each file (i have 65+).
my question is:
is possible import (per row) first column "id" second column "orig" third column "time" , else "text"?
your csv malformed because not use quotes distinguish commas delimiters commas part of field's value.
however, iterate through lines of csv , use str.split(',', 3) split on first 3 commas:
lines = (line.split(',',3) line in f) we can pass iterator directly pd.dataframe:
df = pd.dataframe(lines, columns=header) this not fast loading valid csv using pd.read_csv's optimized parsing engine, think result pretty considering input malformed.
import numpy np import pandas pd open('data', 'r') f: header = [item.strip() item in next(f).split(',')] lines = (line.split(',', 3) line in f) df = pd.dataframe(lines, columns=header) df = df.convert_objects(convert_numeric=true) df['time'] = pd.to_datetime(df['time']) print(df) yields
id orig time text 0 364 1 2015-07-10 works fine\n 1 16254 1 2015-07-10 but, don't work :(\n 2 9846 0 2015-07-10 neither, do, with
print(df.dtypes) # id int64 # orig int64 # time datetime64[ns] # text object # dtype: object
Comments
Post a Comment