csv - Python Pandas DataFrame read_csv UnicodeDecodeError -


i have 129 mb csv file 849,275 rows , 18 columns. i'm trying read csv file pandas dataframe using read_csv.

when use encoding='cp1252':

read_file = pd.read_csv('myfile.csv', encoding='cp1252') 

the error quite long says @ bottom:

 unicodedecodeerror: 'charmap' codec can't decode byte 0x9d in position 41:  character maps <undefined> 

when specify: no encoding, encoding='utf-8', or encoding='utf-8-sig', get:

 unicodedecodeerror: 'utf-8' codec can't decode byte 0x96 in position 65:  invalid start byte 

question:

i fine deleting these problematic characters altogether. better yet normalize them ascii characters under 127. how can using just pandas? i'm looking panda-like way if exists.

not overkill question here's list of types of characters in 1 of columns i'm causing problem:

character   ord     32 !   33 "   34 #   35 $    36 %   37 &   38 '   39 (   40 )   41 *   42 +   43 ,   44 -   45 .   46 /   47 0   48 1   49 2   50 3   51 4   52 5   53 6   54 7   55 8   56 9   57 :   58 ;   59 <   60 =   61 >   62 ?   63 @   64   65 b   66 c   67 d   68 e   69 f   70 g   71 h   72   73 j   74 k   75 l   76 m   77 n   78 o   79 p   80 q   81 r   82 s   83 t   84 u   85 v   86 w   87 x   88 y   89 z   90 [   91 \   92 ]   93 ^   94 _   95 `   96   97 b   98 c   99 d   100 e   101 f   102 g   103 h   104   105 j   106 k   107 l   108 m   109 n   110 o   111 p   112 q   113 r   114 s   115 t   116 u   117 v   118 w   119 x   120 y   121 z   122 {   123 |   124 }   125 ~   126    129    143    157     160 ¡   161 ¢   162 £   163 §   167 ¨   168 ©   169 «   171 ¬   172 ®   174 °   176 ±   177 ²   178 ³   179 ´   180 µ   181 ·   183 ¸   184 ¹   185 º   186 ¼   188 ½   189 ¾   190 ×   215 ß   223 à   224 á   225 â   226 ã   227 ä   228 å   229 æ   230 ç   231 è   232 é   233 ì   236 í   237 î   238 ï   239 ð   240 ñ   241 ó   243 ô   244 ö   246 ú   250 û   251 ü   252 š   353 Ž   381 ƒ   402 –   8211 —   8212 ‘   8216 ’   8217 ‚   8218 “   8220 ”   8221 „   8222 †   8224 •   8226 …   8230 ‹   8249 ›   8250 €   8364 ™   8482 

the best use python 3. alternatively, helped me in number of cases string.encode('ascii', errors='ignore') inside read_csv:

read_csv(..., converters={column_x= lambda v: v.encode('ascii',errors='ignore')}) 

this link has more examples: python: convert unicode ascii without errors


Comments