i have 129 mb csv file 849,275 rows , 18 columns. i'm trying read csv file pandas dataframe using read_csv.
when use encoding='cp1252':
read_file = pd.read_csv('myfile.csv', encoding='cp1252') the error quite long says @ bottom:
unicodedecodeerror: 'charmap' codec can't decode byte 0x9d in position 41: character maps <undefined> when specify: no encoding, encoding='utf-8', or encoding='utf-8-sig', get:
unicodedecodeerror: 'utf-8' codec can't decode byte 0x96 in position 65: invalid start byte question:
i fine deleting these problematic characters altogether. better yet normalize them ascii characters under 127. how can using just pandas? i'm looking panda-like way if exists.
not overkill question here's list of types of characters in 1 of columns i'm causing problem:
character ord 32 ! 33 " 34 # 35 $ 36 % 37 & 38 ' 39 ( 40 ) 41 * 42 + 43 , 44 - 45 . 46 / 47 0 48 1 49 2 50 3 51 4 52 5 53 6 54 7 55 8 56 9 57 : 58 ; 59 < 60 = 61 > 62 ? 63 @ 64 65 b 66 c 67 d 68 e 69 f 70 g 71 h 72 73 j 74 k 75 l 76 m 77 n 78 o 79 p 80 q 81 r 82 s 83 t 84 u 85 v 86 w 87 x 88 y 89 z 90 [ 91 \ 92 ] 93 ^ 94 _ 95 ` 96 97 b 98 c 99 d 100 e 101 f 102 g 103 h 104 105 j 106 k 107 l 108 m 109 n 110 o 111 p 112 q 113 r 114 s 115 t 116 u 117 v 118 w 119 x 120 y 121 z 122 { 123 | 124 } 125 ~ 126 129 143 157 160 ¡ 161 ¢ 162 £ 163 § 167 ¨ 168 © 169 « 171 ¬ 172 ® 174 ° 176 ± 177 ² 178 ³ 179 ´ 180 µ 181 · 183 ¸ 184 ¹ 185 º 186 ¼ 188 ½ 189 ¾ 190 × 215 ß 223 à 224 á 225 â 226 ã 227 ä 228 å 229 æ 230 ç 231 è 232 é 233 ì 236 í 237 î 238 ï 239 ð 240 ñ 241 ó 243 ô 244 ö 246 ú 250 û 251 ü 252 š 353 Ž 381 ƒ 402 – 8211 — 8212 ‘ 8216 ’ 8217 ‚ 8218 “ 8220 ” 8221 „ 8222 † 8224 • 8226 … 8230 ‹ 8249 › 8250 € 8364 ™ 8482
the best use python 3. alternatively, helped me in number of cases string.encode('ascii', errors='ignore') inside read_csv:
read_csv(..., converters={column_x= lambda v: v.encode('ascii',errors='ignore')}) this link has more examples: python: convert unicode ascii without errors
Comments
Post a Comment