python, crawler for website, stored the jpg and png files, but can't be opend. why? -


win8.1-32bit, python3.4 made web-robot www.douban.com main html, jpg files , png files. when finished, can't open pic files.(windows photo viewer can't open picture balablabala~~~~)

questions:

1: why can't pics opened?

2: if line 35 edited this:dbr.write(data), command line prompt: typeerror: 'str' not support buffer interface. same thing happen line 51 , 59. when line 35 :dbr.write(bytes(data, 'utf-8')) , right html file. did same line 51 , 59 pic files, somethings went wrong. wonder there should bug in "write()", can't figure out wrong.

here code.

import urllib.request import os import re #make dirs douban_robot, jpg, png dirpath = 'd:/pwork/webrobot/' if not os.path.isdir(dirpath):     os.makedirs(dirpath) jpg_path = dirpath + 'jpgfiles/' png_path = dirpath + 'pngfiles/' if not os.path.isdir(jpg_path):     os.makedirs(jpg_path) if not os.path.isdir(png_path):     os.makedirs(png_path)  douban_robot = dirpath + 'douban.html'  url = 'http://www.douban.com'  #get .html data = urllib.request.urlopen(url).read().decode('utf-8') open(douban_robot, 'wb') dbr:     dbr.write(bytes(data, 'utf-8')) dbr.close()  # create regex re_jpg = re.compile(r'<img src="(http.+?.jpg)"') re_png = re.compile(r'<img src="(http:.+?.png)"') jpg_data = re_jpg.findall(data) png_data = re_png.findall(data) # test jpg , png date print(jpg_data, png_data)  #get jpg files = 1 image in jpg_data:     jpg_name = jpg_path + str(i)+'.jpg'      #urllib.request.urlretrieve(image, jpg_name)     open(jpg_name, 'wb') jpg_file:         jpg_file.write(bytes(image, 'utf-8'))     jpg_file.close()     += 1  image in png_data:     png_name = png_path + str(i)+'.png'      #urllib.request.urlretrieve(image, png_name)     open(png_name, 'wb') png_file:         png_file.write(bytes(image, 'utf-8'))     png_file.close()     += 1 

  1. the variables jpg_data , png_data lists containing captured urls. loops iterate on each url, placing url string in variable image. then, in both loops, write url string file, not actual image. looks commented out urllib lines trick, instead of you're doing now.

  2. the .write() function expects give object matches mode of file. when call open(..., 'wb'), you're saying open file in write , binary mode, means need give bytes instead of str.

bytes fundamental way stored in computer. series of bytes -- data on hard drive, , data send , receive on internet. bytes don't have meaning on own -- each 1 8 bits strung together. meaning depends on how interpret bytes. instance, interpret single byte representing number 0 255. or, interpret number -128 127 (both of these common). assign these "numbers" characters, , interpret sequence of bytes text. however, allows represent 256 characters, , there many more in world's various languages. so, there multiple ways of representing text sequences of bytes. these called "character encodings". popular modern 1 "utf-8".

in python, bytes object series of bytes. has no special meaning -- nobody has said represents yet. if want use text, need decode it, using 1 of character encodings. once (.decode('utf-8')), have str object. in order write disk (or network), str have encoded bytes. when open file in text mode, python chooses computer's default encoding, , decode read using that, , encode write it. however, when open file in b mode, python expects give bytes, , throws error when give str instead. since know html file downloaded , put in data text, have been best save file in text mode. however, encoding utf-8 , writing binary file works too, long system's default encoding utf-8. in general, when have str , want write file, open file in text mode (just don't pass b in mode parameter) , let python pick encoding, since knows better do!

for more info on character sets , encoding stuff (which glossed over), should read this article.


Comments