win8.1-32bit, python3.4 made web-robot www.douban.com main html, jpg files , png files. when finished, can't open pic files.(windows photo viewer can't open picture balablabala~~~~)
questions:
1: why can't pics opened?
2: if line 35 edited this:dbr.write(data), command line prompt: typeerror: 'str' not support buffer interface. same thing happen line 51 , 59. when line 35 :dbr.write(bytes(data, 'utf-8')) , right html file. did same line 51 , 59 pic files, somethings went wrong. wonder there should bug in "write()", can't figure out wrong.
here code.
import urllib.request import os import re #make dirs douban_robot, jpg, png dirpath = 'd:/pwork/webrobot/' if not os.path.isdir(dirpath): os.makedirs(dirpath) jpg_path = dirpath + 'jpgfiles/' png_path = dirpath + 'pngfiles/' if not os.path.isdir(jpg_path): os.makedirs(jpg_path) if not os.path.isdir(png_path): os.makedirs(png_path) douban_robot = dirpath + 'douban.html' url = 'http://www.douban.com' #get .html data = urllib.request.urlopen(url).read().decode('utf-8') open(douban_robot, 'wb') dbr: dbr.write(bytes(data, 'utf-8')) dbr.close() # create regex re_jpg = re.compile(r'<img src="(http.+?.jpg)"') re_png = re.compile(r'<img src="(http:.+?.png)"') jpg_data = re_jpg.findall(data) png_data = re_png.findall(data) # test jpg , png date print(jpg_data, png_data) #get jpg files = 1 image in jpg_data: jpg_name = jpg_path + str(i)+'.jpg' #urllib.request.urlretrieve(image, jpg_name) open(jpg_name, 'wb') jpg_file: jpg_file.write(bytes(image, 'utf-8')) jpg_file.close() += 1 image in png_data: png_name = png_path + str(i)+'.png' #urllib.request.urlretrieve(image, png_name) open(png_name, 'wb') png_file: png_file.write(bytes(image, 'utf-8')) png_file.close() += 1
the variables
jpg_data,png_datalists containing captured urls. loops iterate on each url, placing url string in variableimage. then, in both loops, write url string file, not actual image. looks commented outurlliblines trick, instead of you're doing now.the
.write()function expects give object matches mode of file. when callopen(..., 'wb'), you're saying open file inwrite,binarymode, means need givebytesinstead ofstr.
bytes fundamental way stored in computer. series of bytes -- data on hard drive, , data send , receive on internet. bytes don't have meaning on own -- each 1 8 bits strung together. meaning depends on how interpret bytes. instance, interpret single byte representing number 0 255. or, interpret number -128 127 (both of these common). assign these "numbers" characters, , interpret sequence of bytes text. however, allows represent 256 characters, , there many more in world's various languages. so, there multiple ways of representing text sequences of bytes. these called "character encodings". popular modern 1 "utf-8".
in python, bytes object series of bytes. has no special meaning -- nobody has said represents yet. if want use text, need decode it, using 1 of character encodings. once (.decode('utf-8')), have str object. in order write disk (or network), str have encoded bytes. when open file in text mode, python chooses computer's default encoding, , decode read using that, , encode write it. however, when open file in b mode, python expects give bytes, , throws error when give str instead. since know html file downloaded , put in data text, have been best save file in text mode. however, encoding utf-8 , writing binary file works too, long system's default encoding utf-8. in general, when have str , want write file, open file in text mode (just don't pass b in mode parameter) , let python pick encoding, since knows better do!
for more info on character sets , encoding stuff (which glossed over), should read this article.
Comments
Post a Comment