Python: 임시파일(TemporaryFile)에 유니코드 캐릭터 쓰기

Daylogs/Python

Python: 임시파일(TemporaryFile)에 유니코드 캐릭터 쓰기

ohgyun 2017. 2. 27. 00:07

발생일: 2016.04.14

키워드: python Temporary file, 파이썬 임시파일, tempfile.TemporaryFile, unicode, utf8, 유니코드, Create temporary file with unicode encoding, unicode object, 유니코드 객체

문제:

파일 내의 특정 패턴의 단어를 다른 단어로 치환하려고 한다.

다음 순서대로 실행하려고 했다.

1. 원본 파일을 라인 단위로 읽어 특정 단어를 치환한 후에,

2. 치환한 라인을 순서대로 임시 파일에 쓴다.

3. 전체 파일을 다 읽었다면, 임시 파일의 내용을 원본 파일에 복사한다.

임시 파일은 tempfile 모듈을 이용해 생성했는데, 임시 파일을 쓰려고 하니 아래와 같은 에러가 발생했다.

UnicodeEncodeError: 'ascii' codec can't encode character ...

왜 그런걸까?

해결책:

유니코드 객체를 인코딩 없이 임시 파일에 쓰려고 했던 게 문제였다.

TemporaryFile에서는 별도로 인코딩을 설정하는 옵션이 없기 때문에,

읽어온 라인을 utf-8로 인코딩해서 쓰도록 수정했다.

아래 강조한 부분이 수정한 부분이다.

def replace_characters(filepath):

in_file = codecs.open(filepath, 'r', 'utf-8')

out_file = tempfile.NamedTemporaryFile(mode='w', delete=False)

for line in in_file:

new_line = replace_specific_pattern(line)

out_file.write(new_line.encode('utf-8'))

in_file.close()

out_file.close()

shutil.copy(out_file.name, filepath)

논의:

파이썬에서는 유니코드 데이터를 다루는 별도의 unicode 타입이 있고,

u'가'와 같이 유니코드 타입을 생성할 수 있다.

>>> type('가')

>>> type(u'가')

unicode와 str을 서로 변환할 때엔, unicode.encode()와 str.decode()를 사용하면 된다.

- unicode.encode(character_set):

유니코드 객체를 특정 캐릭터셋을 이용해 스트링 바이트로 인코딩한다.

- str.decode(character_set):

스트링 바이트를 특정 캐릭터셋을 이용해 유니코드 객체로 디코딩한다.

(encode()는 unicode에서 호출하고, decode()는 str에서 호출하는 것에 주의한다)

아래처럼 유니코드 객체를 스트링 바이트로 인코딩할 수 있다.

>>> a = u'가'.encode('utf8')

>>> type(a)

>>> a

'\xea\xb0\x80'

utf8 대신 다른 캐릭터셋을 사용할 수도 있다.

>>> b = u'가'.encode('euckr')

>>> type(b)

>>> b

'\xb0\xa1'

유니코드 객체로부터 인코딩해 만든 스트링 객체를 디코딩하면, 다시 유니코드 객체를 얻을 수 있다.

>>> a2 = a.decode('utf8')

>>> type(a2)

>>> a2

u'\uac00'

>>> a2 == u'가'

True

문제의 상황은,

파일을 특정 캐릭터셋으로 읽어온 값이 유니코드 객체였는데,

별도의 인코딩을 지원하지 않는 임시 파일에 유니코드 객체를 쓰려고 했기 때문이었다.

아래 코드와 주석을 보면 원인과 해결 방법을 이해할 수 있다.

def replace_characters(filepath):

# 이해를 쉽게 하기 위해, codecs 대신 open()으로 읽게 변경했다

with open(filepath, 'r') as input:

in_file = input.read()

type(in_file) #-> <type 'str'>

in_file = in_file.decode('utf-8')

type(in_file) #-> <type 'unicode'>

out_file = tempfile.NamedTemporaryFile(mode='w', delete=False)

for line in in_file:

new_line = replace_specific_pattern(line)

type(new_line) #-> <type 'unicode'>

encoded = new_line.encode('utf-8')

type(encoded) #-> <type 'str'>

out_file.write(encoded)

in_file.close()

out_file.close()

shutil.copy(out_file.name, filepath)

참고:

https://docs.python.org/3.1/library/tempfile.html

http://stackoverflow.com/questions/10490816/how-to-create-a-temporary-file-with-unicode-encoding

http://stackoverflow.com/questions/94153/how-do-i-persist-to-disk-a-temporary-file-using-python

http://raccoonyy.github.io/working-with-unicode-streams-in-python-korean/

저작자표시 비영리 변경금지 (새창열림)