• Example 의 목적
    • HTML (URL)  ASCII code to UTF-8 characters with Python
    • Convert HTML Unicode to Python string
    • HTML 에서 사용되는 %xx 문자 형식을 utf-8 형식 string 으로 바꾸기 (파이썬Python 라이브러리 사용)

  • URL encode 의 목적 존재 이유?
    URL addresses only accepts alphanumeric characters and some punctuation symbols, like parenthesis and underscore.

    If you need to use any other symbol (like space) you have to URL encode it using a percent sign followed by the two hexadecimal digits that represents that digit in the ASCII table.

    For example, the space symbol is character 32 (hexadecimal 20) in the ASCII table, so it's expressed as %20.

  • urllib library 사용하기
    • urllib library 의 utility function
    • Example
      • Python 
        ##
        # -*- coding: utf-8 -*-
        """
        HTML Unicode Python 에서 읽을 수 있도록 UTF-8 string 으로
        convert/바꿔주는 Example
        """


        import urllib # convert 에 사용할 library


        # --------------------------------------------------------------------------------
        # Data
        # --------------------------------------------------------------------------------
        # convert HTML Unicode Data
        # 자료는 2016.02.26 19:00 twitter trends keywords 이다.
        target_strings = [u'%EA%B9%80%EC%9A%A9%EC%9D%B5',
        u'%23%EC%9A%B0%EB%A6%AC_%EC%A7%80%EC%97%AD%EC%9D%84_%EA%B9%8C%EB%B3%B4%EC%9E%90',
        u'%EB%B0%B0%EC%9E%AC%EC%A0%95', u'%EA%B2%BD%EC%A0%9C%EC%9C%84%EA%B8%B0',
        u'%23%EC%9E%90%EC%BA%90%EC%9D%98_%EC%9D%B4_%EB%A7%90%EC%9D%80_%EB%AF%BF%EC%9C%BC%EB%A9%B4_%EC%95%88_%EB%90%9C%EB%8B%A4',
        u'%23%EB%84%88%EB%8F%84_%EB%82%98%EB%8F%84_%ED%91%B8%EC%9E%AC%EA%B0%9C%EA%B7%B8',
        u'%EC%86%90%EA%B0%80%EB%9D%BD%ED%95%98%ED%8A%B8', u'%EB%AC%B4%EC%8A%A8%EB%86%88',
        u'%EB%B9%84%EC%83%81%EA%B7%BC%EB%AC%B4',
        u'%EC%9E%90%EC%B9%B4%EB%A5%B4%ED%83%80', u'%22%EC%B4%9D%EC%84%A0+%EB%B6%88%EC%B6%9C%EB%A7%88%22',
        u'%EA%B5%AD%ED%9A%8C%EA%B8%B0%EB%A1%9D', u'%ED%83%95%EC%9B%A8%EC%9D%B4',
        u'%22%EB%A7%88%EB%A6%AC+%EC%BE%85%ED%88%AC%22', u'%EC%84%B1%EC%9D%B8%EC%9E%A1%EC%A7%80',
        u'%22DJ+3%EB%82%A8%22', u'%22%EA%B5%AD%EA%B0%80+%EB%B9%84%EC%83%81%22',
        u'%EC%84%9C%EC%9A%B8%EB%9E%9C%EB%93%9C', u'%EA%B2%8C%EC%9E%84%ED%94%84%EC%82%AC',
        u'UN%EC%97%90%EC%84%9C', u'%ED%97%88%EB%9D%BD%EC%B6%A9', u'%EC%95%88%EC%A0%84%EB%B9%84%ED%96%89',
        u'%22%ED%8B%B0%EC%97%94+%EC%B7%A8%EB%AF%B8%22', u'%EB%B0%A9%EC%B2%AD%EC%84%9D',
        u'%EC%8A%A4%EC%BF%A8%EC%96%B4%ED%83%9D', u'%EC%9C%A0%EC%95%BC%EB%AC%B4%EC%95%BC',
        u'%23%EC%A0%9C_%EC%9D%B4%EB%AF%B8%EC%A7%80_%EC%BB%AC%EB%9F%AC%EB%8A%94_%EB%AC%B4%EC%97%87%EC%9D%B8%EA%B0%80%EC%9A%94',
        u'%EC%95%A0%EB%8B%88%ED%94%8C%EC%82%AC', u'%EC%95%84%EC%9D%B4%EB%A7%A5%EC%8A%A4',
        u'%22%EC%95%A0%EB%93%A4+%EC%96%B4%EB%94%94%EA%B0%80%22',
        u'%22%EC%9D%B4%EB%AF%B8%ED%85%8C%EC%9D%B4%EC%85%98+%EA%B2%8C%EC%9E%84%22',
        u'%22%EB%AC%B4%EC%8A%A8+%EA%B6%8C%EB%A6%AC%22', u'%EC%BD%94%EB%AF%B8%EC%85%98',
        u'%EC%83%81%EC%9E%84%EC%9C%84%EC%9B%90%EC%9E%A5', u'%EB%B2%95%ED%92%8D%EC%84%A0',
        u'%EC%9D%B8%EC%B2%9C%ED%95%AD']


        # --------------------------------------------------------------------------------
        # convert 하기
        # --------------------------------------------------------------------------------
        def convert_html_url_encode_to_ascii(target_strings):
        # 바꾸고자 하는 HTML URL encoding words 출력
        print_items(target_strings)

        # 방법 1.
        result_strings = [urllib.unquote(one_str.encode('utf-8')) for one_str in target_strings]

        # # 방법 2. 방법 1 을 풀어서 적용
        # result_strings = []
        # for item in target_strings:
        # # ----------------------------------------------------------------
        # # TODO - 이 부분 가져다가 적용하면 됨.
        # # ----------------------------------------------------------------
        # # 1. utf-8 으로 encoding 해 주고
        # item_to_utf8 = item.encode('utf-8')
        # # 2. urllib.unquote 를 사용하여 ascii code 로 바꿔준다.
        # item_to_ascii = urllib.unquote(item_to_utf8)
        # result_strings.append(item_to_ascii)
        # # ----------------------------------------------------------------

        # convert words 결과 출력
        print_items(result_strings)

        return


        def print_items(items):
        print '-' * 80
        for item in items:
        print item
        print '-' * 80


        if __name__ == '__main__':
        convert_html_url_encode_to_ascii(target_strings)
      • 끝.






+ Recent posts