US20040205673A1 - Method for detecting current client-side browser encoding - Google Patents
Method for detecting current client-side browser encoding Download PDFInfo
- Publication number
- US20040205673A1 US20040205673A1 US09/682,576 US68257601A US2004205673A1 US 20040205673 A1 US20040205673 A1 US 20040205673A1 US 68257601 A US68257601 A US 68257601A US 2004205673 A1 US2004205673 A1 US 2004205673A1
- Authority
- US
- United States
- Prior art keywords
- encoding
- encodings
- language
- detection
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/02—Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L69/00—Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
- H04L69/30—Definitions, standards or architectural aspects of layered protocol stacks
- H04L69/32—Architecture of open systems interconnection [OSI] 7-layer type protocol stacks, e.g. the interfaces between the data link level and the physical level
- H04L69/322—Intralayer communication protocols among peer entities or protocol data unit [PDU] definitions
- H04L69/329—Intralayer communication protocols among peer entities or protocol data unit [PDU] definitions in the application layer [OSI layer 7]
Definitions
- Another solution is to retrieve from HTTP request header encodings that are enabled on the client side. This gives only a hint on which languages can be installed on the user's computer. In some occasions it can be enough, when there is one language that has one encoding; in other occasions it is not enough, for instance in the case of a computer being used for Japanese-Ukrainian translation. In this latter case the computer will have at least two languages installed, each of the languages having three different encodings: we have to choose between 7 (add English) encodings.
- the present invention solves the problem of browser encoding detection.
- the result of detection can be used in a JavaScript program or in a Java applet to adapt the contents depending on the encoding.
- the result can also be passed to the server, either in consequent HTTP requests, or with the form data. If the form data are accompanied by the encoding name, then the data can be uniquely converted into encoding-neutral Unicode strings.
- the method consists of creating an invisible form in the HTML document, with the only hidden input field that contains Unicode character codes for a sample Unicode string, and matching parts of the sample Unicode string with characters or sequence of characters in various specific encodings; when the characters match, the encoding is detected.
- the browser encoding is detected in a piece of JavaScript code that is placed in the very top of HEAD part HTML page, before any body text is written to the document.
- the function works like this:First, it splits the sample Unicode string into two samples, one for multi-byte encodings (multi-byte sample), another for Utf-8 and single-byte encodings (
- the second step detects Utf-8 encoding by comparing the single-byte sample to the same string directly encoded using Utf-8. If the comparison is positive, the algorithm stops.
- the third step compares the multi-byte sample string to the same string encoded in Big 5 Chinese, GBK Chinese, EUC_TW Chinese, EUC_JP Japanese, SJIS Japanese (the list can be easily extended). Note that the multi-byte sample string is padded with space character, to make it a valid sequence of bytes when the encoding is Utf-8.
- the fourth step compares one or two characters of single-byte sample strings to the characters directly encoded using different single-byte encodings. Note that the character cannot be stored alone in the string, but instead has to be padded with space character, to make the sequence legal in Utf- 8 encoding.
- the set of encoding samples can be easily expanded.
- the function VP_getEncoding( ), can be later used in JavaScript later on the page, or in event handling routines, and the result can be passed back to the server if needed.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- Information Transfer Between Computers (AREA)
- Document Processing Apparatus (AREA)
Abstract
In order to make the world wide web pages adaptable to the user language and encoding, a method is provided such that the current encoding set on the client browser can be detected within the page being browsed, making it possible to feed-back the encoding to the server side, and also to adapt the page to the language that is most likely to match the native language of the user. To provide this, sample Unicode strings are matched against encoding-specific string values, which are selected in such a way that the match uniquely determines the encoding being currently set. Ordinarily users around the world do not change this setting and often are not aware of it. When the forms are passed back to the server, knowing the encoding of the form data makes it possible to correctly parse the form data and pass them correctly to search engines, to databases, or to other servers.
Description
- The world wide web is being used by millions of users around the world, with different languages. TCP/IP and HTTP protocols transmit data between server and client, in most cases not having the exact knowledge of the language and encoding that the client-side user uses. While Unicode covers all known languages and characters, its encodings, UTF-8 and UTF-16, are very rarely used as a standard for information exchange. Instead, some languages use several different encodings. For instance, there are two widely-used Russian encodings, and two more, less widely used. Many languages have one encoding for Windows operating system and another for DOS. Linux and Unix often use one more encoding; e.g. in Japanese, Shift JIS is widely (but not always) used on Windows, and EUC-JP is widely (but not always) used on Linux and Unix.
- Ordinary users around the world do not know and often do not care what encoding they have. It can be a problem when the user downloads a page in a different encoding, but this is solved by specifying page encoding inside HTML. When the users sends a form to the server, though, the server cannot find out the client-side encoding, and can either guess, or keep the data as received, in whatever encoding it was.
- This makes searches in international databases almost impossible: for instance, the same set of codes can correspond to different characters in different languages. This also makes it impossible to store data in the server databases in encoding-independent way (which basically means in Unicode).
- Some web sites solve this problem by having different pages for different languages; which is still a partial solution for the languages that have several encodings; and since the users, as experience shows, do not know their encoding, the data they supply cannot be always correctly parsed.
- Another solution is to retrieve from HTTP request header encodings that are enabled on the client side. This gives only a hint on which languages can be installed on the user's computer. In some occasions it can be enough, when there is one language that has one encoding; in other occasions it is not enough, for instance in the case of a computer being used for Japanese-Ukrainian translation. In this latter case the computer will have at least two languages installed, each of the languages having three different encodings: we have to choose between 7 (add English) encodings.
- If the browsers made current encoding available in a JavaScript object on the web page, or to the server in the HTTP request, this would be a solution, but unfortunately this is not so: browsers do not provide this information.
- [t1]
- Related US Patents:
U.S. Pat. No. Date Author 5944790 July, 1996 Levy - [t2]
- Other References
Peter Kent, John Kent “Official Netscape JavaScript 1.2 Book, Second Edition”, Ventana, 1997. The Unicode Consortium, Joan Aliprand, “The Unicode Standard, Julie Allen, Rick McGowan, Joe Becker, Version 3.0”, The Unicode Michael Everson, Mike Ksar, Lisa Moore, Consortium. Michel Suignard, Ken Whistler, Mark Davis, Asmus Freytag, John Jenkins Nadine Kano “Developing International Software for Windows 95 and Windows NT”, Microsoft Press, 1995 - The present invention solves the problem of browser encoding detection. The result of detection can be used in a JavaScript program or in a Java applet to adapt the contents depending on the encoding. The result can also be passed to the server, either in consequent HTTP requests, or with the form data. If the form data are accompanied by the encoding name, then the data can be uniquely converted into encoding-neutral Unicode strings.
- The method consists of creating an invisible form in the HTML document, with the only hidden input field that contains Unicode character codes for a sample Unicode string, and matching parts of the sample Unicode string with characters or sequence of characters in various specific encodings; when the characters match, the encoding is detected.
- The browser encoding is detected in a piece of JavaScript code that is placed in the very top of HEAD part HTML page, before any body text is written to the document. First, a form is written to the document, with th hidden input the value of which is the sample Unicode string, e.g.:document.write(“<form name=VP_encoding><input name=t type=hidden value=‘АÀĄĎ΅ğ  -個 ’></input></form>”); JavaScript also contains a function, VP_getEncoding( ), that returns the current encoding name. The function works like this:First, it splits the sample Unicode string into two samples, one for multi-byte encodings (multi-byte sample), another for Utf-8 and single-byte encodings (single-byte sample).
- The second step detects Utf-8 encoding by comparing the single-byte sample to the same string directly encoded using Utf-8. If the comparison is positive, the algorithm stops.
- The third step compares the multi-byte sample string to the same string encoded in Big5 Chinese, GBK Chinese, EUC_TW Chinese, EUC_JP Japanese, SJIS Japanese (the list can be easily extended). Note that the multi-byte sample string is padded with space character, to make it a valid sequence of bytes when the encoding is Utf-8.
- The fourth step compares one or two characters of single-byte sample strings to the characters directly encoded using different single-byte encodings. Note that the character cannot be stored alone in the string, but instead has to be padded with space character, to make the sequence legal in Utf-8 encoding. The set of encoding samples can be easily expanded.
- If the fourth step does not detect the encoding, “?” is returned.
- The function VP_getEncoding( ), can be later used in JavaScript later on the page, or in event handling routines, and the result can be passed back to the server if needed.
- Program Listing Deposit
<HTML><HEAD><TITLE>Encoding test</title><META HTTP-EQUIV=“Pragma” CONTENT=“no-cache” <% int []det1b = new int[] { 1040, 192, 260, 270, 901, 287 }; // Cyr West CtrE Balt GR Turk // (with prev) int []det2b = new int[] { 0x500b }; // dbl/utf %> <form name=“_unicode_”> <input name=“t1b” type=“hidden” value=“<% for (int i = 0; i < det1b.length; i++) { out.print(“&#” + det1b[i] + “;”); } %>”></input> <input name=“t2b” type=“hidden” value=“<% for (int i = 0; i < det2b.length; i++) { out.print(“&#” + det2b[i] + “;”); } %> ”></input> </form> <hr> <script language=“javascript”> <% String[] b2 = new String[] {“UTF8”, “\u00e5\u0080\u008b”, “Big5”, “\u00ad\u00d3”, “GBK”, “\u0082\u0080”, “EUC_TW”, “\u00d4\u00b6”, “EUC_JP”, “\u00b8\u00c4”, “SJIS”, “\u008c\u00c2” }; String[] b1 = new String[] { “UTF-8”, “\u00d0\u0090\u00c3\u008 “Central-European Windows”, “ \u00a5\u00cf ”, “Central-European ISO”, “ \u00a1\u00cf ”, “Baltic ISO”, “ \u00a1 ”, “Cyrillic DOS”, “\u0080 ”, “Baltic Windows”, “ \u00c0 ”, “Cyrillic Windows”, “\u00c0 ”, “Cyrillic KOI-8”, “\u00e1 ”, “Cyrillic ISO”, “\u00b0 ”, “Turkish”, “ \u00c0 \u00f0 ”, “ISO_8859_1”, “ \u00c0 ”, “Greek ISO”, “ \u00b5 ”, “Greek Windows”, “ \u00a1 ”, }; %> function VP_getEncoding() { var encoding = “?”; var t1 = document.forms._unicode_.t1b.value; var t2 = document.forms._unicode_.t2b.value; <% // Check for multibyte stuff for (int i = 0; i < b2.length; i+=2) { %> <%= i > 0 ? “else ” : “” %> if (t2 == “<%= b2[i+1] %> ”) { encoding = “<%= b2[i] %>”; }<% } // Check for single-byte stuff for (int i = 0; i < b1.length; i+=2) { %> if (encoding == “?”) { <% String originalSample = b1[i+1]; String workingSample = “”; int[] chosen = new int[originalSample.length()]; for (int j = 0; j < originalSample.length(); j++) { char c = originalSample.charAt(j); if (c != ‘ ’) { chosen[workingSample.length()] = j; workingSample += c; } } if (workingSample.length() == originalSample.length()) { %> if (t1 == “<%= originalSample %>”) { <% } else { %> test = “<%= originalSample %> ”; if ( <% for (int j = 0; j < workingSample.length(); j++) { %><%= j > 0 ? “) && ” : “”%> (t1.charAt(<%= chosen[j] %>) == test.charAt(<%= chosen[ }%>)) { <% } %> encoding = “<%= b1[i]%>”; } } <%}%> return encoding; } document.write(“Encoding is <font color=red><b>” + VP_getEncoding() + “</b></font><b </script> </BODY> </HTML>
Claims (5)
1. A method for detecting character set (also known as character encoding) currently selected on the browser on the world wide web client computer system, comprising: a sample Unicode string that contains a set of test character codes which is independent of current client encoding; a plurality of instructions comparing parts of sample Unicode strings with characters or sequences of characters directly encoded using various encodings to be detected; a function that returns the currently selected encoding.
2. The method of claim 1 , wherein the scripting programming language comprises a JavaScript programming language.
3. The method of claim 1 , wherein the detection is done in three consecutive steps: detection of Utf encodings; detection of multi-byte language encodings; detection of single-byte language encodings.
4. The method of accompanying the form data sent from the web client to the web server with the encoding information collected using method of claim 1 .
5. The method of correct form data conversion on the server side based on the accompanying encoding information collected using method of claim 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/682,576 US20040205673A1 (en) | 2001-09-22 | 2001-09-22 | Method for detecting current client-side browser encoding |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/682,576 US20040205673A1 (en) | 2001-09-22 | 2001-09-22 | Method for detecting current client-side browser encoding |
Publications (1)
Publication Number | Publication Date |
---|---|
US20040205673A1 true US20040205673A1 (en) | 2004-10-14 |
Family
ID=33132106
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/682,576 Abandoned US20040205673A1 (en) | 2001-09-22 | 2001-09-22 | Method for detecting current client-side browser encoding |
Country Status (1)
Country | Link |
---|---|
US (1) | US20040205673A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050262511A1 (en) * | 2004-05-18 | 2005-11-24 | Bea Systems, Inc. | System and method for implementing MBString in weblogic Tuxedo connector |
CN103336761A (en) * | 2013-05-14 | 2013-10-02 | 成都网安科技发展有限公司 | Interference filtration matching algorithm based on dynamic partitioning and semantic weighting |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6092100A (en) * | 1997-11-21 | 2000-07-18 | International Business Machines Corporation | Method for intelligently resolving entry of an incorrect uniform resource locator (URL) |
US6253326B1 (en) * | 1998-05-29 | 2001-06-26 | Palm, Inc. | Method and system for secure communications |
US6345307B1 (en) * | 1999-04-30 | 2002-02-05 | General Instrument Corporation | Method and apparatus for compressing hypertext transfer protocol (HTTP) messages |
US20020156688A1 (en) * | 2001-02-21 | 2002-10-24 | Michel Horn | Global electronic commerce system |
US6766296B1 (en) * | 1999-09-17 | 2004-07-20 | Nec Corporation | Data conversion system |
-
2001
- 2001-09-22 US US09/682,576 patent/US20040205673A1/en not_active Abandoned
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6092100A (en) * | 1997-11-21 | 2000-07-18 | International Business Machines Corporation | Method for intelligently resolving entry of an incorrect uniform resource locator (URL) |
US6253326B1 (en) * | 1998-05-29 | 2001-06-26 | Palm, Inc. | Method and system for secure communications |
US6345307B1 (en) * | 1999-04-30 | 2002-02-05 | General Instrument Corporation | Method and apparatus for compressing hypertext transfer protocol (HTTP) messages |
US6766296B1 (en) * | 1999-09-17 | 2004-07-20 | Nec Corporation | Data conversion system |
US20020156688A1 (en) * | 2001-02-21 | 2002-10-24 | Michel Horn | Global electronic commerce system |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050262511A1 (en) * | 2004-05-18 | 2005-11-24 | Bea Systems, Inc. | System and method for implementing MBString in weblogic Tuxedo connector |
US7849085B2 (en) * | 2004-05-18 | 2010-12-07 | Oracle International Corporation | System and method for implementing MBSTRING in weblogic tuxedo connector |
CN103336761A (en) * | 2013-05-14 | 2013-10-02 | 成都网安科技发展有限公司 | Interference filtration matching algorithm based on dynamic partitioning and semantic weighting |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8302011B2 (en) | Technique for modifying presentation of information displayed to end users of a computer system | |
US9195642B2 (en) | Spell checking URLs in a resource | |
US7146369B2 (en) | Method and system for native-byte form handling | |
US6546406B1 (en) | Client-server computer system for large document retrieval on networked computer system | |
US6247133B1 (en) | Method for authenticating electronic documents on a computer network | |
EP1700232A1 (en) | Generating hyperlinks and anchor text in html and non-html documents | |
US6738827B1 (en) | Method and system for alternate internet resource identifiers and addresses | |
US20020188435A1 (en) | Interface for submitting richly-formatted documents for remote processing | |
TW200821867A (en) | Program, character input editing method, and apparatus | |
Li et al. | A composite approach to language/encoding detection | |
US7584089B2 (en) | Method of encoding and decoding for multi-language applications | |
US20040205673A1 (en) | Method for detecting current client-side browser encoding | |
US7814408B1 (en) | Pre-computing and encoding techniques for an electronic document to improve run-time processing | |
US6691119B1 (en) | Translating property names and name space names according to different naming schemes | |
US20030176996A1 (en) | Content of electronic documents | |
US20060015578A1 (en) | Retrieving dated content from a website | |
US20030200331A1 (en) | Mechanism for communicating with multiple HTTP servers through a HTTP proxy server from HTML/XSL based web pages | |
Berners-Lee | The HTTP protocol as implemented in W3 | |
EP1123533A1 (en) | Method and system for alternate internet resource identifiers and addresses | |
Newmarch et al. | Managing Character Sets and Encodings | |
Liu | Lightweight Web Browsing Through HTML Validation | |
Johnson | Home on the Web | |
Faghani et al. | Charset Encoding Detection of HTML Documents: A Practical Experience | |
WO2001093089A1 (en) | System and method for providing interactive translation of information in a communication network | |
Vonk | Publishing on the Web Course Notes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |