Search This Blog

Sunday, November 19, 2017

Different text standards and encoding methods

While programming and making test you may have perfectly performing programs. But if you add data from various sources pro use libraries like pandas or other which encode or decode text, your programs using text file data may may not perform tasks as programmed. One problem I encountered that pohmeliy.com bot does not read see text data urls which passed through pandas encoding interpreter which in turn enodes everything to UTF-8, which I solved by downloading list using pohmeliy.com tolist bot. And another problem some url lines attaches each to other hindering programmed performance or making list of urls unreadable. Encoding and encoding methods detection is universal problem. It is useful to read for understanding What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text to understand how text works with encodings. To  detect  text encodings I use python chardect
Chardet: The Universal Character Encoding Detector which detects some encodings standards and some variants with detection confidenc. Today I tested I got:

C:\Users\ANTRAS>chardetect C:\Pohmeliy_FB\FB_post_to_groupsCR\lists\CRGroups.txt C:\Pohmeliy_FB\FB_post_to_groupsSILK\lists\SILKGroups.txt
C:\Pohmeliy_FB\FB_post_to_groupsCR\lists\CRGroups.txt: ascii with confidence 1.0
C:\Pohmeliy_FB\FB_post_to_groupsSILK\lists\SILKGroups.txt: ascii with confidence 1.0

So you you have to find one data source or decode and encode  to same byte order variant. About Byte order mark   you can read on Wikipedia.org .

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.