Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When returning an empty JSON, '{}' turns into a Chinese character #111

Open
Ni-Knight opened this issue Dec 5, 2022 · 12 comments
Open

When returning an empty JSON, '{}' turns into a Chinese character #111

Ni-Knight opened this issue Dec 5, 2022 · 12 comments

Comments

@Ni-Knight
Copy link

Ni-Knight commented Dec 5, 2022

When receiving an empty response from a server the string '{}' is somewhere translated to Unicode so:
"{" = U+007B
"}" = U+007D
Those are somewhere concatenated to return:
"筽" = U+7B7D

To reproduce just send a query that returns an empty response from a TAXII server, curl and postman returns '{}' but taxii-client returns: '筽'.

For example:
2022-08-15T11:21:48.17891632Z info: (TAXII 2 Feed test_instance_1_TAXII 2 Feed test_taxii2-get-indicators) python logging: DEBUG [urllib3.connectionpool] - [https://ais2.cisa.dhs.gov:443](https://ais2.cisa.dhs.gov/) "GET /public/collections/---/objects/?limit=25&match%5Btype%5D=campaign HTTP/1.1" 200 2 2022-08-15T11:21:48.180585842Z debug: (TAXII 2 Feed test_instance_1_TAXII 2 Feed test_taxii2-get-indicators) GOT RESPONSE resp.content=b'{}' resp.text='筽' resp.status_code=200 resp.headers={'x-transaction-id': '124a663c-e7c5-48c0-a4ba-6fff95cab122', 'Strict-Transport-Security': 'max-age=31536000 ; includeSubDomains', 'Date': 'Mon, 15 Aug 2022 11:21:47 GMT', 'Keep-Alive': 'timeout=60', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Cache-Control': 'no-cache, no-store, max-age=0, must-revalidate', 'Pragma': 'no-cache', 'Expires': '0', 'X-Frame-Options': 'DENY', 'Content-Type': 'application/taxii+json;version=2.1', 'Content-Length': '2', 'Connection': 'keep-alive'}

@chisholm
Copy link
Contributor

chisholm commented Dec 6, 2022

I am not familiar with 'pack', but if the problem seems specific to that tool, perhaps that tool is misunderstanding the textual encoding of the response? If that were the case though, it seems like it ought to misunderstand the encoding regardless of response content. It wouldn't be specific to "empty" responses.

@Ni-Knight
Copy link
Author

@chisholm by pack I meant this package, i.e - taxiiclient.

@chisholm
Copy link
Contributor

chisholm commented Dec 6, 2022

I'm not sure what you mean by taxii-client "returning" something. It's a library with classes and methods, and some methods do return things. It's not clear where the line you quoted came from (looks like a line of logging?). Can you provide a small code sample to reproduce the error?

I tried my own experiment which would produce an empty result, where I enabled a simple logging config to see what logging would get printed out, to compare to your output. It was run against the Medallion server:

import logging
import taxii2client

logging.basicConfig(level="DEBUG")

coll = taxii2client.Collection(
    "http://127.0.0.1:5000/trustgroup1/collections/91a7b528-80eb-42ed-a74d-c6fbd5a26116/",
    user="(user)", password="(password)"
)

envelope = coll.get_objects(
  type="foo"
)

print(envelope)

Notice I had to add my own print statement. The library has some error logging, but doesn't automatically log all of the HTTP responses.

I got:

DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 127.0.0.1:5000
DEBUG:urllib3.connectionpool:http://127.0.0.1:5000 "GET /trustgroup1/collections/91a7b528-80eb-42ed-a74d-c6fbd5a26116/ HTTP/1.1" 200 254
DEBUG:urllib3.connectionpool:Resetting dropped connection: 127.0.0.1
DEBUG:urllib3.connectionpool:http://127.0.0.1:5000 "GET /trustgroup1/collections/91a7b528-80eb-42ed-a74d-c6fbd5a26116/objects/?match%5Btype%5D=foo HTTP/1.1" 200 2
{}

The first four lines are logging output; the last line is my print statement. It shows the use of the taxii-client API to send a request to a TAXII server. There is no Chinese in the output.

@Ni-Knight
Copy link
Author

Thank you for the reply, I'll try to recreate it again and update (I cant use the same server where we first saw it as we don't have creds to use it). Maybe its something with CISA server that causes the weird character.

@BEAdi
Copy link

BEAdi commented Dec 7, 2022

Hi @chisholm, I am working with @Ni-Knight and wanted to share what we did.

It's not clear where the line you quoted came from (looks like a line of logging?). Can you provide a small code sample to reproduce the error?

When using a code that works the same as the code you added, we are getting an InvalidJSONError.
So in order to see the content of the response, I changed our code to use the following method, instead of get_objects method of the collection. You can see that it works similar to get_objects, using the library's methods.

def v21_get_objects(self, accept="application/taxii+json;version=2.1", **filter_kwargs):
        collection = self.collection_to_fetch
        collection._verify_can_read()
        query_params = _filter_kwargs_to_query_params(filter_kwargs)
        merged_headers = collection._conn._merge_headers({"Accept": accept, "Content-Type": "application/taxii+json"})

        resp = collection._conn.session.get(collection.objects_url, headers=merged_headers, params=query_params)
        print(f'GOT RESPONSE {resp.content=} {resp.text=} {resp.status_code=} {resp.headers=}')
        if len(resp.text) <= len('{}'):  # in case it is not a json that has to have {}
            return {}

        return _to_json(resp)

We tried to reproduce it on another server, but it returns {}\n and not just {}. Maybe this is the case you also checked, and without the \n at the end of the response it will reproduce for you?

@chisholm
Copy link
Contributor

chisholm commented Dec 8, 2022

Using your code, slightly modified as:

def v21_get_objects(collection, accept="application/taxii+json;version=2.1", **filter_kwargs):
    collection._verify_can_read()
    query_params = _filter_kwargs_to_query_params(filter_kwargs)
    merged_headers = collection._conn._merge_headers({"Accept": accept, "Content-Type": "application/taxii+json"})

    resp = collection._conn.session.get(collection.objects_url, headers=merged_headers, params=query_params)
    print(f'GOT RESPONSE {resp.content=} {resp.text=} {resp.status_code=} {resp.headers=} {resp.encoding=}')
    if len(resp.text) <= len('{}'):  # in case it is not a json that has to have {}
        return {}

    return _to_json(resp)


coll = taxii2client.Collection(
    "http://127.0.0.1:5000/trustgroup1/collections/91a7b528-80eb-42ed-a74d-c6fbd5a26116/",
    user="(user)", password="(password)"
)

v21_get_objects(coll, type="foo")

Run against the Medallion server, I get as output:

GOT RESPONSE resp.content=b'{}' resp.text='{}' resp.status_code=200 resp.headers={'Content-Type': 'application/taxii+json;version=2.1', 'Content-Length': '2', 'Server': 'Werkzeug/2.0.2 Python/3.9.13', 'Date': 'Thu, 08 Dec 2022 01:15:49 GMT'} resp.encoding=None

Again, you can see there is no Chinese.

The (Chinese) text you see comes from the resp.text code fragment. That is invoking the requests library's decoding logic, including figuring out encodings. As documented, it makes "educated guesses" at the encoding. Maybe in your case, it guessed wrong? The linked docs say you can show the encoding it is using via resp.encoding, and I added that to the code to see what it would show me. It just shows None for me, so maybe not very informative. I wonder if it would show you something else?

The TAXII 2.1 spec looks to require implementers to use UTF-8.

@BEAdi
Copy link

BEAdi commented Dec 15, 2022

We added the encoding, and it also shows None.

2022-12-15T07:39:23.616361458Z info: (DHS Feed v2_instance_1_DHS Feed v2_dhs-get-indicators) python logging: DEBUG [urllib3.connectionpool] - https://ais2.cisa.dhs.gov:443 "GET /public/collections/a6313101-fa6c-4276-bb96-7e826f0b248a/objects/?limit=10&added_after=2022-12-14T07%3A39%3A23.038316Z HTTP/1.1" 200 2
2022-12-15T07:39:23.618157698Z debug: (DHS Feed v2_instance_1_DHS Feed v2_dhs-get-indicators) resp.content=b'{}' resp.text='筽' resp.status_code=200 resp.headers={'x-transaction-id': '05608c51-ddf4-4f9c-851f-38f5d3c9b546', 'Strict-Transport-Security': 'max-age=31536000 ; includeSubDomains', 'Date': 'Thu, 15 Dec 2022 07:39:23 GMT', 'Keep-Alive': 'timeout=60', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Cache-Control': 'no-cache, no-store, max-age=0, must-revalidate', 'Pragma': 'no-cache', 'Expires': '0', 'X-Frame-Options': 'DENY', 'Content-Type': 'application/taxii+json;version=2.1', 'Content-Length': '2', 'Connection': 'keep-alive'} resp.encoding=None

Weird that in your case it guesses right, and in ours it guesses Chinese.

@chisholm
Copy link
Contributor

Checking the requests implementation, looks like if resp.encoding is None, it falls back to resp.apparent_encoding. I think the latter is what triggers the encoding "guess". If I add a print out of that, I get:

GOT RESPONSE resp.content=b'{}' resp.text='{}' resp.status_code=200 resp.headers={'Content-Type': 'application/taxii+json;version=2.1', 'Content-Length': '2', 'Server': 'Werkzeug/2.0.2 Python/3.9.13', 'Date': 'Sat, 17 Dec 2022 02:57:08 GMT'} resp.encoding=None resp.apparent_encoding='ascii'

And that shows "ascii" for me. Maybe that will show a Chinese encoding for you.

@BEAdi
Copy link

BEAdi commented Jan 17, 2023

When we add printing of resp.apparent_encoding, we get utf_16_be.
Is there something else you can think of?
We encountered the Chinese character returning in another case when using the library.

@JasonKeirstead
Copy link
Member

The bug is in the TAXII server, if it is not setting the response encoding to UTF-8.

@chisholm
Copy link
Contributor

Well, utf_16_be might be incorrect. This has gone beyond being a cti-taxii-client issue. This library relies on the requests library as mentioned above, to handle the lower-level HTTP request/response details. If the server does not tell the client what encoding it uses (JasonKeirstead's point above), the client must guess, and it is possible to guess wrong. If you don't have control over the server, I guess there's not much you can do about that.

Looks like by default, requests uses charset_normalizer to detect encodings. It calls a detect() method, but that is a legacy wrapper around from_bytes(). The latter has an interesting explain argument, which may or may not be useful. It is easy to run a test just from the python REPL:

>>> import charset_normalizer
>>> charset_normalizer.from_bytes(b'{}', explain=True)
2023-01-18 03:55:22,198 | WARNING | override steps (5) and chunk_size (512) as content does not fit (2 byte(s) given) parameters.
2023-01-18 03:55:22,202 | WARNING | Trying to detect encoding from a tiny portion of (2) byte(s).
2023-01-18 03:55:22,204 | INFO | ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-01-18 03:55:22,205 | INFO | ascii should target any language(s) of ['Latin Based']
2023-01-18 03:55:22,205 | INFO | ascii is most likely the one. Stopping the process.
<charset_normalizer.models.CharsetMatches object at 0x000001ECF224B4F0>

@Ni-Knight
Copy link
Author

What an odd bug :) I think we can try and ask them which server did they spin up. However you are right this is definitely not an issue with the client itself, It also seems like chardet does guess the encoding correctly as @chisholm stated (and I've also tested it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants