Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with the Timeout #1

Open
Lambru99 opened this issue Nov 20, 2024 · 3 comments
Open

Error with the Timeout #1

Lambru99 opened this issue Nov 20, 2024 · 3 comments

Comments

@Lambru99
Copy link

Lambru99 commented Nov 20, 2024

Hi, I'm trying to use the crawler, but it failed with an error. In particular, the error comes out on this line of code:

    def launchBrowser(self):
        """ """
        # setup selenium firefox profile w/ proxy
        self.profile = webdriver.FirefoxProfile()
        self.profile.set_preference('network.proxy.type', 1)
        self.profile.set_preference("network.proxy.socks_version", 5)
        self.profile.set_preference('network.proxy.socks', '127.0.0.1')
        self.profile.set_preference('network.proxy.socks_port', self.socks + 8)

        options = Options()
        options.headless = True
        self.browser = webdriver.Firefox(self.profile, options=options) #<-----------------       This Line

It gives me:

/home/docker/corrcrawl /
Tor config: {'controlport': '9051', 'socksport': '9050'}
Traceback (most recent call last):
  File "/home/docker/corrcrawl/./crawler/main.py", line 50, in <module>
    main()
  File "/home/docker/corrcrawl/./crawler/main.py", line 43, in main
    collector.run(args.start, args.batches, args.chunksize, webFile=args.sites)
  File "/home/docker/corrcrawl/crawler/OfficialTC.py", line 143, in run
    self.launchBrowser()
  File "/home/docker/corrcrawl/crawler/OfficialTC.py", line 105, in launchBrowser
    self.browser = webdriver.Firefox(self.profile, options=options)
  File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/firefox/webdriver.py", line 170, in __init__
    RemoteWebDriver.__init__(
  File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
    self.start_session(capabilities, browser_profile)
  File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/webdriver.py", line 319, in execute
    response = self.command_executor.execute(driver_command, params)
  File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/remote_connection.py", line 374, in execute
    return self._request(command_info[0], url, body=data)
  File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/remote_connection.py", line 397, in _request
    resp = self._conn.request(method, url, body=body, headers=headers)
  File "/usr/local/lib/python3.10/dist-packages/urllib3/_request_methods.py", line 143, in request
    return self.request_encode_body(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/_request_methods.py", line 278, in request_encode_body
    return self.urlopen(method, url, **extra_kw)
  File "/usr/local/lib/python3.10/dist-packages/urllib3/poolmanager.py", line 432, in urlopen
    conn = self.connection_from_host(u.host, port=u.port, scheme=u.scheme)
  File "/usr/local/lib/python3.10/dist-packages/urllib3/poolmanager.py", line 303, in connection_from_host
    return self.connection_from_context(request_context)
  File "/usr/local/lib/python3.10/dist-packages/urllib3/poolmanager.py", line 328, in connection_from_context
    return self.connection_from_pool_key(pool_key, request_context=request_context)
  File "/usr/local/lib/python3.10/dist-packages/urllib3/poolmanager.py", line 351, in connection_from_pool_key
    pool = self._new_pool(scheme, host, port, request_context=request_context)
  File "/usr/local/lib/python3.10/dist-packages/urllib3/poolmanager.py", line 265, in _new_pool
    return pool_cls(host, port, **request_context)
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 195, in __init__
    timeout = Timeout.from_float(timeout)
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/timeout.py", line 186, in from_float
    return Timeout(read=timeout, connect=timeout)
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/timeout.py", line 115, in __init__
    self._connect = self._validate_timeout(connect, "connect")
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/timeout.py", line 152, in _validate_timeout
    raise ValueError(
ValueError: Timeout value connect was <object object at 0x75b803a74500>, but it must be an int, float or None.
make: *** [Makefile:38: run] Error 1

From what I read it could be related to the Firefox profile, and I tried to set the profile manually with the path, but it did not solve the situation

@notem
Copy link
Owner

notem commented Nov 20, 2024

I believe the error is caused by the webdriver failing to connect to the launched firefox process. Many number of reasons could explain why this happens (the firefox process is not launching correctly, something wrong with the webdriver API, etc.).

Here's what ChatGPT thinks is problem (I make no promises that it is correct)...
image

In my prior experiences, the libraries are very picky about versions (both python-level libraries and os-level libraries). If the API interface or firefox geckodriver behavior has changed, some code may need to be rewritten.

Unfortunately, I don't have the time to spare to debug and rewrite the crawler to work with the current versions of selenium, geckodriver, and tor libraries.

@Lambru99
Copy link
Author

I can update the code and do a pull request, but I need some information about how are you able to sniff both client to entry node and exit to webserver packets. In particular, I understand that you are creating a docker container, and you start to sniff and to visit site from that (and this is the part from client to entry node). For the second part you are using an ssh proxy server, but from my experience even if I first start the ssh connection and next tor, or I first start tor and next ssh, the proxy server still receives the packets before the tor circuit and even in this case I'm only able to sniff traffic as a client to entry node sniffer. How can you sniff from exit node to webserver?

@notem
Copy link
Owner

notem commented Nov 21, 2024

(Just for clarity's sake) the general intended connection structure looks like this...

[ Browser ] <----> [ Tor ] <----> [ SSH Proxy ] <----> [ Website ]

The critical command to build this is...

cmd = f"sshpass -p {self.password} ssh -D {tunnelport} -o "ProxyCommand=nc -X 5 -x 127.0.0.1:{self.socks} %h %p" {self.sshName}@{self.sshHost}"

to break it down further...

  • sshpass -p {self.password} avoids triggering a password prompt, which would break the automation
  • ssh -D {tunnelport} {self.sshName}@{self.sshHost} opens a tunnel to sshName@sshHost; the firefox browser will use the {tunnelport} as the proxy tunnel to access the web.
    • I think it's possible to have the proxy host and browser host be the same machine, but it's generally going to be easier to run a separate host / VM / container to use for the SSH proxy. The only requirement for the SSH proxy I believe is that it has SSH and can run the pcap capture.
  • -o "ProxyCommand=nc -X 5 -x 127.0.0.1:{self.socks} %h %p" defines the proxying command that will be used to connect to the host that will act as the SSH proxy. This uses the netcat command to connect to the tor process using the SOCKS port defined for the Tor process.

The sequence of connections should look something like

  1. BrowserHost constructs a tunnel to SSHProxyHost using the Tor process running on the BrowserHost at {self.socks} port, this opens a socket for this tunnel at {tunnelport} on the BrowserHost. The Tor process must be running on the BrowserHost for this to work.
  2. BrowserHost has firefox configured to use the newly created {tunnelport} to create connections to websites. The SSH Proxy must be functioning for this to work.

Debugging any breakdown in this chain of connections is a mass headache, so try to get as many of these processes as possible writing correctly to a log file somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants