Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File parsing takes too long. #96

Closed
KKKKK-tech opened this issue May 6, 2023 · 6 comments
Closed

File parsing takes too long. #96

KKKKK-tech opened this issue May 6, 2023 · 6 comments

Comments

@KKKKK-tech
Copy link

Hi,
This may be a stupid question, but this problem needs to be solved urgently so I opened this issue…I have a very large json file, about 5GB with 30 million lines. I tried parsing the json file with jsonmachine, but it seemed to take so long that I got an Internal Server Error in the browser. I noticed in the readme file that 100GB files can also be parsed, but I'm not sure how to write code since I'm not very good at php.
My json file format is roughly as follows:
{ "head":{...}, "PTHT0000001":{"CDD":[...],"SMART":[...]}, ..., "PTHT0012803":{"CDD":[...],"SMART":[...]} }
My goal is to find a unique PTHTxxxxxxx and extract its value. How should I parse it?
Thank you very much!

@pkoppstein
Copy link

The jm script (https://github.com/pkoppstein/jm) is based on JSON Machine, and could be used as follows to find the value of a specific key:

$ jm -s | grep --max-count 1 '^{"PTHT0000001"'

Even if you don't want to use jm itself, you could examine it to see how it accomplishes what you want.

Alternatively, you might consider using the --stream option of jq (https://github.com/stedolan/jq), which is designed for just this kind of problem:

$ jq -n --stream 'first(fromstream( (inputs | select(.[0][0] == "PTHT0000001")), [["PTHT0000001"]]  ))'

@halaxa
Copy link
Owner

halaxa commented May 6, 2023

If you need to use it from inside PHP, just use simple foreach and find your key there.

foreach (Items::fromFile('500gb.json') as $key => $item) {
    if ($key === "PTHT0012803") {
        // your code
    }
}

Keep in mind, that a file of this size might get hours to parse with JSON Machine. I guess 2-4 depending on the machine and PHP configuration. You also might be interested in #97.

@halaxa
Copy link
Owner

halaxa commented May 6, 2023

Sorry, I read 500 GB instead of just 5 GB. Then it should be a matter of minutes. Make sure xdebug is disabled and JIT enabled.

Also make longer your php time limit if you parse from browser.

@KKKKK-tech
Copy link
Author

The jm script (https://github.com/pkoppstein/jm) is based on JSON Machine, and could be used as follows to find the value of a specific key:

$ jm -s | grep --max-count 1 '^{"PTHT0000001"'

Even if you don't want to use jm itself, you could examine it to see how it accomplishes what you want.

Alternatively, you might consider using the --stream option of jq (https://github.com/stedolan/jq), which is designed for just this kind of problem:

$ jq -n --stream 'first(fromstream( (inputs | select(.[0][0] == "PTHT0000001")), [["PTHT0000001"]]  ))'

Thank you for your help! I will try the method you mentioned later.

@KKKKK-tech
Copy link
Author

If you need to use it from inside PHP, just use simple foreach and find your key there.

foreach (Items::fromFile('500gb.json') as $key => $item) {
    if ($key === "PTHT0012803") {
        // your code
    }
}

Keep in mind, that a file of this size might get hours to parse with JSON Machine. I guess 2-4 depending on the machine and PHP configuration. You also might be interested in #97.

Thank you for your reply. I used code like this before, but it took too long. I think it may be because foreach takes too much time. Is there any way to avoid this situation in jsonmachine? If not, I will try to split large files.

@halaxa
Copy link
Owner

halaxa commented May 7, 2023

If you split them, it will take about the same time anyway. There is no faster solution in JSON Machine for now. Keep up with #97 which should bring some speedup.

Repository owner locked and limited conversation to collaborators May 7, 2023
@halaxa halaxa converted this issue into discussion #98 May 7, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants