Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intersecting two pandas DataFrames #332

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

KaparaNewbie
Copy link

@KaparaNewbie KaparaNewbie commented Jan 20, 2021

The need for (or the convenience of) a function such as intersect_dfs is demonstrated in this StackExchange thread.
Often your workflow is pandas' based but at some point, during that process, you may want to intersect two data frames. Here intersect_dfs become useful as it takes two data frames as an input and returns their intersection as another data frame.

Given below are three tests for this function.

# preparations
import pandas as pd

df1 = pd.DataFrame({0: ['chr1', 'chr1', 'chr1', 'chr1', 'chr2'],
                    1: [1, 100, 150, 900, 1],
                    2: [100, 200, 500, 950, 100],
                    3: ['feature1', 'feature2', 'feature3', 'feature4', 'feature4'],
                    4: [0, 0, 0, 0, 0], 5: ['+', '+', '-', '+', '+'],
                    6: ["remember", "the", "5th", "of", "november"]})

df2 = pd.DataFrame({0: ['chr1', 'chr1'],
                    1: [155, 800],
                    2: [200, 901],
                    4: [0, 0],
                    5: ['-', '+']})

intersect_kwargs = {"s": True}
read_table_names = ["chrom", "start", "end", "name", "score", "strand", "whatever"]
other_read_table_kwargs = {"usecols": ["chrom", "start", "end"]}

# test 1
intersected_df = intersect_dfs(df1, df2,
                               intersect_kwargs=intersect_kwargs,
                               other_read_table_kwargs=other_read_table_kwargs,
                               read_table_names=read_table_names)

expected_df = pd.DataFrame({"chrom": ["chr1", "chr1"],
                            "start": [155, 900],
                            "end": [200, 901]})

assert intersected_df.equals(expected_df)

# test 2
intersected_df = intersect_dfs(df1, df2,
                               other_read_table_kwargs=other_read_table_kwargs)

expected_df = pd.DataFrame({"chrom": ["chr1", "chr1", "chr1"],
                            "start": [155, 155, 900],
                            "end": [200, 200, 901]})

assert intersected_df.equals(expected_df)

# test 3
intersected_df = intersect_dfs(df1, df2,
                               read_table_names=read_table_names)

expected_df = pd.DataFrame({"chrom": ["chr1", "chr1", "chr1"],
                            "start": [155, 155, 900],
                            "end": [200, 200, 901],
                            "name": ["feature2", "feature3", "feature4"],
                            "score": [0, 0, 0],
                            "strand": ["+", "-", "+"],
                            "whatever": ["the", "5th", "of"]})

assert intersected_df.equals(expected_df)

… demonstrated in this bioinformatics.stackexchange thread:

https://bioinformatics.stackexchange.com/questions/9015/how-to-do-bedtools-intersection-using-pandas-alone/15181#15181

Often your workflow is pandas' based, but you want to intersect two dfs. Here intersect_dfs become useful as it takes two dfs as an input and returns the intersection as another df.

Given below are three tests for this function.

# preparations
import pandas as pd

df1 = pd.DataFrame({0: ['chr1', 'chr1', 'chr1', 'chr1', 'chr2'],
                    1: [1, 100, 150, 900, 1],
                    2: [100, 200, 500, 950, 100],
                    3: ['feature1', 'feature2', 'feature3', 'feature4', 'feature4'],
                    4: [0, 0, 0, 0, 0], 5: ['+', '+', '-', '+', '+'],
                    6: ["remember", "the", "5th", "of", "november"]})

df2 = pd.DataFrame({0: ['chr1', 'chr1'],
                    1: [155, 800],
                    2: [200, 901],
                    4: [0, 0],
                    5: ['-', '+']})

intersect_kwargs = {"s": True}
read_table_names = ["chrom", "start", "end", "name", "score", "strand", "whatever"]
other_read_table_kwargs = {"usecols": ["chrom", "start", "end"]}

# test 1
intersected_df = intersect_dfs(df1, df2,
                               intersect_kwargs=intersect_kwargs,
                               other_read_table_kwargs=other_read_table_kwargs,
                               read_table_names=read_table_names)

expected_df = pd.DataFrame({"chrom": ["chr1", "chr1"],
                            "start": [155, 900],
                            "end": [200, 901]})

assert intersected_df.equals(expected_df)

# test 2
intersected_df = intersect_dfs(df1, df2,
                               other_read_table_kwargs=other_read_table_kwargs)

expected_df = pd.DataFrame({"chrom": ["chr1", "chr1", "chr1"],
                            "start": [155, 155, 900],
                            "end": [200, 200, 901]})

assert intersected_df.equals(expected_df)

# test 3
intersected_df = intersect_dfs(df1, df2,
                               read_table_names=read_table_names)

expected_df = pd.DataFrame({"chrom": ["chr1", "chr1", "chr1"],
                            "start": [155, 155, 900],
                            "end": [200, 200, 901],
                            "name": ["feature2", "feature3", "feature4"],
                            "score": [0, 0, 0],
                            "strand": ["+", "-", "+"],
                            "whatever": ["the", "5th", "of"]})

assert intersected_df.equals(expected_df)
@KaparaNewbie KaparaNewbie changed the title The need for (or convenience of) a function such as intersect_dfs can is ... Intersecting two pandas DataFrame's Jan 20, 2021
@KaparaNewbie KaparaNewbie changed the title Intersecting two pandas DataFrame's Intersecting two pandas DataFrames Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant